Archive for December, 2007

New York State’s Office for Technology released a Request for Public Comment on selecting an XML-based office data format. The choices are OASIS’ ODF and Microsoft’s OOXML. Responses were due by 5 p.m. today, Dec. 28. My response is below, submitted just in time to meet the deadline. I didn’t have time to answer all the questions, so I focused on the ones I felt I could address in the greatest detail.

RESPONSE TO RFPC # 122807:

Background Information

For the past year I have been the lead programmer on AltLaw.org, a project to promote public access to federal court opinions by creating a free, full-text database of those opinions with an easy-to-use search interface. In the process of developing this web site, we have encountered many obstacles because of the way the federal courts store and publish their records. The problems we have encountered using electronic government records illustrate issues that New York State should consider in developing its electronic records policy. I will provide more details in my answers to the questions below.

Question 2

To encourage public access to State electronic records, it is important that the electronic record be the official State record rather than a draft or proxy for the “official” paper document. The greatest weakness of AltLaw.org as a legal reference is that the opinions we download from the federal courts’ web sites are not the final, official versions; those are published by and only available from the West corporation, at considerable cost.

Since the federal courts rely on West to copy-edit and correct their opinions, they are careless with respect to dates, names, and other important data. For example, we have downloaded several opinions that were decided in the year “2992″!

Furthermore, opinions published on federal court web sites lack any citation information — which is also assigned by West based on the pagination of their print volumes — making them useless for legal scholarship or court preparation.

As most professions (including the law) come increasingly to rely exclusively on electronic sources of information, it is critical that those sources become 100% reliable. To this end, electronic State records must be 1) complete, 2) accurate, 3) easily cited, and 4) acceptable for use in all official State business.

Question 3

To encourage interoperability and data sharing with citizens, business partners, and other jurisdictions, it is important that State electronic records be machine-readable. This is a more demanding requirement than simply having records in electronic form. A PDF document, for example, is electronic, but it is difficult or impossible to extract discrete data from a PDF document. This is because the PDF format is optimized toward preserving the visual appearance of a document rather than its structure.

There are two issues to consider when creating machine-readable documents. The first is “metadata.” Metadata is information about a document that may or may not be contained within the text of the document itself. For example, the metadata for a court opinion would include the name of the court, the date the opinion was released, the name of the judge writing the opinion, and the names of the parties in the case, among other data. Since most federal courts publish their opinions on their web sites in PDF format, with no metadata, AltLaw.org must rely on custom software to extract essential metadata such as titles and dates. The process is slow, difficult, and inaccurate.

It is worth noting that ODF supports metadata using the Resource Description Framework (RDF), an international standard which already forms the foundation of powerful data-analysis software. OOXML does not provide comparable metadata support.

See: http://blogs.sun.com/GullFOSS/entry/new_extensible_metadata_support_with

The second requirement for creating machine-readable documents is information about document structure. Document structure includes elements such as sections, headings, paragraphs, lists, and tables. These structures must be identifiable in the document independent of the visual formatting used to display those structures. For example, a human reader can recognize bold-face type as a section heading, but a computer program cannot. Structural information is important for automated document analysis, information retrieval (search), accessibility to physically-impaired users, and conversion to alternate formats (such as HTML). ODF provides more structural information than does OOXML.

In addition to including metadata and structural information in electronic documents, the State should implement rigorous standards to ensure that information is produced consistently. Metadata is only useful when it is stored in a known, consistent format. For example, simple information such as a date can be recorded in a dozen different ways. A date written as “02/03/04″ could be February 3, 2004 or it could be March 4, 2002. Work on AltLaw.org has shown us that there are as many ways of writing dates as there are courts to write them, and dates are by far the simplest piece of metadata to store. For New York State, if different State agencies (or, worse, individual State employees) were to record metadata in different ways, it would be almost as useless as having no metadata at all. After selecting a format for electronic records, the State must then establish formal procedures for using the metadata capabilities of that format. These procedures should be made freely available to the public for the purposes of encouraging interoperability and data sharing.

Question 4

To encourage appropriate government control of its electronic records, the State should rely wherever possible on public-key encryption technology. I am not an expert on this subject, but I urge the State to consult with computer security and encryption experts when choosing the protocols to implement. When properly used, public-key encryption can provide communications that are secure and verifiably authentic to a degree exceeding that of physical documents.

Question 5

To encourage choice and vendor neutrality when creating, maintaining, exchanging, and preserving electronic records, the State should consider the vested interests of parties promoting a particular data format. A format developed by a single corporate entity, such as Microsoft in the case of OOXML, gives that entity strong incentives to design the format to make interoperability difficult to achieve in practice, either through incomplete specification or proprietary extensions. OOXML has been criticized by others as being difficult to implement because of both of these factors. In contrast, ODF was developed by the independent, international OASIS group, with input from a variety of sources. To ensure the success of ODF, its creators have a vested interest in making interoperability as easy as possible. Thus, in the future, ODF is likely to be supported by a wider range of vendors and products than OOXML, and its adoption will promote greater competition in the markeplace.

Question 7

Regarding public access to long-term archives, the State should ensure that its archived records are available in bulk. “In bulk” means that a computer program should be able to obtain large quantities of archived records in an automated fashion, without human intervention. While developing AltLaw.org, we were often forced to write computer programs that simulate the behavior of a human user clicking through a court web site, as that was the only way to download opinions from those sites. In contrast, some court web sites make all their opinions available for download via FTP, a very simple Internet protocol designed for bulk file transfer. The latter made our job easier, and better promotes public access to archived records.

In general, State agencies should not take responsibility for providing the public with the tools to search and retrieve electronic records. Those tasks are better handled by corporate entities (such as Google) and non-profit institutions (such as AltLaw.org) that have technical expertise in those areas. The State should take responsibility for making consistent, accurate, complete data available in bulk at little or no cost to users of that data.

Question 10

Regarding the management of highly specialized data formats such as CAD, digital imaging, Geographic Information Systems, and multimedia, the State should use open, published, freely-available standards whenever possible. When open standards are not available for a particular type of data, the State should attempt to make that data available in as many competing formats as possible. For example, many database and statistical applications which use proprietary data formats can “dump” their data into simple, standardized formats such as comma-separated values (CSV). Imaging software which uses proprietary formats can usually convert files to non-proprietary standard formats as well. Wherever possible, these conversions should be “lossless,” that is, they should not lose any information in the conversion. Ideally, it should be possible to completely reconstruct the specialized data from the information contained in the simpler data formats. These practices provide insurance against future loss or corruption of data in highly-specialized formats.

Conclusion

I strongly urge the State to prefer ODF to OOXML. As one engaged in extracting data from large quantities of government records (over half a million documents, at last count), I would find ODF much more conducive to enabling my work than OOXML. I have looked at examples of the internal XML schemas used by both formats, and I find ODF much easier to read, understand, and manipulate than OOXML.

Comments No Comments »

Dan Weinreb posted common Complaints About Common Lisp. My personal complaint is in there — the lack of libraries that are well-documented and regularly updated. I think it’s a critical mass problem: so few people are using Common Lisp in their day-to-day work that there’s not enough momentum to keep the libraries going and make them fool-proof. Too many Common Lisp libraries are weekend projects that never make it out of alpha.

I’m guilty of the same offense: my one and only (very small) Common Lisp library — a bridge to run an embedded Perl 5 interpreter from Common Lisp — went a year before I heard from one lone user. By that time I had switched to Ruby.

The test of a really good library is not that it’s there, but that you don’t notice it’s there. If I want to scrape some HTML in Ruby, I don’t need to think about it, I just type “require ‘hpricot’” and it works. If I have a problem, odds are someone else has had the same problem and Google will find it. With Common Lisp, one can feel like a lone voice crying out in the wilderness. There’s a bit of a frontier mentality, too: “Well, if it’s broke, fix it yourself, citizen.”

Comments No Comments »

Further thoughts on my new XO-1 Laptop:

  1. It is possible to type on it, albeit not as fast as on a regular keyboard.
  2. It’s a real Linux installation — Redhat — on an x86-compatible processor. You can run “yum” in a root shell to install any package you want.
  3. The hardware/software integration needs some more work. For example, there’s a cool button that rotates the screen to any orientation, but it doesn’t re-map the arrow keys or touchpad axes, so it’s confusing to scroll through an ebook in portrait mode.
  4. There’s a lot of room for the platform to grow — the designers included keys on the keyboard that don’t do anything yet, in anticipation of future features.
  5. It is reasonably clever in remembering which WiFi networks you prefer.
  6. There’s no Ethernet port — if you want network access, you gotta have WiFi (or perhaps a USB network adapter).
  7. The bundled web browser only works with file types the XO is configured to handle. I downloaded a .tar.bz2 file but I couldn’t find where it got stashed on the filesystem.
  8. The “Sugar” interface is cute, and the “Journal” feature is downright innovative, but neither is complete enough to serve as a general-purpose computing platform. This isn’t necessarily bad — they were designed to be restricted to child-oriented tasks — but may limit the XO’s usefulness in other areas.
  9. The interface features (menus, icons, transitions) are slow. Unfortunately, I think this is due to the designers’ reliance on Python. Now, Python is a great language, and probably the best choice for an educational tool like the XO, but more optimization — perhaps from the PyPy project — is needed.

Comments 1 Comment »

Just got my XO-1 laptop today, and I’m using it to write this post. First impressions:

  1. It’s light–weighs about as much as a hardback book.
  2. The screen is sharp and readable, with or without the backlight.
  3. The built-in rubber keyboard is difficult for an adult to touch-type on. I’m hoping I’ll get used to it.
  4. It comes with a terminal app pre-installed!
  5. It’s slow to boot and start apps. I hear a suspend/resume feature is in the works.

Need to play with it more to get a clear idea of its capabilities.

Comments No Comments »

Larry Wall writes about scripting, “I can’t define it, but I’ll know it when I see it.” So I thought I’d throw out my idea of a definition. A scripting language is a programming language that relies on and is designed to run within an ecosystem based on other languages. So Perl 5 and Ruby run within the C/Unix ecosystem, PHP runs within a web server, and Clojure runs within Java.

In contrast, non-scripting languages are designed to be a complete ecosystem on their own. In a non-scripting language, it’s C/Java/Smalltalk/Common Lisp all the way down. (Of course, no language exists in complete isolation, and everything has to be transformed into machine code eventually.)

I’ve always admired Java’s completeness — you can do everything you would ever want to do without leaving the Java world — even while I shun its complexity. Likewise, I admire small, special-purpose languages like AWK that do one thing well. Both approaches have their attractions, but I believe we will always need both. That’s why I’m encouraged by recent efforts to develop scripting languages on the platforms of large-system languages like C# and Java. It’s a recognition that there will never be “one language to rule them all.” I think (hope) we’ll see more task-specific languages integrated with these platforms in the future.

The web already exemplifies this trend, with five or more languages (HTML, CSS, JavaScript, SQL, PHP/Perl/Ruby/…) all working together to produce a single result.

Comments No Comments »