New York State’s Office for Technology released a Request for Public Comment on selecting an XML-based office data format. The choices are OASIS’ ODF and Microsoft’s OOXML. Responses were due by 5 p.m. today, Dec. 28. My response is below, submitted just in time to meet the deadline. I didn’t have time to answer all the questions, so I focused on the ones I felt I could address in the greatest detail.
RESPONSE TO RFPC # 122807:
For the past year I have been the lead programmer on AltLaw.org, a project to promote public access to federal court opinions by creating a free, full-text database of those opinions with an easy-to-use search interface. In the process of developing this web site, we have encountered many obstacles because of the way the federal courts store and publish their records. The problems we have encountered using electronic government records illustrate issues that New York State should consider in developing its electronic records policy. I will provide more details in my answers to the questions below.
To encourage public access to State electronic records, it is important that the electronic record be the official State record rather than a draft or proxy for the “official” paper document. The greatest weakness of AltLaw.org as a legal reference is that the opinions we download from the federal courts’ web sites are not the final, official versions; those are published by and only available from the West corporation, at considerable cost.
Since the federal courts rely on West to copy-edit and correct their opinions, they are careless with respect to dates, names, and other important data. For example, we have downloaded several opinions that were decided in the year “2992”!
Furthermore, opinions published on federal court web sites lack any citation information — which is also assigned by West based on the pagination of their print volumes — making them useless for legal scholarship or court preparation.
As most professions (including the law) come increasingly to rely exclusively on electronic sources of information, it is critical that those sources become 100% reliable. To this end, electronic State records must be 1) complete, 2) accurate, 3) easily cited, and 4) acceptable for use in all official State business.
To encourage interoperability and data sharing with citizens, business partners, and other jurisdictions, it is important that State electronic records be machine-readable. This is a more demanding requirement than simply having records in electronic form. A PDF document, for example, is electronic, but it is difficult or impossible to extract discrete data from a PDF document. This is because the PDF format is optimized toward preserving the visual appearance of a document rather than its structure.
There are two issues to consider when creating machine-readable documents. The first is “metadata.” Metadata is information about a document that may or may not be contained within the text of the document itself. For example, the metadata for a court opinion would include the name of the court, the date the opinion was released, the name of the judge writing the opinion, and the names of the parties in the case, among other data. Since most federal courts publish their opinions on their web sites in PDF format, with no metadata, AltLaw.org must rely on custom software to extract essential metadata such as titles and dates. The process is slow, difficult, and inaccurate.
It is worth noting that ODF supports metadata using the Resource Description Framework (RDF), an international standard which already forms the foundation of powerful data-analysis software. OOXML does not provide comparable metadata support.
The second requirement for creating machine-readable documents is information about document structure. Document structure includes elements such as sections, headings, paragraphs, lists, and tables. These structures must be identifiable in the document independent of the visual formatting used to display those structures. For example, a human reader can recognize bold-face type as a section heading, but a computer program cannot. Structural information is important for automated document analysis, information retrieval (search), accessibility to physically-impaired users, and conversion to alternate formats (such as HTML). ODF provides more structural information than does OOXML.
In addition to including metadata and structural information in electronic documents, the State should implement rigorous standards to ensure that information is produced consistently. Metadata is only useful when it is stored in a known, consistent format. For example, simple information such as a date can be recorded in a dozen different ways. A date written as “02/03/04” could be February 3, 2004 or it could be March 4, 2002. Work on AltLaw.org has shown us that there are as many ways of writing dates as there are courts to write them, and dates are by far the simplest piece of metadata to store. For New York State, if different State agencies (or, worse, individual State employees) were to record metadata in different ways, it would be almost as useless as having no metadata at all. After selecting a format for electronic records, the State must then establish formal procedures for using the metadata capabilities of that format. These procedures should be made freely available to the public for the purposes of encouraging interoperability and data sharing.
To encourage appropriate government control of its electronic records, the State should rely wherever possible on public-key encryption technology. I am not an expert on this subject, but I urge the State to consult with computer security and encryption experts when choosing the protocols to implement. When properly used, public-key encryption can provide communications that are secure and verifiably authentic to a degree exceeding that of physical documents.
To encourage choice and vendor neutrality when creating, maintaining, exchanging, and preserving electronic records, the State should consider the vested interests of parties promoting a particular data format. A format developed by a single corporate entity, such as Microsoft in the case of OOXML, gives that entity strong incentives to design the format to make interoperability difficult to achieve in practice, either through incomplete specification or proprietary extensions. OOXML has been criticized by others as being difficult to implement because of both of these factors. In contrast, ODF was developed by the independent, international OASIS group, with input from a variety of sources. To ensure the success of ODF, its creators have a vested interest in making interoperability as easy as possible. Thus, in the future, ODF is likely to be supported by a wider range of vendors and products than OOXML, and its adoption will promote greater competition in the markeplace.
Regarding public access to long-term archives, the State should ensure that its archived records are available in bulk. “In bulk” means that a computer program should be able to obtain large quantities of archived records in an automated fashion, without human intervention. While developing AltLaw.org, we were often forced to write computer programs that simulate the behavior of a human user clicking through a court web site, as that was the only way to download opinions from those sites. In contrast, some court web sites make all their opinions available for download via FTP, a very simple Internet protocol designed for bulk file transfer. The latter made our job easier, and better promotes public access to archived records.
In general, State agencies should not take responsibility for providing the public with the tools to search and retrieve electronic records. Those tasks are better handled by corporate entities (such as Google) and non-profit institutions (such as AltLaw.org) that have technical expertise in those areas. The State should take responsibility for making consistent, accurate, complete data available in bulk at little or no cost to users of that data.
Regarding the management of highly specialized data formats such as CAD, digital imaging, Geographic Information Systems, and multimedia, the State should use open, published, freely-available standards whenever possible. When open standards are not available for a particular type of data, the State should attempt to make that data available in as many competing formats as possible. For example, many database and statistical applications which use proprietary data formats can “dump” their data into simple, standardized formats such as comma-separated values (CSV). Imaging software which uses proprietary formats can usually convert files to non-proprietary standard formats as well. Wherever possible, these conversions should be “lossless,” that is, they should not lose any information in the conversion. Ideally, it should be possible to completely reconstruct the specialized data from the information contained in the simpler data formats. These practices provide insurance against future loss or corruption of data in highly-specialized formats.
I strongly urge the State to prefer ODF to OOXML. As one engaged in extracting data from large quantities of government records (over half a million documents, at last count), I would find ODF much more conducive to enabling my work than OOXML. I have looked at examples of the internal XML schemas used by both formats, and I find ODF much easier to read, understand, and manipulate than OOXML.