Posts Tagged “XML”

Back in February, in a slightly plaintive post, the W3 sysadmins asked that people stop hammering their servers with requests for XHTML DTDs. Everyone said yes, this is a stupid problem that wouldn’t have happened if a) the XML spec were less dumb, or b) XML libraries were less dumb.

After that post, I spent two whole days fighting with XML catalogs — possibly the worst-documented XML spec ever — to make sure my Java code wasn’t downloading a DTD every time it read an XHTML document.

To my annoyance, no one seems to have posted any cut-and-paste solutions to this problem. Setting properties on the SAX parser is no help, and the XML catalogs solution is a pain to set up.

So what if someone wrote a “dummy” XML entity resolver that does nothing? Here’s what I came up with:

public class DummyEntityResolver implements EntityResolver {
    public InputSource resolveEntity(String publicID, String systemID)
        throws SAXException {

        return new InputSource(new StringReader(""));
    }
}

Lo and behold, it works! The key is the return line — if you return null, the SAX parser reverts to its default behavior and downloads the DTD.

Use it like this:

XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setEntityResolver(new DummyEntityResolver());
reader.setContentHandler(new YourContentHandler());
reader.parse(your_xml_source);

The catch is that this will break any externally-defined entities, including standard XHTML entities like ©. The built-in XML entities such as &, and numeric character entities like &x43;, will still work.

You can check that you’re not downloading any DTD’s by watching the output of ngrep -q DTD while running your XML parser. If it doesn’t print anything, you’re good.

Comments 1 Comment »

New York State’s Office for Technology released a Request for Public Comment on selecting an XML-based office data format. The choices are OASIS’ ODF and Microsoft’s OOXML. Responses were due by 5 p.m. today, Dec. 28. My response is below, submitted just in time to meet the deadline. I didn’t have time to answer all the questions, so I focused on the ones I felt I could address in the greatest detail.

RESPONSE TO RFPC # 122807:

Background Information

For the past year I have been the lead programmer on AltLaw.org, a project to promote public access to federal court opinions by creating a free, full-text database of those opinions with an easy-to-use search interface. In the process of developing this web site, we have encountered many obstacles because of the way the federal courts store and publish their records. The problems we have encountered using electronic government records illustrate issues that New York State should consider in developing its electronic records policy. I will provide more details in my answers to the questions below.

Question 2

To encourage public access to State electronic records, it is important that the electronic record be the official State record rather than a draft or proxy for the “official” paper document. The greatest weakness of AltLaw.org as a legal reference is that the opinions we download from the federal courts’ web sites are not the final, official versions; those are published by and only available from the West corporation, at considerable cost.

Since the federal courts rely on West to copy-edit and correct their opinions, they are careless with respect to dates, names, and other important data. For example, we have downloaded several opinions that were decided in the year “2992″!

Furthermore, opinions published on federal court web sites lack any citation information — which is also assigned by West based on the pagination of their print volumes — making them useless for legal scholarship or court preparation.

As most professions (including the law) come increasingly to rely exclusively on electronic sources of information, it is critical that those sources become 100% reliable. To this end, electronic State records must be 1) complete, 2) accurate, 3) easily cited, and 4) acceptable for use in all official State business.

Question 3

To encourage interoperability and data sharing with citizens, business partners, and other jurisdictions, it is important that State electronic records be machine-readable. This is a more demanding requirement than simply having records in electronic form. A PDF document, for example, is electronic, but it is difficult or impossible to extract discrete data from a PDF document. This is because the PDF format is optimized toward preserving the visual appearance of a document rather than its structure.

There are two issues to consider when creating machine-readable documents. The first is “metadata.” Metadata is information about a document that may or may not be contained within the text of the document itself. For example, the metadata for a court opinion would include the name of the court, the date the opinion was released, the name of the judge writing the opinion, and the names of the parties in the case, among other data. Since most federal courts publish their opinions on their web sites in PDF format, with no metadata, AltLaw.org must rely on custom software to extract essential metadata such as titles and dates. The process is slow, difficult, and inaccurate.

It is worth noting that ODF supports metadata using the Resource Description Framework (RDF), an international standard which already forms the foundation of powerful data-analysis software. OOXML does not provide comparable metadata support.

See: http://blogs.sun.com/GullFOSS/entry/new_extensible_metadata_support_with

The second requirement for creating machine-readable documents is information about document structure. Document structure includes elements such as sections, headings, paragraphs, lists, and tables. These structures must be identifiable in the document independent of the visual formatting used to display those structures. For example, a human reader can recognize bold-face type as a section heading, but a computer program cannot. Structural information is important for automated document analysis, information retrieval (search), accessibility to physically-impaired users, and conversion to alternate formats (such as HTML). ODF provides more structural information than does OOXML.

In addition to including metadata and structural information in electronic documents, the State should implement rigorous standards to ensure that information is produced consistently. Metadata is only useful when it is stored in a known, consistent format. For example, simple information such as a date can be recorded in a dozen different ways. A date written as “02/03/04″ could be February 3, 2004 or it could be March 4, 2002. Work on AltLaw.org has shown us that there are as many ways of writing dates as there are courts to write them, and dates are by far the simplest piece of metadata to store. For New York State, if different State agencies (or, worse, individual State employees) were to record metadata in different ways, it would be almost as useless as having no metadata at all. After selecting a format for electronic records, the State must then establish formal procedures for using the metadata capabilities of that format. These procedures should be made freely available to the public for the purposes of encouraging interoperability and data sharing.

Question 4

To encourage appropriate government control of its electronic records, the State should rely wherever possible on public-key encryption technology. I am not an expert on this subject, but I urge the State to consult with computer security and encryption experts when choosing the protocols to implement. When properly used, public-key encryption can provide communications that are secure and verifiably authentic to a degree exceeding that of physical documents.

Question 5

To encourage choice and vendor neutrality when creating, maintaining, exchanging, and preserving electronic records, the State should consider the vested interests of parties promoting a particular data format. A format developed by a single corporate entity, such as Microsoft in the case of OOXML, gives that entity strong incentives to design the format to make interoperability difficult to achieve in practice, either through incomplete specification or proprietary extensions. OOXML has been criticized by others as being difficult to implement because of both of these factors. In contrast, ODF was developed by the independent, international OASIS group, with input from a variety of sources. To ensure the success of ODF, its creators have a vested interest in making interoperability as easy as possible. Thus, in the future, ODF is likely to be supported by a wider range of vendors and products than OOXML, and its adoption will promote greater competition in the markeplace.

Question 7

Regarding public access to long-term archives, the State should ensure that its archived records are available in bulk. “In bulk” means that a computer program should be able to obtain large quantities of archived records in an automated fashion, without human intervention. While developing AltLaw.org, we were often forced to write computer programs that simulate the behavior of a human user clicking through a court web site, as that was the only way to download opinions from those sites. In contrast, some court web sites make all their opinions available for download via FTP, a very simple Internet protocol designed for bulk file transfer. The latter made our job easier, and better promotes public access to archived records.

In general, State agencies should not take responsibility for providing the public with the tools to search and retrieve electronic records. Those tasks are better handled by corporate entities (such as Google) and non-profit institutions (such as AltLaw.org) that have technical expertise in those areas. The State should take responsibility for making consistent, accurate, complete data available in bulk at little or no cost to users of that data.

Question 10

Regarding the management of highly specialized data formats such as CAD, digital imaging, Geographic Information Systems, and multimedia, the State should use open, published, freely-available standards whenever possible. When open standards are not available for a particular type of data, the State should attempt to make that data available in as many competing formats as possible. For example, many database and statistical applications which use proprietary data formats can “dump” their data into simple, standardized formats such as comma-separated values (CSV). Imaging software which uses proprietary formats can usually convert files to non-proprietary standard formats as well. Wherever possible, these conversions should be “lossless,” that is, they should not lose any information in the conversion. Ideally, it should be possible to completely reconstruct the specialized data from the information contained in the simpler data formats. These practices provide insurance against future loss or corruption of data in highly-specialized formats.

Conclusion

I strongly urge the State to prefer ODF to OOXML. As one engaged in extracting data from large quantities of government records (over half a million documents, at last count), I would find ODF much more conducive to enabling my work than OOXML. I have looked at examples of the internal XML schemas used by both formats, and I find ODF much easier to read, understand, and manipulate than OOXML.

Comments No Comments »

Here’s a question that’s been bugging me for a while: what’s the best way to store information that is a mixture of highly- and loosely-structured data? For example, a collection of documents like Project Posner. Certain attributes of each document like the title, date, and citation fit easily into a normalized relational database model. But the body can only be described with some kind of markup.

I could just use HTML, except for one problem: my documents have to handle footnotes, for which HTML does not provide a tag. (As an aside, footnotes are a pain whether you’re doing web design or typesetting.)

On Project Posner, I compromised: everything is stored in a MySQL database, and the documents table has a “body” column that contains my own made-up XML syntax.

I could, in theory, normalize everything, even individual paragraphs. But that would be a nightmare to code and deadly slow. I could also store everything as XML documents. But then I’d have to reinvent all the facilities that MySQL (and ActiveRecord) provide, like transaction handling, auto-incrementing IDs, and so on.

For another project, I’m trying to create a pseudo-database that stores everything as XML files and uses Ferret for searching. I was going to use Ferret for full-text search anyway, so my original thought was to save overhead by not bothering with MySQL indexes. It works, but looking over it I realize that most of the data could be normalized to fit into the standard relational model. I’d still need a blob of XML data somewhere, but it could be in the database as easily as a file. What have I really gained, besides an impressively large and complex pile of code?

Comments 3 Comments »