Arc

The most famous piece of Lisp-related vaporware is vapor no longer: Arc has been released. After paging through the tutorial, I’m a bit underwhelmed. It looks like just a bunch of syntactic sugar implemented on top of Scheme. Clojure is more interesting and more innovative. Clojure and Arc have some things in common: data structures are functions on their keys, strings are sequences, and lambda is “fn”. But Arc seems to be more in the traditional, semi-imperative model whereas Clojure borrows heavily from functional programming. Arc even has for-loops, for goodness sakes. Obviously, I have no idea what else Paul Graham may have up his sleeve, so maybe there’s much more exciting stuff to come. But for now Arc looks to me like a modest improvement on Common Lisp / Scheme rather than a major new language.

Basking in the Solr Glow

I am happy to report that AltLaw.org‘s switch to Solr has worked very well. Solr is a RESTful search engine, built on Lucene. The setup was more complicated than just using a search library, but the rewards were worth it.

Before, I was using Ferret, which I still like. It’s a great library, and Dave Balmain has done incredible work producing a fast search engine that integrates well with Ruby. I still use it on other sites. But Solr was a better fit for AltLaw.

With Ferret, I was trying to shoe-horn large, unstructured documents into a system — ActiveRecord, MySQL, and acts_as_ferret — that is better suited to small, structured records. Now I use Solr as both a search engine and a document store, eliminating MySQL from the picture. That, combined with Solr’s built-in caching, has dramatically decreased the server load (from around 2.00 to under 0.30) while visibly improving search performance.

Also, I think it helps that Solr is not integrated with Ruby. The solr-ruby gem is not well documented, but easy to figure out, as it’s just a thin wrapper over the Solr APIs. Having the search engine in a separate process made it easier to separate the indexing & searching part of the code from the web application. As a result, the Rails code shrunk to one-fourth its former size.

ODF vs. OOXML in New York State

New York State’s Office for Technology released a Request for Public Comment on selecting an XML-based office data format. The choices are OASIS’ ODF and Microsoft’s OOXML. Responses were due by 5 p.m. today, Dec. 28. My response is below, submitted just in time to meet the deadline. I didn’t have time to answer all the questions, so I focused on the ones I felt I could address in the greatest detail.

RESPONSE TO RFPC # 122807:

Background Information

For the past year I have been the lead programmer on AltLaw.org, a project to promote public access to federal court opinions by creating a free, full-text database of those opinions with an easy-to-use search interface. In the process of developing this web site, we have encountered many obstacles because of the way the federal courts store and publish their records. The problems we have encountered using electronic government records illustrate issues that New York State should consider in developing its electronic records policy. I will provide more details in my answers to the questions below.

Question 2

To encourage public access to State electronic records, it is important that the electronic record be the official State record rather than a draft or proxy for the “official” paper document. The greatest weakness of AltLaw.org as a legal reference is that the opinions we download from the federal courts’ web sites are not the final, official versions; those are published by and only available from the West corporation, at considerable cost.

Since the federal courts rely on West to copy-edit and correct their opinions, they are careless with respect to dates, names, and other important data. For example, we have downloaded several opinions that were decided in the year “2992”!

Furthermore, opinions published on federal court web sites lack any citation information — which is also assigned by West based on the pagination of their print volumes — making them useless for legal scholarship or court preparation.

As most professions (including the law) come increasingly to rely exclusively on electronic sources of information, it is critical that those sources become 100% reliable. To this end, electronic State records must be 1) complete, 2) accurate, 3) easily cited, and 4) acceptable for use in all official State business.

Question 3

To encourage interoperability and data sharing with citizens, business partners, and other jurisdictions, it is important that State electronic records be machine-readable. This is a more demanding requirement than simply having records in electronic form. A PDF document, for example, is electronic, but it is difficult or impossible to extract discrete data from a PDF document. This is because the PDF format is optimized toward preserving the visual appearance of a document rather than its structure.

There are two issues to consider when creating machine-readable documents. The first is “metadata.” Metadata is information about a document that may or may not be contained within the text of the document itself. For example, the metadata for a court opinion would include the name of the court, the date the opinion was released, the name of the judge writing the opinion, and the names of the parties in the case, among other data. Since most federal courts publish their opinions on their web sites in PDF format, with no metadata, AltLaw.org must rely on custom software to extract essential metadata such as titles and dates. The process is slow, difficult, and inaccurate.

It is worth noting that ODF supports metadata using the Resource Description Framework (RDF), an international standard which already forms the foundation of powerful data-analysis software. OOXML does not provide comparable metadata support.

See: http://blogs.sun.com/GullFOSS/entry/new_extensible_metadata_support_with

The second requirement for creating machine-readable documents is information about document structure. Document structure includes elements such as sections, headings, paragraphs, lists, and tables. These structures must be identifiable in the document independent of the visual formatting used to display those structures. For example, a human reader can recognize bold-face type as a section heading, but a computer program cannot. Structural information is important for automated document analysis, information retrieval (search), accessibility to physically-impaired users, and conversion to alternate formats (such as HTML). ODF provides more structural information than does OOXML.

In addition to including metadata and structural information in electronic documents, the State should implement rigorous standards to ensure that information is produced consistently. Metadata is only useful when it is stored in a known, consistent format. For example, simple information such as a date can be recorded in a dozen different ways. A date written as “02/03/04” could be February 3, 2004 or it could be March 4, 2002. Work on AltLaw.org has shown us that there are as many ways of writing dates as there are courts to write them, and dates are by far the simplest piece of metadata to store. For New York State, if different State agencies (or, worse, individual State employees) were to record metadata in different ways, it would be almost as useless as having no metadata at all. After selecting a format for electronic records, the State must then establish formal procedures for using the metadata capabilities of that format. These procedures should be made freely available to the public for the purposes of encouraging interoperability and data sharing.

Question 4

To encourage appropriate government control of its electronic records, the State should rely wherever possible on public-key encryption technology. I am not an expert on this subject, but I urge the State to consult with computer security and encryption experts when choosing the protocols to implement. When properly used, public-key encryption can provide communications that are secure and verifiably authentic to a degree exceeding that of physical documents.

Question 5

To encourage choice and vendor neutrality when creating, maintaining, exchanging, and preserving electronic records, the State should consider the vested interests of parties promoting a particular data format. A format developed by a single corporate entity, such as Microsoft in the case of OOXML, gives that entity strong incentives to design the format to make interoperability difficult to achieve in practice, either through incomplete specification or proprietary extensions. OOXML has been criticized by others as being difficult to implement because of both of these factors. In contrast, ODF was developed by the independent, international OASIS group, with input from a variety of sources. To ensure the success of ODF, its creators have a vested interest in making interoperability as easy as possible. Thus, in the future, ODF is likely to be supported by a wider range of vendors and products than OOXML, and its adoption will promote greater competition in the markeplace.

Question 7

Regarding public access to long-term archives, the State should ensure that its archived records are available in bulk. “In bulk” means that a computer program should be able to obtain large quantities of archived records in an automated fashion, without human intervention. While developing AltLaw.org, we were often forced to write computer programs that simulate the behavior of a human user clicking through a court web site, as that was the only way to download opinions from those sites. In contrast, some court web sites make all their opinions available for download via FTP, a very simple Internet protocol designed for bulk file transfer. The latter made our job easier, and better promotes public access to archived records.

In general, State agencies should not take responsibility for providing the public with the tools to search and retrieve electronic records. Those tasks are better handled by corporate entities (such as Google) and non-profit institutions (such as AltLaw.org) that have technical expertise in those areas. The State should take responsibility for making consistent, accurate, complete data available in bulk at little or no cost to users of that data.

Question 10

Regarding the management of highly specialized data formats such as CAD, digital imaging, Geographic Information Systems, and multimedia, the State should use open, published, freely-available standards whenever possible. When open standards are not available for a particular type of data, the State should attempt to make that data available in as many competing formats as possible. For example, many database and statistical applications which use proprietary data formats can “dump” their data into simple, standardized formats such as comma-separated values (CSV). Imaging software which uses proprietary formats can usually convert files to non-proprietary standard formats as well. Wherever possible, these conversions should be “lossless,” that is, they should not lose any information in the conversion. Ideally, it should be possible to completely reconstruct the specialized data from the information contained in the simpler data formats. These practices provide insurance against future loss or corruption of data in highly-specialized formats.

Conclusion

I strongly urge the State to prefer ODF to OOXML. As one engaged in extracting data from large quantities of government records (over half a million documents, at last count), I would find ODF much more conducive to enabling my work than OOXML. I have looked at examples of the internal XML schemas used by both formats, and I find ODF much easier to read, understand, and manipulate than OOXML.

Critical Mass

Dan Weinreb posted common Complaints About Common Lisp. My personal complaint is in there — the lack of libraries that are well-documented and regularly updated. I think it’s a critical mass problem: so few people are using Common Lisp in their day-to-day work that there’s not enough momentum to keep the libraries going and make them fool-proof. Too many Common Lisp libraries are weekend projects that never make it out of alpha.

I’m guilty of the same offense: my one and only (very small) Common Lisp library — a bridge to run an embedded Perl 5 interpreter from Common Lisp — went a year before I heard from one lone user. By that time I had switched to Ruby.

The test of a really good library is not that it’s there, but that you don’t notice it’s there. If I want to scrape some HTML in Ruby, I don’t need to think about it, I just type “require ‘hpricot'” and it works. If I have a problem, odds are someone else has had the same problem and Google will find it. With Common Lisp, one can feel like a lone voice crying out in the wilderness. There’s a bit of a frontier mentality, too: “Well, if it’s broke, fix it yourself, citizen.”

XO-1 Laptop: Second Impressions

Further thoughts on my new XO-1 Laptop:

  1. It is possible to type on it, albeit not as fast as on a regular keyboard.
  2. It’s a real Linux installation — Redhat — on an x86-compatible processor. You can run “yum” in a root shell to install any package you want.
  3. The hardware/software integration needs some more work. For example, there’s a cool button that rotates the screen to any orientation, but it doesn’t re-map the arrow keys or touchpad axes, so it’s confusing to scroll through an ebook in portrait mode.
  4. There’s a lot of room for the platform to grow — the designers included keys on the keyboard that don’t do anything yet, in anticipation of future features.
  5. It is reasonably clever in remembering which WiFi networks you prefer.
  6. There’s no Ethernet port — if you want network access, you gotta have WiFi (or perhaps a USB network adapter).
  7. The bundled web browser only works with file types the XO is configured to handle. I downloaded a .tar.bz2 file but I couldn’t find where it got stashed on the filesystem.
  8. The “Sugar” interface is cute, and the “Journal” feature is downright innovative, but neither is complete enough to serve as a general-purpose computing platform. This isn’t necessarily bad — they were designed to be restricted to child-oriented tasks — but may limit the XO’s usefulness in other areas.
  9. The interface features (menus, icons, transitions) are slow. Unfortunately, I think this is due to the designers’ reliance on Python. Now, Python is a great language, and probably the best choice for an educational tool like the XO, but more optimization — perhaps from the PyPy project — is needed.

Blogging XO Style

Just got my XO-1 laptop today, and I’m using it to write this post. First impressions:

  1. It’s light–weighs about as much as a hardback book.
  2. The screen is sharp and readable, with or without the backlight.
  3. The built-in rubber keyboard is difficult for an adult to touch-type on. I’m hoping I’ll get used to it.
  4. It comes with a terminal app pre-installed!
  5. It’s slow to boot and start apps. I hear a suspend/resume feature is in the works.

Need to play with it more to get a clear idea of its capabilities.

The Definition of Scripting

Larry Wall writes about scripting, “I can’t define it, but I’ll know it when I see it.” So I thought I’d throw out my idea of a definition. A scripting language is a programming language that relies on and is designed to run within an ecosystem based on other languages. So Perl 5 and Ruby run within the C/Unix ecosystem, PHP runs within a web server, and Clojure runs within Java.

In contrast, non-scripting languages are designed to be a complete ecosystem on their own. In a non-scripting language, it’s C/Java/Smalltalk/Common Lisp all the way down. (Of course, no language exists in complete isolation, and everything has to be transformed into machine code eventually.)

I’ve always admired Java’s completeness — you can do everything you would ever want to do without leaving the Java world — even while I shun its complexity. Likewise, I admire small, special-purpose languages like AWK that do one thing well. Both approaches have their attractions, but I believe we will always need both. That’s why I’m encouraged by recent efforts to develop scripting languages on the platforms of large-system languages like C# and Java. It’s a recognition that there will never be “one language to rule them all.” I think (hope) we’ll see more task-specific languages integrated with these platforms in the future.

The web already exemplifies this trend, with five or more languages (HTML, CSS, JavaScript, SQL, PHP/Perl/Ruby/…) all working together to produce a single result.

Clojure: A Lisp Worth Talking About

A couple nights ago I walked down to LispNYC in the East Village to hear Rich Hickey talk about Clojure, his new Lisp-like language. To be honest, I wasn’t expecting much. Another Lisp? Ho hum. I’m sure it’s very clever and cool and all, but not something I can actually use.

Instead, I was blown away by Rich’s presentation. Clojure might just be the Lisp I’ve been waiting for. Here’s why:

  1. Clojure targets the JVM. That doesn’t just mean it’s written in Java. It compiles directly to Java bytecode, so it can take advantage of all the optimization work that’s been done on just-in-time bytecode compilers. So it’s both fast and portable.
  2. Clojure is functional, i.e. “side-effect free.” The built-in data structures are immutable. At the same time, you can have side effects (via external Java calls) without needing monads.
  3. Clojure has modern data structures. Vectors and hash maps are built in, and both can be used as “functions” that take a key value as their argument (I immediately thought yes, that’s the way it should be).
  4. Clojure fits into the Java ecosystem. All squintillion Java library methods can be called directly from Clojure. (Hickey demonstrated a small Swing app with fewer lines than the Java version.) Clojure data structures implement the Java collection interfaces.
  5. Clojure has built-in concurrency support. I barely understand this, but Hickey has done his homework and built a sophisticated concurrency and transaction system right into the language.
  6. Clojure abstracts Lisp sequences. Standard functions like “map” and “filter” work on any sequence, not just lists. In fact, they work on any object that implements the Java “Iterable” interface.
  7. Clojure has metadata. Objects can have metadata attached to them, which can be used by the program without affecting the value of the data (e.g. two objects with identical contents but different metadata are still considered “equal”).
  8. Clojure is still Lisp! It has first-class functions, Common Lisp-style macros (with automatic gensyms), a reader, “eval”, and a REPL.

I haven’t been this excited about Lisp since Peter Seibel came to talk about Practical Common Lisp. Hickey’s full presentation should show up soon at LispNYC.org. There’s also a (beautiful) Clojure web site and a Google group.

Update Nov. 21: Audio and slides from the presentation are available at LispNYC.

When You Have a Hammer, Everything Looks Like Rails

AltLaw.org began life as a Ruby on Rails project. I was enamored of Rails at the time (who wasn’t?) and rapid-prototyping helped me get it up and running in just a few months.

The problem is that I’ve been trying to follow Rails’ conventions even when they aren’t exactly in line with what I need to do. Right now, AltLaw is a very standard Rails setup: Apache2, Mongrel, MySQL, and Ferret for searching. That works perfectly for every other site I run, but they each have a few hundred pages, not AltLaw’s hundreds of thousands. Ferret is a great search-engine library, but it’s still a work in progress. ActiveRecord is a great framework, but it’s slow and cumbersome when you’re trying to do batch operations on thousands of records.

To put it simply: dealing with this much data requires more than convention.

So as we look toward a non-beta 1.0 release some time in 2008, I’m trying to figure out what the site architecture should be. Rails can still provide the web interface. Solr looks like a good stable search solution, with the added benefit that we can offer the search API to other developers.

The part with no obvious solution is the data store. We have thousands of documents, from more than a dozen different sources, in half a dozen different formats (PDF, HTML, text, even WordPerfect). Different documents have different kinds of metadata, and we need to be able to store, index, and retrieve all of that in a consistent way.

I’ve tried putting everything in a relational database, but databases weren’t really designed to hold large amounts of text. I’ve tried putting everything into structured XML documents, but it’s really hard to get the schema right. I’ve also used a combination of a relational database and static files, which works but can get complicated.

So if anybody knows of open-source software designed for this type of thing, or books I should be reading about information architecture, I’m all ears.