Digital Digressions by Stuart Sierra

Dirty Necessary Money

Posted on January 22, 2009 by Stuart Sierra

Up at Cornell, Tom Bruce has a post about the problem of funding open access to legal materials. This brings to mind a conversation I had with a doctor friend recently about AltLaw. My friend, accustomed to the open-access requirements of NIH grants, was frankly shocked that there are no comparable rules for legal decisions.

NIH Public Access Policy — Screenshot: PubMed home page

A related problem is how to make people aware of what free services are available. AltLaw has been around for two years, and while traffic has grown steadily, it has not gotten as much attention as commercial startups operating similar services. Admittedly, we have done no advertising at all, and that’s our fault. “If you build it they will come” we thought, naÍ¯vely. But how would we advertise? I’m a programmer; the people I work with are law professors. None of us know the first thing about marketing, and quite frankly, none of us care. Seen in that light, Cornell’s recent partnership with Justia.com is a smart move that will benefit everyone working on open-access law, since it will expose more lawyers to the idea.

Tests Are Code

Posted on January 18, 2009 by Stuart Sierra

It’s interesting to see the first signs of rebellion against RSpec. I jumped on the RSpec bandwagon when it first appeared, mostly so I wouldn’t have to write “assert_equals” all the time. But while I liked and used RSpec, I don’t think it made my tests any better. If anything, they were a little bit worse. I found myself testing things that were not really relevant to the code I was writing, asserting obvious things like “a newly-created record should be empty.”

When I got interested in Clojure, one of the first things I wrote was a testing library called “test-is”. I borrowed from a lot of Common Lisp testing frameworks, especially the idea of a generic assertion macro called “is”. It looks like this:

(deftest test-my-function
  (is (= 7 (my-function 3 4)))
  (is (even? (my-function 10 2))))

This is pretty basic, but it’s sufficient for low-level unit testing. So far, I think that’s how the library has been typically used. There have been occasional requests, however, for RSpec-style syntax. I can see how this would be useful for testing at a level higher than individual functions, but I have come to believe that the added semantics of RSpec are not really necessary.

Right now, the test-is library is built on the same abstractions as Clojure itself. Tests are functions, so you can apply all the same tools that already exist for handling functions. Tests can be called by name, organized into namespaces, and composed. There is almost no extra bookkeeping code that I need to write to make all of this work.

In contrast, if I were to adopt the RSpec style, I would have to write code to call, store, and organize tests. That’s more work for me, and ultimately restricts the flexibility of the library for people who use it. Furthermore, RSpec has its own set of semantics, above and beyond the language itself, which must be learned.

This is my first experience supporting a library for anyone other than myself, and I don’t want to force anyone into a particular style. A library like RSpec is a complete environment that attempts to anticipate all possible usage scenarios, so it’s grown correspondingly complicated. I want to provide a set of small tools, that can be combined with other tools to do interesting things.

Of course, by making that decision I’m already dictating, to some extent, how the library can be used. But really, what I’m trying to do is set limits for myself. I will commit to providing a flexible, extensible set of functions and macros for writing tests. I am explicitly not trying to provide a complete testing framework. If someone wants to build an RSpec-style framework on top of test-is, more power to them. I will happily try to make test-is easier to integrate into that framework.

But there’s one other thing that struck me about that article that I linked to at the beginning — the idea of putting tests and code in the same file. I think that’s a great idea, and Clojure comes ready-made to implement it. Clojure supports the idea of “metadata” on definitions. You can attach a set of arbitrary properties to any object, without affecting the value of that object.

It’s easy to attach a test function as metadata on a definition in Clojure, but the syntax is a little ugly, and there is no easy way to remove the tests from production code. So I came up with in addition to my library, the “with-test” macro. It lets you wrap any definition in a set of tests. It looks like this:

(with-test
 (defn add-numbers [a b]
   (+ a b))
 (is (= 7 (add-numbers 3 4)))
 (is (= -4 (add-numbers -6 2))))

This is equivalent to adding metadata to the function, but the syntax is a little cleaner. I’ve also added a global variable, “*load-tests*”, which can be set to false to omit tests when loading production code.

I like having each function right next to its tests. It makes it easier to remember to write tests, and easier to see how the function is supposed to behave. So to the extent that test-is will promote a testing style, this is it. But it’s a pretty radical departure from the traditional style of testing, so I’m not sure how others will react to it.

Data Are Data, Not Objects

Posted on January 10, 2009 by Stuart Sierra

A small post for the new year (in which I have resolved to write more).

The standard introduction to object-oriented programming teaches you to create a class for each type of thing you want to deal with in your program. So if you’re writing a payroll program, you would have an Employee class, a Department class, and so on. The methods of those classes are supposed to model some sort of real-world behavior. You quickly realize that real-world objects are not so easily separable, so you learn about inheritance, abstract classes, polymorphism, virtual methods, overloading, and all the other gobbledygook that is supposed to bring object-oriented programming closer to the real world, but usually just confuses programmers.

Like many others, I became suspicious of this technique after trying to use library classes that did not provide the methods I needed. Subclassing is supposed to provide a way to add new behavior to classes, but it is often thwarted by private variables, final methods, and the admonition that modifying the internals of a class is “breaking encapsulation.”

The joy I experienced when I first discovered Perl was due in large part to the realization that it provided just a few simple data structures — scaler, array, hash — that were sufficient for any sort of object I wanted to model. Moreover, every CPAN library used those same data structures, so it was easy to link them together and extend them where I needed to. Even Perl “objects” are just data structures with some added functionality. I was similarly delighted by Clojure’s abstract data structures — list, vector, map, set — all of which are manipulated with a few generic functions.

The problem with defining your own classes is that every class you define has its own, unique semantics. Someone who wants to use your class has to learn those semantics, which may not be suitable for how they want to use it. I once read (I don’t remember where) that classes are good for modeling abstract, mathematical entities like sets, but they fall apart when trying to model the real world.

So here’s a slightly radical notion: don’t use classes to model the real world. Treat data as data. Every modern programming language has at least a few built-in data structures that usually provide all the semantics you need. Even Java, the prince of “everything is a class” languages, has an excellent collections library. If your program has a list of names, you don’t need to invent a NameList object, just use a List<String>. Don’t hide it behind a specialized interface. The interface is already there: it’s a List. If somebody wants to sort the list, they already know how to do it, and you never have to write a SortedNameList class.

This is an important idea behind JSON (and YAML) — the semantics are deliberately limited, so you know what to expect. That’s also why JSON is popular for sharing data between programs written in different languages — the semantics are simple, so they’re easy to implement. The point is, don’t create new semantics when you don’t need to — you’re only making it harder to understand, extend, and reuse your code.

Antidenormalizationism

Posted on October 23, 2008 by Stuart Sierra

When storing any large collection of data, one of the most critical decisions one has to make is when to normalize and when to denormalize. Normalized data is good for flexibility — you can write queries to recombine things in any combination. Denormalized data is more efficient when you know, in advance, what the queries will be. There’s a maxim, “normalize until it hurts, denormalize until it works.”

In the past two years, I’ve become a big fan of three software paradigms:

RDF and the Semantic Web
RESTful architectures
Hadoop and MapReduce

The first two were made for each other — RDF is the ideal data format for the Semantic Web, and Semantic Web applications are (ideally) implemented with RESTful architectures.

Unfortunately, as I discovered through painful experience, RDF is an inefficient representation for large amounts of data. It’s the ultimate in normalization — each triple represents a single fact — and in flexibility, but even simple queries become prohibitively slow when each item of interest is spread across dozens of triples.

At the other end of the spectrum, MapReduce programming forces you into an entirely different way of thinking. MapReduce offers exactly one data access mechanism: read a list of records from beginning to end. No random access, do not pass GO, do not collect $200. That restriction is what enables Hadoop to distribute a job across many machines, and to process data at the maximum rate supported by the hard disk. Obviously, to take full advantage of these optimizations, your MapReduce program needs to be able to process each record in isolation, without referring to any other resources. In effect, everything has to be completely denormalized.

For my work on AltLaw, I’ve tended to use fully denormalized data, because anything else is too slow. But I want to be working with normalized data. I want to have every datum — plus information about its derivation — stored in one giant RDF graph. But I also want to be able to process this graph efficiently. Maybe if I had a dozen machines with 64 GB of memory apiece, this wouldn’t be a problem. But with one desktop, one server, and occasional rides on EC2, that’s not an option.

The ideal design, I think, would be a hybrid system which can do efficient bulk processing of RDF data. There are some pilot projects to do this with Hadoop, and I’m interested to see how they pan out. But until then, I’ll have to make do as best I can.

Update: The projects I remebered were HRDF and Heart, which turn out to be the same thing: http://code.google.com/p/hrdf/ and http://wiki.apache.org/incubator/HeartProposal

Fragmentation and the Failure of the Web

Posted on August 27, 2008 by Stuart Sierra

What makes an on-line community? In the past two weeks I have received announcements of three new “communities” all interested in using open-source software to retrieve, share, and analyze data from or about governments. Most of these announcements say the same thing: “A lot people seem to be working on this, but they aren’t talking to each other.” Each group has a slightly different slant, but in my mind I lump them all under the heading “Semantic Government,” i.e. building the semantic web for government data.

I started casting out a few search queries, and quickly compiled a list of eight different mailing lists and/or wikis devoted to this subject. That doesn’t include for-profits like Justia.com or larger non-profits like the Sunlight Foundation.

This is a problem. Not only do I have to subscribe to half a dozen mailing lists to keep abreast of what others are doing, I also have to cross-post to several lists when I want to announce something myself. So far, nothing I have posted to these lists has garnered as much response as private emails sent directly to people whom I know are subscribed to the lists.

Perhaps the very idea of a “web-based community” has become a victim of its own success. Back in the olden days, when I was still learning how to type, creating an on-line community was hard. You had to wrangle with BBS software, mailing list managers, or content management systems. It took dedicated individuals willing to invest considerable time and money. Now? Just go to Google / Yahoo / Facebook / whatever flavor-of-the-month service, type in the name of your group and presto, you’re a “community.”

The problem is that it’s now easier to start a group than to join one. Every project wants to be the center of its own community, but what most projects actually get is a lonely soapbox in the wilderness from which to cry, “Announcing version 0.x…”

I’m equally guilty in this trend, having founded one of the sites I referred to above (LawCommons) and built a wiki for another (IGOTF). Once you’ve started a site it’s easier to leave it there than to formally announce “I am shutting down X and throwing my lot in with Y.” It’s also a hedge against the (very likely) possibility that group Y won’t be around in a year. But I worry that a broad movement (Semantic Government) fragmented into so many tiny sub-groups will never gather enough momentum to succeed. The very thing we all want — to share information better — is lost through the scattered efforts to achieve it.

Whither RDF?

Posted on August 27, 2008 by Stuart Sierra

RDF is seductive. I can’t get away from it. Something about the ability to represent anything and everything in one consistent model just tugs at my engineer’s heartstrings.

The problem with RDF, as I’ve discovered through painful experience, is that the ability to represent everything sacrifices the ability to represent anything efficiently. Certainly that is partly the fault of existing toolkits, but it’s still a fault.

But maybe I’m simply missing the point of RDF. Its stated design goals do not include “efficient storage and retrieval.” The W3C’s RDF recommendations make no mention of performance. Tim Berners-Lee doesn’t mention it in his Semantic Web talks, except in airy references to future general purpose RDF engines.

RDF, it seems, is intended not as so much as a data storage model but as a communication model. In short, it’s a protocol. One of the goals Berners-Lee does talk about is exposing the information currently trapped inside relational databases. So maybe I shouldn’t think about storing RDF and think instead about presenting it as a view of my own domain-specific data structures.

There are still some benefits to be had from storing and analyzing data in a graph-like structure. But RDF is not necessarily the be-all and end-all of graphical models either.

Clojure for the Semantic Web

Posted on August 8, 2008 by Stuart Sierra

I dropped in to hear Rich Hickey talk about Clojure at the New York Semantic Web meetup group. Some highlights:

“¢ Some programs, like compilers or theorem provers, are themselves functions. They take input and produce output. Purely functional languages like Haskell are good for these kinds of programs. But other programs, like GUIs or automation systems, are not functions. For example, a program that runs continously for months or years is not a function in the mathematical sense. Clojure is mostly functional, but not purely functional.

“¢ Most Clojure programmers go through an arc. First they think “eww, Java” and try to hide all the Java. Then they think “ooh, Java” and realize that Clojure is a powerful way to write Java code. Rich frowns upon “wrapper” functions in Clojure that do nothing but wrap a Java method. Calling the Java method directly is faster and easier to look up in JavaDoc.

“¢ Rich recommended a paper, Out of the Tar Pit, for a discussion of functional and relational techniques to manage state.

“¢ Clojure’s data structures are persistent. This isn’t persistent in the stored-in-a-database sense. It refers to immutability. For example, adding an element to a vector creates a new vector that shares structure with the old one. Because all data structures are immutable, this is both safe and efficient. Clojure’s hash maps, for example, have time complexity of log-base-32, which is so small it’s practically constant.

“¢ The first thing Rich did when experimenting with the semantic web was to pull data out of the Jena API and get it into Clojure data structures. That allows him to leverage the full power of Clojure’s data manipulation functions. This opens up a world of possibilities that he wouldn’t have if he stuck with Jena objects. Basically, having your data trapped inside objects is bad, because you’re limited to whatever methods those objects provide. With generic data structures, you can re-use and compose all the functions that Clojure already provides.

Screencasts and code from the talk should appear soon — watch clojure.org or the Clojure Google group for an announcement.

The Document-Blob Model

Posted on July 17, 2008 by Stuart Sierra

Update September 22, 2008: I have abandoned this model. I’m still using Hadoop, but with a much simpler data model. I’ll post about it at some point.

…

Gosh darn, it’s hard to get this right. In my most recent work on AltLaw, I’ve been building an infrastructure for doing all my back-end data processing using Hadoop and Thrift.

I’ve described this before, but here’s a summary of my situation:

I have a few million files, ~15 GB total.
Many files represent the same logical entity, sometimes in different formats or from different sources.
Every file needs 5-10 steps of clean-up, data extraction, and format conversion.
The files come from ~15 different sources, each requiring different processing.
I get 20-50 new files daily.

I want to be able to process all this efficiently, but I also want to be able to change my mind. I can never say, “I’ve run this process on this batch of files, so I never need to do it again.” I might improve the code, or I might find that I need to go back to the original files to get some other kind of data.

Hadoop was the obvious choice for efficiency. Thrift is a good, compact data format that’s easier to use than Hadoop’s native data structures. The only question is, what’s the schema? Or, more simply, what do I want to store?

I’ve come up with what, for want of a better term, I call the Document-Blob Model.

I start with a collection of Documents. Each Document represents a single, logical entity, like a court case or a section of statute. A Document contains an integer identifier and an array of Blobs, nothing more.

What is a Blob? Good question. It’s data, any data. It may represent a normal file, in which case it stores the content of that file and some metadata like the MIME type. It may also represent a data structure. Because unused fields do not occupy any space in Thrift’s binary format, the Blob type can have fields for every structure I might want to use now or in the future. In effect, a Blob is a polymorphic type that can become any other type.

So how do I know which type it is? By where it came from. Each Blob is tagged with the name of its “provider”. For files downloaded in bulk, the provider is the web site or service where I got them. For generated files, the creator is a class or script, with a version number.

So I have a few hundred thousand Documents, each containing several Blobs. I represent each conversion/extraction/processing step as its own Java class. All those classes implement the same, simple interface: take one Blob as input and return another Blob as output.

Helper functions allow me to say things like, “Take this Document, find the Blob that was generated by class X. Run class Y on that Blob, and append result to the original Document.” In this way, I can stack multiple processing steps into a single Hadoop job, but retain the option of reusing or rearranging those steps later on.

Will this work? I have no idea. I just came up with it last week. Today I successfully ran a 5-step job on ~700,000 documents from the public.resource.org federal case corpus. It took about an hour on a 10-node Hadoop/EC2 cluster.

The real test will come when I apply this model to the much messier collection of files we downloaded directly from the federal courts.

Thrift vs. Protocol Buffers

Posted on July 10, 2008 by Stuart Sierra

Google recently released its Protocol Buffers as open source. About a year ago, Facebook released a similar product called Thrift. I’ve been comparing them; here’s what I’ve found:

	Thrift	Protocol Buffers
Backers	Facebook, Apache (accepted for incubation)	Google
Bindings	C++, Java, Python, PHP, XSD, Ruby, C#, Perl, Objective C, Erlang, Smalltalk, OCaml, and Haskell	C++, Java, Python (Perl, Ruby, and C# under discussion)
Output Formats	Binary, JSON	Binary
Primitive Types	bool byte 16/32/64-bit integers double string byte sequence map<t1,t2> list<t> set<t>	bool 32/64-bit integers float double string byte sequence “repeated” properties act like lists
Enumerations	Yes	Yes
Constants	Yes	No
Composite Type	struct	message
Exception Type	Yes	No
Documentation	So-so	Good
License	Apache	BSD-style
Compiler Language	C++	C++
RPC Interfaces	Yes	Yes
RPC Implementation	Yes	No
Composite Type Extensions	No	Yes

Overall, I think Thrift wins on features and Protocol Buffers win on
documentation. Implementation-wise, they’re quite similar. Both use
integer tags to identify fields, so you can add and remove fields
without breaking existing code. Protocol Buffers support
variable-width encoding of integers, which saves a few bytes. (Thrift
has an experimental output format with variable-width ints.)

The major difference is that Thrift provides a full client/server RPC
implementation, whereas Protocol Buffers only generate stubs to use in
your own RPC system.

Update July 12, 2008: I haven’t tested for speed, but from a cursory examination it seems that, at the binary level, Thrift and Protocol Buffers are very similar. I think Thrift will develop a more coherent community now that it’s under Apache incubation. It just moved to a new web site and mailing list, and the issue tracker is active.

Moving the ‘C’ in MVC

Posted on July 7, 2008 by Stuart Sierra

I’m sure I’m not the first to suggest this, but here goes.

Ever since somebody first thought of applying the Model-View-Controller paradigm to the web, we’ve had this:

The View is a conflation of HTML and JavaScript. JavaScript is an afterthought, a gimmick to make pages “dynamic.” All the real action is in the Controller, which talks to the database, processes the internal application logic, and renders templates before sending complete pages back to the client.

But what if we implement the Controller entirely in JavaScript?

Now we can put the Controller on the client, and build a RESTful HTTP interface to communicate with the database.

Obviously there are many issues to consider. First and foremost is making sure that rogue clients cannot do anything to the database they’re not supposed to. But that’s a manageable problem — Amazon S3 is a good example. Apps that run entirely in the client can even be made more secure than their server-based counterparts, because encryption can be implemented entirely in the client, so that the server never sees the unencrypted data. (Clipperz, an password-storage service, calls this a zero-knowledge web app.)

There are some interesting possibilities. For example, the entire application, including the current state of the model, can be downloaded as a single web page for off-line use. (Clipperz supports this.) Also, the same application could connect to multiple data sources. And as with any RESTful architecture, back-end scaling is relatively easy.

Update July 10, 2008: I’m always amazed when one of my posts show up on Reddit. Maybe it was the diagrams. In any case, thanks to everyone who sent in comments. A couple responses:

Yes, in a sense I’ve described Ajax. But most Ajax-related code around the web these days is still in the “dynamic view” mode rather than the “client-side controller” mode.
I like Sun’s MVC diagram in which the View takes an active role in rendering the model rather than being just a template. It’s actually quite similar to what I’m suggesting here.
Some MVC frameworks, such as Ruby on Rails, insist that logic in the View is bad, but then they include all these Ajax view helpers, so it’s a bit of a mixed message.
I’m not insisting that all business logic be implemented client-side. Rather, I’m assuming some kind of “smart” database, with a RESTful front-end, that’s capable of containing business logic. Back in the day, these were SQL stored procedures. Now it’s probably something like CouchDB.
Yes, this design is bad for search engines, bookmarks, and deep linking. But there are plenty of cases where those don’t matter. Look at Google Mail, for example. It basically follows the design I’ve laid out here: the entire app is one HTML page (or a very few pages) with behavior implemented in client-side JavaScript.

Update August 7, 2008: This is an example of the code-on-demand style described in Roy Fielding’s REST thesis. Link from Joe Gregorio.