Archive for October, 2009

Data formats are annoying.  As much as half the code in any large software project consists of translating from one data representation — objects, SQL tables, files, XML, RDF, JSON, YAML, CSV, Protocol Buffers, Avro, XML-RPC — to another.

Each format has its own strengths and weaknesses.  Often, no single representation is complete enough to be considered “canonical.”  The only canonical representation is an abstract one, a platonic ideal in the mind of some developer.  Since this platonic ideal cannot be implemented in code, different people have different expectations for how a particular model is supposed to work.

There are two options: Either you re-implement the model, with all its features and constraints, for each format, and hand-code all the translations; or you use a “smart” library that automatically translates between different representations.  ActiveRecord and Hibernate are popular examples of the latter.

The problem with “smart” libraries is that they can never be smart enough.  At some point you always have to dig into the generated SQL or whatever to make them work efficiently, or even correctly.  Frequently this is impossible without hacking the library sources, a daunting tangle of generated and meta-programmed code.  The library that was supposed to make your life easier instead makes it hell.

Do these “smart” libraries really save any time?  Would it be easier to just write the translation code in the first place?  We’ll never know, because programmers can’t resist “smart” systems, the myth that you can “do more with less code.”  You can never do more with less, unless what you’re doing is the lowest common denominator of what everyone else is doing.  And if that is what you’re doing, then why bother?

Comments 2 Comments »

I’ve been fascinated with RDF for years, but I always end up frustrated when I try to use it. How do you read/write/manipulate RDF data in code? Sure, there are lots of libraries, but they all represent RDF data as its primitive structures: statements, resources, literals, etc. Working with data through these APIs feels like using a glovebox. To get anything useful done, you have to define mappings between RDF properties/classes and normal data structures in your programming language — classes, maps, lists, whatever. In effect, you have to define everything twice.

Some Java APIs allow one to add annotation properties to classes and methods, with the annotations defining the mapping between Java objects and RDF triples. It’s convenient, and familiar if you’ve used Java persistence frameworks like Hiberante, but you still have to define everything twice — once in your RDF schema, once in Java code.

Other libraries generate Java source code from RDFS or OWL ontologies. This means you don’t have to define everything twice, but adds another step to the write-compile-run cycle, and limits you to the semantics that the code generator can understand. In particular, certain features of RDFS/OWL — multiple inheritance, sub-properties — do not map well into Java.

What I really wanted was a way to create and work with RDF data in Clojure, using the same map/set/sequence APIs that I use for any other Clojure data structure. I flirted with implementing RDF in Clojure but lost interest when I realized that 1) there’s a lot more to implementing RDF than datatype conversions; and 2) my Clojure library suffered from the same glovebox problem as the Java RDF libraries.

The solution, however, was staring me in the face all along. Clojure is a Lisp. I can generate functions directly, without any intermediate “source” representation. I can use my own customized validation and type-checking functions. Furthermore, I can extend the definitions in my RDF schema with new Clojure functions.

Here’s what I ended up with: I designed a simple OWL ontology using Protege 4 and saved it as RDF/XML. Then I used the Sesame 2 library to find all the RDF classes and properties defined in my ontology, and create the appropriate getter, setter, and constructor functions in Clojure. It looks something like this:

(defn intern-classes []
  (doseq [cls (find-all-classes *ontology*)]
    (let [name (resource-to-symbol cls)]
      (intern *ns* name (fn [] {:type name})))))

The resource-to-symbol function creates a symbol named for the local name of the RDF class, with the full URI of its XML namespace in the symbol’s metadata. The call to intern defines a new function that takes no arguments and returns a Clojure map with the symbol as its :type.

Suppose I have a class named Document in my ontology. I now have a Clojure function named Document that creates a new instance of that class, represented as a Clojure map. Furthermore, using Clojure hierarchies and the isa? function, I can generate Clojure code that implements the subclass relationships defined in the ontology. Whee!

I don’t entirely know where I’m headed with this, but I like the way it’s going. I can define my own data types, decide how they map to Clojure data structures, and have code that’s always up-to-date with my RDF vocabulary.

Comments 3 Comments »

My Hadoop World NYC talk went off well; here are my slides [PDF]

Comments 1 Comment »