Posts Tagged “RDF”

I’ve been fascinated with RDF for years, but I always end up frustrated when I try to use it. How do you read/write/manipulate RDF data in code? Sure, there are lots of libraries, but they all represent RDF data as its primitive structures: statements, resources, literals, etc. Working with data through these APIs feels like using a glovebox. To get anything useful done, you have to define mappings between RDF properties/classes and normal data structures in your programming language — classes, maps, lists, whatever. In effect, you have to define everything twice.

Some Java APIs allow one to add annotation properties to classes and methods, with the annotations defining the mapping between Java objects and RDF triples. It’s convenient, and familiar if you’ve used Java persistence frameworks like Hiberante, but you still have to define everything twice — once in your RDF schema, once in Java code.

Other libraries generate Java source code from RDFS or OWL ontologies. This means you don’t have to define everything twice, but adds another step to the write-compile-run cycle, and limits you to the semantics that the code generator can understand. In particular, certain features of RDFS/OWL — multiple inheritance, sub-properties — do not map well into Java.

What I really wanted was a way to create and work with RDF data in Clojure, using the same map/set/sequence APIs that I use for any other Clojure data structure. I flirted with implementing RDF in Clojure but lost interest when I realized that 1) there’s a lot more to implementing RDF than datatype conversions; and 2) my Clojure library suffered from the same glovebox problem as the Java RDF libraries.

The solution, however, was staring me in the face all along. Clojure is a Lisp. I can generate functions directly, without any intermediate “source” representation. I can use my own customized validation and type-checking functions. Furthermore, I can extend the definitions in my RDF schema with new Clojure functions.

Here’s what I ended up with: I designed a simple OWL ontology using Protege 4 and saved it as RDF/XML. Then I used the Sesame 2 library to find all the RDF classes and properties defined in my ontology, and create the appropriate getter, setter, and constructor functions in Clojure. It looks something like this:

(defn intern-classes []
  (doseq [cls (find-all-classes *ontology*)]
    (let [name (resource-to-symbol cls)]
      (intern *ns* name (fn [] {:type name})))))

The resource-to-symbol function creates a symbol named for the local name of the RDF class, with the full URI of its XML namespace in the symbol’s metadata. The call to intern defines a new function that takes no arguments and returns a Clojure map with the symbol as its :type.

Suppose I have a class named Document in my ontology. I now have a Clojure function named Document that creates a new instance of that class, represented as a Clojure map. Furthermore, using Clojure hierarchies and the isa? function, I can generate Clojure code that implements the subclass relationships defined in the ontology. Whee!

I don’t entirely know where I’m headed with this, but I like the way it’s going. I can define my own data types, decide how they map to Clojure data structures, and have code that’s always up-to-date with my RDF vocabulary.

Comments 3 Comments »

When storing any large collection of data, one of the most critical decisions one has to make is when to normalize and when to denormalize.  Normalized data is good for flexibility — you can write queries to recombine things in any combination.  Denormalized data is more efficient when you know, in advance, what the queries will be.  There’s a maxim, “normalize until it hurts, denormalize until it works.”

In the past two years, I’ve become a big fan of three software paradigms:

  1. RDF and the Semantic Web
  2. RESTful architectures
  3. Hadoop and MapReduce

The first two were made for each other — RDF is the ideal data format for the Semantic Web, and Semantic Web applications are (ideally) implemented with RESTful architectures.

Unfortunately, as I discovered through painful experience, RDF is an inefficient representation for large amounts of data.  It’s the ultimate in normalization — each triple represents a single fact — and in flexibility, but even simple queries become prohibitively slow when each item of interest is spread across dozens of triples.

At the other end of the spectrum, MapReduce programming forces you into an entirely different way of thinking.  MapReduce offers exactly one data access mechanism: read a list of records from beginning to end.  No random access, do not pass GO, do not collect $200.  That restriction is what enables Hadoop to distribute a job across many machines, and to process data at the maximum rate supported by the hard disk.  Obviously, to take full advantage of these optimizations, your MapReduce program needs to be able to process each record in isolation, without referring to any other resources.  In effect, everything has to be completely denormalized.

For my work on AltLaw, I’ve tended to use fully denormalized data, because anything else is too slow.  But I want to be working with normalized data.  I want to have every datum — plus information about its derivation — stored in one giant RDF graph.  But I also want to be able to process this graph efficiently.  Maybe if I had a dozen machines with 64 GB of memory apiece, this wouldn’t be a problem.  But with one desktop, one server, and occasional rides on EC2, that’s not an option.

The ideal design, I think, would be a hybrid system which can do efficient bulk processing of RDF data.  There are some pilot projects to do this with Hadoop, and I’m interested to see how they pan out.  But until then, I’ll have to make do as best I can.

Update: The projects I remebered were HRDF and Heart, which turn out to be the same thing: http://code.google.com/p/hrdf/ and http://wiki.apache.org/incubator/HeartProposal

Comments 3 Comments »

RDF is seductive.  I can’t get away from it.  Something about the ability to represent anything and everything in one consistent model just tugs at my engineer’s heartstrings.

The problem with RDF, as I’ve discovered through painful experience, is that the ability to represent everything sacrifices the ability to represent anything efficiently.  Certainly that is partly the fault of existing toolkits, but it’s still a fault.

But maybe I’m simply missing the point of RDF.  Its stated design goals do not include “efficient storage and retrieval.”  The W3C’s RDF recommendations make no mention of performance.  Tim Berners-Lee doesn’t mention it in his Semantic Web talks, except in airy references to future general purpose RDF engines.

RDF, it seems, is intended not as so much as a data storage model but as a communication model.  In short, it’s a protocol.  One of the goals Berners-Lee does talk about is exposing the information currently trapped inside relational databases.  So maybe I shouldn’t think about storing RDF and think instead about presenting it as a view of my own domain-specific data structures.

There are still some benefits to be had from storing and analyzing data in a graph-like structure.  But RDF is not necessarily the be-all and end-all of graphical models either.

Comments 2 Comments »

There’s a school of thought that URIs should be opaque identifiers with no inherent meaning or structure. I think this is clearly a bad idea on the human-facing web, but it is more reasonable for computer-facing web services.

However, I’ve been generating a lot of RDF lately, trying to organize piles of metadata in AltLaw. I use URIs that have meaning to me, although they aren’t formally specified anywhere. I realized that a URI can represent a lot of information — not just the identity of a resource, but also its type, version, date, etc. — in a compact form. I can write code that manipulates URIs to get key information about a resource more quickly than I could by querying the RDF database.

Unfortunately, RDF query languages like SPARQL assume that URIs themselves are just pointers that do not contain any information. I could easily generate additional RDF statements containing all the same information encoded in the URI, but that would triple the size of my database and slow down my queries. (My experience so far with Sesame 2.0 is that complex queries are slow.)

What I need is a rule-based language for describing the structure of a URI and what it means. This would be similar to URI templates, and would map parts of the URI to RDF statements about that URI.

So if I make a statement about a resource at
http://altlaw.org/cases/US/1204
(note: not a real URI), the RDF database automatically infers the following (in Turtle syntax):

<http://altlaw.org/cases/US/2008-02-14/1204>
        rdf:type        <http://altlaw.org/rdf/Case> ;
        dc:identifier   "1204" ;
        dc:jurisdiction "US" ;
        dc:date         "2008-02-14" .

Comments No Comments »

Let’s have the database models strut down the runway:

Relational (SQL):

  • Data consist of rows in tables.
  • Each row has multiple columns.
  • Each column has a fixed type.
  • Queries use filters and joins.
  • Fixed schema is defined separately from the data.
  • User-defined indexes improve query performance.
  • Robust transaction/data-integrity support.

Graph (RDF):

  • Data consist of nodes and labeled links between them.
  • Nodes may have type information.
  • Optional schema is defined in the same model (e.g. RDFS, OWL).
  • Queries use graph-based constraints and filters.
  • Indexes are not optimized for specific datasets.
  • Limited transaction support.

Tree (XML, filesystem):

  • Data are identified by a name and a position within a hierarchy.
  • Queries use hierarchies and filters.
  • Data may have type information.
  • Optional schema may be defined in the same model (XML-Schema) or outside it (DTD).
  • Transaction support varies by implementation.

Hash (BDB, memcached):

  • Data are untyped binary blobs.
  • Data are identified by unique keys in a flat namespace.
  • Only key-based lookups are possible.
  • Limited transaction support.

Inverted Index (Lucene/Solr):

  • Data are text documents.
  • Queries are free-form text with some special operators.
  • Text queries are fast.
  • Query results can be scored/ranked.
  • Joins are inefficient; denormalized data are preferred.
  • Limited transaction support.

Document-Oriented (CouchDB):

  • Data are identified by unique keys in a flat namespace.
  • Data have named fields.
  • Optional schema defined by application code.
  • Denormalized data are preferred.
  • Limited type information.

Column-Oriented (Hbase, BigTable):

  • Data consist of rows in tables.
  • Each row has multiple columns, rows are assumed to be sparse.
  • Data from a single column are stored contiguously to improve performance.
  • Designed for distributed processing environments such as Hadoop and MapReduce.

Comments No Comments »

Since I started working on AltLaw.org a little over a year ago, I’ve struggled with the data model. I’ve got around half a million documents, totally unstructured, from a dozen different sources. Each source has its own quirks and problems that have to be corrected before it’s useful.

When I started, I was using MySQL + the filesystem + Ferret. Keeping all that in sync, even with Rails’ ActiveRecord, was tricky, and searches were slow. Now I’ve got everything stored directly in a Solr index, which is very fast for searching. But it’s not so great when I want to change, say, how I convert the source PDFs into HTML, and I have to re-index half a million documents. And it can’t store structured data very well, so I augment it with serialized hash tables and SQLite.

What I want is a storage engine that combines properties of a relational database and a full-text search engine. It would put structured fields into SQL-like tables and unstructured fields into an inverted index for text searches. I want to be able to do queries that combine structured and unstructured data, like “cases decided in the Second Circuit between 1990 and 2000 that cite Sony v. Universal and contain the phrase ‘fair use’.” I also want to store metadata about my metadata, like where it came from, how accurate I think it is, etc.

I’ve been exploring some interesting alternatives: CouchDB is probably the closest to what I want, but it’s still under heavy development. DSpace is oriented more towards archival than search. Then there’s the big RDF monster, maybe the most flexible data model ever, which makes it correspondingly difficult to grasp.

I’m come to the conclusion that there is no perfect data model. Or, to put it another way, no storage engine is going to do my work for me. What I need to do is come up with a solid, flexible API that provides just the features I need, then put that API in front of one or more back-ends (Lucene, SQL, the filesystem) that handle whatever they’re best at.

Comments 3 Comments »