When storing any large collection of data, one of the most critical decisions one has to make is when to normalize and when to denormalize.Â Normalized data is good for flexibility — you can write queries to recombine things in any combination.Â Denormalized data is more efficient when you know, in advance, what the queries will be.Â There’s a maxim, “normalize until it hurts, denormalize until it works.”
In the past two years, I’ve become a big fan of three software paradigms:
- RDF and the Semantic Web
- RESTful architectures
- Hadoop and MapReduce
The first two were made for each other — RDF is the ideal data format for the Semantic Web, and Semantic Web applications are (ideally) implemented with RESTful architectures.
Unfortunately, as I discovered through painful experience, RDF is an inefficient representation for large amounts of data.Â It’s the ultimate in normalization — each triple represents a single fact — and in flexibility, but even simple queries become prohibitively slow when each item of interest is spread across dozens of triples.
At the other end of the spectrum, MapReduce programming forces you into an entirely different way of thinking.Â MapReduce offers exactly one data access mechanism: read a list of records from beginning to end.Â No random access, do not pass GO, do not collect $200.Â That restriction is what enables Hadoop to distribute a job across many machines, and to process data at the maximum rate supported by the hard disk.Â Obviously, to take full advantage of these optimizations, your MapReduce program needs to be able to process each record in isolation, without referring to any other resources.Â In effect, everything has to be completely denormalized.
For my work on AltLaw, I’ve tended to use fully denormalized data, because anything else is too slow.Â But I want to be working with normalized data.Â I want to have every datum — plus information about its derivation — stored in one giant RDF graph.Â But I also want to be able to process this graph efficiently.Â Maybe if I had a dozen machines with 64 GB of memory apiece, this wouldn’t be a problem.Â But with one desktop, one server, and occasional rides on EC2, that’s not an option.
The ideal design, I think, would be a hybrid system which can do efficient bulk processing of RDF data.Â There are some pilot projects to do this with Hadoop, and I’m interested to see how they pan out.Â But until then, I’ll have to make do as best I can.