When storing any large collection of data, one of the most critical decisions one has to make is when to normalize and when to denormalize. Normalized data is good for flexibility — you can write queries to recombine things in any combination. Denormalized data is more efficient when you know, in advance, what the queries will be. There’s a maxim, “normalize until it hurts, denormalize until it works.”
In the past two years, I’ve become a big fan of three software paradigms:
- RDF and the Semantic Web
- RESTful architectures
- Hadoop and MapReduce
The first two were made for each other — RDF is the ideal data format for the Semantic Web, and Semantic Web applications are (ideally) implemented with RESTful architectures.
Unfortunately, as I discovered through painful experience, RDF is an inefficient representation for large amounts of data. It’s the ultimate in normalization — each triple represents a single fact — and in flexibility, but even simple queries become prohibitively slow when each item of interest is spread across dozens of triples.
At the other end of the spectrum, MapReduce programming forces you into an entirely different way of thinking. MapReduce offers exactly one data access mechanism: read a list of records from beginning to end. No random access, do not pass GO, do not collect $200. That restriction is what enables Hadoop to distribute a job across many machines, and to process data at the maximum rate supported by the hard disk. Obviously, to take full advantage of these optimizations, your MapReduce program needs to be able to process each record in isolation, without referring to any other resources. In effect, everything has to be completely denormalized.
For my work on AltLaw, I’ve tended to use fully denormalized data, because anything else is too slow. But I want to be working with normalized data. I want to have every datum — plus information about its derivation — stored in one giant RDF graph. But I also want to be able to process this graph efficiently. Maybe if I had a dozen machines with 64 GB of memory apiece, this wouldn’t be a problem. But with one desktop, one server, and occasional rides on EC2, that’s not an option.
The ideal design, I think, would be a hybrid system which can do efficient bulk processing of RDF data. There are some pilot projects to do this with Hadoop, and I’m interested to see how they pan out. But until then, I’ll have to make do as best I can.
Update: The projects I remebered were HRDF and Heart, which turn out to be the same thing: http://code.google.com/p/hrdf/ and http://wiki.apache.org/incubator/HeartProposal
Does this kind of stuff help?
http://www.youtube.com/watch?v=BT-piFBP4fE
Hi,
I feel your pain. we’re currently doing a fair bit of research into column-base dbms like Hadoop’s Hbase vs row-based dbms’.
Which amounts to rdf (key, field, value) vs denormalised in the most extreme cases. Heck, we’d even be happy if we need to run both kinds, as long as we can have them interact with each other.
I’m quite interested in the pilot projects you mentioned though, the only one I’m aware of is HEART http://wiki.apache.org/incubator/HeartProposal, which is still just an incubation proposal, could you list them or link out?
Regards
Ronald Hobbs
To Doug:
Thanks for the link. That lecture in particular doesn’t seem to immediately solve my problems — I’m not looking to build my own graph database from scratch! But I need to study the other lectures too and see if I could be using MapReduce more efficiently.