I think I’m starting to get a handle on how Hadoop is supposed to work. The MapReduce model isn’t what troubles me. The mind-bending part is that there is no database. Everything happens by scanning big files from beginning to end. It’s like everything I learned about data structures with O(log n) access no longer applies, and I’m back to writing giant for-each loops. It’s perl -n gone mad.
I’ve been trying for months to find the most efficient database for AltLaw — SQL, Lucene, RDF, even Berkeley DB. But it still takes hours and hours to process things. Maybe the secret is to get rid of the databases and just mash together some giant files and throw them at Hadoop.

Entries (RSS)