I think I’m starting to get a handle on how Hadoop is supposed to work. The MapReduce model isn’t what troubles me. The mind-bending part is that there is no database. Everything happens by scanning big files from beginning to end. It’s like everything I learned about data structures with O(log n) access no longer applies, and I’m back to writing giant for-each loops. It’s perl -n
gone mad.
I’ve been trying for months to find the most efficient database for AltLaw — SQL, Lucene, RDF, even Berkeley DB. But it still takes hours and hours to process things. Maybe the secret is to get rid of the databases and just mash together some giant files and throw them at Hadoop.
Your Hadoop link is broken.
And when you say “Hadoop”, it really is a larger project. I assume you just meant the MapReduce part of it. The main parts:
1. HDFS (GFS Equivalent).
2. A MapReduce implementation.
3. And now HBase, a Bigtable database equivalent.
While many examples of the MapReduce floating around scan large flag files, the “there is no database” statement is not technically true. In reality you can create your own InputFormats and RecordReaders to read from X number of databases or whatever.
Specifically, the HBase has a table mapper that you can use with your MR. And if you really wanted, you could write up mysql (or whatever) input formats, input splits and record readers for your MR.