An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database sequentially and write it back out again, it takes 5.6 hours.
This is why Hadoop only supports linear access to the filesystem. It’s also why Hadoop coder Tom White says disks have become tapes.Â All this is contrary to the way I think about data — building hash tables and indexes to do fast random-access lookups.Â I’m still trying to get my head around this concept of “linear” data processing.Â But I have found that I can do some things faster by reading sequentially through a batch of files rather than trying to stuff everything in a database (RDF or SQL) and doing big join queries.