An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database sequentially and write it back out again, it takes 5.6 hours.
This is why Hadoop only supports linear access to the filesystem. It’s also why Hadoop coder Tom White says disks have become tapes. All this is contrary to the way I think about data — building hash tables and indexes to do fast random-access lookups. I’m still trying to get my head around this concept of “linear” data processing. But I have found that I can do some things faster by reading sequentially through a batch of files rather than trying to stuff everything in a database (RDF or SQL) and doing big join queries.
[…] it comes to scaling Web applications, every experienced Web architect eventually realizes that Disk is the New Tape. Getting data from off of the hard drive disk is slow compared to getting it from memory or from […]
[…] to the system. How important is improving seek time when accessing data on a hard drive? It can make the difference between taking hours versus days to flush a hundred gigabytes of writes to a disk. Disk is the new […]
[…] “Disk is the new tape” […]
[…] In the early aughts, Jeffrey Dean and Sanjay Ghemawat developed the MapReduce programming model at Google to optimize the process of ranking web pages. MapReduce works well for I/O-bound problems where the computation on each record is small but the number of records is large. It specifically addresses the performance characteristics of modern commodity hardware, especially “disk is the new tape.” […]