Disk is the New Tape

Posted on April 17, 2008 by Stuart Sierra

An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database sequentially and write it back out again, it takes 5.6 hours.

This is why Hadoop only supports linear access to the filesystem. It’s also why Hadoop coder Tom White says disks have become tapes. All this is contrary to the way I think about data — building hash tables and indexes to do fast random-access lookups. I’m still trying to get my head around this concept of “linear” data processing. But I have found that I can do some things faster by reading sequentially through a batch of files rather than trying to stuff everything in a database (RDF or SQL) and doing big join queries.

4 Replies to “Disk is the New Tape”

Dare Obasanjo aka Carnage4Life - Network Attached Memory: Terracota as an Alternative to Memcached says:

July 10, 2008 at 10:24 am

[…] it comes to scaling Web applications, every experienced Web architect eventually realizes that Disk is the New Tape. Getting data from off of the hard drive disk is slow compared to getting it from memory or from […]
Dare Obasanjo aka Carnage4Life - Project Cassandra: Facebook's Open Source Alternative to Google BigTable says:

July 14, 2008 at 7:41 am

[…] to the system. How important is improving seek time when accessing data on a hard drive? It can make the difference between taking hours versus days to flush a hundred gigabytes of writes to a disk. Disk is the new […]
Disk is the new disk | /dev/rant says:

February 4, 2009 at 2:43 am

[…] “Disk is the new tape” […]
Functional Relational Programming with Cascalog « Another Word For It says:

February 4, 2012 at 3:39 pm

[…] In the early aughts, Jeffrey Dean and Sanjay Ghemawat developed the MapReduce programming model at Google to optimize the process of ranking web pages. MapReduce works well for I/O-bound problems where the computation on each record is small but the number of records is large. It specifically addresses the performance characteristics of modern commodity hardware, especially “disk is the new tape.” […]

Comments are closed.