My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information for efficient splitting. So, I wrote a program to convert tar files into Hadoop sequence files.

Here’s some code (Apache license), including all the Apache jars needed to make it work:

tar-to-seq.tar.gz (6.1 MB)

Unpack it and run:

java -jar tar-to-seq.jar tar-file sequence-file

The output sequence file is BLOCK-compressed, about 1.4 times the size of a bzip2-compressed tar file. Each key is the name of a file (a Hadoop “Text”), the value is the binary contents of the file (a BytesWritable).

It took about an hour and a half to convert a 615MB tar.bz2 file to an 868MB sequence file. That’s slow, but it only has to be done once.

9 Responses to “A Million Little Files”
  1. Hi, many thanks for this great post and your work!
    It’s definitely what I needed.

    Do I need to write my own input reader to use binary content as value in hadoop?
    I was working on a modified InputReader, but got bored, couldn’t make it work.
    Some source code is appreciated, if you have already.

    Thanks.

  2. Rasit said, “Do I need to write my own input reader to use binary content as value in hadoop?”

    No, you can use Hadoop’s BytesWritable class.

  3. Stuart, I mean, which InputReader should I use (InputReader which sends key-value pairs to Mapper class.).
    Do hadoop offer some? or should I extend one of existing?

  4. You don’t need to do anything special. The code I posted here produces Hadoop SequenceFiles. You can use the built-in Hadoop class SequenceFile.Reader to read them. Normally, all you need to do is:

    yourJobConf.setInputFormat(SequenceFileInputFormat.class);
    
  5. [...] to create a collection of SequenceFiles in parallel. (Stuart Sierra has written a very useful post about converting a tar file into a SequenceFile — tools like this are very useful, and it [...]

  6. Stuart,

    I have written to code to write from the PC file system to HDFS, and I also noticed that it is very slow. Instead of 40M/sec, as promised by the Tom White’s book, it seems to be 40 sec/Meg. Your tars would work about 5 times faster. But still, why is it so slow? Is there a way to speed this up?

    Thanks!

  7. Mark,

    I’m afraid I don’t know. From my understanding of HDFS, it depends on a lot of factors — the size of the files, the bandwidth to and within the cluster, and the hardware itself. Try the Hadoop mailing list. One thing I do know is that copying lots of small files to HDFS will be slower than copying a few big files.

    -Stuart

  8. Stuart,

    Thanks for posting this. It will help me a lot! It would be nice to see some options to set SequenceFile.CompressionType, or perhaps guess based on the incoming extension like you do in openInputFile().

    Thanks,
    Ryan

  9. [...] are they large, but we have a lot of them. We are using Hadoop after all. Stuart Sierra’s Tar-to-Seq utility was working quite nicely until this new input set, as those files were much [...]

Leave a Reply