My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information for efficient splitting. So, I wrote a program to convert tar files into Hadoop sequence files.

Here’s some code (Apache license), including all the Apache jars needed to make it work:

tar-to-seq.tar.gz (6.1 MB)

Unpack it and run:

java -jar tar-to-seq.jar tar-file sequence-file

The output sequence file is BLOCK-compressed, about 1.4 times the size of a bzip2-compressed tar file. Each key is the name of a file (a Hadoop “Text”), the value is the binary contents of the file (a BytesWritable).

It took about an hour and a half to convert a 615MB tar.bz2 file to an 868MB sequence file. That’s slow, but it only has to be done once.

12 Responses to “A Million Little Files”
  1. Hi, many thanks for this great post and your work!
    It’s definitely what I needed.

    Do I need to write my own input reader to use binary content as value in hadoop?
    I was working on a modified InputReader, but got bored, couldn’t make it work.
    Some source code is appreciated, if you have already.

    Thanks.

  2. Rasit said, “Do I need to write my own input reader to use binary content as value in hadoop?”

    No, you can use Hadoop’s BytesWritable class.

  3. Stuart, I mean, which InputReader should I use (InputReader which sends key-value pairs to Mapper class.).
    Do hadoop offer some? or should I extend one of existing?

  4. You don’t need to do anything special. The code I posted here produces Hadoop SequenceFiles. You can use the built-in Hadoop class SequenceFile.Reader to read them. Normally, all you need to do is:

    yourJobConf.setInputFormat(SequenceFileInputFormat.class);
    
  5. [...] to create a collection of SequenceFiles in parallel. (Stuart Sierra has written a very useful post about converting a tar file into a SequenceFile — tools like this are very useful, and it [...]

  6. Stuart,

    I have written to code to write from the PC file system to HDFS, and I also noticed that it is very slow. Instead of 40M/sec, as promised by the Tom White’s book, it seems to be 40 sec/Meg. Your tars would work about 5 times faster. But still, why is it so slow? Is there a way to speed this up?

    Thanks!

  7. Mark,

    I’m afraid I don’t know. From my understanding of HDFS, it depends on a lot of factors — the size of the files, the bandwidth to and within the cluster, and the hardware itself. Try the Hadoop mailing list. One thing I do know is that copying lots of small files to HDFS will be slower than copying a few big files.

    -Stuart

  8. Stuart,

    Thanks for posting this. It will help me a lot! It would be nice to see some options to set SequenceFile.CompressionType, or perhaps guess based on the incoming extension like you do in openInputFile().

    Thanks,
    Ryan

  9. [...] are they large, but we have a lot of them. We are using Hadoop after all. Stuart Sierra’s Tar-to-Seq utility was working quite nicely until this new input set, as those files were much [...]

  10. Hi,

    I tried to create a seq file with text files inside, using your tool but unfortunately when I later open it in HADOOP, I get different content:
    …6b 20 74 68 65 20 61 72 65 61 73 20 77 68 65 72 65 20 77 65 20 65 73 74 69 6d 61 74 65 64 20 74 68 65 20 74 72 61 69 6c 20 75 70 70 65 72 2d 6c 69 6d 69 74 20 66 72 6f 6d 20 74 68 65 20 73 74 61 6e 64 61 72 64 20 64 65 76 69 61 74 69 6f 6e 20 6f 66 20 61 6e 20 31 31 20 c3 97 20 31 31 20 70 69 78 65 6c 20 61 70 65 72 74 75 7…

    What could be the reason?

    Thank you,
    Jan

  11. Sorry, how to make it work? I got an error:

    prod2@ot-9h30d06:~/QSense/hadoop/hadoop-0.20.2/tar-to-seq$ java -jar tar-to-seq.jar ../change.tar.gz seq_file
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.

  12. oh. sorry. ignore my last post, it is just a warning. it works! thanks~
    BTW, do you happen to have a tool to convert multiple files from a directory to a single sequence file directly? I mean, to save the outside tar compress and un-compress cost.

Leave a Reply