A Million Little Files

My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information for efficient splitting. So, I wrote a program to convert tar files into Hadoop sequence files.

Here’s some code (Apache license), including all the Apache jars needed to make it work:

tar-to-seq.tar.gz (6.1 MB)

Unpack it and run:

java -jar tar-to-seq.jar tar-file sequence-file

The output sequence file is BLOCK-compressed, about 1.4 times the size of a bzip2-compressed tar file. Each key is the name of a file (a Hadoop “Text”), the value is the binary contents of the file (a BytesWritable).

It took about an hour and a half to convert a 615MB tar.bz2 file to an 868MB sequence file. That’s slow, but it only has to be done once.

23 thoughts on “A Million Little Files

  1. Rasit

    Hi, many thanks for this great post and your work!
    It’s definitely what I needed.

    Do I need to write my own input reader to use binary content as value in hadoop?
    I was working on a modified InputReader, but got bored, couldn’t make it work.
    Some source code is appreciated, if you have already.

    Thanks.

  2. Stuart Post author

    Rasit said, “Do I need to write my own input reader to use binary content as value in hadoop?”

    No, you can use Hadoop’s BytesWritable class.

  3. Rasit

    Stuart, I mean, which InputReader should I use (InputReader which sends key-value pairs to Mapper class.).
    Do hadoop offer some? or should I extend one of existing?

  4. Stuart Post author

    You don’t need to do anything special. The code I posted here produces Hadoop SequenceFiles. You can use the built-in Hadoop class SequenceFile.Reader to read them. Normally, all you need to do is:

    yourJobConf.setInputFormat(SequenceFileInputFormat.class);
    
  5. Pingback: Cloudera Hadoop & Big Data Blog » Blog Archive » The Small Files Problem

  6. Mark Kerzner

    Stuart,

    I have written to code to write from the PC file system to HDFS, and I also noticed that it is very slow. Instead of 40M/sec, as promised by the Tom White’s book, it seems to be 40 sec/Meg. Your tars would work about 5 times faster. But still, why is it so slow? Is there a way to speed this up?

    Thanks!

  7. Stuart Post author

    Mark,

    I’m afraid I don’t know. From my understanding of HDFS, it depends on a lot of factors — the size of the files, the bandwidth to and within the cluster, and the hardware itself. Try the Hadoop mailing list. One thing I do know is that copying lots of small files to HDFS will be slower than copying a few big files.

    -Stuart

  8. Ryan

    Stuart,

    Thanks for posting this. It will help me a lot! It would be nice to see some options to set SequenceFile.CompressionType, or perhaps guess based on the incoming extension like you do in openInputFile().

    Thanks,
    Ryan

  9. Pingback: The Case for Babar: A Tool for Creating Hadoop Sequence Files « Ryan Balfanz

  10. Jan

    Hi,

    I tried to create a seq file with text files inside, using your tool but unfortunately when I later open it in HADOOP, I get different content:
    …6b 20 74 68 65 20 61 72 65 61 73 20 77 68 65 72 65 20 77 65 20 65 73 74 69 6d 61 74 65 64 20 74 68 65 20 74 72 61 69 6c 20 75 70 70 65 72 2d 6c 69 6d 69 74 20 66 72 6f 6d 20 74 68 65 20 73 74 61 6e 64 61 72 64 20 64 65 76 69 61 74 69 6f 6e 20 6f 66 20 61 6e 20 31 31 20 c3 97 20 31 31 20 70 69 78 65 6c 20 61 70 65 72 74 75 7…

    What could be the reason?

    Thank you,
    Jan

  11. wenxiu

    Sorry, how to make it work? I got an error:

    prod2@ot-9h30d06:~/QSense/hadoop/hadoop-0.20.2/tar-to-seq$ java -jar tar-to-seq.jar ../change.tar.gz seq_file
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.

  12. wenxiu

    oh. sorry. ignore my last post, it is just a warning. it works! thanks~
    BTW, do you happen to have a tool to convert multiple files from a directory to a single sequence file directly? I mean, to save the outside tar compress and un-compress cost.

  13. Pingback: Hadoop binary files processing entroduced by image duplicates finder « eldad levy's playground

  14. Bowen

    Stuart,

    I’m using Hadoop Streaming (C code) to process binary input files. After using your code to get the sequence file, how to read the data in my C code? Should I read it byte by byte?

    Thanks,
    Bowen

  15. Stuart Post author

    The file contents are stored as a normal Hadoop BytesWritable object. You would access it as you would any other Hadoop Writable datatype. But if I recall correctly, Hadoop Streaming only supports text.

  16. Bowen

    Thanks for the prompt reply.

    So is there anyway for Hadoop Streaming to take binary files as input? Currently, I have to first download those files from HDFS to the local machine, and then process. It’s super slow…

    Thanks,
    Bowen

  17. Chris

    Would you be willing to share the source code? It would be very helpful if i could rewrite it to have your Seq file as Key: Filename Value: File Text instead of a BytesWritable for the Value.

  18. net_ma

    Hi Stuart,

    I downloaded your tool. But when I tried to convert a tar file, I got the following error.

    The tar file I tried to convert is about 4.5GB in size and contains about 200 files.

    Could you tell me what I should do?

    Thank you.

    java -jar tar-to-seq.jar /home/hduser/sample-archive/2011/a-250.tar a-250.seq
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.
    Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:78)
    at org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:71)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1224)
    at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.sync(SequenceFile.java:1247)
    at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.append(SequenceFile.java:1297)
    at org.altlaw.hadoop.TarToSeqFile.execute(TarToSeqFile.java:95)
    at org.altlaw.hadoop.TarToSeqFile.main(TarToSeqFile.java:165)

  19. nitin

    Wouldn’t it be more efficient to write InputFormat/ RecordReader implementation for tar files instead of converting them to SequenceFormat? It would duplicate data otherwise? no?

  20. Stuart Post author

    Nitin- If I recall correctly, you can’t use bzip2-compressed TAR files as Hadoop input files because they are not splittable.

Comments are closed.