A Million Little Files

My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information for efficient splitting. So, I wrote a program to convert tar files into Hadoop sequence files.

Here’s some code (Apache license), including all the Apache jars needed to make it work:

tar-to-seq.tar.gz (6.1 MB)

Unpack it and run:

java -jar tar-to-seq.jar tar-file sequence-file

The output sequence file is BLOCK-compressed, about 1.4 times the size of a bzip2-compressed tar file. Each key is the name of a file (a Hadoop “Text”), the value is the binary contents of the file (a BytesWritable).

It took about an hour and a half to convert a 615MB tar.bz2 file to an 868MB sequence file. That’s slow, but it only has to be done once.

This entry was posted in Programming and tagged . Bookmark the permalink.

20 Responses to A Million Little Files

  1. Rasit says:

    Hi, many thanks for this great post and your work!
    It’s definitely what I needed.

    Do I need to write my own input reader to use binary content as value in hadoop?
    I was working on a modified InputReader, but got bored, couldn’t make it work.
    Some source code is appreciated, if you have already.

    Thanks.

  2. Stuart says:

    Rasit said, “Do I need to write my own input reader to use binary content as value in hadoop?”

    No, you can use Hadoop’s BytesWritable class.

  3. Rasit says:

    Stuart, I mean, which InputReader should I use (InputReader which sends key-value pairs to Mapper class.).
    Do hadoop offer some? or should I extend one of existing?

  4. Stuart says:

    You don’t need to do anything special. The code I posted here produces Hadoop SequenceFiles. You can use the built-in Hadoop class SequenceFile.Reader to read them. Normally, all you need to do is:

    yourJobConf.setInputFormat(SequenceFileInputFormat.class);
    
  5. Pingback: Cloudera Hadoop & Big Data Blog » Blog Archive » The Small Files Problem

  6. Mark Kerzner says:

    Stuart,

    I have written to code to write from the PC file system to HDFS, and I also noticed that it is very slow. Instead of 40M/sec, as promised by the Tom White’s book, it seems to be 40 sec/Meg. Your tars would work about 5 times faster. But still, why is it so slow? Is there a way to speed this up?

    Thanks!

  7. Stuart says:

    Mark,

    I’m afraid I don’t know. From my understanding of HDFS, it depends on a lot of factors — the size of the files, the bandwidth to and within the cluster, and the hardware itself. Try the Hadoop mailing list. One thing I do know is that copying lots of small files to HDFS will be slower than copying a few big files.

    -Stuart

  8. Ryan says:

    Stuart,

    Thanks for posting this. It will help me a lot! It would be nice to see some options to set SequenceFile.CompressionType, or perhaps guess based on the incoming extension like you do in openInputFile().

    Thanks,
    Ryan

  9. Pingback: The Case for Babar: A Tool for Creating Hadoop Sequence Files « Ryan Balfanz

  10. Jan says:

    Hi,

    I tried to create a seq file with text files inside, using your tool but unfortunately when I later open it in HADOOP, I get different content:
    …6b 20 74 68 65 20 61 72 65 61 73 20 77 68 65 72 65 20 77 65 20 65 73 74 69 6d 61 74 65 64 20 74 68 65 20 74 72 61 69 6c 20 75 70 70 65 72 2d 6c 69 6d 69 74 20 66 72 6f 6d 20 74 68 65 20 73 74 61 6e 64 61 72 64 20 64 65 76 69 61 74 69 6f 6e 20 6f 66 20 61 6e 20 31 31 20 c3 97 20 31 31 20 70 69 78 65 6c 20 61 70 65 72 74 75 7…

    What could be the reason?

    Thank you,
    Jan

  11. wenxiu says:

    Sorry, how to make it work? I got an error:

    prod2@ot-9h30d06:~/QSense/hadoop/hadoop-0.20.2/tar-to-seq$ java -jar tar-to-seq.jar ../change.tar.gz seq_file
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.

  12. wenxiu says:

    oh. sorry. ignore my last post, it is just a warning. it works! thanks~
    BTW, do you happen to have a tool to convert multiple files from a directory to a single sequence file directly? I mean, to save the outside tar compress and un-compress cost.

  13. Pingback: Hadoop binary files processing entroduced by image duplicates finder « eldad levy's playground

  14. Bowen says:

    Stuart,

    I’m using Hadoop Streaming (C code) to process binary input files. After using your code to get the sequence file, how to read the data in my C code? Should I read it byte by byte?

    Thanks,
    Bowen

  15. Stuart says:

    The file contents are stored as a normal Hadoop BytesWritable object. You would access it as you would any other Hadoop Writable datatype. But if I recall correctly, Hadoop Streaming only supports text.

  16. Bowen says:

    Thanks for the prompt reply.

    So is there anyway for Hadoop Streaming to take binary files as input? Currently, I have to first download those files from HDFS to the local machine, and then process. It’s super slow…

    Thanks,
    Bowen

  17. Stuart says:

    Sorry, I don’t know anything else about Hadoop Streaming.

  18. Chris says:

    Would you be willing to share the source code? It would be very helpful if i could rewrite it to have your Seq file as Key: Filename Value: File Text instead of a BytesWritable for the Value.

  19. Stuart says:

    Chris: the source code is included in the .tar.gz download. Apache license.

  20. net_ma says:

    Hi Stuart,

    I downloaded your tool. But when I tried to convert a tar file, I got the following error.

    The tar file I tried to convert is about 4.5GB in size and contains about 200 files.

    Could you tell me what I should do?

    Thank you.

    java -jar tar-to-seq.jar /home/hduser/sample-archive/2011/a-250.tar a-250.seq
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.
    Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:78)
    at org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:71)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1224)
    at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.sync(SequenceFile.java:1247)
    at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.append(SequenceFile.java:1297)
    at org.altlaw.hadoop.TarToSeqFile.execute(TarToSeqFile.java:95)
    at org.altlaw.hadoop.TarToSeqFile.main(TarToSeqFile.java:165)

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>