A Million Little Files

Posted on April 24, 2008 by Stuart Sierra

My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information for efficient splitting. So, I wrote a program to convert tar files into Hadoop sequence files.

Here’s some code (Apache license), including all the Apache jars needed to make it work:

tar-to-seq.tar.gz (6.1 MB)

Unpack it and run:

java -jar tar-to-seq.jar tar-file sequence-file

The output sequence file is BLOCK-compressed, about 1.4 times the size of a bzip2-compressed tar file. Each key is the name of a file (a Hadoop “Text”), the value is the binary contents of the file (a BytesWritable).

It took about an hour and a half to convert a 615MB tar.bz2 file to an 868MB sequence file. That’s slow, but it only has to be done once.

23 Replies to “A Million Little Files”

Rasit says:

January 28, 2009 at 10:47 am

Hi, many thanks for this great post and your work!
It’s definitely what I needed.

Do I need to write my own input reader to use binary content as value in hadoop?
I was working on a modified InputReader, but got bored, couldn’t make it work.
Some source code is appreciated, if you have already.

Thanks.
Stuart says:

January 28, 2009 at 11:02 am

Rasit said, “Do I need to write my own input reader to use binary content as value in hadoop?”

No, you can use Hadoop’s BytesWritable class.
Rasit says:

January 29, 2009 at 4:38 am

Stuart, I mean, which InputReader should I use (InputReader which sends key-value pairs to Mapper class.).
Do hadoop offer some? or should I extend one of existing?
Stuart says:

January 29, 2009 at 11:23 am
You don’t need to do anything special. The code I posted here produces Hadoop SequenceFiles. You can use the built-in Hadoop class SequenceFile.Reader to read them. Normally, all you need to do is:
```
yourJobConf.setInputFormat(SequenceFileInputFormat.class);
```
Cloudera Hadoop & Big Data Blog » Blog Archive » The Small Files Problem says:

February 2, 2009 at 12:11 pm

[…] to create a collection of SequenceFiles in parallel. (Stuart Sierra has written a very useful post about converting a tar file into a SequenceFile — tools like this are very useful, and it […]
Mark Kerzner says:

February 9, 2009 at 10:46 pm

Stuart,

I have written to code to write from the PC file system to HDFS, and I also noticed that it is very slow. Instead of 40M/sec, as promised by the Tom White’s book, it seems to be 40 sec/Meg. Your tars would work about 5 times faster. But still, why is it so slow? Is there a way to speed this up?

Thanks!
Stuart says:

February 10, 2009 at 12:11 pm

Mark,

I’m afraid I don’t know. From my understanding of HDFS, it depends on a lot of factors — the size of the files, the bandwidth to and within the cluster, and the hardware itself. Try the Hadoop mailing list. One thing I do know is that copying lots of small files to HDFS will be slower than copying a few big files.

-Stuart
Ryan says:

November 10, 2009 at 6:52 am

Stuart,

Thanks for posting this. It will help me a lot! It would be nice to see some options to set SequenceFile.CompressionType, or perhaps guess based on the incoming extension like you do in openInputFile().

Thanks,
Ryan
The Case for Babar: A Tool for Creating Hadoop Sequence Files « Ryan Balfanz says:

February 22, 2010 at 4:38 am

[…] are they large, but we have a lot of them. We are using Hadoop after all. Stuart Sierra’s Tar-to-Seq utility was working quite nicely until this new input set, as those files were much […]
Jan says:

June 4, 2010 at 11:19 am

Hi,

I tried to create a seq file with text files inside, using your tool but unfortunately when I later open it in HADOOP, I get different content:
…6b 20 74 68 65 20 61 72 65 61 73 20 77 68 65 72 65 20 77 65 20 65 73 74 69 6d 61 74 65 64 20 74 68 65 20 74 72 61 69 6c 20 75 70 70 65 72 2d 6c 69 6d 69 74 20 66 72 6f 6d 20 74 68 65 20 73 74 61 6e 64 61 72 64 20 64 65 76 69 61 74 69 6f 6e 20 6f 66 20 61 6e 20 31 31 20 c3 97 20 31 31 20 70 69 78 65 6c 20 61 70 65 72 74 75 7…

What could be the reason?

Thank you,
Jan
wenxiu says:

September 2, 2010 at 3:48 am

Sorry, how to make it work? I got an error:

prod2@ot-9h30d06:~/QSense/hadoop/hadoop-0.20.2/tar-to-seq$ java -jar tar-to-seq.jar ../change.tar.gz seq_file
log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
log4j:WARN Please initialize the log4j system properly.
wenxiu says:

September 2, 2010 at 4:06 am

oh. sorry. ignore my last post, it is just a warning. it works! thanks~
BTW, do you happen to have a tool to convert multiple files from a directory to a single sequence file directly? I mean, to save the outside tar compress and un-compress cost.
Hadoop binary files processing entroduced by image duplicates finder « eldad levy's playground says:

February 5, 2011 at 4:26 am

[…] all the images as a tar file and using the tool written by Stuart Sierra to convert it to a sequence […]
Bowen says:

March 9, 2011 at 12:26 am

Stuart,

I’m using Hadoop Streaming (C code) to process binary input files. After using your code to get the sequence file, how to read the data in my C code? Should I read it byte by byte?

Thanks,
Bowen
- Stuart says:
  
  March 9, 2011 at 8:52 am
  
  The file contents are stored as a normal Hadoop BytesWritable object. You would access it as you would any other Hadoop Writable datatype. But if I recall correctly, Hadoop Streaming only supports text.
Bowen says:

March 9, 2011 at 4:41 pm

Thanks for the prompt reply.

So is there anyway for Hadoop Streaming to take binary files as input? Currently, I have to first download those files from HDFS to the local machine, and then process. It’s super slow…

Thanks,
Bowen
- Stuart says:
  
  March 9, 2011 at 5:25 pm
  
  Sorry, I don’t know anything else about Hadoop Streaming.
Chris says:

August 5, 2011 at 2:39 pm

Would you be willing to share the source code? It would be very helpful if i could rewrite it to have your Seq file as Key: Filename Value: File Text instead of a BytesWritable for the Value.
Stuart says:

August 5, 2011 at 4:28 pm

Chris: the source code is included in the .tar.gz download. Apache license.
net_ma says:

January 18, 2012 at 10:32 am

Hi Stuart,

I downloaded your tool. But when I tried to convert a tar file, I got the following error.

The tar file I tried to convert is about 4.5GB in size and contains about 200 files.

Could you tell me what I should do?

Thank you.

java -jar tar-to-seq.jar /home/hduser/sample-archive/2011/a-250.tar a-250.seq
log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
log4j:WARN Please initialize the log4j system properly.
Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:78)
at org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:71)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1224)
at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.sync(SequenceFile.java:1247)
at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.append(SequenceFile.java:1297)
at org.altlaw.hadoop.TarToSeqFile.execute(TarToSeqFile.java:95)
at org.altlaw.hadoop.TarToSeqFile.main(TarToSeqFile.java:165)
nitin says:

July 13, 2012 at 3:53 am

Wouldn’t it be more efficient to write InputFormat/ RecordReader implementation for tar files instead of converting them to SequenceFormat? It would duplicate data otherwise? no?
- Stuart says:
  
  July 13, 2012 at 8:33 am
  
  Nitin- If I recall correctly, you can’t use bzip2-compressed TAR files as Hadoop input files because they are not splittable.
bin says:

January 6, 2013 at 2:56 am

You may use one or more file as one split for tasks.

Comments are closed.