My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information for efficient splitting. So, I wrote a program to convert tar files into Hadoop sequence files.
Here’s some code (Apache license), including all the Apache jars needed to make it work:
tar-to-seq.tar.gz (6.1 MB)
Unpack it and run:
java -jar tar-to-seq.jar tar-file sequence-file
The output sequence file is BLOCK-compressed, about 1.4 times the size of a bzip2-compressed tar file. Each key is the name of a file (a Hadoop “Text”), the value is the binary contents of the file (a BytesWritable).
It took about an hour and a half to convert a 615MB tar.bz2 file to an 868MB sequence file. That’s slow, but it only has to be done once.
Hi, many thanks for this great post and your work!
It’s definitely what I needed.
Do I need to write my own input reader to use binary content as value in hadoop?
I was working on a modified InputReader, but got bored, couldn’t make it work.
Some source code is appreciated, if you have already.
Thanks.
Rasit said, “Do I need to write my own input reader to use binary content as value in hadoop?”
No, you can use Hadoop’s BytesWritable class.
Stuart, I mean, which InputReader should I use (InputReader which sends key-value pairs to Mapper class.).
Do hadoop offer some? or should I extend one of existing?
You don’t need to do anything special. The code I posted here produces Hadoop SequenceFiles. You can use the built-in Hadoop class SequenceFile.Reader to read them. Normally, all you need to do is:
[…] to create a collection of SequenceFiles in parallel. (Stuart Sierra has written a very useful post about converting a tar file into a SequenceFile — tools like this are very useful, and it […]
Stuart,
I have written to code to write from the PC file system to HDFS, and I also noticed that it is very slow. Instead of 40M/sec, as promised by the Tom White’s book, it seems to be 40 sec/Meg. Your tars would work about 5 times faster. But still, why is it so slow? Is there a way to speed this up?
Thanks!
Mark,
I’m afraid I don’t know. From my understanding of HDFS, it depends on a lot of factors — the size of the files, the bandwidth to and within the cluster, and the hardware itself. Try the Hadoop mailing list. One thing I do know is that copying lots of small files to HDFS will be slower than copying a few big files.
-Stuart
Stuart,
Thanks for posting this. It will help me a lot! It would be nice to see some options to set SequenceFile.CompressionType, or perhaps guess based on the incoming extension like you do in openInputFile().
Thanks,
Ryan
[…] are they large, but we have a lot of them. We are using Hadoop after all. Stuart Sierra’s Tar-to-Seq utility was working quite nicely until this new input set, as those files were much […]
Hi,
I tried to create a seq file with text files inside, using your tool but unfortunately when I later open it in HADOOP, I get different content:
…6b 20 74 68 65 20 61 72 65 61 73 20 77 68 65 72 65 20 77 65 20 65 73 74 69 6d 61 74 65 64 20 74 68 65 20 74 72 61 69 6c 20 75 70 70 65 72 2d 6c 69 6d 69 74 20 66 72 6f 6d 20 74 68 65 20 73 74 61 6e 64 61 72 64 20 64 65 76 69 61 74 69 6f 6e 20 6f 66 20 61 6e 20 31 31 20 c3 97 20 31 31 20 70 69 78 65 6c 20 61 70 65 72 74 75 7…
What could be the reason?
Thank you,
Jan
Sorry, how to make it work? I got an error:
prod2@ot-9h30d06:~/QSense/hadoop/hadoop-0.20.2/tar-to-seq$ java -jar tar-to-seq.jar ../change.tar.gz seq_file
log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
log4j:WARN Please initialize the log4j system properly.
oh. sorry. ignore my last post, it is just a warning. it works! thanks~
BTW, do you happen to have a tool to convert multiple files from a directory to a single sequence file directly? I mean, to save the outside tar compress and un-compress cost.
[…] all the images as a tar file and using the tool written by Stuart Sierra to convert it to a sequence […]
Stuart,
I’m using Hadoop Streaming (C code) to process binary input files. After using your code to get the sequence file, how to read the data in my C code? Should I read it byte by byte?
Thanks,
Bowen
The file contents are stored as a normal Hadoop BytesWritable object. You would access it as you would any other Hadoop Writable datatype. But if I recall correctly, Hadoop Streaming only supports text.
Thanks for the prompt reply.
So is there anyway for Hadoop Streaming to take binary files as input? Currently, I have to first download those files from HDFS to the local machine, and then process. It’s super slow…
Thanks,
Bowen
Sorry, I don’t know anything else about Hadoop Streaming.
Would you be willing to share the source code? It would be very helpful if i could rewrite it to have your Seq file as Key: Filename Value: File Text instead of a BytesWritable for the Value.
Chris: the source code is included in the .tar.gz download. Apache license.
Hi Stuart,
I downloaded your tool. But when I tried to convert a tar file, I got the following error.
The tar file I tried to convert is about 4.5GB in size and contains about 200 files.
Could you tell me what I should do?
Thank you.
java -jar tar-to-seq.jar /home/hduser/sample-archive/2011/a-250.tar a-250.seq
log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
log4j:WARN Please initialize the log4j system properly.
Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:78)
at org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:71)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1224)
at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.sync(SequenceFile.java:1247)
at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.append(SequenceFile.java:1297)
at org.altlaw.hadoop.TarToSeqFile.execute(TarToSeqFile.java:95)
at org.altlaw.hadoop.TarToSeqFile.main(TarToSeqFile.java:165)
Wouldn’t it be more efficient to write InputFormat/ RecordReader implementation for tar files instead of converting them to SequenceFormat? It would duplicate data otherwise? no?
Nitin- If I recall correctly, you can’t use bzip2-compressed TAR files as Hadoop input files because they are not splittable.
You may use one or more file as one split for tasks.