Archive for April, 2008
Posted by: Stuart in Programming, tags: Hadoop
My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information for efficient splitting. So, I wrote a program to convert tar files into Hadoop sequence files.
Here’s some code (Apache license), including all the Apache jars needed to make it work:
tar-to-seq.tar.gz (6.1 MB)
Unpack it and run:
java -jar tar-to-seq.jar tar-file sequence-file
The output sequence file is BLOCK-compressed, about 1.4 times the size of a bzip2-compressed tar file. Each key is the name of a file (a Hadoop “Text”), the value is the binary contents of the file (a BytesWritable).
It took about an hour and a half to convert a 615MB tar.bz2 file to an 868MB sequence file. That’s slow, but it only has to be done once.
12 Comments »
I just discovered the paper The End of an Architectural Era (It’s Time for a Complete Rewrite), about re-designing database software from the ground up. It contains some unsurprising predictions — “the next decade will bring domination by shared-nothing computer systems, often called grid computing” — and some interesting ideas:
- Any database smaller than 1 TB will fit entirely in main memory, distributed across multiple machines.
- We should scrap SQL in favor of “modifying little languages [Ruby, Python, etc.] to include clean embeddings of DBMS access.” (CouchDB is a good example of this.)
- A database shouldn’t require an expert to tune and optimize it; instead it should all be automated to “produce a system with no visible knobs.”
In their implementation, H-store, they claim to run over 70,000 transactions per second on a standard benchmark on a modest server, compared to 850/second from a commercial DB tuned by an expert. They also plan to “move from C++ to Ruby on Rails as our stored procedure language.” (!)
No Comments »
Posted by: Stuart in Programming, tags: EC2
I just ran my first Amazon EC2 instance. Kind of a heady feeling, having nearly unlimited computing power just a few keystrokes away. I got the same feeling the first time I logged in as root on a dedicated web server.
I gotta say, though, that the ticking meter — even at just $0.10/hour — will make me think real hard about how I use it. I guess I’ll get used to it in time. I’ll have to, since I’m running out of disk space on my local hard drive again.
No Comments »
I think I’m starting to get a handle on how Hadoop is supposed to work. The MapReduce model isn’t what troubles me. The mind-bending part is that there is no database. Everything happens by scanning big files from beginning to end. It’s like everything I learned about data structures with O(log n) access no longer applies, and I’m back to writing giant for-each loops. It’s perl -n gone mad.
I’ve been trying for months to find the most efficient database for AltLaw — SQL, Lucene, RDF, even Berkeley DB. But it still takes hours and hours to process things. Maybe the secret is to get rid of the databases and just mash together some giant files and throw them at Hadoop.
1 Comment »
Posted by: Stuart in Programming, tags: Hadoop
An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database sequentially and write it back out again, it takes 5.6 hours.
This is why Hadoop only supports linear access to the filesystem. It’s also why Hadoop coder Tom White says disks have become tapes. All this is contrary to the way I think about data — building hash tables and indexes to do fast random-access lookups. I’m still trying to get my head around this concept of “linear” data processing. But I have found that I can do some things faster by reading sequentially through a batch of files rather than trying to stuff everything in a database (RDF or SQL) and doing big join queries.
3 Comments »
As I told a friend recently, I’m pretty happy with the front-end code of AltLaw. It’s just a simple Ruby on Rails app that uses Solr for search and storage. The code is small and easy to maintain.
What I’m not happy with is the back-end code, the data extraction, formatting, and indexing. It’s a hodge-podge of Ruby, Perl, Clojure, C, shell scripts, SQL, XML, RDF, and text files that could make the most dedicated Unix hacker blanch. It works, but just barely, and I panic every time I think about implementing a new feature.
This is a software engineering problem rather than a pure computer science problem, and I never pretended to be a software engineer. (I never pretended to be a computer scientist, either.) It might also be a problem for a larger team than an army of one (plus a few volunteers).
But given that I can get more processing power (via Amazon) more easily than I can get more programmers, how can I make use of the resources I have to enhance my own productivity?
I’m studying Hadoop and Cascading in the hopes that they will help. But those systems are inherently batch-oriented. I’d like to move away from a batch processing model if I can. Given that AltLaw acquires 50 to 100 new cases per day, adding to a growing database of over half a million, what I would really like to have is a kind of “continuous integration” process for data. I want a server that runs continuously, accepting new data and new code and automatically re-running processes as needed to keep up with dependencies. Perhaps given a year of free time I could invent this myself, but I’m too busy debugging my shell scripts.
No Comments »
Since we started putting court cases on the interwebs, first with Project Posner and then with AltLaw, we’ve had the occasional angry email from someone who Googles himself/herself and finds a court case from 20 years ago that reveals embarrassing and career-damaging facts. They usually want the page taken down.
Now, sometimes I’m sympathetic with the people making these requests — sexual harassment plaintiffs, asylum-seekers, and so on. Other times, I’m not — usually when the person writing to us was convicted of sexual harassment, fraud, etc.
We’ve canvassed for opinions on what we should do. The responses generally fall into 3 categories:
- Lawyers say: Tough, it’s public record. Without public case law the American legal system would cease to function. Only the courts can (and do) decide to anonymize a case.
- Techies say: Tough, information wants to be free. If you don’t like what the web says about you, make your own web site to tell your side of the story.
- Others say: I don’t know. It’s wrong to censor public records, but it’s also wrong to make people suffer for something that happened 20 years ago.
There are also suggested solutions:
- Refuse to take anything down. Don’t answer the phone.
- Anonymize names in “sensitive” cases. Provide a protected link to a non-censored version. The problem is, cases are routinely identified by the names of the parties. If you take out the names, you don’t know what case it is anymore.
- Block search engines from the entire site, either with robots.txt or free registration. And say goodbye to 50% of our traffic.
- Refuse to modify or take down cases, but block individual cases in robots.txt on request.
Our current policy is #4. But is that good enough? For the appeals and supreme court cases we currently host, probably. But we hope to expand AltLaw to every U.S. court, down to the state level. What happens when we start hosting, say, bankruptcy court decisions?
This gets into bigger questions of open access versus individual privacy. We’re not the only ones struggling with the issue — our friends at Justia and public.resource.org have similar problems. Ultimately, it’s a question for society at large. Perhaps, as legal research on the web expands, courts will develop stricter standards for how they publish cases containing sensitive information. But legal institutions are extremely resistant — and slow — to change. The web of free legal information is growing fast — in the eight months since AltLaw launched, at least three commercial competitors have appeared.
No Comments »
I recently read the expression “going non-linear” describing a person, where most people would say something like “going nuts.” Incredibly geeky; I like it.
No Comments »
… as explained by Sir Kenny,
From: Ken Tilton
Newsgroups: comp.lang.lisp
Date: Tue, 01 Apr 2008 14:53:07 -0400
Subject: Re: Newbie FAQ #2: Where’s the GUI?
Jonathan Gardner wrote:
> I know this is a FAQ, but I still don’t have any answers, at least answers that I like.
That’s because you missed FAQ #1 (“Where are the damn libraries?”) and the answer (“The Open Source Fairy has left the building. Do them your own damn self.”)
… message truncated …
No Comments »
|