Category: Programming

A Million Little Files

My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information…

Read the full article

The Great Database Rewrite

I just discovered the paper The End of an Architectural Era (It’s Time for a Complete Rewrite), about re-designing database software from the ground up. It contains some unsurprising predictions — “the next decade will bring domination by shared-nothing computer systems, often called grid computing” — and some interesting ideas: Any database smaller than 1…

Read the full article

Power At Your Fingertips

I just ran my first Amazon EC2 instance. Kind of a heady feeling, having nearly unlimited computing power just a few keystrokes away. I got the same feeling the first time I logged in as root on a dedicated web server. I gotta say, though, that the ticking meter — even at just $0.10/hour —…

Read the full article

There Is No Database

I think I’m starting to get a handle on how Hadoop is supposed to work. The MapReduce model isn’t what troubles me. The mind-bending part is that there is no database. Everything happens by scanning big files from beginning to end. It’s like everything I learned about data structures with O(log n) access no longer…

Read the full article

Disk is the New Tape

An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database…

Read the full article

Continuous Integration for Data

As I told a friend recently, I’m pretty happy with the front-end code of AltLaw. It’s just a simple Ruby on Rails app that uses Solr for search and storage. The code is small and easy to maintain. What I’m not happy with is the back-end code, the data extraction, formatting, and indexing. It’s a…

Read the full article

Going Non-Linear

I recently read the expression “going non-linear” describing a person, where most people would say something like “going nuts.” Incredibly geeky; I like it.

The Problem With Common Lisp

… as explained by Sir Kenny, From: Ken Tilton Newsgroups: comp.lang.lisp Date: Tue, 01 Apr 2008 14:53:07 -0400 Subject: Re: Newbie FAQ #2: Where’s the GUI? Jonathan Gardner wrote: > I know this is a FAQ, but I still don’t have any answers, at least answers that I like. That’s because you missed FAQ #1…

Read the full article

All Your Base

Let’s have the database models strut down the runway: Relational (SQL): Data consist of rows in tables. Each row has multiple columns. Each column has a fixed type. Queries use filters and joins. Fixed schema is defined separately from the data. User-defined indexes improve query performance. Robust transaction/data-integrity support. Graph (RDF): Data consist of nodes…

Read the full article