Category: Programming

A Million Little Files

Posted on April 24, 2008April 24, 2008 by Stuart

My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information…

Read the full article

The Great Database Rewrite

Posted on April 23, 2008 by Stuart

I just discovered the paper The End of an Architectural Era (It’s Time for a Complete Rewrite), about re-designing database software from the ground up. It contains some unsurprising predictions — “the next decade will bring domination by shared-nothing computer systems, often called grid computing” — and some interesting ideas: Any database smaller than 1…

Read the full article

Power At Your Fingertips

Posted on April 22, 2008 by Stuart

I just ran my first Amazon EC2 instance. Kind of a heady feeling, having nearly unlimited computing power just a few keystrokes away. I got the same feeling the first time I logged in as root on a dedicated web server. I gotta say, though, that the ticking meter — even at just $0.10/hour —…

Read the full article

There Is No Database

Posted on April 21, 2008August 28, 2015 by Stuart

I think I’m starting to get a handle on how Hadoop is supposed to work. The MapReduce model isn’t what troubles me. The mind-bending part is that there is no database. Everything happens by scanning big files from beginning to end. It’s like everything I learned about data structures with O(log n) access no longer…

Read the full article

Disk is the New Tape

Posted on April 17, 2008 by Stuart

An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database…

Read the full article

Continuous Integration for Data

Posted on April 17, 2008 by Stuart

As I told a friend recently, I’m pretty happy with the front-end code of AltLaw. It’s just a simple Ruby on Rails app that uses Solr for search and storage. The code is small and easy to maintain. What I’m not happy with is the back-end code, the data extraction, formatting, and indexing. It’s a…

Read the full article

Going Non-Linear

Posted on April 14, 2008 by Stuart

I recently read the expression “going non-linear” describing a person, where most people would say something like “going nuts.” Incredibly geeky; I like it.

Useful Reminders

Posted on April 11, 2008 by Stuart

Epigrams on Programming

The Problem With Common Lisp

Posted on April 4, 2008 by Stuart

… as explained by Sir Kenny, From: Ken Tilton Newsgroups: comp.lang.lisp Date: Tue, 01 Apr 2008 14:53:07 -0400 Subject: Re: Newbie FAQ #2: Where’s the GUI? Jonathan Gardner wrote: > I know this is a FAQ, but I still don’t have any answers, at least answers that I like. That’s because you missed FAQ #1…

Read the full article

All Your Base

Posted on March 21, 2008 by Stuart

Let’s have the database models strut down the runway: Relational (SQL): Data consist of rows in tables. Each row has multiple columns. Each column has a fixed type. Queries use filters and joins. Fixed schema is defined separately from the data. User-defined indexes improve query performance. Robust transaction/data-integrity support. Graph (RDF): Data consist of nodes…

Read the full article