At the Edge of Feasibility

Well, it happened. I ran out of space on the 250 GB drive I use to develop AltLaw. Not all that surprising, although it did happen sooner than I expected. I’m deleting gigabytes of cached data — file conversions, mostly — just so I can get enough space to work again.

But this makes me wonder: am I operating at the edge of what is feasible (or practical) on a single machine? The latest batch of case updates, courtesy of public.resource.org, took nearly three days to run. And I had to divide the process up into smaller batches because a memory leak — traceable, I think, to Ruby’s HTTP library — was bogging down the system. When you’re dealing with hundreds of thousands of files at a time, even the standard UNIX tools like ls and grep start to feel inadequate. It’s damn hard to get a grip on a million files at once. Just clearing some space with a simple “rm -rf” can take hours.

Maybe it’s time for Amazon Web Services. But that’s a big, scary, enterprisey world. I’m just one little script monkey. I can barely keep track of data on one machine, how could I keep track of it on a dozen machines? Then again, AWS would give me more powerful tools for managing data — a bigger pair of tongs, so to speak. Decisions, decisions.