I have, to my chagrin, recently discovered Twitter. I was at a conference at which the attendees twittered (tweeted?) every presentation as it happened. One speaker accidentally/deliberately left his Twitter client running during his presentation, resulting in a stream of jokes and off-color comments in the corner of his PowerPoint slides.  Maybe every presentation should do this. That way you’d know if you were boring your audience.

Comments No Comments »

I haven’t posted in a while — look for more later this summer.

But in the mean time, I have a question: How do you structure data such that you can efficiently manipulate it on both a large scale and a small scale at the same time?

By large scale, I mean applying a transformation or analysis efficiently over every record in a multi-gigabyte collection.  Hadoop is very good at this, but it achieves its efficiency by working with collections in large chunks, typically 64 MB and up.

What you can’t do with Hadoop — at least, not efficiently — is retrieve a single record.  Systems layered on top of Hadoop, like HBase, attempt to mitigate the problem, but they are still slower than, say, a relational database.

In fact, considering the history of storage and data access technologies, most of them have been geared toward efficient random access to individual records — RAM, filesystems, hard disks, RDBMS’s, etc.  But as Hadoop demonstrates, random-access systems tend to be inefficient for handling very large data collections in the aggregate.

This is not merely theoreticaly musing — it’s a problem I’m trying to solve with AltLaw.  I can use Hadoop to process millions of small records.  The results come out in large Hadoop SequenceFiles.  But then I want to provide random-access to those records via the web site.  So I have to somehow “expand” the contents of those SequenceFiles into individual records and store those records in some format that provides efficient random access.

Right now, I use two very blunt instruments — Lucene indexes and plain old files.  In the final stage of my processing chain, metadata and searchable text get written to a Lucene index, and the pre-rendered HTML content of each page gets written to a file on an XFS filesystem.  This works, but it ends up being one of the slower parts of the process.  Building multiple Lucene indexes and merging them into one big (~6 GB) index takes an hour; writing all the files to the XFS filesystem takes about 20 minutes.  There is no interesting data manipulation going on here, I’m just moving data around.

Comments 4 Comments »

Update: Slides and video available at LispNYC.

Ok, it’s really happening this time:

Stuart Sierra presents: Implementing AltLaw.org in Clojure

This talk demonstrates the power of combining Clojure with large Java frameworks, such as:

  • Hadoop - distributed map/reduce processing
  • Solr - text indexing/searching
  • Restlet - REST-oriented web framework
  • Jets3t - Amazon S3

Join us from 7:00 - 9:00 at Trinity Church in the heart of the East Village. Afterward the discussion will continue at the Sunburnt Cow on 9th and C.

Comments No Comments »

We had to cancel my talk for tomorrow night, due to problems with the venue. LispNYC will still meet at the Sunburnt Cow, 137 Avenue C, for drinks and discussion. My presentation has been postponed to the June meeting.

Comments No Comments »

My favorite programming language has made a 1.0 release!  [announcement]

Comments No Comments »

CANCELED: My presentation is canceled.  LispNYC will still meet at the Sunburnt Cow, 137 Avenue C, and I’ll be there to talk about Clojure.  But no slides, no video, etc.  My presentation is postponed to the June meeting.


I’ll be talking about my work with Clojure at LispNYC on the evening of Tuesday, May 12.   Time and location to be announced.   Slides and (hopefully) video available after the fact.

Possible topics:

  • Why Java + Lisp was such a great idea
  • Using Clojure with Java libraries like Hadoop & Solr
  • Deploying a web server with Clojure and Restlet
  • Clojure libraries I’ve written, such as test-is

Official announcement with time & location:
from LispNYC.org

Stuart Sierra presents: Implementing AltLaw.org in Clojure

This talk demonstrates the power of combining Clojure with large Java frameworks, such as:

  • Hadoop - distributed map/reduce processing
  • Solr - text indexing/searching
  • Restlet - REST-oriented web framework
  • Jets3t - Amazon S3

Join us from 7:00 - 9:00 at Trinity Church in the heart of the East Village. Afterward the discussion will continue at the Sunburnt Cow on 9th and C.

Directions to Trinity:

Trinity Lutheran
602 E. 9th St. & Ave B., on Tompkins Square Park
http://trinitylowereastside.org/

From N,R,Q,W (8th Street NYU Stop) and the 4,5 (Astor Street
Stop):
Walk East 4 blocks on St. Marks, cross Tompkins Square Park.

From F&V (2nd Ave Stop):
Walk E one or two blocks, turn north for 8 short blocks

From L (1st Ave Stop):
Walk E one block, turn sounth for 5 short blocks

The M9 bus line drops you off at the doorstep and the M15 is near get
off on St. Marks & 1st)

To get there by car, take the FDR (East River Drive) to Houston then go NW till you’re at 9th & B.  Week-night parking isn’t bad at all, but if you’re paranoid about your Caddy or in a hurry, there is a parking garage on 9th between 1st and 3rd Ave.

Comments No Comments »

There’s a big ‘ol thread going on down at comp.lang.lisp about Clojure vs. Common Lisp. I’m biased, of course, but I have to say that Clojure and Rich Hickey are holding their own against some of the top c.l.l. flamers.

But all the arguments about functional programming, software transactional memory, and reader macros miss what was, for me, the biggest reason to switch to Clojure. It’s about the libraries, stupid. Building on the JVM and providing direct access to Java classes/methods was the best decision in Clojure’s design. ‘Cause if it’s ever been done, anywhere, by anyone, someone’s done it in Java. Twice.

A few years ago, I tried to solve the Common Lisp library problem by writing a bridge from CL to Perl 5, and was laughed out of town. Rich Hickey, I’m told, spent years trying to bridge CL to Java, and never got very far. But Clojure works, and it works great. It’s a Lisp with a squintillion libraries. Who else can claim that?

So, if I wanted Lisp with Java libraries, why not use Kawa or ABCL … or heck, JRuby? Those are all fine projects, but they all suffer from mismatches between the “source” language (Scheme, CL, Ruby) and the “host” language (Java). There is never a one-to-one mapping between types in the source language and types in the host language. So you end up needing conversions like jclass/jmethod/jcall (ABCL) or primitive-static-method (Kawa). (JRuby is slightly better, but only because Ruby is closer to Java than CL/Scheme.)

Clojure doesn’t have this problem because it was designed from the ground up for the JVM. Clojure strings are java.lang.String, Clojure maps are java.util.Map, even Clojure functions are java.lang.Runnable (and java.lang.Callable). This makes it supremely easy to mix-n-match Clojure code with Java libraries and vice-versa. I know, because every day I use Clojure with complex Java libraries like Hadoop, Restlet, Lucene, and Solr. Everything just works. I don’t have to write any foreign-function interfaces or bridge code. In fact, using Java libraries in Clojure is often easier than using them in Java!

Clojure may not be a programming language for the next hundred years, as Arc aspires to be. But it’s a great language if you want to get stuff done right now.

Comments 6 Comments »

Just a little self-promotion: I’ll be presenting at the New York Hadoop User Group on Tuesday, February 10 at 6:30. I’ll talk about how I use Hadoop for AltLaw.org, including citation linking, distributed indexing, and using Clojure with Hadoop.

Update 2/28: My slides from this presentation are available from the Meetup group files.

Comments 4 Comments »

Up at Cornell, Tom Bruce has a post about the problem of funding open access to legal materials. This brings to mind a conversation I had with a doctor friend recently about AltLaw. My friend, accustomed to the open-access requirements of NIH grants, was frankly shocked that there are no comparable rules for legal decisions.

NIH Public Access Policy

Screenshot: PubMed home page

A related problem is how to make people aware of what free services are available. AltLaw has been around for two years, and while traffic has grown steadily, it has not gotten as much attention as commercial startups operating similar services. Admittedly, we have done no advertising at all, and that’s our fault. “If you build it they will come” we thought, naïvely. But how would we advertise? I’m a programmer; the people I work with are law professors. None of us know the first thing about marketing, and quite frankly, none of us care. Seen in that light, Cornell’s recent partnership with Justia.com is a smart move that will benefit everyone working on open-access law, since it will expose more lawyers to the idea.

Comments No Comments »

It’s interesting to see the first signs of rebellion against RSpec. I jumped on the RSpec bandwagon when it first appeared, mostly so I wouldn’t have to write “assert_equals” all the time. But while I liked and used RSpec, I don’t think it made my tests any better. If anything, they were a little bit worse. I found myself testing things that were not really relevant to the code I was writing, asserting obvious things like “a newly-created record should be empty.”

When I got interested in Clojure, one of the first things I wrote was a testing library called “test-is”. I borrowed from a lot of Common Lisp testing frameworks, especially the idea of a generic assertion macro called “is”. It looks like this:

(deftest test-my-function
  (is (= 7 (my-function 3 4)))
  (is (even? (my-function 10 2))))

This is pretty basic, but it’s sufficient for low-level unit testing. So far, I think that’s how the library has been typically used. There have been occasional requests, however, for RSpec-style syntax. I can see how this would be useful for testing at a level higher than individual functions, but I have come to believe that the added semantics of RSpec are not really necessary.

Right now, the test-is library is built on the same abstractions as Clojure itself. Tests are functions, so you can apply all the same tools that already exist for handling functions. Tests can be called by name, organized into namespaces, and composed. There is almost no extra bookkeeping code that I need to write to make all of this work.

In contrast, if I were to adopt the RSpec style, I would have to write code to call, store, and organize tests. That’s more work for me, and ultimately restricts the flexibility of the library for people who use it. Furthermore, RSpec has its own set of semantics, above and beyond the language itself, which must be learned.

This is my first experience supporting a library for anyone other than myself, and I don’t want to force anyone into a particular style. A library like RSpec is a complete environment that attempts to anticipate all possible usage scenarios, so it’s grown correspondingly complicated. I want to provide a set of small tools, that can be combined with other tools to do interesting things.

Of course, by making that decision I’m already dictating, to some extent, how the library can be used. But really, what I’m trying to do is set limits for myself. I will commit to providing a flexible, extensible set of functions and macros for writing tests. I am explicitly not trying to provide a complete testing framework. If someone wants to build an RSpec-style framework on top of test-is, more power to them. I will happily try to make test-is easier to integrate into that framework.

But there’s one other thing that struck me about that article that I linked to at the beginning — the idea of putting tests and code in the same file. I think that’s a great idea, and Clojure comes ready-made to implement it. Clojure supports the idea of “metadata” on definitions. You can attach a set of arbitrary properties to any object, without affecting the value of that object.

It’s easy to attach a test function as metadata on a definition in Clojure, but the syntax is a little ugly, and there is no easy way to remove the tests from production code. So I came up with in addition to my library, the “with-test” macro. It lets you wrap any definition in a set of tests. It looks like this:

(with-test
 (defn add-numbers [a b]
   (+ a b))
 (is (= 7 (add-numbers 3 4)))
 (is (= -4 (add-numbers -6 2))))

This is equivalent to adding metadata to the function, but the syntax is a little cleaner. I’ve also added a global variable, “*load-tests*”, which can be set to false to omit tests when loading production code.

I like having each function right next to its tests. It makes it easier to remember to write tests, and easier to see how the function is supposed to behave. So to the extent that test-is will promote a testing style, this is it. But it’s a pretty radical departure from the traditional style of testing, so I’m not sure how others will react to it.

Comments 5 Comments »