Two Upcoming Clojure/Hadoop Talks

Hello, everyone.

I’ll be performing my Clojure+Hadoop magic tricks at the following events:

Friday, October 2: Hadoop World NYC. Use the code hadoopworld_friend for 10% off the registration fee.

Monday, October 5: NoSQL NYC Meetup. Free!

At both events I’ll be talking about:

  • Why Clojure and Hadoop are a perfect fit.
  • How to write Hadoop jobs in Clojure.
  • My clojure-hadoop library.
  • Storage options for Clojure data structures.

Will post slides after, and recordings if they are available.

It’s About the Platform

I’ve said It’s About the Libraries, and indeed, one of the major selling points of Clojure is that it can call Java libraries directly.

But there’s more to it than that. Libraries are just one benefit to building Clojure on top of Java, or, more accurately, on top of Java the platform.

Look around you, and you’ll see that 99% of all the software in the world runs on just three platforms:

  1. Unix/C
  2. Java Virtual Machine
  3. .NET Common Language Runtime

Where did these platforms come from? Let’s see:

  1. AT&T
  2. Sun
  3. Microsoft

Notice something? All three were all developed by huge corporations.

Building a new platform isn’t just about writing the code. In fact, very little of it is about code. You need books, articles, conferences, workshops, and university courses. You need multinational corporations to trust their entire business to your platform. It takes millions of dollars and tens of thousands of hours of labor to create a new platform. Think of the massive ad campaigns Sun ran for Java. Can you do that? Of course not.

So when you’re designing a new language, you have to build on an existing platform. Most of the so-called “scripting” languages grew up on Unix, so they’re written in C. Now, Unix/C is a great platform, still going strong after 40 years. It provides powerful tools and standardized interfaces such as files, sockets, and pipes.

The problem is that each of the “scripting” languages has developed into its own mini-platform. Perl, Python, and Ruby each define their own set of data structures for fundamental types like strings, lists, and maps. The only “types” that Unix recognizes are text and binary. You can’t exchange data between two languages without serializing everything to some agreed-upon format. And you can’t do callbacks between languages below the level of a whole process.

The other problem with languages written in C is, well, C. Pointers are hard. Memory management is hard. I know from bitter experience that Ruby libraries can have segfaults or memory leaks. That just doesn’t happen in Java.

Clojure was created to leverage capabilities of Java-the-platform — garbage collection, dynamic code generation, JIT compilation, threads, locks — some of which are difficult to use effectively in Java-the-language. To implement Clojure in C, for example, you would first have to build your own platform with these features. That’s effectively what most Common Lisp implementations do, and they suffer because the Common Lisp world is too small to sustain its own platform.

The brilliant thing about Java-the-platform is that it allows many languages to coexist. I can mix code written in Java, Clojure, JRuby, Jython, etc. and it’s pretty easy, because they all implement the same fundamental interfaces like java.util.List and java.lang.Runnable. For example, right now I have Hadoop (Java code) calling Clojure code calling JRuby code. It all just works.

(The .NET CLR provides similar capabilities, and there is a Clojure CLR port.)

Run Your Own Maven Repository With Nothing but an FTP Server

I hope I’ve demonstrated in the last few posts that Maven is pretty cool, not so scary. But the public Maven repositories sometimes leave a bit to be desired. They don’t have entries for every possible library, and occasionally they have incorrect dependencies or other metadata. Also, the process of adding new libraries to the central repositories is somewhat involved.

Maybe you want to depend on a project that isn’t in the public repos. Or maybe you want to publish development snapshots of your own projects. In either case, you need your own Maven repository.

Fortunately, running your own Maven repository is dirt simple. All you need is a web server where you can upload files. Maven understands FTP, SCP, WebDAV, and more exotic protocols like Subversion. Here I’m going to describe the simplest one, plain old FTP. If you have a web site on a cheap, shared web host, odds are you can use FTP to manage the files on the server.

Step 1: Get Some Web Space

I’m assuming you have a web site somewhere. Lets say it’s at http://www.example.net/. Furthermore, let’s say you can publish files on this web site by uploading them to the FTP server ftp.example.net. Your FTP user name is samiam with the password greeneggs. You put the files for your web site in the directory /home/samiam/public_html on the server.

Using your favorite FTP client, create a directory named maven2 inside your web site directory (public_html in our example).

Congratulations, you just created a Maven 2 repository!

Check that you can visit http://www.example.net/maven2 in a web browser. You should see a directory listing; it’s empty, because we haven’t added any files yet.

Step 2: Configure Your Project Deployment

Now you’re ready to deploy a project to your Maven repository. If you’ve been following along with my Maven blog posts, you know that each Maven project has a project description file named pom.xml.

Open up your awesome software project and edit the pom.xml file. You’re going to add two new sections, ending up with a file that looks like this:

<project ...>
  ...
  <groupId>net.example</groupId>
  <artifactId>awesome</artifactId>
  <name>The Awesome Library</name>
  <version>1.0-SNAPSHOT</version>
  ...

  <build>
    ...
    <extensions>
      <extension>
        <groupId>org.apache.maven.wagon</groupId>
        <artifactId>wagon-ftp</artifactId>
        <version>1.0-alpha-6</version>
      </extension>
    </extensions>
    ...
  </build>

  ...
  <distributionManagement>
    <repository>
      <id>example-ftp</id>
      <url>ftp://ftp.example.net/home/samiam/public_html/maven2</url>
    </repository>
  </distributionManagement>
</project>

The <extensions> section loads the Maven plugin that handles FTP uploads. The <distributionManagement> section tells Maven where to publish the project. Here we specified a <url> with the full URL path to our maven2 directory. The <id> tag in the <repository> is a name that you choose — just remember it for the next step.

Step 3: Configure Your Server Credentials

Remember that the pom.xml will be part of the public distribution of your project. It’s OK to put the name of the FTP server there, but you wouldn’t want to include private information like your user name and password. Those go in a special Maven configuration file called settings.xml that you keep private.

You can find settings.xml in your personal Maven cache directory. On Unix-like systems, it should be at ~/.m2/settings.xml. You may have to create the file if it doesn’t already exist. Here’s what it should contain:

<settings xmlns="http://maven.apache.org/POM/4.0.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
  <servers>
    <server>
      <id>example-ftp</id>
      <username>samiam</username>
      <password>greeneggs</password>
    </server>
  </servers>
</settings>

The <id> is the same as in our pom.xml. The user name and password are the credentials for your FTP server.

Step 4: Deploy!

Now all you have to do is run this command in your project directory:

mvn deploy

Maven builds your project and uploads it to your public repository. That’s all there is to it!

Take a look with your web browser at http://www.example.net/maven2/net/example/awesome/1.0-SNAPSHOT and you’ll see the JAR and POM files there.

Step 5: Tell the World

Now, anyone who wants to use your awesome library can just add a dependency to their pom.xml, like this:

  ...
  <dependencies>
    ...
    <dependency>
      <groupId>net.example</groupId>
      <artifactId>awesome</artifactId>
      <version>1.0-SNAPSHOT</version>
    </dependency>
    ...

  </dependencies>
  ...

  <repositories>
    ...
    <repository>
      <id>example-net</id>
      <name>The Example Repository</name>
      <url>http://www.example.net/maven2</url>
    </repository>
    ...

  </repositories>
  ...

The <repository> section is necessary, since your repository is not on the list of “central” repositories that Maven searches by default.

In general, if your project has a stable release that is widely used, then it’s worth the effort to get it into the central Maven repository. This is easier to do when you already have your own personal repository to point to. Note that the central repository does not accept SNAPSHOT releases, nor does it allow any changes to a release after it has been uploaded.

Cutting-Edge Clojure Development with Maven

I promised, in my previous post, that I would show you how to use the latest-and-greatest versions of Clojure and clojure-contrib in your Maven projects. Here’s that post.

Formos Software maintains a Maven server with nightly builds of Clojure and contrib at http://tapestry.formos.com/maven-snapshot-repository/

Here’s a complete pom.xml file with dependencies on both Clojure and clojure-contrib:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>YOUR.GROUP.ID</groupId>
  <artifactId>YOUR-PROJECT-NAME</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>YOUR-PROJECT-NAME</name>
  <dependencies>
    <dependency>
      <groupId>org.clojure</groupId>
      <artifactId>clojure-lang</artifactId>
      <version>1.1.0-alpha-SNAPSHOT</version>
    </dependency>
    <dependency>
      <groupId>org.clojure</groupId>
      <artifactId>clojure-contrib</artifactId>
      <version>1.0-SNAPSHOT</version>
    </dependency>
  </dependencies>
  <repositories>
    <repository>
      <id>tapestry.formos.com</id>
      <name>Formos Software snapshot repository</name>
      <url>http://tapestry.formos.com/maven-snapshot-repository</url>
    </repository>
  </repositories>
</project>

Yes, that’s a pile of XML. But it’s not that complicated once you break it down. Here’s what’s going on:

Dependencies

The <dependencies> section lists the libraries our project depends on. We have one <dependency> for Clojure (called clojure-lang in the Formos repository) and one for clojure-contrib. We’re depending on SNAPSHOT versions, which tells Maven to follow the most recent version on a particular branch.

The current development branch of clojure-lang is called 1.1.0-alpha-SNAPSHOT. The development branch of contrib, which has never had a formal release, is just 1.0-SNAPSHOT.

How did I find these version numbers? I just looked at the repository in a web browser. In the org/clojure/clojure-lang directory I found directories named for each development branch, 1.0-SNAPSHOT, 1.0.0-RC1-SNAPSHOT, and 1.1.0-alpha-SNAPSHOT. I chose the latest one, 1.1.0-alpha-SNAPSHOT. Then I did the same with clojure-contrib.

If you look inside a branch directory like 1.1.0-alpha-SNAPSHOT, you’ll find hundreds of files, one for each daily snapshot, named with timestamps.

Repositories

The <repositories> section tells Maven where to look for JAR files to download. We added the Formos repository by specifying its URL.

The <id> and <name> tags inside <repository> are purely for our own reference. Maven only cares about the URL. We could have used any id and name to describe the Formos repository; those names will be used in Maven’s console logging.

Managing Dependency Versions

The problem with tracking the latest snapshot is that sometimes there’s a release that breaks your code. It might be a bug, or it might just be a change in behavior that makes the library incompatible with previous versions.

The Versions Maven Plugin can help to alleviate this problem by “locking” dependencies to specific releases and updating them in a controlled way.

First, we have to make the Versions plugin available to our project. Do this by adding the following lines just before the final </project> in your pom.xml:

  <pluginRepositories>
    <pluginRepository>
      <id>Codehaus</id>
      <name>Codehaus Maven Plugin Repository</name>
      <url>http://repository.codehaus.org/org/codehaus/mojo</url>
    </pluginRepository>
  </pluginRepositories>

We’ve added a “plugin repository,” which is just a Maven repository that happens to contain Maven plugins. (Technically, you don’t need to add this if your local Maven cache already has a copy of the Versions plugin, but putting it in pom.xml ensures that other developers coming to your project have access to all the same plugins.)

Now we can use the following mvn command:

mvn versions:lock-snapshots

This modifies your pom.xml file, setting the version string of every SNAPSHOT dependency to the current snapshot timestamp. For example, when I run this command, I end up with the following:

  <dependencies>
    <dependency>
      <groupId>org.clojure</groupId>
      <artifactId>clojure-lang</artifactId>
      <version>1.1.0-alpha-20090904.093041-38</version>
    </dependency>
    <dependency>
      <groupId>org.clojure</groupId>
      <artifactId>clojure-contrib</artifactId>
      <version>1.0-20090904.093531-59</version>
    </dependency>
  </dependencies>

Now, whenever you build the project, you know exactly which Clojure release you’re getting.

So you work with a particular release for a while, then you want to upgrade to the latest one. No problem! Just run:

mvn versions:unlock-snapshots
mvn -U install

The first command modifies pom.xml, replacing all the timestamped version numbers with SNAPSHOT versions.

The -U option on the second command forces Maven to check for updated versions of all the snapshot dependencies.

Note: Both lock-snapshots and unlock-snapshots create a backup file called pom.xml.versionsBackup. To remove this file (and accept the Version Plugin’s changes to your pom.xml) run:

mvn versions:commit

Likewise, to go back to the pom.xml file you had before the Versions Plugin messed with it, run:

mvn versions:revert

Explore

There’s a lot more to the Versions plugin, and to Maven dependency management in general. Check the documentation for details.

Of course, the example here can be combined with the clojure-maven-plugin demonstrated in my previous post. The syntax of the combined pom.xml file is left as an exercise for the reader.

One other thing: if you really want the absolute latest, up-to-the-second, still-hot-from-the-github version of something, there’s nothing for it but to use Git submodules. I’ll demonstrate that in another post.

<dependencies>
<dependency>
<groupId>org.clojure</groupId>
<artifactId>clojure-lang</artifactId>
<version>1.1.0-alpha-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.clojure</groupId>
<artifactId>clojure-contrib</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
</dependencies>
<pluginRepositories>
<pluginRepository>
<id>Codehaus</id>
<name>Codehaus Maven Plugin Repository</name>
<url>http://repository.codehaus.org/org/codehaus/mojo/</url>
</pluginRepository>
</pluginRepositories>

Maven’s Not So Bad: Further Thoughts on Clojure Package Management

Update Sept. 4: How to get the latest builds of Clojure & Contrib

Maven is a touchy subject. People tend to have strong opinions about it. But like it or not, it’s the de-facto standard for dependency management in the Java world. Clojure lives in the Java world, so that means we have to live with Maven.

Here are some good things about Maven:

  • “Convention over configuration.”
  • Plugins are downloaded & installed automatically.
  • Handles dependencies of dependencies.
  • Declarative configuration, not imperative like Ant.
  • Only stores one copy of each JAR, shared by all projects.

Here are some bad things about Maven:

  • XML configuration file.
  • Verbose command line options.
  • Doesn’t track latest source code of projects. It does; see comments (Thanks, Tim!)
  • First run takes forever to download all the plugins.
  • Verbose console output.

In my estimation, the good outweigh the bad. And nothing outweighs the huge fact that Maven is already there.

So let’s develop a Clojure app using Maven.

Step 1: Install Maven.

If you don’t already have it, that is. This is pretty easy, just visit maven.apache.org and follow the instructions.

Step 2: Create a new project.

Type the following at the command line:

mvn archetype:generate

Maven will ask a series of questions:

  1. archetype: At the “Choose a number” prompt, press enter to accept the default project type, maven-archetype-quickstart.
  2. groupId: Enter a name to identify yourself in the global Maven namespace. All your Maven projects will use the same groupId. This is typically a reverse domain name in the style of Java package names. For example, I could use the groupId com.stuartsierra
  3. artifactId: Enter a name to identify this specific project in the Maven repository. For example, my-great-clojure-library
  4. version: Press enter to accept the default, 1.0-SNAPSHOT
  5. package: Press enter to accept the default, which is the same as your groupId.
  6. Confirmation: Press enter.

Now you have a directory named my-great-clojure-library containing the skeleton of a new Java project.

Step 3: Configure your pom.xml.

Go into your new project directory and edit the pom.xml file. The basic information will already be filled out.

We can remove the dependency on JUnit, so delete these lines:

    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>

Then we need to add the Clojure Maven plugin, which tells Maven how to compile Clojure source code. Before the final </project> tag, add these lines:

  <build>
    <plugins>
      <plugin>
        <groupId>com.theoryinpractise</groupId>
        <artifactId>clojure-maven-plugin</artifactId>
        <version>1.0</version>
        <executions>
          <execution>
            <id>compile-clojure</id>
            <phase>compile</phase>
            <goals>
              <goal>compile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>

We also need to add Clojure itself as a dependency of our project (the Clojure Maven plugin does not do this automatically). Inside the <dependencies> tag, add the following lines:

    <dependency>
      <groupId>org.clojure</groupId>
      <artifactId>clojure</artifactId>
      <version>1.0.0</version>
    </dependency>

That’s if you want the official Clojure 1.0.0 release. If you want a cutting-edge version, I’ll explain how in a later post.

Step 4: Delete Java sources.

Your project directory comes pre-equipped with two Java source directories at src/main/java and src/test/java. You can delete both of them, unless, of course, you’re developing a mixed Clojure-Java project.

Step 5: Add dependencies.

If your project does anything interesting, chances are it’s going to depend on some external Java libraries. You can find libraries in the public Maven repositories at mvnrepository.com. Search for a library name, and it will show you the code to put in your pom.xml file.

For example, say we want to use the Apache Commons IO library. At mvnrepository.com, we find the dependency code for this library:

  <dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>1.4</version>
  </dependency>

And we can add that inside the <dependencies> section of pom.xml.

Step 6: Start coding!

Create the directory src/main/clojure. This is where all your .clj source files will go.

Follow the standard Clojure/Java convention for file names. That is, if you have a Clojure namespace called my.great.library, it should be in a file named src/main/clojure/my/great/library.clj

Step 7: Compile and install.

Run the following command:

mvn install

That will compile all your .clj source files into Java .class files, package them into a JAR, and install that JAR in your local Maven cache. On Unix-like systems, the cache should be at ~/.m2/repository/

Step 8: Live and learn.

There’s a whole lot more to learn about Maven. It’s a very flexible tool, and it can do almost anything. Yes, you will have to write some XML, but it’s really not that much.

Things I hope to cover in future posts:

  1. Using git submodules to track development versions of Clojure libraries.
  2. Running tests written in Clojure.
  3. Including .clj source files in your JAR.
  4. Creating a stand-alone JAR including all dependencies.
  5. Setting up a private Maven repository.

I hope this was a reasonable introduction to developing with Maven and Clojure, and that I have shown that Maven isn’t nearly as scary as people make it out to be. I think Maven suffered for a long time from poor documentation, but that’s changing rapidly. I found the (free) book Maven: The Definitive Guide extremely helpful.

Appendix: Complete pom.xml

Here’s the complete pom.xml file for the project I developed in this post:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.stuartsierra</groupId>
  <artifactId>my-great-clojure-library</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>my-great-clojure-library</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <dependency>
      <groupId>org.clojure</groupId>
      <artifactId>clojure</artifactId>
      <version>1.0.0</version>
    </dependency>
    <dependency>
      <groupId>commons-io</groupId>
      <artifactId>commons-io</artifactId>
      <version>1.4</version>
    </dependency>
  </dependencies>
  <build>
    <plugins>
      <plugin>
        <groupId>com.theoryinpractise</groupId>
        <artifactId>clojure-maven-plugin</artifactId>
        <version>1.0</version>
        <executions>
          <execution>
            <id>compile-clojure</id>
            <phase>compile</phase>
            <goals>
              <goal>compile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

Thoughts on Clojure Package Management

Update Sept. 3: Maven’s Not So Bad.

A lot of Ruby types come to Clojure and ask, “Where’s the package manager?” The answer is usually, “Maven or Ivy,” which isn’t really an answer.

I discussed this in the latter half of my Philly Lambda talk (PDF slides). The problem is that Clojure is built on Java, and any Clojure library that does something interesting is going to need some Java libraries beyond what the JDK provides.

Java has only one established dependency management system, Maven. (Ivy is an alternative, but it uses the Maven repositories.) Maven works, but it’s a big, complicated beast, built in the best giant-XML-configuration-file Java tradition. It’s also slow to accept new libraries into the public repositories. The central Maven 2 repository contains fewer than 700 libraries. Rubyforge, by contrast, lists over 8,000.

Maven seems to work well for large organizations that can benefit from setting up their own, private repositories, but it’s kind of a headache for the independent developer.

There’s a Clojure Maven plugin, some shell-based hacks like Corkscrew, and some Ivy-related code floating around, but none really provides what people want: one simple command to download and install all the dependencies for a project, without needing any XML.

What everyone wants, of course, is CPAN. Thousands of documented, tested modules for just about any task you could imagine, and quite a few you couldn’t (e.g., Acme::Buffy).

But CPAN was not created in a day. Most of its imitators (Rubygems, PEAR, Python Eggs) have failed to reach the same level of quality. Perl is also much older, and therefore more stable, than Python or Ruby. 10-year-old Perl code probably still works.

Part of this CPAN’s success, I think, has to do with the environment in it evolved. When Perl was the hot new language, running a web server was an expensive proposition. Even domain names weren’t cheap. If you were going to publish code on the web, there was a cost to doing so, either in time or money, so you wanted to make sure that it was worth publishing.

These days, when everyone has a blog and a Github account, sharing code is easy. Doing “git push” requires almost no thought, no investment of time. Why not release everything, even when it’s untested, undocumented, or unfinished?

So this weekend I started working on a package repository for Clojure. It was modeled it after CPAN, but designed to support anything that could be packaged in a JAR file, including compiled Java libraries and Clojure source code.

I got started. Then I thought, who would actually use this? Of the few dozen Clojure libraries that have been published on Github, only a handful are “production-ready.” Most aren’t even finished. Very few have been thoroughly tested. (I’m equally guilty in this regard.)

I concluded that it’s just too early. Clojure is a scarcely two years old. It just released “1.0” this year, and is still developing rapidly. The libraries are evolving equally rapidly. If you want to build a project using, say, Compojure, the best way to do it is with Git submodules.

The one place a package manager would really be useful is in downloading and installing the standard Java packages that get used in almost every project, like the Apache Commons libraries. For this, Maven/Ivy works, if not brilliantly.

Update: another Maven helper: Clojure-POM

Scalability for dummies like me

Alex Barrera wrote a nice little article about why “scalability issues” can prevent any visible progress on a web project for months at a time: Scalability Issues for Dummies.

I’ve been in this position — no visible progress while redesigning a back-end — with AltLaw several times now. I’m contemplating yet another redesign right now, and I don’t know if I can stand it.