What Is a Program?

What is a program? Is it the source code that a programmer typed? The physical state of the machine on which it is run? Or something more abstract, like the data structures it creates? This is not a purely philosophical question: it has consequences for how programming languages are designed, how development tools work, and what programmers do when we go to work every day.

How much information can we determine about a program without evaluating it? At the high end, we have the Halting Problem: for any Turing-complete programming language, we cannot determine if a given program written in that language will terminate. At the low end, consider an easier problem: can we determine the names of all of the invokable functions/procedures/methods in a given program? In a language in which functions are not first-class entities, such as C or Java, this is generally easy, because every function must have a specific lexical representation in source code. If we can parse the source code, we can find all the functions. At the other extreme, finding all function names in a language like Ruby is nearly impossible in the face of method_missing, define_method, and other “metaprogramming” tricks.

Again, these decisions have consequences. Ruby code is flexible and concise, but it’s hard to parse: ten thousand lines of YACC at last count. I find that documentation generated from Ruby source code tends to be harder to read — and less useful — than JavaDoc.

Here’s another example: can I determine, without evaluating it, what other functions a given function calls? In Java this is easy. In C it’s much harder, because of the preprocessor. To build a complete call graph of a C function, we need to run the preprocessor first, in effect evaluating some of the code. In Ruby — Ha! Looking at a random piece of Ruby code, I sometimes have trouble distinguishing method calls from local variables. Again because of method_missing, I’m not sure it’s even possible to distinguish the two without evaluating the code.

I do not want to argue that method_missing, macros, and other metaprogramming tools in programming languages are bad. But they do make secondary tooling harder. The idea for this blog post came into my head after observing the following Twitter conversation:

Dave Ray: “seesaw.core is 3500 lines of which probably 2500 are docstrings. I wonder if they could be externalized somehow…”

Chas Emerick: “Put your docs in another file you load at the *end* of seesaw.core, w/ a bunch of (alter-meta! #'f assoc :doc "docs") exprs”

Anthony Grimes: “I hate both of your guts for even thinking about doing this.”

Chas Emerick: “Oh! [Anthony] is miffed b/c Marginalia doesn’t load namespaces it’s generating docs for. What was the rationale for that?”

Michael Fogus: “patches welcomed.”

Chas Emerick: “Heh, sure. I’m just wondering if avoiding loading was intentional, i.e. in case someone has a (launch-missiles) top-level.”

(As an aside, it’s hard to link to a Twitter conversation instead of individual tweets.)

Clojure — or any Lisp, really — exemplifies a tension between tools that operate on source code and tools that operate on the running state of a program. For any code-analysis task one might want to automate in Clojure, there are two possible approaches. One is to read the source as a sequence of data structures and analyze it. After all, in a Lisp, code is data. The other approach is to eval the source and analyze the state it generates.

For example, suppose I wanted to determine the names of all invokable functions in a Clojure program. I could write a program that scanned the source looking for forms beginning with defn, or I could evaluate the code and use introspection functions to search for all Vars in all namespaces whose values are functions. Which method is “correct” depends on the goal.

Michael Fogus wrote Marginalia, a documentation tool for Clojure which takes the first approach. Marginalia was inspired by Literate Programming, which advocates writing source code as a human-readable document, so the unevaluated “text” of the program is what it deals with. In contrast, Tom Faulhaber’s Autodoc for Clojure takes the second approach: loading code and then examining namespaces and Vars. Autodoc has to work this way to produce documentation for the core Clojure API: many functions in core.clj are defined before documentation strings are available, so their doc strings are loaded later from separate files. (As an alternative, those core functions could be rewritten in standard syntax after the language has been bootstrapped.)

One of the talking-point features of any Lisp is “all of the language, all of the time.” All features of the language are always available in any context: macros give access to the entire runtime of the language at compile time, and eval gives access to the entire compiler at runtime. These are a powerful tools for developers, but they make both the compiler and the runtime more complicated. ClojureScript, on the other hand, does not have eval and its macros are written in a different language (Clojure). These choices were deliberate because ClojureScript was designed to be compiled into compressed JavaScript to run on resource-constrained platforms, but I think it also made the ClojureScript compiler easier to write.

Chas’s last point about code at the top-level comes up often in discussions around Clojure tooling. The fact that we can write arbitrary code anywhere in a Clojure source file — launch-missiles being the popular example — means we can never be sure that evaluating a source file is free of side effects. But not evaluating the source means we can never be sure that we have an accurate model of what the code does. Welcome back to the Halting Problem.

Furthermore, maintaining all the metadata associated with a dynamic runtime — namespaces, Vars, doc strings, etc. — has measurable costs in performance and (especially) memory. If Clojure is ever to be successful on resource-constrained devices like phones, it will need the ability to produce compiled code that omits most or all of that metadata. At the same time, developers accustomed to the “creature comforts” of modern IDEs will continue to clamor for even more metadata. Fortunately, this isn’t a zero-sum game. ClojureScript proves that Clojure-the-language can work without some of those features, and there have been discussions around making the ClojureScript analyzer more general and implementing targeted build profiles for Clojure. But I don’t think we’ll ever have a perfect resolution of the conflict between tooling and optimization. Programs are ideas, and source code is only one representation among many. To understand a program, one needs to build a mental representation of how it operates. And to do that we’re stuck with the oldest tool we have, our brains.

JDK Version Survey Results

After a month and about 175 responses, here are the results of my JDK Version Usage Survey (now closed):

Versions: Almost everyone uses 1.6. A few are still using 1.5, and a few are trying out 1.7. Only a handful are still on 1.4. Fortunately, no one is on a version older than 1.4.

Reasons: These are more varied. The most common reason for not upgrading is lack of time, with “it just works” running a close second. A little less than half of respondents are limited by external forces: either operations/management or third-party dependencies.

Not much came out in the comments. Banks and other large institutions seem to be the most resistant to upgrades, especially if they’ve been bitten by past JDK changes.

Design Philosophies of Developer Tools

I’ve been thinking about some of the tools that I use every day, and about the different design philosophies they reflect.

Git

First and foremost, Git. We use Git on every single project, internal and external. Git is a great example of the Unix design philosophy: many small programs — 153 of them by my count — each of which does exactly one thing and does it well. But this is not “loose coupling.” The components of Git are tightly integrated: they all depend on the same repository structure and file formats.

One of the nice things about Git is how its internals are both exposed for the world to see and thoroughly documented. We can easily write scripts to automate common tasks or create different workflows. With a bit more effort, we could even write new tools that integrate with the Git suite. These tools can do things that Git’s authors never intended, as long as they follow the documented repository structure. Git isn’t so much a version control system as the means to construct one.

Still, Git is one project with many components, not many separate projects. All 153 executables in Git are governed by a single release cycle, tested and known to work together. We never have to worry about incompatible versions of, say, git-branch and git-merge on the same machine. Older versions of Git can read repositories created with newer versions even if they don’t provide all the same features.

Maven

In stark contrast to Git, we have tools from the Java world like Ant and Maven. The JVM cannot fork/exec, so the many-small-programs design is a non-starter. Instead, the Java tools usually favor some sort of plug-in architecture, which is a great idea in theory but hard to get right in practice.

I’ve tried writing a Maven plugin. Hacking up a one-off for a single project is not too difficult, but designing a general-purpose plugin that works everywhere is maddeningly complicated. Maven plugins are just Java code, so they can do whatever they want, but the APIs for interacting with the rest of the Maven system are woefully underdocumented. The contract of a Maven plugin, what it can and cannot do, is not well-defined. The internals of Maven itself are largely a black box.

The core Maven plug-ins have independent release cycles, so there is the possibility for unexpected incompatibilities, but I’ve never encountered such. On the whole, the Maven ecosystem is quite stable. The struggle comes once you venture outside the realm of what the standard plugins provide. Maven plugins are not designed to be composed, so adding new capabilities is rarely as simple as scripting plugins that already exist. You have to start from scratch every time.

Ruby / Rubygems / RVM / Bundler

Finally, we have tools from the Ruby world, the ever-changing cornucopia of Ruby implementations, libraries, and tools to manage it all. The problem with the Ruby tools is that they are both tightly-coupled and uncoordinated. Despite having separate tools for each task, each tool reaches into at least one of the others: Rubygems modifies the behavior of the Ruby interpreter, Bundler modifies the behavior of Rubygems, RVM modifies the behavior of the shell, and so on. Each one adds another layer of indirection, making debugging harder.

All of the Ruby development tools have independent release cycles, and they don’t seem to plan or coordinate with one another in advance of each release. Integration testing is left up to the users.

I admire the speed and eagerness with which the Ruby community produces new tools. But on almost every Ruby project I’ve worked on, we’ve spent hours or days sorting out incompatibilities among some combination of libraries, language implementations, and development tools. Our internal mailing list is littered with advice like “Don’t use Bundler version X with RVM version Y.” The speed of development comes with its own cost.

Thoughts

So what do I take from all this? Just a few principles to keep in mind when writing software tools:

  1. Plan for integration
  2. Rigorously specify the boundaries and extension points of your system
  3. Do not depend on unspecified behavior

And a couple of ideas if you’re starting a new project from scratch:

  1. The filesystem is the universal integration point
  2. Fork/exec is the universal plugin architecture

Update 8/31: More comments at Hacker News.

The Naming of Namespaces

From time to time I’m asked, “How do you organize namespaces in Clojure projects?” The question surprised me at first, because I hadn’t thought about it much. But then I was using Clojure back when the only way to load code was “load-file.”

Most programming languages, especially object-oriented languages, provide strong hints on how to structure your source files. Everything is a class, and (almost) every class is a file. In Clojure, everything (almost) is a function. Functions are much smaller units than classes. So how do we group them?

The important thing to remember about namespaces is that, from the compiler’s point of view, they don’t matter. Namespaces are a convenience for the programmer, to help you avoid name clashes without having to write longer names. There’s no reason why an entire application can’t be defined in a single namespace. Most of Clojure itself is defined in one namespace with over 500 symbols. (Common Lisp has 978 symbols in a single namespace.)

You can think of namespaces as a tool to express something about your application. Here are some ideas to get you started:

  • Group functions into namespaces based on type of data they manipulate. For example, functions to manipulate customer data go in the “customer” namespace. This technique is familiar from object-oriented languages, but it has the same limitations: where do you put functions concerning relationships among two or more types? The OO answer would be to make a new name for the relationship. This style leads to a proliferation of small namespaces, which can become a burden.
  • Divide a library into a public API namespace and a internal implementation namespace. Or define a high-level API for common cases and a low-level API for more advanced usage.
  • Divide an application into namespaces representing architectural layers. You can examine “ns” declarations to prove that each layer calls functions only from the layer below it.
  • Divide an application into namespaces representing functional modules, with well-defined contracts for communication between modules.
  • Try to separate decision-making code from the code that carries out those decisions. That is, keep your business logic purely functional and free of side-effects, so it is easy to test. You don’t necessarily have to put side-effect code in a separate namespace, but doing so may help keep it cleanly separated.

With all these techniques, the point to remember is that namespaces are there to help you, not to get in your way. If you have a large namespace, you can still divide it up into multiple files.

The one hard-and-fast rule is that you cannot have a circular dependency between namespaces. That is, if namespace A needs to call functions defined in namespace B, then namespace B cannot call functions in namespace A. There is no workaround, it simply can’t be done. In practice, this is rarely a problem. If you encounter a situation where two namespaces are mutually dependent, it’s probably a sign that they should be merged into a single namespace.

In client projects at Relevance (home of Clojure/core) we often end up with one namespace for each aspect of an application — data access, UI, logging, and so on. Then there’s one “main” namespace that depends on all the others and ties it all together.

Update 9/5/2011: Chris Houser wrote a nice answer on StackOverflow about how to split a Clojure namespace over several files.

ClojureScript Launch, New York

As you may have heard, last night we (Clojure/core) announced ClojureScript at the Clojure NYC Meetup. Rich Hickey gave a talk, which was streamed live over the web, while we monitored Twitter and IRC for feedback.

The event was a great success, with loads of excitement expressed by both the local New York crowd and the Internet at large.

Screenshot of IRC / Twitter during the ClojureScript announcement

Screenshot of IRC / Twitter during the ClojureScript announcement

Video was also recorded, which will be posted soon.

Thanks to Google New York for hosting.

Dependency Management First-Aid Kit

This article attempts to unravel some of the mysteries of dependency management with Maven and Maven-based tools.

Help, something’s missing!

Say you have a project named “my-new-project” which declares a dependency on version 3 of the “awesome-sauce” library by the Example.com corporation. You add the dependency to your pom.xml, project.clj, or whatever configuration file your build tool uses. You take a deep breath and start a build. And it fails!

If you’re using Maven 2, you see something like this:

[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] Failed to resolve artifact.

Missing:
----------
1) com.example:awesome-sauce:jar:3.0.0

  Try downloading the file manually from the project website.

  Then, install it using the command: 
      mvn install:install-file -DgroupId=com.example -DartifactId=awesome-sauce -Dversion=3.0.0 -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file there: 
      mvn deploy:deploy-file -DgroupId=com.example -DartifactId=awesome-sauce -Dversion=3.0.0 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]

  Path to dependency: 
        1) my.group:my-new-project:jar:1.0.0-SNAPSHOT
        2) com.example:awesome-sauce:jar:3.0.0

----------
1 required artifact is missing.

for artifact: 
  my.group:my-new-project:jar:1.0.0-SNAPSHOT

from the specified remote repositories:
  central (http://repo1.maven.org/maven2),
  clojars (http://clojars.org/repo/)

Leiningen, which uses Maven 2 under the covers, produces similar output, but it mistakenly prints the current project name as org.apache.maven:super-pom:jar:2.0.

Maven 3 prints a less verbose (and less informative) error message, but the gist is the same.

What happened?

What is all this verbosity saying? Well, obviously, the build failed because something was missing. What was missing? Maven tells you:

Missing:
----------
1) com.example:awesome-sauce:jar:3.0.0

The JAR file for the project “awesome-sauce”, version 3.0.0, published in the “com.example” group, is missing. That just means Maven didn’t find it in any of the places it looked.

Where did it look? Maven tells you that too:

from the specified remote repositories:
  central (http://repo1.maven.org/maven2),
  clojars (http://clojars.org/repo/)

These are the public repositories where Maven searched for the file. Each repository has an ID (“central” and “clojars” in this case) and a URL. Both are specified in the configuration of:

  1. Your project, in pom.xml or project.clj

  2. Your build tool’s global configuration file

    • settings.xml for Maven
    • N/A for Leiningen
  3. Your build tool’s built-in defaults

If you visit http://repo1.maven.org/maven2/com/example/awesome-sauce or http://clojars.org/repo/com/example/awesome-sauce in a browser you will see that those directories do not, in fact, exist.

Although it’s not listed, the first place Maven checks for a dependency is your local Maven repository. The local repository is just a big cache of everything Maven has downloaded in the past. It’s typically located at $HOME/.m2/repository.

What to do next

You have two options at this point:

  1. Find a public repository containing “awesome-sauce”
  2. Install “awesome-sauce” in your local repository

The first option is generally less work, and more repeatable if you ever build your project on another machine.

Finding a repository

Odds are, if the library you are looking for is free, open-source, and popular, it will already be in a public Maven repository somewhere. Start with the source: who released the library? Large organizations with a lot of open-source projects often host their own repositories, like Google and Codehaus. Failing that, search engines such as Mvnbrowser may help you find it.

Once you’ve found a repository, you need to add it to your build. For example, to add the Codehaus repository to a Maven project, add these lines to pom.xml inside the <project> tag:

<repositories>
  <repository>
    <id>codehaus</id>
    <name>Codehaus</name>
    <url>http://repository.codehaus.org/</url>
  </repository>
</repositories>

(You can pick your own <id> and <name>.)

For Leiningen, add the following lines inside the (defproject ...) block:

  :repositories {"codehaus" "http://repository.codehaus.org/"}

Installing locally

If the library you want is not available in any public repository, you’re not stuck, you just have to do a bit more work. You need to get the JAR file for the library, either by downloading it manually or building from source. Then you need to install that JAR file in your local Maven repository. That’s easy, because Maven has already told you exactly how to do it:

  Then, install it using the command: 
      mvn install:install-file -DgroupId=com.example -DartifactId=awesome-sauce \
 -Dversion=3.0.0 -Dpackaging=jar -Dfile=/path/to/file

Copy that command verbatim, changing only /path/to/file to the path to the library’s JAR file. Maven will copy the file to $HOME/.m2/repository/com/example/awesome-sauce/3.0.0/awesome-sauce-3.0.0.jar. The next time you build your project, Maven knows exactly where to find it.

Installing remotely

If you want others to be able to build your project without having to go through these manual steps, you need your own public Maven repository to which you can upload files. Hosting a Maven repository isn’t hard: all you need is a web server.

If you work with a team, consider setting up a shared repository for everyone to use. A repository manager such as Nexus can help you take care of user accounts and authentication.

If you publish open-source libraries, I strongly encourage you to get an account on Sonatype OSS, a free service provided by the makers of Nexus. Releasing your projects to Sonatype OSS gives them a path to get added to the Maven Central Repository. While the requirements for projects in Maven Central are more stringent than just tossing code into your own repository, it’s worth the effort. In Maven Central, your project will have greater visibility and will be easier for anyone in the world to use.

But what if I don’t want that dependency?

Maven dependencies are transitive: if your project depends on project X, which depends on projects Y and Z, then your build will try to download X, Y, and Z.

But sometimes projects declare dependencies that aren’t strictly necessary. Or they declare dependencies on something you want, but the wrong version. How can you avoid including those extra dependencies in your build?

Maven supports dependency exclusions for these cases. For example, suppose the “awesome-sauce” library declares a dependency on “com.example:stupidity:0.0.1″. You know that you don’t need “stupidity” in your project, so you want to prevent the build from including it. In pom.xml, you write:

<dependencies>
  <dependency>
    <groupId>com.example</groupId>
    <artifactId>awesome-sauce</artifactId>
    <version>3.0.0</version>
    <exclusions>
      <exclusion>
        <groupId>com.example</groupId>
        <artifactId>stupidity</artifactId>
      </exclusion>
    </exclusions> 
  </dependency>
</dependencies>

Or in Leiningen’s project.clj, you write:

  :dependencies [[com.example/awesome-sauce "3.0.0"
                  :exclusions [com.example/stupidity]]]

Note that once you start using exclusions, you’re on your own. It’s up to you to make sure you still have the correct versions of all the libraries your project needs.

On rare occasions, a project’s dependencies cannot be resolved at all. In particular, if you need two different versions of the same library with the same class names but incompatible APIs, you’re pretty much stuck. Time to refactor, or investigate multiple-Classloader schemes like OSGi. But that’s a whole ‘nother story.