Blog

Parallel Processing with core.async

Update August 13, 2041: This approach may now be obsolete with the introduction of pipeline in core.async.

Update April 27, 2015: the date 2041 in the previous update is a typo, but updates from the future ties in nicely with the async theme so I decided to leave it in.

* * *

Say you have a bunch of items to process, and you want to parallelize the work across N threads. Using core.async, one obvious way to do this is to create N go blocks, all reading from the same input channel.

(defn parallel
  "Processes values from input channel in parallel on n 'go' blocks.

  Invokes f on values taken from input channel. Values returned from f
  are written on output channel.

  Returns a channel which will be closed when the input channel is
  closed and all operations have completed.

  Note: the order of outputs may not match the order of inputs."
  [n f input output]
  (let [tasks (doall
               (repeatedly n
                #(go-loop []
                   (let [in (<! input)]
                     (when-not (nil? in)
                       (let [out (f in)]
                         (when-not (nil? out)
                           (>! output out))
                         (recur)))))))]
    (go (doseq [task tasks]
          (<! task)))))

This might create more go blocks than you need, but inactive go blocks don’t cost much except a little memory.

But this isn’t always ideal: if f is going to block or do I/O, then you might want to create a thread instead of a go. Threads are more expensive. Suppose you don’t know how quickly the inputs will arrive: you might end up creating more threads than you need.

What I typically want is to process things in parallel with as many threads as necessary, but at most N. If the processing with two threads is fast enough to keep up with the input, then we should only create two threads. This applies to other kinds of resources besides threads, network calls for example.

After many attempts, here is what I came up with:

(defn pmax
  "Process messages from input in parallel with at most max concurrent
  operations.

  Invokes f on values taken from input channel. f must return a
  channel, whose first value (if not closed) will be put on the output
  channel.

  Returns a channel which will be closed when the input channel is
  closed and all operations have completed.

  Creates new operations lazily: if processing can keep up with input,
  the number of parallel operations may be less than max.

  Note: the order of outputs may not match the order of inputs."
  [max f input output]
  (go-loop [tasks #{input}]
    (when (seq tasks)
      (let [[value task] (alts! (vec tasks))]
        (if (= task input)
          (if (nil? value)
            (recur (disj tasks task))  ; input is closed
            (recur (conj (if (= max (count tasks))  ; max - 1 tasks running
                           (disj tasks input)  ; temporarily stop reading input
                           tasks)
                         (f value))))
          ;; one processing task finished: continue reading input
          (do (when-not (nil? value) (>! output value))
              (recur (-> tasks (disj task) (conj input)))))))))

The function f is responsible for both processing the input and creating the response channel. So f could be a go, a thread, or something else that returns a channel, such as an asynchronous I/O operation. There’s a little bit of extra overhead to shuffle around data structures in this go-loop, but I’m assuming that the cost of processing inputs will dominate.

So how to test it? First a few helpers.

We want to make sure that the output channel doesn’t hold up anything else, so we’ll make a helper function to consume everything from it:

(defn sink
  "Returns an atom containing a vector. Consumes values from channel
  ch and conj's them into the atom."
  [ch]
  (let [a (atom [])]
    (go-loop []
      (let [val (<! ch)]
        (when-not (nil? val)
          (swap! a conj val)
          (recur))))
    a))

What we want to keep track of is how many parallel operations are running at any given time. We can have our “processing” function increment a counter when it starts, wait a random interval of time, then decrement the counter before returning.

My colleague @timbaldridge suggested a watch function to keep track of how high the counter gets. This will produce a record of how many tasks were active at any time during the test.

(defn watch-counter [counter thread-counts]
  (add-watch counter
             :thread-count
             (fn [_ _ _ thread-count]
               (swap! thread-counts conj thread-count))))

Here’s the pmax function using a go block:

(deftest t-pmax-go
  (let [input (to-chan (range 50))
        output (chan)
        result (sink output)
        max-threads 5
        counter (atom 0)
        f (fn [x]
            (go
             (swap! counter inc)
             (<! (timeout (rand-int 100)))
             (swap! counter dec)
             x))
        thread-counts (atom [])]
    (watch-counter counter thread-counts)
    (<!! (pmax max-threads f input output))
    (is (= (set (range 50)) (set @result)))
    (is (every? #(<= % max-threads) @thread-counts))))

And pmax using a thread:

(deftest t-pmax-thread
  (let [input (to-chan (range 50))
        output (chan)
        result (sink output)
        max-threads 5
        counter (atom 0)
        f (fn [x]
            (thread
             (swap! counter inc)
             (<!! (timeout (rand-int 100)))
             (swap! counter dec)
             x))
        thread-counts (atom [])]
    (watch-counter counter thread-counts)
    (<!! (pmax max-threads f input output))
    (is (= (set (range 50)) (set @result)))
    (is (every? #(<= % max-threads) @thread-counts))))

But what we really wanted to know is that pmax won’t create more threads than necessary when the input source is slower than the processing. Here’s that test, with a deliberately slow input channel:

(deftest t-pmax-slow-input
  (let [input (chan)
        output (chan)
        result (sink output)
        max-threads 5
        actual-needed-threads 3
        counter (atom 0)
        f (fn [x]
            (go
             (swap! counter inc)
             (<! (timeout (rand-int 100)))
             (swap! counter dec)
             x))
        thread-counts (atom [])]
    (watch-counter counter thread-counts)
    ;; Slow input:
    (go-loop [i 0]
      (if (< i 50)
        (do (<! (timeout 50))
            (>! input i)
            (recur (inc i)))
        (close! input)))
    (<!! (pmax max-threads f input output))
    (is (= (set (range 50)) (set @result)))
    (is (every? #(<= % actual-needed-threads) @thread-counts))))

This still isn’t suitable for every scenario: maybe each thread needs some expensive set-up before it can process inputs. Then you would need some more elaborate mechanism to keep track of how many threads you have and whether or not they are keeping up with the input.

I have released the code in this blog post under the MIT License. View the source code on GitHub.

Update January 1, 2014: As my colleague @craigandera pointed out, this code doesn’t do any error handling. It’s easy enough to add once you make a decision about how to handle errors: ignore, log, or abort the whole process.

Command-Line Intransigence

In the early days of Clojure, I was skeptical of Clojure-specific build tools like Lancet, Leiningen, and Cake. Why would Clojure, a part of the Java ecosystem, need its own build tool when there were already so many Java-based tools?

At the time, I thought Maven was the last word in build tooling. Early Leiningen felt like a thin wrapper around Maven for people with an inconsolable allergy to XML. Maven was the serious build tool, with a rich declarative model for describing dependency relationships among software artifacts. That model was imperfect, but it worked well enough to power one of the largest repositories of open-source software on the planet.

But things change. Leiningen has evolved rapidly. Maven has also evolved, but more slowly, and the promised non-XML POM syntax (“polyglot Maven”) has not materialized.

Meanwhile, I learned why everyone eventually hates Maven, through the experience of crafting custom Maven builds for two large-ish projects: the Clojure language and its contributed libraries. It was a challenge to satisfy the (often conflicting) requirements of developers, continuous integration, source repositories, and the public Maven repository network. Even with the help of Maven books from Sonatype, it took months of trial and error and nearly all my “open-source” time to get everything working.

At the end of this process I discovered, to my dismay, that I was the only one who understood it. As my colleague Stuart Halloway put it, “Maven breeds heroes.” For end-users and developers, there’s a nice interface: Clojure-contrib library authors can literally click a button to make a release. But behind that button are so many steps and moving parts (Git, Hudson, Maven, Nexus, GPG, and all the Maven plugins) that even I can barely remember how it all works. I never wanted to be the XML hero.

So I have come around to Leiningen, and even incorporate it into my Clojure development workflow. It’s had some bumps, as one might expect from a fast-moving open-source project with lots of contributors, but most of the time it does what I need and doesn’t get in the way.

What puzzles me, however, is the stubbornness of developers who want to do everything via Leiningen. Some days it seems like every new tool or development utility for Clojure comes wrapped up in a Leiningen plugin so it can be invoked at the command line. I don’t get it. When you have a Clojure REPL, why would you limit yourself to the UNIX shell?

I think this habit comes partly from scripting languages, which were born at the command line, and still live there to a great extent. But it puzzled me a bit even in Ruby: if it takes 3 seconds to for rake to load your 5000-line Rails app, do you really want to use rake for critical administrative tasks like database migrations? IRB is not a REPL in the Lisp sense, but it’s a pretty good interactive shell. I’d rather work with a large Ruby app in IRB than via rake.

Start-up time remains a major concern for Leiningen, and its contributors have gone to great lengths (sometimes too far) to ameliorate it. Why not just avoid the problem altogether? Start Leiningen once and then work at the REPL. Admittedly, this takes some discipline and careful application design, but on my own projects I’ve gotten to the point where I only need to launch Leiningen once a day. Occasionally I make a mistake and get my application into such a borked state that the only remedy is restarting the JVM, but those occasions are rare.

I pretty much use Leiningen for just three things: 1) getting dependencies, 2) building JARs, and 3) launching REPLs. Once I have a REPL I can do my real work: running my application, testing, and profiling. The feedback cycles are faster and the debugging options much richer than what I can get on the command-line.

“Build plugins,” for Leiningen or Maven or any other tool, always suffer from running in a different environment from the code they are building. But isn’t one of the central tenets of Lisp that the compiler is part of your application? There isn’t really a sharp boundary between “build” code and “application” code. It’s all just code.

I used to write little “command-line interfaces” for running tests, builds, deployments, and so on. Now I’m more likely to just put those functions in a Clojure namespace and call them from the REPL. Sometimes I wonder: why not go further? Use Leiningen (or Maven, or Gradle, or whatever) just to download dependencies and bootstrap a REPL, then execute builds and releases from the REPL.

Lifecycle Composition

I’ve been thinking about how to build up software systems out of
stateful components. In past presentations and blog posts I’ve alluded
to a standard interface I use for starting and stopping stateful
components:

(defprotocol Lifecycle
  (start [this] "Begins operation of this component.")
  (stop [this] "Ceases operation of this component."))

Most of my Clojure programs follow the same basic structure: Each
subsystem or service is represented by a record which implements this
protocol. At the top, there is a “system” record which contains all
the other components.

I’ve gone through several versions of this protocol with the same
function names but slightly different semantics.

Side Effects, Mutable State

In the first version, start and stop were side-effecting
procedures which returned a future or promise:

(defprotocol Lifecycle
  (start [this]
    "Begins operation of this component. Asynchronous, returns a
  promise on which the caller can block to wait until the component is
  started.")
  (stop [this]
    "Ceases operation of this component. Asynchronous, returns a
  promise on which the caller can block to wait until the component is
  stopped."))

The calling code could dereference the future to block until the
service had successfully started. For example, a database-access
component might look like this:

(defrecord Database [uri connection-atom]
  Lifecycle
  (start [_]
    (future (reset! connection-atom (connect uri))))
  (stop [_]
    (.close @connection-atom)
    (future (reset! connection-atom nil))))

(defn database [uri]
  (->Database uri (atom nil)))

My idea was that multiple services could be started in parallel if
they didn’t depend on one another, but in practice I always ended up
blocking on every call to start:

(defrecord System [database scheduler web]
  Lifecycle
  (start [_]
    (future
      @(start database)
      @(start scheduler)
      @(start web)))
  (stop [_]
    (future
      @(stop web)
      @(stop scheduler)
      @(stop database))))

(defn system [database-uri]
  (let [database (database database-uri)
        scheduler (scheduler)
        web (web-server database)]
    (->System database scheduler web)))

Second Attempt

I decided to drop the requirement to return a promise from start/stop,
which meant that those functions became synchronous and had no return
value:

(defprotocol Lifecycle
  (start [this]
    "Begins operation of this component. Synchronous, does not return
  until the component is started.")
  (stop [this]
    "Ceases operation of this component. Synchronous, does not return
  until the component is stopped."))

This simplified the code calling start/stop, because I didn’t have to
worry about dereferencing any futures.

(defrecord System [database scheduler web]
  Lifecycle
  (start [_]
    (start database)
    (start scheduler)
    (start web))
  (stop [_]
    (stop web)
    (stop scheduler)
    (stop database)))

This also made it very clear that I was using start/stop only for
side-effects, forcing all of my components to contain mutable state.

Also, I had to manually place the calls to start/stop in the correct
order, to ensure that components were not started before other
components which depended on them.

Immutable Values

I decided to try to make the component objects more like immutable
values by redefining start/stop to return updated versions of the
components:

(defprotocol Lifecycle
  (start [this]
    "Begins operation of this component. Synchronous, does not return
  until the component is started. Returns an updated version of this
  component.")
  (stop [this]
    "Ceases operation of this component. Synchronous, does not return
  until the component is stopped. Returns an updated version of this
  component."))

In this version, start and stop feel more like functions. They are
still not pure functions, because they will have to execute
side-effects such as connecting to a database or opening a web server
port, but at least they return something meaningful.

This version has the added benefit of removing some mutable state, at
the cost of making the start/stop implementations slightly more
complicated. Now these functions have to return a new instance of the
component record:

(defrecord Database [uri connection]
  Lifecycle
  (start [this]
    (assoc this :connection (connect uri)))
  (stop [this]
    (.close connection)
    (assoc this :connection nil)))

(defn database [uri]
  (->Database uri nil))

One interesting feature of this pattern is that the system record can
reduce over its own keys to start/stop all the components:

(defrecord System [database scheduler web]
  Lifecycle
  (start [this]
    (reduce (fn [system key]
              (update-in system [key] start))
            this
            ;; Keys are returned in the order they were declared.
            (keys this)))
  (stop [this]
    (reduce (fn [system key]
              (update-in system [key] stop))
            this
            ;; Reverse the order to stop.
            (reverse (keys this)))))

However, this relies on implementation behavior of Clojure records:
the keys function will return the keys in the order they are
declared in the record. I’m reluctant to rely on that undocumented
behavior, so instead I’ll declare the ordering explicitly:

(def component-order
  [:database :scheduler :web])

(defrecord System [database scheduler web]
  Lifecycle
  (start [this]
    (reduce (fn [system key]
              (update-in system [key] start))
            this
            component-order))
  (stop [this]
    (reduce (fn [system key]
              (update-in system [key] stop))
            this
            ;; Reverse the order to stop.
            (reverse component-order))))

Dependency Order

I still don’t have good solution to specifying the order in which
components must be started/stopped. I’ve tried building a graph of
dependencies and computing the correct order. This would be similar to
what tools.namespace does with namespaces. But I haven’t been able to
come up with a good syntax for representing these relationships that
isn’t more cumbersome than just declaring them in order.

I’ve also tried using Prismatic’s Graph library or my own Flow library
to define the graph of dependency relationships, but neither of those
libraries can produce a structure that remembers the graph after it
has computed its output, so I have no way to recover the dependency
relationships after constructing the system object.

Dependency Injection and State

This technique is a form of dependency injection through constructors.
The choice of whether to make the individual components mutable,
stateful objects has an impact on how I can use them later on. In the
original version of this pattern using mutable objects, each component
gets stable references to other components it depends on. In the later
version using immutable data structures, each component gets
references to the constructed versions of other components it
depends on, but not the started versions of those components,
i.e. the values returned from start.

So far, this has not been a problem in the programs I write. For
example, a Datomic database connection is always recoverable from the
URI, so I don’t need to store it explicitly. But other components,
particularly components which rely on external state to function
properly, might need to be mutable so that their dependents can still
use them via the references the received in their constructors. I
could still have start and stop return new values, but they would
also have to modify some mutable state (such as a Ref or Atom) along
the way. As always, mutable objects muddy the distinction between
values and identities.

I’ve also experimented with variations of start and stop that
pass in other “started” components, but this was cumbersome and hard
to generalize through a single interface.

So I don’t have a perfect system. It works well enough for the
applications I’ve developed with it so far, and it facilitates my
REPL-driven development workflow, but I always have to adapt to
circumstance. Eliminating mutable state is generally a good thing,
but it can also be limiting, especially when you have to deal with
external state.

The Amateur Problem

We have a problem. We are professional software developers who work with open-source software. The problem is that we are in the minority. Most open-source software is written by amateurs.

Every time a hot new technology comes on the scene, developers flock to it like ants to a picnic. Those early adopters are, by definition, people for whom choosing a new technology is less risky. Which means, mostly, that their work doesn’t really matter. Students, hobbyists, “personal” projects: nobody’s life or career is on the line. It doesn’t matter if the program is entirely correct, efficient, or scalable. It doesn’t matter if it ignores lots of edge cases.

I’ve been one of those amateurs. It’s fun. New technologies need amateurs. But as a technology matures, it attracts professionals with real jobs who do care about those details. And those professionals are immediately confronted with a world of open-source software written by amateurs.

I used to write code for myself. Since I started getting paid to write code for other people, I’ve become wary of code written by people writing for themselves. Every time I see a README that begins “X is a dead simple way to do Y,” I shudder. Nothing in software is simple. “Dead simple” tells me the author has “simplified” by “deadening” vast swaths of the problem space, either by making unfounded assumptions or by ignoring them completely.

We like to carp about “bloated” APIs in “mainstream” languages like Java. Truly, lots of APIs are more complicated than they need to be. But just because an API is big doesn’t mean it’s bloated. I like big APIs: they show me that someone has thought about, and probably encountered, all of the edges and corners in the problem space.

Simplifying assumptions do not belong in libraries; they belong in applications, where you know the boundaries of the problem space. On rare occasions, the ground of one problem is trod often enough to warrant a framework. Emphasis on rare. A framework is almost always unnecessary, and, in these days of rapidly-changing technological capabilities, likely to be obsolete before it’s finished.

Frameworks written by amateurs are the worst of the worst: brittle constructs that assume everything in service of one or two “dead simple” demos but collapse under the weight of a real-world application.

I don’t want to be a code snob. Let’s be amateurs. Let’s have fun. Explore. Learn. Publish code as we go. Show a solution to a problem without assuming it’s the solution. Be cognizant of and vocal about what assumptions we’re making. Don’t call something a library unless it really attempts to reach every nook and cranny of the problem space.

And don’t write frameworks. Ever. ;)

Update August 8, 2013: Based on the comments, I feel like too many people have gotten hung up on the words amateur and professional. Those were just convenient labels which I found amusing. The important difference is between “easy” single-purpose code and thorough, general-purpose code.

On the Perils of Dynamic Scope

Common Lisp the Language (CLtL) devotes an entire chapter to the subject of Scope and Extent. It defines scope as the textual region of a program in which an entity may be used, where “entity” could be a symbol, a value, or something more abstract like a variable binding.

So scope is about where you can use something. CLtL defines two different kinds of scope:

Lexical scope is usually the body of a single expression like let or defun.

Indefinite scope is everything else, effectively the global set of symbols that exist in a program.

In contrast, extent is about when: it’s the interval of time during which something may be used. CLtL defines two different kinds of extent:

Dynamic extent refers to things that exist for a fixed period of time and are explicitly “destroyed” at the end of that period, usually when control returns to the code that created the thing.

Indefinite extent is everything else: things that get created, passed around, and eventually garbage-collected.

In any language with a garbage collector, most things have indefinite extent. You can create strings, lists, or hash tables and pass them around with impunity. When the garbage collector determines that you are done using something, it reclaims the memory. The process is, in most cases, completely transparent to you.

But what about the so-called dynamic scope? The authors of CLtL have this to say:

The term “dynamic scope” is a misnomer. Nevertheless it is both traditional and useful.

They also define “dynamic scope” to be the combination of indefinite scope and dynamic extent. That is, things with dynamic scope are valid in any place in a program, but only for a limited time. In Common Lisp, these are called “special variables,” and are created with the macros defparameter and defvar.

Vars

So what does this have to do with Clojure? Clojure has these things called Vars. Every time you write def or defn or one of its variants in Clojure, you’re creating a Var.

Vars have indefinite scope: no matter where you def a Var, it’s visible everywhere in the program.1

Vars usually have indefinite extent as well. Usually. This is where things get tricky. Clojure, unlike Common Lisp, was designed for multi-threaded programs.2 The meaning of extent gets a lot muddier in the face of multiple threads. Each thread has its own timeline, its own view of “now” which may or may not conform to any other thread’s view.

In Clojure versions 1.2 and earlier, all Vars had dynamic scope by default, but this meant that there was a performance cost to look up the current dynamic binding of a Var on every function call. Leading up to 1.3, Rich Hickey experimented with allowing Vars to be declared ^:static, before settling on static by default with ^:dynamic as an option. You can still find ^:static declarations littered through the Clojure source code. Maybe someday they’ll be useful again.

The definition of “dynamic scope” in Clojure is even fuzzier than it is in Common Lisp. How do we define “extent” in the face of multiple threads, each potentially with its own thread-local binding? If a resource can be shared across multiple threads, we have to coordinate the cleanup of that resource. For example, if I open a socket and then hand it off to another piece of code, who is responsible for closing the socket?

Resource management is one concurrency bugaboo that Clojure developers have not managed to crack. Various attempts have been made: you can see the artifacts in wiki and mailing list discussions of Resource Scopes. So far, no solution has been found that doesn’t just shift the problem somewhere else.

It ends up looking a bit like garbage collection: how do I track the path of every resource used by my program and ensure that it gets cleaned up at the appropriate time? But it’s even harder than that, because resources like file handles and sockets are much scarcer than memory: they need to be reclaimed as soon as possible. In a modern runtime like the JVM, garbage collection is stochastic: there’s no guarantee that it will happen at any particular time, or even that it will happen at all.

To make matters worse, Clojure has laziness to contend with. It’s entirely possible to obtain a resource, start consuming it via a lazy sequence, and never finish consuming it.

The Wrong Solution

This brings me to one of my top anti-patterns in Clojure: the Dynamically-Scoped Singleton Resource (DSSR).

The DSSR is popular in libraries that depend on some external resource such as a socket, file, or database connection. It typically looks like this:

(ns com.example.library)

(def ^:dynamic *resource*)

(defn- internal-procedure []
  ;; ... uses *resource* ...
  )

(defn public-api-function [arg]
  ;; ... calls internal-procedure ...
  )

That is, there is a single dynamic Var holding the “resource” on which the rest of the API operates. The DSSR is often accompanied by a with-* macro:

(defmacro with-resource [src & body]
  `(binding [*resource* (acquire src)]
     (try ~@body
       (finally
         (dispose *resource*)))))

This looks harmless enough. It’s practically a carbon copy of Clojure’s with-open macro, and it ensures that the resource will get cleaned up even if body throws an exception.

The problem with this pattern, especially in libraries, is the constraints it imposes on any code that wants to use the library. The with-resource macro severely constrains what you can do in the body:

You can’t dispatch to another thread. Say goodbye to Agents, Futures, thread pools, non-blocking I/O, or any other kind of asynchrony. The resource is only valid on the current thread.3

You can’t return a lazy sequence backed by the resource because the resource will be destroyed as soon as body returns.

You can’t have more than one resource at a time. Hence the “singleton” in the name of this pattern. Using a thread-bound Var throughout the API means that you can never operate on more than one instance of the resource in a single thread. Lots of apps need to work with multiple databases, which really sucks using this kind of library.

The last problem with this pattern is a more subtle one: hidden dependencies. The public API functions, which have global scope, depend on the state (thread-local binding) of another Var with global scope. This dependency isn’t explicitly stated anywhere in the definition of those functions. That might not seem like such a big deal in small examples, and it isn’t. But as programs (and development teams) grow larger, it’s one additional piece of implicit knowledge that you have to keep in your head. If there are seventeen layers of function calls between the resource binding and its usage, how certain are you going to be that the resource has the right extent?

Friends Don’t Let Friends Use Dynamic Scope

The alternative is easy: don’t do it. Don’t try to “solve” resource management in every library.

By all means, provide the functions to acquire and dispose of resources, but then let the application programmer decide what to do with them. Define API functions to take the resource as an argument.

Applications can manage their own resources, and only the application programmer knows what the extent of those resources should be. Maybe you can pass it around as a value. Maybe you want to use dynamic binding after all. Maybe you want to stash it in a global state Var.4 That’s for you to decide.

Datomic is a good example to follow: it creates connection objects that have a lot of state attached to them — sockets, queues, and threads. But it says nothing about how you should manage the extent of those connections.5 Most functions in the Datomic API take either a connection or a database (a value obtained from the connection) as an argument.

Safe Dynamic Scope

So dynamic scope is totally evil, right? Not totally. There are situations where dynamic scope can be helpful without causing the cascade of problems I described above.

Remember that dynamic scope in Clojure is really thread-local binding. Therefore, it’s best suited to operations that are confined to a single thread. There are plenty of examples of this: most popular algorithms are single-threaded, after all. Consider the classic recursive-descent parser: you start with one function call at the top and you’re not done until that function returns. The entire operation happens on a single thread, in a single call stack. It has dynamic extent.

I took advantage of this fact in a Clojure JSON parser. There were a number of control flags that I needed to make available to all the functions. Rather than pass around extra arguments all over the place, I created private dynamic Vars to hold them. Those Vars get bound in the entry-point to the parser, based on options passed in as arguments to the public API function. The thread-local state never leaks out of the initial function call.

As another example, the Clojure compiler, although written in Java, uses dynamic Vars to keep track of internal state.

And what about our friend with-open? I said that the example with-resource macro was nearly a copy of it, but only nearly. clojure.core/with-open creates lexical (i.e. local) bindings. It still suffers from some limitations around what you can do in the body, but at least it doesn’t limit you to one resource at a time.

Global state is the zombie in the closet of every Clojure program, about which I’ll have more to say in future posts. For now, I hope I’ve convinced you that dynamic scope is easily abused and has a lot of unintended consequences.

Footnotes:

1 Technically, a Var is visible only after the point at which it was defined. This is significant with regard to the order of definitions, but Vars are still globally visible once they have been defined.

2 CLtL has this note embedded in the chapter on scope and extent: “Behind the assertion that dynamic extents nest properly is the assumption that there is only a single program or process. Common Lisp does not address the problems of multiprogramming (timesharing) or multiprocessing (more than one active processor) within a single Lisp environment.” Modern Common Lisp implementations have added multi-threading, but it remains absent from the language specification.

3 You can use bound-fn to capture the bindings and pass them to another thread, but you still have the problem that the resource may be destroyed before the other thread is finished with it.

4 Not recommended, to be discussed in a future post.

5 There’s some caching of connection objects under the hood, but this is not relevant to the consumer.

Affordance and Concision

Quick, Clojure programmers, what does the following expression do?

(get x k)

If you answered, It looks up the key k in an associative data structure x and returns its associated value, you’re right, but only partially.

What if x is not an associative data structure? In every released version of Clojure up to and including 1.5.0, get will return nil in that case.

Is that a bug or a feature? It can certainly lead to some hard-to-find bugs, such as this one which I’ve often found in my own code:

(def person (ref {:name "Stuart" :job "Programmer"}))

(get person :name)
;;=> nil

Spot the bug? person is not a map but rather a Ref whose state is a map. I should have written (get @person :name). One character between triumph and defeat! To make matters worse, that nil might not show up until it triggers a NullPointerException several pages of code later.

It turns out that several core functions in Clojure behave this way: if called on an object which does not implement the correct interface, they return nil rather than throwing an exception.

The contains? function is a more bothersome example. Not only is the name difficult to remember — it’s an associative function that checks for keys, not a linear search of values like java.util.Collection#contains — but it also returns nil on functions which do not implement clojure.lang.Associative. Or at least it did up through Clojure 1.4.0. I submitted a patch (CLJ-932), included in Clojure 1.5.0, which changed contains? to throw an exception instead.[1]

I submitted a similar patch (CLJ-1107) to do the same thing for get, but not in time for consideration in the 1.5.0 release.

A few weeks later, I was writing some code that looked like this:

(defn my-type [x]
  (or (get x :my-namespace/type)
      (get (meta x) :my-namespace/type)
      (get x :type)
      (clojure.core/type x)))

I wanted a flexible definition of “type” which worked on maps or records with different possible keys, falling back on the clojure.core/type function, which looks for a :type key in metadata before falling back to clojure.core/class.

Before the patch to get in CLJ-1107, this code works perfectly well. After the patch, it won’t. I would have to write this instead:

(defn my-type [x]
  (or (when (associative? x)
        (get x :my-namespace/type))
      (get (meta x) :my-namespace/type)
      (when (associative? x)
        (get x :type))
      (clojure.core/type x)))

But wait! The meta function also returns nil for objects which do not support metadata. Maybe that should be “fixed” too. Then I would have to write this:

(defn my-type [x]
  (or (when (associative? x)
        (get x :my-namespace/type))
      (when (instance? x clojure.lang.IMeta)
        (get (meta x) :my-namespace/type))
      (when (associative? x)
        (get x :type))
      (clojure.core/type x)))

And so on.

Every language decision means trade-offs. Clojure accepts nil as a logical false value in boolean contexts, like Common Lisp (and also many scripting languages). This “nil punning” enables a concise style in which nil stands in for an empty collection or missing data.[2] For example, Clojure 1.5.0 introduces two new macros some-> and some->>, which keep evaluating expressions until one of them returns nil.

Is Clojure’s get wrong? It depends on what you think get should mean. If you’re a fan of more strictly-typed functional languages you might think get should be defined to return an instance of the Maybe monad:

;; made-up syntax:
get [Associative[K,V], K] -> Maybe[V]

You can implement the Maybe monad in Clojure, but there’s less motivation to do so without the support of a static type checker. You could also argue that, since Clojure is dynamically-typed, get can have a more general type:

;; made-up syntax:
get [Any, Any] -> Any | nil

This latter definition is effectively the type of get in Clojure right now.

Which form is better is a matter of taste. What I do know is that the current behavior of get doesn’t give much affordance to a Clojure programmer, even an experienced one.[3]

Again, tradeoffs. Clojure’s definition of get is flexible but can lead to subtle bugs. The stricter version would be safer but less flexible.

An even stricter version of get would throw an exception if the key is not present instead of returning nil. Sometimes that’s what you want. The Simulant testing framework defines a utility function getx that does just that.

Over the past five years, Rich Hickey has gradually led Clojure in the direction of “fast but correct by default.” This is particularly evident in the numeric primitives since release 1.3.0, which throw an exception on overflow (correct) but do not automatically promote from fixed- to arbitrary-precision types (slow).

I believe the change to get in CLJ-1107 will ultimately be more help than hindrance. But it might also be useful to have a function which retains the “more dynamic” behavior. We might call it get' in the manner of the auto-promoting arithmetic functions such as +'. Or perhaps, with some cleverness, we could define a higher order function that transforms any function into a function that returns nil when called on a type it does not support. This would be similar in spirit to fnil but harder to define.[4]

Update #1: changed (instance? x clojure.lang.Associative) to (associative? x), suggested by Luke VanderHart.

Update #2: Some readers have pointed out that I could make my-type polymorphic, thereby avoiding the conditional checks. But that would be even longer and, in my opinion, more complicated than the conditional version. The get function is already polymorphic, a fact which I exploited in the original definition of my-type. It’s a contrived example anyway, not a cogent design.

Footnotes:

[1] We can’t do anything about the name of contains? without breaking a lot more code. This change, at least, is unlikely to break any code that wasn’t already broken.

[2] There’s a cute poem about nil-punning in Common Lisp versus Scheme or T.

[3] I am slightly abusing the definition of affordance here, but I think it works to convey what I mean: the implementation of get in the Clojure runtime does not help me to write my code correctly.

[4] I don’t actually know how to do it without catching IllegalArgumentException, which would be bad for performance and potentially too broad. Left as an exercise for the reader!

A Brief Rant About Versioning

Version numbers are meaningless. By that, I mean they convey no useful information. Oh sure, there are conventions: major.minor.patch, even/odd for stable/development versions, and designations like release candidate. But they’re just conventions. Version numbers are chosen by people, so they are subject to all the idiosyncrasies and whims of individuals.

Semantic Versioning, you say? Pshaw. Nobody does semantic versioning. If they did, we’d see dozens of libraries and applications with major-version numbers in the double or triple digits. It’s almost impossible to change software without breaking something. Even a change which is technically a bugfix can easily break a downstream consumer that relied, intentionally or not, on the buggy behavior.

That’s not to say you shouldn’t try to follow semantic versioning. It’s a good idea, and even its author admits that some versioning decisions boil down to Use your best judgment.

The trouble with semantic versioning is that everyone want others to follow it, but no one wants to follow it themselves. Everyone thinks there’s room for one more quick fix, or this change isn’t big enough to warrant a major-version bump, or simply my project is special. It’s a slippery slope from there to redesigning your entire API between versions 2.7.4-RC13 and 2.7.4-RC14.

Everybody does it. I could name names, but that would be redundant. I’m not sitting in a glass house here, either. I caught major flack for breaking the API of a JSON parser – a JSON parser! – between two 0.x releases. People don’t like change, even improvements, if it means the tiniest bit more work for them. Even if the new API is cleaner and more logical, even if you change things that were never explicitly promised by the old API, there will be grumbles and calls for your resignation. It’s enough to make you want to stop releasing things altogether, or to throw up your hands and just number all your releases sequentially, or to go totally off the reservation a have your version numbers converge towards an irrational constant.

Did I mention this was a rant? Please don’t take it too seriously.

The Reluctant Dictator

I have a confession to make. I’m bad at open-source. Not writing the code. I’m pretty good at that. I can even write pretty good documentation. I’m bad at all the rest: patches, mailing lists, chat rooms, bug reports, and anything else that might fall under the heading of “community.” I’m more than bad at it: I don’t like doing it and generally try to avoid it.

I write software to scratch an itch. I release it as open-source in the vague hope that someone else might find it useful. But once I’ve scratched the itch, I’m no longer interested. I don’t want to found a “community” or try to herd a bunch of belligerent, independent-minded cats. I’m not in it for the money. I’m not even in it for the fame and recognition. (OK, maybe a little bit for the fame.)

But this age of “social” insists that everything be a community. Deoderant brands beg us to “like” their Facebook pages and advertising campaigns come accesorized with Twitter hash tags. In software, you can’t just release a bit of code as open-source. You have to create a Google Group and a blog and an IRC channel and a novelty Twitter account too.

The infrastructure of “social coding” has codified this trend into an expectation that every piece of open-source software participate in a world-wide collaboration / popularity contest. The only feature of GitHub that can’t be turned off is the pull request.

Don’t get me wrong, I love GitHub and use it every day. On work projects, I find pull requests to be an efficient tool for doing code reviews. GitHub’s collaboration tools are great when you’re only trying to collaborate with a handful of people, all of whom are working towards a common, mutually-understood goal.

But when it comes to open-source work, I use GitHub primarily as a hosting platform.[1] I put code on GitHub because I want people to be able to find it, and use it if it helps them. I want them to fork it, fix it, and improve it. But I don’t want to be bothered with it. If you added something new to my code, great! It’s open-source – have at it!

I’m puzzled by people who write to me saying, “If I were to write a patch for your library X to make it do Y, would you accept it?” First of all, you don’t need my or anybody else’s permission to modify my code. That’s the whole point of open-source! Secondly, how can I decide whether or not I’ll accept a patch I haven’t seen yet? Finally, if you do decide to send me a pull request, please don’t be offended if I don’t accept it, or if I ignore it for six months and then take the idea and rewrite it myself.

Why didn’t I accept your pull request? Not because I want to hog all the glory for myself. Not because I want to keep you out of my exclusive open-source masters’ club. Not even because I can find any technical fault with your implementation. I’ve just got other things to do, other itches to scratch.

If everyone thought that way, would open-source still work? Probably. Maybe not as well.

To be sure, there’s a big difference between one-off utilities written in a weekend and major projects sustained for years by well-funded organizations. Managing a world-wide collaborative open-source project is a full-time job. The benevolent-dictator-for-life needs an equally-benevolent corporate-sponsor-for-life.[2] You can’t expect the same kind of support from individuals working in their spare time, for free.

I sometimes dream of an open-source collaboration model that is truly pull-based instead of GitHub’s they-should-have-called-it-push request. I don’t want to be forced to look at anything on any particular schedule. Don’t give me “notifications” or send me email. Instead, and only when I ask for it, allow me to browse the network of forks spawned by my code. Let me see who copied it, how they used it, and how they modified it. Be explicit about who owns the modifications and under what terms I can copy them back into my own project. And not just direct forks — show me all the places where my code was copied-and-pasted too.

Imagine if you could free open-source developers from all the time spent on mailing lists, IRC, bug trackers, wikis, pull requests, comment threads, and patches and channel all that energy into solving problems. Who knows? We might even solve the hard problems, like dependency management.

Update Jan 17, 8:52am EST: I should mention that I have nothing but admiration and respect for people who are good at the organizational/community aspects of open-source software. I’m just not one of them.

Footnotes:

[1] I’m not the only one. Linus Torvalds famously pointed out flaws in the GitHub pull-request model, in particular its poor support for more rigorous submission/signoff processes.

[2] Even with a cushy corporate sponsor, accepting patches is a far more work than the authors of those patches typically realize. See The story with #guava and your patches.

Playing the Obstacle

When I was in acting school (yes, I was in acting school, see my bio) one of my teachers had an expression: playing the obstacle. When studying for a role, one of an actor’s most important jobs is to determine the character’s overall objective: What’s my motivation? The plot of any play or movie typically centers around how the character overcomes obstacles to achieve that objective.

What my teacher had noticed was a tendancy of young actors to focus too much on the obstacles themselves, attempting to build characters out of what they can’t do rather than what they want to do.

I think there’s a similar tendency in programmers. We start out with a clear objective, but when we encounter an obstacle to that objective we obsess over it. How many times has a programmer said, “I wanted to do X, but I couldn’t because Y got in the way,” followed by a 10-minute rant about how much language / framework / library / tool Y sucks? That’s playing the obstacle.

If you’re lucky enough to make software that real people (not programmers) actually use, then Y is irrelevant. No one cares how many ugly hacks you had to put in to make Y do something it wasn’t quite designed to do. All that matters is X.