Open-source Bundling

Cast your mind back to the halcyon days of the late ’90s. Windows 95/98. Internet Explorer 4. Before you laugh, consider that IE4 included some pretty cutting-edge technology for the time: Dynamic HTML, TLS 1.0, single sign-on, streaming media, and “Channels” before RSS. IE4 even pioneered — unsuccessfully — the idea of “web browser as operating system” a decade before Google Apps.

But if you remember anything about IE in the ’90s, it’s probably the word bundling. United States v. Microsoft centered on the tight integration of IE with Windows. If you had Windows, you had to have IE. By the time the lawsuit reached a settlement, IE was entrenched as the dominant browser.

Fast forward to the present. What an enlightened age we live in. Open-source has won and the browser market has fragmented. Firefox broke the IE hegemony, and Chrome killed it. The web browser really is an operating system.

But if you look around at software today, “bundling” is still with us, even in open-source software, that champion of choice and touchstone of tinkering.

To take an example (and to get the taste of IE out of your brain) let’s look at Hystrix, a Java fault-tolerance framework written at Netflix. First let me say that Hystrix is a fantastic piece of engineering. Netflix has given a great gift to the open-source community by releasing, for free, an essential part of their software infrastructure. I’ve learned a lot by studying the Hystrix documentation and source code.

But if you want to use Hystrix in your application, you have to use RxJava and Netflix’s Archaius configuration management framework. Via transitive dependencies, you also have to use Google’s Guava, the Jackson JSON processor, SLF4J, and Apache’s Commons Configuration, Commons Lang, and Commons Logging. For those of you keeping score at home, that’s two different logging APIs, two configuration APIs, and two grab-bag “utility” libraries.

There’s nothing wrong with these library choices. They may be suitable for your application or they may not. But either way, you don’t get a choice. If you want Hystrix, you have to have RxJava and all the rest. Even if you choose to ignore, say, Archaius, it’s still there, linked into your application code, with whatever bugs and security holes it might carry.

I don’t mean to pick on Netflix here either. As I said, Hystrix is a fantastic piece of engineering, and I’m very happy that Netflix released it. But it points to a mismatch between the goals of “internal-use” software and “open-source” software.

If you’re developing a tool or library for internal use within an organization, it makes sense to integrate closely with other software internal to that organization. It saves time, reduces development effort, and makes the software organization more efficient. When software is tightly integrated, each new tool or library multiplies the value of all the other software which came before it. That’s how technology companies like Netflix or Google can deliver consistently high-quality products and rapid innovation at scale.

The downside to this approach, from the open-source point of view, is that each new tool or library released by a software organization tends to be tightly coupled to the software which preceded it. More dependencies mean more opportunities for bugs, security holes, and misconfiguration. For the application developer using open-source libraries, each new dependency multiplies the cost of development and maintenance.

It’s not just corporate-sponsored open source that suffers from this problem — just look at the dependency tree of any Apache project.

The root problem is that great, hairy Minotaur which stalks the labyrinthine passages of any large code base: cross-cutting concerns. Almost any piece of code in an application will need, at some point, to deal with at least some of:

  • Logging
  • Configuration
  • Error handling & recovery
  • Process/thread management
  • Resource management
  • Startup/shutdown
  • Network communication
  • Filesystems
  • Data persistence
  • (De)serialization
  • Caching
  • Internationalization/translation
  • Build/provisioning/deployment

It’s much easier to write code if you know how each of these cross-cutting concerns will be handled. So when you’re developing something in-house, obviously you use the tools and libraries your organization has standardized on. Even if you’re writing something which you plan to make open-source, it’s easier to rely on the tools and patterns you already know.

It’s difficult to avoid coupling library code to one or more of these concerns. Take logging, for example. Java has had a built-in logging framework since 1.4. But many developers preferred Log4j or one of a handful of others. To avoid coupling libraries to a single logging framework, there is Apache Commons Logging, which tries to abstract over different logging frameworks with clever class-loading tricks. That turned out to be a brittle solution, so we got SLF4J, which puts responsibility for linking the correct logging APIs back in the hands of the application developer. But no one wants to take an entire day to slog through the SLF4J manual in the middle of building an application. Throw in the mysterious interactions of transitive dependencies in Maven-style build tools, and it’s no wonder every Java app starts up with an error message about logging. And logging is the easy case — most programmers could probably agree on what, broadly speaking, a logging framework needs to do. But still we have half a dozen widely-used, slightly-different logging APIs.

Developing a library which avoids making decisions about cross-cutting concerns is possible, but it takes painstaking attention to detail, with lots of extra extension points. (See Chris Houser’s talk on Exception Handling for an example.) Unfortunately, the resulting library is often less-than-satisfying to potential users because it has so many “holes” that need to be filled in. Who wants to spend half a day writing “glue” code and callbacks before you can even try out a new library? Busy application developers have an incentive to choose libraries that work “out of the box,” so library creators have an incentive to make arbitrary decisions about cross-cutting concerns. We justify this with the oxymoron “sensible defaults.”

The conclusion I draw from all this is that modern programming languages have succeeded at making software out of reusable parts, but have largely failed at making software out of interchangeable parts. You cannot just “swap in,” say, a different thread-management library. Hystrix itself exists to solve a problem with libraries and cross-cutting concerns in a services architecture. Quoting from the Hystrix docs:

Applications in complex distributed architectures have dozens of dependencies, each of which will inevitably fail at some point. If the host application is not isolated from these external failures, it risks being taken down with them.

These issues are exacerbated when network access is performed through a third-party client — a “black box” where implementation details are hidden and can change at any time, and network or resource configurations are different for each client library and often difficult to monitor and change.

Even worse are transitive dependencies that perform potentially expensive or fault-prone network calls without being explicitly invoked by the application.

Netflix has so many “API client” libraries, each making their own network calls with unpredictable behavior, that to make their systems robust they have to isolate each library in its own thread pool. Again, this is amazing engineering, but it was necessary precisely because too many libraries came bundled with their own networking, error handling, and resource management decisions.

A robust solution would seem to require everyone to agree on standards for every possible cross-cutting concern. That will obviously never happen. Even a so-called batteries-included language cannot keep the same batteries forever. This is a hard problem, and like all truly hard problems in software, it’s more about people than code.

I wish I had a perfect solution, but the best I can offer is some guidance. If you’re writing an open-source library, do everything in your power to avoid dependencies. Use only the features of the core language, and use those conservatively. Don’t pull in a library that deals with some cross-cutting concern just because it might be more convenient for your users. Build your API around plain functions and standard data structures.

Some examples, specific to Clojure:

  • Don’t depend on a logging framework unless it’s SLF4J.

  • Don’t use an error-handling framework: Throw ex-info with enough data for a handler to decide what to do.

  • If you need to do something asynchronous, use callbacks instead of core.async. Callbacks are easily integrated with core.async if that’s what the user wants to do. Likewise, if you need some kind of inversion of control, use function callbacks or protocols.

  • Don’t depend on any state-management framework or “ambient” state. Pass everything needed by an API function in its arguments. Provide operations for resource initialization and termination as part of your API. Same for configuration: pass a Clojure map as an argument.

  • Network communication and serialization: these are, admittedly, almost impossible to avoid if you’re writing a library for some network API. But you can at least give users the option of controlling their own networking by providing APIs to prepare requests and parse responses independently of making the actual network calls.

On the other hand, some “libraries” really are more like “embeddable services,” with their own internal state. Large frameworks like Hystrix fall into this category, as do a few sophisticated “client” libraries. These libraries might be expected to manage their own resources and state “under the hood.” That’s a reasonable design choice, but at least be clear about which goal you’re pursuing and what trade-offs you’re making. In most language runtimes, the behavior and dependencies of these libraries cannot be fully isolated from the rest of the code. As an application developer, I might be willing to invest time and effort arranging my code to accommodate one or two embedded services that offer significant power in exchange for the added complexity. For everything else, when I need a library, just give me some ordinary functions.

Command-Line Intransigence

In the early days of Clojure, I was skeptical of Clojure-specific build tools like Lancet, Leiningen, and Cake. Why would Clojure, a part of the Java ecosystem, need its own build tool when there were already so many Java-based tools?

At the time, I thought Maven was the last word in build tooling. Early Leiningen felt like a thin wrapper around Maven for people with an inconsolable allergy to XML. Maven was the serious build tool, with a rich declarative model for describing dependency relationships among software artifacts. That model was imperfect, but it worked well enough to power one of the largest repositories of open-source software on the planet.

But things change. Leiningen has evolved rapidly. Maven has also evolved, but more slowly, and the promised non-XML POM syntax (“polyglot Maven”) has not materialized.

Meanwhile, I learned why everyone eventually hates Maven, through the experience of crafting custom Maven builds for two large-ish projects: the Clojure language and its contributed libraries. It was a challenge to satisfy the (often conflicting) requirements of developers, continuous integration, source repositories, and the public Maven repository network. Even with the help of Maven books from Sonatype, it took months of trial and error and nearly all my “open-source” time to get everything working.

At the end of this process I discovered, to my dismay, that I was the only one who understood it. As my colleague Stuart Halloway put it, “Maven breeds heroes.” For end-users and developers, there’s a nice interface: Clojure-contrib library authors can literally click a button to make a release. But behind that button are so many steps and moving parts (Git, Hudson, Maven, Nexus, GPG, and all the Maven plugins) that even I can barely remember how it all works. I never wanted to be the XML hero.

So I have come around to Leiningen, and even incorporate it into my Clojure development workflow. It’s had some bumps, as one might expect from a fast-moving open-source project with lots of contributors, but most of the time it does what I need and doesn’t get in the way.

What puzzles me, however, is the stubbornness of developers who want to do everything via Leiningen. Some days it seems like every new tool or development utility for Clojure comes wrapped up in a Leiningen plugin so it can be invoked at the command line. I don’t get it. When you have a Clojure REPL, why would you limit yourself to the UNIX shell?

I think this habit comes partly from scripting languages, which were born at the command line, and still live there to a great extent. But it puzzled me a bit even in Ruby: if it takes 3 seconds to for rake to load your 5000-line Rails app, do you really want to use rake for critical administrative tasks like database migrations? IRB is not a REPL in the Lisp sense, but it’s a pretty good interactive shell. I’d rather work with a large Ruby app in IRB than via rake.

Start-up time remains a major concern for Leiningen, and its contributors have gone to great lengths (sometimes too far) to ameliorate it. Why not just avoid the problem altogether? Start Leiningen once and then work at the REPL. Admittedly, this takes some discipline and careful application design, but on my own projects I’ve gotten to the point where I only need to launch Leiningen once a day. Occasionally I make a mistake and get my application into such a borked state that the only remedy is restarting the JVM, but those occasions are rare.

I pretty much use Leiningen for just three things: 1) getting dependencies, 2) building JARs, and 3) launching REPLs. Once I have a REPL I can do my real work: running my application, testing, and profiling. The feedback cycles are faster and the debugging options much richer than what I can get on the command-line.

“Build plugins,” for Leiningen or Maven or any other tool, always suffer from running in a different environment from the code they are building. But isn’t one of the central tenets of Lisp that the compiler is part of your application? There isn’t really a sharp boundary between “build” code and “application” code. It’s all just code.

I used to write little “command-line interfaces” for running tests, builds, deployments, and so on. Now I’m more likely to just put those functions in a Clojure namespace and call them from the REPL. Sometimes I wonder: why not go further? Use Leiningen (or Maven, or Gradle, or whatever) just to download dependencies and bootstrap a REPL, then execute builds and releases from the REPL.

Dependency Management First-Aid Kit

This article attempts to unravel some of the mysteries of dependency management with Maven and Maven-based tools.

Help, something’s missing!

Say you have a project named “my-new-project” which declares a dependency on version 3 of the “awesome-sauce” library by the Example.com corporation. You add the dependency to your pom.xml, project.clj, or whatever configuration file your build tool uses. You take a deep breath and start a build. And it fails!

If you’re using Maven 2, you see something like this:

[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] Failed to resolve artifact.

Missing:
----------
1) com.example:awesome-sauce:jar:3.0.0

  Try downloading the file manually from the project website.

  Then, install it using the command: 
      mvn install:install-file -DgroupId=com.example -DartifactId=awesome-sauce -Dversion=3.0.0 -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file there: 
      mvn deploy:deploy-file -DgroupId=com.example -DartifactId=awesome-sauce -Dversion=3.0.0 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]

  Path to dependency: 
        1) my.group:my-new-project:jar:1.0.0-SNAPSHOT
        2) com.example:awesome-sauce:jar:3.0.0

----------
1 required artifact is missing.

for artifact: 
  my.group:my-new-project:jar:1.0.0-SNAPSHOT

from the specified remote repositories:
  central (http://repo1.maven.org/maven2),
  clojars (http://clojars.org/repo/)

Leiningen, which uses Maven 2 under the covers, produces similar output, but it mistakenly prints the current project name as org.apache.maven:super-pom:jar:2.0.

Maven 3 prints a less verbose (and less informative) error message, but the gist is the same.

What happened?

What is all this verbosity saying? Well, obviously, the build failed because something was missing. What was missing? Maven tells you:

Missing:
----------
1) com.example:awesome-sauce:jar:3.0.0

The JAR file for the project “awesome-sauce”, version 3.0.0, published in the “com.example” group, is missing. That just means Maven didn’t find it in any of the places it looked.

Where did it look? Maven tells you that too:

from the specified remote repositories:
  central (http://repo1.maven.org/maven2),
  clojars (http://clojars.org/repo/)

These are the public repositories where Maven searched for the file. Each repository has an ID (“central” and “clojars” in this case) and a URL. Both are specified in the configuration of:

  1. Your project, in pom.xml or project.clj

  2. Your build tool’s global configuration file

    • settings.xml for Maven
    • N/A for Leiningen
  3. Your build tool’s built-in defaults

If you visit http://repo1.maven.org/maven2/com/example/awesome-sauce or http://clojars.org/repo/com/example/awesome-sauce in a browser you will see that those directories do not, in fact, exist.

Although it’s not listed, the first place Maven checks for a dependency is your local Maven repository. The local repository is just a big cache of everything Maven has downloaded in the past. It’s typically located at $HOME/.m2/repository.

What to do next

You have two options at this point:

  1. Find a public repository containing “awesome-sauce”
  2. Install “awesome-sauce” in your local repository

The first option is generally less work, and more repeatable if you ever build your project on another machine.

Finding a repository

Odds are, if the library you are looking for is free, open-source, and popular, it will already be in a public Maven repository somewhere. Start with the source: who released the library? Large organizations with a lot of open-source projects often host their own repositories, like Google and Codehaus. Failing that, search engines such as Mvnbrowser may help you find it.

Once you’ve found a repository, you need to add it to your build. For example, to add the Codehaus repository to a Maven project, add these lines to pom.xml inside the <project> tag:

<repositories>
  <repository>
    <id>codehaus</id>
    <name>Codehaus</name>
    <url>http://repository.codehaus.org/</url>
  </repository>
</repositories>

(You can pick your own <id> and <name>.)

For Leiningen, add the following lines inside the (defproject ...) block:

  :repositories {"codehaus" "http://repository.codehaus.org/"}

Installing locally

If the library you want is not available in any public repository, you’re not stuck, you just have to do a bit more work. You need to get the JAR file for the library, either by downloading it manually or building from source. Then you need to install that JAR file in your local Maven repository. That’s easy, because Maven has already told you exactly how to do it:

  Then, install it using the command: 
      mvn install:install-file -DgroupId=com.example -DartifactId=awesome-sauce \
 -Dversion=3.0.0 -Dpackaging=jar -Dfile=/path/to/file

Copy that command verbatim, changing only /path/to/file to the path to the library’s JAR file. Maven will copy the file to $HOME/.m2/repository/com/example/awesome-sauce/3.0.0/awesome-sauce-3.0.0.jar. The next time you build your project, Maven knows exactly where to find it.

Installing remotely

If you want others to be able to build your project without having to go through these manual steps, you need your own public Maven repository to which you can upload files. Hosting a Maven repository isn’t hard: all you need is a web server.

If you work with a team, consider setting up a shared repository for everyone to use. A repository manager such as Nexus can help you take care of user accounts and authentication.

If you publish open-source libraries, I strongly encourage you to get an account on Sonatype OSS, a free service provided by the makers of Nexus. Releasing your projects to Sonatype OSS gives them a path to get added to the Maven Central Repository. While the requirements for projects in Maven Central are more stringent than just tossing code into your own repository, it’s worth the effort. In Maven Central, your project will have greater visibility and will be easier for anyone in the world to use.

But what if I don’t want that dependency?

Maven dependencies are transitive: if your project depends on project X, which depends on projects Y and Z, then your build will try to download X, Y, and Z.

But sometimes projects declare dependencies that aren’t strictly necessary. Or they declare dependencies on something you want, but the wrong version. How can you avoid including those extra dependencies in your build?

Maven supports dependency exclusions for these cases. For example, suppose the “awesome-sauce” library declares a dependency on “com.example:stupidity:0.0.1”. You know that you don’t need “stupidity” in your project, so you want to prevent the build from including it. In pom.xml, you write:

<dependencies>
  <dependency>
    <groupId>com.example</groupId>
    <artifactId>awesome-sauce</artifactId>
    <version>3.0.0</version>
    <exclusions>
      <exclusion>
        <groupId>com.example</groupId>
        <artifactId>stupidity</artifactId>
      </exclusion>
    </exclusions> 
  </dependency>
</dependencies>

Or in Leiningen’s project.clj, you write:

  :dependencies [[com.example/awesome-sauce "3.0.0"
                  :exclusions [com.example/stupidity]]]

Note that once you start using exclusions, you’re on your own. It’s up to you to make sure you still have the correct versions of all the libraries your project needs.

On rare occasions, a project’s dependencies cannot be resolved at all. In particular, if you need two different versions of the same library with the same class names but incompatible APIs, you’re pretty much stuck. Time to refactor, or investigate multiple-Classloader schemes like OSGi. But that’s a whole ‘nother story.

Single Abstract Method Macro

John Rose’s article, Scheme in One Class, introduced me to the notion of Single Abstract Method, or SAM, classes. One of the proposed APIs for JSR-292 allows a MethodHandle (the Java version of a closure) to be cast to any SAM class.

In Java, a SAM can be either an interface or a class, but if it’s a class then it’s usually abstract. The interface or class has exactly one method. Callbacks are often specified as SAM interfaces. The standard Java library has lots of SAM interfaces, such as Runnable and ActionListener.

In Clojure, it’s easy to interoperate with these Java interfaces with reify (or proxy for abstract classes) but it’s tiresome to type out. Since reflection can give us the name of the method, why not let the compiler do the work for us?

(defmacro single-method
  "Returns a proxied or reified instance of c, which must be a class
  or interface with exactly one method. Forwards method calls to the
  function f, which must accept the same number of arguments as the
  method, not including the 'this' argument.

  If c is a class, not an interface, then it must have a public,
  no-argument constructor."
  [c f]
  {:pre [(symbol? c)]}
  (let [klass (resolve c)]
    (assert (instance? java.lang.Class klass))
    (let [methods (.getDeclaredMethods klass)]
      (assert (= 1 (count methods)))
      (let [method (first methods)
            method-name (symbol (.getName method))
            arg-count (count (.getParameterTypes method))
            args (repeatedly arg-count gensym)]
        (if (.isInterface klass)
          `(let [f# ~f]
             (reify ~c (~method-name ~(vec (cons (gensym "this") args))
                                     (f# ~@args))))
          `(let [f# ~f]
             (proxy [~c] [] (~method-name ~(vec args) (f# ~@args)))))))))

Generating Clojure from an Ontology

I’ve been fascinated with RDF for years, but I always end up frustrated when I try to use it. How do you read/write/manipulate RDF data in code? Sure, there are lots of libraries, but they all represent RDF data as its primitive structures: statements, resources, literals, etc. Working with data through these APIs feels like using a glovebox. To get anything useful done, you have to define mappings between RDF properties/classes and normal data structures in your programming language — classes, maps, lists, whatever. In effect, you have to define everything twice.

Some Java APIs allow one to add annotation properties to classes and methods, with the annotations defining the mapping between Java objects and RDF triples. It’s convenient, and familiar if you’ve used Java persistence frameworks like Hibernate, but you still have to define everything twice — once in your RDF schema, once in Java code.

Other libraries generate Java source code from RDFS or OWL ontologies. This means you don’t have to define everything twice, but adds another step to the write-compile-run cycle, and limits you to the semantics that the code generator can understand. In particular, certain features of RDFS/OWL — multiple inheritance, sub-properties — do not map well into Java.

What I really wanted was a way to create and work with RDF data in Clojure, using the same map/set/sequence APIs that I use for any other Clojure data structure. I flirted with implementing RDF in Clojure but lost interest when I realized that 1) there’s a lot more to implementing RDF than datatype conversions; and 2) my Clojure library suffered from the same glovebox problem as the Java RDF libraries.

The solution, however, was staring me in the face all along. Clojure is a Lisp. I can generate functions directly, without any intermediate “source” representation. I can use my own customized validation and type-checking functions. Furthermore, I can extend the definitions in my RDF schema with new Clojure functions.

Here’s what I ended up with: I designed a simple OWL ontology using Protege 4 and saved it as RDF/XML. Then I used the Sesame 2 library to find all the RDF classes and properties defined in my ontology, and create the appropriate getter, setter, and constructor functions in Clojure. It looks something like this:

(defn intern-classes [] 
  (doseq [cls (find-all-classes *ontology*)]
    (let [name (resource-to-symbol cls)]
      (intern *ns* name (fn [] {:type name})))))

The resource-to-symbol function creates a symbol named for the local name of the RDF class, with the full URI of its XML namespace in the symbol’s metadata. The call to intern defines a new function that takes no arguments and returns a Clojure map with the symbol as its :type.

Suppose I have a class named Document in my ontology. I now have a Clojure function named Document that creates a new instance of that class, represented as a Clojure map. Furthermore, using Clojure hierarchies and the isa? function, I can generate Clojure code that implements the subclass relationships defined in the ontology. Whee!

I don’t entirely know where I’m headed with this, but I like the way it’s going. I can define my own data types, decide how they map to Clojure data structures, and have code that’s always up-to-date with my RDF vocabulary.

It’s About the Platform

I’ve said It’s About the Libraries, and indeed, one of the major selling points of Clojure is that it can call Java libraries directly.

But there’s more to it than that.  Libraries are just one benefit to building Clojure on top of Java, or, more accurately, on top of Java the platform.

Look around you, and you’ll see that 99% of all the software in the world runs on just three platforms:

  1. Unix/C
  2. Java Virtual Machine
  3. .NET Common Language Runtime

Where did these platforms come from?  Let’s see:

  1. AT&T
  2. Sun
  3. Microsoft

Notice something?  All three were all developed by huge corporations.

Building a new platform isn’t just about writing the code.  In fact, very little of it is about code.  You need books, articles, conferences, workshops, and university courses.  You need multinational corporations to trust their entire business to your platform.  It takes millions of dollars and tens of thousands of hours of labor to create a new platform.  Think of the massive ad campaigns Sun ran for Java.  Can you do that?  Of course not.

So when you’re designing a new language, you have to build on an existing platform.  Most of the so-called “scripting” languages grew up on Unix, so they’re written in C.  Now, Unix/C is a great platform, still going strong after 40 years.  It provides powerful tools and standardized interfaces such as files, sockets, and pipes.

The problem is that each of the “scripting” languages has developed into its own mini-platform.  Perl, Python, and Ruby each define their own set of data structures for fundamental types like strings, lists, and maps.  The only “types” that Unix recognizes are text and binary.  You can’t exchange data between two languages without serializing everything to some agreed-upon format.  And you can’t do callbacks between languages below the level of a whole process.

The other problem with languages written in C is, well, C.  Pointers are hard.  Memory management is hard.  I know from bitter experience that Ruby libraries can have segfaults or memory leaks.  That just doesn’t happen in Java.

Clojure was created to leverage capabilities of Java-the-platform — garbage collection, dynamic code generation, JIT compilation, threads, locks — some of which are difficult to use effectively in Java-the-language.  To implement Clojure in C, for example, you would first have to build your own platform with these features.  That’s effectively what most Common Lisp implementations do, and they suffer because the Common Lisp world is too small to sustain its own platform.

The brilliant thing about Java-the-platform is that it allows many languages to coexist.  I can mix code written in Java, Clojure, JRuby, Jython, etc. and it’s pretty easy, because they all implement the same fundamental interfaces like java.util.List and  java.lang.Runnable.  For example, right now I have Hadoop (Java code) calling Clojure code calling JRuby code.  It all just works.

(The .NET CLR provides similar capabilities, and there is a Clojure CLR port.)

Stop Your Java SAX Parser from Downloading DTDs

Back in February, in a slightly plaintive post, the W3 sysadmins asked that people stop hammering their servers with requests for XHTML DTDs. Everyone said yes, this is a stupid problem that wouldn’t have happened if a) the XML spec were less dumb, or b) XML libraries were less dumb.

After that post, I spent two whole days fighting with XML catalogs — possibly the worst-documented XML spec ever — to make sure my Java code wasn’t downloading a DTD every time it read an XHTML document.

To my annoyance, no one seems to have posted any cut-and-paste solutions to this problem. Setting properties on the SAX parser is no help, and the XML catalogs solution is a pain to set up.

So what if someone wrote a “dummy” XML entity resolver that does nothing? Here’s what I came up with:

public class DummyEntityResolver implements EntityResolver {
    public InputSource resolveEntity(String publicID, String systemID)
        throws SAXException {
        
        return new InputSource(new StringReader(""));
    }
}

Lo and behold, it works! The key is the return line — if you return null, the SAX parser reverts to its default behavior and downloads the DTD.

Use it like this:

XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setEntityResolver(new DummyEntityResolver());
reader.setContentHandler(new YourContentHandler());
reader.parse(your_xml_source);

The catch is that this will break any externally-defined entities, including standard XHTML entities like &copy;. The built-in XML entities such as &amp;, and numeric character entities like &x43;, will still work.

You can check that you’re not downloading any DTD’s by watching the output of ngrep -q DTD while running your XML parser. If it doesn’t print anything, you’re good.

Calling Java Constructors with this()

The things I don’t know about Java… could fill a book. Here’s a new one, from the Hadoop sources:

public ArrayWritable(Class valueClass) {
    // ...
}

public ArrayWritable(Class valueClass, Writable[] values) {
  this(valueClass);
  this.values = values;
}

The second constructor uses the syntax this(arg) to call a different constructor, then follows with initialization code of its own. I had no idea you could do that.

Not So Slow

Perhaps I was premature worrying about how slow Ruby is. John Wiseman was benchmarking Montezuma, his Common Lisp port of Ferret/Lucene, and found out in the process that Ferret is 10 times faster than Java Lucene! As he says, Ferret gets help from about 65,000 lines of C code.

I’ve heard this before, perhaps not often enough to make a generalization, but at least enough to identify a trend: if you want performance from Ruby code, rewrite it in C. (The same is sometimes said of Python, or really any interpreted language.) The basic approach seems to be to extract the most performance-critical parts of your dynamic, interpreted language program and rewrite them in a static, compiled language, thus retaining most of the benefits of both.

It’s an interesting contrast to what I see as the Common Lisp approach to optimization, which is to keep everything in Lisp but add compiler declarations in hopes of speeding it up. Trouble is, unless you’re an expert on the inner workings of your compiler (or can read the disassembled code) it’s hard to know exactly what effects a particular declaration will have.

Eventually, I think manual optimization will become unnecessary. Experimental compilers like Stalin have been shown to produce faster machine code than hand-coded C. Stalin compiles a subset of Scheme down to a subset of C, making heavy use of type-inferencing and static analysis. If it can be done with Scheme, surely it can be done with Python, Ruby, or any other dynamic language.

Who Needs Data Structures?

Ran across an interesting remark in a discussion of Microsoft hiring interviews:

If I remember, a lot of MIT people back in the 70s broke the computer world into the Lisp and non-Lisp data typers. The Lisp folk took a casual attitude towards data structures – just shove them in a list, put them on a plist, stash them in a cache. If it gets slow or confusing, add some tags and a hash algorithm. Most non-Lisp folk were appalled at this. They wanted to see the data structure design up front, the data relationship dictionary, complete and comprehensive, even before any coding started.

This sounds like it could be a language-induced habit as much as a programming style. E.g., in Java you have to write “class foo” at the start of a program before you can do anything else, so it makes sense that you’d start by defining data structures. In Lisp, it’s really easy to jump straight to the algorithms, making up “casual” data structures as you go along, so that’s what you do.

The “stuff everything into a list, add some tags later when it gets too big or confusing” is exactly how my largest-to-date Lisp programming project panned out. It worked fine, and made the code pretty short and simple. As the big list got too big to handle in-place, I wrote a handful of functions that pulled out specific pieces of data. Since function calls look identical to method calls in Lisp, the code looked like I had defined a class with a bunch of accessor methods, while in fact it was still just a big list. This was especially helpful given that I was working with free-form data parsed from text files, not a table of predetermined fields. (This could also have been handled somewhat less elegantly with XML.)

Update Oct. 3, 2006: In case anyone was misled, the title was a joke. Data structures are certainly useful. I was merely commenting on the programming styles induced by different languages.