Open-source Bundling

Cast your mind back to the halcyon days of the late ’90s. Windows 95/98. Internet Explorer 4. Before you laugh, consider that IE4 included some pretty cutting-edge technology for the time: Dynamic HTML, TLS 1.0, single sign-on, streaming media, and “Channels” before RSS. IE4 even pioneered — unsuccessfully — the idea of “web browser as operating system” a decade before Google Apps.

But if you remember anything about IE in the ’90s, it’s probably the word bundling. United States v. Microsoft centered on the tight integration of IE with Windows. If you had Windows, you had to have IE. By the time the lawsuit reached a settlement, IE was entrenched as the dominant browser.

Fast forward to the present. What an enlightened age we live in. Open-source has won and the browser market has fragmented. Firefox broke the IE hegemony, and Chrome killed it. The web browser really is an operating system.

But if you look around at software today, “bundling” is still with us, even in open-source software, that champion of choice and touchstone of tinkering.

To take an example (and to get the taste of IE out of your brain) let’s look at Hystrix, a Java fault-tolerance framework written at Netflix. First let me say that Hystrix is a fantastic piece of engineering. Netflix has given a great gift to the open-source community by releasing, for free, an essential part of their software infrastructure. I’ve learned a lot by studying the Hystrix documentation and source code.

But if you want to use Hystrix in your application, you have to use RxJava and Netflix’s Archaius configuration management framework. Via transitive dependencies, you also have to use Google’s Guava, the Jackson JSON processor, SLF4J, and Apache’s Commons Configuration, Commons Lang, and Commons Logging. For those of you keeping score at home, that’s two different logging APIs, two configuration APIs, and two grab-bag “utility” libraries.

There’s nothing wrong with these library choices. They may be suitable for your application or they may not. But either way, you don’t get a choice. If you want Hystrix, you have to have RxJava and all the rest. Even if you choose to ignore, say, Archaius, it’s still there, linked into your application code, with whatever bugs and security holes it might carry.

I don’t mean to pick on Netflix here either. As I said, Hystrix is a fantastic piece of engineering, and I’m very happy that Netflix released it. But it points to a mismatch between the goals of “internal-use” software and “open-source” software.

If you’re developing a tool or library for internal use within an organization, it makes sense to integrate closely with other software internal to that organization. It saves time, reduces development effort, and makes the software organization more efficient. When software is tightly integrated, each new tool or library multiplies the value of all the other software which came before it. That’s how technology companies like Netflix or Google can deliver consistently high-quality products and rapid innovation at scale.

The downside to this approach, from the open-source point of view, is that each new tool or library released by a software organization tends to be tightly coupled to the software which preceded it. More dependencies mean more opportunities for bugs, security holes, and misconfiguration. For the application developer using open-source libraries, each new dependency multiplies the cost of development and maintenance.

It’s not just corporate-sponsored open source that suffers from this problem — just look at the dependency tree of any Apache project.

The root problem is that great, hairy Minotaur which stalks the labyrinthine passages of any large code base: cross-cutting concerns. Almost any piece of code in an application will need, at some point, to deal with at least some of:

  • Logging
  • Configuration
  • Error handling & recovery
  • Process/thread management
  • Resource management
  • Startup/shutdown
  • Network communication
  • Filesystems
  • Data persistence
  • (De)serialization
  • Caching
  • Internationalization/translation
  • Build/provisioning/deployment

It’s much easier to write code if you know how each of these cross-cutting concerns will be handled. So when you’re developing something in-house, obviously you use the tools and libraries your organization has standardized on. Even if you’re writing something which you plan to make open-source, it’s easier to rely on the tools and patterns you already know.

It’s difficult to avoid coupling library code to one or more of these concerns. Take logging, for example. Java has had a built-in logging framework since 1.4. But many developers preferred Log4j or one of a handful of others. To avoid coupling libraries to a single logging framework, there is Apache Commons Logging, which tries to abstract over different logging frameworks with clever class-loading tricks. That turned out to be a brittle solution, so we got SLF4J, which puts responsibility for linking the correct logging APIs back in the hands of the application developer. But no one wants to take an entire day to slog through the SLF4J manual in the middle of building an application. Throw in the mysterious interactions of transitive dependencies in Maven-style build tools, and it’s no wonder every Java app starts up with an error message about logging. And logging is the easy case — most programmers could probably agree on what, broadly speaking, a logging framework needs to do. But still we have half a dozen widely-used, slightly-different logging APIs.

Developing a library which avoids making decisions about cross-cutting concerns is possible, but it takes painstaking attention to detail, with lots of extra extension points. (See Chris Houser’s talk on Exception Handling for an example.) Unfortunately, the resulting library is often less-than-satisfying to potential users because it has so many “holes” that need to be filled in. Who wants to spend half a day writing “glue” code and callbacks before you can even try out a new library? Busy application developers have an incentive to choose libraries that work “out of the box,” so library creators have an incentive to make arbitrary decisions about cross-cutting concerns. We justify this with the oxymoron “sensible defaults.”

The conclusion I draw from all this is that modern programming languages have succeeded at making software out of reusable parts, but have largely failed at making software out of interchangeable parts. You cannot just “swap in,” say, a different thread-management library. Hystrix itself exists to solve a problem with libraries and cross-cutting concerns in a services architecture. Quoting from the Hystrix docs:

Applications in complex distributed architectures have dozens of dependencies, each of which will inevitably fail at some point. If the host application is not isolated from these external failures, it risks being taken down with them.

These issues are exacerbated when network access is performed through a third-party client — a “black box” where implementation details are hidden and can change at any time, and network or resource configurations are different for each client library and often difficult to monitor and change.

Even worse are transitive dependencies that perform potentially expensive or fault-prone network calls without being explicitly invoked by the application.

Netflix has so many “API client” libraries, each making their own network calls with unpredictable behavior, that to make their systems robust they have to isolate each library in its own thread pool. Again, this is amazing engineering, but it was necessary precisely because too many libraries came bundled with their own networking, error handling, and resource management decisions.

A robust solution would seem to require everyone to agree on standards for every possible cross-cutting concern. That will obviously never happen. Even a so-called batteries-included language cannot keep the same batteries forever. This is a hard problem, and like all truly hard problems in software, it’s more about people than code.

I wish I had a perfect solution, but the best I can offer is some guidance. If you’re writing an open-source library, do everything in your power to avoid dependencies. Use only the features of the core language, and use those conservatively. Don’t pull in a library that deals with some cross-cutting concern just because it might be more convenient for your users. Build your API around plain functions and standard data structures.

Some examples, specific to Clojure:

  • Don’t depend on a logging framework unless it’s SLF4J.

  • Don’t use an error-handling framework: Throw ex-info with enough data for a handler to decide what to do.

  • If you need to do something asynchronous, use callbacks instead of core.async. Callbacks are easily integrated with core.async if that’s what the user wants to do. Likewise, if you need some kind of inversion of control, use function callbacks or protocols.

  • Don’t depend on any state-management framework or “ambient” state. Pass everything needed by an API function in its arguments. Provide operations for resource initialization and termination as part of your API. Same for configuration: pass a Clojure map as an argument.

  • Network communication and serialization: these are, admittedly, almost impossible to avoid if you’re writing a library for some network API. But you can at least give users the option of controlling their own networking by providing APIs to prepare requests and parse responses independently of making the actual network calls.

On the other hand, some “libraries” really are more like “embeddable services,” with their own internal state. Large frameworks like Hystrix fall into this category, as do a few sophisticated “client” libraries. These libraries might be expected to manage their own resources and state “under the hood.” That’s a reasonable design choice, but at least be clear about which goal you’re pursuing and what trade-offs you’re making. In most language runtimes, the behavior and dependencies of these libraries cannot be fully isolated from the rest of the code. As an application developer, I might be willing to invest time and effort arranging my code to accommodate one or two embedded services that offer significant power in exchange for the added complexity. For everything else, when I need a library, just give me some ordinary functions.

The Amateur Problem

We have a problem. We are professional software developers who work with open-source software. The problem is that we are in the minority. Most open-source software is written by amateurs.

Every time a hot new technology comes on the scene, developers flock to it like ants to a picnic. Those early adopters are, by definition, people for whom choosing a new technology is less risky. Which means, mostly, that their work doesn’t really matter. Students, hobbyists, “personal” projects: nobody’s life or career is on the line. It doesn’t matter if the program is entirely correct, efficient, or scalable. It doesn’t matter if it ignores lots of edge cases.

I’ve been one of those amateurs. It’s fun. New technologies need amateurs. But as a technology matures, it attracts professionals with real jobs who do care about those details. And those professionals are immediately confronted with a world of open-source software written by amateurs.

I used to write code for myself. Since I started getting paid to write code for other people, I’ve become wary of code written by people writing for themselves. Every time I see a README that begins “X is a dead simple way to do Y,” I shudder. Nothing in software is simple. “Dead simple” tells me the author has “simplified” by “deadening” vast swaths of the problem space, either by making unfounded assumptions or by ignoring them completely.

We like to carp about “bloated” APIs in “mainstream” languages like Java. Truly, lots of APIs are more complicated than they need to be. But just because an API is big doesn’t mean it’s bloated. I like big APIs: they show me that someone has thought about, and probably encountered, all of the edges and corners in the problem space.

Simplifying assumptions do not belong in libraries; they belong in applications, where you know the boundaries of the problem space. On rare occasions, the ground of one problem is trod often enough to warrant a framework. Emphasis on rare. A framework is almost always unnecessary, and, in these days of rapidly-changing technological capabilities, likely to be obsolete before it’s finished.

Frameworks written by amateurs are the worst of the worst: brittle constructs that assume everything in service of one or two “dead simple” demos but collapse under the weight of a real-world application.

I don’t want to be a code snob. Let’s be amateurs. Let’s have fun. Explore. Learn. Publish code as we go. Show a solution to a problem without assuming it’s the solution. Be cognizant of and vocal about what assumptions we’re making. Don’t call something a library unless it really attempts to reach every nook and cranny of the problem space.

And don’t write frameworks. Ever. ;)

Update August 8, 2013: Based on the comments, I feel like too many people have gotten hung up on the words amateur and professional. Those were just convenient labels which I found amusing. The important difference is between “easy” single-purpose code and thorough, general-purpose code.

ODF vs. OOXML in New York State

New York State’s Office for Technology released a Request for Public Comment on selecting an XML-based office data format. The choices are OASIS’ ODF and Microsoft’s OOXML. Responses were due by 5 p.m. today, Dec. 28. My response is below, submitted just in time to meet the deadline. I didn’t have time to answer all the questions, so I focused on the ones I felt I could address in the greatest detail.

RESPONSE TO RFPC # 122807:

Background Information

For the past year I have been the lead programmer on AltLaw.org, a project to promote public access to federal court opinions by creating a free, full-text database of those opinions with an easy-to-use search interface. In the process of developing this web site, we have encountered many obstacles because of the way the federal courts store and publish their records. The problems we have encountered using electronic government records illustrate issues that New York State should consider in developing its electronic records policy. I will provide more details in my answers to the questions below.

Question 2

To encourage public access to State electronic records, it is important that the electronic record be the official State record rather than a draft or proxy for the “official” paper document. The greatest weakness of AltLaw.org as a legal reference is that the opinions we download from the federal courts’ web sites are not the final, official versions; those are published by and only available from the West corporation, at considerable cost.

Since the federal courts rely on West to copy-edit and correct their opinions, they are careless with respect to dates, names, and other important data. For example, we have downloaded several opinions that were decided in the year “2992”!

Furthermore, opinions published on federal court web sites lack any citation information — which is also assigned by West based on the pagination of their print volumes — making them useless for legal scholarship or court preparation.

As most professions (including the law) come increasingly to rely exclusively on electronic sources of information, it is critical that those sources become 100% reliable. To this end, electronic State records must be 1) complete, 2) accurate, 3) easily cited, and 4) acceptable for use in all official State business.

Question 3

To encourage interoperability and data sharing with citizens, business partners, and other jurisdictions, it is important that State electronic records be machine-readable. This is a more demanding requirement than simply having records in electronic form. A PDF document, for example, is electronic, but it is difficult or impossible to extract discrete data from a PDF document. This is because the PDF format is optimized toward preserving the visual appearance of a document rather than its structure.

There are two issues to consider when creating machine-readable documents. The first is “metadata.” Metadata is information about a document that may or may not be contained within the text of the document itself. For example, the metadata for a court opinion would include the name of the court, the date the opinion was released, the name of the judge writing the opinion, and the names of the parties in the case, among other data. Since most federal courts publish their opinions on their web sites in PDF format, with no metadata, AltLaw.org must rely on custom software to extract essential metadata such as titles and dates. The process is slow, difficult, and inaccurate.

It is worth noting that ODF supports metadata using the Resource Description Framework (RDF), an international standard which already forms the foundation of powerful data-analysis software. OOXML does not provide comparable metadata support.

See: http://blogs.sun.com/GullFOSS/entry/new_extensible_metadata_support_with

The second requirement for creating machine-readable documents is information about document structure. Document structure includes elements such as sections, headings, paragraphs, lists, and tables. These structures must be identifiable in the document independent of the visual formatting used to display those structures. For example, a human reader can recognize bold-face type as a section heading, but a computer program cannot. Structural information is important for automated document analysis, information retrieval (search), accessibility to physically-impaired users, and conversion to alternate formats (such as HTML). ODF provides more structural information than does OOXML.

In addition to including metadata and structural information in electronic documents, the State should implement rigorous standards to ensure that information is produced consistently. Metadata is only useful when it is stored in a known, consistent format. For example, simple information such as a date can be recorded in a dozen different ways. A date written as “02/03/04” could be February 3, 2004 or it could be March 4, 2002. Work on AltLaw.org has shown us that there are as many ways of writing dates as there are courts to write them, and dates are by far the simplest piece of metadata to store. For New York State, if different State agencies (or, worse, individual State employees) were to record metadata in different ways, it would be almost as useless as having no metadata at all. After selecting a format for electronic records, the State must then establish formal procedures for using the metadata capabilities of that format. These procedures should be made freely available to the public for the purposes of encouraging interoperability and data sharing.

Question 4

To encourage appropriate government control of its electronic records, the State should rely wherever possible on public-key encryption technology. I am not an expert on this subject, but I urge the State to consult with computer security and encryption experts when choosing the protocols to implement. When properly used, public-key encryption can provide communications that are secure and verifiably authentic to a degree exceeding that of physical documents.

Question 5

To encourage choice and vendor neutrality when creating, maintaining, exchanging, and preserving electronic records, the State should consider the vested interests of parties promoting a particular data format. A format developed by a single corporate entity, such as Microsoft in the case of OOXML, gives that entity strong incentives to design the format to make interoperability difficult to achieve in practice, either through incomplete specification or proprietary extensions. OOXML has been criticized by others as being difficult to implement because of both of these factors. In contrast, ODF was developed by the independent, international OASIS group, with input from a variety of sources. To ensure the success of ODF, its creators have a vested interest in making interoperability as easy as possible. Thus, in the future, ODF is likely to be supported by a wider range of vendors and products than OOXML, and its adoption will promote greater competition in the markeplace.

Question 7

Regarding public access to long-term archives, the State should ensure that its archived records are available in bulk. “In bulk” means that a computer program should be able to obtain large quantities of archived records in an automated fashion, without human intervention. While developing AltLaw.org, we were often forced to write computer programs that simulate the behavior of a human user clicking through a court web site, as that was the only way to download opinions from those sites. In contrast, some court web sites make all their opinions available for download via FTP, a very simple Internet protocol designed for bulk file transfer. The latter made our job easier, and better promotes public access to archived records.

In general, State agencies should not take responsibility for providing the public with the tools to search and retrieve electronic records. Those tasks are better handled by corporate entities (such as Google) and non-profit institutions (such as AltLaw.org) that have technical expertise in those areas. The State should take responsibility for making consistent, accurate, complete data available in bulk at little or no cost to users of that data.

Question 10

Regarding the management of highly specialized data formats such as CAD, digital imaging, Geographic Information Systems, and multimedia, the State should use open, published, freely-available standards whenever possible. When open standards are not available for a particular type of data, the State should attempt to make that data available in as many competing formats as possible. For example, many database and statistical applications which use proprietary data formats can “dump” their data into simple, standardized formats such as comma-separated values (CSV). Imaging software which uses proprietary formats can usually convert files to non-proprietary standard formats as well. Wherever possible, these conversions should be “lossless,” that is, they should not lose any information in the conversion. Ideally, it should be possible to completely reconstruct the specialized data from the information contained in the simpler data formats. These practices provide insurance against future loss or corruption of data in highly-specialized formats.

Conclusion

I strongly urge the State to prefer ODF to OOXML. As one engaged in extracting data from large quantities of government records (over half a million documents, at last count), I would find ODF much more conducive to enabling my work than OOXML. I have looked at examples of the internal XML schemas used by both formats, and I find ODF much easier to read, understand, and manipulate than OOXML.

Easterbrook on GPL, Presages AltLaw

While playing with my current all-consuming project, AltLaw.org, I came across this case: Wallace v. IBM. In 2006 a man named Daniel Wallace sued various distributors of GNU/Linux, including IBM, Red Hat, and Novell, for “price-fixing.” Since the GPL ensures Linux will always be free, Wallace argued, he cannot afford to enter the market with his own operating system, thus Linux distributors have an illegal monopoly. Judge Frank Easterbrook, in an opinion that mentions Open Office and GIMP, denies the claim, concluding “The GPL and open-source software have nothing to fear from the antitrust laws.”

Easterbrook also mentions the popularity of commercial legal databases such as WestLaw and Lexis-Nexis, “even though courts give away their work (this opinion, for example, is not covered by copyright and may be downloaded from the court’s web site and copied without charge).”

Brilliant! Easterbrook captures both AltLaw’s reason for existence and its methodology in one sentence. Couldn’t have said it better myself, Frank.

Note-taking on the Web

I just started playing with this, and already I love it: Zotero. It’s like a bookmark manager crossed with a note-taking program crossed with BibTeX.

Zotero is an extension that runs inside Firefox 2.0 — click the icon, and it captures a complete bibliographic record of the page you’re looking at, and saves a copy. This is vital when you need to cite web pages that may not be permanent.

Even better, it has scrapers (they call them “translators”) for a bunch of online databases, the kind you get in university libraries. Say you’re reading the online PDF version of an article that appeared in a journal. Click on Zotero, and it saves the PDF then stores both the URL and the journal name, volume, number, page, author, title, etc. Then click another button and it spits out a bibliography in APA, MLA, or Chicago form.

It works for offline resources too. Look up a book in a card catalog, and click to record the bibliography. Add your own notes and links to files on your hard drive. Of course, it has a search function, with tagging promised in future release. It’s open-source.

Unlike the cool-but-geeky note-taking programs (like desktop Wikis), Zotero is designed for scholarly work, and it has some big-name research institutions behind it. Here’s hoping it continues to grow.

Abstract Interfaces

Office 2003 uses a table of 1500 colors to render the user interface. That’s 1500 different colors designers have to choose for each color scheme. Overkill? Probably. But it says something about commercial software that sets it apart from most open-source software. Despite the greater theme/skin-ability of KDE, Gnome, and friends, open-source GUIs tend to look less “polished” than big commercial ones. To be fair, the same could be said of an awful lot of shareware and commercial software from smaller companies. The fact is, only a huge company like Microsoft has the resources to pay professional designers just to pick colors all day.

Can independent developers compete? I think they can, with a different approach. By adopting a standard API for interfaces that goes beyond the usual widget set to encompass entire interaction paradigms, developers of many different applications can all take advantage of a consistent interface. Then that interface can be beautified by a relatively small number of designers.

I’m not talking about user-interface guidelines like the Gnome and KDE projects have, I’m talking about abstracting the entire interface away from application programming.

Emacs provides a simple example that goes in the same vein: If you are writing an Elisp application to run inside Emacs, you do not need to provide an interface to set user preferences for that application. You only need to describe the user-customizable variables and their possible values, then Emacs itself renders those options in a customize buffer, which will look and act consistent with every other Emacs customize buffer.

Now, Emacs is text-based and programmer-oriented, and the customize interface reflects that. But there’s no reason why the same concept cannot be applied to graphical interfaces. Suppose, instead of writing classes for MyAppMainWindow, MyAppMenuBar, and MyAppToolBar, you just specify “My App manipulates Foo objects. It provides the following functions: Qux Bar Baz. It can read and write Foo objects as plain text or XML. It has these user preferences: 1 2 3.” In other words, specify the hooks into the application’s functionality and let the GUI framework worry about things like where to position controls in a dialog box.

Creating such a sophisticated framework would not be easy, but I believe it could be done, and would open up possibilities for faster application development with more sophisticated interfaces.