Posts Tagged “Java”

I’ve been fascinated with RDF for years, but I always end up frustrated when I try to use it. How do you read/write/manipulate RDF data in code? Sure, there are lots of libraries, but they all represent RDF data as its primitive structures: statements, resources, literals, etc. Working with data through these APIs feels like using a glovebox. To get anything useful done, you have to define mappings between RDF properties/classes and normal data structures in your programming language — classes, maps, lists, whatever. In effect, you have to define everything twice.

Some Java APIs allow one to add annotation properties to classes and methods, with the annotations defining the mapping between Java objects and RDF triples. It’s convenient, and familiar if you’ve used Java persistence frameworks like Hiberante, but you still have to define everything twice — once in your RDF schema, once in Java code.

Other libraries generate Java source code from RDFS or OWL ontologies. This means you don’t have to define everything twice, but adds another step to the write-compile-run cycle, and limits you to the semantics that the code generator can understand. In particular, certain features of RDFS/OWL — multiple inheritance, sub-properties — do not map well into Java.

What I really wanted was a way to create and work with RDF data in Clojure, using the same map/set/sequence APIs that I use for any other Clojure data structure. I flirted with implementing RDF in Clojure but lost interest when I realized that 1) there’s a lot more to implementing RDF than datatype conversions; and 2) my Clojure library suffered from the same glovebox problem as the Java RDF libraries.

The solution, however, was staring me in the face all along. Clojure is a Lisp. I can generate functions directly, without any intermediate “source” representation. I can use my own customized validation and type-checking functions. Furthermore, I can extend the definitions in my RDF schema with new Clojure functions.

Here’s what I ended up with: I designed a simple OWL ontology using Protege 4 and saved it as RDF/XML. Then I used the Sesame 2 library to find all the RDF classes and properties defined in my ontology, and create the appropriate getter, setter, and constructor functions in Clojure. It looks something like this:

(defn intern-classes []
  (doseq [cls (find-all-classes *ontology*)]
    (let [name (resource-to-symbol cls)]
      (intern *ns* name (fn [] {:type name})))))

The resource-to-symbol function creates a symbol named for the local name of the RDF class, with the full URI of its XML namespace in the symbol’s metadata. The call to intern defines a new function that takes no arguments and returns a Clojure map with the symbol as its :type.

Suppose I have a class named Document in my ontology. I now have a Clojure function named Document that creates a new instance of that class, represented as a Clojure map. Furthermore, using Clojure hierarchies and the isa? function, I can generate Clojure code that implements the subclass relationships defined in the ontology. Whee!

I don’t entirely know where I’m headed with this, but I like the way it’s going. I can define my own data types, decide how they map to Clojure data structures, and have code that’s always up-to-date with my RDF vocabulary.

Comments 3 Comments »

I’ve said It’s About the Libraries, and indeed, one of the major selling points of Clojure is that it can call Java libraries directly.

But there’s more to it than that.  Libraries are just one benefit to building Clojure on top of Java, or, more accurately, on top of Java the platform.

Look around you, and you’ll see that 99% of all the software in the world runs on just three platforms:

  1. Unix/C
  2. Java Virtual Machine
  3. .NET Common Language Runtime

Where did these platforms come from?  Let’s see:

  1. AT&T
  2. Sun
  3. Microsoft

Notice something?  All three were all developed by huge corporations.

Building a new platform isn’t just about writing the code.  In fact, very little of it is about code.  You need books, articles, conferences, workshops, and university courses.  You need multinational corporations to trust their entire business to your platform.  It takes millions of dollars and tens of thousands of hours of labor to create a new platform.  Think of the massive ad campaigns Sun ran for Java.  Can you do that?  Of course not.

So when you’re designing a new language, you have to build on an existing platform.  Most of the so-called “scripting” languages grew up on Unix, so they’re written in C.  Now, Unix/C is a great platform, still going strong after 40 years.  It provides powerful tools and standardized interfaces such as files, sockets, and pipes.

The problem is that each of the “scripting” languages has developed into its own mini-platform.  Perl, Python, and Ruby each define their own set of data structures for fundamental types like strings, lists, and maps.  The only “types” that Unix recognizes are text and binary.  You can’t exchange data between two languages without serializing everything to some agreed-upon format.  And you can’t do callbacks between languages below the level of a whole process.

The other problem with languages written in C is, well, C.  Pointers are hard.  Memory management is hard.  I know from bitter experience that Ruby libraries can have segfaults or memory leaks.  That just doesn’t happen in Java.

Clojure was created to leverage capabilities of Java-the-platform — garbage collection, dynamic code generation, JIT compilation, threads, locks — some of which are difficult to use effectively in Java-the-language.  To implement Clojure in C, for example, you would first have to build your own platform with these features.  That’s effectively what most Common Lisp implementations do, and they suffer because the Common Lisp world is too small to sustain its own platform.

The brilliant thing about Java-the-platform is that it allows many languages to coexist.  I can mix code written in Java, Clojure, JRuby, Jython, etc. and it’s pretty easy, because they all implement the same fundamental interfaces like java.util.List and  java.lang.Runnable.  For example, right now I have Hadoop (Java code) calling Clojure code calling JRuby code.  It all just works.

(The .NET CLR provides similar capabilities, and there is a Clojure CLR port.)

Comments 1 Comment »

Back in February, in a slightly plaintive post, the W3 sysadmins asked that people stop hammering their servers with requests for XHTML DTDs. Everyone said yes, this is a stupid problem that wouldn’t have happened if a) the XML spec were less dumb, or b) XML libraries were less dumb.

After that post, I spent two whole days fighting with XML catalogs — possibly the worst-documented XML spec ever — to make sure my Java code wasn’t downloading a DTD every time it read an XHTML document.

To my annoyance, no one seems to have posted any cut-and-paste solutions to this problem. Setting properties on the SAX parser is no help, and the XML catalogs solution is a pain to set up.

So what if someone wrote a “dummy” XML entity resolver that does nothing? Here’s what I came up with:

public class DummyEntityResolver implements EntityResolver {
    public InputSource resolveEntity(String publicID, String systemID)
        throws SAXException {

        return new InputSource(new StringReader(""));
    }
}

Lo and behold, it works! The key is the return line — if you return null, the SAX parser reverts to its default behavior and downloads the DTD.

Use it like this:

XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setEntityResolver(new DummyEntityResolver());
reader.setContentHandler(new YourContentHandler());
reader.parse(your_xml_source);

The catch is that this will break any externally-defined entities, including standard XHTML entities like ©. The built-in XML entities such as &, and numeric character entities like &x43;, will still work.

You can check that you’re not downloading any DTD’s by watching the output of ngrep -q DTD while running your XML parser. If it doesn’t print anything, you’re good.

Comments 3 Comments »

The things I don’t know about Java… could fill a book. Here’s a new one, from the Hadoop sources:

public ArrayWritable(Class valueClass) {
    // ...
}

public ArrayWritable(Class valueClass, Writable[] values) {
  this(valueClass);
  this.values = values;
}

The second constructor uses the syntax this(arg) to call a different constructor, then follows with initialization code of its own. I had no idea you could do that.

Comments No Comments »

Perhaps I was premature worrying about how slow Ruby is. John Wiseman was benchmarking Montezuma, his Common Lisp port of Ferret/Lucene, and found out in the process that Ferret is 10 times faster than Java Lucene! As he says, Ferret gets help from about 65,000 lines of C code.

I’ve heard this before, perhaps not often enough to make a generalization, but at least enough to identify a trend: if you want performance from Ruby code, rewrite it in C. (The same is sometimes said of Python, or really any interpreted language.) The basic approach seems to be to extract the most performance-critical parts of your dynamic, interpreted language program and rewrite them in a static, compiled language, thus retaining most of the benefits of both.

It’s an interesting contrast to what I see as the Common Lisp approach to optimization, which is to keep everything in Lisp but add compiler declarations in hopes of speeding it up. Trouble is, unless you’re an expert on the inner workings of your compiler (or can read the disassembled code) it’s hard to know exactly what effects a particular declaration will have.

Eventually, I think manual optimization will become unnecessary. Experimental compilers like Stalin have been shown to produce faster machine code than hand-coded C. Stalin compiles a subset of Scheme down to a subset of C, making heavy use of type-inferencing and static analysis. If it can be done with Scheme, surely it can be done with Python, Ruby, or any other dynamic language.

Comments No Comments »

Ran across an interesting remark in a discussion of Microsoft hiring interviews:

If I remember, a lot of MIT people back in the 70s broke the computer world into the Lisp and non-Lisp data typers. The Lisp folk took a casual attitude towards data structures – just shove them in a list, put them on a plist, stash them in a cache. If it gets slow or confusing, add some tags and a hash algorithm. Most non-Lisp folk were appalled at this. They wanted to see the data structure design up front, the data relationship dictionary, complete and comprehensive, even before any coding started.

This sounds like it could be a language-induced habit as much as a programming style. E.g., in Java you have to write “class foo” at the start of a program before you can do anything else, so it makes sense that you’d start by defining data structures. In Lisp, it’s really easy to jump straight to the algorithms, making up “casual” data structures as you go along, so that’s what you do.

The “stuff everything into a list, add some tags later when it gets too big or confusing” is exactly how my largest-to-date Lisp programming project panned out. It worked fine, and made the code pretty short and simple. As the big list got too big to handle in-place, I wrote a handful of functions that pulled out specific pieces of data. Since function calls look identical to method calls in Lisp, the code looked like I had defined a class with a bunch of accessor methods, while in fact it was still just a big list. This was especially helpful given that I was working with free-form data parsed from text files, not a table of predetermined fields. (This could also have been handled somewhat less elegantly with XML.)

Update Oct. 3, 2006: In case anyone was misled, the title was a joke. Data structures are certainly useful. I was merely commenting on the programming styles induced by different languages.

Comments 2 Comments »

I was looking at a problem in the early chapters of Artificial Intelligence: A Modern Approach. It’s called Vacuum World. This is a very simple agent problem consisting of a vacuum placed on a grid. The grid has only two squares, each of which is either dirty or clean. The vacuum has three actions: move left, move right, and suck.

I wrote a simple implementation of this in Java (because the course I’m taking uses Java), starting out something like this:

public class Vacuum {
    private GridSquare location;
    public void moveLeft() { ... };
    public void moveRight() { ... };
    public void suck() { ... };
}

Obvious, beginner’s OOP, right? But then I start to second-guess myself. Is location really a property of the vacuum? Or is it a property of the vacuum in its environment? That is, does the location variable belong in the Vacuum class, in the Grid class (the environment), or in some “world” object that encompasses both the vacuum and the grid? Should Vacuum even be allowed to have a reference directly to its location, or should it query the environment indirectly, as in “Is the square I am currently occupying (wherever it may be) dirty?”

The example code on the A.I.M.A. web site seems to go out of its way to be inscrutably OO, with the behavior of the vacuum divided among half a dozen abstract “agent” classes. I think the vacuum there stores its location as an integer, but I couldn’t swear it to you.

Comments No Comments »

I decide to play around with the Java logging facility. I write a simple test program:

import java.util.logging.*;

public class LoggingTest {
    public static void main(String[] args) {
	Logger log = Logger.getLogger("com.stuartsierra");
	log.entering("LoggingTest", "main");
	log.info("Info Message");
	log.warning("Warning Message");
	log.exiting("LoggingTest", "main");
    }
}

Compile, run, and I get this:

Aug 30, 2006 10:48:45 AM LoggingTest main
INFO: Info Message
Aug 30, 2006 10:48:45 AM LoggingTest main
WARNING: Warning Message

Hmm. Not quite what I expected. Two out of my four log messages don’t appear. Back to the API docs. Ah, yes, I need to set the logging level if I want to get finer-grained messages like “entering” and “exiting.” So I add the line,

log.setLevel(Level.FINEST);

just after getLogger(), compile, and run again. Same exact result — the “entering” and “exiting” messages don’t appear. I look at the API docs again. Nothing strikes me. The Java (TM) Logging Overview doesn’t enlighten me either. I try Level.ALL, just to make sure. Nope, Java doesn’t want to print those messages.

Now if I were on a deadline, I would say screw it and go do some real work. But I want to understand why this doesn’t work. After three or four web searches I find An Introduction to the Java Logging API. As it turns out, I need to set the level not just of my Logger, but also of each of the Handlers receiving messages from it. The article suggests this code:

Handler[] handlers =
   Logger.getLogger( "" ).getHandlers();
for ( int index = 0; index < handlers.length; index++ ) {
   handlers[index].setLevel( Level.FINE );
}

I insert that into my test code, and lo and behold, it works:

Aug 31, 2006 11:05:53 AM LoggingTest main
FINER: ENTRY
Aug 31, 2006 11:05:53 AM LoggingTest main
INFO: Info Message
Aug 31, 2006 11:05:53 AM LoggingTest main
WARNING: Warning Message
Aug 31, 2006 11:05:53 AM LoggingTest main
FINER: RETURN

So what's the lesson of all this? It's easy just to say "Java's logging API is weird." But I'm sure those Handlers are in there for a reason. I can imagine wanting to send the same stream of log messages to a file and a window at the same time. I can even imagine wanting to send different levels of log detail to different outputs. I do think it is a mistake to have the level-setting code duplicated in both Logger and Handler, as it means more work for me, the client programmer who gets frustrated when his code doesn't do what he expects it to. At the least, the API documentation for Logger.setLevel() should mention that you need to set the level on all the Handlers too.

More generally, I think this is an example of how a library can be too customizable. Picture the feature list: "Custom logging levels per-logger and per-output." Sounds great, until you have to remember to set them both to the same value to get what you want.

Comments 2 Comments »

Or, How the Lisp-n Shall Inherit the Earth

Humans like to name things. Like ourselves, Homo sapiens, Latin for “Primate that has taken leave of its senses.”

Then there are engineers. Engineers like to name things too. Like SCSI, pronounced “scuzzy.” Or WYSIWYG, pronounced “wizzy-wig.” Or TTY, pronounced (I couldn’t believe this at first) “titty.”

Then there are programmers, who are singularly uncreative when it comes to naming things. With Apache’s Gregor Samsa class as a brilliant exception, programmers tend to give their variables and functions dull, predictable names like string, number, or do_not_use_this_variable_ever.

Of course, in most programming languages a given name can refer to one thing and one thing only for ever and ever until your program crashes and dumps core Amen. To get around this sad limitation, programming boldly took advantage of the eighth-century’s greatest typographical innovation, lower case letters. Thus we have the Dada-esque beauty of case-sensitive code:

Widget* WIDGET = new Widget.widget();

As languages got more complex, adding first-class functions and first-class types, there were more and more things that needed to be called by the same name. Few programming languages picked up the idea of meaning based on context. So hierarchical namespaces were born:

java.awt.event.ActionEvent myEvent = new java.awt.event.ActionEvent();

My tendinitis starts acting up at the mere sight of that.

Common Lisp to the rescue! As many have pointed out, CL is not just a Lisp-2 (functions and variables in separate namespaces), but more of a Lisp-n. It has many overlapping namespaces — functions, variables, type specifiers, block names, tagbody tags. With a few macros and a hash table, one can easily add a new namespace to the language. For example, testing frameworks like FiveAM and LispUnit put test names in a separate namespace. A programmer using these libraries doesn’t have to think about the extra namespace. Just type symbols and Lisp will do the right thing.

I like to think of this approach as parallel namespaces. Everything is kept neatly separated, without endless qualifiers attached to every symbol. Macros determine how symbols will be evaluated — or not — in a block of code.

This isn’t perfect. For one thing, there’s no canonical way to add a new namespace to a language. Some macros require names to be quoted symbols, some not. Some use keywords. Sometimes an unquoted symbol will pollute the namespace of the current package; sometimes it won’t.

The solution, to my mind, would be something that lets us do this:


(define-namespace foo)
(defmacro-with-namespaces my-macro ((arg1 foo))
. . . )

Then my-macro would always treat its first argument as a symbol which is not evaluated, internally binding arg1 to whatever value the given symbol holds in the foo namespace. I’m not going to attempt to implement such a beast just now, but I’m sure an experienced macro-writer (i.e. someone else) could dash it off in half an hour.

Comments 3 Comments »

Ah, the loop, so fundamental to programming it’s hard to imagine a single program without one. After all, what’s the use of calculating just one thing? Usually you have a big pile of things you want to calculate, which is why you need a computer in the first place.

I think one of the quickest ways to get a feel for a language is to study its looping constructs. I make no pretense that this is a complete or even an accurate list, but these are some of the general iteration patterns I’ve noticed in different languages.

Counter Variable Loop

for ( int i = 0; i <= 10; ++i ) {
  do stuff with list[i];
}

The old C classic, with deeper roots, I believe, in FORTRAN or PASCAL or both. Successive values of an integer counter are used to retrieve input from an array. Very efficient for small data sets, but requires the entire input to be stored as an array in memory. Also requires the looping code to know about the structure of the input, so not very adaptable. But surprisingly resiliant: Perl and Java support the same syntax.

For-Each Loop

foreach item in list
   do stuff with item
done

Probably the most popular syntactic looping construct, and easy to see why: it’s very easy to read and understand. The for-each loop shows up in Perl, Python, Visual Basic, and a host of other languages. Because it’s usually built in to the language syntax, it can rarely be extended to non-standard container types.

Container Method Loop

list.each do |item|
   do stuff with item
end

In purely object-oriented languages like Smalltalk and Ruby (shown above), looping constructs can be implemented as methods of container classes. This has the great advantages that new looping constructs can be added and standard loops can be implemented for new container types. Since the code “inside” the loop is just an anonymous function that takes a single item as its argument, it doesn’t need to know anything about the structure or type of the container.

List Comprehensions

[ item.do_stuff() for item in list ]

Although Python (syntax above) has gotten a lot of press, both good and bad, for its adoption of list comprehensions, they’ve been around a lot longer. I believe they were originally developed to describe lists in a way that looks more like mathematics. For simple patterns list comprehensions are easy to understand, but I don’t yet grok their full significance. They can be nested and combined to produce complex looping patterns that would be awkward to write with C-style iteration.

Half-Nelson Functional Loop

(map  (lambda (item)  do stuff with item)  list)

(Update 26 July 2006: Replaced the idiotic example (map (lambda (item) (process item)) list) with the above. I’m talking about block structures in code here, using the body of the lambda like the body of a for loop in the other languages. Obviously, if your loop body was just a single function, you would just use (map #'process list). Same changes apply to the next example below.)

Functional languages, including most dialects of Lisp, usually have a map operator that takes a function and a list and applies that function successively to each element of that list. I call this the Half-Nelson Functional Loop because it’s not, to my mind, the ultimate of functional behavior. For that, we turn to…

Full-Nelson Functional Loop

((lift (lambda (item)  do stuff with item))  list)

This looks pretty much like the previous example. But here, instead of map, we have lift, which takes only one argument, a function, and returns a new function that applies that function to every element of a list. That new function is then applied (here, in a Scheme-like syntax) to the list argument. I learned about lift from this blog article.

Comments 6 Comments »