Posts Tagged “Java”

Back in February, in a slightly plaintive post, the W3 sysadmins asked that people stop hammering their servers with requests for XHTML DTDs. Everyone said yes, this is a stupid problem that wouldn’t have happened if a) the XML spec were less dumb, or b) XML libraries were less dumb.

After that post, I spent two whole days fighting with XML catalogs — possibly the worst-documented XML spec ever — to make sure my Java code wasn’t downloading a DTD every time it read an XHTML document.

To my annoyance, no one seems to have posted any cut-and-paste solutions to this problem. Setting properties on the SAX parser is no help, and the XML catalogs solution is a pain to set up.

So what if someone wrote a “dummy” XML entity resolver that does nothing? Here’s what I came up with:

public class DummyEntityResolver implements EntityResolver {
    public InputSource resolveEntity(String publicID, String systemID)
        throws SAXException {

        return new InputSource(new StringReader(""));
    }
}

Lo and behold, it works! The key is the return line — if you return null, the SAX parser reverts to its default behavior and downloads the DTD.

Use it like this:

XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setEntityResolver(new DummyEntityResolver());
reader.setContentHandler(new YourContentHandler());
reader.parse(your_xml_source);

The catch is that this will break any externally-defined entities, including standard XHTML entities like ©. The built-in XML entities such as &, and numeric character entities like &x43;, will still work.

You can check that you’re not downloading any DTD’s by watching the output of ngrep -q DTD while running your XML parser. If it doesn’t print anything, you’re good.

Comments 1 Comment »

The things I don’t know about Java… could fill a book. Here’s a new one, from the Hadoop sources:

public ArrayWritable(Class valueClass) {
    // ...
}

public ArrayWritable(Class valueClass, Writable[] values) {
  this(valueClass);
  this.values = values;
}

The second constructor uses the syntax this(arg) to call a different constructor, then follows with initialization code of its own. I had no idea you could do that.

Comments No Comments »

Perhaps I was premature worrying about how slow Ruby is. John Wiseman was benchmarking Montezuma, his Common Lisp port of Ferret/Lucene, and found out in the process that Ferret is 10 times faster than Java Lucene! As he says, Ferret gets help from about 65,000 lines of C code.

I’ve heard this before, perhaps not often enough to make a generalization, but at least enough to identify a trend: if you want performance from Ruby code, rewrite it in C. (The same is sometimes said of Python, or really any interpreted language.) The basic approach seems to be to extract the most performance-critical parts of your dynamic, interpreted language program and rewrite them in a static, compiled language, thus retaining most of the benefits of both.

It’s an interesting contrast to what I see as the Common Lisp approach to optimization, which is to keep everything in Lisp but add compiler declarations in hopes of speeding it up. Trouble is, unless you’re an expert on the inner workings of your compiler (or can read the disassembled code) it’s hard to know exactly what effects a particular declaration will have.

Eventually, I think manual optimization will become unnecessary. Experimental compilers like Stalin have been shown to produce faster machine code than hand-coded C. Stalin compiles a subset of Scheme down to a subset of C, making heavy use of type-inferencing and static analysis. If it can be done with Scheme, surely it can be done with Python, Ruby, or any other dynamic language.

Comments No Comments »

Ran across an interesting remark in a discussion of Microsoft hiring interviews:

If I remember, a lot of MIT people back in the 70s broke the computer world into the Lisp and non-Lisp data typers. The Lisp folk took a casual attitude towards data structures - just shove them in a list, put them on a plist, stash them in a cache. If it gets slow or confusing, add some tags and a hash algorithm. Most non-Lisp folk were appalled at this. They wanted to see the data structure design up front, the data relationship dictionary, complete and comprehensive, even before any coding started.

This sounds like it could be a language-induced habit as much as a programming style. E.g., in Java you have to write “class foo” at the start of a program before you can do anything else, so it makes sense that you’d start by defining data structures. In Lisp, it’s really easy to jump straight to the algorithms, making up “casual” data structures as you go along, so that’s what you do.

The “stuff everything into a list, add some tags later when it gets too big or confusing” is exactly how my largest-to-date Lisp programming project panned out. It worked fine, and made the code pretty short and simple. As the big list got too big to handle in-place, I wrote a handful of functions that pulled out specific pieces of data. Since function calls look identical to method calls in Lisp, the code looked like I had defined a class with a bunch of accessor methods, while in fact it was still just a big list. This was especially helpful given that I was working with free-form data parsed from text files, not a table of predetermined fields. (This could also have been handled somewhat less elegantly with XML.)

Update Oct. 3, 2006: In case anyone was misled, the title was a joke. Data structures are certainly useful. I was merely commenting on the programming styles induced by different languages.

Comments 2 Comments »

I was looking at a problem in the early chapters of Artificial Intelligence: A Modern Approach. It’s called Vacuum World. This is a very simple agent problem consisting of a vacuum placed on a grid. The grid has only two squares, each of which is either dirty or clean. The vacuum has three actions: move left, move right, and suck.

I wrote a simple implementation of this in Java (because the course I’m taking uses Java), starting out something like this:

public class Vacuum {
    private GridSquare location;
    public void moveLeft() { ... };
    public void moveRight() { ... };
    public void suck() { ... };
}

Obvious, beginner’s OOP, right? But then I start to second-guess myself. Is location really a property of the vacuum? Or is it a property of the vacuum in its environment? That is, does the location variable belong in the Vacuum class, in the Grid class (the environment), or in some “world” object that encompasses both the vacuum and the grid? Should Vacuum even be allowed to have a reference directly to its location, or should it query the environment indirectly, as in “Is the square I am currently occupying (wherever it may be) dirty?”

The example code on the A.I.M.A. web site seems to go out of its way to be inscrutably OO, with the behavior of the vacuum divided among half a dozen abstract “agent” classes. I think the vacuum there stores its location as an integer, but I couldn’t swear it to you.

Comments No Comments »

I decide to play around with the Java logging facility. I write a simple test program:

import java.util.logging.*;

public class LoggingTest {
    public static void main(String[] args) {
	Logger log = Logger.getLogger("com.stuartsierra");
	log.entering("LoggingTest", "main");
	log.info("Info Message");
	log.warning("Warning Message");
	log.exiting("LoggingTest", "main");
    }
}

Compile, run, and I get this:

Aug 30, 2006 10:48:45 AM LoggingTest main
INFO: Info Message
Aug 30, 2006 10:48:45 AM LoggingTest main
WARNING: Warning Message

Hmm. Not quite what I expected. Two out of my four log messages don’t appear. Back to the API docs. Ah, yes, I need to set the logging level if I want to get finer-grained messages like “entering” and “exiting.” So I add the line,

log.setLevel(Level.FINEST);

just after getLogger(), compile, and run again. Same exact result — the “entering” and “exiting” messages don’t appear. I look at the API docs again. Nothing strikes me. The Java (TM) Logging Overview doesn’t enlighten me either. I try Level.ALL, just to make sure. Nope, Java doesn’t want to print those messages.

Now if I were on a deadline, I would say screw it and go do some real work. But I want to understand why this doesn’t work. After three or four web searches I find An Introduction to the Java Logging API. As it turns out, I need to set the level not just of my Logger, but also of each of the Handlers receiving messages from it. The article suggests this code:

Handler[] handlers =
   Logger.getLogger( "" ).getHandlers();
for ( int index = 0; index < handlers.length; index++ ) {
   handlers[index].setLevel( Level.FINE );
}

I insert that into my test code, and lo and behold, it works:

Aug 31, 2006 11:05:53 AM LoggingTest main
FINER: ENTRY
Aug 31, 2006 11:05:53 AM LoggingTest main
INFO: Info Message
Aug 31, 2006 11:05:53 AM LoggingTest main
WARNING: Warning Message
Aug 31, 2006 11:05:53 AM LoggingTest main
FINER: RETURN

So what’s the lesson of all this? It’s easy just to say “Java’s logging API is weird.” But I’m sure those Handlers are in there for a reason. I can imagine wanting to send the same stream of log messages to a file and a window at the same time. I can even imagine wanting to send different levels of log detail to different outputs. I do think it is a mistake to have the level-setting code duplicated in both Logger and Handler, as it means more work for me, the client programmer who gets frustrated when his code doesn’t do what he expects it to. At the least, the API documentation for Logger.setLevel() should mention that you need to set the level on all the Handlers too.

More generally, I think this is an example of how a library can be too customizable. Picture the feature list: “Custom logging levels per-logger and per-output.” Sounds great, until you have to remember to set them both to the same value to get what you want.

Comments 2 Comments »

Or, How the Lisp-n Shall Inherit the Earth

Humans like to name things. Like ourselves, Homo sapiens, Latin for “Primate that has taken leave of its senses.”

Then there are engineers. Engineers like to name things too. Like SCSI, pronounced “scuzzy.” Or WYSIWYG, pronounced “wizzy-wig.” Or TTY, pronounced (I couldn’t believe this at first) “titty.”

Then there are programmers, who are singularly uncreative when it comes to naming things. With Apache’s Gregor Samsa class as a brilliant exception, programmers tend to give their variables and functions dull, predictable names like string, number, or do_not_use_this_variable_ever.

Of course, in most programming languages a given name can refer to one thing and one thing only for ever and ever until your program crashes and dumps core Amen. To get around this sad limitation, programming boldly took advantage of the eighth-century’s greatest typographical innovation, lower case letters. Thus we have the Dada-esque beauty of case-sensitive code:

Widget* WIDGET = new Widget.widget();

As languages got more complex, adding first-class functions and first-class types, there were more and more things that needed to be called by the same name. Few programming languages picked up the idea of meaning based on context. So hierarchical namespaces were born:

java.awt.event.ActionEvent myEvent = new java.awt.event.ActionEvent();

My tendinitis starts acting up at the mere sight of that.

Common Lisp to the rescue! As many have pointed out, CL is not just a Lisp-2 (functions and variables in separate namespaces), but more of a Lisp-n. It has many overlapping namespaces — functions, variables, type specifiers, block names, tagbody tags. With a few macros and a hash table, one can easily add a new namespace to the language. For example, testing frameworks like FiveAM and LispUnit put test names in a separate namespace. A programmer using these libraries doesn’t have to think about the extra namespace. Just type symbols and Lisp will do the right thing.

I like to think of this approach as parallel namespaces. Everything is kept neatly separated, without endless qualifiers attached to every symbol. Macros determine how symbols will be evaluated — or not — in a block of code.

This isn’t perfect. For one thing, there’s no canonical way to add a new namespace to a language. Some macros require names to be quoted symbols, some not. Some use keywords. Sometimes an unquoted symbol will pollute the namespace of the current package; sometimes it won’t.

The solution, to my mind, would be something that lets us do this:


(define-namespace foo)
(defmacro-with-namespaces my-macro ((arg1 foo))
. . . )

Then my-macro would always treat its first argument as a symbol which is not evaluated, internally binding arg1 to whatever value the given symbol holds in the foo namespace. I’m not going to attempt to implement such a beast just now, but I’m sure an experienced macro-writer (i.e. someone else) could dash it off in half an hour.

Comments 3 Comments »

Ah, the loop, so fundamental to programming it’s hard to imagine a single program without one. After all, what’s the use of calculating just one thing? Usually you have a big pile of things you want to calculate, which is why you need a computer in the first place.

I think one of the quickest ways to get a feel for a language is to study its looping constructs. I make no pretense that this is a complete or even an accurate list, but these are some of the general iteration patterns I’ve noticed in different languages.

Counter Variable Loop

for ( int i = 0; i <= 10; ++i ) {
  do stuff with list[i];
}

The old C classic, with deeper roots, I believe, in FORTRAN or PASCAL or both. Successive values of an integer counter are used to retrieve input from an array. Very efficient for small data sets, but requires the entire input to be stored as an array in memory. Also requires the looping code to know about the structure of the input, so not very adaptable. But surprisingly resiliant: Perl and Java support the same syntax.

For-Each Loop

foreach item in list
   do stuff with item
done

Probably the most popular syntactic looping construct, and easy to see why: it’s very easy to read and understand. The for-each loop shows up in Perl, Python, Visual Basic, and a host of other languages. Because it’s usually built in to the language syntax, it can rarely be extended to non-standard container types.

Container Method Loop

list.each do |item|
   do stuff with item
end

In purely object-oriented languages like Smalltalk and Ruby (shown above), looping constructs can be implemented as methods of container classes. This has the great advantages that new looping constructs can be added and standard loops can be implemented for new container types. Since the code “inside” the loop is just an anonymous function that takes a single item as its argument, it doesn’t need to know anything about the structure or type of the container.

List Comprehensions

[ item.do_stuff() for item in list ]

Although Python (syntax above) has gotten a lot of press, both good and bad, for its adoption of list comprehensions, they’ve been around a lot longer. I believe they were originally developed to describe lists in a way that looks more like mathematics. For simple patterns list comprehensions are easy to understand, but I don’t yet grok their full significance. They can be nested and combined to produce complex looping patterns that would be awkward to write with C-style iteration.

Half-Nelson Functional Loop

(map  (lambda (item)  do stuff with item)  list)

(Update 26 July 2006: Replaced the idiotic example (map (lambda (item) (process item)) list) with the above. I’m talking about block structures in code here, using the body of the lambda like the body of a for loop in the other languages. Obviously, if your loop body was just a single function, you would just use (map #'process list). Same changes apply to the next example below.)

Functional languages, including most dialects of Lisp, usually have a map operator that takes a function and a list and applies that function successively to each element of that list. I call this the Half-Nelson Functional Loop because it’s not, to my mind, the ultimate of functional behavior. For that, we turn to…

Full-Nelson Functional Loop

((lift (lambda (item)  do stuff with item))  list)

This looks pretty much like the previous example. But here, instead of map, we have lift, which takes only one argument, a function, and returns a new function that applies that function to every element of a list. That new function is then applied (here, in a Scheme-like syntax) to the list argument. I learned about lift from this blog article.

Comments 6 Comments »

An excellent article introducing the concepts and advantages of functional programming in non-academic language. Now if I could just understand monads…

Comments No Comments »

So it turns out that nearly every binary search algorithm is broken. Why? Fixed-sized integers, a holdover from machine programming that still lingers in supposedly modern languages like Java. LtU’ers are haggling to decide if it’s a type error, a bad data structure, a language defect, a failure of correctness proofs, or just a bug. Lispers are talking smugly about their automatic fixnum-to-bignum conversion.

Of course, even Common Lisp arbitrarily limits the size of arrays. On my 64-bit SBCL, it’s a little over one quintillion, or (assuming one-byte array elements) one exabyte. That’s a bigger array than even Google is likely to need in the near future. But I think it’s valid to ask why code should be limited by anything except the hardware it’s running on.

So here’s my take on the whole issue: Yes, Java/C++/C suck. But a mathmatically perfect computer will never be built. Every premature optimization is just a concession to the reality of chips and electrons. Computer Science likes to believe it’s pure Mathematics, but Engineering will always be there to keep it humble. Double-check those proofs, and then start running tests.

Comments No Comments »