Comment Spam as Popularity Index

I just noticed that the Askimet spam filter plugin for WordPress provides PHP code to display its spam-comments-blocked counter on your blog. I wonder: could one use the number of spam comments received as a measure of a blog’s popularity? Presumably spam bots target sites that are more heavily linked by other sites, making them rough approximations of Google’s PageRank.

Just imagine it: Spam is the new hits. “My blog gets more spam than your blog.” “I need to add more porn links to my site to attract more spam.” “We need to focus on our spam-engine-optimization strategy.” “This blog brought to you by Cia|is.”

For the record, Askimet has caught 1,958 spam comments in the few months since I installed it, or about 30 times the number of legitimate comments I have received since I started writing the blog.

LaTeX for the Rest of Us

I really like LaTeX. So much so that I bought a used copy of the original LaTeX “blue book” just so I could write a class file to print my freshman English papers in MLA format, which requires breaking most of the typesetting rules that make LaTeX output look so professional to begin with.

But there’s no question about the ugliness of LaTeX (or plain TeX) source. At times it borders on incomprehensible. LyX helps, but only if you like LyX’s editor.

When I was an office temp I saw dozens of people struggle to typset 100+ page documents in Microsoft Word. Word is a pretty powerful tool, but it’s just not up to the task. Small, maddening inconsistencies appear that are difficult to correct. Large documents require a lot of memory and cause crashes. And when you have multiple people working on the same document, only one of whom understands how to use styles properly, it becomes nightmarish.

The trouble with WYSIWYG is that what you see is all you get. Why should an author — who should only be thinking about the content, not the presentation — be faced with decisions about line and page breaks while working on a first draft?

So here’s what I want: an editor that looks and feels like Microsoft Word, but that only permits structural editing — section headings, emphasis, etc. Then use TeX or a similar typesetting system to generate printed documents. To make this useful to a general audience, one would also need a “style editor” to modify the behavior of the typesetter.

The problem with this is that the idea of “structural editing” seems quite alien to a lot of computer users. People like WYSIWYG. They want to work on screen with something that looks like a finished document. I believe this is actually a very inefficient way to work, since one is distracted by formatting concerns from the actual writing, but I don’t know how to convince anyone else of this. But if the “structural editor” could be styled to match the user’s expectations of what the final document will look like, we could have the best of both worlds. Writers will be able to work in an environment they already know and feel comfortable with, and editors and publishers will be spared the frustration of fixing formatting inconsistencies introduced by writers.

Ruby More Memory-Efficient than Lisp?

I continue to sweat (see previous entry) over the question of language choice for future iterations of Project Posner (and some as-yet-unnamed similar projects). Ruby on Rails is the obvious mainstream choice, mainstream at least compared to Lisp. But a part of me really wants to do it in Common Lisp, just to prove I can.

One concern I do have speed. Ruby is pooh-poohed for being slow, which, its true, is not really fair for a 1.x version scripting language, but the Programming Language Shootout does support the accusation. I tried comparing Ruby and SBCL on the Shootout. As I expected, SBCL is up to several hundred times faster than Ruby, but I did not expect that Ruby would use two to five times less memory.

Maybe Ruby’s data structures are very close to their C analogs, lacking the extra padding that Lisp needs for type identification? But no, Ruby is dynamically typed, too, so surely it needs just as many tag bits. Ah, I know: The test must be counting the large size of the SBCL runtime (over 20MB, I recall reading somewhere) compared to Ruby’s (less than 2MB). For a limited-duration algorithmic test, this would certainly dominate the results.

I wonder, though: over longer run times, which language would use less memory for actual data storage? I suspect that carefully optimized Lisp arrays would win, but Ruby’s arrays, the standard way to represent lists in Ruby, might fit in less space than a linked list structure, the standard way to represent lists in Lisp.

Note-taking on the Web

I just started playing with this, and already I love it: Zotero. It’s like a bookmark manager crossed with a note-taking program crossed with BibTeX.

Zotero is an extension that runs inside Firefox 2.0 — click the icon, and it captures a complete bibliographic record of the page you’re looking at, and saves a copy. This is vital when you need to cite web pages that may not be permanent.

Even better, it has scrapers (they call them “translators”) for a bunch of online databases, the kind you get in university libraries. Say you’re reading the online PDF version of an article that appeared in a journal. Click on Zotero, and it saves the PDF then stores both the URL and the journal name, volume, number, page, author, title, etc. Then click another button and it spits out a bibliography in APA, MLA, or Chicago form.

It works for offline resources too. Look up a book in a card catalog, and click to record the bibliography. Add your own notes and links to files on your hard drive. Of course, it has a search function, with tagging promised in future release. It’s open-source.

Unlike the cool-but-geeky note-taking programs (like desktop Wikis), Zotero is designed for scholarly work, and it has some big-name research institutions behind it. Here’s hoping it continues to grow.

Borrowability

The first draft of Project Posner was written in Common Lisp. I thought it would be fun to see how Common Lisp fared as a language for doing heavy text processing with a web front end. It worked well, and I’m convinced it made the process easier than it would have been with any other language. But everything I’ve done with it up till now is off-line. I used Lisp to statically generate the site on my desk, then uploaded the HTML pages to the server. Search is handled by ht://Dig, an old-school CGI app written in C.

I’d love to continue to develop Project Posner in Lisp, especially since I’m currently the only programmer working on it. But to add any more features I need server-side programming. I find myself wondering … do I dare try to use Lisp? First off, I’d have to get a new web host, probably a virtual server, since no shared-host server offers Lisp pre-installed. That would cost more. Then I’d have to set up and maintain the OS on the server, which I’d frankly rather not be bothered to do. Then there’s the multiple headache of getting Apache, mod_Lisp, PostgreSQL, and a CL implementation all running and talking to one another. Then, and only then, can I start work on the application itself. And then I don’t have much pre-written code to draw on. Sure, HTML and JavaScript generation is in the bag, but there aren’t any drop-in libraries for forums, guestbooks, user authentication, or any of that good stuff.

I could probably write that stuff myself in Lisp. But could I do it faster and better than the hundreds of other people who have already done it in Perl/PHP/Python/Ruby? I don’t think so. I’m not that good.

So there it is. Web application development is an evolving problem, but by and large a solved one. And it wasn’t solved in Lisp. When Paul Graham was creating Viaweb, no one else was even thinking of web applications, so he had to create his own tools. But the biggest recent poster child for Lisp on the Web, Reddit, gave up and switched to Python (to much gnashing of teeth in the Lisp world). It has nothing to do with the language itself. Lisp is still great. It’s all about the tools, the libraries, the “borrowablility” of other people’s code.

So I’ll continue using Lisp for off-line stuff, private projects and such. But for building Project Posner version 2.0, I’ll probably look elsewhere.

Premature Cleverization

“Premature optimization is the root of all evil,” say Hoare and Knuth. I have determined that I suffer from a slightly different but related malady: premature cleverization. At the start of a programming task, before I’ve written any real code, I get some wacky idea like “I could do this all with binary XORs” or “this would be cool in Lisp” or “I could write a library that would solve this and a whole bunch of other problems and make me rich and famous.” I sit up late at night, doodling diagrams that prove just how brilliant my idea is. Three days later I have a stack of diagrams and I’m ready to write some real code. I try to implement the idea. It doesn’t work. In the excitement of inspiration I glossed over — or just forgot — some flaw that makes the whole thing impossible. In effect, I wasted three days doodling and trying to be clever instead of doing real work on a boring piece of code that actually solves my problem. It’s a high-energy kind of procrastination, the sober equivalent of a drug addict’s rambling theories.

The Woz

I got to see Steve Wozniak speak at Columbia University last night, promoting his new book iWoz. Strongest impression: The man has an incredible amount of energy. He talked in a strong voice at high speed for nearly an hour before rushing off to tape a spot on The Colbert Report. Most interesting fact: he hooked up his Apple I to his television set — creating the first practical video monitor — because he needed a cheaper input/output method than the teletypes everyone was using at the time.

He’s obviously obsessed with engineering and programming to a degree that most mortals can barely imagine. Who else would think of building a 10-bit digital computer as a middle school science fair project?

Another thing that struck me was his stated desire to start fresh on each computer he designed, building in capabilities — like color for the Apple II — from the ground up. That seems like the complete opposite of the approach most of the PC industry has taken of layering new functionality on top of old, to the point where it takes 5 minutes to boot a modern PC. What might the Woz come up with if he started today designing a new PC from the ground up, taking advantage of all the advances of the past two decades but leaving behind the legacies?

Project Posner: first look

Been too busy with work and class to post much, but here’s a link for all the IANALs out there: Project Posner. It’s an on-line database collecting the case opinions of Richard A. Posner, judge on the 7th Circuit Court of Appeals. This was the brainchild of law professor and former Posner clerk Tim Wu. I wrote all the code to parse and format the cases, in 100% Common Lisp! Specifically, about 800 lines of code that spits out 36 MB of static HTML in about 5 minutes — whee! Currently having some problems with Google’s free web search; maybe they’ll crawl the site now that I’ve linked to it. Or maybe I’ll break down and implement my own search function. In any case, take a look, comments welcome.

Breadth-first and Depth-first Searching

I’m playing some more with the early chapters of Artificial Intelligence: A Modern Approach, looking at basic tree search techniques for the 8-puzzle. I wrote simple breadth-first and depth-first search algorithms for this puzzle in Common Lisp. Here’s the code. It’s an interesting demonstation of how inefficient these alogrithms really are.

I represent the puzzle as a simple list of integers numbered 0 through 7, with the symbol B representing the “hole.” The “winning” position is (0 1 2 3 4 5 6 7 B). I also added a little code to have the algorithm track the number of iterations; how many “nodes,” or board positions, it has to expand in the search tree; and the maximum number of positions it has to keep stored in memory.

Just to test that I’ve coded it right, I make sure the algorithm does nothing when given a solved puzzle to start with:

CL-USER> (breadth-first-search '(0 1 2 3 4 5 6 7 B))
"Found (0 1 2 3 4 5 6 7 B) in 0 steps.
Expanded 0 nodes, stored a maximum of 0 nodes."

Next, I try it on a puzzle that is just one move away from a solution:

CL-USER> (breadth-first-search '(0 1 2 3 4 5 6 b 7))
"Found (0 1 2 3 4 5 6 7 B) in 3 steps.
Expanded 9 nodes, stored a maximum of 5 nodes."

All well and good. Now lets try shuffling the pieces a bit more:

CL-USER> (breadth-first-search '(0 1 2 b 3 6 4 5 7))
"Found (0 1 2 3 4 5 6 7 B) in 18327 steps.
Expanded 49075 nodes, stored a maximum of 8422 nodes."

Ouch. That’s kind of slow. And that wasn’t even the hardest possible starting arrangement. So much for that.

Let’s see how depth-first search does on the one-move-away puzzle:

CL-USER> (depth-first-search '(0 1 2 3 4 5 6 b 7))

Dum de dum dum. Tick tock.

Half an hour later, I get bored and interrupt it. Obviously depth-first search is not a very good algorithm for solving the 8-puzzle. I’ll have to wait for the next chapter, which introduces A* and other informed searches.

It’s funny to me, because both of these algorithms seem like reasonable ways of searching a tree of possibilities. I guess it’s hard for my mind to grasp just how overwhelmingly large exponential growth really is.

Who Needs Data Structures?

Ran across an interesting remark in a discussion of Microsoft hiring interviews:

If I remember, a lot of MIT people back in the 70s broke the computer world into the Lisp and non-Lisp data typers. The Lisp folk took a casual attitude towards data structures – just shove them in a list, put them on a plist, stash them in a cache. If it gets slow or confusing, add some tags and a hash algorithm. Most non-Lisp folk were appalled at this. They wanted to see the data structure design up front, the data relationship dictionary, complete and comprehensive, even before any coding started.

This sounds like it could be a language-induced habit as much as a programming style. E.g., in Java you have to write “class foo” at the start of a program before you can do anything else, so it makes sense that you’d start by defining data structures. In Lisp, it’s really easy to jump straight to the algorithms, making up “casual” data structures as you go along, so that’s what you do.

The “stuff everything into a list, add some tags later when it gets too big or confusing” is exactly how my largest-to-date Lisp programming project panned out. It worked fine, and made the code pretty short and simple. As the big list got too big to handle in-place, I wrote a handful of functions that pulled out specific pieces of data. Since function calls look identical to method calls in Lisp, the code looked like I had defined a class with a bunch of accessor methods, while in fact it was still just a big list. This was especially helpful given that I was working with free-form data parsed from text files, not a table of predetermined fields. (This could also have been handled somewhat less elegantly with XML.)

Update Oct. 3, 2006: In case anyone was misled, the title was a joke. Data structures are certainly useful. I was merely commenting on the programming styles induced by different languages.