Here’s a question that’s been bugging me for a while: what’s the best way to store information that is a mixture of highly- and loosely-structured data? For example, a collection of documents like Project Posner. Certain attributes of each document like the title, date, and citation fit easily into a normalized relational database model. But the body can only be described with some kind of markup.
I could just use HTML, except for one problem: my documents have to handle footnotes, for which HTML does not provide a tag. (As an aside, footnotes are a pain whether you’re doing web design or typesetting.)
On Project Posner, I compromised: everything is stored in a MySQL database, and the documents table has a “body” column that contains my own made-up XML syntax.
I could, in theory, normalize everything, even individual paragraphs. But that would be a nightmare to code and deadly slow. I could also store everything as XML documents. But then I’d have to reinvent all the facilities that MySQL (and ActiveRecord) provide, like transaction handling, auto-incrementing IDs, and so on.
For another project, I’m trying to create a pseudo-database that stores everything as XML files and uses Ferret for searching. I was going to use Ferret for full-text search anyway, so my original thought was to save overhead by not bothering with MySQL indexes. It works, but looking over it I realize that most of the data could be normalized to fit into the standard relational model. I’d still need a blob of XML data somewhere, but it could be in the database as easily as a file. What have I really gained, besides an impressively large and complex pile of code?