Data formats are annoying. As much as half the code in any large software project consists of translating from one data representation — objects, SQL tables, files, XML, RDF, JSON, YAML, CSV, Protocol Buffers, Avro, XML-RPC — to another.
Each format has its own strengths and weaknesses. Often, no single representation is complete enough to be considered “canonical.” The only canonical representation is an abstract one, a platonic ideal in the mind of some developer. Since this platonic ideal cannot be implemented in code, different people have different expectations for how a particular model is supposed to work.
There are two options: Either you re-implement the model, with all its features and constraints, for each format, and hand-code all the translations; or you use a “smart” library that automatically translates between different representations. ActiveRecord and Hibernate are popular examples of the latter.
The problem with “smart” libraries is that they can never be smart enough. At some point you always have to dig into the generated SQL or whatever to make them work efficiently, or even correctly. Frequently this is impossible without hacking the library sources, a daunting tangle of generated and meta-programmed code. The library that was supposed to make your life easier instead makes it hell.
Do these “smart” libraries really save any time? Would it be easier to just write the translation code in the first place? We’ll never know, because programmers can’t resist “smart” systems, the myth that you can “do more with less code.” You can never do more with less, unless what you’re doing is the lowest common denominator of what everyone else is doing. And if that is what you’re doing, then why bother?
Excellent post. I can add some ammunition for your argument. We recently wrote some simple tools to maintain and update information in a database. The decision to model the schema inside the software (rather than simply having the software operate on the data in the database) substantially reduced the code needed to complete the project, but fleshing out the nuances of the model (and making exceptions for the models assumptions) substantially postponed the project.
Well done, Stuart!
I think this is a fascinating area and, perversely in the internet age of data sharing, somewhat under-researched. Personally, I have big issues with current techniques for abstracting structured (hierarchical) data. The obvious candidate is xml, but I think it has a number of flaws. Some are niggles (it’s too verbose) but the main criticism is that it does not separate structure from content. This severely compromises its usefulness. Suppose I wanted to represent a file system using xml, every time I move a file from one folder to another, I have to write back the entire xml file, when in fact all I really need to do is move the file from one part of the structure to another.
I’ve spent a lot of time thinking about and implementing what in my view are more elegant and efficient solutions. If you would like to discuss these issues further, feel free to drop me a line!