Better Abstractions – Digital Digressions by Stuart Sierra

A common complaint about Object-Oriented Programming (OOP) is that classes can make simple data hard to deal with: “I don’t want a DirectoryList object, I just want a list of strings.” I would say this is not a flaw inherent in OOP but rather in the way it is commonly used. “Encapsulation” and “data hiding” encourage creation of abstract interfaces completely divorced from their underlying implementation, even when that underlying implementation is in fact very simple. The problem is that abstract interfaces usually only provide a tiny subset of the operations one might potentially want to perform on the data hiding behind them.

What we have, essentially, are interfaces that abstract “down” to increasing levels of speficity, farther and farther away from the data themselves. Wouldn’t it be better if we could abstract “up,” choosing the representation of the data based on what we want to do with it? For example, take a very common, very general case: a collection of data elements with similar structure. There are only a few basic operations we are likely to perform on such a collection:

Add an item
Remove an item
Modify an item
Sort the items based on some criteria
Find one or more items that satisfy certain criteria
Perform agregate calculations on the items (max, min, etc.)

This is the genius of SQL: it provides the first three with simple commands, INSERT, UPDATE, and DELETE; and the last three with the multipurpose SELECT. An SQL query looks the same regardless of how big the data set is or what algorithm is used to index it. Unfortunately, this idea has never quite made it into general purpose programming languages.

Ultimately, it shouldn’t matter whether data are stored in an array in local memory or on a cluster of relational database servers located across the internet. The things we want to do with the data remain the same. A compiler should be able to translate abstract operations on collections into the actual work needed to perform them in a specific implementation. Ideally, we could just start feeding data into the system and let the compiler choose the most efficient implementation, switching over automatically when the collection becomes too big for a particular method. This is similar to the way Common Lisp (and other languages) automatically switch to arbitrary-precision integers when a calculation grows too large for fixed-width integers.

For an example of this, I turn, strangely enough, to one of the hairier aspects of Perl: tied variables. Basically, a tied variable looks and behaves exactly like one of Perl’s built-in data types (scalar, array, or hash table) when you use it, but it is actually an object that calls methods in response to the operations performed on it. For example, the Tie::File module represents a file as an array of strings, one per line. Delete an item from the array, and you remove a line from the file. Same with insertion, modification, and sorting. Presto! A complete random-access file library that doesn’t require the user to learn a new interface. The implementation does not load the whole file into memory, so it is efficient even for very large files. Iterating over lines in a file used to require a special syntax unique to files; with Tie::File it can be done with standard Perl control structures that iterate over lists.

The even more ambitious Data::All module attempts the same thing in a more general way: tying any external data format (CSV, RDBMS, Excel spreadsheets…) to standard Perl data structures. It’s still alpha code, but the idea is brilliant.

In short, I would like object-oriented interfaces that abstract data towards its simplest possible representation rather than the most specific representation. I do not want objects that limit what I can do with the data they contain.

One Reply to “Better Abstractions”

Digital Digressions » Blog Archive » The Only Data Structures You’ll Ever Need says:

July 6, 2006 at 2:48 pm

[…] These basic types can also be conveniently mapped to external data. A CSV file can be represented as an array of hashes. A database table can be an array of arrays, an array of hashes, or a hash of hashes, whichever you prefer. […]

Comments are closed.