Programming by Linguists II

Given the task of designing a programming language, which must be exactly defined for engineering purposes, what would a linguist — as opposed to a mathemetician — do?

The first step would probably be to define parts of speech. Most programming languages really have only two parts of speech: nouns (variables) and verbs (functions). Named or keyword arguments might count as adverbs, and object properties are often treated like adjectives, but they are not used the same way as in real languages, where they can completely change the meaning of the words they modify.

Next comes grammar: sentence structure, connecting words. Again, despite the complexity of their punctuation rules (which a linguist would classify as mechanics, not grammar) most programming has a very simple grammar. We have statements, which correspond to sentences; and expressions, which are usually mathematical. (Lisp does away with the former, and as a result is much less like natural languages, a barrier to understanding that contributes to its obscurity.)

The third and most complicated step would probably be defining vocabulary, i.e. coming up with names for everything. Programmers are notoriously bad at this. One can hardly blame them for not being linguists themselves, but I think computer science ought to include at least some basic theory of naming. One difficulty is the usual requirement that every name used in a single program be unique. Scoped namespaces alleviate some of the trouble, but it is still difficult to come up with names that are both distinct enough to have an obvious meaning and concise enough to be convenient. Lisp’s panoply of car/cdr functions (cadr, cdar, caadr, caddr, …) might win the prize for the worst name choices. One also encounters library functions with exceedingly long names such as libfooRemoveAllFoosFromBaz. (Even with nested scopes and object-oriented programming, this example would probably be written as Foo.Baz.foos.removeAll, which is not much better.)

Vocabulary for a programming language typically includes both the core language and a set of standard libraries. One flaw with nearly all programming languages, Lisp being the notable exception, is the demarcation between the “core” language and its libraries. In all but a few highly specialized languages, libraries will comprise the meat of almost every program. In the linguistic analogy, one could almost say that the core of a language is its grammar, while the libraries are its vocabulary.

It is tempting to assume that human languages have infinite precision. That is, they can express any concept that the mind can conceieve of. This is not entirely true. There are many documented examples of concepts that exist in one language but for which there is no adequate word, sometimes not even a combination of words, in another language. Much of what are considered “cultural differences” are caused by different vocabularies that define the set of concepts that can be expressed, and therefore understood, by different cultures.

The fact that nearly every programming language draws its words from English makes things harder still. English is terribly dependent on context for meaning, using the same simple words to express many different concepts. For example, consider the verb to put, whose English definition fills a full dictionary page. Add to this the habit of almost all languages of using compound phrases to name single concepts. German habitually mashes these phrases together to make new words, but names like “ListOfPrimeNumbers” feel awkward in English.

So vocabulary is difficult every way around. Of course, linguists are rarely consulted when new English words are being invented, although they will decide whether those words appear in a dictionary. In naming functions, invented words would be a barrier to comprehension. And yet, it is difficult to find existing words to identify the many operations which have no direct relevance to the problem at hand but which are nonetheless necessary to help the computer do its job. Think of prototypes, declarations, typecasts, and conversions.