I’m writing an annotation specification. If that sounds really tedious and science-y, it probably is, but I think that when most people think “linguistics,” the first free-association word isn’t “science.”
But depending on how you look at it, linguistics is either the softest hard science or the hardest soft science. I prefer the former, because I like my science like I like my eggs—hard enough for Occam’s razor but squishy enough for Hume’s fork.
If “writing an annotation specification” sounds boring beyond belief, it’s because it probably is and I’m just too close to the subject to realize it. But it needs to be done. Natural language processing is one of the hardest subjects within the larger domain of computer science. We’ve been at it since the days of Alan Turing and the Enigma Machine, back when the attitude was that language was just a composition of rules, so it couldn’t be that hard to crack. Boy were we wrong. Now, 70 years later, there are some ways that, using computers, we can process language really well, and others in which we still suck completely. The upshot is, if you want to stand a chance of drawing any kind of conclusion by examining natural language data using computer algorithms, you have to mark it up in such a way that a computer can understand it. There are ways you can do the markup computationally, such as the ways we do part-of-speech tagging, but that’s an entire process unto itself, with imperfect results. If you want your data to be a gold standard over which you run some kind of machine learning algorithm, you really want a human to mark it up, and you have to give your annotators a standard from which to work which will produce something your code can read, so your computer can linguistics for you.
But language is notoriously messy. Technically everyone’s language is a different one. For example, having grown up in New Mexico, I say “card” with an “r,” like a normal person, and think that “wicked” means “evil.” But now I live in Massachusetts, and all these freaky Bostonians I live around say “caahd” and think that playing them is “wicked awesome.” As humans, we can make lots of different sounds, which get classed in different ways (phonetics). These classes of sounds combine to make words (morphology). A stereotypical Bostonian and I read an “r” following a vowel using different sounds, but if you put whatever that sound is in a word, we will interpret the word the same way. In any given language, the number of words is both large and ever-growing (English is particularly good at making up new words), but also finite. These words can be broken up into a number of discrete classes (“yellow” can’t be an adverb, arguments of “the” must be noun phrases, etc.) that combine in certain, describable ways to create phrases and sentences (syntax), which in turn combine to create meaning (semantics and pragmatics).
Out of these discrete combinatorics comes a virtually infinite amount of ways to express the thoughts in your head, with infinitesimal shades of nuance and poetry, in any language.
The problem arises in that the classes into which we divide linguistic units are exclusive and non-overlapping, while those shades and nuances that appear in actual language use don’t lend themselves well to hard and fast rules. “We gift gifts,” “They saw how well each other were doing,” and “You wanna come with? It’s gonna be wicked awesome” are all grammatical constructions according to some, but not others. In addition, because there are rules that will give you a 100%-accurate prediction oft what kind of language a person will use, the only way to know exactly what a person’s language is like is to hear them use it. Like any science, you can only draw conclusions from observed data, and because the search space for language is so large, you have to gather enormous amounts of data in the form of actual language usage. That much data is usually too big for a human to go through and process if you want a result in less than a cosmic timescale. Since we’ve developed algorithms that can process language data using statistical methods, we feed it to a computer. But you have to mark it up for computer usage first, and provide the computer program parameters over which it can build a model of the language data, ergo annotation specification.
The pithy saying holds true with a slight addendum: Statistics don’t lie, given a well-trained model and a large enough sample size.
If my real interest is language and how people use it, why am I messing about with computers and all these numbers? This, apparently, is the curse of academia—no one else realizes your work is as cool and you do, and they look at you funny when you try and convince them.
But what I do is awesome! I keep telling myself. I study meaning. If language is a primary medium humans use to communicate with each other, then shouldn’t studying it be a primary way to draw some base-level conclusions about the way we humans interact and the way we bring what exists in our brains out into the real world? That’s the operating assumption, anyway. I’m not doing this to improve some company’s product, even though that’s where the money is, a fact about which I occasionally grow despondent. I’m doing this because there are things that can be learned—there’s knowledge hidden in the data, and knowledge demands to be known.