UC Berkley is welding two of my linguistic loves: large amounts of data and dead languages. I have to try and bring this kind of thing to Brandeis. We love large amounts of properly formatted data.
I’ve always had a thing for dead languages and historical linguistics. I don’t know what it is, but the notions of sound change and language families has fascinated me for most of my life. I don’t know if it started when I saw a map of the major language families of Europe in a book, or when my dad dropped the bombshell that English and German actually had something in common, or when I read the last appendix to The Return of the King and realized you could actually simulate this stuff wholesale, but as a result, language construction and language reconstructions are areas I wish I could do some real research in. Maybe someday.
But this Berkley thing is really cool. Basically it’s the same thing that historical linguists have done by hand for years, but done by statistical machine learning algorithms, which makes it a hell of a lot faster.
For example, take a few cognates—or words of similar structure that mean roughly the same thing: say, English wheel and Dutch wiel. If these two words are the only correspondence you know, you can make a rough assumption that English wh corresponds to Dutch w, both languages have a long i (ee sound) in the middle position and l at the end. Thus the ancestor of these word would have a form of either *wīl or *hwīl. Which one? To answer this question, you pull in data from other related languages, say, Norwegian hjul. From the h at the beginning of the Norwegian word, you can hypothesize that the ancestral form was more likely *hwīl (actually it was probably more like *hwehul because vowels are extra fluid). The more examples you pull in from various languages, the closer your algorithm approaches a more certain approximation of the ancestral form.
So what they’ve built is a computer that does all that hand-work for you.
Why should we care? Anybody who spoke a dead language lived thousands of years ago. At Berkley, they’re reconstructing Proto-Austronesian, whose speakers not only lived thousands of years ago, but thousands of years ago in the middle of the South China Sea. So what?
So, the simple fact of what words are reconstructed can tell us tons about what life was like in that time and place. Because if you have a word for something, it means you have that something. The fact the reconstructed Proto-Indo-European has words for “wheel,” “horse,” and “snow,” but not for “sea” or “orange tree” already tells us they had wheeled vehicles, horses to pull them, and lived in a temperate-to-cold location far from a coastline. See that? Five words, and I’ve already outlined a picture of a culture that was dead a thousand years before anyone ever heard of Ancient Greece.
I haven’t seen Berkley’s data, but I’ll bet we’ll find that Proto-Austronesians had words for “boat” and “ocean” and maybe “coconut” or something, but had never thought of “snow” and thought “horse” was something you got after a long night of screaming. Now, we could have probably extrapolated that anyway, given what we know about modern day Pacific Islanders and their ancestors, but what we would also find are words for deities, mythology, and social customs, which when trying to crack open the culture that built the Easter Island statues could be invaluable. And we can apply the same techniques to data from language families about which we know even less, like those of Sub-Saharan Africa and the Americas, and learn things about them and the cultures they existed in.
In a world governed by Einsteinian physics, this is a time machine indeed.