On Linguistics and Public Outreach

Earlier in the week A couple weeks ago, I posted three questions about science communication in the field of linguistics:

  1. Who is the Carl Sagan or Neil DeGrasse Tyson of linguistics? In other words, who do we have in linguistics who is an effective presenter of the ongoing work in the field?
  2. Is there one? (Implicitly, do we need one at all? Or why not many?)
  3. Leaving the question of specific personality(ies) aside, how do we as linguists (of any stripe) better present the work we and our colleagues do for effective public consumption?

I got a number of good responses, so here’s a summary along with thoughts about each. These are my personal opinions, so feel free to disagree—vehemently if you like.

I Should Not Love the Olympics

The controversial Sochi Olympics are over and this makes me very sad. After 2+ weeks of feverishly checking medal counts and goggling at interminable ski races and falls on hard ice and wondering if NHL teammates on opposing national squads take out repressed rage on each other during the semifinals, I’m now left with… a desire and no time to go snowboarding again and a lot of residual guilt over not hitting the gym often enough.

I should not love the Olympics. But I do.

Statistics Aren’t Enough

There’s a cute quip: statistics don’t lie.

There’s an accurate quip: statistics don’t lie, given a well-trained model and a large enough sample.

With statistics on your side, you can be 99% sure that you’re right. Unfortunately, people who are 99% sure are wrong 40% of the time.

I was recently thinking about the way Hindi and Urdu represent possession. The language doesn’t have a verb “to have,” and so uses three different constructions depending on the type of possession and the object possessed.

Long story short, if the possessed object is physical and alienable (possession is impermanent), you use ke pās, or “near.”

Rajesh ke pās ek kitāb hai
Rajesh.GEN-near one book is
“Rajesh has a book.”

If it’s nonphysical and alienable, you use ko, or “to.”

Rajesh ko ek bukhār hai
Rajesh-to one fever is
“Rajesh has a fever.”

Finally, if it’s inalienable (things that always belong to you, like body parts or relatives), you use a copula construction.

Rajesh kā bhāī hai
Rajesh-of brother is
“Rajesh has a brother.”

Now, let’s say you were trying to learn Hindi using the state-of-the-art machine translator, Google Translate. How would it treat you when it comes to possession?

The answer, it turns out, is not so well.

Screen Shot 2014-01-17 at 11.59.07 AM

Input: “Rajesh has a fever.” Output: “Rajesh is a fever.”

Screen Shot 2014-01-17 at 11.59.31 AM

Input: “Rajesh has a book.” Output: “Rajesh is a book.”

Screen Shot 2014-01-17 at 11.59.49 AM

Input: “Rajesh has a brother.” Output: “Rajesh is a brother.”

Clearly, Google Translate is missing something here–the entirety of the possession entailed by the English “have.” But watch what happens below:

Screen Shot 2014-01-17 at 12.00.11 PM

Input: “I have a fever.” Output: “I have a fever.”

Yay! So maybe it works with “I”, but not a third person?

Screen Shot 2014-01-17 at 12.03.27 PM

Input: “I have a book.” Output: “I is a book.”

Okay, maybe not?

We get a clue from “I have a fever.” Notice how below the translation, it notes it as a phrase? “I have a fever” is a common enough phrase that it’s showed up in some Hindi-English corpus that Google Translate trained over, so it “knows” the phrase “I have a fever,” and that it corresponds to mujhe ek bukhār hai (mujhe is a contraction of mujhko, “to me”).

It turns out the ko construction is used in a number of common phrases regarding feeling (including sickness) and emotion:

mujhe is kitāb se pasaṃd hai = “I like this book”
mujhe tumse pyār hai = “I love you”

All these kinds of phrases are common enough that Google Translate has surely seen them in a corpus. It can scan along its input, find familiar phrases and then replace them with their equivalents: “I have a fever,” she said becomes “X,” she said and since the translator knows that there’s a high probability that X as a block translates to mujhe ek bukhār hai, it just swaps it in, and is left with “mujhe ek bukhār hai,” she said, and only two words left to translate. This is called “phrase-based” translation, which Google Translate uses to take a lot of work out of its translation task, breaking inputs up not into the individual words, but into bigger chunks that it can translate wholesale.

Which brings us back to the “Rajesh is a book” problem. “Rajesh has a book” is not a common phrase that the translator algorithm can just swap in, so it tries breaking it apart into chunks: maybe “Rajesh” and “has a book.” The name goes through all right, but as we’ve seen, “has a book” is not a phrase that has an easy Hindi equivalent in isolation. You have to do some stuff to the possessor as well, which is no longer part of this chunk. The same thing happens if you split it differently: say, “Rajesh has” and “a book.” It can translate “a book” just fine, but “Rajesh has” is now a problem, because it can’t translate “has” without knowing what is possessed, and “a book” is outside the chunk being examined.

What it might do is break it down until it gets translatable portions, and so ends up with “Rajesh”, “has”, and “a book.”

Rajesh ⟶ Rajesh
a book ⟶ ek kitāb
has ⟶ NULL

“has” goes to NULL, because there’s no direct translation. It might look in its knowledge and see a lot of “has” sentences in English that end in hai in Hindi. Knowing that Hindi sentences often end in the verb (tree-based translation), it seems like there might be reason to translate “has” as hai, which it does, though perhaps not very confidently. has ⟶ hai, and Rajesh is a book.

But what if we were to invoke some kind of semantic category? Turn “a book” into, say ALIENABLE PHYSICAL OBJECT. Seeing that, we could train the translator to know that “has ALIENABLE PHYSOBJ” should translate to “ke pās ALIENABLE PHYSOBJ hai.” Suddenly “has a book” has a distinct translation, because we know what kind of thing a book is. The process would then look something like this:

a book ⟶ ALIENABLE PHYSOBJ (store “a book” somewhere so you can get it back)
has a book ⟶ has ALIENABLE PHYSOBJ
Rajesh ⟶ Rajesh
ke pās ALIENABLE PHYSOBJ ⟶ ke pās a book hai
a book ⟶ ek kitāb

It takes a little longer and you have to have this extra semantic layer in the middle, but you get it right in the end.

Basically, the machine currently fails at this task because we haven’t cracked the question of meaning yet. The computer doesn’t know that a book is a physical object that can be given away or that a fever is an impermanent affliction or that your relative will always be your relative. That requires a human to go in there and annotate those things as such. The holy grail of machine translation is to be high quality (correct), general domain (you can talk about anything), and machine exclusive (you don’t need a person to either format things before it’s translated, or to fix it up afterwards). So far, we can usually hit two out of three with the most advanced computational linguistic techniques. Statistics do very well in some cases, especially between languages where there are very large parallel corpora that allow things to be restricted to a big but closed set of phrases and word chunks. However, this is not a general solution, and has to be tweaked for each language pair, and we usually still have to have a human in the loop to come in and clean things up, so that poor Rajesh isn’t a book.

But we’re working on that.

A Big-Ass Chart of Indo-European Languages

Other people make and eat massive amounts of food over the Thanksgiving weekend. I did that, too, but I also made a chart. A big-ass chart.

There are many ways to chart a language family. You can do the traditional tree view showing genetic relationships. You can show a map of the distribution of various languages and subfamilies. You can view each one individually in terms of its internal history.

Inspired by a brief discussion on Tumblr, I attempted to do all three. Someone asked what Germanic languages were contemporaneous with Latin. I found out the answer, but not before having to read through three Wikipedia articles. I started thinking about ways to represent such a question graphically and came up with this (previews below).

ie_preview1 ie_preview2

Done in LaTeX, the linguist’s best friend, I attempted to capture the genetic relationships between the various Indo-European languages, the relative locations they occupied in the Indo-European sprachraum (at least before the age of exploration), and the time periods in which each language flourished.

How to read this chart:

  • Languages on a red background are extinct.
  • Languages on a green background are still extant.
  • Languages on a yellow background have no native speakers but are still in use as liturgical or scholarly languages in certain traditions.
  • Read from left to right, the chart shows the languages from west to east based on the center of each one’s speaking area (i.e. Celtic is the westernmost subfamily, Tocharian the easternmost).  If two languages occupied the same longitudinal area within their subgroup, the more northerly one is on the left and the more southerly one is on the right (thus Baltic is to the left of Slavic, since they occupied more or less the same east-west area, but Baltic is on average more northerly).
  • The date at which the language first appears on the chart is the date at which linguists hypothesize that language existed as a distinct idiom.  This may be the same as the date of first attestation, but is not necessarily, especially with the older languages.  With some of those, the date is rather speculative.
  • When a language’s children appear on the chart, you can assume that the parent language went extinct at about that point.  If a language had no living children, the line extends downward from that language to the point of extinction.
  • The time-scale is very loosely logarithmic; centuries pass much slower as you get to the bottom of the chart.

I chose Indo-European to try this on as it’s the language family I’m most familiar with (I think all but one of the languages I know much of anything about are Indo-European), and because most branches and subfamilies have been well-placed in space and time at this point.  I assumed the Pontic-Caspian hypothesis of Indo-European origins; quite simply, it’s the one that makes the most sense to me and the date it gives also allowed the time scale to look reasonable in chart form.

This chart doesn’t include all the Indo-European languages–that would take forever and many of the smaller ones we don’t know much about (the Slavic family alone would take weeks to catalogue), but I think you can get a good picture of when and where diversifications happened within the Indo-European family of languages.

Full version here: 8267×1426 px


Fixed a few errors: added Romani, Romansh, fixed left-right direction of Eastern Iranian branch, status of Classical Armenian, spelling errors

Book Review — Holy Sh*t!: A Brief History of Swearing

Author: Melissa Mohr
Publisher: Oxford University Press

Every once in a while you come across a book on a bookstore shelf that catches your eye simply because it’s got a dirty word right in the title. So it was with me and this one, which is really quite a good data point in favor of one of the book’s primary arguments: that swear words and taboo language speak to us on a fundamentally different level from other language.

Though brief, this history of swearing is pretty comprehensive, at least for English. Mohr (whose author picture on the back jacket shows her with her young son–that must have been fun trying to explain mom’s latest project), divides swears and cursing into two categories, the Holy (“Oh my god!”, “Sweet Jesus”, “Damn you”, etc.) and the Shit (shit, fuck, asshole, and the like). If you found yourself reacting more strongly to the latter than the former, that’s because, according to Mohr, we’re living in the age of the Shit right now. Which is to say that language referencing obscenity and bodily functions has greater taboo force than the other kind of historically offensive language, language that references theology.

Apparently we Gen-Yers care more about “a fulfilling career” than “a secure career”.

Cal Newport points out that “follow your passion” is a catchphrase that has only gotten going in the last 20 years, according to Google’s Ngram viewer, a tool that shows how prominently a given phrase appears in English print over any period of time. The same Ngram viewer shows that the phrase “a secure career” has gone out of style, just as the phrase “a fulfilling career” has gotten hot.

Images follow. Click the link to see them.

While the article is kind of ambivalent about what this statistic means, many of the comments regard the idea that fulfillment be valued over security with a significant amount of derision. Yeah, I know, never read the bottom half of the Internet, but why is the idea that seeking fulfillment in your livelihood is somehow bad (or at least to be valued less than security) so prevalent?

This is interesting to me now, because reading that article came at a time when my life is at a confluence. To wit:

I have two jobs–software engineer and graduate student. Engineer pays pretty well, grad student pays marginally above the poverty line. However, combining those, I’m living pretty high, easily within the top quarter for income-earners in the United States. In terms of actual work content, my engineering work I can take or leave, though I try to do the best I can at it. It’s just often not particularly exciting and sometimes very frustrating. My doctoral research, on the other hand, gets the blood flowing like nothing else besides writing (which, not coincidentally, is a large part of a Ph.D. program). Engineering is a secure career–people are always going to need halfway-literate code monkeys, and will be willing to pay them well. Research is a fulfilling career–I get to meet people who are excited to learn about the work I’m doing, have opportunities for great collaborations, and can end up adding something concrete to a growing and exciting field.

This can’t last.

Burgers, metal, and religious offense

Really, really kicking myself for never visiting Kuma’s Corner when I was living in Chicago.

Pisa Wrap-up

I haven’t blogged in a while which is nothing new (not that anyone reads this), but this time I have a reason beyond my own laziness: my doctoral advisor invited me to a conference in Pisa, Italy, which is where I was for the past week.

This was my first time in Italy and only my second in Europe outside of an airport. The conference was great. Some very interesting work was presented and I met some people who were quite interested in my doctoral research, despite the fact that I haven’t really started doing it yet. Some of the best parts about conferences are when you just get to talk to people on an informal level and hear their thoughts about a field that you already have in common. There’s some really exciting and diverse research going on in computational linguistics right now.


I tried to make the most of sightseeing while I was there. Most of that revolved around the Leaning Tower and the Cathedral and Baptistry in the Piazza dei Miracoli, which was just a short walk from my hotel. Even for an atheist, religious structures can be impressive in their own right.


The acoustics of the Baptistry are a remarkable feat of engineering where even something as simple as a toe tap echoes like a gunshot.


And of course, the Leaning Tower of Pisa has the association with Galileo’s alleged gravitational experiments, which makes it interesting from a scientific perspective.

19th century plaque commemorating Galileo’s supposed experiment

I also took a short trip out to Lucca, a medieval walled city about a half hour from Pisa. I’m particularly happy about this because I was able to negotiate my way there, to a tower, a bookstore, through lunch, and back to the bus station, all in Italian, which I had only started learning a week before.

Panorama of Lucca from the top of the Torre delle Ore (clock tower), looking west

Pisa and the area is great, and I would highly recommend it to anyone.

A Use for Bibles in Disaster Zones


You know those folks who travel to disaster zones (Haiti, Oklahoma, etc.) and hand out not food, not medicine, but Bibles? Turns out they’re not useless after all–as long as the Bibles have leather covers.

You see, back in Renaissance Europe, Catholics and Protestants were doing the whole war thing all over the western half of the continent, especially France. The Huguenots, French Protestants, had an especially bad time of it, with Catholic armies besieging every Huguenot controlled city they could find, down to small towns and villages. One such village was under siege for so long they were reduced to eating leather. Being French, of course, they made it edible.

From a letter found in a 1969 collection called The Huguenot Wars: An Eyewitness Account (via The Cartoon History of the Modern World, Part I comes the recipe: soak the leather (shoes or something, I guess) in brine for at least a day, changing the water often, braise until tender with roots and herbs, and sauté it in fat. The author of the account apparently describes it as “one of the most delicious things [I] have ever eaten.”

Another letter (or maybe the same one–I haven’t seen the original source), from a pastor called Jean de Lery, has a similar recipe for boiled drum skins: first soaked for 2 days, then scraped with a knife and boiled until tender enough for you to scratch with your fingers and see if they’re glutinous. Cut the result into small pieces, the whole affair is then seasoned with herbs and spices.

See? Bibles are useful, after all. They can feed the people, if they’re the nice fancy kind with leather covers.

But somehow I doubt people in disaster zones are receiving nice leather Bibles. Maybe some enterprising soul with come up some kind of lentil soup equivalent you can make out of the wood pulp from paper.

“Pain from an old wound”

“Nostalgia” means “pain from an old wound,” so says common knowledge. The problem with common knowledge is that it is often wrong. Nostalgia has nothing to do with old wounds, and only a little to do with pain.

The other problem with common knowledge is that it is often almost right.

“Nostalgia” means “the affliction of homecoming.” Think “neuralgia,” the affliction of the brain. Despite its Greek appearance, “nostalgia” is actually a new-ish word, a translation of the German heimweh, because of course the Germans would have a word for this thing.

On the way back from vacation, I decide to stop by the old house where I grew up. It’ll be the first time I see it in sixteen years.

