So the latest scandal is that the NSA has been allegedly gathering large amounts of data on American phone calls and internet traffic. I’m pissed off about this, for the usual reasons and more.
Is it a violation of the 4th amendment? I would say yes, but I’m not the judicial system. Digital information is obviously not one of the things the framers of the Constitution foresaw, so it is, in kind terms, a clusterfuck trying to sort out how the Bill of Rights applies.
I’m not a constitutional scholar, so I’m not here to argue about the legality of all. My opinion on it is pretty set (it sucks). I’m pissed off at the Obama administration for continuing the Bush policy (it started after 9/11, of course), I don’t believe most of the Republicans taking him to task over it are doing so in good faith (where were they in 2006?), but none of the politics excuse the act. What I would like to talk about is the data itself and why it’s mostly useless (and why that pisses me off).
The problem is simply this: too much data. If you’re taking data from the communications of all Americans, or every tenth American with a cell phone, or even every twentieth American who sends e-mails between the hours of 10pm Tuesday and 5am Wednesday, you’re getting too much data to do a whole lot with.
First, some terms and a little math:
precision: for a given category and set of data points, the percent of positive predictions that were correct. The formula is true positives ÷ (true positives + false positives). A low precision system will retrieve a high number of false positives, like a spam filter that catches all the spam you receive, but also catches lots of non-spam.
recall: for a given category and set of data points, the percent of positive cases caught. The formula is true positives ÷ (true positives + false negatives). A low recall system will retrieve a high number of false negatives, like a spam filter that catches some spam and no non-spam, but lets lots of spam slip through.
accuracy: for a given set of categories and a set of data points, the percent of correct predictions over all categories. Formula: (true positives + true negatives) ÷ (true positives + false positives + true negatives + false negatives).
For reference, the Gmail spam filter is both high precision and high recall, but probably has higher recall than precision (that is, it catches virtually all the spam you receive, but also catches some non-spam).
Now, the problem.
Here’s how the wiretapping program works per the Electronic Frontier Foundation:
This equipment [installed by the NSA in telecommunications facilities] gave the NSA unfettered access to large streams of domestic and international communications in real time—what amounted to at least 1.7 billion emails a day, according to the Washington Post.
1.7 billion e-mails a day. Let’s look at this from the perspective of a spam filter.
Let’s say our filter has two categories, spam and non-spam, and we know very well what the features of spam are, so it has, say, 100% recall (we focus on recall because we want a low number of false negatives–our primary concern is getting all the spam that’s out there), and we’re able to get it up to 90% total accuracy. In most natural language processing, these are pretty good numbers. Assuming we’ve got 10000 e-mails to look through, ten percent of which really are spam, our numbers will look something like this:
Accuracy = 0.9 = (1000 true pos. + 8000 true neg.) ÷ (1000 true pos. + 1000 false pos. + 8000 true neg. + 0 false neg.)
Out of 1000 spam e-mails, our filter correctly catches all of them. Eight thousand non-spam messages are let through the filter. But 1000 non-spam messages are also caught. Not so great.
Now let’s look at the same problem from the point of view of a test for a rare but deadly disease. Let’s say we have a million people to test, and our test has 100% recall and 95% accuracy. In most natural language processing, those are numbers you can only dream of, but disease classification isn’t NLP. Assuming that one percent of the test group has the disease we’re testing for, our numbers now look like this:
Accuracy = 0.95 = (10000 true pos. + 940,000 true neg.) ÷ (10000 true pos. + 50000 false pos. + 940,000 true neg. + 0 false neg.)
Out of 10000 people who have the disease, our tests catches all of them. Nine hundred forty thousand un-infected people are correctly classified as healthy. But, oops, 50000 non-diseased people are told they have it and lose $1000 each to unnecessary pharmaceuticals. Our company has a slight PR mess on its hands.
Now, regarding the NSA, supposedly the data has already allowed intelligence services to thwart two terror attacks. Given the amount of data allegedly gathered, if this is the best they’ve done, I am extremely unimpressed.
Let’s look at the same problem, except our categories are “this communication is a lead to a terrorist plot” and “it isn’t.” We also need to scale up. Let’s focus on a single day and assume the NSA’s got a billion e-mails to troll through. Let’s also assume they have a great working knowledge of what makes a good terror lead, and that they have an awesome algorithm that classifies terror leads with 100% recall (because we can’t risk false negatives, i.e. letting any real terror chatter through the filter) and 99% accuracy. Our final assumption will be that in those billion e-mails, a thousand contain information that is a lead on a terror plot.
Accuracy = 0.99 = (1000 true pos. + 989,999,000 true neg.) ÷ (1000 true pos. + 10,000,000 false pos. + 989,999,000 true neg. + 0 false neg.)
In a single day, they capture all communications that do in fact discuss terror plots. The number of communications classified as uninteresting is 989,999,000. However, due to that 1% accuracy error, ten million innocuous e-mails are flagged as terror leads. Now the agency has to assess and throw out those ten million before they find the thousand they actually need. Needle, meet haystack. The sheer amount of data acquired at the outset leads to a bottleneck and wasted time and resources searching through it all for a true positive threat.
As the size of your data grows, your impressive accuracy numbers grow steadily less impressive in terms of raw hits, even with perfect recall, because the precision is low. In order to get precision high enough to get single digit numbers of false positives, your accuracy has to be orders of magnitude better–about 99.999999999%, or so close to perfect as to be impossible.
But look what happens when you reduce your data intake by focusing your search on, say, known risky individuals, like, oh, I don’t know, Tamerlan Tsarnaev.
Say you’ve gathered 10000 e-mails from a known target, ten of which are valid terror plot leads. With 100% recall and 99% accuracy, you’ll get the ten true positives and only 100 false positives. I assume one hundred and ten e-mails can be made short work of by a few good analysts in a few good hours.
Maybe the NSA has a really awesome high-precision filter for terror chatter, but I doubt it. The face of terrorism is constantly changing, which makes it not like trying to wrestle with a spam-bot, but like trying to wrestle with a spam-bot that records the messages that get caught in the spam filter and adjusts accordingly. Oh, and you take the legit messages that get caught in the filter and throw them in jail, and while you were at the jail, the spam messages you don’t catch blow up your computer with a pressure cooker.
Now “if you have nothing to hide, you have nothing to fear” is technically true, unless your fear is some creepy NSA analyst knowing what kind of porn you like to search for. But the amount of data would appear to give no clue as to the number and nature of potential threats. Coupled with the very small percentage of that communication that is likely to be useful as a lead on a terror plot, it makes the NSA’s apparent strategy akin to standing at the far end of the room from an enormous dartboard that has a target the size of a pinhead and expecting them to hit the target on a single throw. This seems like a ineffective way to fight terrorism.
Really, the only way to get the bad guys out of all this data is if you give up trying to sort the false positives out from the true positives–that is, if for every lead you get that is a real terror plot, you also get n number of innocent people who just miss the ’90s and think everything is DA BOMB, and put them on a list, too. Oh, wait…
The wiretapping initiative punishes innocent people in the pursuit of the guilty and doesn’t even have the benefit of being good at what it’s supposed to be doing. I give up. Why should we have to sacrifice our privacy because our lawmakers and security agencies fail at basic statistics?