“Ridiculously optimistic” machine learning algorithm is “completely bullshit,” claims specialist
In 2014, the ex – director of both the CIA and NSA announced that “we kill people based on metadata.” Currently, a new assessment of earlier published Snowden documents indicates that many of those persons may have been innocent.
Last year, documents were published detailing the NSA’s SKYNET programme. Based on the documents, SKYNET engages in mass surveillance of Pakistan’s mobile phone network, and then utilizes a machine learning algorithm on the cellular network metadata of 55 million people to attempt and rate each person’s likelihood of being a terrorist.
Patrick Ball-a data scientist and the director of research at the Human Rights Data Analysis Group-who has earlier given expert testimony before war crimes tribunals, described the NSA’s techniques as “ridiculously optimistic” and “completely bullshit.” A flaw in how the NSA trains SKYNET’s machine learning algorithm to analyse cellular metadata, Ball told Ars, makes the results scientifically unsound.
Somewhere between 2,500 and 4,000 people have been killed by drone strikes in Pakistan since 2004, and most of them were classified by the US government as “extremists,” the Bureau of Investigative Journalism documented. According to the classification date of “20070108” on one of the SKYNET slide decks (which themselves seem to date from 2011 and 2012), the machine learning program could have been in development as early as 2007.
In the years that have followed, thousands of innocent people in Pakistan may have been mislabelled as terrorists by that “scientifically unsound” algorithm, probably resulting in their untimely demise.
The siren song of big data
SKYNET works like a typical modern Big Data business application. The program collects metadata and stores it on NSA cloud servers, extracts relevant information, and then applies machine learning to recognize leads for a targeted campaign. Except instead of trying to sell the targets something, this campaign, given the overall business focus of the US government in Pakistan, likely entails another branch of the US government-the CIA or military-that completes their “Find-Fix-Finish” strategy utilizing Predator drones and on-the-ground death squads.
Along with processing logged cellular phone call data (so-called “DNR” or Dialled Number Recognition data, such as time, duration, who called whom, etc.), SKYNET also gathers user location, permitting for the development of detailed travel profiles. Turning off a mobile phone gets flagged as an effort to evade mass surveillance. Users who swap SIM cards, naively thinking this will avoid tracking, also get flagged (the ESN/MEID/IMEI burned into the handset makes the phone trackable across multiple SIM cards).
Even handset swapping gets discovered and flagged, the slides boast. Such detection, we can only guess (since the slides do not go into detail on this point), is most likely determined by the fact that other metadata, such as user location in the real world and social network, remain unchanged.
Given the total set of metadata, SKYNET pieces together peoples’ typical daily routines-who travels together, have shared contacts, stay overnight with friends, visit other countries, or move permanently. Overall, the slides indicate, the NSA machine learning algorithm uses more than 80 different properties to rate people on their terroristiness.
The program, the slides inform us, is determined by the predictions that the behaviour of terrorists differs significantly from that of ordinary citizens with respect to some of these properties. Nevertheless, as sources last year made clear, the highest rated target according to this machine learning program was Ahmad Zaidan, Al-Jazeera’s long-time bureau chief in Islamabad.
As sources reported, Zaidan regularly travels to regions with known terrorist activity in order to interview insurgents and report the news. But rather than questioning the machine learning that produced such a weird result, the NSA engineers behind the algorithm instead trumpeted Zaidan as an example of a SKYNET success in their in-house presentation, which include a slide that labelled Zaidan as a “MEMBER OF AL-QA’IDA.”
Feeding the machine
Training a machine learning algorithm is like training a Bayesian spam filter: you feed it identified spam and identified non-spam. From these “ground truths” the algorithm learns how to filter spam correctly.
In the same manner, a critical part of the SKYNET program is feeding the machine learning algorithm “known terrorists” to be able to teach the algorithm to spot similar profiles.
The issue is that there are relatively few “known terrorists” to feed the algorithm, and real terrorists are not likely to answer a hypothetical NSA survey into the matter. The internal NSA documents propose that SKYNET utilizes a set of “known couriers” as ground truths, and thinks by default the rest of the population is innocent.
Pakistan has a population of around 192 million people, with about 120 million cellular handsets in use at the end of 2012, when the SKYNET presentation was created. The NSA analysed 55 million of those mobile phone records. Given 80 variables on 55 million Pakistani mobile phone users, there is certainly far too much data to make sense of manually. So like any Big Data application, the NSA utilizes machine learning as an aid-or perhaps a substitute, the slides do not say-for human reason and judgement.
SKYNET’s classification algorithm analyses the metadata and ground truths, and then generates a score for each individual based on their metadata. The goal is to assign high scores to real terrorists and low scores to the rest of the innocent population.
To do this, the SKYNET algorithm utilizes the random forest algorithm, generally utilized for this kind of Big Data application. Certainly, the UK’s GCHQ also seems to utilize similar machine learning techniques, as new Snowden docs released last week reveal. “It seems the technique of choice when it comes to machine learning is Random Decision Forests,” George Danezis, associate professor of Security and Privacy Engineering at University College London, published in a blog post analysing the released documents.
The random forest method utilizes random subsets of the training data to create a “forest” of decision “trees,” and then brings together those by averaging the predictions from the individual trees. SKYNET’s algorithm takes the 80 properties of each cellphone user and assigns them a numerical score-just like a spam filter.
SKYNET then chooses a threshold value above which a cellphone user is classified as a “terrorist.” The slides present the evaluation results when the threshold is set to a 50 percent false negative rate. At this rate, half of the persons who would be classified as “terrorists” are instead classified as innocent, in order to keep the number of false positives-innocents falsely classified as “terrorists”-as low as possible.
We cannot be sure, needless to say, that the 50 percent false negative rate chosen for this presentation is the identical threshold utilized to generate the final kill list. No matter, the issue of what to do with innocent false positives stays.
“The reason they’re doing this,” Ball mentioned, “is because the fewer false negatives they have, the more false positives they’re certain to have. It’s not symmetric: there are so many true negatives that lowering the threshold in order to reduce the false negatives by 1 will mean accepting many thousands of additional false positives. Hence this decision.”
One NSA slide boasts, “Statistical algorithms are able to find the couriers at very low false alarm rates, if we’re allowed to miss half of them.”
But just how minimal is the NSA’s idea of “very low”?
The issue, Ball informed Ars, is how the NSA trains the algorithm with ground truths.
The NSA assess the SKYNET program utilizing a subset of 100,000 randomly selected persons (identified by their MSIDN/MSI pairs of their mobile phones), and a identified group of 7 terrorists. The NSA then trained the learning algorithm by feeding it six of the terrorists and tasking SKYNET to find the seventh. This data provides the percentages for false positives in the slide above.
“First, there are very few ‘known terrorists’ to use to train and test the model,” Ball stated. “If they are using the same records to train the model as they are using to test the model, their assessment of the fit is completely bullshit. The usual practice is to hold some of the data out of the training process so that the test includes records the model has never seen before. Without this step, their classification fit assessment is ridiculously optimistic.”
The explanation is that the 100,000 citizens were chosen at random, while the seven terrorists are from a identified cluster. Under the random selection of a tiny subset of less than 0.1 percent of the total population, the density of the social graph of the citizens is massively reduced, while the “terrorist” cluster remains highly interconnected. Scientifically-sound statistical analysis would have demanded the NSA to mix the terrorists into the population set before random selection of a subset-but this is not practical due to their tiny number.
This may sound like a mere academic problem, but, Ball stated, is in fact highly damaging to the quality of the results, and thus ultimately to the accuracy of the classification and assassination of people as “terrorists.” A quality evaluation is especially important in this case, as the random forest method is known to overfit its training sets, producing results that are overly optimistic. The NSA’s analysis thus does not provide a good indicator of the quality of the method.
If 50 % of the false negatives (actual “terrorists”) are permitted to make it, the NSA’s false positive rate of 0.18 % would still mean thousands of innocents misclassified as “terrorists” and potentially killed. Even the NSA’s most optimistic result, the 0.008 percent false positive rate, would still result in many innocent people dying.
“On the slide with the false positive rates, note the final line that says ‘+ Anchory Selectors,'” Danezis informed Ars. “This is key, and the figures are unreported… if you apply a classifier with a false-positive rate of 0.18 percent to a population of 55 million you are indeed likely to kill thousands of innocent people. [0.18 percent of 55 million = 99,000]. If however you apply it to a population where you already expect a very high prevalence of ‘terrorism’-because for example they are in the two-hop neighbourhood of a number of people of interest-then the prior goes up and you will kill fewer innocent people.”
Apart from the apparent objection of how many innocent people it is ever acceptable to kill, this also takes on there are a lot of terrorists to identify. “We know that the ‘true terrorist’ proportion of the full population is very small,” Ball mentioned. “As Cory [Doctorow] says, if this were not true, we would all be dead already. Therefore a small false positive rate will lead to misidentification of lots of people as terrorists.”
“The bigger point,” Ball included, “is that the model will totally overlook ‘true terrorists’ who are statistically different from the ‘true terrorists’ used to train the model.”
In many instances, a failure rate of 0.008% would be excellent…
The 0.008 percent false positive rate would be amazingly low for traditional business applications. This kind of rate is suitable where the outcomes are displaying an ad to the wrong person, or charging someone a premium price by mistake. Nevertheless, even 0.008 percent of the Pakistani population still corresponds to 15,000 people potentially being misclassified as “terrorists” and targeted by the military-not forgetting innocent bystanders or first responders who happen to get in the way.
Security guru Bruce Schneier concluded. “Government uses of big data are inherently different from corporate uses,” he told Ars. “The accuracy requirements mean that the same technology doesn’t work. If Google makes a mistake, people see an ad for a car they don’t want to buy. If the government makes a mistake, they kill innocents.”
Killing civilians is unacceptable by the Geneva Convention, to which the United States is a signatory. Many details about the SKYNET program remain mysterious, nevertheless. For example, is SKYNET a closed loop system, or do analysts evaluate each mobile phone user’s profile before condemning them to death based on metadata? Are attempts made to capture these suspected “terrorists” and put them on trial? How can the US government be certain it is not killing innocent persons, given the obvious flaws in the machine learning algorithm on which that kill list is based?
“On whether the use of SKYNET is a war crime, I defer to lawyers,” Ball stated. “It’s bad science, that’s for damn sure, because classification is inherently probabilistic. If you’re going to condemn someone to death, usually we have a ‘beyond a reasonable doubt’ standard, which is not at all the case when you’re talking about people with ‘probable terrorist’ scores anywhere near the threshold. And that’s assuming that the classifier works in the first place, which I doubt because there simply aren’t enough positive cases of known terrorists for the random forest to get a good model of them.”
The leaked NSA slide decks offer strong proof that thousands of innocent people are being labelled as terrorists; what occurs after that, we don’t understand. We don’t have the full picture, nor is the NSA likely to fill in the gaps for us. (We repeatedly sought comment from the NSA for this story, but at the time of publishing it had not responded.)
Algorithms increasingly rule our lives. It’s a small step from applying SKYNET logic to look for “terrorists” in Pakistan to applying the same logic domestically to look for “drug dealers” or “protesters” or just people who disagree with the state. Killing people “based on metadata,” as Hayden said, is easy to ignore when it happens far away in a foreign land. But what happens when SKYNET gets turned on us-assuming it hasn’t been already?