The AuthorNick Selby (@NSelby) is co-founder and CEO of StreetCred Software Inc., which provides law enforcement agencies with fugitive case management products that leverage predictive intelligence. He is co-author of Blackhatonomics: An Inside Look at the Economics of Cybercrime, published by Elsevier/Syngress in 2012, and technical editor of Investigating Internet Crimes (Elsevier/Syngress, 2013).
Selby has consulted with US and European governments, investment banks, start-ups and venture capital firms on matters related to security intelligence. Since 2008 he has focused on law enforcement intelligence and was sworn in as a Texas police officer in 2010. He is based Fort Worth, Texas.
The story of the week is the revelation by a 29-year-old National Security Administration (NSA) contractor named Edward Snowden that the US government has been ingesting large quantities of metadata and data about telephony and Internet traffic for the purposes of gathering intelligence.
This is highly controversial, to say the least.
For those of you who have spent the last week on Pluto, head on over to Doc Searls’ excellent blog for a comprehensive set of links to the stories about this spectacular news event. It is no overstatement to say that this leak is, in terms of its public and operational impact, the most explosive and significant since the 1971 publication in The New York Times of the classified history of the Vietnam conflict – what became known as The Pentagon Papers — and the most profound leak of intelligence tactics since Duncan Campbell’s 1976 reporting on GCHQ in Time Out, and the subsequent ABC Trials in 1978.
To put this in perspective, Bradley Manning’s leaks to WikiLeaks pale in significance when compared to any of the above but certainly to Snowden’s. This is not about cables, communiques and field intelligence, it’s about the very essence of Signals Intelligence (SIGINT) and specific methods and targets.
There is so much about to happen on this story that to say anything else about it would be folly. So I won’t.
What I will discuss is the “Why?”
I don’t think I’m alone in feeling that explanations given by talking heads (including myself) about what intelligence analysts do with metadata involves “pattern matching” and “dot connecting” and “link analyses” leave people confused. If you already understand the topic, then those statements are highly descriptive. To the uninitiated, though, they are as opaque as viewing your kid’s ballet recital through milk-filled goggles.
So what is it that people would want to do with all that data and metadata that would lead to an agency apparently Hoovering it up wholesale?
And what the hell is metadata, anyway?
Intelligence is about making the pile smallerHere’s a super-obvious case to get the gist: If you were in San Francisco and you were looking for the person who stole your red Swingline stapler and if you also, magically, had access to the GPS signals from every mobile device in the Bay Area, what would you do?
I’ll tell you what I’d do: I’d draw a geo-fence around the area in which my Swingline stapler was last seen, and I would slice the huge pile of data by seeking only those devices that were in that area at around the time of the theft.
By doing that I have just reduced from 7.15 million – the number of people in the San Francisco metropolitan area – to probably a couple of dozen people you can be pretty sure were near the scene of the crime at the time. Is this proof of anything? Absolutely not. But it is a very short list of people who may have information about who stole the stapler, and that list may even include someone who later turns out to be (for other reasons) a good suspect or in fact the culprit.
This exercise is a simplified and streamlined version of the work conducted by SIGINT people each day: find an event, a person, a time and place or even just a place that is for some reason interesting, and work to reduce the quantity of data you have on that thing until you’ve reached a manageable quantity. The difference is that in intelligence, we’re predicting which things are of interest, as opposed to reactively finding things of interest and working backwards to a root cause.
We do this reduction because we have limited resources – even the powerful NSA has limited resources. We need to use broadly-sourced SIGINT like this as a starting point, or as a place to return once an “interesting thing” is discovered.
The best explanation I’ve heard of the nature and composition of metadata came from Eric Olson, who writes the Digital Water Blog. Eric tells a story of having to fly across the country to make a speech on a subject you’re not familiar with, and you have $20, enough to buy one book on the subject to prepare. You go into the bookshop and you start looking in the shelves, and as you do, you start asking yourself questions: Do you know the author’s name? Is the author someone you trust? Is the author someone who’s been recommended to you? Do you trust the person who recommended it? Have you read any reviews about the book, and were they good? Does the jacket have blurbs from people you trust, praising the book? Is the book heavy (because you do your best reading in the bath)? Is it too large (because you’ll be reading it on a plane and you’re sitting in the cheap seats)? Is it too thick (because you have to give this speech tomorrow and won’t have time to read a tome)?
Olson points out that these questions are all information about the book that can lead to decisions about whether you will buy it, and you haven’t even looked at the table of contents.
Those questions are metadata about the book that you’ve generated, dynamically. Metadata, as you can see, can be extremely powerful. That’s why some are unsatisfied by President Obama’s statements that the NSA is “only” looking at telephony metadata. As pointed out by the wonderful Electronic Frontier Foundation (a group to which I personally donate money each year), metadata is truly important and can be extremely telling. It’s misleading to say that it’s “just” metadata.
It would have been far better for the administration to say, “Yeah, this is what we do, there’s oversight and that’s it.”
At my company, we put these concepts to work and derive and create metadata that is in fact more immediately applicable and telling than specific records might be. For example, it’s highly specific information that on March 15, 2009, a given fugitive was involved in an altercation with Officer Jack Smith (#1028) of the Tulsa, Oklahoma, Police Department at 18:25, at the corner of B Street and 3rd Avenue, while he was driving a white, 1997 Dodge Ram 1500, OKLP 192BRYS, during which he swung at the officer with his fist and threatened to kill the officer.
We would simply set the “Violent” flag to a 1. Officers don’t need to know all those details of a case from four or five years ago but they do need to have an operational indication that this fugitive has been known in the past to be violent, or to present a threat of violence. Does this mean that this fugitive is guilty of violence now? Of course not. But it tells the officer viewing the information to be on his toes, using an indicator derived solely from metadata.
An officer viewing this fugitive’s case file within StreetCred would not have any idea of the details of the case, but he’d have everything he needs in that one metadatum to take steps to ensure his safety.
Taking metadata from others
Of course, what is under frantic and frenetic discussion these days is metadata describing telephone calls, and even of Internet communication. Here again, descriptions of “pattern matching” and “link analysis” can create confusion. One great way to look at this is the creation of a “social map” showing links between people; when it’s done, it resembles an airline route map.
Say you’ve got information that a telephone number is associated with Dr. Evil. One extremely fast way of determining who really knows Dr. Evil — as opposed to who just sort of casually knows Dr Evil — would be to look at call data records. This has been practice for a very long time in the European Union (Germany banned it as unconstitutional a couple of years ago).
These calls would allow me to understand whom Dr. Evil calls and who calls Dr Evil, the number of times they call, and their general or even specific location when they do. It would let me see how long people spoke (so, for example, later, when a person says, “But I don’t know him,” the question can be asked, “Oh yeah? So why did you make 51 phone calls to him in the last six months averaging four minutes and 15 seconds each?”).
Very quickly, this would allow me to go out one level to seek more connections that are relevant. It’s really, really easy to see how investigators could use this information to, say, map out a potential terror cell, but here’s the big thing: this is how investigators disqualify people from being looked at further.
That last bit is the really important thing to remember. As Chris Swan points out over at Police Led Intelligence:
Your data is mostly safe because it’s mostly uninteresting (and that’s not the same as ‘nothing to hide, nothing to fear’, which is only ever said by politicians with something to hide who fear being found out).
The analyses are generally there to reduce from petabytes to kilobytes the size of the data-pile in which analysts search for good targets. This is not to opine that there is nothing wrong with what is happening – just as it is not to opine there’s nothing right about it. I’m making no statement at all about whether these revelations are “good” or “bad.”
This post was to give some context about just what someone might do with a lot of that kind of data, and to explain some of the kinds of thinking that goes into exploiting it once collected.