The Spider That Crawls the Dark Web Looking for Stolen Data

Share this…

A start-up alerts organizations when their sensitive information pops up for sale online. When police officers respond to a theft or a mugging, they’ll usually ask for the serial numbers of any valuable electronics that were taken. Those identifiers can help police know if a stolen item turns up at a local pawn shop, in a second-hand store, or on eBay. In many states, resellers have to check the serial numbers of certain items against a registry of stolen goods when the items come in. But it’s a lot harder to track a stolen database of personal information, which can contain everything from names and addresses to financial details and fingerprints. Hackers routinely make off with massive hauls of sensitive data by breaking into databases held by government agencies, retailers, hospitals, banks, and just about every other kind of organization. But most intrusions are discovered by a third party rather than the organization that actually lost the data, according to a 2014 report from Verizon. Some network tools detect intrusions, and scrutinizing detailed logs can reveal unauthorized access, but often, an organization won’t realize what happened until after a security researcher or a journalist catches wind of the intrusion. If your TV is stolen, it’s hard not to notice its conspicuous absence in the living room; if hackers nab data from a server, however, it’s not nearly as obvious.


Once it’s stolen, valuable data tends to crop up for sale in the shady alleyways of the Internet. Online forums frequented by hackers are popular places for hawking data dumps, and full-blown marketplaces on the dark web provide anonymity to buyers and sellers.

For an organization that’s charged with protecting sensitive data—which is nearly any company with payroll records or employee health files—one good way to know when a data breach has occurred is to monitor these markets. That’s where Matchlight, a service from Baltimore-based Terbium Labs, comes in.

Matchlight scans the recesses of hacker forums and marketplaces on both the surface web and the dark web—a part of the Internet accessible only through the anonymizing Tor network—and notifies clients if their confidential data turns up.

The service has two parts: The first is a web crawler, also known as a spider, that automatically searches and indexes the websites where stolen data is likely to appear. On the part of the Internet that most people browse every day, Google is the king of indexing. Every traffic-hungry site conforms to certain standards in order to get picked up by Google’s spiders and rise up as far as possible in its search results.

This makes Matchlight’s job relatively easy on the surface web. But there’s a different story on the dark web. “We’re trying to index what people don’t want indexed,” said Danny Rogers, Terbium’s CEO. “There’s no desire to make things easy to find. Fundamentally, it’s a more hostile environment to crawl.”

Marketplaces that sell illicit goods on the dark web come and go: The FBI shut down the notorious Silk Road market in 2013, and a set of coordinated raids in 2014 took down 400 dark-web markets hosted in 17 different countries. That’s a small slice of all the sites on the dark web, but these fluctuations make it difficult to monitor the most active marketplaces—so to help its spider out, Terbium’s employees  are also on the lookout for new sites in need of indexing.

Once Matchlight has an index of what’s being traded on the Internet, it needs to compare it against its clients’ data. But instead of keeping a database of sensitive and private client information to compare against, Terbium uses cryptographic hashes to find stolen data.

Hashes are functions that create an effectively unique fingerprint based on a file or a message. They’re particularly useful here because they only work in one direction: You can’t figure out what the original input was just by looking at a fingerprint. So clients can use hashing to create fingerprints of their sensitive data, and send them on to Terbium; Terbium then uses the same hash function on the data its web crawler comes across. If anything matches, the red flag goes up. Rogers says the program can find matches in a matter of minutes after a dataset is posted.

When a client gets an alert from Terbium that their sensitive data has turned up someplace it shouldn’t be, their next move depends entirely on the kind of data that was stolen. If it’s customer financial data, the client might offer a year of free identity-fraud protection services to the affected individuals; if it’s passwords, the organization might force all its users to create new ones.

Hackers and criminals appear to be getting busy: Rogers says Matchlight sends clients thousands to tens of thousands of alerts every day. And while getting an alert means the damage has already been done—you can’t put the data back in the box once it’s out—speedy notifications can keep a company from reading about their own data breach on the morning news.