How anonymous is anonymized data?
In 2006, The Netflix prize competition challenged teams to create an algorithm that would predict how individuals would rate a movie. They provided a dataset of 100M ratings submitted by 480K users for more than 17K movies. Netflix engineers replaced the name of the users with random identifiers, and replaced some of the ratings with fake and random ratings. This sounds like a valid approach at anonymizing a dataset, however this strategy turned out to be ineffective. Two scientists from the university of Texas published a paper, Robust De-anonymization of Large Sparse Datasets, detailing how linkage attacks could be used to successfully identify people in the Netflix dataset by combining it with data from IMDb. A linkage, or re-identification attack is an attempt to re-identify individuals in an anonymized dataset by combining that data with other sources of information. Computer scientist Latanya Sweeney published an article discussing a study wherein the findings determined that 87% of all Americans can be identified by only 3 pieces of information: ZIP code, date of birth and gender.
A brief introduction
Differential privacy is a data governance practice that makes it possible to analyze large datasets and generate insights while maintaining the privacy of the individual data owners.
Ensuring data privacy is crucial for numerous applications, including maintaining the integrity of sensitive information and removing the opportunity for adversaries to track people based on personally identifiable information. Differential Privacy provides a mathematically provable guarantee of privacy protection against a wide range of attacks including linkage attacks.
How does it work?
Let’s consider the product manager at a tech startup who wants to understand how many people have at least one negative experience when trying to track their heart health on a wearable device. They download OOLoop on their smart phone and create a company brand, including the products and features. They will then receive a series of reports about the problem their products want to solve, as well as insights on the individuals experiencing the problem. Under the hood, OOLoop sends anonymized database queries to the devices of the Loopers who allowed access via the new data economy.
The product manager never has access to the real dataset that contains the Looper‘s experiences. Instead, the query is performed using differential privacy on the Looper‘s device. The anonymized results with the added noise are aggregated on a OOLoop server and the insight is returned to the product manager.
The queries are performed on the Looper‘s device using differential privacy to add noise to the results before being aggregated. The product manager wants to know the number of Loopers who had a negative experience when trying to monitor their heart health. The query is performed locally on the Looper‘s phone, and if it finds a negative experience it will introduce some noise to the result before returning to it the server.
When trying to determine which value to return, the differential privacy algorithm flips a coin. If it lands on Heads the algorithm returns the real result (in this case the value “true” to indicate the presence of a negative experience). If the coin lands on Tails, the privacy algorithm flips another coin. If the second coin toss lands on Heads, it returns a fake value of “false” to indicate the absence of a negative experience, and if it lands on Tails it will return “true” regardless of the real value. This means there is a 25% chance the value of “true” that is returned is just a result of the coin toss. This provides plausible deniability especially for sensitive questions that involves revealing illegal or deeply personal behaviour. It also makes it difficult for anyone to determine with certainty that the results returned was the Looper‘s real experience. Because we know how the noise is distributed we can compensate for it and end up with a fairly accurate representation of all the Loopers who had at least one negative experience when trying to monitor their heart health, without compromising the privacy of any of the individual Loopers who participated.
Of course a coin toss is a simplified explanation. OOLoop uses algorithms like Laplace distribution to spread data over a larger range to increase the level of anonymity. The scientific paper The Algorithmic Foundations of Differential Privacy notes that differential privacy promises that the outcome of a survey will stay the same wether or not a particular individual participated in it. Because of this guarantee, Loopers know that the insights contained in their experiences can’t be linked back to them. Differential privacy is also used by companies like Apple and Google to generate insights without violating privacy.