Chapter 1. History

The idea that computing statistics on a dataset can leak information about individual data points is not new. In fact, many of the fundamental differential privacy papers 1 2 from the twenty-first century cite research from the nineteen seventies and eighties. These previous studies acknowledge challenges that are today of widespread concern. For example, the inference problem was described by Denning in 1980 as the “deduction of confidential data by correlating the declassified statistical summaries and prior information” 3. Data storage and computation are significantly more pervasive now than in 1980, and therefore the problem of preserving privacy has gained increased attention. Further, “prior information” is more readily purchased and analyzed, leading to such privacy violations via inference. Denning’s paper also identifies a key concept in privacy: if you compute a statistic on a dataset, and then compute the same statistic on the dataset minus one of the data points, you have learned something about that data point. In her words, “comparing the mean salary of two groups differing only by a single record may reveal the salary of the individual whose record is in one group but not the other.”

A variety of approaches have been proposed to counter these vulnerabilities. Output perturbation involves modifying (perturbing) the result of a query so that methods like inference are less likely to cause a privacy leak. Early attempts at output perturbation included adding a fixed value to all queries, and swapping values within the data. The fixed value approach can be broken by executing a sufficiently large number of queries and averaging the results, while data swapping can prove to make the data different enough from the true values so as to be unusable. Instead, there was a need to add random noise to query outputs, but carefully scaled to maximize the utility of the resulting statistic.

While there was recognition that adding noise was a promising technique for privatizing statistical queries, it was only with the work of researchers like Dinur and Nissim 4 that the theory of output pertubation was formalized, and the concept of a private database became mathematically-provable. In 2006, a paper called “Calibrating Noise to Sensitivity in Private Data Analysis” was published by Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith to address the problem of scaling noise. In this paper, the authors address “privacy-preserving statistical databases” 5 and demonstrate privacy guarantees when adding noise to any statistical query. Let’s carefully look at the words in the previous sentence - you may have encountered these words before, but in this context they have subtly different meanings.

Statistical database

A “system that enables its users to retrieve only aggregate statistics for a subset of the entities represented in the database.” 6 You may already be familiar with databases like PostGreSQL - with these databases, you can execute queries to search for individual rows based on certain conditions. By contrast, a statistical database only allows you to ask questions like “What is the average age of the people in this database?” and prevents you from executing queries like

SELECT * FROM users WHERE age = 32+
Noise

In the context of privacy, noise refers to values sampled from a statistical distribution. You will encounter terms like Laplace noise and Gaussian noise which specify the kind of distribution that the values were sampled from.

Privacy

Depending on context, the term privacy can mean the absence of surveillance, or the inability to be identified. In contrast to the previous two terms, which are technical in nature, the word privacy is used in philosophical and legal literature under various definitions. In fact, even the Stanford Encyclopedia of Philosophy admits “there is no single definition or analysis or meaning of the term.”

This chain of research culminated in Dwork’s 2006 paper “Differential Privacy” 7 , in which the author unified the previously-mentioned concepts into a formal measure of the risk that an individual takes on by being in a statistical database. This paper, along with her textbook “The Algorithmic Foundations of Differential Privacy,” co-authored with Aaron Roth 8, demonstrate many of the concepts that will be centrally important in this book. Dwork’s work was partially motivated by a previous paper from Tore Dalenius that sought to formalize the notion of a statistical disclosure. As part of this work, Dalenius admitted that the goal should be control of disclosure rather than elimination, since elimination is “not operationally feasible,” adding “it may be argued that elimination of disclosure is possible only by elimination of statistics.” 9 This idea of controlling disclosure will be important throughout the book - the goal is to quantify and calibrate the risk of disclosure, rather than to eliminate it altogether.

Previous Privatization Techniques

Differential privacy is certainly not the first technique that has been developed to analyze data while protecting individuals in the data. Consider a situation where a researcher releases aggregate statistics about hospital patients over the past year. For example, the researcher publishes a paper stating that there have been 3 cases of a particular type of cancer in that hospital this year. On the surface, you may assume this is safe - there were no names, addresses, social security numbers, or other sensitive personally identifiable aspects released about these patients.

Previously, standard privatization techniques relied on anonymization to protect privacy - for example, removing or modifying the names of individuals. This is insufficient and can lead to privacy leakages. Consider the case of Dr. Latanya Sweeney who, while still a graduate student, purchased Massachusetts voter registration information and was able to identify the governor of Massachusetts within anonymized medical data. 10 She was able to due this by isolating those individuals in the data with the same attributes as the governor, finding that “six people had [the governor’s] particular birthdate, only three of them were men, and, he was the only one in his 5-digit zip code.”11

This work led Sweeney to the discovery that many people are uniquely identifiable using only these three pieces of information: birthdate, government-recorded gender, and zip code. 12 A 2006 study estimated that about 63% of the US population can be uniquely identified from these three attributes, 13 while 44% of the population in the 2010 census is identifiable via census block, age, and sex. 14

To protect against vulnerabilities like this, Sweeney introduced k-anonymity in 2002. K-anonymity is a generalized protection against this specific type of identification. The “k” in k-anonymity refers to its central principle - that an individual cannot be distinguished from at least k-1 other individuals in the dataset. Using the Massachusetts scenario, this would mean a hypothetical scenario where Sweeney could not distinguish between the governor and k-1 other individuals in the voter roll, preventing her from re-identifying his medical records. Thus, k-anonymity has vulnerabilities to linking - if an attribute is considered non-identifying, but can be linked to another dataset, it can become an identifying attribute and lead to privacy violations.

The two major attacks on k-anonymity are:

Homogeneity attack

This attack takes advantage of a situation where the data has many identical values. Consider a situation where a hostile actor knows that someone has been admitted to the hospital on a certain date, and looks at the admission logs. If everyone admitted to the hospital on that day had cancer, then the attacker has learned that this individual has cancer. Clearly, this attack relied on the homogeneity of the data - if everyone admitted on that date had a different disease, then the attacker would have learned nothing.

Background knowledge attack

This attack relies on the attacker knowing things about the individual that are not present in the relevant dataset. For example, if the attacker wants to know what type of cancer the individual has, and knows that the hospital admissions were all for prostate or cervical cancer, then the attacker simply needs to know whether the individual has a prostate or a cervix in order to learn their cancer diagnosis with certainty. 15

As you can see, k-anonymity does not always prevent a bad actor from identifying individuals in a dataset. Even if the data is sufficiently heterogeneous to not fall victim to a homogeneity attack, with the availability of third-party data for sale, the risk of a linkage attack is significant. Methods that suppress individual features in a dataset are not sufficient to guarantee privacy.

Differential privacy gives a more general definition of privacy than techniques like k-anonymity. Instead of focusing on removing or anonymizing certain attributes from a dataset, it gives an analyst a probabilistic guarantee about an attacker’s ability to learn if a certain individual was in a dataset or not. In data privacy, implementation is key to guaranteeing protection - a flawed implementation can leak information and cause privacy violations. With this in mind, the next section will introduce several libraries for conducting DP data analysis.

Available Tools

This is fundamentally a book about understanding and implementing differential privacy, meaning that the software packages are of central importance. Luckily, there are already high-quality open source tools available for working with data in a differentially private manner.

The majority of the code examples you will see in this book are written in the OpenDP Library and SmartNoise. The OpenDP Library provides a flexible suite of lower-level functions for building differentially private data analysis pipelines, while SmartNoise contains higher-level abstractions.

The OpenDP Library

The OpenDP Library is a “modular collection of statistical algorithms” that follow differentially private principles. It can be used to build differentially private data analysis pipelines using a number of different models of privacy. The library is implemented in Rust, with bindings in Python for ease of use. The architecture of the OpenDP Library is based on a conceptual framework for expressing privacy-aware computations. 16

The following OpenDP Library sample, taken from the documentation, creates a chain of three operations and applies them to a dataset:

import opendp.prelude as dp

# construct the query
dp_sum = (
  dp.space_of(list[float]) >>         # public knowledge
  dp.t.then_clamp(bounds=(0., 5.)) >> # stable transformation
  dp.t.then_sum() >>                  # stable transformation
  dp.m.then_laplace(scale=5.)         # private mechanism
)

# evaluate the privacy expenditure
max_contributions = 1
print("epsilon:", dp_sum.map(d_in=max_contributions))

# make a DP release
mock_dataset = [0.7, -0.3, 1., -1.]
print("DP sum release:", dp_sum(arg=mock_dataset))

The OpenDP Library is part of the larger OpenDP Project, a community effort to build trustworthy, open source software tools for analysis of private data. The full documentation for OpenDP can be found at https://docs.opendp.org.

SmartNoise

SmartNoise is a collaboration between Microsoft, Harvard’s Institute for Quantitative Social Science (IQSS), and the School of Engineering and Applied Sciences (SEAS) as part of the Open Differential Privacy (OpenDP) initiative. The project aims to connect solutions from the research community with the lessons learned from real-world deployments to make differential privacy broadly accessible.

Building upon the foundation of the OpenDP library, the SmartNoise SDK includes two Python packages:

  • smartnoise-sql: Allows data owners to run differentially private SQL queries. This package is useful when generating reports or data cubes over tabular data stored in SQL databases or Spark, or when the dataset is very large.

  • smartnoise-synth: Provides utilities for generating differentially private synthetic datasets. Useful when you can’t predict the workload in advance and want to be able to share data that is structurally similar to the real data with collaborators.

Both of these packages focus on the global model of differential privacy, as opposed to the local model. In the global Differential Privacy model, a trusted data collector is presumed to have access to unprotected data and wishes to protect public releases of information.

Tumult Labs

Tumult Core is a set of components for building differentially private computations. Tumult Analytics is a Python library, built on top of Tumult Core, that allows the user to perform differentially private queries on tabular data. Tumult has also created an online platform for both batch and interactive data analysis. In May 2023, Tumult and Google announced a strategic partnership integrating Tumult’s differential privacy techniques into Google’s BigQuery platform. This allows BigQuery users to easily make differentially private queries on the platform by adding a differential privacy clause to their query. In other words, you only have to add WITH DIFFERENTIAL PRIVACY to your SELECT statement along with several options in order to make the query differentially private:

SELECT
  WITH DIFFERENTIAL_PRIVACY
    OPTIONS(epsilon=1.0, delta=.01, privacy_unit_column=id)
    item,
    AVG(quantity) average_sales
FROM inventory
GROUP BY item;
OPACUS

Opacus is a library for training differentially private PyTorch models. Training a DP model with Opacus requires minimal code changes, and in many cases does not have a significant impact on training performance. The library also supports online tracking of the privacy budget.

Tensorflow Privacy

Tensorflow Privacy (also known as TF Privacy) was created by Google Research. The library includes DP implementations of TensorFlow Optimizers for training ML. TF Privacy was designed so that users can train DP models with TensorFlow by changing a minimal amount of code. Differentially private implementations of some Keras models are available as well.

Diffprivlib

Diffprivlib, from IBM, is a “general-purpose library for experimenting with” differential privacy. The library has Python bindings that can be used to compare DP and non-DP ML models, and to build DP-preserving applications. It includes a variety of mechanisms, including differentially private implementations of common scikit-learn machine learning models.

If you have a background with scikit-learn, this code sample from the Diffprivlib documentation will likely seem familiar:

from diffprivlib.models import GaussianNB

clf = GaussianNB()
clf.fit(X_train, y_train)

The U.S. Census

Differential privacy made the jump from theory to application relatively quickly - in 2018, the U.S. Census announced that it would privatize the results of the End-to-End (E2E) Census Test. 17 This process, was a practice run for future Censuses, and systems that passed validation in 2018 were expected to be put into production for the 2020 Census. In light of the vulnerabilities pointed out by the database reconstruction theorem of Nissim and Dinur, the Census decided to build a differentially private statistical publication system. The stakes here were real - the data in question is used to construct legislative districts and ensure that the 1965 Voting Rights Act is being followed.

In 2021, the State of Alabama sued the U.S. Census for its use of differential privacy, alleging that the Census was providing “manipulated redistricting data to the States.” 18 There is prior precedent for the U.S. Census to introduce techniques to protect individual privacy. For example, in 1930, Census releases stopped including data from small areas to protect residents in sparsely-populated regions from being re-identified. A panel of three judges ultimately rejected Alabama’s complaint, responding that the state had no claim of harm stemming from skewed results since they had yet to be released. 19

20

Differential privacy, though a relatively recent development, has led to the creation of powerful tools for analyzing sensitive data. It is not enough to simply remove sensitive attributes from data, nor is it sufficient to anonymize data in order to protect privacy. Relatively simple methods can be used to re-identify individuals in a dataset, for example, cross-referencing demographic information with voter rolls in order to identify patients in a medical dataset. With the weaknesses of previous privatization techniques in mind, you can begin to weigh your options so that you will be able to safely work with sensitive data in the future. The goal of this book is that, upon completing it, you will have a firm grasp of the concepts behind differential privacy and also feel confident and prepared to work with sensitive data without falling into common traps.

Each chapter has a set of exercises at the end - these exercises are divided between conceptual, theoretical, and programming. A conceptual question will ask you to answer a “why” question in sentences. These questions will probe whether you understand the fundamentals that have been covered in the chapter. Theoretical questions will rely on mathematical reasoning - sometimes it will involve solving a problem, other times proving a theorem. Programming questions will challenge you to implement concepts in Python with the OpenDP Library. All questions are answered at the end of the book, with explanations and further comments.

In the next chapter, you will follow a real-world scenario that demonstrates the risks of non-private statistical releases. Though the stakes are relatively low (an exam score), the principles are the same for more sensitive types of data. This chapter will motivate the ways that a seemingly innocuous action can lead to the indirect disclosure of information. From there, you will dive deeper into the theory and application of differential privacy - what it guarantees, what it doesn’t, and how to analyze data with privacy guarantees in mind.

1 C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating Noise to Sensitivity in Private Data Analysis,” in Theory of Cryptography, S. Halevi and T. Rabin, Eds., in Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2006, pp. 265–284. doi: 10.1007/11681878_14.

2 C. Dwork, “Differential Privacy,” in Automata, Languages and Programming, M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, Eds., in Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2006, pp. 1–12. doi: 10.1007/11787006_1.

3 D. E. Denning, “Secure statistical databases with random sample queries,” ACM Trans. Database Syst., vol. 5, no. 3, pp. 291–315, Sep. 1980, doi: 10.1145/320613.320616.

4 I. Dinur and K. Nissim, “Revealing information while preserving privacy,” in Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, in PODS ’03. New York, NY, USA: Association for Computing Machinery, Jun. 2003, pp. 202–210. doi: 10.1145/773153.773173.

5 C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating Noise to Sensitivity in Private Data Analysis,” in Theory of Cryptography, S. Halevi and T. Rabin, Eds., in Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2006, pp. 265–284. doi: 10.1007/11681878_14.

6 N. Adam, H. Lu, J. Vaidya, and B. Shafiq, “Statistical Databases,” in Encyclopedia of Cryptography and Security, H. C. A. van Tilborg and S. Jajodia, Eds., Boston, MA: Springer US, 2011, pp. 1256–1260. doi: 10.1007/978-1-4419-5906-5_767.

7 C. Dwork, “Differential Privacy,” in Automata, Languages and Programming, M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, Eds., in Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2006, pp. 1–12. doi: 10.1007/11787006_1.

8 C. Dwork and A. Roth, “The Algorithmic Foundations of Differential Privacy,” FNT in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2013, doi: 10.1561/0400000042.

9 T. Dalenius, “Towards a methodology for statistical disclosure control,” 1977, Accessed: Jun. 21, 2023. Available: https://ecommons.cornell.edu/handle/1813/111303

10 L. Sweeney, “Simple Demographics Often Identify People Uniquely,” . Pittsburgh.

11 Sweeney, L. Recommendations to identify and combat pri- vacy problems in the Commonwealth, October 2005. Testi- mony before the Pennsylvania House Select Committee on Information Security (House Resolution 351), Pittsburgh, PA.

12 Sweeney, L. Uniqueness of simple demographics in the US population. Tech. rep., Carnegie Mellon University, 2000.

13 Golle, P. Revisiting the uniqueness of simple demographics in the US population. In Proceedings of the 5th ACM Workshop on Privacy in Electronic Society (New York, NY, USA, 2006), WPES ’06, ACM, pp. 77–80.

14 John M. Abowd. Supplemental Declaration, 2021. State of Alabama v. US Department of Commerce.

15 A. Hussien, N. Hamza and H. Hefny, “Attacks on Anonymization-Based Privacy-Preserving: A Survey for Data Mining and Data Publishing,” Journal of Information Security, Vol. 4 No. 2, 2013, pp. 101-112. doi: 10.4236/jis.2013.42012.

16 Hay, Michael, Marco Gaboardi, and Salil Vadhan. “A programming framework for OpenDP.” 6th Workshop on the Theory and Practice of Differential Privacy (TPDP 2020), 2020.

17 J. M. Abowd, “The U.S. Census Bureau Adopts Differential Privacy,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London United Kingdom: ACM, Jul. 2018, pp. 2867–2867. doi: 10.1145/3219819.3226070.

18 The State of Alabama v. United States Department of Commerce & United States Census Bureau, United States District Court for the Middle District of Alabama

19 The full case is called “Memorandum Opinion and Order, The State of Alabama v. United States Department of Commerce & United States Census Bureau, United States District Court for the Middle District of Alabama.” A brief of amicus curae was filed by the Electronic Privacy Information Center (EPIC) in support of the U.S. Census.

20 R. Chetty and J. N. Friedman, “A Practical Method to Reduce Privacy Loss when Disclosing Statistics Based on Small Samples.” in Working Paper Series. National Bureau of Economic Research, Mar. 2019. doi: 10.3386/w25626.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset