Chapter 6. A System for Relating Distances

This chapter builds on concepts discussed in previous chapters to construct a higher-level perspective on differential privacy. This perspective will help you gain a more mathematical understanding of differential privacy, as well as unify concepts discussed in previous chapters.

The goal of this chapter is to give you a consistent framework under which you can feel confident in constructing your own differentially private algorithms.

To accomplish this objective, this chapter will:

  • lay a theoretical framework for understanding differential privacy

  • adapt topics discussed in previous chapters to the framework

  • introduce new stable transformations and private mechanisms within the framework

The high-level idea is that differential privacy is a system for relating distances. The system is composed of a set of functions, each of which is stable. Stable functions are functions for which close inputs result in close outputs. If you know how much your private dataset could possibly change, and then apply a stable function to it, then you also know how much the function’s output could possibly change. Knowing how much the output could possibly change can be equivalent to knowing how private the output is.

To help familiarize yourself with this perspective, the chapter will then walk through a series of examples, each of which will demonstrate how differentially private algorithms break down into provably stable functions.

To get started, it will be helpful to familiarize yourself with the concept of a metric space first.

Metric Spaces

A metric space is a pair consisting of a domain and a metric, where the metric may be used to compute the distance between any two members of the domain. That is, a metric space is a set of values, equipped with a way of computing the distance between said values. In Chapter 1, the claim was made that differentially private algorithms consist of a pipeline of smaller functions. Each function in the pipeline maps from one metric space to another.

A domain is the set of possible values that a dataset may take on. The set may consist of virtually anything, and is usually defined with some descriptors, which are italicized below. The descriptors may be very restrictive-- like the set of booleans. Alternatively, the set may contain an infinite number of values-- like the set of all real numbers, or the set of all vectors of bounded integers. The domain may even contain the set of all dataframes, or the set of graphs of bounded degree.

A metric is a kind of function that computes the distance between any two members of a domain. Common metrics in differential privacy are the absolute distance, L1 distance, L2 distance, symmetric distance, Hamming distance and discrete distance. Other possible metrics arise naturally when you consider how to best measure distances between values after a function has been applied. This chapter will discuss various situations where these metrics, and other metrics, arise.

In Practice

One of the first steps towards making a DP release is to define the initial metric space that your sensitive dataset resides in. If you have made a DP release in the past, you may not have even recognized this step! Many DP libraries make an implicit assumption about the metric space. For the guarantees of any DP library to hold, then the metric space of your dataset must match the metric space the DP library uses. In this step, you should describe the set of possible values your dataset may take on, choose a metric that characterizes the distance between neighboring datasets, and then set an upper bound on the distance, called din.

For instance, say your dataset is a single integer-- the number of individuals in your household. The dataset resides in the domain of integers, where neighboring datasets differ by the absolute distance. You know intuitively that any two neigboring datasets differ by at most one (this is din).

Any two datasets are close (or adjacent/neighboring) if the distance between them is no greater than the bound. In the above example, any two datasets/integers are close if they differ by no more than one. This same concept of closeness can be applied to any choice of metric space, dataset, and distance.

Privacy Loss as a Distance

DP releases follow a known probability distribution. Privacy loss parameters like ϵ represent an upper bound on the distance between any two probability distributions.

Notice how privacy loss parameters can be interpreted similarly as din, in that privacy loss parameters are an upper bound on the distance. In contrast, they are also different, in that they measure the distance between distributions, instead of the distance between datasets. Because of this distinction, instead of describing these parameters with metrics, the rest of this chapter will use the more informal term privacy measure.

These privacy loss parameters complement din. din is used to describe a bound on distances on the input dataset, and dout is used to describe a bound on distances between the output distributions.

Together, the input and output distance fully characterize the privacy guarantee of a differentially private algorithm. The input distance is a bound on how much an individual can influence a dataset, and the output distance is a bound on how much an individual can influence the output distribution/how private the data release is. The next section will discuss how to relate these two distances.

Stable Functions

A function defines a relationship between two metric spaces. That is, a function maps a member of the first metric space to a member of the second metric space. A function encompasses any private or non-private computation you might perform on a dataset. Of particular interest are stable functions.

A function is considered stable if close inputs result in close outputs. “Close” means there is an upper bound on the distance between any two values. Restated, a stable function is a function for which, if you know any two inputs to the function differ by no more than some upper bound din, then the outputs differ by no more than some upper bound dout.

To get started, assume both the input metric and output metric is the absolute distance. Consider the following function:

def double(x):
    return x * 2

This function is stable because it can be proven that if the inputs to the function differ by at most din, the respective outputs differ by at most din·2. Therefore doutdin·2.

Stable functions don’t necessarily use the absolute distance metric! Other common choices of metrics are the symmetric distance, Hamming distance or Lp distance for some choice of p. Later sections will talk about these in more detail.

Differentially private algorithms naturally break down into a sequence of functions, where each function can be shown to be stable. You can think about differential privacy as a system for relating distances. Under the perspective of function stability, differential privacy is a special kind of stability constraint on a function where the upper bound on output distances is computed via a privacy measure.

Data analysts start their analysis with a limit on how much an individual can influence a dataset (an input distance), and if all of the functions in the pipeline are stable, then that bound can be translated all the way to the end of the pipeline. If the final function in the pipeline measures distances with a privacy measure, then the output distance is a privacy guarantee.

We’ll show more examples of metric spaces and stable functions in the following sections.

Bounded vs. Unbounded Differential Privacy

Often in the literature of differential privacy, a distinction is made between bounded and unbounded differential privacy. This distinction is based on the existence of one piece of public information: the dataset size.

When the dataset size is unknown, then you are operating under unbounded differential privacy. In unbounded differential privacy, the data domain consists of all datasets of any size. The distance between any two datasets is computed by counting the fewest number of records that must be added or removed to convert any one dataset into another dataset. A metric that can express unbounded DP is the symmetric distance. In unbounded DP, the user must consider the greatest number of additions or removals an individual may make to a dataset.

When the dataset size is known, then you are operating under bounded differential privacy. In bounded differential privacy, the data domain consists of all datasets that have exactly as many records as the known dataset size. In this regime, the distance between any two dataset is computed by counting the number of rows that must be edited to convert any one dataset into another dataset. In bounded DP, you must consider the greatest number of edits an individual may make to a dataset.

Unbounded differential privacy considers additions and removals of individuals from a dataset. Whereas, in bounded differential privacy, edits can be counted instead. Bounded differential privacy is a special case of unbounded differential privacy, where the data domain is restricted to a known size. Notice that in bounded differential privacy, the number of edits is equivalent to two times the number of additions and removals, because each edit constitutes one addition and one removal. This relationship makes it possible to convert between different notions of neighboring datasets.

Domains and Restricted Sensitivity

Restrictions placed on the data domain reduce the sensitivity because the restrictions exclude extreme dataset pairs.

Even clamping each individual datum is a form of restricted sensitivity: clamping restricts the domain of possible datasets to those datasets, and under this restriction to the domain, the unbounded-DP sensitivity of the sum becomes finite. A size descriptor on the domain may also reduce the sensitivity, as shown in the following table.

In the following table an individual may add or remove up to din rows. In the first column, the domain consists of all numeric vectors of any length, in the second column, the domain consists of bounded (L,U) numeric vectors of any length, and in the third column, the domain consists of bounded numeric vectors of fixed length (N).

Statistic

any numeric vector

with L, U

with L, U, N

Count

din

din

0

Sum

din·max(|L|,U)

din/2·(U-L)

Mean

din/2·(U-L)/N

Variance

din/2·(U-L)2/N

Under bounded DP, din must be even: Since the dataset size is known, the number of additions must be the same as the number of removals. One addition and one removal can change the sum by a quantity of U-L.

Any kind of descriptor based on what is considered “public information” can reduce the sensitivity: including a lower bound on the dataset size, the data types that the data takes on, or limitations on the norm of a data row.

These descriptors, or restrictions, on the domain may come from two sources:

  • descriptors may be public information about the dataset

  • descriptors may be established via data processing

In the first case, descriptors may only be safely considered public information about a dataset if, for all potential datasets under consideration, that descriptor holds invariant. In many real-world applications, descriptors can be derived from common-sense relationships, like how the age of parents may not be less than that of their children.

Restricting the domain by incorporating public knowledge you have about your data, and judiciously applying preprocessing, can significantly reduce the sensitivity of your queries.

Sensitivity of Vector Aggregates

In the previous section, one kind of data descriptor was a bound on the L1 or L2 norm on each row in a dataset. This is useful for releasing a query result that spans multiple columns simultaneously, more efficiently. In this context, the sensitivity of the sum scales according to the norm bound:

l1 Norm

The l1 norm of a vector x is defined as:

x1=i=1Nxi

When each row of a dataset is of bounded norm (no greater than C), then it is also possible to bound the distance between common queries, like counts and sums, on adjacent datasets (the sensitivity).

Note that, while this definition gives an example where datasets are considered neighboring under the symmetric distance, this choice of metric is arbitrary.

The l1 is another possible metric that can be used to characterize the distance between datasets (in this case, datasets in an aggregated form). When the dimension of the data is one, it becomes identical to the absolute distance.

A Stable Transformation into l1 Sensitivity

A very common situation where this arises is in computing histograms. Say you were asked to prepare a differentially private release of fitness data to show the effectiveness of a product. You want to share counts of the number of unique individuals of your product, grouped by binned ages and binned activity levels. Given the structure of this query, it is safe to claim that any one individual may influence at most one partition- any one individual cannot be of multiple ages, or multiple activity levels, at once! This makes it possible to bound the L1 sensitivity of the entire count query. Adding or removing any one individual can influence at most one of the counts, so the l1 sensitivity, or ΔF, is at most one.

A very simple adjustment of the query can cause this very nice property to degrade. When also binning by time range (say there are k time ranges), the user may now contribute a record to each of the k time periods. This naturally increases the sensitivity: ΔF=k.

In the context of relating distances, a function that computes the histogram (without noise addition) is (din,dout)-stable, where doutdin, din is with respect to the symmetric distance between datasets, and dout is with respect to the l1 distance.

A Private Mechanism over l1 Sensitivity

Stable transformations with l1 sensitivity bounded by dout can also be privatized with the Laplace mechanism. In this section, the frame of reference is changed. The dout previously discussed for the stable histogram becomes the din for the Laplace mechanism. The new objective is to find the smallest dout that satisfies a privacy measure.

To show this for pure differential privacy, where dout corresponds to ϵ, start with the definition of pure differential privacy. For any possible noisy output of the mechanism T:

ϵlnPr[Y=T]Pr[Yâ=T]

In this case, Y and Y' represent the distributions of the Laplace mechanism applied on any two input datasets. Remember, in the context of the previous fitness data example with histograms, the input datasets are any two vectors of counts, termed y and y'.

ϵ=lnPr[Laplace(y|b)=z]Pr[Laplace(yâ|b)=z]

Consider b to be the noise scale. Now substitute the distribution of the multivariate Laplace mechanism, keeping in mind that y, y' and z are vectors:

=lni=1k12bexp-|yi-zi|bi=1k12bexp-|yi'-zi|b

This may seem like too much, but the math quickly simplifies:

=i=1kln(exp(|yizi|b)exp(|yizi|b))=i=1k|yizi||yizi|bi=1k|yiyi|b=||yy||1b

At this stage, observe how well the Laplace mechanism complements queries with known l1 sensitivity: the l1 distance appears directly in the formula! Substituting ||y-y'||1din gives the final bound:

ϵdinb

In the overall scheme of relating distances, this function that samples Laplace noise is (din,dout)-private, where doutdinb, din is with respect to the l1 distance between aggregated datasets, and dout is with respect to the ϵ.

Together with the previous section, applying both the histogram and Laplace functions sequentially is (din,dout)-private, where doutdinb, din is with respect to the symmetric distance between the original datasets, and dout is with respect to the ϵ.

L2 Distance

Another common choice of metric is the l2 distance.

l2 Norm

The l2 norm of a vector x is defined as:

x2=i=1xi2

Similarly as in the case of the l1-sensitivity:

Again, the choice of metric (the symmetric distance) is arbitrary. Any given dataset metric may be used, and they will influence the resulting, derived, sensitivity. This section will continue using the symmetric distance.

A Stable Transformation into l2 Sensitivity

A significant benefit of using the l2 distance is that the sensitivity is much smaller-- by a square root-- for certain kinds of queries.

First, recall how the l1 sensitivity of the histogram on fitness data scaled by the number of partitions an individual may influence. The sensitivity of that same query under the l2 distance is as follows:

dout=ΔF=maxx,yF(x)-F(y)2

For this choice of query, on this dataset, you derive significant benefit from the knowledge that an individual may influence each partition by at most one.

=maxx,yi(F(x)iF(y)i)2=maxx,yi12=din

As you can see, this sensitivity is far tighter than before. Unfortunately, not all queries satisfy the constraint that an individual may influence each partition by at most one! If, in the fitness data example, you were to partition on only binned ages and binned activity levels, then you would have no way of knowing how many records from an individual may fall into each bin. Therefore, the sensitivity of this query, under the l2 distance, degrades to doutdin (gaining no benefit from the square root).

A Private Mechanism on l2 Sensitivity

A similar relationship can be shown between the Gaussian mechanism and the l2 distance. The proof of privacy for the gaussian mechanism is similar to that of the Laplace mechanism, but in this case, the probability distribution substituted is that of the gaussian distribution. The gaussian distribution doesn’t satisfy pure differential privacy, but it does satisfy a number of other measures of privacy. One such measure of privacy is approximate differential privacy.

The proof is somewhat lengthy, but simplifies to the following guarantee:

ϵ2·ln(1.25/δ)·Δ2(F)/σ

Together with the stable transformation, since each function is stable, the entire function is stable. Starting from a bound on the distance between the input datasets, you can bound the distance between aggregates (sensitivity). Then, you can use this sensitivity to bound the distance between any two output distributions. This gives the final bound on privacy.

The Exponential Mechanism

The exponential mechanism is used for private selection from a set of candidates. In this setting, you know what the set of possible outputs can be (your candidate set), and you want to choose one (or more) candidates from that set that best answers a query on the private dataset.

A motivating example (similar to the seminal paper) involves pricing: Imagine that you have decided to sell off each of your baseball cards, and want to choose the best price point to maximize profit.

If you set the price too high, you will fail to find buyers, whereas setting the price too low will also lose money. In this setting, assume that prospective buyers will tell you what they are willing to spend, so long as this information isn’t leaked to other buyers.

To use the exponential mechanism, you must define a set of potential price points before conferring with the buyers. This set of potential price points forms the candidate set. The exponential mechanism selects one of those candidates based on the private information from the buyers.

Just like in the prior examples, you can divide the algorithm into a stable transformation and a private mechanism. The stable transformation is the scoring function, which maps the private dataset to a vector of scores. The private mechanism is the exponential mechanism, which selects one of the candidates based on the scores.

You can think of the scoring as a dataset aggregation, just like you might think of a dataset sum as an aggregation. The scoring function F aggregates a dataset down to a vector of scores, where each score corresponds to a different candidate price point.

The sensitivity of this score vector is based on the greatest amount any one score may change.

That is, the sensitivity of the exponential mechanism is based on the l norm.

l Norm

The l norm of a vector x is defined as:

x=maxxi

You can use this norm, just as in the prior examples, to define the sensitivity of the scoring function.

This time F is a function that computes a vector of scores.

A Stable Transformation into l Sensitivity

In the following score function F, the candidate set is denoted by C. The score function maps a dataset x to a vector of scores F(x). This function is written in terms of computing the ith score, F(x)i.

F(x)i=-|(1-α)·#(X<Ci)-α·#(X>Ci)|

An equivalent way to write this function in Python is as follows:

def score_quantile_discrete(x, candidates):
    """Assuming `x` is sorted, scores every element in `x`
    according to rank distance from the `alpha`-quantile."""
    num_leq = (x[None] <= candidates[:, None]).sum(axis=1)
    return -np.abs(num_leq - alpha * len(x))

The score function is based on the number of records in the dataset that are less than (#(X<Ci)) or greater than (#(X>Ci)) each candidate price point.

If you’d like to find the median, then let α=0.5. Candidates that evenly divide the dataset will have the greatest possible score of zero, because the two terms will cancel each other out. The more unevenly that a candidate divides the dataset, the more negative the score will become.

The scoring function is stable because, when the distance between the input datasets is bounded, the distance between the output vectors is also bounded.

maxxyF(x)-F(y)max(|α|,|1-α|)·din

This bound skips over many steps for brevity, but if you were to substitute the definition of F, and then consider the cases of adding or removing a single record from the dataset, you would find that the bound holds.

A Private Mechanism on l Sensitivity

The exponential mechanism selects the candidate with likelihood proportional to the exponentiated score. In other words, the likelihood of selecting the ith candidate is proportional to the following:

L(i)exp(τxi2ΔF)

The τ parameter is the temperature, or entropy. The higher the temperature, the more likely it is to select candidates with lower scores. With the normalization term, the probability is:

Pr[Y=i]=exp(τxi2ΔF)jNexp(τxj2ΔF)

To show that this is private, again appeal to the definition of pure differential privacy.

ϵlnPr[Y=T]Pr[Yâ=T]

The exponential in the likelihood lends nice utility guarantees. Relatively small distances between scores can result in a much higher probability of selecting the better candidate.

When you substitute the probability of selecting candidate i, you get the following:

ϵlnexpτxi2ΔFiNexpτxj2ΔFexpτxi2ΔFjexpτxj2ΔF=τxi2ΔFjNτxj2ΔFτxi2ΔF+jNτxj2ΔF=τ(xixi)2ΔF+τ2ΔF[jNxjjNxj]τ2+τ2=τ

Since the math simplifies to ϵτ, we’ll substitute τ=ϵ in the Python implementation. One thing to be mindful of, is that the exponential can result in numerical issues. The following script, implementing the discrete exponential mechanism, shifts the scores to be negative in order to improve numerical stability.

import numpy as np

def mechanism_exponential_discrete(x, candidates, epsilon, scorer, sensitivity):
    """Return an `epsilon`-DP sample from the set of discrete `candidates`.

    :param x: 1d dataset for which to release the estimate
    :param candidates: a list of candidate values to select from
    :param epsilon: privacy parameter
    :param scorer: accepts `data` and `candidates` and returns scores
    :param sensitivity: the greatest that `scorer` can change
    :returns a sample from `candidates` with prob proportional to `scorer`
    """

    # score each candidate
    scores = scorer(x, candidates)

    # for numerical stability; omitting this line gives same probabilities
    scores -= scores.max()

    # compute likelihood of selecting each candidate
    likelihoods = np.exp(epsilon * scores / (2 * sensitivity))

    # normalize to a probability
    probabilities = likelihoods / likelihoods.sum()

    # sample one index wrt the selection probabilities
    cdf = probabilities.cumsum()
    index = np.argmax(cdf >= np.random.uniform())

    # return the respective candidate
    return candidates[index]

As is shown, this mechanism privately selects the single candidate with approximately the highest score. Together with the scorer transformation, since each function is stable, the entire function is stable. Starting from a bound on the distance between the input datasets, you can bound the distance between score vectors (sensitivity). Then, you can use this sensitivity to bound the distance between the probabilities of selecting each candidate. This gives the final bound on privacy.

Relating Distances

The previous sections have shown how to construct differentially private algorithms from stable functions, under the framework of relating distances between metric spaces. This perspective is particularly powerful, because it unifies many concepts of differential privacy under a single framework. You have the power to isolate stable functions mapping from one metric space to another, where each metric space represents common differentially private concepts-- like unbounded-DP, bounded-DP, function sensitivity, and restricted sensitivity.

The system allows you to reason about the stability, or privacy, of any subset of functions in a differentially private algorithm. This can be useful in many situations, like when you need to start your reasoning about the closeness of two datasets after already applying a dataset transformation in an external system, or when you need to tailor an existing differentially private algorithm to your use-case.

There are many more examples of stable functions that can be used to construct differentially private algorithms, which will be shown in the following chapters.

Exercises

  1. Similar to def double discussed earlier in the chapter, consider the following function that duplicates each record in a list:

def duplicate(x):
    assert instanceof(x, list)
    return x * 2

Under what metric could you argue that this is a stable function? What would the stability of this transformation be?

  1. If you don’t know the size of a dataset, are there other domain descriptors you can use to reduce the sensitivity of a mean?

  2. Is it possible to transform a dataset with an unknown number of records to a dataset with a known number of records? What would the stability of this transformation be?

The following questions involve dataset domains corresponding to other kinds of datasets, such as those consisting of text or graphs.

  1. You have a graph dataset where each individual can contribute one node. Show that the count of nodes in a graph is a stable transformation from the domain of graphs to the domain of integers.

  2. You have a dataset consisting of tweets and want to conduct sentiment analysis. Can you claim that a transformation from the domain of tweets to a categorical domain of sentiments (where each individual is assigned one sentiment) is stable?

  3. Recall that a dataset domain may consist of any medium of data. Take a moment to visualize the domains of datasets you may want to conduct a DP analysis over. What transformations would you want to apply to them. Are these transformations stable? Are there domain descriptors that would make them stable? If so, are there transformations that would establish these descriptors?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset