Today's paper is "Differential Privacy as a Mutual Information Constraint", also available on YouTube, which discusses what the authors claim is an equivalent definition of privacy based on the well-understood concept of mutual information.
The definitions are not equivalent. At least, not in any valuable sense. Their definition fails to distinguish between the Laplace mechanism with parameter epsilon, with its nice relative guarantees on the probability of bad events, and the computation that releases each entry in the clear with probability epsilon.
The paper doesn't say anything patently false (well, some things), so what do we conclude about a not-obviously-incorrect technical result like this?
This paper led me to have a good think about why we have privacy definitions. As in, what is the good thing that a quantitative definition (like differential privacy) does for us in a technical setting? And, what makes one definition better, worse, or at least different from another?
I hope that by the end of the post we will have a better sense of this than I did, at least, when I started reading the paper.
I want to start with a walk through the abstract, which I think sets up several expectations about the paper, both positive and negative.
After introducing differential privacy (DP) as cool and stuff, the abstract presents its main contribution:
Herein we give an equivalent definition of privacy using mutual information that makes plain some of the subtleties of differential privacy.
Oh good! Remember these things: (i) their definition of privacy should be equivalent to differential privacy, and (ii) it will make plain some subtleties about it.
The mutual-information differential privacy is in fact sandwiched between
eps
-differential privacy and (eps
,del
)-differential privacy in terms of its strength.
This claim ends up being a bit of a doozy. We will have to come back to it, but it turns out the authors are using a surprising definition of "sandwiched between". As it turns out, in their terminology one could correctly write
The function
f(x) = exp(x)
is in fact sandwiched between the functionsf(x) = x
andf(x) = x + 1
.
If that seems like a surprising statement, good! Keep reading, and we will together sort out what the authors meant by their statement, and what implications this "sandwiching" actually has for the equivalence of privacy definitions. You can then determine which subtleties have been made plain.
The abstract concludes with some comments on how connecting mutual information and differential privacy might have benefits, which I don't doubt; I just doubt that the connection is even remotely as close as the authors present.
Let's start out with some definitions.
Fortunately, the paper has definitions of both differential privacy and mutual-information differential privacy together. Let's take a peek:
Ok, that wasn't super helpful, sorry. The definition of differential privacy has several other definitions you need to track down, and mutual-information differential privacy has I
, for mutual information, whose definition I hope you remember (click the link otherwise). The authors also use the definition of differential privacy based on Hamming distance rather than symmetric difference, which no one should be using (the authors do footnote the alternate definition, but everyone should just stop using the Hamming distance; please).
So, let's review the definitions of differential privacy. As we will need both eps
-differential privacy and (eps
, del
)-differential privacy, let's do the more general (latter) one and instantiate the former as an instance of it. I'll also try and explain the mutual information definition in terms that make more sense to me, at least. It turns out we won't actually get to the formal definition, because there are bigger problems.
Differential privacy is a property on randomized computations over multisets of input data. We say a randomized computation M provides (eps
, del
)-differential privacy if for any two multisets A
and B
with symmetric difference one (either A
is B
plus one record, or vice-versa) and any set S
of possible computation outputs:
Pr[M(A) in S] <= exp(eps) * Pr[M(B) in S] + del .
We say that a computation provides eps
-differential privacy if it provides (eps
, 0
)-differential privacy, that is del
is 0
.
These definitions relate the behavior of a computation M
when its input changes only slightly, by the addition or removal of one record. The intended intuition is that you, a potential participant in the data set, are assured that nothing (good, bad) is going to change much by your participation.
The definition of mutual-information differential privacy informally asks that, for all distributions over possible datasets, there not be a very strongly correlation between the distribution over each element and the computation output conditioned on all other elements. The definition asks that one should not learn much more about any one element of the input Xi
from the computation output Y
than one learns from seeing all elements in the dataset other than Xi
. We aren't actually going to use the specifics of the definition in our discussion, but as an exercise you can try and rewrite the mutual information definition as guarantees relating P[M(A) in S]
and P[M(B) in S]
.
Maybe wait until the end of the post to start that.
With the definitions in place, let's look at the main claim of the paper, helpfully stated by the authors in the introduction:
The claim herein is that MI-DP is sandwiched between these two definitions in the following sense: It is weaker than
eps
-DP but stronger than (eps
,del
)-DP. That is, a mechanism that satisfieseps
-DP also satisfieseps
-MI-DP. Similarly, ifeps
-MI-DP holds then (eps'
,del
)-DP also must hold, whereeps'
anddel
vanish aseps
goes to zero. In fact, the connection between MI-DP and (eps
,del
)-DP is an equivalence if either the domain or range of the query mechanism is a finite set.
Let me start with what I was hoping the authors were going to claim. I was hoping they would claim that
eps-DP < eps-MI-DP <= (eps, del)-DP
in the sense that the sets of computations satisfying each property are contained (perhaps strictly) as we go from left to right. That is, any mechanism that is eps
-DP is also eps
-MI-DP, and any mechanism that is eps
-MI-DP is also (eps
, del
)-DP.
Instead, we have something else entirely. What the authors show, in terms of containment of mechanisms, is that
eps-DP < eps-MI-DP <= (eps', del)-DP
Where you might notice that eps
does not appear on the right-hand side. The only constraint is that eps'
and del
should go to zero as eps
goes to zero. This is very, very different.
What if I told you that f(x) = exp(x)
is sandwiched between f(x) = x
and f(x) = x + 1
, because for each x
we can write
x < exp(x) <= x' + 1
where x'
goes to zero as x
goes to zero? For example, maybe x'
happens to be something like exp(x) - 1
.
What's your take? Would you agree f(x) = exp(x)
is indeed sandwiched between f(x) = x
and f(x) = x + 1
, and that exponential functions are "equivalent to straight lines"?
Are you ready to re-evaluate the claims that the authors have made about their "equivalent privacy definition"? How about the plainness of the subtleties?
I'm going to spoil the surprise and show you the relationship the authors prove:
eps-DP < eps-MI-DP <= (0, eps)-DP
where the <=
is =
in several cases (when the domain or range are finite). This meets the requirements that the eps'
and del
both go to zero as eps
goes to zero; it turns out that eps'
is even zero to start with.
It turns out that (0
, eps
)-DP is just statistical distance which in this case you could think of as differential privacy with additive eps
rather than multiplicative. The additive bound is much weaker than the multiplicative bound, because rare events stay rare with the multiplicative bound and they do not with the additive bound.
The statistical distance is a pretty well understood concept; it's used all the time in cryptography for example, and statisticians understand it quite well too. So why did we use multiplicative eps
rather than additive eps
when we did the differential privacy definition? Likewise with (eps
, del
)-differential privacy, why did we use two parameters rather than just lumping them together in del
?
We write definitions the way we do to create meaningful distinctions between those things that satisfy the definition and those that do not. The class of computations satisfying an additive eps
definition contains those satisfying the multiplicative eps
definition, but it contains a large amount of additional crap too. For example, here are two computations that the multiplicative definition distinguishes between, but the additive definition does not:
- The Laplace mechanism that adds to counts Laplace noise with parameter
1/eps
. - The mechanism that releases each entry in the database independently with probability
eps
.
One of these mechanisms clearly sucks. Maybe you think they both suck. I don't, but let's agree that the second one sucks for sure. Unless eps
is tiny, someone is going to get their data released. In fact, to make it unlikely that any entries are published we would need to take eps
so small that the resulting distribution wouldn't be much different from that of the computation run on the empty dataset.
Using (0
, eps
)-DP, if we want to rule out crappy mechanisms, we need to rule out all non-trivial mechanisms.
The value of the multiplicative definition of differential privacy is that it simultaneously (i) provides non-trivial protections against leaking individual records, and (ii) permits non-trivial computations on the data. The additive definition cannot do both at the same time.
What about (eps
, del
)-differential privacy, which has an additive term? Wouldn't this mean it is fundamentally flawed too? No, because it also has a multiplicative term. The multiplicative term (eps
) allows computations with vanishingly small additive terms (del
) to have non-trivial behavior. If the multiplicative term were always 1 (corresponding to eps = 0
), then we would be back in the world above where any setting of del
either permits disclosures or prevents utility.
The multiplicative term allows differential privacy definitions to distinguish between computations that are useful and protective, and computations that are neither. The strictly additive formulation, and apparently the mutual-information based definition, do not.
The paper "Differential Privacy as a Mutual Information Constraint" provides a new privacy definition, eps
-MI-DP, with no clear value (not to me, at least). It is not clear that there are values of eps
for which eps
-MI-DP simultaneously contains non-trivial mechanisms and excludes mechanisms that are likely to release user data in the clear.
Although eps
-MI-DP is related to existing differential privacy definitions, most precisely with the relationships
eps-DP < eps-MI-DP <= (0, eps)-DP
this shouldn't provide the MI-DP characterization with any more merit than we give to the (0
, eps)
-DP formulation (for which eps
either obscures the entire dataset or leaks records in the clear).
Privacy definitions may be "weaker" or "stronger" based on whether they contain strictly more or strictly fewer mechanisms, and we can write papers about that, but they are useful only when they simultaneously rule out mechanisms that disclose individual data and permit mechanisms that disclose aggregate data.