Skip to content

Latest commit

 

History

History
128 lines (65 loc) · 13.8 KB

2016-05-19.md

File metadata and controls

128 lines (65 loc) · 13.8 KB

Jane Bambauer and Krish Muralidhar have pushed a blog post responding to criticisms of their "Fool's Gold" article. Their response has roughly as much technical shortfall as the Fool's Gold paper, relative to its volume, so I thought I would go through it and comment on points it makes.

To keep you interested, I'm actually going to agree with them on one point (!!!!).

Introduction

The post starts with some background about differential privacy, their article, and then the responses to it. For example,

... a few other computer scientists have written hostile critiques that aim primarily to impugn our intelligence and honesty rather than engaging with our arguments on the merits.

I think the authors have the causality backwards here. At least for the post I wrote, their arguments are engaged on their merits and found (by me) lacking. The subsequent attempts to explain the authors' inaccurate statements leads to questions about their intelligence or honesty (or both!).

The authors are right that I could have skipped calling them out on their shortcomings, but which would you prefer: civil but dishonest discourse, or candid and honest discourse? The Fool's Gold paper has a position (the clue is in the name): it calls out several people by name for not understanding apparent issues with differential privacy.

The not-so-subtle subtext is “don’t listen to these idiots. They are bad people.”

Don't listen to these idiots. They are bad people. In case my position was unclear.

I still haven't determined if they are "bad" in the moral sense of actively malicious, or "bad" in the technical sense of "do not understand that some statements are true and some are false and that we usually prefer the former in scientific writing". In either case, they are bad people.

Emails and personal conversations have quietly confirmed that data managers understand the significant limitations of pure Differential Privacy and have had to stick with other forms of statistical disclosure controls that have fallen out of vogue.

There absolutely are limitations to differential privacy, and they do have significance. They are not, however, the limitations presented by the authors. I think that pushing on these limitations is a great thing for research to do, but pointing at non-limitations and saying "it can never work" is not helpful. Identify the actual limitations, determine if they are fundamental (often they are not), and accomplish science. I dare you.

Also, I'm guessing that "have fallen out of vogue" is code for "have substandard privacy guarantees". No one is talking about differential privacy because it is easy to use, it is interesting because it provides real and meaningful privacy guarantees. If the "out of vogue" approaches did this satisfactorily, I wouldn't be here.

And just to be totally clear, I don't have any issues with people who have trouble using differential privacy. It is not yet easy to use. I would like to make it more easy to use. Until that happens, keep using the tools you understand. However, if you have any interest in the world becoming a better place, the right way to do this is to engage rather than dissemble.

Responses

As we will discuss, the apparent limitations of differential privacy are in fact limitations in the authors' understanding of differential privacy. We'll review their four responses, and see that responses (2) and (4) present fundamental misunderstandings of differential privacy's limitations. Response (1) is rhetorical nonsense, and I agree with response (3).

The heading for each subsection is a quote from their post, each a hypothetical position taken by people like me. Some are actual positions (2, 3, and 4) while 1 is just silly talk. With that...

Response 1: "Differential Privacy should destroy utility—that’s what ensures privacy."

Differential privacy should limit utility-that's what ensures privacy. Whether it "destroys" it or not depends on whether your view of utility is "distinguishing between datasets differing in one record". If this is your view of utility, it is fundamentally at odds with differential privacy. However, large swathes of statistics are expressly not interested in this view of utility.

McSherry therefore seems to be in agreement with us that if we want open access to research data that yields useful insights, Differential Privacy is not an option.

This statement is categorically false. The swathes of statistics mentioned above may still be applied, yielding useful insights. The inability to do one class of analysis, accurate analysis of small populations, does not magically invalidate the utility of techniques explicitly designed to prevent that class. Moreover, the existence of privacy controls only increases the amount of data we can open to public analysis; the whole point of doing this research is to create safe query mechanisms so they can then be used more.

McSherry’s remarks show that the disagreement between us is not technical at all, but philosophical.

This quote highlights the importance of getting one's quantifiers correct, an important aspect of speaking correctly about things.

The disagreement between us is both technical and philosophical. Mostly it is a technical disagreement, as we will see in more detail. I don't insist that anyone actually use differential privacy, but I do require that they speak accurately and honestly about it. This latter part is apparently where we disagree philosophically.

Response 2: "We exaggerate the damage that Differential Privacy does to utility"

This response starts with the issue that taking the average of unbounded (the authors say "skewed" but skew is irrelevant here) variables can make differential privacy's bounds look bad. This is true, but an identical complaint holds for the law of large numbers: taking an average can be an arbitrarily inaccurate statistical estimator if you increase the range of values in the distribution. The normal response is that, holding the distribution fixed, as you increase the number of samples the accuracy increases. The same response applies for differential privacy.

The central limit theorem and the Laplace mechanism have different bounds on their variance (the latter starts larger but improves faster with samples), but the authors' complaint applies to both. I suspect they don't mean to claim that averages also "destroy utility", or whatever it is they think.

I had previously proposed to the authors to use the median when analyzing non-Gaussian data (for which the mean may be a poor estimator). They express concern that the analyst might not know when to use the median versus the mean, which is a pervasive issue in statistics generally, ignoring issues of privacy; I can't fix that for them.

In any case, if we had taken McSherry’s suggestion and analyzed the median instead of the mean, the noise would have been worse. It would have been 59 times worse, to be specific.

No, it wouldn't.

Recall that the noise added to meet Differential Privacy is based on “global sensitivity.”

No, it isn't.

Adding noise based on global sensitivity is one possible approach, and as has been REPEATEDLY explained to the authors of the post, it is neither the only nor the best approach. The authors then quote from and link to a paper about local sensitivity, which is literally about how to compute values like the median with more accuracy than using global sensitivity. The authors of the post have quoted the motivation from the paper, that global sensitivity can be too large, without revealing the main contribution of the paper, identified in its abstract (the next two quotes are from that paper, not the blog post), that there are alternatives:

NRS09: Our framework allows one to release functions f of the data with instance-based additive noise. That is, the noise magnitude is determined not only by the function we want to release, but also by the database itself.

and that

NRS09: We show how to do this efficiently for several different functions, including the median [...]

The linked paper quite literally addresses the problem the authors claim is fundamental to differential privacy. Seriously.

Returning to quotes from their post:

So, consider ourselves corrected. Differential Privacy will actually perform worse than we say it will when non-idiots submit a proper query.

No, it wont.

This is followed by some text about my explained alternative for correlation coefficients "but his objection is brief and perplexing." I do believe the authors were perplexed. In such a case, it is reasonable to ask for clarification, as Jane Bambauer did.

A question

I provided a citation, literally the first paper on differential privacy, which we are lead to believe has perplexed the authors.

An answer

Perhaps it was the oblique reference, or maybe the paper itself, that perplexed the authors. They don't say. They didn't ask any follow-up questions either. Despite not understanding the technique, the authors proceed to conclude

McSherry’s technique leaks information about the data set.

No, it doesn't. At least, not in a way that violates differential privacy. When accurate, the result does reveal that there is a large underlying population, which need not be a violation of differential privacy. The paper mentioned above, Calibrating Noise to Sensitivity in Private Data Analysis, has a proof that differential privacy is maintained; if the authors would like to poke holes in the proof they are welcome (and have been repeatedly invited to do so; they haven't). Rather, the authors repeatedly appeal to prose arguments based on their intuitive understanding of differential privacy. Please stop. The mathematics exist for a reason.

For interested persons, the code for the examples has been available for months. In the authors' defense, it was not tweeted at them so perhaps they were unaware.

Response 3: "We overlook the evolution in Differential Privacy that has permitted more data utility."

This part responds to complaints by Anand Sarwate that there are other related privacy definitions that might have more accuracy. The authors complain that this is a bait-and-switch tactic, and I completely agree. In a hypothetical world where differential privacy was uniformly bad, a response of "there are other options" wouldn't invalidate a complaint about differential privacy itself. I don't think we are in that world, but I support the authors pushing back on this defense.

Response 4: "There are methods other than adding LaPlace noise that satisfy pure Differential Privacy."

Who taught you how to capitalize French things?

Finally, we have occasionally received feedback that the Differential Privacy method we illustrate in our paper—adding Laplace noise—is not the only method that satisfies the pure Differential Privacy standard.

I know right? It's almost like you get told this every time you write about differential privacy. It's probably just a coincidence, lol!

Yet these other methods would have to add at least as much uncertainty as Laplace noise does, since the privacy guarantees and the Laplace curve are identical.

Here we come to a critical distinction:

  • uncertainty about the specific value of an individual record.
  • uncertainty about the value of a statistic on the whole collection.

These are different, see? Differential privacy requires a standard for the first one, and you are right that other methods would need to maintain the uncertainty there. It emphatically does NOT require techniques to have the same uncertainty for the second quantity. Your readers might be led to believe the statistical accuracy of other methods must be the same as that of the Laplace mechanism; it is reasons like this that I call you naughty names.

At the risk of repeating myself, and others apparently, a large number of mechanisms that are not the Laplace mechanism can provide substantially more accurate estimates of statistics than naïve application of the Laplace mechanism. Remember that paper you linked to, about smooth sensitivity? Remember the correlation coefficient example I showed you the accuracy curves for? Remember the exponential mechanism, mentioned in all correspondence with you? Each of these are examples where the privacy guarantee is maintained, but the statistical accuracy is changed (usually significantly improved, in their corresponding contexts).

Conclusion

In summary, our original paper has stood up quite well to criticism.

In summary, it really hasn't.

I think it is important to distinguish between two classes of differential privacy skeptic: there are cranks like the authors who misrepresent their subject to score points, and there are honest analysts who struggle to get sufficient accuracy out of known tools and techniques. The research space currently suffers from too few practitioners advancing (or delineating) the bounds of differential privacy; we have a volume of esoteric theoretical research, but relatively little work translating it (or even simpler results) to practice in clear and useable tools.

For example, I still think that work done with Davide Proserpio and Sharon Goldberg, which automatically provides privacy guarantees, and does MCMC-style probabilistic inference, has a lot of cool potential. Given that I've recently re-done a bunch of the infrastructure in Rust, perhaps I should port this over as well and start supporting it.