Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop recommending UUID for deviceId/groupId #682

Closed
dontcallmedom opened this issue Apr 7, 2020 · 17 comments · Fixed by #687
Closed

Stop recommending UUID for deviceId/groupId #682

dontcallmedom opened this issue Apr 7, 2020 · 17 comments · Fixed by #687
Assignees
Labels
PR exists privacy-needs-resolution Issue the Privacy Group has raised and looks for a response on.

Comments

@dontcallmedom
Copy link
Member

Because UUID are unique, and until we have double-keyed storage widely available, having deviceId (and groupId) being UUID creates a tracking opportunity for getUserMedia() callers embedded as third-party iframes.

I think having deviceIds be simple monotonically incremented integers should be sufficient for our use case. I'm not sure how we need to be here, beyond no longer recommending to use UUIDs as ids.

/cc @pes10k since we were discussing this while reviewing progress on privacy-related work on the spec

@dontcallmedom dontcallmedom added the privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response. label Apr 7, 2020
@youennf
Copy link
Contributor

youennf commented Apr 8, 2020

Related to or a dupe of #607.

deviceIds be simple monotonically incremented integers should be sufficient for our use case

This has been discussed in the past.
If you are incrementing integers and try to persist them so that they can be used by web pages across navigations, these integers will end up becoming a tracker for those users that use non built-in camera/microphone devices. These integers might be unique to the user and not even tied to an origin.

If you do not have double-keyed storage, users are anyway being tracked with or without device Ids (given IDs are regenerated anytime website data like IDB is cleared). The spec now mentions double keyed storage, maybe the wording should be made stronger?

Note also that the spec now mandates device Ids to only be exposed after the page is capturing.
This makes it difficult to be used as a tracker in practice. And once page is capturing, device IDs are probably not the most personal information that is leaking to the page.

@dontcallmedom
Copy link
Member Author

thanks @youennf for clarifying the current thinking on that particular sub-issue.

@pes10k - wdyt?

@pes10k
Copy link

pes10k commented Apr 8, 2020

these integers will end up becoming a tracker for those users

I don't believe this is correct. If you are reusing the same identifiers across users, they're by definition not identifying that user :) They're only "identifying" when you already have an identifying key to join against (in which case, all bets are off).

The goal of making these less identifying is:

  1. to remove more foot guns for vendors who are managing storage for users (for privacy reasons) in ways that aren't easily compatible with the "all or nothing" definitions keyed off in the spec. The different implementations of Storage Access API do this to a degree (especially in storage upgrade cases), Safari's ITP does this in places, Brave does this in places, etc. There are places where storage is cleared / changed, and not having unique identifiers here makes it much easier to reason about the privacy boundaries in such cases (and to prevent the device ids from being the unique key to join tracking session data).

  2. In general, we should have a hard line about adding unique identifiers into the platform. This is the only case im aware of where this is, and it would be very good to address this too.

the spec now mandates device Ids to only be exposed after the page is capturing

This is extremely good news and very appreciated :)

@jan-ivar
Copy link
Member

jan-ivar commented Apr 16, 2020

If you are incrementing integers and try to persist them so that they can be used by web pages across navigations, these integers will end up becoming a tracker for those users that use non built-in camera/microphone devices.

I agree with @youennf here. E.g. everything may look benign at first:

Your USB camera, mic & speaker ids may be 0, 1 and 2. Your bluetooth headset 3 and 4.

Cut to 6 months later: Think of every device you've plugged into your system since then, even for a brief moment: conference room external speakers, a friend's camera, bluetooth headset of every family member. The "new id" counter may now be much higher.

Say you purchase a new USB camera, mic, and external speaker after trying a couple in a store, and you got a new bluetooth headset a couple of months ago. Their ids may now be 34, 35, 42, 14, 15. These bits may now help correlate you across origins you visit, because they'll be the same across all origins you visit (even in first-party pages, no iframes needed). They may not be enough to identify you uniquely, but along with other bits they might.

We should weigh that risk against today's origin-unique id, which is not correlatable across first party pages, and prevented from showing up in iframes by default, or may have its iframe storage partitioned in the near future (along with equally damaging JS-created ids in local storage).

@jan-ivar
Copy link
Member

jan-ivar commented Apr 16, 2020

TL;DR: The bits will be based on your USB insertion/removal activity and shopping behavior, which may be quite unique. Sites may even deduce you've made a new purchase since last visit.

@guest271314
Copy link

Because UUID are unique, and until we have double-keyed storage widely available, having deviceId (and groupId) being UUID creates a tracking opportunity for getUserMedia() callers embedded as third-party iframes.

How? What would be tracked?

@pes10k
Copy link

pes10k commented Apr 20, 2020

In general, I understand the points that are being made that a UUID based identifier isn't in all (many) cases a tracking vector on its own (we might disagree, but I understand the argument). What I'm not getting is any positive argument in favor of a UUID. If we can agree that UUIDs are (at best) a privacy risk that needs to be mediated through other means, why use them? At best its a foot gun…

@jan-ivar
I take your point, and yes, using (say) just increasing ints would be its own privacy risk (still, way better than a UUID, but yes, not perfect to be sure). That was a straw proposal thats hung around since our conversations before TPAC, so I apologize for it. Here is a slightly less straw suggestion:

  1. device ids are integers, chosen at random from the range [0,255], w/o replacement
  2. if needed for web compat reasons, device IDs can be packed into the same of UUIDs (eg.
    25500000-0000-0000-0000-000000000000)
  3. device ids are in all other ways treated as the started currently describes (dual key'ed on platforms that support it, reset on storage clear events, etc).

@guest271314

i dont think i fully understand your question, but the claim is that (i) having UUIDs when needed is a bad practice in general (ii) browsers do increasingly sophsiticated and clever things to maintain and minimize storage for users (for privacy, among other reasons), and the idea of a single "clear storage" event increasingly doesn't exist, and (iii) its important to not let one of these device IDs be the key used to rejoin sessions when cookies or other identifiers have been cleared.

@guest271314
Copy link

@pes10k

i dont think i fully understand your question

This is what was referring to

(iii) its important to not let one of these device IDs be the key used to rejoin sessions when cookies or other identifiers have been cleared.

The case is not clear. How exactly could or would that happen using only a deviceId or groupId?

If that was the case any site where a MediaStreamTrack is used could, if gather the gist of the access point theory correctly, the site could get the deviceId or groupId and do exactly what those strings?

@youennf
Copy link
Contributor

youennf commented Apr 21, 2020

What I'm not getting is any positive argument in favor of a UUID.

I think this is historical. We probably all agree this is not a great model in general and we do not want new APIs to adopt this model.

The current approach is passive fingerprinting neutral, without any additional mitigation.
The additional mitigations we are talking about make it fully fingerprinting neutral. These mitigations are also much more urgently needed by other web technologies, the cost of applying them to device IDs is very low.

I personally feel like we fixed this particular issue and would prefer focusing on other existing privacy issues, like device/track labels.

@jan-ivar
Copy link
Member

What I'm not getting is any positive argument in favor of a UUID

Note the spec doesn't actually mandate a UUID, only a "generated unique identifier", so your idea would be compliant.

I don't actually know why the spec recommends UUIDs. I suspect most browsers use hashes, not an actual UUID generator. A positive argument for hashes, to answer your question, is efficient implementation (we store one key per origin vs. one id per device per origin).

The benefit of the large entropy is not worrying about collisions (though I agree it is probably larger than necessary in most browsers). If you have an algorithm with less entropy with the same storage needs that might be interesting.

But since the title of this issue is "Stop recommending UUID for deviceId/groupId", and since I haven't heard anyone defend the UUID recommendation specifically, I would actually be in favor of removing the recommendation.

@youennf
Copy link
Contributor

youennf commented Apr 21, 2020

A positive argument for hashes, to answer your question, is efficient implementation (we store one key per origin vs. one id per device per origin).

Right, Safari is using that strategy.

I would actually be in favor of removing the recommendation.

I am fine with that too.

@pes10k
Copy link

pes10k commented Apr 21, 2020

@youennf @jan-ivar Do you know the amount of entropy that goes into those per origin seeds (that the hashes are generated from)? If its not a huge number of bits, maybe thats the way to solve this issue.

Again, my main concern here isn't (main) situations where storage (and so device identifiers) is dual keyed; its in the majority of browsers where it isn't, where people use a variety of methods to try and reduce third party storage (extensions, for example), and so where storage may rotate on a different interval than deviceIds, and then having a highly unique device id is a way or linking the storage-based identifiers together. (a large number of people on gecko and blink based browsers, particularly privacy conscious users)

So, hashes instead of UUIDs sounds fine; but constraining the entropy of the input to those hashes would be a very useful step that seems like it could be some middle ground / way forward?

@youennf
Copy link
Contributor

youennf commented Apr 21, 2020

where storage may rotate on a different interval than deviceIds

The specification mandates this rotation. Maybe there is a bug in some browsers?
Or this rotation mechanism might not kick in if extensions implement this clean-up purely by injecting JS that deletes all the databases (what about the HTTP cache though?).

So, hashes instead of UUIDs sounds fine; but constraining the entropy of the input to those hashes would be a very useful step that seems like it could be some middle ground / way forward?

This is fine to me as long as we do not add needless constraints to browsers implementing partitioning. This seems somehow hard to spec though.

@jan-ivar jan-ivar self-assigned this Apr 23, 2020
@pes10k
Copy link

pes10k commented Apr 24, 2020

The specification mandates this rotation. Maybe there is a bug in some browsers?
Or this rotation mechanism might not kick in if extensions implement this clean-up purely by injecting JS that deletes all the databases (what about the HTTP cache though?).

Yes, you put it better than i could! Clearing / managing storage in practice is in practice more complicated and non-binary than what the spec seems to imagine. Even setting aside possible bugs, there are ways of clearing storage (injected JS extension code is just one possible example) that won't (and couldn't, given the diversity of possible policies) be mapped into the browser as "storage clear". Privacy in depth really matters here, I dont mean this as a theoretical kind of concern.

This is fine to me as long as we do not add needless constraints to browsers implementing partitioning. This seems somehow hard to spec though.

I appreciate your point here, and again, am not requesting any particular mitigation. Whatever is easy enough to spec and prevents device ids from being unique / identifier-join-ing-material would be terrific. I though picking identifiers from [0,255] w/o replacement would be an easy to specify, privacy preserving option, but if thats not the case, point taken.

I also appreciate that this issue is long, and I really dont mean to be throwing sand in the gears so close to transition; i really appreciate the work you all do! And I think you did a better job of stating the concern (the first quote above) than i managed to do in a TPAC meeting and a couple thousand rambling words above.

@jan-ivar
Copy link
Member

The specification mandates this rotation. Maybe there is a bug in some browsers?
Or this rotation mechanism might not kick in if extensions implement this clean-up purely by injecting JS that deletes all the databases (what about the HTTP cache though?).

This seems like something browsers should fix. The sole purpose of all this is recognizing deviceIds in site storage. If browsers can detect sites without storage they should rotate deviceIds.

@pes10k
Copy link

pes10k commented Apr 24, 2020

This seems like something browsers should fix. … If browsers can detect sites without storage they should rotate deviceIds

Im not sure what you mean. There are an infinitely diverse number of reasons browser extensions will modify storage; increasingly browsers are doing so too. Sometimes they might delete all storage, sometimes they may delete or modify some storage values and not others.

My point is that spec seems to imagine there are only two cases a) browser clears all storage, b) storage as usual. My point is that there are many situations in between, and an increasing number, where browsers do storage-related interventions above nothing, but below "clear everything", and in those those cases having highly identifying identifiers is where the privacy harm occurs.

The two solutions I can see are to either a) be more specific about when deviceIds should be rotated, or b) make the deviceIds less identifying. I've been suggesting "B" since it seems much easier of the two, but if you think "A" is the better path, that could work too

@w3cbot w3cbot added privacy-needs-resolution Issue the Privacy Group has raised and looks for a response on. and removed privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response. labels Apr 30, 2020
@pes10k
Copy link

pes10k commented Jun 30, 2020

I do not agree that #687 addresses this concern.

  1. That PR suggests identifiers around 32 bits in length. That is enough bits to identify ~4 billion devices. Why recommend so much identifiability when the common case user will have < 10 devices on their machine? This seems like far more privacy risk than is warranted
  2. I appreciate the new text, describing the "lower-entropy alternative". However, since this is presented as an alternative (and not the main recommendation), it would be worth describing why this more privacy-friendly approach is not the main suggestion. The text says "storage" but that seems odd, given that the amount of storage needed is minuscule (every device identifier, for every site i've visited, would be less storage than it takes to store my cool Marge avatar.

Put differently, I appreciate that we disagree on how much privacy gain their is by using a less identifying device identifier, but I think its hard to argue that there is at least some privacy improvement (for the reasons given in #682 (comment), among others). If the WG is going to recommend an approach that isn't the most privacy-preserving (and equally as user-serving), I think its important to say why, beyond trivial storage difference.

(Im not trying to draw out this disagreement, but I think its important to fully explain the "why" here)

@jan-ivar jan-ivar added the privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response. label Oct 9, 2020
@w3cbot w3cbot removed the privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response. label Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR exists privacy-needs-resolution Issue the Privacy Group has raised and looks for a response on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants