Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

How to ensure we capture all identifying information exposed to sites? #3

Open
englehardt opened this issue Sep 1, 2019 · 3 comments

Comments

@englehardt
Copy link

We can measure the total amount of identifying information from simple APIs, such as window.screen.height or navigator.maxTouchPoints, by simply querying these properties across our user base.

The approach is less obvious for APIs like Canvas and WebGL, which extract entropy by rendering text and shapes to a canvas. Note that this concern is not exclusive to those APIs. For these, how can we ensure that we’ve chosen the input that extracts the maximum amount of identifying information from the API. Past research has used a wide variety of inputs: e.g., Figure 2 in [0] and Figure 4 in [1].

Taking this a step further, identifying information may not be exposed directly by an API but rather by the structure of the API (i.e., the presence or absence of certain methods, the order of methods and properties in the object, the type of the return value of a method or property, etc). These types of attacks are described in [2]. Section 5 of [2] shows that these attacks can be used to detect browser (and browser version), OS, and architecture. We could try to include these in our entropy measurements, but we would need some way to discover differences between browsers and it seems like we’d need to recompute this with each new browser release.

Lastly, it’s unclear how we can perform such a measurement (and enforce such a restriction) for information inferred through timing side channels (e.g., [3] [4]). Note that even if we enforce a budget on APIs that are designed to provide timing information (e.g., window.Performance), timing information can still be extracted indirectly, as summarized in this Tor Browser issue.

[0] https://hal.inria.fr/hal-01718234v2/document
[1] https://securehomes.esat.kuleuven.be/~gacar/persistent/the_web_never_forgets.pdf.
[2] https://www.ndss-symposium.org/wp-content/uploads/2019/02/ndss2019_01B-4_Schwarz_paper.pdf
[3] https://cseweb.ucsd.edu/~kmowery/papers/js-fingerprinting.pdf
[4] https://www.ieee-security.org/TC/W2SP/2013/papers/s2p1.pdf

@JensenPaul
Copy link

WRT choosing inputs for APIs like Canvas and WebGL, Chrome chose to let the web provide the inputs. Chrome calculates digests of operations performed on canvases (e.g. https://crrev.com/c/2185756) and digests of the resulting canvas, then measures entropy of outputs for identical inputs.

Measuring entropy of browser and OS can be done through server User-Agent header analysis and doesn’t require in-browser measurement. Measuring consumption of this information implied by API structure (rather than direct querying) is more challenging but is believed to not be widespread presently as most browsers provide this information directly in User-Agent headers or corresponding Client Hints. If consumption via implication becomes more widespread Chrome can reassess consumption measurement. Querying of APIs with structure that exposes the OS can be treated by the Privacy Budget similar to querying of OS Client Hint.

Timing side channels are presently of less concern as they are significantly less stable and require significantly more time to calculate. Their instability can also be exacerbated by inserting random delays to help prevent fingerprinting. Measuring consumption of identity implied from timings is more challenging but this consumption is believed to not be widespread presently as most browsers provide at least their name and CPU architecture (the easiest to infer pieces of information from timing side channels) in User-Agent headers. If consumption of these signals becomes more widespread further measurement will be necessary.

@lknik
Copy link

lknik commented Oct 12, 2020

@englehardt another tricky place is APIs that provide "changeable" output. Not necessarily speaking of Battery Status API ;-) But you may imagine ambient light sensors in same league. It seems these are special cases where "theoretical" value span could be used.

@Vonda20
Copy link

Vonda20 commented Feb 25, 2023

I worried about how I'm going to get this right

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants