Cache tracker lookups #231

laurengarcia · 2017-12-06T05:38:42Z

Reviewer:

Description:

This is the async version for Firefox. For Chrome we'll need to port the onBeforeRequest handler to being all sync. This PR caches tracker lookup data by domain (it caches no-block as well as block decisions). So if 'http://foo.com' is found to be a tracker on 'http://bar.org", that lookup is cached and used again on subsequent tries anywhere on 'http://bar.org"; the inverse is also true -- if 'http://baz.org' is found not to be a tracker on 'http://bar.org', that noblock decision is also cached (as long as tracker blocking is turned on -- we dont cache anything when trackers setting is off). I don't think we need to cache lookups against individual urls but please correct if wrong -- that's easy enough to change.

One of the assumptions of this PR is that caching by domain/host is granular enough (vs. caching by specific url). So if a block/noblock decision on a 3rd party request is made on http://foo.com/bar, it is cached for all subsequent requests on foo.com.

Also added some perf features to the screenshot testing tool (a new checkbox for perf-only measurements which outputs load time and memory usage).

Also added a flag to the background process called 'debugTrackerTimer' that times individual tracker lookups and outputs them to the console in the background process.

Steps to test this PR:

Pull this branch down, $ npm run build
Install fresh in browser

Automated tests:

Unit tests
Integration tests

Reviewer Checklist:

Ensure the PR solves the problem
Review every line of code
Ensure the PR does no harm by testing the changes thoroughly
Get help if you're uncomfortable with any of the above!
Determine if there are any quick wins that improve the implementation

PR Author Checklist:

Get advice or leverage existing code
Agree on technical approach with reviewer (if the changes are nuanced)
Ensure that there is a testing strategy (and documented non-automated tests)
Ensure there is a documented monitoring strategy (if necessary)
Consider systems implications

…ot tracker

…isabled

russellholt · 2017-12-12T05:25:14Z

js/trackers.js

+        if (!settings.getSetting('trackerBlockingEnabled')) return
+        if (isFirstPartyRequest(currLocation, reqUrl)) return
+        if (cache.size > 3000) {
+            cache.delete(cache.keys().next().value)


I'm not sure about the implementation but this looks super inefficient. Like, you'll reach 3000 and then this will be called for every single URL from then on out.

I thought about this, too, at the outset of this task and wrote a quick script you can copy/paste into the browser and futz around with. It's a timer I wrote to make sure that Map would be just as efficient as a POJO (plain old javascript object) and it indeed it seemed to be. You'll notice that the very first key added and retrieved from the cache (WITHOUT the delete operation) takes ~2ms total. From the iteration no. 2 - 2999, on Chrome it gets incredibly efficient, the whole operation (without the delete) takes ~0.04-0.06ms. Once the delete operation is added to the mix after iteration 3000, on my machine all of those operations -- both adding the cache key and then retrieving it -- take ~0.25ms total.

Firefox's performance is slighly slower. Again, the very first key added to the Map and its retrieval takes about 2ms, and from iteration 2-2999, im seeing it take about ~0.2 - 0.6ms. Once the delete operation is added to the mix after iteration 3000, im seeing it take ~0.4 - 0.8ms.

This seems pretty efficient to me, and much faster than the raw tracker lookups (a fast tracker lookup on my machine is in the ~3-5ms range, and the slow ones can take 50ms+). The script that I used is below, feel free to copy/paste into the console of any browser and see how it looks to you if you like. Am open to any further comments/observations, but bear in mind that I took this into consideration quite a bit.

var cache = new Map() function addToCache (key, cancel) { if (cache.size >= 3000) { console.log('DELETE') cache.delete(cache.keys().next().value) } cache.set(key, cancel) } function isCached (url) { return new Promise((resolve, reject) => { var cachedValue = cache.get(url) resolve({cancel: cachedValue}) }) } for (var i=0; i <= 3500; i++) { ms = 50 counter = 0 setTimeout(function () { counter++ var r = () => (Math.random() * 1000000000000000000000).toString() // key length is 64 characters, same length as a SHA256 hash var key = r() + r() + r() + 'f' // time these operations, output to console console.time(counter) addToCache(key, true) isCached(key).then((v) => { return console.timeEnd(counter) }) }, ms) }

Thinking about it more -- one improvement that could be made here: the .addToCache() step can be made to be non-blocking -- meaning nothing that happens in the onBeforeRequest handler after .addToCache() is called is dependent on it having run -- so I can actually speed this up even further by returning a promise immediately from .addToCache() -- even though we won't do anything with the promise(!), we'll just use it to make .addToCache() return immediately, which in turn means that the onBeforeRequest() handler's tracker lookup step will return faster, too. https://github.com/duckduckgo/chrome-zeroclickinfo/pull/231/files#diff-2650fb3e14b7f42d68a8766269c54434R141

good to know.

this 3000 spans all tabs right? what about the idea of one cache per page? you could delete that whole thing when page is finished loading, which would also address your concern about storing history.

also - do we know the order of cache.keys()? will there be cases where, after 3000 we're not caching anything (or only partially) on the subsequent pages, for example, the 3001th is my first image url on amazon.com, and it so happens that the cache.keys().next() call returns this so nothing ends up getting cached on amazon?

Actually I love the idea of caching per page instead of in one huge glob that spans all tabs like this. Am assuming you mean that we could eliminate the hashing operation if this were the case and we cached by page (or tab). That would be the superior design for sure(!). All along I've been weighing the performance pickup we'd get on sites with lots of tracker lookups vs. the drag of the caching operation on pages with lots of 3rd party requests that are not trackers. This also solves the history problem.

(The second concern you raised above is not valid if I'm understanding it correctly, bc all the keys are hashed).

russellholt · 2017-12-12T05:25:21Z

js/trackers.js

+        // cached trackers lookups are scoped to a domain
+        const key = `domain=${utils.extractHostFromURL(currLocation)},reqUrl:${reqUrl}`
+        return new Promise((resolve, reject) => {
+            utils.hashSHA256(key).then((hashedKey) => resolve(hashedKey))


why are we double hashing again? security of the domain? are we sure that's really an issue?

this is looking like you want to subclass Map with your special behavior of 3000 and rehashed keys.

The crypto (aka double) -hash is for privacy. Otherwise this is a record of user history in cleartext.

Sure, I would consider that approach.

An alternate approach that would incorporate both your points above and improve the implementation which I was considering last week and will put on table here -- make the cache its own class and abstract them away from this trackers module (which is publicly available as a global -- meaning a user's history could theoretically be discovered by hashing the domain and tracker url manually and querying trackers.isCached(key) from the console. A highly unlikely scenario but still plausible. Either way I like the idea of subclassing Map/cache. I would like to leave the hashSHA256() func as a util so it can be repurposed in other places.

(... meaning that util.hashSHA256() would be set to a convenience method of the Map subclass but also still available as util.hashSHA256() for other pieces of code that might want to utilize it)

I'm just asking if we're really sure that's the case, a real risk. we don't want to store such a thing offline anyway and I think you'd want to prevent that from happening if it was, for performance reasons alone. Not saying don't do it, I'm asking how we know it's a real concern.

I like your suggest re: caching per page so that invalidates the discussion about history. Thankfully!

laurengarcia added 30 commits November 29, 2017 17:28

utils.hashSHA256(): utilize native window.crypto.subtle

5fb8e4d

trackers.isCached()

d8df29a

background: debug timer

23354a3

merge beta

8a90b90

console output: disambiguate tab.status complete from onload event

77266f3

impelement trackers.isCached properly

80999dd

trackers: check cache! this got complicated bc browser crypto is async

1064fff

trackers: make sure cached value gets obeyed if found

edae61f

cache both trackers and non-trackers

33aaad8

trackers perf: ignore the right calls

eabbccc

clean up output

4329279

screenshots css

ec0e9f7

rm todo comment sent in other pr

4c52ea6

screenshot tool: memory profiling data

e6ab831

better label on screenshot results

7f67805

screenshots: process sites in order, not reverse order

cb7ef0f

productionize memory api perms

75b9d57

screenshots output: easier to read

65956d5

screenshots: a little styling

550d8f6

restrict tracker cache lookups by domain

7a9d4e4

faster trackers cache key lookups

ae3bef0

unit test for caching trackers lookups

53312dc

generated hash of string correctly

9bd9c73

unit test: make sure we only cache 3rd party reqs

b22ee2b

lower cache size

251f492

always console.timeEnd() when debugTimer = true

a014c23

make sure we're always resolving onBeforeRequest promise

0173e5f

merge beta, fix conflict

78f70de

TODO

ebefd73

onBeforeRequest control flow: make sure exit points are obvious

171bd2a

laurengarcia added 5 commits December 7, 2017 13:08

onBeforeRequest comments super clear

709bf6b

cache & disambiguate tracker blocked data

c53a40c

tame trackers control flow helper

ea48249

unit test interface tweak

982a578

todo/fixme

9ef1854

laurengarcia added the WIP label Dec 8, 2017

laurengarcia added 6 commits December 10, 2017 18:19

onBeforeRequest helpers into promise closure, always check https if n…

8355228

…ot tracker

onBeforeRequest: only tracker.addToCache() if correct conditions met

fa2f9db

better console output

a372e56

another test: do not cache tracker lookups when tracker blocking is d…

4cb2edd

…isabled

update unit tests

cb18138

tracker cache: finishing touches to console output

686cf6c

laurengarcia removed the WIP label Dec 11, 2017

russellholt reviewed Dec 12, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache tracker lookups #231

Cache tracker lookups #231

laurengarcia commented Dec 6, 2017 •

edited

Loading

russellholt Dec 12, 2017

laurengarcia Dec 13, 2017 •

edited

Loading

laurengarcia Dec 13, 2017 •

edited

Loading

russellholt Dec 13, 2017

laurengarcia Dec 13, 2017 •

edited

Loading

russellholt Dec 12, 2017

laurengarcia Dec 12, 2017

laurengarcia Dec 13, 2017

russellholt Dec 13, 2017

laurengarcia Dec 13, 2017

Cache tracker lookups #231

Are you sure you want to change the base?

Cache tracker lookups #231

Conversation

laurengarcia commented Dec 6, 2017 • edited Loading

Description:

Steps to test this PR:

Automated tests:

Reviewer Checklist:

PR Author Checklist:

russellholt Dec 12, 2017

Choose a reason for hiding this comment

laurengarcia Dec 13, 2017 • edited Loading

Choose a reason for hiding this comment

laurengarcia Dec 13, 2017 • edited Loading

Choose a reason for hiding this comment

russellholt Dec 13, 2017

Choose a reason for hiding this comment

laurengarcia Dec 13, 2017 • edited Loading

Choose a reason for hiding this comment

russellholt Dec 12, 2017

Choose a reason for hiding this comment

laurengarcia Dec 12, 2017

Choose a reason for hiding this comment

laurengarcia Dec 13, 2017

Choose a reason for hiding this comment

russellholt Dec 13, 2017

Choose a reason for hiding this comment

laurengarcia Dec 13, 2017

Choose a reason for hiding this comment

laurengarcia commented Dec 6, 2017 •

edited

Loading

laurengarcia Dec 13, 2017 •

edited

Loading

laurengarcia Dec 13, 2017 •

edited

Loading

laurengarcia Dec 13, 2017 •

edited

Loading