-
Notifications
You must be signed in to change notification settings - Fork 42
Cache tracker lookups #231
base: beta
Are you sure you want to change the base?
Conversation
if (!settings.getSetting('trackerBlockingEnabled')) return | ||
if (isFirstPartyRequest(currLocation, reqUrl)) return | ||
if (cache.size > 3000) { | ||
cache.delete(cache.keys().next().value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the implementation but this looks super inefficient. Like, you'll reach 3000 and then this will be called for every single URL from then on out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, too, at the outset of this task and wrote a quick script you can copy/paste into the browser and futz around with. It's a timer I wrote to make sure that Map
would be just as efficient as a POJO (plain old javascript object) and it indeed it seemed to be. You'll notice that the very first key added and retrieved from the cache (WITHOUT the delete operation) takes ~2ms total. From the iteration no. 2 - 2999, on Chrome it gets incredibly efficient, the whole operation (without the delete) takes ~0.04-0.06ms. Once the delete operation is added to the mix after iteration 3000, on my machine all of those operations -- both adding the cache key and then retrieving it -- take ~0.25ms total.
Firefox's performance is slighly slower. Again, the very first key added to the Map and its retrieval takes about 2ms, and from iteration 2-2999, im seeing it take about ~0.2 - 0.6ms. Once the delete operation is added to the mix after iteration 3000, im seeing it take ~0.4 - 0.8ms.
This seems pretty efficient to me, and much faster than the raw tracker lookups (a fast tracker lookup on my machine is in the ~3-5ms range, and the slow ones can take 50ms+). The script that I used is below, feel free to copy/paste into the console of any browser and see how it looks to you if you like. Am open to any further comments/observations, but bear in mind that I took this into consideration quite a bit.
var cache = new Map()
function addToCache (key, cancel) {
if (cache.size >= 3000) {
console.log('DELETE')
cache.delete(cache.keys().next().value)
}
cache.set(key, cancel)
}
function isCached (url) {
return new Promise((resolve, reject) => {
var cachedValue = cache.get(url)
resolve({cancel: cachedValue})
})
}
for (var i=0; i <= 3500; i++) {
ms = 50
counter = 0
setTimeout(function () {
counter++
var r = () => (Math.random() * 1000000000000000000000).toString()
// key length is 64 characters, same length as a SHA256 hash
var key = r() + r() + r() + 'f'
// time these operations, output to console
console.time(counter)
addToCache(key, true)
isCached(key).then((v) => {
return console.timeEnd(counter)
})
}, ms)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more -- one improvement that could be made here: the .addToCache() step can be made to be non-blocking -- meaning nothing that happens in the onBeforeRequest
handler after .addToCache() is called is dependent on it having run -- so I can actually speed this up even further by returning a promise immediately from .addToCache() -- even though we won't do anything with the promise(!), we'll just use it to make .addToCache() return immediately, which in turn means that the onBeforeRequest()
handler's tracker lookup step will return faster, too. https://github.com/duckduckgo/chrome-zeroclickinfo/pull/231/files#diff-2650fb3e14b7f42d68a8766269c54434R141
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know.
this 3000 spans all tabs right? what about the idea of one cache per page? you could delete that whole thing when page is finished loading, which would also address your concern about storing history.
also - do we know the order of cache.keys()? will there be cases where, after 3000 we're not caching anything (or only partially) on the subsequent pages, for example, the 3001th is my first image url on amazon.com, and it so happens that the cache.keys().next() call returns this so nothing ends up getting cached on amazon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I love the idea of caching per page instead of in one huge glob that spans all tabs like this. Am assuming you mean that we could eliminate the hashing operation if this were the case and we cached by page (or tab). That would be the superior design for sure(!). All along I've been weighing the performance pickup we'd get on sites with lots of tracker lookups vs. the drag of the caching operation on pages with lots of 3rd party requests that are not trackers. This also solves the history problem.
(The second concern you raised above is not valid if I'm understanding it correctly, bc all the keys are hashed).
// cached trackers lookups are scoped to a domain | ||
const key = `domain=${utils.extractHostFromURL(currLocation)},reqUrl:${reqUrl}` | ||
return new Promise((resolve, reject) => { | ||
utils.hashSHA256(key).then((hashedKey) => resolve(hashedKey)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- why are we double hashing again? security of the domain? are we sure that's really an issue?
- this is looking like you want to subclass Map with your special behavior of 3000 and rehashed keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The crypto (aka double) -hash is for privacy. Otherwise this is a record of user history in cleartext.
- Sure, I would consider that approach.
An alternate approach that would incorporate both your points above and improve the implementation which I was considering last week and will put on table here -- make the cache its own class and abstract them away from this trackers
module (which is publicly available as a global -- meaning a user's history could theoretically be discovered by hashing the domain and tracker url manually and querying trackers.isCached(key)
from the console. A highly unlikely scenario but still plausible. Either way I like the idea of subclassing Map/cache. I would like to leave the hashSHA256() func as a util so it can be repurposed in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(... meaning that util.hashSHA256() would be set to a convenience method of the Map subclass but also still available as util.hashSHA256() for other pieces of code that might want to utilize it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm just asking if we're really sure that's the case, a real risk. we don't want to store such a thing offline anyway and I think you'd want to prevent that from happening if it was, for performance reasons alone. Not saying don't do it, I'm asking how we know it's a real concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your suggest re: caching per page so that invalidates the discussion about history. Thankfully!
Reviewer:
Description:
This is the async version for Firefox. For Chrome we'll need to port the
onBeforeRequest
handler to being all sync. This PR caches tracker lookup data by domain (it caches no-block as well as block decisions). So if 'http://foo.com' is found to be a tracker on 'http://bar.org", that lookup is cached and used again on subsequent tries anywhere on 'http://bar.org"; the inverse is also true -- if 'http://baz.org' is found not to be a tracker on 'http://bar.org', that noblock decision is also cached (as long as tracker blocking is turned on -- we dont cache anything when trackers setting is off). I don't think we need to cache lookups against individual urls but please correct if wrong -- that's easy enough to change.One of the assumptions of this PR is that caching by domain/host is granular enough (vs. caching by specific url). So if a block/noblock decision on a 3rd party request is made on
http://foo.com/bar
, it is cached for all subsequent requests on foo.com.Also added some perf features to the screenshot testing tool (a new checkbox for perf-only measurements which outputs load time and memory usage).
Also added a flag to the background process called 'debugTrackerTimer' that times individual tracker lookups and outputs them to the console in the background process.
Steps to test this PR:
Pull this branch down, $ npm run build
Install fresh in browser
Automated tests:
Reviewer Checklist:
PR Author Checklist: