Skip to content
This repository has been archived by the owner on Jan 10, 2019. It is now read-only.

Cache tracker lookups #231

Open
wants to merge 41 commits into
base: beta
Choose a base branch
from
Open

Cache tracker lookups #231

wants to merge 41 commits into from

Conversation

laurengarcia
Copy link
Collaborator

@laurengarcia laurengarcia commented Dec 6, 2017

Reviewer:

Description:

This is the async version for Firefox. For Chrome we'll need to port the onBeforeRequest handler to being all sync. This PR caches tracker lookup data by domain (it caches no-block as well as block decisions). So if 'http://foo.com' is found to be a tracker on 'http://bar.org", that lookup is cached and used again on subsequent tries anywhere on 'http://bar.org"; the inverse is also true -- if 'http://baz.org' is found not to be a tracker on 'http://bar.org', that noblock decision is also cached (as long as tracker blocking is turned on -- we dont cache anything when trackers setting is off). I don't think we need to cache lookups against individual urls but please correct if wrong -- that's easy enough to change.

One of the assumptions of this PR is that caching by domain/host is granular enough (vs. caching by specific url). So if a block/noblock decision on a 3rd party request is made on http://foo.com/bar, it is cached for all subsequent requests on foo.com.

Also added some perf features to the screenshot testing tool (a new checkbox for perf-only measurements which outputs load time and memory usage).

Also added a flag to the background process called 'debugTrackerTimer' that times individual tracker lookups and outputs them to the console in the background process.

Steps to test this PR:

Pull this branch down, $ npm run build
Install fresh in browser

Automated tests:

  • Unit tests
  • Integration tests
Reviewer Checklist:
  • Ensure the PR solves the problem
  • Review every line of code
  • Ensure the PR does no harm by testing the changes thoroughly
  • Get help if you're uncomfortable with any of the above!
  • Determine if there are any quick wins that improve the implementation
PR Author Checklist:
  • Get advice or leverage existing code
  • Agree on technical approach with reviewer (if the changes are nuanced)
  • Ensure that there is a testing strategy (and documented non-automated tests)
  • Ensure there is a documented monitoring strategy (if necessary)
  • Consider systems implications

@laurengarcia laurengarcia removed the WIP label Dec 11, 2017
if (!settings.getSetting('trackerBlockingEnabled')) return
if (isFirstPartyRequest(currLocation, reqUrl)) return
if (cache.size > 3000) {
cache.delete(cache.keys().next().value)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the implementation but this looks super inefficient. Like, you'll reach 3000 and then this will be called for every single URL from then on out.

Copy link
Collaborator Author

@laurengarcia laurengarcia Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, too, at the outset of this task and wrote a quick script you can copy/paste into the browser and futz around with. It's a timer I wrote to make sure that Map would be just as efficient as a POJO (plain old javascript object) and it indeed it seemed to be. You'll notice that the very first key added and retrieved from the cache (WITHOUT the delete operation) takes ~2ms total. From the iteration no. 2 - 2999, on Chrome it gets incredibly efficient, the whole operation (without the delete) takes ~0.04-0.06ms. Once the delete operation is added to the mix after iteration 3000, on my machine all of those operations -- both adding the cache key and then retrieving it -- take ~0.25ms total.

Firefox's performance is slighly slower. Again, the very first key added to the Map and its retrieval takes about 2ms, and from iteration 2-2999, im seeing it take about ~0.2 - 0.6ms. Once the delete operation is added to the mix after iteration 3000, im seeing it take ~0.4 - 0.8ms.

This seems pretty efficient to me, and much faster than the raw tracker lookups (a fast tracker lookup on my machine is in the ~3-5ms range, and the slow ones can take 50ms+). The script that I used is below, feel free to copy/paste into the console of any browser and see how it looks to you if you like. Am open to any further comments/observations, but bear in mind that I took this into consideration quite a bit.

var cache = new Map()

function addToCache (key, cancel) {
    if (cache.size >= 3000) {
        console.log('DELETE')
        cache.delete(cache.keys().next().value)
    }
    cache.set(key, cancel)
}

function isCached (url) {
    return new Promise((resolve, reject) => {
        var cachedValue = cache.get(url)
        resolve({cancel: cachedValue})
    })
}

for (var i=0; i <= 3500; i++) {
    ms = 50
    counter = 0
    setTimeout(function () {
        counter++
        var r = () => (Math.random() * 1000000000000000000000).toString()
        // key length is 64 characters, same length as a SHA256 hash
        var key = r() + r() + r() + 'f'

        // time these operations, output to console
        console.time(counter)
        addToCache(key, true)
        isCached(key).then((v) => {
            return console.timeEnd(counter)
        })

    }, ms)
}

Copy link
Collaborator Author

@laurengarcia laurengarcia Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more -- one improvement that could be made here: the .addToCache() step can be made to be non-blocking -- meaning nothing that happens in the onBeforeRequest handler after .addToCache() is called is dependent on it having run -- so I can actually speed this up even further by returning a promise immediately from .addToCache() -- even though we won't do anything with the promise(!), we'll just use it to make .addToCache() return immediately, which in turn means that the onBeforeRequest() handler's tracker lookup step will return faster, too. https://github.com/duckduckgo/chrome-zeroclickinfo/pull/231/files#diff-2650fb3e14b7f42d68a8766269c54434R141

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know.

this 3000 spans all tabs right? what about the idea of one cache per page? you could delete that whole thing when page is finished loading, which would also address your concern about storing history.

also - do we know the order of cache.keys()? will there be cases where, after 3000 we're not caching anything (or only partially) on the subsequent pages, for example, the 3001th is my first image url on amazon.com, and it so happens that the cache.keys().next() call returns this so nothing ends up getting cached on amazon?

Copy link
Collaborator Author

@laurengarcia laurengarcia Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I love the idea of caching per page instead of in one huge glob that spans all tabs like this. Am assuming you mean that we could eliminate the hashing operation if this were the case and we cached by page (or tab). That would be the superior design for sure(!). All along I've been weighing the performance pickup we'd get on sites with lots of tracker lookups vs. the drag of the caching operation on pages with lots of 3rd party requests that are not trackers. This also solves the history problem.

(The second concern you raised above is not valid if I'm understanding it correctly, bc all the keys are hashed).

// cached trackers lookups are scoped to a domain
const key = `domain=${utils.extractHostFromURL(currLocation)},reqUrl:${reqUrl}`
return new Promise((resolve, reject) => {
utils.hashSHA256(key).then((hashedKey) => resolve(hashedKey))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. why are we double hashing again? security of the domain? are we sure that's really an issue?
  2. this is looking like you want to subclass Map with your special behavior of 3000 and rehashed keys.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The crypto (aka double) -hash is for privacy. Otherwise this is a record of user history in cleartext.
  2. Sure, I would consider that approach.

An alternate approach that would incorporate both your points above and improve the implementation which I was considering last week and will put on table here -- make the cache its own class and abstract them away from this trackers module (which is publicly available as a global -- meaning a user's history could theoretically be discovered by hashing the domain and tracker url manually and querying trackers.isCached(key) from the console. A highly unlikely scenario but still plausible. Either way I like the idea of subclassing Map/cache. I would like to leave the hashSHA256() func as a util so it can be repurposed in other places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(... meaning that util.hashSHA256() would be set to a convenience method of the Map subclass but also still available as util.hashSHA256() for other pieces of code that might want to utilize it)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm just asking if we're really sure that's the case, a real risk. we don't want to store such a thing offline anyway and I think you'd want to prevent that from happening if it was, for performance reasons alone. Not saying don't do it, I'm asking how we know it's a real concern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your suggest re: caching per page so that invalidates the discussion about history. Thankfully!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants