-
Notifications
You must be signed in to change notification settings - Fork 89
Use etld plus one matching for 3p #172
base: master
Are you sure you want to change the base?
Conversation
'www.example.org' | ||
) | ||
assert.equal(queryResult.matches, true) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suggest adding some tests that use non-trivial etld+1's like, example.co.uk
and example.githubusercontent.com
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many, but there at the C++ level. They're in this file https://github.com/brave/ad-block/blob/use-etld-plus-one-matching-for-3p/test/etld_test.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see thanks
general question about this approach: is it possible to just use the etld+1 parsing from Chromium? see https://cs.chromium.org/chromium/src/net/base/registry_controlled_domains/registry_controlled_domain.h?q=getdomainandregistry&dr=CSs |
We could, but then we'd loose the ability to run in node (which has been very valuable for crawling / measurement, getting other folks to use the code, debugging, etc) |
etld/domain.cc
Outdated
@@ -0,0 +1,51 @@ | |||
/* Copyright (c) 2018 The Brave Software Team. Distributed under the MPL2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should be 2019. This applies to all of the new files added in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed with 81cc4f4
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm confused about how GitHub is displaying the latest version of your PR, but it seems like you fixed all files, except for etld/domain.cc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, i goofed, apologies for the butter fingers. Fixed now
I've got a question about the build process since I don't actually know when the build step takes place: if we pull down the list at build time, is there a reason to have it checked into the repo? |
You're right no need for this. I removed it from the .gitignore previously, now also removed it from the set of tracked files. Should be good now |
@bbondy my code expects the public suffix list to be in a known location, and lazily parses the list on first use (i.e. at |
be4c62e
to
33ff5d9
Compare
.gitignore
Outdated
.vscode | ||
|
||
# These files are either fetched at build time, or generated from the build | ||
etld/data/public_suffix_list.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I usually try to make these absolute paths with respect to the location of the .gitignore
file (i.e. /etld/data/public_suffix.h
) in order to avoid unexpected matches somewhere else in the codebase. Not very likely in this case, so feel free to ignore this suggestion, but a good habit IMHO.
Makefile
Outdated
@@ -5,7 +5,9 @@ | |||
.PHONY: clean | |||
|
|||
build: | |||
./node_modules/.bin/node-gyp configure && ./node_modules/.bin/node-gyp build | |||
curl -s https://publicsuffix.org/list/public_suffix_list.dat -o etld/data/public_suffix_list.dat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kudos for not using -k
here :)
nit: If you want to make that curl
line even stricter, you could also throw in --tlsv1.2
to enforce a minimum level of TLS.
etld/data/build.js
Outdated
constructorArgs.push(isException ? "true" : "false"); | ||
|
||
const wrappedLabels = labels.map(JSON.stringify); | ||
constructorArgs.push("{" + wrappedLabels.join(", ") + "}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the labels be surrounded by "
in case they're not all bare words? Also, is it possible they include "
characters that should be escaped?
} | ||
|
||
if (previous != 0) { | ||
labels_.push_back(string.substr(previous, current - previous)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this call will fail if you get invalid input like: abcd.efgh.
(note the trailing dot). Might be worth adding a test for it if there isn't one already.
2919690
to
8d72272
Compare
@bbondy this is now ready for review again. The ways to enable the eTLD+1 checking (by parsing a public suffix list) are:
|
d1ada9e
to
86fb8a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a commit to fix npm run perf
, this is unfortunately currently regressing perf by 4-5x overall though. We'll need to optimize that so it's trivially the same in speed for matching.
…/ re-alloc-ing objects
0fcdfc4
to
76bdb66
Compare
etld/internal/public_suffix_rule.cc
Outdated
while (current != std::string::npos) { | ||
current_label = label_text.substr(previous, current - previous); | ||
if (current_label.length() == 0) { | ||
throw PublicSuffixRuleInputException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chromium is built with exceptions disabled, and this would be the first exception.
I'd recommend instead passing in a pointer to a vector and then filling it. And make the return value the result, and propagate failures.
etld/internal/public_suffix_rule.cc
Outdated
// If don't include any trailing whitespace, if there is any. | ||
current_label = label_text.substr(previous, current - previous); | ||
if (current_label == "") { | ||
throw PublicSuffixRuleInputException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto exceptions
etld/internal/public_suffix_rule.cc
Outdated
PublicSuffixRule::PublicSuffixRule(const std::string& rule_text) { | ||
std::string trimmed_rule_text(trim_to_whitespace(rule_text)); | ||
if (trimmed_rule_text.length() == 0) { | ||
throw PublicSuffixRuleInputException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto exceptions
etld/internal/public_suffix_rule.cc
Outdated
break; | ||
|
||
case '/': | ||
throw PublicSuffixRuleInputException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto exceptions
test/js/matchingTest.js
Outdated
before(function () { | ||
this.client = new AdBlockClient() | ||
this.client.parse('||bannersnack.com^$third-party') | ||
const etldRules = fs.readFileSync('./test/data/public_suffix_list.dat', 'utf8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do some similar tests for when we don't call parsePublicSuffixRules and it falls back to the warning with FQDN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably just add a for loop which loops twice using different clients around all the it
calls.
test/js/matchingTest.js
Outdated
this.client.parsePublicSuffixRules(etldRules) | ||
}) | ||
it('consider eTLD+1 domains as 1p', function () { | ||
const altSubDomainQuery = this.client.findMatchingFilters( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add tests for matches
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please squash these 3 commits together by using git rebase -i
- linter fixes / cleanup
- further linter fixes / cleanup
- even more lint cleanup
d25f24d
to
ad0dab0
Compare
fixes #171