-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore hashtag on pages #35
Conversation
… as WARC records do not contain hashtags
Codecov Report
@@ Coverage Diff @@
## main #35 +/- ##
==========================================
+ Coverage 81.80% 81.87% +0.06%
==========================================
Files 4 4
Lines 676 673 -3
Branches 148 147 -1
==========================================
- Hits 553 551 -2
Misses 79 79
+ Partials 44 43 -1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good! One comment, and might benefit from a test?
wacz/waczindexer.py
Outdated
new_page = { | ||
"timestamp": ts, | ||
"url": input_page.get("url") or url, | ||
"title": input_page.get("title") or url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what's the motivation to falling back to URL for the title if a title's not present, since we didn't before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually doing that before, just a different way, first setting title as url and then overriding it. But agree that we should add tests for this..
- add test for hashtag in pages - allow extra fields to be passed in from provided pages.jsonl, instead of filtering only some fields - keep 'id' from provided pages.jsonl - keep other fields, such as 'loadState', from existing pages.jsonl
Added some tests, but also additional refactoring to be able to include additional fields in pages.jsonl |
This PR allows matching of pages in a provided pages.jsonl to WARC records, ignoring the hashtag, eg.
https://example.com/#hashtag
is matched to WARC recordhttps://example.com/
. The page is added ashttps://example.com/#hashtag
but the timestamp of the record without the hashtag is used.bumps version 0.4.9