Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore hashtag on pages #35

Merged
merged 5 commits into from
Jun 29, 2023
Merged

Ignore hashtag on pages #35

merged 5 commits into from
Jun 29, 2023

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Jun 9, 2023

This PR allows matching of pages in a provided pages.jsonl to WARC records, ignoring the hashtag, eg. https://example.com/#hashtag is matched to WARC record https://example.com/. The page is added as https://example.com/#hashtag but the timestamp of the record without the hashtag is used.

bumps version 0.4.9

@ikreymer ikreymer requested a review from tw4l June 9, 2023 18:17
@codecov
Copy link

codecov bot commented Jun 9, 2023

Codecov Report

Merging #35 (0d26c44) into main (af3bed7) will increase coverage by 0.06%.
The diff coverage is 88.23%.

@@            Coverage Diff             @@
##             main      #35      +/-   ##
==========================================
+ Coverage   81.80%   81.87%   +0.06%     
==========================================
  Files           4        4              
  Lines         676      673       -3     
  Branches      148      147       -1     
==========================================
- Hits          553      551       -2     
  Misses         79       79              
+ Partials       44       43       -1     
Impacted Files Coverage Δ
wacz/waczindexer.py 82.88% <86.66%> (+0.18%) ⬆️
wacz/util.py 96.22% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! One comment, and might benefit from a test?

new_page = {
"timestamp": ts,
"url": input_page.get("url") or url,
"title": input_page.get("title") or url,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what's the motivation to falling back to URL for the title if a title's not present, since we didn't before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually doing that before, just a different way, first setting title as url and then overriding it. But agree that we should add tests for this..

- add test for hashtag in pages
- allow extra fields to be passed in from provided pages.jsonl, instead of filtering only some fields
- keep 'id' from provided pages.jsonl
- keep other fields, such as 'loadState', from existing pages.jsonl
@ikreymer
Copy link
Member Author

Added some tests, but also additional refactoring to be able to include additional fields in pages.jsonl
Previously, all but the core fields were being filtered out during WACZ creation process..

@ikreymer ikreymer merged commit 47b3eef into main Jun 29, 2023
@ikreymer ikreymer deleted the ignore-hashtag-on-pages branch June 29, 2023 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants