Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record and use HTTP cache information to skip unnecessary crawling and processing #856

Merged
merged 19 commits into from
Feb 7, 2022

Conversation

dontcallmedom
Copy link
Member

close #850

@dontcallmedom
Copy link
Member Author

this depends on tidoust/fetch-filecache-for-crawling#6

from my exploration, this can reduce a no-change crawl to ~1min30; but there are server-side configuration issues (on www.w3.org and rfc-editor.org, probably on csswg.org although I haven't diagnosed them yet) that makes the actual impact as is less big at the moment.

@dontcallmedom
Copy link
Member Author

@tidoust if you get a chance to review the current direction, this would be very much appreciated

src/lib/specs-crawler.js Outdated Show resolved Hide resolved
src/lib/specs-crawler.js Outdated Show resolved Hide resolved
src/lib/util.js Outdated Show resolved Hide resolved
@dontcallmedom
Copy link
Member Author

Thanks for the review, I've fixed the bugs you've mentioned and improved a bit the cache semantics to avoid relying on etag; the CSS server is still problematic (which is unfortunate given their many many specs), but I'm down to 3min30 of processing despite that.

@dontcallmedom
Copy link
Member Author

dontcallmedom commented Feb 4, 2022

Added another option to skip processing that accounts for mis-behaving servers; it now runs in 1min30 with still a few servers/spec not providing cache info (fxtf, rfcs (!))

@dontcallmedom dontcallmedom requested a review from tidoust February 4, 2022 12:55
@dontcallmedom
Copy link
Member Author

Once tidoust/fetch-filecache-for-crawling#6 is merged and released as an updated npm package, the only missing from this PR should be a bump in package.json (and presumable package-lock) - which should also result in making the added test not fail.

Copy link
Member

@tidoust tidoust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits.

This looks good to me otherwise. I note that one implicit hypothesis here is that the index page of a multipage spec is always going to be updated when one of the pages gets updated. I think that's the case because the index page contains the last modified date. I'm not sure that is always the case though. For instance, is the index page always updated when multiple updates get made to pages in the same day?

More importantly, as-is, this caching mechanism (not surprisingly?) creates a cache invalidation problem: if we fix or update one of the extraction modules in browserlib or even other parts of Reffy's code that could affect extracts, and continue to run Reffy with the fallback parameter, the update will only be reflected in extracts when specs get modified.

That problem already exists without this caching mechanism but is limited to specs that cannot be crawled (which we hope is going to be a temporary condition).

The obvious solution is to force a full crawl when that happens. However, from a Webref perspective:

  1. We don't really know which version of Reffy was used to crawl data, so we cannot easily run Reffy without the fallback parameter when a new version comes out.
  2. We probably want to keep the fallback parameter in any case, so as to reuse previous crawl results in case of errors and not introduce temporary updates in the extracts.

I'm thinking that a relatively simple solution would be to inject the version of Reffy that was used to crawl the data in index.json (which is useful info to have in any case), and to only enable this caching mechanism when the current version matches the one in the fallback crawl result.

requestId,
errorReason: "Failed"
});
reuseExistingData = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit sad that there's no easy way to pass that reason within the Fetch.failRequest message. errorReason: "BlockedByClient" could perhaps be the closest reason to what we're doing here.

In any case, #858 removes that part, so that comment will quickly become moot.

src/lib/util.js Outdated Show resolved Hide resolved
src/lib/specs-crawler.js Outdated Show resolved Hide resolved
@dontcallmedom dontcallmedom marked this pull request as ready for review February 7, 2022 07:14
@dontcallmedom
Copy link
Member Author

I'm thinking that a relatively simple solution would be to inject the version of Reffy that was used to crawl the data in index.json (which is useful info to have in any case), and to only enable this caching mechanism when the current version matches the one in the fallback crawl result.

I had been thinking along the same line, although I had imagined we would do only a full crawl when reffy was bumped in minor or major versions (while having a dispatchable workflow to do a full crawl on demand as well)

Do you want this in this PR, or can this be done separately?

@dontcallmedom dontcallmedom requested a review from tidoust February 7, 2022 07:14
@tidoust
Copy link
Member

tidoust commented Feb 7, 2022

Do you want this in this PR, or can this be done separately?

I would prefer if we had it in this PR so as not to start a loop in Webref that cannot be fixed until we get the feature. But if that's the next thing you do, I'm fine doing it separately as well

(I still wonder about multipage specs though)

@dontcallmedom
Copy link
Member Author

I've added support for checking reffy version to use fallback data; this could made more subtle (i.e. not necessarily require an exact match), but it's probably enough as starting point.

I note that one implicit hypothesis here is that the index page of a multipage spec is always going to be updated when one of the pages gets updated. I think that's the case because the index page contains the last modified date. I'm not sure that is always the case though. For instance, is the index page always updated when multiple updates get made to pages in the same day?

Looking at multi-page specs we have today:

  • HTML, CSS2.2 and ES are generated from a single source file, so the title page is guaranteed to be updated
  • CSS 2.1 and SVG 1.1 are TR-only (so would necessary have their title page updated if they ever get updated)
  • SVG2 is problematic - its title page DOES NOT get updated when subpages are updated

In practice, since we would still get a fresh data update on each reffy upgrade, and given that SVG2 gets updated very rarely, I'm not too concerned by that limitation; we could check when fetching subpages whether their last-modified is (significantly) more recent than the title page and not save the cache info in that situation if we really care.

@tidoust
Copy link
Member

tidoust commented Feb 7, 2022

HTML, CSS2.2 and ES are generated from a single source file, so the title page is guaranteed to be updated

But will that generation always trigger an actual git update? What is going to change in the title page? The date is per day in particular, so would the title page get git-updated the second time a spec receives an update on the same day? That situation is probably not that uncommon for the HTML spec.

Edit: Or are the generated pages not stored in and served from a Git branch?

@dontcallmedom
Copy link
Member Author

But will that generation always trigger an actual git update? What is going to change in the title page? The date is per day in particular, so would the title page get git-updated the second time a spec receives an update on the same day? That situation is probably not that uncommon for the HTML spec.

For HTML, the title page includes a link with the relevant commit (for the snapshot) - so if that's our primary concern, I think we're safe

@dontcallmedom
Copy link
Member Author

Edit: Or are the generated pages not stored in and served from a Git branch?

HTML and CSS2 are not served from a git branch afaict; ES is, but it looks (from a cursory check) that the title page is updated for the date when one of the subpages is tc39/ecma262@65905e9#diff-bf7934c87fd1a409fa27452a6461d0ca2916decce82025154e60df5a3e0e8215 - we might lose some same-day updates, but I don't think that warrants a strong concern. This presumably could be addressed with my earlier proposal discussed in the context of SVG2

Copy link
Member

@tidoust tidoust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! Go go go!

@dontcallmedom dontcallmedom merged commit c371626 into main Feb 7, 2022
@dontcallmedom dontcallmedom deleted the http-cache branch February 7, 2022 13:46
tidoust added a commit that referenced this pull request Feb 7, 2022
New features:
- Record and use HTTP cache information to skip unnecessary crawling and
processing (#856)
- Check cache validity before launching browser tab (#858)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make reffy crawler http-cache friendly
2 participants