-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record and use HTTP cache information to skip unnecessary crawling and processing #856
Conversation
…d processing close #850
this depends on tidoust/fetch-filecache-for-crawling#6 from my exploration, this can reduce a no-change crawl to ~1min30; but there are server-side configuration issues (on www.w3.org and rfc-editor.org, probably on csswg.org although I haven't diagnosed them yet) that makes the actual impact as is less big at the moment. |
@tidoust if you get a chance to review the current direction, this would be very much appreciated |
as pointed in code review https://github.com/w3c/reffy/pull/856/files#r799292847
as suggested in code review https://github.com/w3c/reffy/pull/856/files#r799297474
Only use conditional request headers on the relevant URL Use Last-Modified in prefere to Etag as ETag tends to provide less reliable caching experience
Thanks for the review, I've fixed the bugs you've mentioned and improved a bit the cache semantics to avoid relying on etag; the CSS server is still problematic (which is unfortunate given their many many specs), but I'm down to 3min30 of processing despite that. |
Added another option to skip processing that accounts for mis-behaving servers; it now runs in 1min30 with still a few servers/spec not providing cache info (fxtf, rfcs (!)) |
Once tidoust/fetch-filecache-for-crawling#6 is merged and released as an updated npm package, the only missing from this PR should be a bump in package.json (and presumable package-lock) - which should also result in making the added test not fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits.
This looks good to me otherwise. I note that one implicit hypothesis here is that the index page of a multipage spec is always going to be updated when one of the pages gets updated. I think that's the case because the index page contains the last modified date. I'm not sure that is always the case though. For instance, is the index page always updated when multiple updates get made to pages in the same day?
More importantly, as-is, this caching mechanism (not surprisingly?) creates a cache invalidation problem: if we fix or update one of the extraction modules in browserlib
or even other parts of Reffy's code that could affect extracts, and continue to run Reffy with the fallback parameter, the update will only be reflected in extracts when specs get modified.
That problem already exists without this caching mechanism but is limited to specs that cannot be crawled (which we hope is going to be a temporary condition).
The obvious solution is to force a full crawl when that happens. However, from a Webref perspective:
- We don't really know which version of Reffy was used to crawl data, so we cannot easily run Reffy without the fallback parameter when a new version comes out.
- We probably want to keep the fallback parameter in any case, so as to reuse previous crawl results in case of errors and not introduce temporary updates in the extracts.
I'm thinking that a relatively simple solution would be to inject the version of Reffy that was used to crawl the data in index.json
(which is useful info to have in any case), and to only enable this caching mechanism when the current version matches the one in the fallback crawl result.
requestId, | ||
errorReason: "Failed" | ||
}); | ||
reuseExistingData = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit sad that there's no easy way to pass that reason within the Fetch.failRequest
message. errorReason: "BlockedByClient"
could perhaps be the closest reason to what we're doing here.
In any case, #858 removes that part, so that comment will quickly become moot.
as suggested in #856 (review) Co-authored-by: François Daoust <[email protected]>
I had been thinking along the same line, although I had imagined we would do only a full crawl when reffy was bumped in minor or major versions (while having a dispatchable workflow to do a full crawl on demand as well) Do you want this in this PR, or can this be done separately? |
I would prefer if we had it in this PR so as not to start a loop in Webref that cannot be fixed until we get the feature. But if that's the next thing you do, I'm fine doing it separately as well (I still wonder about multipage specs though) |
I've added support for checking reffy version to use fallback data; this could made more subtle (i.e. not necessarily require an exact match), but it's probably enough as starting point.
Looking at multi-page specs we have today:
In practice, since we would still get a fresh data update on each reffy upgrade, and given that SVG2 gets updated very rarely, I'm not too concerned by that limitation; we could check when fetching subpages whether their last-modified is (significantly) more recent than the title page and not save the cache info in that situation if we really care. |
But will that generation always trigger an actual git update? What is going to change in the title page? The date is per day in particular, so would the title page get git-updated the second time a spec receives an update on the same day? That situation is probably not that uncommon for the HTML spec. Edit: Or are the generated pages not stored in and served from a Git branch? |
For HTML, the title page includes a link with the relevant commit (for the snapshot) - so if that's our primary concern, I think we're safe |
HTML and CSS2 are not served from a git branch afaict; ES is, but it looks (from a cursory check) that the title page is updated for the date when one of the subpages is tc39/ecma262@65905e9#diff-bf7934c87fd1a409fa27452a6461d0ca2916decce82025154e60df5a3e0e8215 - we might lose some same-day updates, but I don't think that warrants a strong concern. This presumably could be addressed with my earlier proposal discussed in the context of SVG2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good! Go go go!
close #850