Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-write throttling logic with queues #1587

Merged
merged 3 commits into from
Jun 7, 2024
Merged

Re-write throttling logic with queues #1587

merged 3 commits into from
Jun 7, 2024

Conversation

tidoust
Copy link
Member

@tidoust tidoust commented Jun 6, 2024

(Leaving as draft as this would need to be updated if the code for throttled-queue changes following review of w3c/browser-specs#1356)

This re-writes the convoluted logic introduced in previous commit (and the also convoluted logic that pre-existed) as suggested in #1582, taking inspiration from the queue in the CG monitor.

On top of serializing requests sent to a given origin and sleeping a bit in between these requests, the new queue controls parallelization, making sure that the code does not load too many specs at once (I stuck to 4 in parallel as before).

The net result is essentially the same as that achieved with the previous commit: specs are processed as if origins had been interleaved. But the new code is more straightforward, less verbose and more readable.

Also, the code for the throttled queue is the exact same one as the code in browser-specs (see w3c/browser-specs#1356). Code is duplicated here out of simplicity and laziness to create and maintain a shared dependency.

This re-writes the convoluted logic introduced in previous commit (and the
also convoluted logic that pre-existed) as suggested in #1582, taking
inspiration from the queue in the CG monitor:
https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68

On top of serializing requests sent to a given origin and sleeping a bit in
between these requests, the new queue controls parallelization, making sure
that the code does not load too many specs at once (I stuck to 4 in parallel as
before).

The net result is essentially the same as that achieved with the previous
commit: specs are processed as if origins had been interleaved. But the new
code is more straightforward, less verbose and more readable.

Also, the code for the throttled queue is the exact same one as the code in
browser-specs (see w3c/browser-specs#1356). Code is
duplicated here out of simplicity and laziness to create and maintain a shared
dependency.
@tidoust tidoust marked this pull request as ready for review June 7, 2024 12:53
@tidoust
Copy link
Member Author

tidoust commented Jun 7, 2024

That seems to work well, with the caveat that crawl takes much longer due to the serialization per origin and the 2s pause in-between specs that target the same origin. The thing is there are >300 nightly specs with a github.io URL. That means crawl will take ~12minutes at best, closer to 15mn in practice. That's fine in theory.

We could split the different github.io origins (I know, I argued against that in browser-specs ;)). We'll still have ~200 specs with a w3c.github.io URL, and ~85 specs with a wicg.github.io URL, but that's better than grouping them together.

(I'll prepare a side PR to avoid requests on SVG and CSS resources, that works fine as well, and seems useful!)

@dontcallmedom
Copy link
Member

my personal sense is that we should only use the 2 seconds pacing for the CSS server, and a much smaller one for other "origins"

@tidoust
Copy link
Member Author

tidoust commented Jun 7, 2024

my personal sense is that we should only use the 2 seconds pacing for the CSS server, and a much smaller one for other "origins"

Done in the latest commit. I kept a 1 second sleep time for www.w3.org because of the 180 requests/minute limit.

Note that getOrigin now also returns the "https://" prefix, to be more consistent with what the origin property returns.

The update does not fundamentally change the duration of a full crawl: loading the specs is what takes the most amount of time. It should speed up the 304 case in incremental crawls though.

@tidoust tidoust requested a review from dontcallmedom June 7, 2024 14:35
This makes the sleep time depend on the origin being crawled:
- The CSS server breaks easily, the code sleeps 2 seconds in between requests.
- The www.w3.org server has a limit of 180 requests per minute, the code sleeps
1 second in between requests.
- Other origins seem more amenable to faster crawls, the code sleeps 100ms in
between requests.
@tidoust tidoust merged commit 0364b86 into main Jun 7, 2024
1 check passed
@tidoust tidoust deleted the throttling branch June 7, 2024 14:55
tidoust added a commit that referenced this pull request Jun 7, 2024
Breaking change:
- Bump required version of Node.js to >=20.12.1 (#1590)

Required by dependency on ReSpec, although note this dependency is only for
tests (`devDependencies` in `package.json`).

New features:
- Reduce crawler load on servers (#1587, #1581)
- Include data-xref-type as another hint for autolinks (#1585)

Feature patches:
- Skip requests on SVG and CSS resources (#1588)
- Fix Puppeteer instance teardown (#1586)

Dependency bumps:
- Bump respec from 35.0.2 to 35.1.0 (#1583)
- Bump ajv from 8.15.0 to 8.16.0 (#1580)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants