Add post-processor and events post-processing module #1015

tidoust · 2022-07-14T20:25:03Z

This is a breaking change. It does two main things:

It adds a post-processing stage to Reffy. This makes it possible and easy to run modules that may need to mix data extracted by multiple crawling modules, and/or to mix data generated from other specs. The update converts logic that used to be hardcoded in the crawler such as idlparsed generation to post-processing modules.
It adds an "events" post-processing module to consolidate event extracts.

More precisely, changes are:

Add a post-processor library that can run post-processing modules either at the spec level or at the crawl level.
Convert generateIDLParsed/saveIDLParsed and generateIDLNames/saveIDLNames to post-processing modules "idlparsed" and "idlnames". The former runs at the spec level. The latter runs at the crawl level.
Create a "csscomplete" post-processing module for post-crawl CSS completing tasks that were previously hardcoded in the crawler.
Create an "events" post-processing module that integrates the non-patch logic done in webref's PR Adjust scripts to also create @webref/events package webref#650
Add a --post option to Reffy's CLI to specify the post-processing modules to run. None are run by default. This means that "idlparsed" and "idlnames" are no longer generated by default during a crawl. This also means that CSS properties are not completed (e.g. with the IDL attributes they generate) by default either.
Add a --crawl option to Reffy's CLI to make it skip the crawl and rather ingest a crawl result. The option is only really useful when combined with the new --post option. It makes it possible to run post-processing modules on an existing crawl result.
Make the crawler skip the crawl step when the --crawl option is used.
Drop the "generate-idlparsed" and "generate-idlnames" CLI tools. Same result may now be achieved by running the "idlparsed" and "idlnames" post-processing modules through Reffy's CLI.
Expose the post-processor as an entry in index.js.

Bugs fixed along the way:

The isLatestLevelThatPasses function could crash due to a check being done too early.
The crawler was not setting the right crawled property initially.
The events extraction expected crawled to be set on the spec object, but it is typically set "after" the crawler is done running the modules. The window.location URL should be used instead. Tests adjusted accordingly.

Still missing (could be done later on?):

Post-processor tests.
Update README to explain the post-processor.
Smarter workflow logic to order the post-processing modules based on what they depend on. In practice, that's not needed for now because the only theoretical hiccup is between "idlparsed" and "idlnames" and the former runs at the spec level, so the latter always runs after the former.

This is a breaking change. It does two main things: 1. It adds a post-processing stage to Reffy. This makes it possible and easy to run modules that may need to mix data extracted by multiple crawling modules, and/or to mix data generated from other specs. The update converts logic that used to be hardcoded in the crawler such as idlparsed generation to post-processing modules. 2. It adds an "events" post-processing module to consolidate event extracts. More precisely, changes are: - Add a post-processor library that can run post-processing modules either at the spec level or at the crawl level. - Convert generateIDLParsed/saveIDLParsed and generateIDLNames/saveIDLNames to post-processing modules "idlparsed" and "idlnames". The former runs at the spec level. The latter runs at the crawl level. - Create a "csscomplete" post-processing module for post-crawl CSS completing tasks that were previously hardcoded in the crawler. - Create an "events" post-processing module that integrates the non-patch logic done in webref's PR w3c/webref#650 - Add a `--post` option to Reffy's CLI to specify the post-processing modules to run. None are run by default. This means that "idlparsed" and "idlnames" are no longer generated by default during a crawl. This also means that CSS properties are not completed (e.g. with the IDL attributes they generate) by default either. - Add a `--crawl` option to Reffy's CLI to make it skip the crawl and rather ingest a crawl result. The option is only really useful when combined with the new `--post` option. It makes it possible to run post-processing modules on an existing crawl result. - Make the crawler skip the crawl step when the `--crawl` option is used. - Drop the "generate-idlparsed" and "generate-idlnames" CLI tools. Same result may now be achieved by running the "idlparsed" and "idlnames" post-processing modules through Reffy's CLI. - Expose the post-processor as an entry in `index.js`. Bugs fixed along the way: - The `isLatestLevelThatPasses` function could crash due to a check being done too early. - The crawler was not setting the right `crawled` property initially. - The events extraction expected `crawled` to be set on the spec object, but it is typically set "after" the crawler is done running the modules. The window.location URL should be used instead. Tests adjusted accordingly. Still missing (could be done later on?): - Post-processor tests. - Update README to explain the post-processor. - Smarter workflow logic to order the post-processing modules based on what they depend on. In practice, that's not needed for now because the only theoretical hiccup is between "idlparsed" and "idlnames" and the former runs at the spec level, so the latter always runs after the former.

Result that receives the post-processor in a real crawl does not yet have a `shortname` property, so check should not have been based on it.

The function will be useful to run tests in Webref

tidoust · 2022-07-16T07:48:05Z

I ran the crawler locally and confirms that the new version produces the same extracts as the previous ones.

A few additional notes:

The events post-processing module generates an events.json file in the root output folder and not an index.json file under an events sub-folder. We'll have to adjust the code that will generate the @webref/events package accordingly but that should be a no-brainer.
A consequence of 1. is that we could actually envision getting back to events for the name of the folder that contains the raw events extracts, instead of spec-events.
Webref's curation code will need to be adjusted to the new version in any case to run the post-processing modules on the patched results.

dontcallmedom

LGTM, with some reservations on some of the naming choices :)

reffy.js

src/lib/post-processor.js

src/lib/util.js

Co-authored-by: Dominique Hazael-Massieux <[email protected]>

Per feedback, rename `--crawl` to `--use-crawl` to make it clearer that Reffy will use the provided parameter as input, and rename generic `getTreeInfo` function to a more specific `getInterfaceTreeInfo` function.

dontcallmedom · 2022-07-18T09:27:48Z

2. A consequence of 1. is that we could actually envision getting back to events for the name of the folder that contains the raw events extracts, instead of spec-events.

+1 to that idea

Breaking changes: - Add post-processor and events post-processing module (#1015) Bug fixes: - Couple of bug fixes in crawler (#1019) - Fix exported functions crawlSpecs and getInterfaceTreeInfo (#1018) Bumped dependencies: - Bump rollup from 2.76.0 to 2.77.0 (#1017) - Bump commander from 9.3.0 to 9.4.0 (#1016) What "breaks" compared to v7, a.k.a "how to upgrade": - Reffy no longer completes CSS extracts per default to add generated IDL attributes and add properties defined in prose. Run the crawler with the `csscomplete` post-processing module to get the same result. Note the `csscomplete` module will look at `dfns` extracts to add properties defined in prose. In other words, command line should include something like: `--module dfns --post csscomplete` (or not mention `--module` at all as Reffy runs all crawl modules by default) - Reffy no longer outputs parsed CSS structures to the console when CSS extraction runs. This was not used anywhere. It would be trivial to do this in a post-processing module if that seems needed. - Reffy no longer produces "idlparsed" and "idlnames" extracts per default. Run the crawler with the `idlparsed` and `idlnames` post-processing modules. The `idlparsed` module needs `idl` extracts to run. The `idlnames` module needs `idlparsed` extracts to run. In other words, command line should include something like: `--module idl --post idlparsed --post idlnames` (or not mention `--module` at all as Reffy runs all crawl modules by default) - Reffy saves events extracts to an `events` folder again (instead of `spec-events`). Events extraction and events merging should be viewed as a beta feature for now, likely to be improved in future versions of Reffy.

tidoust added 4 commits July 13, 2022 18:57

Merge branch 'main' into postprocessing-and-events

29ae456

Fix spec/crawl check

f538eee

Result that receives the post-processor in a real crawl does not yet have a `shortname` property, so check should not have been based on it.

Make getTreeInfo an exported util function

887756b

The function will be useful to run tests in Webref

tidoust marked this pull request as ready for review July 16, 2022 07:48

tidoust requested a review from dontcallmedom July 16, 2022 07:48

dontcallmedom approved these changes Jul 18, 2022

View reviewed changes

reffy.js Outdated Show resolved Hide resolved

reffy.js Outdated Show resolved Hide resolved

src/lib/post-processor.js Outdated Show resolved Hide resolved

src/lib/util.js Outdated Show resolved Hide resolved

tidoust and others added 3 commits July 18, 2022 10:39

Update src/lib/post-processor.js

718808e

Co-authored-by: Dominique Hazael-Massieux <[email protected]>

Rename "--crawl" option and "getTreeInfo" method

69e189c

Per feedback, rename `--crawl` to `--use-crawl` to make it clearer that Reffy will use the provided parameter as input, and rename generic `getTreeInfo` function to a more specific `getInterfaceTreeInfo` function.

Fix typo in error message

dc16920

Rename "spec-events" to "events"

5abb393

tidoust merged commit 7a2afc6 into main Jul 18, 2022

tidoust deleted the postprocessing-and-events branch July 18, 2022 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add post-processor and events post-processing module #1015

Add post-processor and events post-processing module #1015

tidoust commented Jul 14, 2022

tidoust commented Jul 16, 2022

dontcallmedom left a comment

dontcallmedom commented Jul 18, 2022

Add post-processor and events post-processing module #1015

Add post-processor and events post-processing module #1015

Conversation

tidoust commented Jul 14, 2022

tidoust commented Jul 16, 2022

dontcallmedom left a comment

Choose a reason for hiding this comment

dontcallmedom commented Jul 18, 2022