-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editor: Introduce HTML Tag Processor #3920
Conversation
This commit pulls in the HTML Tag Processor from the Gutenbeg repository. The Tag Processor attempts to be an HTML5-spec-compliant parser that provides the ability in PHP to find specific HTML tags and then add, remove, or update attributes on that tag. It provides a safe and reliable way to modify the attribute on HTML tags. ```php // Add missing `rel` attribute to links. $p = new WP_HTML_Tag_Processor( $block_content ); if ( $p->next_tag( 'A' ) && empty( $p->get_attribute( 'rel' ) ) ) { $p->set_attribute( 'noopener nofollow' ); } return $p->get_updated_html(); ``` Introduced originally in WordPress/gutenberg#42485 and developed within the Gutenberg repository, this HTML parsing system was built in order to address a persistent need (properly modifying HTML tag attributes) and was motivated after a sequence of block editor defects which stemmed from mismatches between actual HTML code and expectectations for HTML input running through existing naive string-search-based solutions. The Tag Processor is intended to operate fast enough to avoid being an obstacle on page render while using as little memory overhead as possible. It is practically a zero-memory-overhead system, and only allocates memory as changes to the input HTML document are enqueued, releasing that memory when flushing those changes to the document, moving on to find the next tag, or flushing its entire output via `get_updated_html()`. Rigor has been taken to ensure that the Tag Processor will not be consfused by unexpected or non-normative HTML input, including issues arising from quoting, from different syntax rules within `<title>`, `<textarea>`, and `<script>` tags, from the appearance of rare but legitimate comment and XML-like regions, and from a variety of syntax abnormalities such as unbalanced tags, incomplete syntax, and overlapping tags. The Tag Processor is constrained to parsing an HTML document as a stream of tokens. It will not build an HTML tree or generate a DOM representation of a document. It is designed to start at the beginning of an HTML document and linearly scan through it, potentially modifying that document as it scans. It has no access to the markup inside or around tags and it has no ability to determine which tag openers and tag closers belong to each other, or determine the nesting depth of a given tag. It includes a primitive bookmarking system to remember tags it has previously visited. These bookmarks refer to specific tags, not to string offsets, and continue to point to the same place in the document as edits are applied. By asking the Tag Processor to seek to a given bookmark it's possible to back up and continue processsing again content that has already been traversed. Attribute values are sanitized with `esc_attr()` and rendered as double-quoted attributes. On read they are unescaped and unquoted. Authors wishing to rely on the Tag Processor therefore are free to pass around data as normal strings. Convenience methods for adding and removing CSS class names exist in order to remove the need to process the `class` attribute. ```php // Update heading block class names $p = new WP_HTML_Tag_Processor( $html ); while ( $p->next_tag() ) { switch ( $p->get_tag() ) { case 'H1': case 'H2': case 'H3': case 'H4': case 'H5': case 'H6': $p->remove_class( 'wp-heading' ); $p->add_class( 'wp-block-heading' ); break; } return $p->get_updated_html(); ``` The Tag Processor is intended to be a reliable low-level library for traversing HTML documents and higher-level APIs are to be built upon it. Immediately, and in Core Gutenberg blocks it is meant to replace HTML modification that currently relies on RegExp patterns and simpler string replacements. See the following for examples of such replacement: WordPress/gutenberg@1315784 https://github.com/WordPress/gutenberg/pull/45469/files#diff-dcd9e1f9b87ca63efe9f1e834b4d3048778d3eca41aa39c636f8b16a5bb452d2L46 WordPress/gutenberg#46625 Co-Authored-By: Adam Zielinski <[email protected]> Co-Authored-By: Bernie Reiter <[email protected]> Co-Authored-By: Grzegorz Ziolkowski <[email protected]>
Trac ticket: https://core.trac.wordpress.org/ticket/57575 |
Wondering if it would be better to move the |
agreed. I've already done this in the Gutenberg side to get around the linter and will move the changes here once that merges, or beforehand maybe if I can. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @dmsnell and to everyone who worked on this!
I've left some thoughts below 🙂
I don't think we should even include those guards at all here -- after all, Core is the most "canonical" source for those classes, and its versions should "prevail" over the ones provided by the Gutenberg plugin for back-compat (which shouldn't happen normally, but in case the order of imports is mixed up for whatever reason). |
Hmm, IMO these guards shouldn't be in Core. Why? Core should expect to load itself. The loading of these classes into memory happens early and before a plugin or theme loads. Thus, That said, code extending Core does need these guards. For example, Gutenberg needs these guards to prevent loading its classes of the same name. |
* Rename data providers to match test per coding standard. * Restructure data provider datasets into a single array form for consistency. * Add `WP_HTML_Tag_Processor::` to @Covers methods per coding standard. * Add empty line between set up and assertion groupings. * Moved well-formed HTML into separate test of updating attributes. * Replaced assertEquals() with assertSame().
It depends :) What I'm getting at here is that both Gutenberg's PHP (in If the Tag Processor's classes (files) are always loaded from wp-settings.php, which happens earlier than any plugin, I agree the |
@costdev I've marked a number of our conversations as resolved to reduce the noise when reading this PR, but if you didn't find them resolved please un-resolve them and let me know. normally I wouldn't have done that but there were so many non-controversial changes you suggested and I figured it'd be easiest if I hid them from the chat once I accepted your changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmsnell Please find my notes below.
Thanks!
public $start; | ||
|
||
/** | ||
* Byte offset into document where span ends. | ||
* | ||
* @since 6.2.0 | ||
* @var int | ||
*/ | ||
public $end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declaring these properties as public breaks encapsulation - one of the basic OOP principles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @anton-vlasenko - do you see a particular problem with them being public or are you sharing your value about how you prefer the code to be written?
what benefit do you see us gaining by hiding these?
of note, these are helper classes specifically for performance reasons, and are essentially stand-ins for PHP's lack of records. there is a massive gain from using these vs. using associative arrays, and if PHP coding style were more comfortable with more than one class per file they'd be in the main tag processor file.
that being said, unless you have a concrete benefit in mind or see a clear problem with the public
visibility I'd very much to prefer keeping them as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
while ( $this->parse_next_attribute() ) { | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the continue
operator is needed here.
It's the only and the last statement in the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is here explicitly as a compromise with the linter, because if the continue
is gone someone will try again to come in and remove the loop body, which itself has been empirically one of the most dangerous code patterns in history, alongside null
and zero-terminated strings.
// what the linter fixes want
while ( $this->parse_next_attribute() );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just tried to commit this code
while ( $this->parse_next_attribute() ) {
// No code is needed here.
}
and I got no complaints from the linter.
In my case, removing the continue
worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appeared in Gutenberg not in Core, and I'd prefer having separate subtle forms like this in both repos. are you okay leaving the continue
in? it's a lose-lose situation IMO and given that the linter's choice is dangerous and this is benign I'd rather leave the benign option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm ok with that.
The continue
operator in the loop body doesn't seem to affect performance significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The continue operator in the loop body doesn't seem to affect performance significantly.
The continue
in the loop body should not affect the performance at all.
* | ||
* @param WP_HTML_Text_Replacement $a First attribute update. | ||
* @param WP_HTML_Text_Replacement $b Second attribute update. | ||
* @return int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @return int | |
* @return int An integer less than, equal to, or greater than zero if | |
the first argument is considered to be respectively less than, | |
equal to, or greater than the second. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a single-line annotation because I found that some IDEs truncate multi-line descriptions like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found any performance related issues.
However, set_attribute()
s memory usage looks suspicious.
Update: #3920 (comment)
@peterwilsoncc you should find several tests for script tag processing, including the case you mentioned, in |
Changes requested have been resolved. Dismissing this review to unblock the PR's progress.
Status Update: Following up on #3920 (comment).
What's left for it to be committed?
My weekend starts soon (well as soon as I clear my backport review and commit queue - not including this PR). I'll be back on Monday. The decision to commit this PR before Monday is in @azaozz and @felixarntz hands. @azaozz @felixarntz IF there's consensus to commit it, then please commit it as soon as it's approved. Why? Other work is dependent upon this PR. So IF it's going into 6.2, sooner is better than later. Thank you everyone for contributions, patience, and expertise! cc @ntsekouras and @Mamaduka to keep you aware. |
For transparency, I was asked today: Here's my response:
|
I ran a performance test for this via #3971, a PR purely intended for testing, which is directly based on this PR here, but also includes relevant usage from #3914 and dmsnell#1 (see 8d7ebfb). TLDR: The performance impact appears to be negligible, which is a great outcome. There is no clear tendency of increase or decline in performance. Overall the data, if anything, indicates a this PR being a tiny bit faster than The test is based on the Gutenberg demo content, tested with the last three default themes, using the median from 20 requests in each scenario (#3971 vs See this spreadsheet for the data and also notes on how exactly the test was conducted. So I believe this should be unblocked now. Thanks everyone for the patience, I am really glad to see this positive outcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all for the patience and continued iterations on this. Following the previous feedback from others being addressed and performance not looking to be a problem, I am happy to approve this.
Of course that performance analysis is a lab analysis and therefore not entirely comprehensive (performance may differ depending on which / how many blocks are used), but the impact with the relatively large Gutenberg demo content is so small that I don't think there's anything to be worried about in terms of performance. 👍
We will still need to commit the actual usage of this API (see e.g. #3914), but we can definitely get this in and then go from there.
What is the performance regression test I've asked for? For blocks or features that are in Core now (<=6.1) but will change to use this API, what is the performance impact when users upgrade to 6.2? Why? The data collected from a regression test is needed to know what the impact will be to users upgrading to 6.2. That information will then help with decision making about if that impact is acceptable or not. There are many benefits to this API. For example, it will make maintaining string manipulation much easier and understandable. It will help resolve the challenges of working with HTML markup in PHP (whitespaces, malformed, etc.), though these challenges have always existed. That said, it also has be performant to not cause issues for users. |
Note to whoever commits this: I've posted a commit message in a details tag in the description at the top. A couple typos are fixed from the |
This commit pulls in the HTML Tag Processor from the Gutenbeg repository. The Tag Processor attempts to be an HTML5-spec-compliant parser that provides the ability in PHP to find specific HTML tags and then add, remove, or update attributes on that tag. It provides a safe and reliable way to modify the attribute on HTML tags. More information: #3920. Props: antonvlasenko, bernhard-reiter, costdev, dmsnell, felixarntz, gziolo, hellofromtonya, zieladam, flixos90, ntsekouras, peterwilsoncc, swissspidy, andrewserong, onemaggie, get_dave, aristath, scruffian, justlevine, andraganescu, noisysocks, dlh, soean, cbirdsong, revgeorge, azaozz. Fixes #57575. git-svn-id: https://develop.svn.wordpress.org/trunk@55203 602fd350-edb4-49c9-b593-d223f7449a82
This commit pulls in the HTML Tag Processor from the Gutenbeg repository. The Tag Processor attempts to be an HTML5-spec-compliant parser that provides the ability in PHP to find specific HTML tags and then add, remove, or update attributes on that tag. It provides a safe and reliable way to modify the attribute on HTML tags. More information: WordPress/wordpress-develop#3920. Props: antonvlasenko, bernhard-reiter, costdev, dmsnell, felixarntz, gziolo, hellofromtonya, zieladam, flixos90, ntsekouras, peterwilsoncc, swissspidy, andrewserong, onemaggie, get_dave, aristath, scruffian, justlevine, andraganescu, noisysocks, dlh, soean, cbirdsong, revgeorge, azaozz. Fixes #57575. Built from https://develop.svn.wordpress.org/trunk@55203 git-svn-id: http://core.svn.wordpress.org/trunk@54736 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Committed to core: https://core.trac.wordpress.org/changeset/55203. |
This commit pulls in the HTML Tag Processor from the Gutenbeg repository. The Tag Processor attempts to be an HTML5-spec-compliant parser that provides the ability in PHP to find specific HTML tags and then add, remove, or update attributes on that tag. It provides a safe and reliable way to modify the attribute on HTML tags. More information: WordPress/wordpress-develop#3920. Props: antonvlasenko, bernhard-reiter, costdev, dmsnell, felixarntz, gziolo, hellofromtonya, zieladam, flixos90, ntsekouras, peterwilsoncc, swissspidy, andrewserong, onemaggie, get_dave, aristath, scruffian, justlevine, andraganescu, noisysocks, dlh, soean, cbirdsong, revgeorge, azaozz. Fixes #57575. Built from https://develop.svn.wordpress.org/trunk@55203 git-svn-id: https://core.svn.wordpress.org/trunk@54736 1a063a9b-81f0-0310-95a4-ce76da25c4cd
This is amazing, I really appreciate everyone's collaboration on this one ❤️ |
This commit pulls in the HTML Tag Processor from the Gutenbeg repository. The Tag Processor attempts to be an HTML5-spec-compliant parser that provides the ability in PHP to find specific HTML tags and then add, remove, or update attributes on that tag. It provides a safe and reliable way to modify the attribute on HTML tags. More information: WordPress/wordpress-develop#3920. Props: antonvlasenko, bernhard-reiter, costdev, dmsnell, felixarntz, gziolo, hellofromtonya, zieladam, flixos90, ntsekouras, peterwilsoncc, swissspidy, andrewserong, onemaggie, get_dave, aristath, scruffian, justlevine, andraganescu, noisysocks, dlh, soean, cbirdsong, revgeorge, azaozz. Fixes #57575. Built from https://develop.svn.wordpress.org/trunk@55203 git-svn-id: http://core.svn.wordpress.org/trunk@54736 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Status
❓ Is it acceptable to use the
u
Unicode flag in a PCRE pattern in Core?^^^
This is the last remaining pertinent question form the author from a design perspective; it should be answered before a merge.
Draft commit message
Description
This commit pulls in the HTML Tag Processor from the Gutenbeg repository (pulled from WordPress/gutenberg@98ce5c5
). The Tag Processor attempts to be an HTML5-spec-compliant parser that provides the ability in PHP to find specific HTML tags and then add, remove, or update attributes on that tag. It provides a safe and reliable way to modify the attribute on HTML tags.
Introduced originally in WordPress/gutenberg#42485 and in https://core.trac.wordpress.org/ticket/56299 and developed within the Gutenberg repository, this HTML parsing system was built in order to address a persistent need (properly modifying HTML tag attributes) and was motivated after a sequence of block editor defects which stemmed from mismatches between actual HTML code and expectations for HTML input running through existing naive string-search-based solutions.
The Tag Processor is intended to operate fast enough to avoid being an obstacle on page render while using as little memory overhead as possible. It is practically a zero-memory-overhead system, and only allocates memory as changes to the input HTML document are enqueued, releasing that memory when flushing those changes to the document, moving on to find the next tag, or flushing its entire output via
get_updated_html()
.Rigor has been taken to ensure that the Tag Processor will not be consfused by unexpected or non-normative HTML input, including issues arising from quoting, from different syntax rules within
<title>
,<textarea>
, and<script>
tags, from the appearance of rare but legitimate comment and XML-like regions, and from a variety of syntax abnormalities such as unbalanced tags, incomplete syntax, and overlapping tags.The Tag Processor is constrained to parsing an HTML document as a stream of tokens. It will not build an HTML tree or generate a DOM representation of a document. It is designed to start at the beginning of an HTML document and linearly scan through it, potentially modifying that document as it scans. It has no access to the markup inside or around tags and it has no ability to determine which tag openers and tag closers belong to each other, or determine the nesting depth of a given tag.
It includes a primitive bookmarking system to remember tags it has previously visited. These bookmarks refer to specific tags, not to string offsets, and continue to point to the same place in the document as edits are applied. By asking the Tag Processor to seek to a given bookmark it's possible to back up and continue processsing again content that has already been traversed.
Attribute values are sanitized with
esc_attr()
and rendered as double-quoted attributes. On read they are unescaped and unquoted. Authors wishing to rely on the Tag Processor therefore are free to pass around data as normal strings.Convenience methods for adding and removing CSS class names exist in order to remove the need to process the
class
attribute.The Tag Processor is intended to be a reliable low-level library for traversing HTML documents and higher-level APIs are to be built upon it. Immediately, and in Core Gutenberg blocks it is meant to replace HTML modification that currently relies on RegExp patterns and simpler string replacements.
See the following for examples of such replacement:
Co-Authored-By: Adam Zielinski [email protected]
Co-Authored-By: Bernie Reiter [email protected]
Co-Authored-By: Grzegorz Ziolkowski [email protected]
Trac ticket: https://core.trac.wordpress.org/ticket/57575