-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation: Detect block validity by tokenizing content #2210
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this branch.
It had the two first use cases fixed:
- https://wordpress.slack.com/archives/C02QB2JS7/p1501782622676845
- Provide options for invalid block resolution #2132 (comment)
The third one, there are missing p
tags, technically the HTML are different.
Let's ship it
454ebd3
to
4dd6cfe
Compare
Yep, noted as well in Caveats. |
Incompatibility with Jest coverage istanbuljs/babel-plugin-istanbul#116
Allows easier testing of isEquivalentHTML in isolation
4dd6cfe
to
8a79697
Compare
8a79697
to
e54dd1c
Compare
Codecov Report
@@ Coverage Diff @@
## master #2210 +/- ##
=========================================
+ Coverage 23.11% 23.9% +0.79%
=========================================
Files 141 142 +1
Lines 4405 4484 +79
Branches 747 763 +16
=========================================
+ Hits 1018 1072 +54
- Misses 2856 2877 +21
- Partials 531 535 +4
Continue to review full report at Codecov.
|
@aduth did you compare this again lexing the HTML tags in the parser? |
@dmsnell This was on my mind over the weekend. I think certainly there's some overlap in effort between the parser and this, and hopefully we can make some strides in simplifying the two. Your advice is welcome! (As this is an area I've generally allowed others with a better understanding of the fundamentals take the helm 😉 ) Speaking strictly of the requirements of this implementation:
To the last point, it seemed in order to achieve this, validation would need to occur very early in a parse, during lexical analysis / tokenization ("give me the next relevant token of each input, and let's compare"). My assumption was that we would not need the full syntax tree (children, balanced nodes), and that in fact this could be a hindrance to comparison (instead compare tokens on a flat array). Although in retrospect, I'm curious if there are valid failing cases because we don't perform any checks on the closing tag. Do you envision any flaws with this approach? Any ways we could leverage the existing grammar to achieve these goals? |
Well the PEG parser doesn't have separate lexing and parsing stages so it would come altogether. The original parser did this lexing for this and other reasons. I think if we build it there we automatically get this notion of equivalence: syntax changes are irrelevant while text nodes and attribute values constitute actual changes. One of the other reasons I've wanted to put it back in for so long is that we can finally break out of text-only storage and choose to "serialize" by saving the JSON data directly in some other mechanism. For example, maybe a we write a plugin to store content in Simperium or the mobile apps don't want to mess with HTML. Validation in the app can look at structure directly and know that if we get an extra space it was a space as a "text node" and not spacing between HTML tags, etc… Also, because of this lexing all of the self-closing/self-closable tags can be normalized in the Gutenberg data structure and not introduce odd cases to validate. E.g., is "" equal to "" |
Related: #2197, #2132, #1929
This pull request seeks to try an alternate strategy to block validation. Previously we would compare a "beautified" copy of the block from post content when run again through its serialized using parsed attributes. The approach here instead tries to find whether there is an effective difference between two HTML strings by using an HTML tokenizer and comparing the token objects generated by each.
In general, this improves leniency of the comparison, particularly around whitespace, self-closing tags, whitespace within tag names, class attribute value ordering, style attribute values and value ordering.
Some false negatives can occur, largely around whitespace in HTML because it is complicated.
The tokenizer is not a spec-compliant validator, nor do I think we need one. It is relatively light-weight (especially compared to other options like parse5), and can be extended to observe events while tokenizing. Future refactoring could introduce optimizations such as early bailing on mismatched tokens (i.e. fail early).
Testing instructions:
Verify that all blocks in demo content are valid.
Verify that modifying
post-content.js
to cause a block to become invalid (e.g. adding/remove class or attribute) still causes "Invalid Block" warning to appear. See also #1929.Verify that some previously erroneously-invalid content is now considered valid:
Ensure unit tests pass:
Caveats:
Differences in content caused by
removep
(autop
) behavior is still not accounted for. We may want to consider building this into the parser?