Validation: Detect block validity by tokenizing content #2210

aduth · 2017-08-04T02:17:41Z

Related: #2197, #2132, #1929

This pull request seeks to try an alternate strategy to block validation. Previously we would compare a "beautified" copy of the block from post content when run again through its serialized using parsed attributes. The approach here instead tries to find whether there is an effective difference between two HTML strings by using an HTML tokenizer and comparing the token objects generated by each.

In general, this improves leniency of the comparison, particularly around whitespace, self-closing tags, whitespace within tag names, class attribute value ordering, style attribute values and value ordering.

Some false negatives can occur, largely around whitespace in HTML because it is complicated.

The tokenizer is not a spec-compliant validator, nor do I think we need one. It is relatively light-weight (especially compared to other options like parse5), and can be extended to observe events while tokenizing. Future refactoring could introduce optimizations such as early bailing on mismatched tokens (i.e. fail early).

Testing instructions:

Verify that all blocks in demo content are valid.

Verify that modifying post-content.js to cause a block to become invalid (e.g. adding/remove class or attribute) still causes "Invalid Block" warning to appear. See also #1929.

Verify that some previously erroneously-invalid content is now considered valid:

Ensure unit tests pass:

npm test

Caveats:

Differences in content caused by removep (autop) behavior is still not accounted for. We may want to consider building this into the parser?

youknowriad

Tested this branch.

It had the two first use cases fixed:

The third one, there are missing p tags, technically the HTML are different.

Let's ship it

aduth · 2017-08-04T16:25:42Z

The third one, there are missing p tags, technically the HTML are different.

Yep, noted as well in Caveats.

Incompatibility with Jest coverage istanbuljs/babel-plugin-istanbul#116

Allows easier testing of isEquivalentHTML in isolation

codecov · 2017-08-04T16:34:01Z

Codecov Report

Merging #2210 into master will increase coverage by 0.79%.
The diff coverage is 94.33%.

@@            Coverage Diff            @@
##           master   #2210      +/-   ##
=========================================
+ Coverage   23.11%   23.9%   +0.79%     
=========================================
  Files         141     142       +1     
  Lines        4405    4484      +79     
  Branches      747     763      +16     
=========================================
+ Hits         1018    1072      +54     
- Misses       2856    2877      +21     
- Partials      531     535       +4

Impacted Files	Coverage Δ
blocks/api/parser.js	`100% <ø> (+2.85%)`	⬆️
blocks/api/validation.js	`94.33% <94.33%> (ø)`
editor/sidebar/post-taxonomies/tags-selector.js	`0% <0%> (ø)`	⬆️
editor/sidebar/post-visibility/index.js	`0% <0%> (ø)`	⬆️
editor/modes/visual-editor/block-list.js	`0% <0%> (ø)`	⬆️
editor/modes/visual-editor/block.js	`0% <0%> (ø)`	⬆️
editor/sidebar/table-of-contents/index.js	`0% <0%> (ø)`	⬆️
...tor/sidebar/post-taxonomies/categories-selector.js	`0% <0%> (ø)`	⬆️
editor/settings/provider.js	`0% <0%> (ø)`	⬆️
editor/block-mover/index.js	`0% <0%> (ø)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad030a1...e54dd1c. Read the comment docs.

dmsnell · 2017-08-07T12:07:08Z

@aduth did you compare this again lexing the HTML tags in the parser?

aduth · 2017-08-07T12:56:00Z

@dmsnell This was on my mind over the weekend. I think certainly there's some overlap in effort between the parser and this, and hopefully we can make some strides in simplifying the two. Your advice is welcome! (As this is an area I've generally allowed others with a better understanding of the fundamentals take the helm 😉 )

Speaking strictly of the requirements of this implementation:

It needs to find a difference between inputs (actual post content, post content reserialized from parsed attributes)
- Notably, a parse needs to have already occurred by the time this step is reached
It needs to be relatively forgiving (whitespace, attribute order, class order, style order)
- We care only if differences have a meaningful impact on browser interpretation
Not implemented yet, but ideally it could fail fast (abandon validation early as soon as it identifies an issue)

To the last point, it seemed in order to achieve this, validation would need to occur very early in a parse, during lexical analysis / tokenization ("give me the next relevant token of each input, and let's compare"). My assumption was that we would not need the full syntax tree (children, balanced nodes), and that in fact this could be a hindrance to comparison (instead compare tokens on a flat array). Although in retrospect, I'm curious if there are valid failing cases because we don't perform any checks on the closing tag.

Do you envision any flaws with this approach? Any ways we could leverage the existing grammar to achieve these goals?

dmsnell · 2017-08-07T13:09:29Z

Well the PEG parser doesn't have separate lexing and parsing stages so it would come altogether.

The original parser did this lexing for this and other reasons. I think if we build it there we automatically get this notion of equivalence: syntax changes are irrelevant while text nodes and attribute values constitute actual changes.

One of the other reasons I've wanted to put it back in for so long is that we can finally break out of text-only storage and choose to "serialize" by saving the JSON data directly in some other mechanism. For example, maybe a we write a plugin to store content in Simperium or the mobile apps don't want to mess with HTML.

Validation in the app can look at structure directly and know that if we get an extra space it was a space as a "text node" and not spacing between HTML tags, etc… Also, because of this lexing all of the self-closing/self-closable tags can be normalized in the Gutenberg data structure and not introduce odd cases to validate. E.g., is "" equal to ""

aduth added [Feature] Blocks Overall functionality of blocks Framework Issues related to broader framework topics, especially as it relates to javascript [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f labels Aug 4, 2017

aduth requested a review from nylen August 4, 2017 02:17

aduth added this to the Beta 0.7.0 milestone Aug 4, 2017

youknowriad approved these changes Aug 4, 2017

View reviewed changes

aduth force-pushed the update/block-validation branch from 454ebd3 to 4dd6cfe Compare August 4, 2017 16:25

aduth added 4 commits August 4, 2017 12:27

Framework: Move babel-plugin-lodash to production build only

26cfecf

Incompatibility with Jest coverage istanbuljs/babel-plugin-istanbul#116

Validation: Detect block validity by tokenizing content

dc90b4d

Validation: Improve style leniency on URL type

2df3e0d

Validation: Split isValidBlock from isEquivalentHTML

b916f2a

Allows easier testing of isEquivalentHTML in isolation

aduth force-pushed the update/block-validation branch from 4dd6cfe to 8a79697 Compare August 4, 2017 16:28

Validation: Add unit tests

e54dd1c

aduth force-pushed the update/block-validation branch from 8a79697 to e54dd1c Compare August 4, 2017 16:32

aduth merged commit af6b374 into master Aug 4, 2017

aduth deleted the update/block-validation branch August 4, 2017 16:43

nylen mentioned this pull request Aug 4, 2017

Add HTML schema for separator block #914

Closed

5 tasks

aduth mentioned this pull request Aug 7, 2017

Fix and regenerate invalid block fixtures #2056

Closed

dmsnell mentioned this pull request Aug 8, 2017

Parser: Start tokenizing HTML within the blocks #2286

Closed

aduth mentioned this pull request Aug 9, 2017

Autosave deletes contents of block if block has focus #2300

Closed

aduth mentioned this pull request Sep 29, 2017

Validation: Improve developer messaging for invalid blocks #2837

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation: Detect block validity by tokenizing content #2210

Validation: Detect block validity by tokenizing content #2210

aduth commented Aug 4, 2017

youknowriad left a comment

aduth commented Aug 4, 2017

codecov bot commented Aug 4, 2017 •

edited

Loading

dmsnell commented Aug 7, 2017

aduth commented Aug 7, 2017

dmsnell commented Aug 7, 2017

Validation: Detect block validity by tokenizing content #2210

Validation: Detect block validity by tokenizing content #2210

Conversation

aduth commented Aug 4, 2017

youknowriad left a comment

Choose a reason for hiding this comment

aduth commented Aug 4, 2017

codecov bot commented Aug 4, 2017 • edited Loading

Codecov Report

dmsnell commented Aug 7, 2017

aduth commented Aug 7, 2017

dmsnell commented Aug 7, 2017

codecov bot commented Aug 4, 2017 •

edited

Loading