Handling of "misnested" formatting tags does not match standard HTML behavior #1075

vassudanagunta · 2022-01-13T07:06:59Z

Given the following input:

plain <b>bold <i>italic bold </b>italic </i>plain

htmlparser2 generates events equivalent to the following:

plain <b>bold <i>italic bold </i></b>italic plain

whereas the HTML5 spec (and de facto HTML behavior pre HTML5) interprets it as the following:

plain <b>bold <i>italic bold </i></b><i>italic </i>plain

You can confirm this behavior by opening the attached file in your browser and then looking at the rendered results as well as inspecting the DOM. This is also specified by the HTML Living Standard: 13.2.10.1 Misnested tags: . See also https://stackoverflow.com/a/8766163/8910547

expected behavior

The current behavior with the following changes:

Generate an implied  open tag event between the  close tag and "italic " text events.
Do NOT skip the  close event between the "italic " and "plain" text events.

The text was updated successfully, but these errors were encountered:

47-misnested-formatting-tags.json demonstrates and covers Issue fb55#1075. 48-misnested-formatting-and-block-tags.json demonstrates and covers both Issue fb55#1075 and Issue fb55#1076, and their possible interaction.

vassudanagunta · 2022-04-02T14:48:49Z

@fb55 What's your view of this issue?

I have a library built on top of htmlparser2 that I'd like to release to the public. It normalizes HTML with a focus on facilitating testing of HTML producing code. All the existing libraries for normalizing HTML are very inadequate. I think what I've created is much needed by the community. A critical guarantee of the library is that the normalized HTML is semantically equivalent to the original. Right now the guarantee fails because of what I've reported above. I would think that htmlparser2 also has the same guarantee, that the parse events it produces is semantically equivalent to the input.

If you agree, but you simply don't have the time, then I can possible implement the needed logic, but I'd need your guidance. I'm also very busy and I don't have the time to "reverse engineer" the architecture so that i can figure out the proper place to do it.

fb55 · 2022-04-02T16:00:39Z

Hi @vassudanagunta, sorry for not responding earlier. Yes, this is a shortcoming for htmlparser2. Unfortunately, doing this properly is both quite complicated and bad for performance, and I consider it out of scope for htmlparser2.

I recently joined the parse5 project as a maintainer. parse5 is spec-compliant HTML parser. Parsing a document takes ~2.5x the time it takes htmlparser2, but all of HTML's weirdnesses are covered. It also features a DOM adapter that produces the same DOM as htmlparser2. You might want to have a look at that project as a basis for your library.

vassudanagunta · 2022-04-02T19:12:57Z

Hey @fb55 no worries I wasn't expecting a quick answer. We all are so busy!

Are you saying that fixing this particular issue would be too complicated and bad for performance? Because at first glance to me it seems like it should be easy to solve efficiently, I'd just need guidance on location save me some time, but of course my gut senes can be entirely wrong. Or are you saying that this issue is but one case in a set of many other cases where HTML compliance is off, and that there is no point in fixing this case without fixing those cases, and it's the whole set of cases that are out of scope?

I'll migrate to parse5 if that's what i should do, but want to make sure that there isn't an easy path that's good for me and good for htmlparse2 first.

Totally fine to give me a very short sentence answer :)

fb55 · 2022-04-02T22:08:53Z

There are all sorts of edge-cases that would have to be taken care of. Eg. <template> tags should stop tags from closing other tags. Then there is the issue of foster-parenting, where tags are moved because they don't fit in the current context. And there a lot of other weirdnesses in HTML that I'd like to avoid dealing with.

vassudanagunta · 2022-04-02T23:25:03Z

Gotcha. Thanks. Parse5 it is.

… need HTML5 compliance. Fixes fb55#1075

vassudanagunta mentioned this issue Jan 13, 2022

chore(tests): Add test cases for misnested tags #1077

Closed

vassudanagunta added a commit to vassudanagunta/htmlparser2 that referenced this issue Apr 3, 2022

docs(readme): Add message in README steering people to parse5 if they…

77712d7

… need HTML5 compliance. Fixes fb55#1075

vassudanagunta mentioned this issue Apr 3, 2022

docs(readme): Add message in README steering people to parse5 if they need HTML5 compliance #1147

Merged

fb55 closed this as completed in #1147 Apr 23, 2022

fb55 added the wontfix Out of scope for the project label Apr 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of "misnested" formatting tags does not match standard HTML behavior #1075

Handling of "misnested" formatting tags does not match standard HTML behavior #1075

vassudanagunta commented Jan 13, 2022 •

edited

Loading

vassudanagunta commented Apr 2, 2022

fb55 commented Apr 2, 2022

vassudanagunta commented Apr 2, 2022 •

edited

Loading

fb55 commented Apr 2, 2022

vassudanagunta commented Apr 2, 2022

Handling of "misnested" formatting tags does not match standard HTML behavior #1075

Handling of "misnested" formatting tags does not match standard HTML behavior #1075

Comments

vassudanagunta commented Jan 13, 2022 • edited Loading

expected behavior

vassudanagunta commented Apr 2, 2022

fb55 commented Apr 2, 2022

vassudanagunta commented Apr 2, 2022 • edited Loading

fb55 commented Apr 2, 2022

vassudanagunta commented Apr 2, 2022

vassudanagunta commented Jan 13, 2022 •

edited

Loading

vassudanagunta commented Apr 2, 2022 •

edited

Loading