Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of "misnested" formatting tags does not match standard HTML behavior #1075

Closed
vassudanagunta opened this issue Jan 13, 2022 · 5 comments · Fixed by #1147
Closed
Labels
wontfix Out of scope for the project

Comments

@vassudanagunta
Copy link
Contributor

vassudanagunta commented Jan 13, 2022

Given the following input:

plain <b>bold <i>italic bold </b>italic </i>plain

htmlparser2 generates events equivalent to the following:

plain <b>bold <i>italic bold </i></b>italic plain

whereas the HTML5 spec (and de facto HTML behavior pre HTML5) interprets it as the following:

plain <b>bold <i>italic bold </i></b><i>italic </i>plain

You can confirm this behavior by opening the attached file in your browser and then looking at the rendered results as well as inspecting the DOM. This is also specified by the HTML Living Standard: 13.2.10.1 Misnested tags: <b><i></b></i>. See also https://stackoverflow.com/a/8766163/8910547

expected behavior

The current behavior with the following changes:

  1. Generate an implied <i> open tag event between the </b> close tag and "italic " text events.
  2. Do NOT skip the </i> close event between the "italic " and "plain" text events.
vassudanagunta added a commit to vassudanagunta/htmlparser2 that referenced this issue Jan 13, 2022
47-misnested-formatting-tags.json demonstrates and covers Issue fb55#1075.

48-misnested-formatting-and-block-tags.json demonstrates and covers both
Issue fb55#1075 and Issue fb55#1076, and their possible interaction.
@vassudanagunta
Copy link
Contributor Author

@fb55 What's your view of this issue?

I have a library built on top of htmlparser2 that I'd like to release to the public. It normalizes HTML with a focus on facilitating testing of HTML producing code. All the existing libraries for normalizing HTML are very inadequate. I think what I've created is much needed by the community. A critical guarantee of the library is that the normalized HTML is semantically equivalent to the original. Right now the guarantee fails because of what I've reported above. I would think that htmlparser2 also has the same guarantee, that the parse events it produces is semantically equivalent to the input.

If you agree, but you simply don't have the time, then I can possible implement the needed logic, but I'd need your guidance. I'm also very busy and I don't have the time to "reverse engineer" the architecture so that i can figure out the proper place to do it.

@fb55
Copy link
Owner

fb55 commented Apr 2, 2022

Hi @vassudanagunta, sorry for not responding earlier. Yes, this is a shortcoming for htmlparser2. Unfortunately, doing this properly is both quite complicated and bad for performance, and I consider it out of scope for htmlparser2.

I recently joined the parse5 project as a maintainer. parse5 is spec-compliant HTML parser. Parsing a document takes ~2.5x the time it takes htmlparser2, but all of HTML's weirdnesses are covered. It also features a DOM adapter that produces the same DOM as htmlparser2. You might want to have a look at that project as a basis for your library.

@vassudanagunta
Copy link
Contributor Author

vassudanagunta commented Apr 2, 2022

Hey @fb55 no worries I wasn't expecting a quick answer. We all are so busy!

Are you saying that fixing this particular issue would be too complicated and bad for performance? Because at first glance to me it seems like it should be easy to solve efficiently, I'd just need guidance on location save me some time, but of course my gut senes can be entirely wrong. Or are you saying that this issue is but one case in a set of many other cases where HTML compliance is off, and that there is no point in fixing this case without fixing those cases, and it's the whole set of cases that are out of scope?

I'll migrate to parse5 if that's what i should do, but want to make sure that there isn't an easy path that's good for me and good for htmlparse2 first.

Totally fine to give me a very short sentence answer :)

@fb55
Copy link
Owner

fb55 commented Apr 2, 2022

There are all sorts of edge-cases that would have to be taken care of. Eg. <template> tags should stop tags from closing other tags. Then there is the issue of foster-parenting, where tags are moved because they don't fit in the current context. And there a lot of other weirdnesses in HTML that I'd like to avoid dealing with.

@vassudanagunta
Copy link
Contributor Author

Gotcha. Thanks. Parse5 it is.

vassudanagunta added a commit to vassudanagunta/htmlparser2 that referenced this issue Apr 3, 2022
@fb55 fb55 added the wontfix Out of scope for the project label Apr 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix Out of scope for the project
Projects
None yet
2 participants