-
-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of "misnested" formatting tags does not match standard HTML behavior #1075
Comments
@fb55 What's your view of this issue? I have a library built on top of htmlparser2 that I'd like to release to the public. It normalizes HTML with a focus on facilitating testing of HTML producing code. All the existing libraries for normalizing HTML are very inadequate. I think what I've created is much needed by the community. A critical guarantee of the library is that the normalized HTML is semantically equivalent to the original. Right now the guarantee fails because of what I've reported above. I would think that htmlparser2 also has the same guarantee, that the parse events it produces is semantically equivalent to the input. If you agree, but you simply don't have the time, then I can possible implement the needed logic, but I'd need your guidance. I'm also very busy and I don't have the time to "reverse engineer" the architecture so that i can figure out the proper place to do it. |
Hi @vassudanagunta, sorry for not responding earlier. Yes, this is a shortcoming for htmlparser2. Unfortunately, doing this properly is both quite complicated and bad for performance, and I consider it out of scope for htmlparser2. I recently joined the parse5 project as a maintainer. parse5 is spec-compliant HTML parser. Parsing a document takes ~2.5x the time it takes htmlparser2, but all of HTML's weirdnesses are covered. It also features a DOM adapter that produces the same DOM as htmlparser2. You might want to have a look at that project as a basis for your library. |
Hey @fb55 no worries I wasn't expecting a quick answer. We all are so busy! Are you saying that fixing this particular issue would be too complicated and bad for performance? Because at first glance to me it seems like it should be easy to solve efficiently, I'd just need guidance on location save me some time, but of course my gut senes can be entirely wrong. Or are you saying that this issue is but one case in a set of many other cases where HTML compliance is off, and that there is no point in fixing this case without fixing those cases, and it's the whole set of cases that are out of scope? I'll migrate to parse5 if that's what i should do, but want to make sure that there isn't an easy path that's good for me and good for htmlparse2 first. Totally fine to give me a very short sentence answer :) |
There are all sorts of edge-cases that would have to be taken care of. Eg. |
Gotcha. Thanks. Parse5 it is. |
… need HTML5 compliance. Fixes fb55#1075
Given the following input:
htmlparser2 generates events equivalent to the following:
whereas the HTML5 spec (and de facto HTML behavior pre HTML5) interprets it as the following:
You can confirm this behavior by opening the attached file in your browser and then looking at the rendered results as well as inspecting the DOM. This is also specified by the HTML Living Standard: 13.2.10.1 Misnested tags: <b><i></b></i>. See also https://stackoverflow.com/a/8766163/8910547
expected behavior
The current behavior with the following changes:
<i>
open tag event between the</b>
close tag and "italic " text events.</i>
close event between the "italic " and "plain" text events.The text was updated successfully, but these errors were encountered: