Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple issues with numeric entities #33

Closed
raphlinus opened this issue May 6, 2015 · 2 comments · Fixed by #37
Closed

Multiple issues with numeric entities #33

raphlinus opened this issue May 6, 2015 · 2 comments · Fixed by #37

Comments

@raphlinus
Copy link

Single digit decimal entities are sometimes recognized, sometimes not. I believe the issue here is the size > 3 test at https://github.com/jgm/cmark/blob/master/src/houdini_html_u.c#L15. When, for example, 	 appears at the end of a line, size == 3 and the test fails.

Handling of � fails to recognize as an entity. This seems to be out of compliance with the current state of the spec, which asks for all 1-8 digit sequences to be recognized. For this issue, perhaps the spec should be changed, and a separate issue commonmark/commonmark-spec#323 open about handling of NULL.

Invalid Unicode characters are passed through to the final render, without replacement. For example, &#xd800; is rendered as b'<p>\xed\xa0\x80</p>\n'. These should be replaced with U+FFFD at parse time.

Entities with more than 8 digits are interpreted as numeric entities. According to the spec, they should be treated as literal text.

Currently, during parsing of entities, the int codepoint is subject to integer overflow, which is undefined behavior in C (yes, I know this is insane, but when you lie down with C, you get up with UB). A sufficiently smart compiler could optimize away the if (cp < codepoint) test because negative values are impossible. This issue would be mitigated somewhat by using a maximum of 8 digits, but &#x80000000 would still provoke it. My recommendation is to use uint32_t and bail when the number of digits exceeds 8.

@nwellnhof
Copy link
Contributor

These are all good points. I can offer to make the following fixes to houdini_unescape_ent:

  • Accept numeric entities with a single digit.
  • Replace invalid Unicode code points and &#0; with U+FFFD.
  • Address integer overflow. It's enough to abort if the accumulated value exceeds 0x10FFFF.

I'll leave the issue of entities with more than 8 digits open for now.

nwellnhof added a commit to nwellnhof/cmark that referenced this issue May 7, 2015
@nwellnhof
Copy link
Contributor

The pull request also fixes the issue of entities with more than 8 digits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants