You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Single digit decimal entities are sometimes recognized, sometimes not. I believe the issue here is the size > 3 test at https://github.com/jgm/cmark/blob/master/src/houdini_html_u.c#L15. When, for example, 	 appears at the end of a line, size == 3 and the test fails.
Handling of � fails to recognize as an entity. This seems to be out of compliance with the current state of the spec, which asks for all 1-8 digit sequences to be recognized. For this issue, perhaps the spec should be changed, and a separate issue commonmark/commonmark-spec#323 open about handling of NULL.
Invalid Unicode characters are passed through to the final render, without replacement. For example, � is rendered as b'<p>\xed\xa0\x80</p>\n'. These should be replaced with U+FFFD at parse time.
Entities with more than 8 digits are interpreted as numeric entities. According to the spec, they should be treated as literal text.
Currently, during parsing of entities, the int codepoint is subject to integer overflow, which is undefined behavior in C (yes, I know this is insane, but when you lie down with C, you get up with UB). A sufficiently smart compiler could optimize away the if (cp < codepoint) test because negative values are impossible. This issue would be mitigated somewhat by using a maximum of 8 digits, but � would still provoke it. My recommendation is to use uint32_t and bail when the number of digits exceeds 8.
The text was updated successfully, but these errors were encountered:
Single digit decimal entities are sometimes recognized, sometimes not. I believe the issue here is the
size > 3
test at https://github.com/jgm/cmark/blob/master/src/houdini_html_u.c#L15. When, for example,	
appears at the end of a line,size == 3
and the test fails.Handling of
�
fails to recognize as an entity. This seems to be out of compliance with the current state of the spec, which asks for all 1-8 digit sequences to be recognized. For this issue, perhaps the spec should be changed, and a separate issue commonmark/commonmark-spec#323 open about handling of NULL.Invalid Unicode characters are passed through to the final render, without replacement. For example,
�
is rendered asb'<p>\xed\xa0\x80</p>\n'
. These should be replaced withU+FFFD
at parse time.Entities with more than 8 digits are interpreted as numeric entities. According to the spec, they should be treated as literal text.
Currently, during parsing of entities, the
int codepoint
is subject to integer overflow, which is undefined behavior in C (yes, I know this is insane, but when you lie down with C, you get up with UB). A sufficiently smart compiler could optimize away theif (cp < codepoint)
test because negative values are impossible. This issue would be mitigated somewhat by using a maximum of 8 digits, but�
would still provoke it. My recommendation is to use uint32_t and bail when the number of digits exceeds 8.The text was updated successfully, but these errors were encountered: