-
-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numeric character references: Should HTML spec be followed for codes mapping to control characters #765
Comments
Well, first of all, I don’t think the user should do that. No need to encode an en-dash in markdown. And if you do, use CM doesn’t follow all of HTMLs character reference stuff. It doesn’t allow I’d personally be open to doing the same as HTML here. It would mean that every markdown parser has to embed a list though: https://github.com/wooorm/character-reference-invalid/blob/main/index.js. Is it really an improvement to the world when markdown adds workarounds for Windows 1252? |
Well the spec says to do this for invalid code points. But 0096 isn't an invalid code point, is it? |
These code points are the surrogates, they can't be represented for themselves in Unicode text or any of the Unicode transformation formats (e.g. you can't serialize or deserialize a surrogate to/from a valid UTF-8 byte stream). So specifying one as a character reference makes no sense and given that CommonMark never errors the only sensitive thing to do here is to replace them by the Unicode replacement character. HTML does the same. When I opened #369 I already noticed here that this bit of the spec may benefit of some clarifications as to which code points exactly you are allowed to specify via numeric character references. Regarding the issue at hand I don't have a strong opinion, but I suspect the HTML definition is rather driven by some kind of "no existing document should be broken constraint" and it would feel slightly odd to replicate that behaviour in CommonMark. Also I would find it strange to do it for numeric character references but not for literal (raw) occurrences of such characters. But then if we do this for the literals aswell we might loose the ability to directly embed some Unicode text based format/sources in code fences. I tend to find it a good property that CommonMark lets creep in any of the oddest Unicode scalar values (except U+0000), at least in those blocks. |
I'm not too well-versed on this--and I wouldn't use the term "invalid" myself--but I feel/understand that unicode keeps this space "empty" for compatibility reasons with other encodings. And that html only assigns things to it because they'd be "invalid" otherwise. |
Basically these "characters" are there because Unicode to fulfill its duty decided to project every other existing character encoding in it. In particular the code points I'm not sure if any format actually uses these codes with their purported semantics though. So one thing that CommonMark could do is to replace all the C0 and C1 control characters (the Unicode general category Cc) except TAB, CR and LF by the Unicode replacement characters. Because But then sometimes they are not. For example XML is rather liberal, or JSON while it mandates escaping of the So if most |
Code blocks won’t be affected as this issue is about numeric character references |
Depends if you also apply the transform on these code points (whether replacement character or surprising html behaviour) to the literal (raw) characters that bare the same code point. This should be done in my opinion, otherwise the character escape system becomes surprising and idiosyncratic for no good reason. (That being said personally I'm rather convinced that the good answer to the title of this issue is: no). |
Hmm, I guess that could be done, but to me seems like it would be a different issue? There’s no such text for that yet in the markdown spec or in HTML: document.body.textContent = 'a\u0096b'; document.body.textContent.codePointAt(1).toString(16) // 96 |
I don't see a different issue. If you are going to solve this issue by deciding to follow the HTML character mapping only on numeric character references, it's good to also reflect on what your character escape system has effectively become for end users (an inconsistent mess if you ask me). |
Ah yes, scratch that part (I wasn't reading the code properly; I do know what surrogates are). GitHub does replace I agree that adding the table adds some complexity and I'm not sure it's worth it for supporting some legacy way of specifying some characters. (I think I'll close the commonmark-java issue as "won't fix". Can always reopen in case the spec changes.) |
oh interesting, there was just a ping on my 5 year old #614. Which is very much related. |
The spec says this:
But note that commonmark.js doesn't actually follow that. Take
–
for example. In hex 150 is 96 which would map to U+0096, which is the (SPA) control character. However, it is converted to an en-dash –, see dingus.The reason for this is that commonmark.js uses the
entities
package which implements the HTML spec. The HTML spec has a replacement table for certain characters, see parsing section:(cmark behaves differently yet again, for certain code points, it uses a replacement character instead.)
Question: Should the spec prescribe the same mapping as the HTML spec? Or should it mention it and make it optional whether an implementation uses the replacement table or not?
(Originally raised here: commonmark/commonmark-java#307 (comment))
The text was updated successfully, but these errors were encountered: