Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(XML) Unicode letters should be recognized in element and attribute names #3256

Closed
martin-honnen opened this issue Jun 24, 2021 · 6 comments · Fixed by #3529
Closed

(XML) Unicode letters should be recognized in element and attribute names #3256

martin-honnen opened this issue Jun 24, 2021 · 6 comments · Fixed by #3529
Labels
bug help welcome Could use help from community language

Comments

@martin-honnen
Copy link
Contributor

Describe the issue
XML allows Unicode letters as element and attribute names while your xml.js language mode uses a regular expression just checking for ASCII letters A-Z.
That way anyone trying to highlight XML with non-ASCII letters in element or attribute names doesn't get highlighting e.g. in <categoría>producto</categoría> the Spanish word categoría which is a well-formed XML element name is not recognized as such by the regular expression const TAG_NAME_RE = regex.concat(/[A-Z_]/, regex.optional(/[A-Z0-9_.-]*:/), /[A-Z0-9_.-]*/); in https://github.com/highlightjs/highlight.js/blob/main/src/languages/xml.js#L12

Which language seems to have the issue?
XML from https://github.com/highlightjs/highlight.js/blob/main/src/languages/xml.js

Are you using highlight or highlightAuto?

highlight

Sample Code to Reproduce

console.log(hljs.highlight(`
<root>
  <categoría>test</categoría>
  <category>test</category>
</root>`, {language: 'xml'}).value)

Expected behavior

The output for the XML markup <categoría>test</categoría> currently is &lt;categoría&gt;test&lt;/categoría&gt; while it should be <span class="hljs-tag">&lt;<span class="hljs-name">categoría</span>&gt;</span>test<span class="hljs-tag">&lt;/<span class="hljs-name">categoría</span>&gt;</span>.

Additional context

https://www.w3.org/TR/xml/#NT-NameStartChar and https://www.w3.org/TR/xml/#NT-NameChar definitions from XML spec. I think it should be possible to fix the regular expressions used in xml.js, either by using ranges of the characters given in the XML spec or, if the Unicode support in JavaScript regular expressions is used, by using e.g. \p{Letter} instead of A-Z.

@martin-honnen martin-honnen added bug help welcome Could use help from community language labels Jun 24, 2021
@martin-honnen
Copy link
Contributor Author

I am not sure the whole range of Unicode letters or the same range as used in the XML spec is expressible in the four digit \uDDDD format JavaScript supports but a much broader range than the current ASCII A-Z could be implemented with e.g.

nameStartChar = /[A-Z_a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]/g

I think.

@joshgoebel
Copy link
Member

Could we just use \p{L} for this now instead of that huge list?

https://stackoverflow.com/a/51413159/12430243

@joshgoebel
Copy link
Member

If so perhaps lets reopen #3257 with that approach instead...

@martin-honnen
Copy link
Contributor Author

I will hopefully be able to look into this next week.

@joshgoebel
Copy link
Member

Ping. Still interested in pursuing this?

@martin-honnen
Copy link
Contributor Author

I will give it a try during this week (i.e. until Sunday, May 1st).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help welcome Could use help from community language
Projects
None yet
2 participants