-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(JRuby) Nokogiri munges unicode characters that require more than 2 bytes #1113
Labels
Comments
I am facing this problem as well under jruby 1.7.13 with any UTF-8 above plane 0. The example string contains a plane 0 char, plane 1 char, and plane 2 char.
Yields:
In MRI 2.1.1:
|
mkristian
added a commit
to mkristian/nokogiri
that referenced
this issue
Feb 21, 2015
this is partial fix for sparklemotion#1113 to NOT use character entities when the encoding of the document can encode the data. Sponsored by Lookout Inc.
mkristian
added a commit
to mkristian/nokogiri
that referenced
this issue
Feb 21, 2015
the new version of nekohtml brought a few regressions. this commit fixes but two error warning ones. it avoids to autocomplete the tbody tag around tr tags of a table. the check of unknown html did change upstream and got adjusted. fixes sparklemotion#1113 Sponsored by Lookout Inc.
mkristian
added a commit
to mkristian/nokogiri
that referenced
this issue
Feb 21, 2015
this is partial fix for sparklemotion#1113 to NOT use character entities when the encoding of the document can encode the data. Sponsored by Lookout Inc.
mkristian
added a commit
to mkristian/nokogiri
that referenced
this issue
Feb 21, 2015
the new version of nekohtml brought a few regressions. this commit fixes but two error warning ones. it avoids to autocomplete the tbody tag around tr tags of a table. the check of unknown html did change upstream and got adjusted. fixes sparklemotion#1113 Sponsored by Lookout Inc.
jvshahid
pushed a commit
that referenced
this issue
Dec 18, 2015
this is partial fix for #1113 to NOT use character entities when the encoding of the document can encode the data. Sponsored by Lookout Inc.
jvshahid
pushed a commit
that referenced
this issue
Dec 18, 2015
the new version of nekohtml brought a few regressions. this commit fixes but two error warning ones. it avoids to autocomplete the tbody tag around tr tags of a table. the check of unknown html did change upstream and got adjusted. fixes #1113 Sponsored by Lookout Inc.
jvshahid
pushed a commit
that referenced
this issue
Dec 18, 2015
this is partial fix for #1113 to NOT use character entities when the encoding of the document can encode the data. Sponsored by Lookout Inc.
jvshahid
pushed a commit
that referenced
this issue
Dec 18, 2015
the new version of nekohtml brought a few regressions. this commit fixes but two error warning ones. it avoids to autocomplete the tbody tag around tr tags of a table. the check of unknown html did change upstream and got adjusted. fixes #1113 Sponsored by Lookout Inc.
jvshahid
pushed a commit
that referenced
this issue
Dec 20, 2015
this is partial fix for #1113 to NOT use character entities when the encoding of the document can encode the data. Sponsored by Lookout Inc.
jvshahid
pushed a commit
that referenced
this issue
Dec 20, 2015
the new version of nekohtml brought a few regressions. this commit fixes but two error warning ones. it avoids to autocomplete the tbody tag around tr tags of a table. the check of unknown html did change upstream and got adjusted. fixes #1113 Sponsored by Lookout Inc.
jvshahid
pushed a commit
that referenced
this issue
Jan 2, 2016
this is partial fix for #1113 to NOT use character entities when the encoding of the document can encode the data. Sponsored by Lookout Inc.
This should be in v1.6.8 when it drops. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
If you use character entities or straight UTF-8 characters, somewhere along the line, they get cast to 2 byte characters under JRuby. This affects emojis, traditional chinese, and numerous other planes.
Yields on JRuby:
Under MRI, you get the correct values:
The text was updated successfully, but these errors were encountered: