Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(JRuby) Nokogiri munges unicode characters that require more than 2 bytes #1113

Closed
BoboFraggins opened this issue May 30, 2014 · 2 comments
Closed

Comments

@BoboFraggins
Copy link

If you use character entities or straight UTF-8 characters, somewhere along the line, they get cast to 2 byte characters under JRuby. This affects emojis, traditional chinese, and numerous other planes.

doc = Nokogiri::HTML::Document.new
doc.encoding = 'US-ASCII'
puts doc.fragment('<p>&#x1f340;</p>')
puts doc.fragment('<p>&#127808;</p>')
doc.encoding = 'UTF-8'
puts doc.fragment('<p>&#x1f340;</p>')
puts doc.fragment('<p>&#127808;</p>')
puts doc.fragment('<p>🍀</p>')

Yields on JRuby:

<p>&#xf340;</p>
<p>&#xf340;</p>
<p></p>
<p></p>
<p>&#xd83c;&#xdf40;</p>

Under MRI, you get the correct values:

<p>&#127808;</p>
<p>&#127808;</p>
<p>🍀</p>
<p>🍀</p>
<p>🍀</p>
@papile
Copy link

papile commented Aug 25, 2014

I am facing this problem as well under jruby 1.7.13 with any UTF-8 above plane 0. The example string contains a plane 0 char, plane 1 char, and plane 2 char.

 xml = '<frag>ὡ 𐄣 𢂁</frag>'
 frag = Nokogiri::XML(xml, nil, 'UTF-8', Nokogiri::XML::ParseOptions::STRICT)
 puts xml.valid_encoding?
 puts xml
 puts frag.to_xml

Yields:

 true
 <frag>ὡ 𐄣 𢂁</frag>
 <?xml version="1.0" encoding="UTF-8"?>
 <frag>ὡ &#xd800;&#xdd23; &#xd848;&#xdc81;</frag>

In MRI 2.1.1:

 true
 <frag>ὡ 𐄣 𢂁</frag>
 <?xml version="1.0" encoding="UTF-8"?>
 <frag>ὡ 𐄣 𢂁</frag>

mkristian added a commit to mkristian/nokogiri that referenced this issue Feb 21, 2015
this is partial fix for sparklemotion#1113 to NOT use character entities when the encoding
of the document can encode the data.

Sponsored by Lookout Inc.
mkristian added a commit to mkristian/nokogiri that referenced this issue Feb 21, 2015
the new version of nekohtml brought a few regressions. this commit fixes
but two error warning ones.

it avoids to autocomplete the tbody tag around tr tags of a table. the check
of unknown html did change upstream and got adjusted.

fixes sparklemotion#1113

Sponsored by Lookout Inc.
mkristian added a commit to mkristian/nokogiri that referenced this issue Feb 21, 2015
this is partial fix for sparklemotion#1113 to NOT use character entities when the encoding
of the document can encode the data.

Sponsored by Lookout Inc.
mkristian added a commit to mkristian/nokogiri that referenced this issue Feb 21, 2015
the new version of nekohtml brought a few regressions. this commit fixes
but two error warning ones.

it avoids to autocomplete the tbody tag around tr tags of a table. the check
of unknown html did change upstream and got adjusted.

fixes sparklemotion#1113

Sponsored by Lookout Inc.
jvshahid pushed a commit that referenced this issue Dec 18, 2015
this is partial fix for #1113 to NOT use character entities when the encoding
of the document can encode the data.

Sponsored by Lookout Inc.
jvshahid pushed a commit that referenced this issue Dec 18, 2015
the new version of nekohtml brought a few regressions. this commit fixes
but two error warning ones.

it avoids to autocomplete the tbody tag around tr tags of a table. the check
of unknown html did change upstream and got adjusted.

fixes #1113

Sponsored by Lookout Inc.
jvshahid pushed a commit that referenced this issue Dec 18, 2015
this is partial fix for #1113 to NOT use character entities when the encoding
of the document can encode the data.

Sponsored by Lookout Inc.
jvshahid pushed a commit that referenced this issue Dec 18, 2015
the new version of nekohtml brought a few regressions. this commit fixes
but two error warning ones.

it avoids to autocomplete the tbody tag around tr tags of a table. the check
of unknown html did change upstream and got adjusted.

fixes #1113

Sponsored by Lookout Inc.
jvshahid pushed a commit that referenced this issue Dec 20, 2015
this is partial fix for #1113 to NOT use character entities when the encoding
of the document can encode the data.

Sponsored by Lookout Inc.
jvshahid pushed a commit that referenced this issue Dec 20, 2015
the new version of nekohtml brought a few regressions. this commit fixes
but two error warning ones.

it avoids to autocomplete the tbody tag around tr tags of a table. the check
of unknown html did change upstream and got adjusted.

fixes #1113

Sponsored by Lookout Inc.
jvshahid pushed a commit that referenced this issue Jan 2, 2016
this is partial fix for #1113 to NOT use character entities when the encoding
of the document can encode the data.

Sponsored by Lookout Inc.
jvshahid added a commit that referenced this issue Jan 2, 2016
@flavorjones
Copy link
Member

This should be in v1.6.8 when it drops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants