Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double quotes URL-escaped in attribute when using JRuby #1382

Closed
donv opened this issue Nov 24, 2015 · 12 comments
Closed

Double quotes URL-escaped in attribute when using JRuby #1382

donv opened this issue Nov 24, 2015 · 12 comments

Comments

@donv
Copy link

donv commented Nov 24, 2015

Hi!

When using JRuby to parse, modify, and output an HTML document with double quotes in an attribute, the double quote is URL-escaped as %22. I believe this is wrong behaviour. The double quote should be left alone as it is when using MRI.

$ rvm use jruby-1.7.22
Using ~/.rvm/gems/jruby-1.7.22
$ irb
jruby-1.7.22 :001 > require 'nokogiri'
 => true 
jruby-1.7.22 :002 > puts Nokogiri::HTML(%q{<tag attr='"' />}).to_html
<html><head></head><body><tag attr="%22"></tag></body></html>
 => nil 
$ rvm use jruby-9.0.4.0
Using ~/.rvm/gems/jruby-9.0.4.0
$ irb
jruby-9.0.4.0 :001 > require 'nokogiri'
 => true 
jruby-9.0.4.0 :002 > puts Nokogiri::HTML(%q{<tag attr='"' />}).to_html
<html><head></head><body><tag attr="%22"></tag></body></html>
 => nil
$ rvm use 2.2.2
Using ~/.rvm/gems/ruby-2.2.2
$ irb
2.2.2 :001 > require 'nokogiri'
 => true 
2.2.2 :002 > puts Nokogiri::HTML(%q{<tag attr='"' />}).to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><tag attr='"'></tag></body></html>
 => nil
@flavorjones
Copy link
Member

@donv Thanks for asking this question. This may simply be a difference in how the parsers in Nokogiri work (Xerces on JRuby, libxml2 on MRI), and if so there's not much we can do.

Summoning @jvshahid to validate my thinking.

@donv
Copy link
Author

donv commented Nov 25, 2015

Hi @flavorjones !

Thanks for the reply!

As far as I can see this is not just a cosmetic difference in handling. The resulting document is changed, and is no longer equivalent. In my case it results in a broken web page.

If it is all handled by an upstream library, we should file an issue in their tracker. Which project would that be? Any easy way I can test that library without Nokogiri?

@jvshahid
Copy link
Member

Thanks @donv for reporting this issue. I'll have some time the next few days to take a look at this issue. Hopefully it's an easy fix.

@donv
Copy link
Author

donv commented Nov 25, 2015

Thanks @jvshahid ! Much appreciated!

@donv
Copy link
Author

donv commented Dec 1, 2015

Hi @jvshahid !

I tried a Xerces example using JRuby 9.0.4.0 without Nokogiri. It seems Xerces is included in JRuby somehow: https://gist.github.com/donv/ed8b296ccfc99f927f71

It does not URL-escape the attribute in any way. Maybe this helps. 😄

@donv
Copy link
Author

donv commented Dec 10, 2015

Hi @jvshahid !

Have you had time to look at this?

@jvshahid
Copy link
Member

Unfortunately I haven't. I am currently working on 1.6.8 issues and prs (in particular #1251). I'll take a look at this issue once all 1.6.8 issues are fixed. Thanks for your patience.

@donv
Copy link
Author

donv commented Jan 2, 2016

Hi @flavorjones @jvshahid !

Happy new year! How is the 1.6.8 release coming along? Going well, I hope. Hang in there!

@donv
Copy link
Author

donv commented Jun 5, 2016

Hi @flavorjones @jvshahid !

How are you doing? I am just checking in to say that this issue is still a problem. I see 1.6.8 has had some progress. I wish you a happy release! 😄

@donv
Copy link
Author

donv commented Jul 20, 2016

Hi @flavorjones @jvshahid !

Have you had time to look at this issue? My PR succeeds for JRuby, but fails for MRI. That is weird since the manual test in IRB works with all versions of MRI.

@flavorjones
Copy link
Member

Hi @donv,

Apologies for the lack of response. I think we both deprioritized looking at this because we don't totally understand the problem you're trying to solve, nor why you believe this is a problem.

The three representations of a quote (", &quot;, and %22) are all equivalent. Why is it important to you that you get a specific one of the three in your output? I can't find in the XML Spec much about how these characters should be rendered.

Have you looked at Node#canonicalize, though? Canonicalization ("C14N") is intended to solve this exact problem, and in fact this code:

puts Nokogiri::VERSION_INFO.to_yaml
node = Nokogiri::HTML(%q{<tag attr='"%22&quot;' />}).at_css("tag")
puts node.to_html
puts node.canonicalize

renders the attribute identically on both JRuby (xerces) and MRI (libxml2):

---
warnings: []
nokogiri: 1.7.0.1
ruby:
  version: 2.4.0
  platform: x86_64-linux
  description: ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-linux]
  engine: ruby
libxml:
  binding: extension
  source: packaged
  libxml2_path: "/home/flavorjones/.rvm/gems/ruby-2.4.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxml2/2.9.4"
  libxslt_path: "/home/flavorjones/.rvm/gems/ruby-2.4.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxslt/1.1.29"
  libxml2_patches: []
  libxslt_patches: []
  compiled: 2.9.4
  loaded: 2.9.4
<tag attr='"%22"'></tag>
<tag attr="&quot;%22&quot;"></tag>

and

---
warnings: []
nokogiri: 1.7.0
ruby:
  version: 2.3.1
  platform: java
  description: jruby 9.1.6.0 (2.3.1) 2016-11-09 0150a76 OpenJDK 64-Bit Server VM 25.111-b14 on 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14 [linux-x86_64]
  engine: jruby
  jruby: 9.1.6.0
xerces: Xerces-J 2.11.0
nekohtml: NekoHTML 1.9.21
<tag attr="%22%22%22"></tag>
<tag attr="&quot;%22&quot;"></tag>

I hope this helps?

We historically have not accepted PRs that try to work around these smaller differences in how xerces and libxml2 format their output. Better to let the parser be the parser, in our opinion. There are simply too many differences in output to take care of (e.g., "should an html doc always have a head node?", "how should whitespace be handled?"), and so we've identified EXACTLY equal output as an anti-goal.

But again, try #canonicalize and let us know if that helps you get where you need to be.

@donv
Copy link
Author

donv commented Jan 15, 2017

Thanks. Using #canonicalize seems to solve my problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants