Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced character stack with string buffer #1676

Conversation

andrew-aladev
Copy link
Contributor

I've found an inconsistency between c and java version of html and xml sax parsers.

Let it be an html:

<p>
  text 1
  <span>text 2</span>
  text 3
</p>

We can expect that our sax parser will receive the following information:

start_element "p"
characters "text 1"
start_element "span"
characters "text 2"
end_element "span"
characters "text 3"
end_element "p"

c version works perfect but java does not. I've found a characterStack in ext/java/nokogiri/internals/NokogiriHandler.java that is actually a bug. I couldn't understand the reason of using stack for characters.

I was thinking for a long time about this stack and I've found the only reason for it: we will receive a list of character strings anyway when xml or html syntax is broken at the end of the document. But nokogiri has absolutely another methods of handling broken syntax. I think nokogiri should not keep this stack in the code.

So I've removed characters stack and added simple string buffer. The fix itself is very small.

Than I've added a special document to the test helper in order to test the strict order of parsed items. I've implemented 2 tests with regular text and whitespace for html and xml sax parsers.

Text nodes are regular text and comments for html parser. Xml test received a bonus: cdata blocks.

@andrew-aladev
Copy link
Contributor Author

I will fix ruby-1.9.3 tomorrow.

@knu knu requested a review from yokolet September 29, 2017 00:54
@flavorjones
Copy link
Member

Hi @andrew-aladev, thanks for submitting this. I love the additional test coverage!

@knu and I have asked a couple of the core contributors who know the jruby implementation a bit better than us to take a look.

And yes, if you avoid the %i syntax all tests will pass on jruby 1.7. I've increased PR test coverage in Concourse in commit be56b1e to include jruby 1.7 in response.

@flavorjones
Copy link
Member

Also pinging @kares for an opinion.

@kares
Copy link
Contributor

kares commented Sep 29, 2017

looking good, just get CI 💚 and its perfect :)
the stack was introduced due order (if I recall right) but tests verify its preserved, one minor thing to consider would be whether we need StringBuffer's synchronization ... (or could use a StringBuilder)

@andrew-aladev
Copy link
Contributor Author

I've replaced StringBuffer with StringBuilder. JRuby 1.7 and MRI ruby 1.9.3 seems to work good.

@flavorjones flavorjones merged commit 7eb8cf0 into sparklemotion:master Sep 29, 2017
@flavorjones
Copy link
Member

Merged! Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants