Replaced character stack with string buffer #1676

andrew-aladev · 2017-09-28T21:28:48Z

I've found an inconsistency between c and java version of html and xml sax parsers.

Let it be an html:

<p>
  text 1
  <span>text 2</span>
  text 3
</p>

We can expect that our sax parser will receive the following information:

start_element "p"
characters "text 1"
start_element "span"
characters "text 2"
end_element "span"
characters "text 3"
end_element "p"

c version works perfect but java does not. I've found a characterStack in ext/java/nokogiri/internals/NokogiriHandler.java that is actually a bug. I couldn't understand the reason of using stack for characters.

I was thinking for a long time about this stack and I've found the only reason for it: we will receive a list of character strings anyway when xml or html syntax is broken at the end of the document. But nokogiri has absolutely another methods of handling broken syntax. I think nokogiri should not keep this stack in the code.

So I've removed characters stack and added simple string buffer. The fix itself is very small.

Than I've added a special document to the test helper in order to test the strict order of parsed items. I've implemented 2 tests with regular text and whitespace for html and xml sax parsers.

Text nodes are regular text and comments for html parser. Xml test received a bonus: cdata blocks.

…ion mechanism

…by parsers

…f parsed text items

andrew-aladev · 2017-09-28T22:34:55Z

I will fix ruby-1.9.3 tomorrow.

flavorjones · 2017-09-29T04:13:46Z

Hi @andrew-aladev, thanks for submitting this. I love the additional test coverage!

@knu and I have asked a couple of the core contributors who know the jruby implementation a bit better than us to take a look.

And yes, if you avoid the %i syntax all tests will pass on jruby 1.7. I've increased PR test coverage in Concourse in commit be56b1e to include jruby 1.7 in response.

flavorjones · 2017-09-29T04:17:46Z

Also pinging @kares for an opinion.

kares · 2017-09-29T05:50:54Z

looking good, just get CI 💚 and its perfect :)
the stack was introduced due order (if I recall right) but tests verify its preserved, one minor thing to consider would be whether we need StringBuffer's synchronization ... (or could use a StringBuilder)

…ronization in handler

andrew-aladev · 2017-09-29T10:08:49Z

I've replaced StringBuffer with StringBuilder. JRuby 1.7 and MRI ruby 1.9.3 seems to work good.

flavorjones · 2017-09-29T11:38:35Z

Merged! Thank you for your contribution!

andrew-aladev added 3 commits September 28, 2017 23:43

replaced character stack with string buffer, fixed characters populat…

ac7060c

…ion mechanism

added helper document that can test a strict order of items produced …

91d8e3e

…by parsers

added tests for html and xml sax parsers that will verify the order o…

41c6faf

…f parsed text items

knu requested a review from yokolet September 29, 2017 00:54

flavorjones requested a review from jvshahid September 29, 2017 03:52

flavorjones added the platform/jruby label Sep 29, 2017

andrew-aladev added 2 commits September 29, 2017 10:07

replaced %i with regular array of symbols to fit ruby-1.9.3

66cc8ee

replaced StringBuffer with StringBuilder, because we don't need synch…

c3914e4

…ronization in handler

flavorjones merged commit 7eb8cf0 into sparklemotion:master Sep 29, 2017

flavorjones mentioned this pull request Jul 2, 2024

Java version of sax parser doesn't care about the order of text nodes #1561

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaced character stack with string buffer #1676

Replaced character stack with string buffer #1676

andrew-aladev commented Sep 28, 2017

andrew-aladev commented Sep 28, 2017

flavorjones commented Sep 29, 2017

flavorjones commented Sep 29, 2017

kares commented Sep 29, 2017

andrew-aladev commented Sep 29, 2017

flavorjones commented Sep 29, 2017

Replaced character stack with string buffer #1676

Replaced character stack with string buffer #1676

Conversation

andrew-aladev commented Sep 28, 2017

andrew-aladev commented Sep 28, 2017

flavorjones commented Sep 29, 2017

flavorjones commented Sep 29, 2017

kares commented Sep 29, 2017

andrew-aladev commented Sep 29, 2017

flavorjones commented Sep 29, 2017