S3 object/bucket XML parsing issues 2.0.34 #764

DanielRedOak · 2015-04-06T05:01:19Z

Seeing some bad parsing with the changes from 2.0.34 coming from 2.0.33 when dealing with listing larger buckets. Reproduced with jruby 1.7.16.1 (1.9.3p392),which is by for Logstash where I am ultimately consuming this code.

require "aws-sdk-resources"
s3 = Aws::S3::Resource.new(:region => "us-east-1")
bucket = 'somebucket'
prefix = 'AWSLogs/accountnumber/CloudTrail/'
s3bucket = s3.bucket(bucket)
s3bucket.objects(:prefix => prefix).each do |log|
  puts log.key
end

Error:

RuntimeError: unhandled timestamp format `2015-04-'
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/frame.rb:277
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/frame.rb:275
      set_text at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/frame.rb:267
          text at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/stack.rb:31
    characters at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/engines/nokogiri.rb:29
    parse_with at nokogiri/XmlSaxParserContext.java:252
  parse_memory at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/nokogiri-1.6.6.2-java/lib/nokogiri/xml/sax/parser.rb:112
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/nokogiri-1.6.6.2-java/lib/nokogiri/xml/sax/parser.rb:84
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/engines/nokogiri.rb:13
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser.rb:28
    parse_body at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/rest_handler.rb:15
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/rest_body_handler.rb:91
  extract_data at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/rest_body_handler.rb:40
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/rest_body_handler.rb:10
          call at org/jruby/RubyProc.java:271
            on at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/response.rb:43
    on_success at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/http/response.rb:137
          call at org/jruby/RubyProc.java:271
      listener at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/http/response.rb:164
          call at org/jruby/RubyProc.java:271
       on_done at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/http/response.rb:128
    on_success at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/http/response.rb:135
            on at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/response.rb:42
    on_success at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/response.rb:51
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/rest_body_handler.rb:9
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/plugins/user_agent.rb:12
   manage_keys at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/plugins/s3_url_encoded_keys.rb:39
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/plugins/s3_url_encoded_keys.rb:31
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/restful_bindings.rb:13
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/endpoint.rb:35
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/param_validation.rb:22
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/raise_response_errors.rb:14
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/plugins/s3_sse_cpk.rb:18
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/param_conversion.rb:22
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/plugins/response_paging.rb:10
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/response_target.rb:18
  send_request at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/request.rb:70
  list_objects at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/base.rb:216
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-resources-2.0.34/lib/aws-sdk-resources/request.rb:24
   all_batches at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-resources-2.0.34/lib/aws-sdk-resources/operations.rb:127
          each at org/jruby/RubyEnumerator.java:274
          each at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-resources-2.0.34/lib/aws-sdk-resources/collection.rb:18
        (root) at s3_test.rb:6

Some debugging I've added to (aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/stack.rb) to show what it's receiving.

def text(value)
  puts value.inspect
  @frame.set_text(value)
end

Output:

...
"AWSLogs/someaccount/CloudTrail/eu-west-1/2015/04/04/someaccount_CloudTrail_eu-west-1_20150404T1450Z_YPnaBuc2bGxhPjHQ.json.gz"
"2015-04-04T14:52:35.000Z"
"\""
"26a091c72b039fc7b28b90414ce01c21"
"\""
"484"
"STANDARD"
"AWSLogs/someaccount/CloudTrail/eu-west-1/2015/04/04/someaccount_CloudTrail_eu-west-1_20150404T1505Z_b1NDxxGvlx5qnaHM.json.gz"
"2015-0"
...(Exception here)

Added some debugging in aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser.rb to validate the XML. Everything looks correct there so its somewhere down the line within the parsing.

def parse(xml, target = nil)
  xml = '<xml/>' if xml.nil? or xml.empty?
  puts "XML: " + xml + "ENDXML"
  stack = Stack.new(@shape, target)
  @engine.new(stack).parse(xml.to_s)
  stack.result
end

The error only shows up when trying to parse the timefields since there is validation on that field, but I see other inconsistencies where it seems to be chopping the key field into multiples as well causing issues.

Not sure whats going on at this point, it only seems to happen when listing a large bucket (which worked up until 2.0.34) which seems to make the parsing fail.

The text was updated successfully, but these errors were encountered:

trevorrowe · 2015-04-06T17:43:19Z

@DanielRedOak I'm really interested in determining the cause and providing a fix. Would you be able to provide a complete XML document for the list object response that causes the failure? I would like to reproduce this locally, and am currently unable to. I'd like to get this patched asap.

DanielRedOak · 2015-04-06T17:48:02Z

Probably, I'll need to sanitize things first. I'll see about doing that shortly

trevorrowe · 2015-04-06T18:11:56Z

I was able to produce a failure myself. No need to send the output. No clue what the cause is yet, but I'll update as I find out more.

DanielRedOak · 2015-04-06T18:12:57Z

OK, was just about to ask for an email. Let me know if you need anything else!

trevorrowe · 2015-04-06T18:49:02Z

If found the root cause of the issue. It appears that Nokogiri will send potentially multiple text events to the sax handler. For example, given the following XML element:

<LastModified>2012-12-06T18:33:01.000Z</LastModified>

It will typically trigger a single text event with the value 2012-12-06T18:33:01.000Z as a string. For whatever reason, larger XML documents will cause nokogiri to start emitting two text events, e.g. 2012-12- and 06T18:33:01.000Z, splitting the XML text element up into multiple parts.

The fix is to have the stack frame collect all of the yeilded text values and then have them parsed when the result is accessed, joining the parts first.

This explains why other string values might be truncated (as only the last one would be returned). This appears to only affect the Nokogiri backend, but I'm going to test the others. I need to add a text case that triggers this an then post a fix.

I hope to have this patched this afternoon.

Thank you for reporting the issue!

trevorrowe · 2015-04-06T19:15:09Z

The fix was pretty straight-forward. In the sax-parser, when an element has a text value and that element was a known scalar value (e.g. string, timestamp, integer, boolean, etc) it was being type-cast in response to the text callback.

Now the text values are collected into a list and they are instead joined when the #result method is called on the stack frame. This allows xml parsers to emit the text in multiple parts and have them joined at the end when the full string is available.

I added the ability to inject a parsing engine and the tests now exercise a dummy parser engine that splits the sample text into individual characters emitting each as their own text element. This initially broke all of the tests until the related change as made, causing the tests to go green again.

I'm happy with the change and am inclined to ship this later today. Please feel free to share any feedback you have. Also, if you have a change to checkout the master branch to test the fix, that would be great.

trevorrowe · 2015-04-06T21:27:28Z

This is available now in version v2.0.36.

DanielRedOak · 2015-04-06T21:41:05Z

Just tested again with a very large data set and it seemed to work correctly now. Thanks for knocking this one out quickly and with a detailed explanation!

trevorrowe closed this as completed in db2049f Apr 6, 2015

trevorrowe added bug labels Apr 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 object/bucket XML parsing issues 2.0.34 #764

S3 object/bucket XML parsing issues 2.0.34 #764

DanielRedOak commented Apr 6, 2015

trevorrowe commented Apr 6, 2015

DanielRedOak commented Apr 6, 2015

trevorrowe commented Apr 6, 2015

DanielRedOak commented Apr 6, 2015

trevorrowe commented Apr 6, 2015

trevorrowe commented Apr 6, 2015

trevorrowe commented Apr 6, 2015

DanielRedOak commented Apr 6, 2015

S3 object/bucket XML parsing issues 2.0.34 #764

S3 object/bucket XML parsing issues 2.0.34 #764

Comments

DanielRedOak commented Apr 6, 2015

trevorrowe commented Apr 6, 2015

DanielRedOak commented Apr 6, 2015

trevorrowe commented Apr 6, 2015

DanielRedOak commented Apr 6, 2015

trevorrowe commented Apr 6, 2015

trevorrowe commented Apr 6, 2015

trevorrowe commented Apr 6, 2015

DanielRedOak commented Apr 6, 2015