-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stream projection API #171
Conversation
event buffers are fixed size but point into bigger buffers
This is insane, and I LOVE it. I'm currently pretty sick and would like to give this a much better review when my head clears in a few days. But I'm loving what I've seen so far! In a perfect world I would have moved yajl-ruby on top of the latest version of yajl itself before we did this, but the bundled version is already somewhat diverged from upstream anyway. yajl 2.0 brings features I added in the bundled version here (of yajl 1.x) with a couple of exceptions. It's been a while since I last looked at upgrading. But I certainly wouldn't consider it a blocker against this amazingness. I'll get back to this soon, I promise! <3 |
@brianmario Just wanted to check in - I have a project using this branch of yajl-ruby, and it'd be really useful to have this in master or (especially!) in a new tagged release! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a pass, things are looking pretty good to me! I'd love to possibly get a review from @tenderlove and/or @charliesome for the C api usage (in case there are any gotchas I'm unaware of, specifically around the usage of RB_GC_GUARD
).
Also, this is making me think about JSON Schema. Which, while a seemingly much heavier-weight, seemingly similar solution - I'm curious of the advantages of disadvantages to what we have here?
|
||
while (1) { | ||
if (parser->offset >= RSTRING_LEN(parser->buffer)) { | ||
//printf("reading offset %d size %ld\n", parser->offset, RSTRING_LEN(parser->buffer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind cleaning this up? :)
|
||
yajl_tok token; | ||
if (pop == 0) { | ||
//printf("peeking %p %ld %d\n", RSTRING_PTR(parser->buffer), RSTRING_LEN(parser->buffer), parser->offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
if (pop == 0) { | ||
//printf("peeking %p %ld %d\n", RSTRING_PTR(parser->buffer), RSTRING_LEN(parser->buffer), parser->offset); | ||
token = yajl_lex_peek(parser->lexer, (const unsigned char *)RSTRING_PTR(parser->buffer), RSTRING_LEN(parser->buffer), parser->offset); | ||
//printf("peeked event %d\n", token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
return event; | ||
} | ||
|
||
//printf("popping\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
|
||
//printf("popping\n"); | ||
token = yajl_lex_lex(parser->lexer, (const unsigned char *)RSTRING_PTR(parser->buffer), RSTRING_LEN(parser->buffer), &parser->offset, (const unsigned char **)&event.buf, &event.len); | ||
//printf("popped event %d\n", token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
# Given the start of an array or object, read until the closing event. | ||
# Object structures can nest and this is considered. | ||
# | ||
# Returns nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with using *
here instead of #
.
val = rb_float_new(strtod(buf, NULL)); | ||
} else { | ||
val = rb_cstr2inum(buf, 10); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure rb_float_new
and rb_cstr2inum
can't ever raise, otherwise we'll leak buf
.
} | ||
|
||
case yajl_tok_string_with_escapes:; { | ||
//printf("decoding string with escapes\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More cleanup plz :)
x.compare! | ||
} | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this to the top-level benchmark dir? Just so we don't take up this time during the tests (not that it's a huge deal...)
] | ||
}.to_json | ||
|
||
puts json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this puts
here?
//printf("reading offset %d size %ld\n", parser->offset, RSTRING_LEN(parser->buffer)); | ||
|
||
// Refill the buffer | ||
rb_funcall(parser->stream, intern_io_read, 2, INT2FIX(RSTRING_LEN(parser->buffer)), parser->buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need an rb_protect
in case the read call raises an exception? I'm not sure how this function fits in to the rest of the program yet, but we might need to clean up memory or something if there's an error.
|
||
// nil schema means reify the subtree from here on | ||
// otherwise if the schema has a key for this we want it | ||
int interesting = (schema == Qnil || rb_funcall(schema, rb_intern("key?"), 1, key) == Qtrue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support schemas that aren't hashes? If not, I think we should use rb_hash_has_key
here.
val = rb_yajl_projector_build_simple_value(parser, value_event); | ||
} | ||
|
||
rb_hash_aset(hsh, key, val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to call rb_str_freeze
on the key here. Otherwise rb_hash_aset
will dup and freeze the key.
This is amazing! I think we can get better performance by doing the key freeze I mentioned earlier. Another idea for more performance: it seems we allocate a Ruby string even if we aren't interested in the key (we have to generate the string so we can call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I spotted a small leak that we should fix
.lexer = yajl_lex_alloc(&allocFuncs, 0, 1), | ||
}; | ||
|
||
VALUE result = rb_yajl_projector_filter_subtree(&parser, schema, yajl_event_stream_next(&parser, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rb_yajl_projector_filter_subtree
can raise an exception. In which case the lexer won't get freed. We should move this around a little to fix the leak.
Going to go ahead and merge this so we can iterate on it without needing to deal with forks. |
This ports the pure Ruby stream projection API I built previously http://github.com/keithduncan/json-projection to the yajl C API. Using the lexer we can project the stream of JSON tokens and only allocate the Ruby objects we actually care about. This is much more performant than the pure Ruby version.
Comparing performance
Taking a sample GitHub push event webhook
push.json
, the intended use case here's how it performs comparing speed and space.For my sample
push.json
which is 10kBYajl::Projector
is the fastest and allocates the least memory.Now taking
push-big.json
which is a webhook I artificially inflated not having one of our giant pushes to hand, real world performance still to be determined:I can’t profile
JsonProjection::Projector
for memory because the Ruby process dies 😱 meanwhileYajl::Projector
is 1.6x faster thanJSON.parse
and allocates 11.89x less memory.