-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support String Coercion in Raw JSON Reader #3736
Conversation
I'm not really happy with the Loose configuration I implemented because the string conversion to bool doesn't make much sense. And bool doesn't follow javascript either, where it considers an empty list to be false. Maybe a best implementation for this is to make bools and numbers as they used to be, and consider the loose behaviour to just parse everything to string, because in that case we don't lose any information. The good thing about this implementation is that if the end-user wants to get their hands dirty, they can implement their custom converter and get it tuned to their needs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written I don't think TapeConverter
can be implemented outside the crate as Tape
is not public. I happen to think this is a good thing as I would rather what are effectively implementation details not be exposed outside of the crate.
I wonder if we could rework the generics slightly so that TapeConverter
is a type-erased implementation detail of the various ArrayDecoder
, instead of plumbing it the whole way through?
I do also wonder if we really need generics for this, or if a simple strict
flag is all that is needed....
arrow-json/src/raw/tape.rs
Outdated
@@ -87,14 +92,15 @@ impl Display for TapeElement { | |||
/// | |||
/// [simdjson]: https://github.com/simdjson/simdjson/blob/master/doc/tape.md | |||
#[derive(Debug)] | |||
pub struct Tape<'a> { | |||
pub struct Tape<'a, C: TapeConverter> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make for a cleaner interface if the converter were instead given the tape instead of Tape
being parameterised on TapeConverter
. This would allow ArrayDecoder
and by extension RawDecoder
/ RawReader
to not need a generic
arrow-json/src/raw/converter.rs
Outdated
TapeElement::True => Ok("true".into()), | ||
TapeElement::False => Ok("false".into()), | ||
TapeElement::Number(idx) => Ok(tape.get_string(idx).into()), | ||
TapeElement::String(idx) => Ok(format!("\"{}\"", tape.get_string(idx)).into()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this will correctly escape newlines or quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Do you think using serde_json is a good approach to solve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be completely honest I struggle to understand the use-case, the performance will be bad and it seems like a rather strange API... This sort of schema coercion I'm not sure really belongs in the reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case is that we don't always control the json format we're reading from, but we can control the Schema.
And it's the behaviour given by infer_json_schema:
let schema = Schema::new(vec![
Field::new("c1", DataType::Utf8, true),
Field::new("c2", DataType::Utf8, true),
]);
let inferred_schema = infer_json_schema_from_iterator(
vec![
Ok(serde_json::json!({"c1": 1, "c2": 1})),
Ok(serde_json::json!({"c1": "a", "c2": 0})),
Ok(serde_json::json!({"c1": true, "c2": "ok"})),
]
.into_iter(),
)
.unwrap();
assert_eq!(inferred_schema, schema);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it's the behaviour given by infer_json_schema:
infer_json_schema coerces primitives, I'm not aware of it coercing objects or lists to strings? If it is, that is not intentional
The use case is that we don't always control the json format we're reading from, but we can control the Schema.
Does each file have a consistent schema though, i.e. does a given column have a mix of objects and primitives in the same file? The assumption of arrow typically is that schema evolution occurs at the file granularity, and can therefore be handled at a higher level...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, no, it's just for primitives, and only to strings, if I'm not mistaken. Let me see what it coerces between number and bool.
I probably just got creative there and asked why not? But thinking about it, what I really need is to be able to read json with the schema that was inferred before, and use it to save to Parquet.
Maybe we don't even need a flag for it, shouldn't this be the default behaviour since infer_json_schema does it by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, confirmed that if you mismatch numbers and bools, it will also consider them to be strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since infer_json_schema does it by default
I think having an option that disables strict enforcement along the lines of the schema inference makes sense 👍, I do think it should still be optional as some workflows care about schema consistency and would rather fail loud 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I'll keep it very simple. Just a bool in a struct for now.
arrow-json/src/raw/converter.rs
Outdated
} | ||
} | ||
|
||
fn obj_to_json_string<'a, C: TapeConverter>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think this behaviour is a little bit strange, I think it should be possible to opt-in to coercing primitives, e.g. string <-> bool, separately from opting into this potentially more surprising behaviour?
I was thinking about this when implementing too. I tried to make it a Box before but it didn't like that to_primitive is generic. So it couldn't be object safe. I was going to offer another PR with just the flag. Instead I'll rework this one and squash everything into one commit at the end. I agree with you that leaking implementation details makes it harder to evolve without breaking changes. I can create a few different flags that the user can opt-in:
Then the user can provide these flags using an array, and by default none of them are enabled. I'll also make sure it can parse \n and " from a string, maybe the best option is to use serde_json to do that. |
I would recommend a struct of booleans, as opposed to an array, given these will need to be checked on the "hot path". I would also recommend keeping it simple, with a single |
77213bd
to
59cd225
Compare
@tustvold, thank you so much for all the guidance. I have updated this PR with the suggested changes. |
…sense than the Tape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you. I will double-check the benchmarks tomorrow prior to getting this in
I've confirmed this does not seem to have a discernible impact on the benchmarks 🎉 Thank you for sticking with this |
Benchmark runs are scheduled for baseline = 25da74c and contender = e33dbe2. e33dbe2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
None. This is a feature request.
Rationale for this change
infer_json_schema
coerces primitive values into string, and it couldn't work when reading the same json with the given schema.What changes are included in this PR?
Add flag to make
RawReader
behave the same asinfer_json_schema
.Are there any user-facing changes?
Yes, the user can opt-in a
coerce_primitive
flag.