-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding ability to parse float from number with leading decimal #831
Conversation
Codecov Report
@@ Coverage Diff @@
## master #831 +/- ##
==========================================
- Coverage 82.45% 82.31% -0.15%
==========================================
Files 168 168
Lines 48175 48092 -83
==========================================
- Hits 39723 39587 -136
- Misses 8452 8505 +53
Continue to review full report at Codecov.
|
@@ -1341,6 +1341,7 @@ mod tests { | |||
assert_eq!(infer_field_schema("\"123\""), DataType::Utf8); | |||
assert_eq!(infer_field_schema("10"), DataType::Int64); | |||
assert_eq!(infer_field_schema("10.2"), DataType::Float64); | |||
assert_eq!(infer_field_schema(".2"), DataType::Float64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also try to cover cases like '2.'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, your suggesting that should also be inferred and parsed as a valid float as well right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brianrackle can you please rebase this PR (it should then pass CI) and let us know if you will be able to add the test that @jimexist suggests?
I am starting to prepare for the 6.1 release (I plan to build the release candidate late next week) and would like to get this in if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a look at the JSON parser in more detail and running some tests to see how the JSON parser deals with a case like '2.'. Reason being is that I think '2.' is a little less clear how to handle because that value can be represented by an Int, so the question is does the presence of a decimal automatically infer Float or should there be some laziness to upgrading from Int to Float and only done when necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it isn't clear what to do in this case, filed #929 to track this as a follow on work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brackle-lattice - here is a PR to your branch to pickup changes from master as well as add a test showing what happens with 2.
(it is parsed as Utf8
). brianrackle#1 (github shows the full diff which is somewhat silly, as it is a merge commit and then a 2 line change). I think what that change this PR will be ready to go
If you don't have time to merge it I can also create a new PR with the changes too to get it into arrow-rs
The tests were failing with
I have triggered the tests to re-run so hopefully they will pass on the next run |
…org/core/usize/constant.MAX.html and making consistent with other usages
Filed #838 for the CI failures |
filed #838 for nightly test failures (there is a PR up so hopefully once that is merged, we can rebase this branch and have it pass too) |
It looks like workflows are still awaiting approval |
Workflows started |
The integration test failure is unrelated. @brianrackle can you clarify your plans for #831 (comment) ? |
@@ -1341,6 +1341,7 @@ mod tests { | |||
assert_eq!(infer_field_schema("\"123\""), DataType::Utf8); | |||
assert_eq!(infer_field_schema("10"), DataType::Int64); | |||
assert_eq!(infer_field_schema("10.2"), DataType::Float64); | |||
assert_eq!(infer_field_schema(".2"), DataType::Float64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brackle-lattice - here is a PR to your branch to pickup changes from master as well as add a test showing what happens with 2.
(it is parsed as Utf8
). brianrackle#1 (github shows the full diff which is somewhat silly, as it is a merge commit and then a 2 line change). I think what that change this PR will be ready to go
If you don't have time to merge it I can also create a new PR with the changes too to get it into arrow-rs
Update for JSON schema inference
@alamb thanks, I believe I have the PR ready. Thanks for helping me get this change in. I will follow up on the other ticket with more info when I get a chance to look into it. |
* Adding ability to parse float from number with leading decimal * Fixing deprecated std::usize::MAX constant per https://doc.rust-lang.org/core/usize/constant.MAX.html and making consistent with other usages * Add test case for 2. and issue link Co-authored-by: Andrew Lamb <[email protected]>
…#962) * Adding ability to parse float from number with leading decimal * Fixing deprecated std::usize::MAX constant per https://doc.rust-lang.org/core/usize/constant.MAX.html and making consistent with other usages * Add test case for 2. and issue link Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Brian Rackle <[email protected]>
Which issue does this PR close?
Closes #830
Rationale for this change
The issue this fixes results in real-world data sets being incorrectly inferred as strings when they are floats.
What changes are included in this PR?
Modifying the regular expression to make the digit proceeding the decimal optional
Are there any user-facing changes?
CSV inferred schemas which include floating-point data doesn't need a proceeding zero for floats between 0 and 1