-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support read decimal data from csv reader if user provide the schema with decimal data type #941
Conversation
Codecov Report
@@ Coverage Diff @@
## master #941 +/- ##
==========================================
+ Coverage 82.36% 82.40% +0.03%
==========================================
Files 168 168
Lines 48587 48728 +141
==========================================
+ Hits 40020 40153 +133
- Misses 8567 8575 +8
Continue to review full report at Codecov.
|
@alamb PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @liukun4515 -- this is looking great.
I had a question on the handling of decimals when the input doesn't have the same number of values as the declared scale
arrow/src/csv/reader.rs
Outdated
@@ -513,6 +514,9 @@ fn parse( | |||
let field = &fields[i]; | |||
match field.data_type() { | |||
DataType::Boolean => build_boolean_array(line_number, rows, i), | |||
DataType::Decimal(p, v) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using DataType::Decimal(precision, scale)
rather than DataType::Decimal(p, v)
would be more consistent with the rest of the codebase
arrow/src/csv/reader.rs
Outdated
precision: usize, | ||
scale: usize, | ||
) -> Result<ArrayRef> { | ||
let mut decimal_builder = DecimalBuilder::new(line_number, precision, scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first argument to DecimalBuilder
is the capacity (aka the expected size of the array). So I think it would be more performant to use rows.size()
rather than line_number
here
https://docs.rs/arrow/6.1.0/arrow/array/struct.DecimalBuilder.html#method.new
arrow/src/csv/reader.rs
Outdated
Ok(Arc::new(decimal_builder.finish())) | ||
} | ||
|
||
// parse the string format decimal value to i128 format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised that we don't already have a &str
--> i128 parser, but I also could not find one.
arrow/src/csv/reader.rs
Outdated
// each byte is digit、'-' or '.' | ||
let bytes = s.as_bytes(); | ||
let mut negitive = false; | ||
let mut result: i128 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could probably use str::parse
here (after first removing all '.'
) rather than a custom parsing loop
For example:
let i: i128 = "123".parse().unwrap();
println!("i is {}", i);
prints "123"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a simple way to convert a string to i128, but we may meet some corner case.
For example, we want to convert the value to decimal(10,2).
The string value (12.0000000000000000000000000000000000....000) (too may 'zero' after the decimal), if we remove the '.' and convert the value to i128, we may meet the overflow error.
assert_eq!("53.002666", lat.value_as_string(1)); | ||
assert_eq!("52.412811", lat.value_as_string(2)); | ||
assert_eq!("51.481583", lat.value_as_string(3)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add some tests where the full precision is not supplied? For example, try parsing
57.65
(only 2 places after the decimal)
I think this implementation may end up producing a decimal of 0.005765
rather than 57.65000
Also, we should test
56.3838748284
(or something else that has more data after the decimal than the declared scale)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added other test cases to ensure right for above case.
thanks for your comments, i also have the same question. |
Hi @alamb , I have checked the mysql and pg database.
In the MYSQL:
From the two database, we can know that the number of value after the decimal must match the scale parameter. The second is about precision, if the input string value is out of bounds for the declared precision, what should we do to handle this? This is the result for PG
This is the result for mysql
In the PG and mysql, if the input string value is out of bounds for the precision, we will get some error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @liukun4515 --- this is now looking good to me. I had one suggestion for a test as well as some style comments, but I think we could also merge this PR as is and fix them as a follow on PR
let mut base = 1; | ||
|
||
// handle the value after the '.' and meet the scale | ||
let delimiter_position = s.find('.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what will happen if there are two '.'
in the string - like 123.456.789
( I would expect a parsing error in this case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this value 123.123.13
will get an error, because of the regex static ref PARSE_DECIMAL_RE: Regex = Regex::new(r"^-?(\d+\.?\d*|\d*\.?\d+)$").unwrap();
.
I will add more test case and change the code style in the next pull request.
@@ -1503,6 +1586,54 @@ mod tests { | |||
} | |||
} | |||
|
|||
#[test] | |||
fn test_parse_decimal_with_parameter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
let result = parse_decimal_with_parameter(s, 20, 3); | ||
assert_eq!(i, result.unwrap()) | ||
} | ||
let can_not_parse_tests = ["123,123", "."]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend a test with two .
in it as well, such as 123.123.123
let mut negative = false; | ||
let mut result: i128 = 0; | ||
|
||
while offset > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that might make this code faster (and more idomatic rust) would be to avoid the bytes[offset - 1]
call (which does a bounds check).
It might look something like this (untested):
bytes[0..offset].iter().rev()
.try_for_each(|b| {
match b {
b'-' => negative = true,
b'.' => {},
...
}
})?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion!!!
I'm a rust learner, some rust code style are relatively new to me.
I think we can fix it in the next pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 yeah it took me a while to get into the Rust / functional style
@alamb maybe you merge this pr first. |
…with decimal data type (#941) * support decimal data type for csv reader * format code and fix lint check * fix the clippy error * enchance the parse csv to decimal and add more test
…with decimal data type (#941) (#974) * support decimal data type for csv reader * format code and fix lint check * fix the clippy error * enchance the parse csv to decimal and add more test Co-authored-by: Kun Liu <[email protected]>
Which issue does this PR close?
Closes #926
Rationale for this change
parse decimal data type
What changes are included in this PR?
Are there any user-facing changes?
No