Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ability to parse float from number with leading decimal #831

Merged
merged 6 commits into from
Nov 9, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions arrow/src/csv/reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ use crate::record_batch::RecordBatch;
use csv_crate::{ByteRecord, StringRecord};

lazy_static! {
static ref DECIMAL_RE: Regex = Regex::new(r"^-?(\d+\.\d+)$").unwrap();
static ref DECIMAL_RE: Regex = Regex::new(r"^-?(\d*\.\d+)$").unwrap();
static ref INTEGER_RE: Regex = Regex::new(r"^-?(\d+)$").unwrap();
static ref BOOLEAN_RE: Regex = RegexBuilder::new(r"^(true)$|^(false)$")
.case_insensitive(true)
Expand Down Expand Up @@ -271,7 +271,7 @@ pub fn infer_schema_from_files(
has_header: bool,
) -> Result<Schema> {
let mut schemas = vec![];
let mut records_to_read = max_read_records.unwrap_or(std::usize::MAX);
let mut records_to_read = max_read_records.unwrap_or(usize::MAX);

for fname in files.iter() {
let (schema, records_read) = infer_file_schema(
Expand Down Expand Up @@ -1342,6 +1342,9 @@ mod tests {
assert_eq!(infer_field_schema("\"123\""), DataType::Utf8);
assert_eq!(infer_field_schema("10"), DataType::Int64);
assert_eq!(infer_field_schema("10.2"), DataType::Float64);
assert_eq!(infer_field_schema(".2"), DataType::Float64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also try to cover cases like '2.'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, your suggesting that should also be inferred and parsed as a valid float as well right?

Copy link
Contributor

@alamb alamb Oct 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianrackle can you please rebase this PR (it should then pass CI) and let us know if you will be able to add the test that @jimexist suggests?

I am starting to prepare for the 6.1 release (I plan to build the release candidate late next week) and would like to get this in if possible

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look at the JSON parser in more detail and running some tests to see how the JSON parser deals with a case like '2.'. Reason being is that I think '2.' is a little less clear how to handle because that value can be represented by an Int, so the question is does the presence of a decimal automatically infer Float or should there be some laziness to upgrading from Int to Float and only done when necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it isn't clear what to do in this case, filed #929 to track this as a follow on work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brackle-lattice - here is a PR to your branch to pickup changes from master as well as add a test showing what happens with 2. (it is parsed as Utf8). brianrackle#1 (github shows the full diff which is somewhat silly, as it is a merge commit and then a 2 line change). I think what that change this PR will be ready to go

If you don't have time to merge it I can also create a new PR with the changes too to get it into arrow-rs

// Should be parsed as Float or Int. See https://github.com/apache/arrow-rs/issues/929
assert_eq!(infer_field_schema("2."), DataType::Utf8);
assert_eq!(infer_field_schema("true"), DataType::Boolean);
assert_eq!(infer_field_schema("false"), DataType::Boolean);
assert_eq!(infer_field_schema("2020-11-08"), DataType::Date32);
Expand Down