-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infer timestamps from CSV files #3209
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit to remove unnecessary capture groups
arrow-csv/src/reader.rs
Outdated
static ref DATETIME_RE: Regex = | ||
Regex::new(r"^\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d$").unwrap(); | ||
Regex::new(r"^\d{4}-\d\d-\d\d(T|\s)\d\d:\d\d:\d\d(.\d{1,9})?$").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex::new(r"^\d{4}-\d\d-\d\d(T|\s)\d\d:\d\d:\d\d(.\d{1,9})?$").unwrap(); | |
Regex::new(r"^\d{4}-\d\d-\d\d[T ]\d\d:\d\d:\d\d(.\d{1,9})?$").unwrap(); |
Or at the very least a non-capturing group. I also think it should probably be
instead of \s
as things like \n
or \t
I don't think would parse correctly.
} else if datetime_re.is_match(string) { | ||
DataType::Date64 | ||
} else if DATE_RE.is_match(string) { | ||
} else if DATE32_RE.is_match(string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I created #3211 to track using a RegexSet here, as the current code is rather wasteful, perhaps something for a follow on PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah didn't know about RegexSet, I can give it a shot in this PR a bit later
Edit: scratch that, might as well tackle as a separate PR instead
arrow-csv/src/reader.rs
Outdated
static ref DATE_RE: Regex = Regex::new(r"^\d{4}-\d\d-\d\d$").unwrap(); | ||
static ref DATE32_RE: Regex = Regex::new(r"^\d{4}-\d\d-\d\d$").unwrap(); | ||
static ref DATE64_RE: Regex = | ||
Regex::new(r"^\d{4}-\d\d-\d\d(T|\s)\d\d:\d\d:\d\d$").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex::new(r"^\d{4}-\d\d-\d\d(T|\s)\d\d:\d\d:\d\d$").unwrap(); | |
Regex::new(r"^\d{4}-\d\d-\d\d[T ]\d\d:\d\d:\d\d$").unwrap(); |
Benchmark runs are scheduled for baseline = 5d84746 and contender = 64b466e. 64b466e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #1060
Rationale for this change
Like Pyarrow, be able to infer timestamps from CSV files (with a bit more flexibility in the default format that is used for inferring from values)
What changes are included in this PR?
Are there any user-facing changes?