-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet-read: add support to read parquet data from stdin #2482
Conversation
let parquet_reader = | ||
SerializedFileReader::new(file).expect("Failed to create reader"); | ||
let parquet_reader: Box<dyn FileReader> = if filename == "-" { | ||
let mut buf = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may surprise people who aren't familiar with parquet that this will buffer the entire file in memory, as opposed to streaming it. Perhaps we could add a note to the help text to alert people?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @tustvold suggests, I think it is inevitable that reading from stdin requires buffering (because the parquet footer is at the end of the file and we need to get past all the data to get to the footer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nvartolomei
let parquet_reader = | ||
SerializedFileReader::new(file).expect("Failed to create reader"); | ||
let parquet_reader: Box<dyn FileReader> = if filename == "-" { | ||
let mut buf = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @tustvold suggests, I think it is inevitable that reading from stdin requires buffering (because the parquet footer is at the end of the file and we need to get past all the data to get to the footer).
I'm going to merge this as is so that this can make the release |
Benchmark runs are scheduled for baseline = a835ba0 and contender = e60eef3. e60eef3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Rationale for this change
Handy to debug parquet output without having to write it to disk first.