-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rust] Added support for creating empty dataset #1016
Conversation
@@ -1070,8 +1071,9 @@ mod tests { | |||
let test_dir = tempdir().unwrap(); | |||
let test_uri = test_dir.path().to_str().unwrap(); | |||
let mut reader: Box<dyn RecordBatchReader> = Box::new(RecordBatchBuffer::empty()); | |||
let result = Dataset::write(&mut reader, test_uri, None).await; | |||
assert!(matches!(result.unwrap_err(), Error::EmptyDataset { .. })); | |||
let result = Dataset::write(&mut reader, test_uri, None).await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also test that count_rows()
and to_batch
, and add new rows
work?
Also make sure they work in Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Lei,
I've tested count_rows() and adding new rows. I couldn't find anything regarding to_batch(). Could you please point me in the right direction regarding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, maybe just try to_table()
should be sufficient.
I was asking that the Dataset scan still works (i.e., returning empty table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. I've added a test in python package where we check to_table() on empty dataset see the data is empty but schema is in place, then append a new record and verify the new state of the new dataset as well.
rust/src/dataset.rs
Outdated
@@ -309,7 +309,8 @@ impl Dataset { | |||
return Err(Error::from(batch.as_ref().unwrap_err())); | |||
} | |||
} else { | |||
return Err(Error::EmptyDataset); | |||
warn!("Dataset is empty, proceeding with empty schema"); | |||
schema = Schema::try_from(&ArrowSchema::empty())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When creating an empty dataset, users should provide a schema.
I'd imagine that the semantics will be very similar to SQL CREATE TABLE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed. Up until now we were extracting schema from the records but now that records might be absent in the case of empty dataset, schema must be provided or error will be thrown. This is what mostly this latest commit covers.
@@ -309,7 +309,8 @@ impl Dataset { | |||
return Err(Error::from(batch.as_ref().unwrap_err())); | |||
} | |||
} else { | |||
return Err(Error::EmptyDataset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to remove this EmptyDataset
as error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, indeed. It's gone in latest version.
@trueutkarsh Lemme know if you need anything else. We'd love to have this PR to get in soon. Thanks again for contribution. |
c90d804
to
cea60ac
Compare
Hi @eddyxu @wjones127, RecordBatchBuffer implemented trait schema (RecordBatchReader) by extracting schema from the batches it stored. Now that we have to support empty datasets with schema as well, we need to make sure that RecordBatchBuffer must store the schema for that case. Hence I added a new optional field in RecordBatchBuffer of type Option. Now whenever schema function is called upon dataset, if dataset has records, schema would be extracted from them or else the optional schema would be resolved and returned. Please let me know if you find something unclear in the code or any thing else you'd like me cover which I missed. Some doctests have been failing to which I made no changes so I would inspect what's going on there. Thanks |
646a045
to
2973a7f
Compare
rust/src/arrow/record_batch.rs
Outdated
#[derive(Debug)] | ||
pub struct RecordBatchBuffer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we even want to keep this. It seems like all the uses can be handled either with Vec<RecordBatch>
or RecordBatchIterator<Vec<RecordBatch>::Iter>
. What do we think of removing it?
https://docs.rs/arrow-array/42.0.0/arrow_array/struct.RecordBatchIterator.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RecordBatchBuffer provides the ability to extend the record batches. RecordBatchIterator being an iterator (or wrapping an iterator) will only provide the read only view to the batches hence cannot do that. This functionality is used while writing batches limited by max rows per group in (dataset.rs::402, fragment.rs::84).
Vec provides that functionality but we cannot store schema for empty vector then.
Hence I think RecordBatchBuffer is best of both worlds so removing it is not feasible unless we change the implementation of fragment.rs::create and dataset.rs::write.
Please correct me if I misinterpreted anything.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But are there any place where we are adding batches at the same time as we are iterating?
If we are adding batches, I think we can just use Vec<RecordBatch>
.
If we are iterating, we could just use RecordBatchIterator<>
.
Is there a place where we need both at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you're right, thanks for pointing this out. After skimming through the critical places it doesn't seem like we need both functionality at the same time. I'll start making the changes accordingly.
…f, added support for it and made changes to all files in rust package. Added empty dataset operation test in python package
…BatchIterator's implementation. Fixed tests and README as well.
2973a7f
to
8265780
Compare
rust/src/dataset.rs
Outdated
use crate::arrow::*; | ||
// use crate::arrow::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Just one comment.
Closes #954
Currently the test only checks schema properties. Are there more scenarios/checks that I should cover in the test ?
I can't figure out what else you can do with empty dataset since you can't change schema dynamically or add new values because of it.
Please let me know your thoughts over this.