Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rust] Added support for creating empty dataset #1016

Merged
merged 7 commits into from
Jul 1, 2023

Conversation

trueutkarsh
Copy link
Contributor

Closes #954

Currently the test only checks schema properties. Are there more scenarios/checks that I should cover in the test ?
I can't figure out what else you can do with empty dataset since you can't change schema dynamically or add new values because of it.

Please let me know your thoughts over this.

@@ -1070,8 +1071,9 @@ mod tests {
let test_dir = tempdir().unwrap();
let test_uri = test_dir.path().to_str().unwrap();
let mut reader: Box<dyn RecordBatchReader> = Box::new(RecordBatchBuffer::empty());
let result = Dataset::write(&mut reader, test_uri, None).await;
assert!(matches!(result.unwrap_err(), Error::EmptyDataset { .. }));
let result = Dataset::write(&mut reader, test_uri, None).await.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also test that count_rows() and to_batch, and add new rows work?
Also make sure they work in Python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Lei,
I've tested count_rows() and adding new rows. I couldn't find anything regarding to_batch(). Could you please point me in the right direction regarding it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, maybe just try to_table() should be sufficient.

I was asking that the Dataset scan still works (i.e., returning empty table)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I've added a test in python package where we check to_table() on empty dataset see the data is empty but schema is in place, then append a new record and verify the new state of the new dataset as well.

@@ -309,7 +309,8 @@ impl Dataset {
return Err(Error::from(batch.as_ref().unwrap_err()));
}
} else {
return Err(Error::EmptyDataset);
warn!("Dataset is empty, proceeding with empty schema");
schema = Schema::try_from(&ArrowSchema::empty())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating an empty dataset, users should provide a schema.

I'd imagine that the semantics will be very similar to SQL CREATE TABLE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed. Up until now we were extracting schema from the records but now that records might be absent in the case of empty dataset, schema must be provided or error will be thrown. This is what mostly this latest commit covers.

@@ -309,7 +309,8 @@ impl Dataset {
return Err(Error::from(batch.as_ref().unwrap_err()));
}
} else {
return Err(Error::EmptyDataset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to remove this EmptyDataset as error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, indeed. It's gone in latest version.

@eddyxu eddyxu requested a review from wjones127 June 26, 2023 16:57
@eddyxu
Copy link
Contributor

eddyxu commented Jun 27, 2023

@trueutkarsh Lemme know if you need anything else. We'd love to have this PR to get in soon.

Thanks again for contribution.

@trueutkarsh
Copy link
Contributor Author

Hi @eddyxu @wjones127,
Let me briefly explain what I've tried to do this time.

RecordBatchBuffer implemented trait schema (RecordBatchReader) by extracting schema from the batches it stored. Now that we have to support empty datasets with schema as well, we need to make sure that RecordBatchBuffer must store the schema for that case. Hence I added a new optional field in RecordBatchBuffer of type Option. Now whenever schema function is called upon dataset, if dataset has records, schema would be extracted from them or else the optional schema would be resolved and returned.
This resulted in minor changes in constructing empty RecordBatchBuffer or with batches where in first case you must specify schema and in latter you can optionally skip but good practice to specify.

Please let me know if you find something unclear in the code or any thing else you'd like me cover which I missed.

Some doctests have been failing to which I made no changes so I would inspect what's going on there.

Thanks

Comment on lines 26 to 27
#[derive(Debug)]
pub struct RecordBatchBuffer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we even want to keep this. It seems like all the uses can be handled either with Vec<RecordBatch> or RecordBatchIterator<Vec<RecordBatch>::Iter>. What do we think of removing it?

https://docs.rs/arrow-array/42.0.0/arrow_array/struct.RecordBatchIterator.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecordBatchBuffer provides the ability to extend the record batches. RecordBatchIterator being an iterator (or wrapping an iterator) will only provide the read only view to the batches hence cannot do that. This functionality is used while writing batches limited by max rows per group in (dataset.rs::402, fragment.rs::84).
Vec provides that functionality but we cannot store schema for empty vector then.
Hence I think RecordBatchBuffer is best of both worlds so removing it is not feasible unless we change the implementation of fragment.rs::create and dataset.rs::write.
Please correct me if I misinterpreted anything.
Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But are there any place where we are adding batches at the same time as we are iterating?

If we are adding batches, I think we can just use Vec<RecordBatch>.
If we are iterating, we could just use RecordBatchIterator<>.

Is there a place where we need both at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you're right, thanks for pointing this out. After skimming through the critical places it doesn't seem like we need both functionality at the same time. I'll start making the changes accordingly.

…f, added support for it and made changes to all files in rust package. Added empty dataset operation test in python package
…BatchIterator's implementation. Fixed tests and README as well.
use crate::arrow::*;
// use crate::arrow::*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this?

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Just one comment.

@wjones127 wjones127 requested a review from eddyxu July 1, 2023 19:44
@changhiskhan changhiskhan merged commit 7583ec0 into lancedb:main Jul 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support creating an empty table
4 participants