[rust] Added support for creating empty dataset #1016

trueutkarsh · 2023-06-26T16:46:35Z

Closes #954

Currently the test only checks schema properties. Are there more scenarios/checks that I should cover in the test ?
I can't figure out what else you can do with empty dataset since you can't change schema dynamically or add new values because of it.

Please let me know your thoughts over this.

eddyxu · 2023-06-26T16:51:11Z

rust/src/dataset.rs

@@ -1070,8 +1071,9 @@ mod tests {
        let test_dir = tempdir().unwrap();
        let test_uri = test_dir.path().to_str().unwrap();
        let mut reader: Box<dyn RecordBatchReader> = Box::new(RecordBatchBuffer::empty());
-        let result = Dataset::write(&mut reader, test_uri, None).await;
-        assert!(matches!(result.unwrap_err(), Error::EmptyDataset { .. }));
+        let result = Dataset::write(&mut reader, test_uri, None).await.unwrap();


Could you also test that count_rows() and to_batch, and add new rows work?
Also make sure they work in Python?

Hi Lei,
I've tested count_rows() and adding new rows. I couldn't find anything regarding to_batch(). Could you please point me in the right direction regarding it.

My bad, maybe just try to_table() should be sufficient.

I was asking that the Dataset scan still works (i.e., returning empty table)

Thanks for clarifying. I've added a test in python package where we check to_table() on empty dataset see the data is empty but schema is in place, then append a new record and verify the new state of the new dataset as well.

eddyxu · 2023-06-26T16:51:13Z

rust/src/dataset.rs

@@ -309,7 +309,8 @@ impl Dataset {
                return Err(Error::from(batch.as_ref().unwrap_err()));
            }
        } else {
-            return Err(Error::EmptyDataset);
+            warn!("Dataset is empty, proceeding with empty schema");
+            schema = Schema::try_from(&ArrowSchema::empty())?;


When creating an empty dataset, users should provide a schema.

I'd imagine that the semantics will be very similar to SQL CREATE TABLE.

Yes indeed. Up until now we were extracting schema from the records but now that records might be absent in the case of empty dataset, schema must be provided or error will be thrown. This is what mostly this latest commit covers.

eddyxu · 2023-06-26T16:56:03Z

rust/src/dataset.rs

@@ -309,7 +309,8 @@ impl Dataset {
                return Err(Error::from(batch.as_ref().unwrap_err()));
            }
        } else {
-            return Err(Error::EmptyDataset);


We might be able to remove this EmptyDataset as error.

Yeah, indeed. It's gone in latest version.

eddyxu · 2023-06-27T15:42:22Z

@trueutkarsh Lemme know if you need anything else. We'd love to have this PR to get in soon.

Thanks again for contribution.

trueutkarsh · 2023-06-27T23:55:25Z

Hi @eddyxu @wjones127,
Let me briefly explain what I've tried to do this time.

RecordBatchBuffer implemented trait schema (RecordBatchReader) by extracting schema from the batches it stored. Now that we have to support empty datasets with schema as well, we need to make sure that RecordBatchBuffer must store the schema for that case. Hence I added a new optional field in RecordBatchBuffer of type Option. Now whenever schema function is called upon dataset, if dataset has records, schema would be extracted from them or else the optional schema would be resolved and returned.
This resulted in minor changes in constructing empty RecordBatchBuffer or with batches where in first case you must specify schema and in latter you can optionally skip but good practice to specify.

Please let me know if you find something unclear in the code or any thing else you'd like me cover which I missed.

Some doctests have been failing to which I made no changes so I would inspect what's going on there.

Thanks

wjones127 · 2023-06-28T18:03:23Z

rust/src/arrow/record_batch.rs

 #[derive(Debug)]
 pub struct RecordBatchBuffer {


I'm wondering if we even want to keep this. It seems like all the uses can be handled either with Vec<RecordBatch> or RecordBatchIterator<Vec<RecordBatch>::Iter>. What do we think of removing it?

https://docs.rs/arrow-array/42.0.0/arrow_array/struct.RecordBatchIterator.html

RecordBatchBuffer provides the ability to extend the record batches. RecordBatchIterator being an iterator (or wrapping an iterator) will only provide the read only view to the batches hence cannot do that. This functionality is used while writing batches limited by max rows per group in (dataset.rs::402, fragment.rs::84).
Vec provides that functionality but we cannot store schema for empty vector then.
Hence I think RecordBatchBuffer is best of both worlds so removing it is not feasible unless we change the implementation of fragment.rs::create and dataset.rs::write.
Please correct me if I misinterpreted anything.
Thanks

But are there any place where we are adding batches at the same time as we are iterating?

If we are adding batches, I think we can just use Vec<RecordBatch>.
If we are iterating, we could just use RecordBatchIterator<>.

Is there a place where we need both at the same time?

Well, you're right, thanks for pointing this out. After skimming through the critical places it doesn't seem like we need both functionality at the same time. I'll start making the changes accordingly.

…f, added support for it and made changes to all files in rust package. Added empty dataset operation test in python package

…BatchIterator's implementation. Fixed tests and README as well.

wjones127 · 2023-07-01T19:37:40Z

rust/src/dataset.rs

-use crate::arrow::*;
+// use crate::arrow::*;


Can we remove this?

wjones127

Looks pretty good. Just one comment.

eddyxu reviewed Jun 26, 2023

View reviewed changes

eddyxu requested changes Jun 26, 2023

View reviewed changes

eddyxu reviewed Jun 26, 2023

View reviewed changes

eddyxu requested a review from wjones127 June 26, 2023 16:57

trueutkarsh force-pushed the empty_dataset branch from c90d804 to cea60ac Compare June 27, 2023 23:15

trueutkarsh force-pushed the empty_dataset branch from 646a045 to 2973a7f Compare June 28, 2023 06:56

wjones127 reviewed Jun 28, 2023

View reviewed changes

trueutkarsh added 5 commits June 29, 2023 16:49

Added support for creating empty dataset

50c0ca7

Changed as per comments - RecordBatchBuffer now has optional schemare…

75e460f

…f, added support for it and made changes to all files in rust package. Added empty dataset operation test in python package

fmt missed files

2ab92c1

Fixed lint isort/clippy/ruff error

8a8b40d

Removed RecordBatchBuffer as module and replaced it with Arrow Record…

8265780

…BatchIterator's implementation. Fixed tests and README as well.

trueutkarsh force-pushed the empty_dataset branch from 2973a7f to 8265780 Compare June 29, 2023 11:23

Removed unused error

b680ae0

wjones127 reviewed Jul 1, 2023

View reviewed changes

rust/src/dataset.rs Outdated

use crate::arrow::*;

// use crate::arrow::*;

Copy link

Contributor

wjones127 Jul 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this?

wjones127 reviewed Jul 1, 2023

View reviewed changes

wjones127 mentioned this pull request Jul 1, 2023

feat: upgrade arrow, make write Send #1033

Merged

cleanup

5be9b60

wjones127 approved these changes Jul 1, 2023

View reviewed changes

wjones127 requested a review from eddyxu July 1, 2023 19:44

eddyxu approved these changes Jul 1, 2023

View reviewed changes

changhiskhan merged commit 7583ec0 into lancedb:main Jul 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rust] Added support for creating empty dataset #1016

[rust] Added support for creating empty dataset #1016

trueutkarsh commented Jun 26, 2023

eddyxu Jun 26, 2023

trueutkarsh Jun 27, 2023

eddyxu Jun 28, 2023

trueutkarsh Jun 28, 2023

eddyxu Jun 26, 2023

trueutkarsh Jun 27, 2023

eddyxu Jun 26, 2023

trueutkarsh Jun 27, 2023

eddyxu commented Jun 27, 2023

trueutkarsh commented Jun 27, 2023

wjones127 Jun 28, 2023

trueutkarsh Jun 28, 2023

wjones127 Jun 28, 2023

trueutkarsh Jun 28, 2023

wjones127 Jul 1, 2023

wjones127 left a comment

[rust] Added support for creating empty dataset #1016

[rust] Added support for creating empty dataset #1016

Conversation

trueutkarsh commented Jun 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eddyxu commented Jun 27, 2023

trueutkarsh commented Jun 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment