feat: support merge by row_id, row_addr #3254

chenkovsky · 2024-12-16T14:06:21Z

No description provided.

wjones127

Thanks for working on this @chenkovsky. I would like to see a few improvements to the unit tests, and then this is ready to go.

wjones127 · 2024-12-16T18:54:49Z

rust/lance/src/dataset.rs

+        let test_dir = tempdir().unwrap();
+        let test_uri = test_dir.path().to_str().unwrap();


If we aren't testing anything about the files, let's use an in-memory dataset instead.

Suggested change

let test_dir = tempdir().unwrap();

let test_uri = test_dir.path().to_str().unwrap();

wjones127 · 2024-12-16T18:55:29Z

rust/lance/src/dataset.rs

+        Dataset::write(data, test_uri, Some(write_params.clone()))
+            .await
+            .unwrap();
+
+        let mut dataset = Dataset::open(test_uri).await.unwrap();


If you re-use the dataset instance from write(), you can just use an in-memory dataset:

Suggested change

Dataset::write(data, test_uri, Some(write_params.clone()))

.await

.unwrap();

let mut dataset = Dataset::open(test_uri).await.unwrap();

let dataset = Dataset::write(data, "memory://", Some(write_params.clone()))

.await

.unwrap();

wjones127 · 2024-12-16T19:05:03Z

rust/lance/src/dataset.rs

+        let new_batch =
+            RecordBatch::try_new(new_schema.clone(), vec![row_ids.clone(), row_ids.clone()])
+                .unwrap();
+        let new_data = RecordBatchIterator::new(vec![Ok(new_batch)], new_schema.clone());
+        dataset.merge(new_data, ROW_ID, "rowid").await.unwrap();
+        dataset.validate().await.unwrap();


I'd like us to assert a few more things in this test:

dataset has the expected final schema key, value, new_value.

The values are what we expect. For this, you should avoid using the same values in each column. Otherwise, the test could pass even if there is a bug that uses the wrong column's values. Right now, you use row_ids.clone() for both rowid and new_value.

This works even if you shuffle the data. I would recommend using take_record_batch() to reorder the new_batch so the row ids are out-of-order.

wjones127 · 2024-12-16T19:05:28Z

rust/lance/src/dataset.rs

+        // This test also tests "null filling" when merging (e.g. when keys do not match
+        // we need to insert nulls)


Where is the null filling? It seems like you are providing every row id, unless I am missing something.

Where is the null filling? It seems like you are providing every row id, unless I am missing something.

sorry, I copy and modify another test

wjones127 · 2024-12-16T19:05:45Z

rust/lance/src/dataset.rs

+    #[rstest]
+    #[tokio::test]
+    async fn test_merge_on_row_addr(
+        #[values(LanceFileVersion::Legacy, LanceFileVersion::Stable)]
+        data_storage_version: LanceFileVersion,
+        #[values(false, true)] use_stable_row_id: bool,


Same comments from the row id test apply here.

codecov-commenter · 2024-12-17T00:57:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.88%. Comparing base (83b8efd) to head (b216aa0).
Report is 7 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3254      +/-   ##
==========================================
+ Coverage   78.47%   78.88%   +0.41%     
==========================================
  Files         245      246       +1     
  Lines       85088    86568    +1480     
  Branches    85088    86568    +1480     
==========================================
+ Hits        66772    68292    +1520     
+ Misses      15501    15450      -51     
- Partials     2815     2826      +11

Flag	Coverage Δ
unittests	`78.88% <100.00%> (+0.41%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127

Looks better. Have a few suggestions to clean up the tests. Then I think this is ready to be merged.

wjones127 · 2024-12-17T17:08:03Z

rust/lance/src/dataset.rs

+        let result = dataset
+            .scan()
+            .try_into_stream()
+            .await
+            .unwrap()
+            .try_collect::<Vec<_>>()
+            .await
+            .unwrap();


You can use .try_into_batch() to immediately collect the results in one batch. Will need to remove the for loop below too.

Suggested change

let result = dataset

.scan()

.try_into_stream()

.await

.unwrap()

.try_collect::<Vec<_>>()

.await

.unwrap();

let result = dataset

.scan()

.try_into_batch()

.await

.unwrap();

wjones127 · 2024-12-17T17:10:03Z

rust/lance/src/dataset.rs

+            let key = batch
+                .column_by_name("key")
+                .unwrap()
+                .as_any()
+                .downcast_ref::<arrow_array::Int32Array>()
+                .unwrap();


For tests, you can use RecordBatch["<column>"] to access a column by name. Also, .as_primitive() is typically shorter. Both of these are mostly suitable for tests because they can panic.

Suggested change

let key = batch

.column_by_name("key")

.unwrap()

.as_any()

.downcast_ref::<arrow_array::Int32Array>()

.unwrap();

let key = batch["key"].as_primitive::<Int32Type>();

feat: merge by row_id, row_addr

3f8071f

github-actions bot added the enhancement New feature or request label Dec 16, 2024

chenkovsky changed the title ~~feat: merge by row_id, row_addr~~ feat: support merge by row_id, row_addr Dec 16, 2024

chenkovsky mentioned this pull request Dec 16, 2024

_rowaddr and _rowid not exposed for merge? #3251

Open

broccoliSpicy requested a review from wjones127 December 16, 2024 15:42

wjones127 requested changes Dec 16, 2024

View reviewed changes

update test

0ce8ac1

wjones127 requested changes Dec 17, 2024

View reviewed changes

update test

b216aa0

wjones127 approved these changes Dec 18, 2024

View reviewed changes

wjones127 merged commit 95f98b3 into lancedb:main Dec 18, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support merge by row_id, row_addr #3254

feat: support merge by row_id, row_addr #3254

chenkovsky commented Dec 16, 2024

wjones127 left a comment

wjones127 Dec 16, 2024

wjones127 Dec 16, 2024

wjones127 Dec 16, 2024

wjones127 Dec 16, 2024

chenkovsky Dec 16, 2024

wjones127 Dec 16, 2024

chenkovsky Dec 17, 2024

codecov-commenter commented Dec 17, 2024 •

edited

Loading

wjones127 left a comment

wjones127 Dec 17, 2024

wjones127 Dec 17, 2024

		let test_dir = tempdir().unwrap();
		let test_uri = test_dir.path().to_str().unwrap();

		// This test also tests "null filling" when merging (e.g. when keys do not match
		// we need to insert nulls)

feat: support merge by row_id, row_addr #3254

feat: support merge by row_id, row_addr #3254

Conversation

chenkovsky commented Dec 16, 2024

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Dec 17, 2024 • edited Loading

Codecov Report

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Dec 17, 2024 •

edited

Loading