support append commit #982

LiWeiJie · 2023-06-15T09:25:08Z

[rust] allow append commit in dataset
[python] fix the missing mode in the dataset commit interface

wjones127

It seems like we need better validation when we making these commits. We can't have duplicate fragment ids, because that breaks the unique row ids.

wjones127 · 2023-06-15T19:10:56Z

rust/src/dataset.rs

        let schema = if object_store.exists(&latest_manifest).await? {
            let dataset = Self::open(base_uri).await?;
            version = dataset.version().version + 1;

            if matches!(mode, WriteMode::Append) {
                // Append mode: inherit indices from previous version.
                indices = dataset.load_indices().await?;
+                dataset_fragments = dataset.fragments().iter().map(|f| f.clone()).collect();


This will create duplicate paths in the current write append code path. The old fragments are already added here:

lance/rust/src/dataset.rs

Lines 355 to 358 in d3e8153

let mut fragments: Vec<Fragment> = if matches!(params.mode, WriteMode::Append) {

dataset

.as_ref()

.map_or(vec![], |d| d.manifest.fragments.as_ref().clone())

This is a breaking change in behavior of a public API, so I think I'd rather not make this change here if we can avoid it. But if we do make this change, we need to adjust the other code path as well, and add tests to make sure we aren't creating duplicate fragment entries.

BTW, how are you generating fragment ids to make them unique? We might want to add validation to this function to make sure we aren't creating duplicate IDs.

wjones127 · 2023-06-15T19:13:34Z

python/python/lance/dataset.py

@@ -603,7 +603,7 @@ def _commit(
            base_uri = str(base_uri)
        if not isinstance(new_schema, pa.Schema):
            raise TypeError(f"schema must be pyarrow.Schema, got {type(new_schema)}")
-        _Dataset.commit(base_uri, new_schema, fragments)
+        _Dataset.commit(base_uri, new_schema, fragments, mode)


+1 to this change.

wjones127 · 2023-06-15T19:18:45Z

rust/src/dataset.rs

+        let fragments: Vec<Fragment> = dataset.fragments().iter().map(|f| f.clone()).collect();
+
+        let new_dataset =
+            Dataset::commit(test_uri, dataset.schema(), &fragments, WriteMode::Append)
+                .await
+                .unwrap();


I don't think this should be allowed because the added fragments have the same ids as the old ones.

wjones127 · 2023-06-15T19:25:40Z

Perhaps what we want instead is a new struct to represent uncommitted fragments that doesn't have the ID

struct NewFragment {
    /// Files within the fragment.
    pub files: Vec<DataFile>,

    /// Optional file with deleted row ids.
    pub deletion_file: Option<DeletionFile>,
}

And then some function to append those to the log, adding new ids as appropriate:

impl Dataset {
    async fn append_new_fragments(dataset_uri: &str, new_fragments: &[NewFragment]) -> Result<Self> {
...}
}

That way we can handle the assignment of new fragment ids inside the function.

@eddyxu what do you think of that?

wjones127 · 2023-09-18T20:10:45Z

@LiWeiJie closing this since we've implemented support in _commit for Append and several other operations in #1193

You can see an example of append here:

lance/python/python/tests/test_dataset.py

Lines 399 to 421 in 2ee7f01

    
           table = pa.Table.from_pydict({"a": range(100), "b": range(100)}) 
        
           base_dir = tmp_path / "test" 
        
           lance.write_dataset(table, base_dir) 
        
           fragment = lance.fragment.LanceFragment.create(base_dir, table) 
        
           append = lance.LanceOperation.Append([fragment]) 
        
           with pytest.raises(OSError): 
        
               # Must specify read version 
        
               dataset = lance.LanceDataset._commit(base_dir, append) 
        
           dataset = lance.LanceDataset._commit(base_dir, append, read_version=1) 
        
           tbl = dataset.to_table() 
        
           expected = pa.Table.from_pydict( 
        
               { 
        
                   "a": list(range(100)) + list(range(100)), 
        
                   "b": list(range(100)) + list(range(100)), 
        
               } 
        
           ) 
        
           assert tbl == expected

[rust] allow append commit in dataset

d3e8153

wjones127 requested changes Jun 15, 2023

View reviewed changes

wjones127 mentioned this pull request Aug 28, 2023

Use transactions in commit API #1091

Closed

wjones127 closed this Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support append commit #982

support append commit #982

LiWeiJie commented Jun 15, 2023

wjones127 left a comment

wjones127 Jun 15, 2023

wjones127 Jun 15, 2023

wjones127 Jun 15, 2023

wjones127 Jun 15, 2023

wjones127 commented Jun 15, 2023

wjones127 commented Sep 18, 2023

	let mut fragments: Vec<Fragment> = if matches!(params.mode, WriteMode::Append) {
	dataset
	.as_ref()
	.map_or(vec![], \|d\| d.manifest.fragments.as_ref().clone())

support append commit #982

support append commit #982

Conversation

LiWeiJie commented Jun 15, 2023

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Jun 15, 2023

Choose a reason for hiding this comment

wjones127 Jun 15, 2023

Choose a reason for hiding this comment

wjones127 Jun 15, 2023

Choose a reason for hiding this comment

wjones127 Jun 15, 2023

Choose a reason for hiding this comment

wjones127 commented Jun 15, 2023

wjones127 commented Sep 18, 2023