feat: add file compaction #1095

wjones127 · 2023-07-26T17:09:45Z

Closes #934

rust/src/dataset/write.rs

wjones127 · 2023-07-26T22:43:52Z

rust/src/dataset/scanner.rs

@@ -667,6 +667,12 @@ impl Stream for DatasetRecordBatchStream {
    }
 }

+impl From<DatasetRecordBatchStream> for SendableRecordBatchStream {


TBH, I'm not entirely clear why we have our own wrapper for SendableRecordBatchStream. Seems like it would be much less of a headache to just use the DataFusion type and make sure we provide a conversion trait for the errors.

Lets just use Datafusion if feasible.

eddyxu · 2023-07-30T20:41:20Z

rust/src/dataset/optimize.rs

+/// This method tries to preserve the insertion order of rows in the dataset.
+///
+/// If no compaction is needed, this method will not make a new version of the table.
+pub async fn compact_files(


Concerning that this is global level planning + action (compaction). It might be not feasible for a large dataset. Should we split the plan and action into two phases?

eddyxu · 2023-07-30T20:42:25Z

rust/src/dataset/optimize.rs

+        .buffer_unordered(options.num_concurrent_jobs);
+
+    // Prepare this so we can assign ids to the new fragments.
+    let mut current_fragment_id = dataset


Do you need to lock the dataset to prevent other writer obtain these fragment ids.

If there is a concurrent writer, one of them will fail. Eventually, we'll implement retries, at which point the losing writer will need to recompute the fragment ids starting at the new max.

eddyxu · 2023-07-30T20:43:39Z

rust/src/dataset/optimize.rs

+    // TODO: replace this with from_previous
+    let mut manifest = Manifest::new(dataset.schema(), Arc::new(final_fragments));
+
+    manifest.version = dataset


Can we abstract this away into a new commit() operation.

eddyxu · 2023-07-30T20:46:59Z

rust/src/dataset/optimize.rs

+}
+
+impl CompactionPlan {
+    fn with_capacity(n: usize) -> Self {


What does capacity mean here (as the user of CompactPlan)?

Do we want to make progressive plan , as only compact up to certain number of fragment each time.

This is private right now. with_capacity is there mostly for performance reasons, since we know we'll have the full fragment list stored there.

eddyxu · 2023-08-01T18:10:41Z

rust/src/dataset/scanner.rs

@@ -667,6 +667,12 @@ impl Stream for DatasetRecordBatchStream {
    }
 }

+impl From<DatasetRecordBatchStream> for SendableRecordBatchStream {


Lets just use Datafusion if feasible.

eddyxu · 2023-08-01T18:16:35Z

python/python/lance/dataset.py

+            materialize_deletions_threshold=materialize_deletions_threshold,
+            num_concurrent_jobs=num_concurrent_jobs,
+        )
+        return _compact_files(self._dataset._ds, opts)


Can we make these two phases (plan + (potentially distributed) execution). I would see many cases we need to be able to run this distributely.

eddyxu · 2023-08-01T18:17:20Z

python/src/dataset.rs

@@ -562,3 +562,83 @@ pub(crate) fn get_write_params(options: &PyDict) -> PyResult<Option<WriteParams>
    };
    Ok(params)
 }
+
+pub mod optimize {


Just separate this to another file under /dataset/?

eddyxu · 2023-08-01T18:20:53Z

rust/src/dataset/optimize.rs

+/// Options to be passed to [compact_files].
+#[derive(Debug, Clone)]
+pub struct CompactionOptions {
+    /// Target number of rows per file. Defaults to 1 million.


There is one thing that i'd like to do for long time, is that using stats of column to find a better encoding when doing compaction. For example, in Procella paper, it actually do two phase write to determine the optimal perf/cost of encodings, and rewrite the data in background.

Yeah I agree we should use stats to determine encoding, but IMO we should do that at the page level. So as we write pages, we first collect stats, then pass that to the encoder. It chooses the encoding based on the stats (all values the same => Constant encoding, distinct count small => dictionary encoding, etc.). That's part of the motivation for variable encodings.

That's a good point. With that in mind, it means that we need to put page encoding metadata with each page?
Also are there chances that we want to merge/split pages?

Also are there chances that we want to merge/split pages?

This should do that automatically, since it reads the data in as Arrow, and then streams that into the writer.

With that in mind, it means that we need to put page encoding metadata with each page?

Yup. That's part of the variable encodings design.

rust/src/dataset/optimize.rs

eddyxu · 2023-08-01T18:23:25Z

rust/src/dataset/optimize.rs

+/// to compact is represented as a range of indices in the `fragments` list. We
+/// also track the ranges of fragments that should be kept as-is, to make it
+/// easier to build the final list of fragments.
+struct CompactionPlan {


If we consider "rewrite for better encoding", this is not just a compact plan, right? It is like a Optimize(Storage)Plan.

wjones127 · 2023-08-15T18:14:38Z

Note: need to refactor for this: #1127 (comment)

westonpace

Nothing but minor thoughts. This looks great. I wonder if we might have other "plan then execute" distributed tasks in the future (e.g. building an index) and this serves as a good template.

docs/read_and_write.rst

westonpace · 2023-09-06T13:42:48Z

docs/read_and_write.rst

+
+When files are rewritten, the original row ids are invalidated. This means the
+affected files are no longer part of any ANN index if they were before. Because
+of this, it's recommended to rewrite files before re-building indices.


Suggested change

of this, it's recommended to rewrite files before re-building indices.

of this, it's recommended to rebuild indices after rewriting files.

I'm not sure if you are trying to say "after you rewrite you need to make sure and rebuild indices" or "if you're going to build indices, you should probably rewrite first so you don't lose the index data later" but I think the former is more essential to communicate (though both could also be said if you want)

I guess I say it this way because I don't think you have to rebuild the index after compaction. If the part that was compacted was data that wasn't indexed in the first place (because it was recently appended), you don't necessarily have to rebuild the index. But it is always a waste to build indices and then do compaction, if you are planning on doing both.

python/python/lance/dataset.py

westonpace · 2023-09-06T13:47:51Z

python/python/lance/dataset.py

+        materialize_deletions_threshold: float, default 0.1
+            The fraction of original rows that are soft deleted in a fragment
+            before the fragment is a candidate for compaction.


Hmm, I would think a simple limit (e.g. 100 soft deleted rows) would be fine here. Why go with proportion?

I can't say I have great justification, without having any benchmarks. I've heard engineers on similar projects say 10% was the threshold they saw deteriorations in scan performance, so I guess here's a little cargo-culting :)

I guess I could setup a benchmark parametrized by scale and proportion deleted to see how scan performance is affected. And we can see if a fixed value or a proportional value allows for the most broadly applicable default value.

python/python/lance/dataset.py

rust/src/dataset/optimize.rs

wip: start outlining compaction impl wip: implement compaction complete impl docs: add good rust docs feat: expose compact_files in python fix: handle deletions better docs: rewrite for final api pr feedback wip: refactor for tasks feat: add distributed compaction API test distributed wip: add python distributed bindings Python api and docs migrate to transaction api format Apply suggestions from code review Co-authored-by: Weston Pace <[email protected]> make more flexible

wjones127 commented Jul 26, 2023

View reviewed changes

rust/src/dataset/write.rs Outdated Show resolved Hide resolved

wjones127 commented Jul 26, 2023

View reviewed changes

wjones127 force-pushed the wjones127/table-maintenance branch 2 times, most recently from a215a34 to e86d2d7 Compare July 27, 2023 20:42

wjones127 marked this pull request as ready for review July 27, 2023 21:10

wjones127 requested review from eddyxu and gsilvestrin July 27, 2023 21:10

eddyxu reviewed Aug 1, 2023

View reviewed changes

wjones127 mentioned this pull request Aug 2, 2023

Add function to delete dataset versions #1117

Closed

wjones127 force-pushed the wjones127/table-maintenance branch from e86d2d7 to a443248 Compare August 3, 2023 04:09

wjones127 marked this pull request as draft August 4, 2023 23:49

wjones127 force-pushed the wjones127/table-maintenance branch 2 times, most recently from aaef7bb to a93a5eb Compare August 15, 2023 00:01

eddyxu requested review from westonpace and chebbyChefNEQ and removed request for gsilvestrin August 15, 2023 00:56

wjones127 force-pushed the wjones127/table-maintenance branch from a93a5eb to 789a31d Compare August 28, 2023 19:30

wjones127 marked this pull request as ready for review August 28, 2023 20:08

wjones127 force-pushed the wjones127/table-maintenance branch from 789a31d to 50e8068 Compare September 5, 2023 20:13

wjones127 mentioned this pull request Sep 5, 2023

feat: update the commit function to work with transactions #1193

Merged

westonpace approved these changes Sep 6, 2023

View reviewed changes

wjones127 force-pushed the wjones127/table-maintenance branch 2 times, most recently from da29586 to 73ed44d Compare September 11, 2023 19:36

wjones127 added 2 commits September 11, 2023 13:13

fix transaction api for rewrite

6b14c40

wjones127 force-pushed the wjones127/table-maintenance branch from 73ed44d to 6b14c40 Compare September 11, 2023 20:28

wjones127 added 2 commits September 11, 2023 14:21

docs: improve documentation about the ordering

c96deb1

fix error

690f823

wjones127 merged commit 816d85c into main Sep 11, 2023

wjones127 deleted the wjones127/table-maintenance branch September 11, 2023 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add file compaction #1095

feat: add file compaction #1095

wjones127 commented Jul 26, 2023 •

edited

Loading

wjones127 Jul 26, 2023

eddyxu Aug 1, 2023

eddyxu Jul 30, 2023

eddyxu Jul 30, 2023

wjones127 Aug 1, 2023

eddyxu Jul 30, 2023

eddyxu Jul 30, 2023

wjones127 Aug 1, 2023

eddyxu Aug 1, 2023

eddyxu Aug 1, 2023

eddyxu Aug 1, 2023

eddyxu Aug 1, 2023

wjones127 Aug 1, 2023

eddyxu Aug 8, 2023

wjones127 Aug 8, 2023

eddyxu Aug 1, 2023

wjones127 commented Aug 15, 2023

westonpace left a comment

westonpace Sep 6, 2023

wjones127 Sep 6, 2023

westonpace Sep 6, 2023

wjones127 Sep 6, 2023

	of this, it's recommended to rewrite files before re-building indices.
	of this, it's recommended to rebuild indices after rewriting files.

feat: add file compaction #1095

feat: add file compaction #1095

Conversation

wjones127 commented Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 commented Aug 15, 2023

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 commented Jul 26, 2023 •

edited

Loading