feat: added a `commit` method on `TypeTracerReport` to identify touched buffers in the Dask DAG-building pass #3043

jpivarski · 2024-03-08T22:15:17Z

This is to avoid having to metadata-compute a large script twice, once to build the Dask DAG and determine which node is the compute output, and then again to identify the touched buffers. The goal is to identify touched buffers in the first pass, simultaneously with DAG-building.

The current algorithm for identifying touched data is purely by side-effects: dask-awkward plays the entire subgraph that ends with the compute node, and TypeTracerArray buffers put their labels into the TypeTracerReport if they are touched. This has to be done in a second pass because parts of the DAG might not be needed to compute the compute node. While the DAG is being built, before compute is called, we don't know which nodes will be dependencies for compute and which ones won't.

To do this in one pass, we need to annotate every Dask graph node with the set of labels that it would contribute if it is part of the subgraph that leads to the compute node. Once that subgraph is known, we can do a union over the sets of labels associated with the relevant Dask nodes.

To annotate each node independently with a data-touching implementation that relies on side-effects, each Dask operation (a dak.* call that results in one graph node) needs the TypeTracerReport to be empty at the start of the operation, get filled during the operation, stash the result (set of labels for that operation only), and reset the TypeTracerReport to an empty state for the next operation.

This is accomplished with the new TypeTracerReport.commit method in this PR. The TypeTracerReport now keeps a mapping of Dask node_id → set of labels and commit fills this database with one node, then resets its own set. (The mapping from buffers to their report does not need to be changed, since the report clears itself.)

The potential downsides to this approach are (1) copying the working set of labels to a per-node set of labels is a lot of set copies, which could be time-expensive, and (2) storing sets of labels for every node in the graph could be memory-expensive.

To alleviate both of these,

the working set (originally just a Python set of strings) was replaced with a FillableByteSet, which is a dict mapping expected labels to indexes and an array of booleans for each index
the per-node sets are each an ImmutableBitSet, which is the same dict (referenced, not copied) and a bit-packed version of that array.

Maybe I'm being too cautious with memory and we'd rather store per-node sets by referencing the byte array instead of converting it into a bit array. (Maybe the Python object overhead drowns out the factor of 8 in the array itself. It depends on the number of possible labels.) In that case, we'd move the byte array to the per-node data and create a new working set array, rather than zeroing it out, as in this PR. It's a knob we can tune to optimize later.

More performance tuning is available in the data_touched_in method, which converts each ImmutableBitSet into a Python set for the union; if label indexes are the same (true in many cases), it can be a bitwise or.

To use this, the TypeTracerReports associated with dask_awkward.Array._meta objects would have to be collected from the inputs of each Dask operation and propagated to the outputs (infectiously: a report in any input goes to all of the outputs), and the commit method has to be called on every report at the end of the Dask operation.

This PR only adds the method and doesn't break things. To see a performance benefit, it has to be used by dask-awkward.

jpivarski · 2024-03-08T22:15:32Z

@lgray @martindurant

jpivarski · 2024-03-08T22:16:53Z

Oh, yeah, and we also lose one feature: it used to be the case that TypeTracerReport would maintain the order of labels in data_touched. Now it's a set whose order can't be guaranteed. That probably isn't important, but I'm mentioning it just in case.

jpivarski · 2024-03-20T19:19:54Z

This PR, by itself, doesn't actually change things. When dask-awkward uses it, it will make the single-pass optimization possible. So I'll merge it as soon as tests pass.

jpivarski added 2 commits March 8, 2024 13:01

Report.data_touched can no longer be ordered (that's okay, right?)

23b4008

implemented TypeTracerReports as byte/bit-sets with a 'commit' method

52b3d07

jpivarski requested a review from agoose77 March 8, 2024 22:15

jpivarski temporarily deployed to docs March 8, 2024 22:25 — with GitHub Actions Inactive

refactor: small changes

a5bcd59

agoose77 temporarily deployed to docs March 11, 2024 09:43 — with GitHub Actions Inactive

Merge branch 'main' into jpivarski/TypeTracerReport-commit-method

36b98cb

jpivarski enabled auto-merge (squash) March 20, 2024 19:19

jpivarski temporarily deployed to docs March 20, 2024 20:26 — with GitHub Actions Inactive

jpivarski and others added 2 commits March 21, 2024 18:12

pylint was useful: '__len__' should return 'int', not a NumPy integer

f8418f8

Merge branch 'main' into jpivarski/TypeTracerReport-commit-method

61f3ac1

jpivarski deployed to docs March 21, 2024 23:26 — with GitHub Actions View deployment

jpivarski merged commit bf18460 into main Mar 21, 2024
38 checks passed

jpivarski deleted the jpivarski/TypeTracerReport-commit-method branch March 21, 2024 23:31

jpivarski mentioned this pull request Mar 25, 2024

wip: fix axis=0 concatenation dask-contrib/dask-awkward#457

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: added a `commit` method on `TypeTracerReport` to identify touched buffers in the Dask DAG-building pass #3043

feat: added a `commit` method on `TypeTracerReport` to identify touched buffers in the Dask DAG-building pass #3043

jpivarski commented Mar 8, 2024

jpivarski commented Mar 8, 2024

jpivarski commented Mar 8, 2024

jpivarski commented Mar 20, 2024

feat: added a commit method on TypeTracerReport to identify touched buffers in the Dask DAG-building pass #3043

feat: added a commit method on TypeTracerReport to identify touched buffers in the Dask DAG-building pass #3043

Conversation

jpivarski commented Mar 8, 2024

jpivarski commented Mar 8, 2024

jpivarski commented Mar 8, 2024

jpivarski commented Mar 20, 2024

feat: added a `commit` method on `TypeTracerReport` to identify touched buffers in the Dask DAG-building pass #3043

feat: added a `commit` method on `TypeTracerReport` to identify touched buffers in the Dask DAG-building pass #3043