feat: added a commit
method on TypeTracerReport
to identify touched buffers in the Dask DAG-building pass
#3043
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is to avoid having to metadata-compute a large script twice, once to build the Dask DAG and determine which node is the
compute
output, and then again to identify the touched buffers. The goal is to identify touched buffers in the first pass, simultaneously with DAG-building.The current algorithm for identifying touched data is purely by side-effects: dask-awkward plays the entire subgraph that ends with the
compute
node, andTypeTracerArray
buffers put their labels into theTypeTracerReport
if they are touched. This has to be done in a second pass because parts of the DAG might not be needed to compute thecompute
node. While the DAG is being built, beforecompute
is called, we don't know which nodes will be dependencies forcompute
and which ones won't.To do this in one pass, we need to annotate every Dask graph node with the set of labels that it would contribute if it is part of the subgraph that leads to the
compute
node. Once that subgraph is known, we can do a union over the sets of labels associated with the relevant Dask nodes.To annotate each node independently with a data-touching implementation that relies on side-effects, each Dask operation (a
dak.*
call that results in one graph node) needs theTypeTracerReport
to be empty at the start of the operation, get filled during the operation, stash the result (set of labels for that operation only), and reset theTypeTracerReport
to an empty state for the next operation.This is accomplished with the new
TypeTracerReport.commit
method in this PR. TheTypeTracerReport
now keeps a mapping of Dasknode_id
→ set of labels andcommit
fills this database with one node, then resets its own set. (The mapping from buffers to their report does not need to be changed, since the report clears itself.)The potential downsides to this approach are (1) copying the working set of labels to a per-node set of labels is a lot of set copies, which could be time-expensive, and (2) storing sets of labels for every node in the graph could be memory-expensive.
To alleviate both of these,
FillableByteSet
, which is a dict mapping expected labels to indexes and an array of booleans for each indexImmutableBitSet
, which is the same dict (referenced, not copied) and a bit-packed version of that array.Maybe I'm being too cautious with memory and we'd rather store per-node sets by referencing the byte array instead of converting it into a bit array. (Maybe the Python object overhead drowns out the factor of 8 in the array itself. It depends on the number of possible labels.) In that case, we'd move the byte array to the per-node data and create a new working set array, rather than zeroing it out, as in this PR. It's a knob we can tune to optimize later.
More performance tuning is available in the
data_touched_in
method, which converts eachImmutableBitSet
into a Pythonset
for the union; if label indexes are the same (true in many cases), it can be a bitwise or.To use this, the
TypeTracerReports
associated withdask_awkward.Array._meta
objects would have to be collected from the inputs of each Dask operation and propagated to the outputs (infectiously: a report in any input goes to all of the outputs), and thecommit
method has to be called on every report at the end of the Dask operation.This PR only adds the method and doesn't break things. To see a performance benefit, it has to be used by dask-awkward.