Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added a commit method on TypeTracerReport to identify touched buffers in the Dask DAG-building pass #3043

Merged
merged 6 commits into from
Mar 21, 2024

Conversation

jpivarski
Copy link
Member

This is to avoid having to metadata-compute a large script twice, once to build the Dask DAG and determine which node is the compute output, and then again to identify the touched buffers. The goal is to identify touched buffers in the first pass, simultaneously with DAG-building.

The current algorithm for identifying touched data is purely by side-effects: dask-awkward plays the entire subgraph that ends with the compute node, and TypeTracerArray buffers put their labels into the TypeTracerReport if they are touched. This has to be done in a second pass because parts of the DAG might not be needed to compute the compute node. While the DAG is being built, before compute is called, we don't know which nodes will be dependencies for compute and which ones won't.

To do this in one pass, we need to annotate every Dask graph node with the set of labels that it would contribute if it is part of the subgraph that leads to the compute node. Once that subgraph is known, we can do a union over the sets of labels associated with the relevant Dask nodes.

To annotate each node independently with a data-touching implementation that relies on side-effects, each Dask operation (a dak.* call that results in one graph node) needs the TypeTracerReport to be empty at the start of the operation, get filled during the operation, stash the result (set of labels for that operation only), and reset the TypeTracerReport to an empty state for the next operation.

This is accomplished with the new TypeTracerReport.commit method in this PR. The TypeTracerReport now keeps a mapping of Dask node_id → set of labels and commit fills this database with one node, then resets its own set. (The mapping from buffers to their report does not need to be changed, since the report clears itself.)

The potential downsides to this approach are (1) copying the working set of labels to a per-node set of labels is a lot of set copies, which could be time-expensive, and (2) storing sets of labels for every node in the graph could be memory-expensive.

To alleviate both of these,

  • the working set (originally just a Python set of strings) was replaced with a FillableByteSet, which is a dict mapping expected labels to indexes and an array of booleans for each index
  • the per-node sets are each an ImmutableBitSet, which is the same dict (referenced, not copied) and a bit-packed version of that array.

Maybe I'm being too cautious with memory and we'd rather store per-node sets by referencing the byte array instead of converting it into a bit array. (Maybe the Python object overhead drowns out the factor of 8 in the array itself. It depends on the number of possible labels.) In that case, we'd move the byte array to the per-node data and create a new working set array, rather than zeroing it out, as in this PR. It's a knob we can tune to optimize later.

More performance tuning is available in the data_touched_in method, which converts each ImmutableBitSet into a Python set for the union; if label indexes are the same (true in many cases), it can be a bitwise or.

To use this, the TypeTracerReports associated with dask_awkward.Array._meta objects would have to be collected from the inputs of each Dask operation and propagated to the outputs (infectiously: a report in any input goes to all of the outputs), and the commit method has to be called on every report at the end of the Dask operation.

This PR only adds the method and doesn't break things. To see a performance benefit, it has to be used by dask-awkward.

@jpivarski jpivarski requested a review from agoose77 March 8, 2024 22:15
@jpivarski
Copy link
Member Author

@lgray @martindurant

@jpivarski
Copy link
Member Author

Oh, yeah, and we also lose one feature: it used to be the case that TypeTracerReport would maintain the order of labels in data_touched. Now it's a set whose order can't be guaranteed. That probably isn't important, but I'm mentioning it just in case.

@jpivarski
Copy link
Member Author

This PR, by itself, doesn't actually change things. When dask-awkward uses it, it will make the single-pass optimization possible. So I'll merge it as soon as tests pass.

@jpivarski jpivarski merged commit bf18460 into main Mar 21, 2024
38 checks passed
@jpivarski jpivarski deleted the jpivarski/TypeTracerReport-commit-method branch March 21, 2024 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants