Benchmark refactoring: tidy data and multi-node capability via `--scheduler-file` #940

wence- · 2022-06-22T16:15:05Z

This is a move towards using the benchmarks for regular profiling on more than one node.

That requires two substantive changes:

Refactor benchmark running and data processing to bring a client up from an external source (here a --scheduler-file, dask-mpi could be used I think, but I haven't done so).
As well as producing human-readable output, produce data that can be consumed by downstream scripts

I've refactored the benchmarks into common infrastructure, which simplifies new benchmark creation.

Removes the async from the Cluster and Client definitions (every future was immediately waited on, so no need for this complication). Factor out the monolithic run and produce data function into pieces, and build tidy dataframe (one row per experimental condition) for saving.

Adds option to connect to cluster specified in a file (via --scheduler-file) as well as creating a tidy data table for later visualisation/analysis.

wence- · 2022-06-22T16:17:11Z

This will conflict with #937 and #938 so I should coordinate with @pentschev.

pentschev

It does look good overall, I didn't look line-by-line yet, but it seems that this PR could build upon #937 and #938, IOW, we could get those two merged first and then resolve conflicts on this, what do you think @wence- ?

dask_cuda/benchmarks/utils.py

wence- · 2022-06-23T11:21:05Z

It does look good overall, I didn't look line-by-line yet, but it seems that this PR could build upon #937 and #938, IOW, we could get those two merged first and then resolve conflicts on this, what do you think @wence- ?

Yes, happy to do that.

…rk-tidy-data

codecov-commenter · 2022-06-24T11:02:29Z

Codecov Report

Merging #940 (0844040) into branch-22.08 (be4aa70) will not change coverage.
The diff coverage is 0.00%.

❗ Current head 0844040 differs from pull request most recent head 1d4c744. Consider uploading reports for the commit 1d4c744 to get more accurate results

@@              Coverage Diff              @@
##           branch-22.08    #940    +/-   ##
=============================================
  Coverage          0.00%   0.00%            
=============================================
  Files                22      22            
  Lines              3149    3260   +111     
=============================================
- Misses             3149    3260   +111

Impacted Files	Coverage Δ
dask_cuda/benchmarks/local_cudf_merge.py	`0.00% <0.00%> (ø)`
dask_cuda/benchmarks/local_cupy.py	`0.00% <0.00%> (ø)`
dask_cuda/benchmarks/utils.py	`0.00% <0.00%> (ø)`
dask_cuda/cli/dask_cuda_worker.py	`0.00% <0.00%> (ø)`
dask_cuda/benchmarks/local_cudf_shuffle.py	`0.00% <0.00%> (ø)`
dask_cuda/benchmarks/local_cupy_map_overlap.py	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be4aa70...1d4c744. Read the comment docs.

…rk-tidy-data

wence- · 2022-06-27T16:21:27Z

Let's think about data we want to store as part of a benchmark. At the moment, I store some aggregate data, and then summary statistics for point to point messaging between workers. The really blows up the storage of the data when using JSON files (this is a disadvantage of the tidy-data format since much information is repeated).

For example, on 256 GPUs, a single run's benchmark data with this format consumes 40MB. The majority of that is bumf. The in-memory representation is 18MB (which is still blown up).

I think that the p2p data is best stored just as a dense matrix, at single precision this would be (ngpus**2)*4 bytes (or 256KB in this case), multiplied by the number of different statistical measures we want (probably less than 10). The remaining overall summary statistics will then be a few KB, which is much more workable.

Other things that one should probably capture:

Mapping from integer node/device ID to node name/device name (in case the p2p data shows up bad behaviour we might want to match to bad hardware)

Following on from #938, the minimum information one should probably gather is:

Throughput (data processed / second), its quartiles, median, min, and max, standard deviation
Observed aggregated bandwidth (if possible, cf linked discussion above), + statistical measures
Time to solution
Your favourite metric here.

pentschev · 2022-06-27T18:13:59Z

The metrics and means to achieve reasonable results sound sensible to me, I'm +1 on the whole idea.

The only comment I have is: I'm assuming you're already considering that, but just for the sake of completeness, perhaps we would like to make this easily extendable, in this manner we don't necessarily need to know in advance all metrics we may want in the future, and thus we will add them as we encounter need for those.

Simplifies new benchmark creation and makes sure that data aggregation is scalable for all benchmarks to large worker counts. The idea is that a benchmark now needs to define three functions: - bench_once (running the benchmark a single time) - pretty_print_results (to produce a human-readable output) - create_tidy_results (to produce computer-readable output)

wence- · 2022-06-30T15:38:11Z

The only comment I have is: I'm assuming you're already considering that, but just for the sake of completeness, perhaps we would like to make this easily extendable, in this manner we don't necessarily need to know in advance all metrics we may want in the future, and thus we will add them as we encounter need for those.

This is now hopefully easier, one must adapt the aggregate_transfer_log_data function.

…rk-tidy-data

OAOO for forkserver setup

dask_cuda/benchmarks/local_cudf_merge.py

dask_cuda/benchmarks/local_cupy.py

dask_cuda/benchmarks/local_cupy_map_overlap.py

dask_cuda/benchmarks/utils.py

wence- · 2022-06-30T15:58:57Z

I think this is ready for another look.

pentschev

Overall this all looks really well-organized and more complete, certainly a big leap from the old implementation, thanks @wence- ! I left a few suggestions.

dask_cuda/benchmarks/local_cudf_merge.py

dask_cuda/benchmarks/utils.py

Rather than randomly sorting the workers on a host (by port number) sort by host and then device id (if it is set).

pentschev

Looks very nice now, just one minor missing change required and then we're done.

dask_cuda/benchmarks/utils.py

pentschev

LGTM now, but I think we'll need the fixes for cuda-python<11.7.1 to be merged before we can merge this. Thanks @wence- for the nice work in making things much more professional looking!

pentschev · 2022-07-01T19:52:42Z

Once more, thanks @wence- !

pentschev · 2022-07-01T19:52:47Z

@gpucibot merge

pentschev · 2022-07-01T20:53:03Z

@gpucibot merge

wence- added 5 commits June 22, 2022 13:37

utils: Group common benchmark arguments by type

a2d7c81

utils: Factor out some common functions

8f7f520

Rework merge benchmark

2e695cd

Adds option to connect to cluster specified in a file (via --scheduler-file) as well as creating a tidy data table for later visualisation/analysis.

all-to-all if requested in local_cupy benchmark

0fea365

wence- requested a review from a team as a code owner June 22, 2022 16:15

github-actions bot added the python python code needed label Jun 22, 2022

pentschev reviewed Jun 22, 2022

View reviewed changes

dask_cuda/benchmarks/utils.py Show resolved Hide resolved

wence- mentioned this pull request Jun 23, 2022

Add communications bandwidth to benchmarks #938

Merged

Merge remote-tracking branch 'origin/branch-22.08' into wence/benchma…

0f3ca85

…rk-tidy-data

pentschev added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 24, 2022

Merge remote-tracking branch 'origin/branch-22.08' into wence/benchma…

cbea898

…rk-tidy-data

wence- mentioned this pull request Jun 28, 2022

trouble scaling cudf merge benchmark with ucx ~64 nodes on NERSC perlmutter #930

Closed

wence- added 2 commits June 30, 2022 16:31

Correctly assign arguments to cluster group

b392078

wence- added 3 commits June 30, 2022 16:42

Merge remote-tracking branch 'origin/branch-22.08' into wence/benchma…

1d4c744

…rk-tidy-data

Fix __all__ in benchmarks.common

bf8350e

Simplify __main__ for benchmarks

cc2d285

OAOO for forkserver setup

wence- commented Jun 30, 2022

View reviewed changes

wence- added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 30, 2022

pentschev reviewed Jun 30, 2022

View reviewed changes

dask_cuda/benchmarks/local_cudf_merge.py Show resolved Hide resolved

dask_cuda/benchmarks/utils.py Outdated Show resolved Hide resolved

dask_cuda/benchmarks/utils.py Outdated Show resolved Hide resolved

wence- added 3 commits July 1, 2022 09:32

Reinstate handling of --scheduler-address option

182e925

Sort workers by host then device id

462b97f

Rather than randomly sorting the workers on a host (by port number) sort by host and then device id (if it is set).

Remove hack comment

c8e0803

pentschev requested changes Jul 1, 2022

View reviewed changes

dask_cuda/benchmarks/utils.py Outdated Show resolved Hide resolved

Remove destination for --scheduler-address

8f86f7b

pentschev approved these changes Jul 1, 2022

View reviewed changes

Print processed data

baa9883

rapids-bot bot merged commit 9515a22 into rapidsai:branch-22.08 Jul 1, 2022

wence- deleted the wence/benchmark-tidy-data branch July 1, 2022 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark refactoring: tidy data and multi-node capability via `--scheduler-file` #940

Benchmark refactoring: tidy data and multi-node capability via `--scheduler-file` #940

wence- commented Jun 22, 2022 •

edited

Loading

wence- commented Jun 22, 2022

pentschev left a comment

wence- commented Jun 23, 2022

codecov-commenter commented Jun 24, 2022 •

edited

Loading

wence- commented Jun 27, 2022

pentschev commented Jun 27, 2022

wence- commented Jun 30, 2022

wence- commented Jun 30, 2022

pentschev left a comment

pentschev left a comment

pentschev left a comment

pentschev commented Jul 1, 2022

pentschev commented Jul 1, 2022 •

edited by wence-

Loading

pentschev commented Jul 1, 2022

Benchmark refactoring: tidy data and multi-node capability via --scheduler-file #940

Benchmark refactoring: tidy data and multi-node capability via --scheduler-file #940

Conversation

wence- commented Jun 22, 2022 • edited Loading

wence- commented Jun 22, 2022

pentschev left a comment

Choose a reason for hiding this comment

wence- commented Jun 23, 2022

codecov-commenter commented Jun 24, 2022 • edited Loading

Codecov Report

wence- commented Jun 27, 2022

pentschev commented Jun 27, 2022

wence- commented Jun 30, 2022

wence- commented Jun 30, 2022

pentschev left a comment

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

pentschev commented Jul 1, 2022

pentschev commented Jul 1, 2022 • edited by wence- Loading

pentschev commented Jul 1, 2022

Benchmark refactoring: tidy data and multi-node capability via `--scheduler-file` #940

Benchmark refactoring: tidy data and multi-node capability via `--scheduler-file` #940

wence- commented Jun 22, 2022 •

edited

Loading

codecov-commenter commented Jun 24, 2022 •

edited

Loading

pentschev commented Jul 1, 2022 •

edited by wence-

Loading