-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark refactoring: tidy data and multi-node capability via --scheduler-file
#940
Benchmark refactoring: tidy data and multi-node capability via --scheduler-file
#940
Conversation
Removes the async from the Cluster and Client definitions (every future was immediately waited on, so no need for this complication). Factor out the monolithic run and produce data function into pieces, and build tidy dataframe (one row per experimental condition) for saving.
Adds option to connect to cluster specified in a file (via --scheduler-file) as well as creating a tidy data table for later visualisation/analysis.
This will conflict with #937 and #938 so I should coordinate with @pentschev. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #940 +/- ##
=============================================
Coverage 0.00% 0.00%
=============================================
Files 22 22
Lines 3149 3260 +111
=============================================
- Misses 3149 3260 +111
Continue to review full report at Codecov.
|
Let's think about data we want to store as part of a benchmark. At the moment, I store some aggregate data, and then summary statistics for point to point messaging between workers. The really blows up the storage of the data when using JSON files (this is a disadvantage of the tidy-data format since much information is repeated). For example, on 256 GPUs, a single run's benchmark data with this format consumes 40MB. The majority of that is bumf. The in-memory representation is 18MB (which is still blown up). I think that the p2p data is best stored just as a dense matrix, at single precision this would be (ngpus**2)*4 bytes (or 256KB in this case), multiplied by the number of different statistical measures we want (probably less than 10). The remaining overall summary statistics will then be a few KB, which is much more workable. Other things that one should probably capture:
Following on from #938, the minimum information one should probably gather is:
|
The metrics and means to achieve reasonable results sound sensible to me, I'm +1 on the whole idea. The only comment I have is: I'm assuming you're already considering that, but just for the sake of completeness, perhaps we would like to make this easily extendable, in this manner we don't necessarily need to know in advance all metrics we may want in the future, and thus we will add them as we encounter need for those. |
Simplifies new benchmark creation and makes sure that data aggregation is scalable for all benchmarks to large worker counts. The idea is that a benchmark now needs to define three functions: - bench_once (running the benchmark a single time) - pretty_print_results (to produce a human-readable output) - create_tidy_results (to produce computer-readable output)
This is now hopefully easier, one must adapt the |
OAOO for forkserver setup
I think this is ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this all looks really well-organized and more complete, certainly a big leap from the old implementation, thanks @wence- ! I left a few suggestions.
Rather than randomly sorting the workers on a host (by port number) sort by host and then device id (if it is set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very nice now, just one minor missing change required and then we're done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, but I think we'll need the fixes for cuda-python<11.7.1
to be merged before we can merge this. Thanks @wence- for the nice work in making things much more professional looking!
Once more, thanks @wence- ! |
@gpucibot merge |
1 similar comment
@gpucibot merge |
This is a move towards using the benchmarks for regular profiling on more than one node.
That requires two substantive changes:
--scheduler-file
,dask-mpi
could be used I think, but I haven't done so).I've refactored the benchmarks into common infrastructure, which simplifies new benchmark creation.