-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged #3373
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, mostly minor style comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last wave of nits!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…ant, necessary gpu metrics logged (mosaicml#3373) * implemented min_max flag * fixed string parsing * refactoring compute_system_metrics for all_reduce * keep track of rank within dict * added compute_min_max * added flag for both min_max and all_logging * corrected min_max call with model_device * removing total bytes (always going ot be constant) * handled no gpu case in min_max flag * removed unnecessary imports, patched unit tests * fixed assert statement for with gpu case, world size 1 * case min_rank and max_rank as int to guarantee them working as indices * fixed indent issue from fixing font * made docs more concise and readable * fixing unexpected unindent * fixing unit test device * modifying device to equal model_device.type * reverting to device=model_device * setting device in unit test = 'gpu' * setting device = 'cuda' in unit testing * reverting to next(state.model.parameters()).device * removed torch as a dependecy for unit_testing * cleaned up UI to be consistent + removed calling next to obtain device --------- Co-authored-by: Mihir Patel <[email protected]> Co-authored-by: Charles Tang <[email protected]>
…ant, necessary gpu metrics logged (#3373) * implemented min_max flag * fixed string parsing * refactoring compute_system_metrics for all_reduce * keep track of rank within dict * added compute_min_max * added flag for both min_max and all_logging * corrected min_max call with model_device * removing total bytes (always going ot be constant) * handled no gpu case in min_max flag * removed unnecessary imports, patched unit tests * fixed assert statement for with gpu case, world size 1 * case min_rank and max_rank as int to guarantee them working as indices * fixed indent issue from fixing font * made docs more concise and readable * fixing unexpected unindent * fixing unit test device * modifying device to equal model_device.type * reverting to device=model_device * setting device in unit test = 'gpu' * setting device = 'cuda' in unit testing * reverting to next(state.model.parameters()).device * removed torch as a dependecy for unit_testing * cleaned up UI to be consistent + removed calling next to obtain device --------- Co-authored-by: Mihir Patel <[email protected]> Co-authored-by: Charles Tang <[email protected]>
What does this PR do?
Added logging for gpu power and removed all gpu metrics covered in other callbacks (used, free, total memory).
The following GPU metrics pertaining to Straggler Detection are logged:
Added clearer documentation for the SystemsMetricMonitor class and removed the user inputted boolean parameter for whether the model uses GPU's or not (gpu_available).
Added a boolean flag for users to select whether to log the min/max values for the GPU metrics listed above and their corresponding their ranks, or log all values for all ranks (log_all_data, default set to false).
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)