-
Notifications
You must be signed in to change notification settings - Fork 6.8k
introduce gradient update handler to the base estimator #16900
Conversation
@leezu What is the rationale behind marking this PR as 1.6? This looks like adding a new functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the improvement! 2 concerns.
Also could you point an example which require custom gradient handler? (gradient clipping or aggregation)
@ptrendx the estimator API is currently experimental and is not yet up to more complex real-world use-cases. @liuzh91 is currently converting GluonNLP training scripts to make use of the Estimator API and ran into several remaining shortcomings of the API. Do you think backporting the fix would be reasonable or do you have concerns? When do you plan to tag a release candidate? These extends the fixes commited started November 2019 https://github.com/apache/incubator-mxnet/commits/master/python/mxnet/gluon/contrib/estimator. |
Thank u for the review. For the gradient update example, one use case of using gradient accumulation appears when training a transformer. (https://github.com/dmlc/gluon-nlp/blob/master/scripts/machine_translation/train_transformer.py#L320) Because the size of parameters in the transformer network is too large, we can compute gradient for a small batch of data examples during each iteration. In this case, the gradient is updated periodically on the weight parameters. |
Codecov Report
@@ Coverage Diff @@
## master #16900 +/- ##
==========================================
- Coverage 67.06% 66.95% -0.12%
==========================================
Files 271 264 -7
Lines 30173 29804 -369
Branches 4477 4400 -77
==========================================
- Hits 20237 19954 -283
+ Misses 8648 8563 -85
+ Partials 1288 1287 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does this require the user to specify grad_req = 'add' for gradient accumulation? When is the gradient reset to zeros?
- Now GradientUpdateHandler is added to the list of default handlers. If users explicitly create estimator with
event_handlers=[metric, logging]
, willGradientUpdateHandler
still be included in the estimator and performing updates correctly?
|
5878451
to
e5acca5
Compare
e5acca5
to
058c3a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, re-triggered CI
* introduce gradient update handler to the base estimator * Modify the gradient update handler to include the batch size * Remove unrelated gradient update handler. * Modify gradient update handler to take the current batch size. * Remove white space to avoid the sanity check failure * add small tweak to the handler code * Modify the documentation of priority parameter of relevant handlers. * small modification on the documentation. * Add small modification on the documentation. * Remove unnecessary list check
* Fix ndarray indexing bug (#16895) * Fix indexing bug * More test cases * Add test from 16647 * [Gluon] Update contrib.Estimator LoggingHandler to support logging per batch interval (#16922) * Update LoggingHandler to support logging per interval * Fix the constant variable issue in the logging handler * Remove the constant variable hack in the logging handler. * 1) replace LOG_PER_BATCH with LOG_PER_INTERVAL 2) add test case * Improve the test script for LoggingHandler * small fix on the test script * logging handler test case bug fix * remove parameter verbose from LoggingHandler * move log_interval to the first argument * resolve unittest mistakes * Add micro averaging strategy to pearsonr metric (#16878) Strategy to be used for aggregating across mini-batches. "macro": average the pearsonr scores for each batch. "micro": compute a single pearsonr score across all batches. * [Bugfix] [Numpy] Add `kAddTo` and kNullOp to Transpose (#16979) * update Check for repeated axes enable addto to transpose fix fix fix fix remove unused ndim Update pseudo2DTranspose_op-inl.cuh Update pseudo2DTranspose_op-inl.cuh Update pseudo2DTranspose_op-inl.cuh fix Update pseudo2DTranspose_op-inl.cuh try to fix Update pseudo2DTranspose_op-inl.cuh Update pseudo2DTranspose_op-inl.cuh Update pseudo2DTranspose_op-inl.cuh fix Update np_matrix_op.cc Update test_numpy_op.py update test case fix implementation fix bug update fix bug Update pseudo2DTranspose_op-inl.cuh fix fix Update test_numpy_op.py * Fix bug * fix docstring * try to address comment * no need to change this line * Fix bug * address comments * address comment * introduce gradient update handler to the base estimator (#16900) * introduce gradient update handler to the base estimator * Modify the gradient update handler to include the batch size * Remove unrelated gradient update handler. * Modify gradient update handler to take the current batch size. * Remove white space to avoid the sanity check failure * add small tweak to the handler code * Modify the documentation of priority parameter of relevant handlers. * small modification on the documentation. * Add small modification on the documentation. * Remove unnecessary list check
Description
This change add default gradient update handler to the base estimator class. The purpose of this change is to decouple the forward/backward computation with the gradient apply process. In some use cases such as gradient accumulation, gradient updates are not executed during each training batch. See issue description #16869.