[RFC] Moving MXNet-AMP to core #18896

mk-61 · 2020-08-10T21:16:49Z

MXNet already has experimental AMP (Automatic Mixed Precision) support, exposed in mxnet.contrib package. It is used for automatic casting models to both float16 and bfloat16. This RFC covers moving it into core / making a first-class feature, as well as further development.

Here's a rough task break down for the initial move:

~~Need to ensure AMP works with numpy ops - i.e., all ops are in either of the lists~~ - done in AMP support for Numpy ops #19036
~~API change: make loss scale public (Make loss scale public in AMP #17507)~~ - done in AMP support for Numpy ops #19036
~~Transparent / lazy AMP initialization? (Got "kFlag == type_flag_: TBlob.get_with_shape: data type do not match specified type.Expected: 0 v.s. given 2" when training with amp. #18902 (comment))~~ - a warning added, when amp.init() is called and a model already exists in AMP support for Numpy ops #19036
A number of issues has to be resolved to improve user experience:
1. ~~Cannot load trainer with AMP (Cannot load trainer with AMP #16858)~~ - fixed in Get rid of monkey patching in LossScaler overflow handling #18959
2. ~~There's a CUDA crash (IMA) in amp_multicast, happens on some models (Yolo3)~~ - fixed in Fix possible IMA in amp_multicast fusion #19318
3. AMP not reusing weights on recursive networks (AMP not reusing weights on recursive networks #19019)
~~The actual shuffling code around and updating import paths~~

Post move:

Layout optimization - upstreaming feature already existing in NVIDIA NGC container. This helps convolutions' performance by automatically casting between NCHW and NHWC layouts.
Explore alternatives to front end ops monkey-patching (AMP for mx2 #18697)
Add a way for the user to turn AMP off, and to control AMP setting via a context manager.

github-actions · 2020-08-10T21:17:31Z

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue.
Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly.
If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

szha · 2020-08-12T17:29:22Z

cc @sxjscience @leezu

sxjscience · 2020-08-12T22:36:00Z

@mk-61 If you'd like to see some test cases of the new numpy API, you can also try the numpy version of GluonNLP: https://github.com/dmlc/gluon-nlp/tree/numpy . Would we connect via Slack?

mk-61 added the Feature request label Aug 10, 2020

szha added the RFC Post requesting for comments label Aug 10, 2020

szha mentioned this issue Aug 25, 2020

Got "kFlag == type_flag_: TBlob.get_with_shape: data type do not match specified type.Expected: 0 v.s. given 2" when training with amp. #18902

Closed

mk-61 mentioned this issue Oct 14, 2020

Move AMP from contrib to core #19347

Merged

1 task

mk-61 mentioned this issue Nov 19, 2020

Add AMP patching of npi ops in _api_internal module #19488

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Moving MXNet-AMP to core #18896

[RFC] Moving MXNet-AMP to core #18896

mk-61 commented Aug 10, 2020 •

edited

Loading

github-actions bot commented Aug 10, 2020

szha commented Aug 12, 2020

sxjscience commented Aug 12, 2020

[RFC] Moving MXNet-AMP to core #18896

[RFC] Moving MXNet-AMP to core #18896

Comments

mk-61 commented Aug 10, 2020 • edited Loading

github-actions bot commented Aug 10, 2020

szha commented Aug 12, 2020

sxjscience commented Aug 12, 2020

mk-61 commented Aug 10, 2020 •

edited

Loading