Add LR Scheduler #4694

rayankrish · 2020-08-03T18:58:59Z

Note: should be merged after train step (Add train_step #4677)
Implements LR Scheduler
Created a separate pull request with isolated commits for this task

orttraining/orttraining/test/python/orttraining_test_orttrainer_frontend.py

rayankrish · 2020-08-03T23:05:34Z

orttraining/orttraining/test/python/orttraining_test_orttrainer_frontend.py

+    max_train_step = 1
+    warmup = 0.5
+    initial_lr = 1
+    optim_config = optim.SGDConfig() if not lr_scheduler else optim.SGDConfig(lr=initial_lr)


@thiagocrepaldi when lr=1 is set but no lr_scheduler is given, the train step crashes by an input error. This is a temporary fix, but it's unclear what the problem is

Can you repro this in the current API/example?

Yes. I reproduced the error in two ways.

Pass a learning rate of 1 at each train step

Pass a learning rate scheduler to ORTTrainer that always returns 1

Error:
RuntimeError: Error in execution: Unexpected input data type. Actual: (N11onnxruntime17PrimitiveDataTypeIlEE) , expected: (N11onnxruntime17PrimitiveDataTypeIfEE)

At orttraining/orttraining/python/training/optim/config.py, around line 47, there is this:

assert (isinstance(defaults['lr'], float) or isinstance(defaults['lr'], int)) and defaults['lr'] >= 0, "lr must be a positive number"

Now that we create an internal IODescription with torch.float32, it is clear that we should change it to

assert isinstance(defaults['lr'], float) and defaults['lr'] >= 0, "lr must be a positive number" to prevent this issue

@rayankrish

orttraining/orttraining/python/training/orttrainer.py

orttraining/orttraining/test/python/orttraining_test_orttrainer_frontend.py

thiagocrepaldi

LGTM, Maybe @liqunfu has some feedback

orttraining/orttraining/python/training/orttrainer.py

orttraining/orttraining/python/training/optim/config.py

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]>

* Add ORTTrainerOptions class for the new pytorch frontend (#4382) Add ORTTrainerOptions class and some placeholders * Add _ORTTrainerModelDesc to perform validation for model description (#4416) * Add Loss Scaler classes to the new frontend (#4306) * Add TrainStepInfo used on the new frontend API (#4256) * Add Optimizer classes to the new frontend (#4280) * Add LRScheduler implementation (#4357) * Add basic ORTTrainer API (#4435) This PR presents the public API for ORTTrainer for the short term development. It also validates and saves input parameters, which will be used in the next stages, such as building ONNX model, post processing the model and configuring the training session * Add opset_version into ORTTrainerOptions and change type of ORTTrainer.loss_fn (#4592) * Update ModelDescription and minor fix on ORTTrainer ctor (#4605) * Update ModelDescription and minor fix on ORTTrainer/ORTTrainerOptions This PR keeps the public API intact, but changes how model description is stored on the backend Currently, users creates a dict with two lists of tuples. One list called 'inputs' and each tuple has the following format tuple(name, shape). The second list is called 'outputs' and each tuple can be either tuple(name, shape) or tuple(name, shape, is_loss). With this PR, when this dict is passed in to ORTTrainer, it is fully validated as usual. However, tuples are internally replaced by namedtuples and all output tuples will have tuple(name, shape, is_loss) format instead of is_loss being optionally present. Additionally to that normalization in the internal representation (which eases coding), two internal methods were created to replace a namedtuple(name, shape) to namedtuple(name, shape, dtype) or namedtuple(name, shape, is_loss, dtype) dependeing whether the tuple is an input or output. This is necessary as ORTTRainer finds out data types of each input/output during model export to onnx. Finally, a minor fix was done on ORTTrainer. It could initialize ORTTrainerOptions incorrectly when options=None * Rename input name for test * Add ONNX Model Export to New Frontend (#4612) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]> * Create training session + minor improvements (#4668) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Save ONNX model in file (#4671) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Add eval step (#4674) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Add train_step (#4677) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Add LR Scheduler (#4694) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]> * Add deterministic compute tests (#4716) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]> * Add legacy vs experimental ORTTrainer accuracy comparison (#4727) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]> * Add Mixed precision/LossScaler + several fixes (#4739) Additionally to the mixed precision/loss scaler code, this PR includes: * Fix CUDA training * Add optimization_step into TrainStepInfo class * Refactor LRSCheduler to use optimization_step instead of step * Updated several default values at ORTTrainerOptions * Add initial Gradient Accumulation supported. Untested * Fix ONNX model post processing * Refactor unit tests * Add ONNX BERT example + minor fixes (#4757) * Fix training issue when passing ONNX file into ORTTrainer Co-authored-by: Thiago Crepaldi <[email protected]> Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Add Dynamic Shape support (#4758) * Update DeepSpeed Zero Stage option to a separate option group (#4772) * Add support to fetches (#4777) * Add Gradient Accumulation Steps support (#4793) * Fix Dynamic Axes feature and add unit test (#4795) * Add frozen weights test (#4807) * Move new pytorch front-end to 'experimental' namespace (#4814) * Fix build Co-authored-by: Rayan-Krishnan <[email protected]> Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

Rayan Krishnan added 2 commits August 3, 2020 18:56

initial development lr scheduler

89c54d2

finish lr scheduler and basic test

cc10845

rayankrish requested a review from a team as a code owner August 3, 2020 18:59

thiagocrepaldi linked an issue Aug 3, 2020 that may be closed by this pull request

[WIP] New PyTorch frontend API #4176

Closed

thiagocrepaldi reviewed Aug 3, 2020

View reviewed changes

thiagocrepaldi changed the title ~~Frontend/feature/lr scheduler~~ Add LR Scheduler Aug 3, 2020

address comments

3987f05

rayankrish commented Aug 3, 2020

View reviewed changes

thiagocrepaldi reviewed Aug 3, 2020

View reviewed changes

orttraining/orttraining/python/training/orttrainer.py Outdated Show resolved Hide resolved

fix init lr type issue

6c39f0b

thiagocrepaldi reviewed Aug 3, 2020

View reviewed changes

address comments on _step and opts

f59b1d9

thiagocrepaldi approved these changes Aug 4, 2020

View reviewed changes

prevent initial int lr error

143ac35

liqunfu reviewed Aug 4, 2020

View reviewed changes

orttraining/orttraining/python/training/orttrainer.py Show resolved Hide resolved

liqunfu approved these changes Aug 4, 2020

View reviewed changes

thiagocrepaldi reviewed Aug 4, 2020

View reviewed changes

orttraining/orttraining/python/training/optim/config.py Outdated Show resolved Hide resolved

Update orttraining/orttraining/python/training/optim/config.py

6984ba2

thiagocrepaldi approved these changes Aug 4, 2020

View reviewed changes

thiagocrepaldi merged commit d91ec63 into microsoft:feature/new_pytorch_frontend Aug 4, 2020

thiagocrepaldi mentioned this pull request Aug 10, 2020

[WIP] New PyTorch frontend API #4176

Closed

thiagocrepaldi removed a link to an issue Aug 10, 2020

[WIP] New PyTorch frontend API #4176

Closed

thiagocrepaldi pushed a commit that referenced this pull request Aug 12, 2020

Add LR Scheduler (#4694)

bd7a422

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]>

thiagocrepaldi pushed a commit that referenced this pull request Aug 14, 2020

Add LR Scheduler (#4694)

92c5ba4

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]>

thiagocrepaldi pushed a commit that referenced this pull request Aug 15, 2020

Add LR Scheduler (#4694)

1e1653d

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]>

thiagocrepaldi pushed a commit that referenced this pull request Aug 15, 2020

Add LR Scheduler (#4694)

e8269a4

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LR Scheduler #4694

Add LR Scheduler #4694

rayankrish commented Aug 3, 2020

rayankrish Aug 3, 2020

thiagocrepaldi Aug 3, 2020

rayankrish Aug 3, 2020

thiagocrepaldi Aug 4, 2020

thiagocrepaldi Aug 4, 2020

thiagocrepaldi left a comment

Add LR Scheduler #4694

Add LR Scheduler #4694

Conversation

rayankrish commented Aug 3, 2020

rayankrish Aug 3, 2020

Choose a reason for hiding this comment

thiagocrepaldi Aug 3, 2020

Choose a reason for hiding this comment

rayankrish Aug 3, 2020

Choose a reason for hiding this comment

thiagocrepaldi Aug 4, 2020

Choose a reason for hiding this comment

thiagocrepaldi Aug 4, 2020

Choose a reason for hiding this comment

thiagocrepaldi left a comment

Choose a reason for hiding this comment