diff --git a/docs/source/index.rst b/docs/source/index.rst index dcf2ff30e9c5..86ad55d1709b 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -9,7 +9,7 @@ NVIDIA NeMo User Guide starthere/intro starthere/tutorials starthere/best-practices - + starthere/migration-guide .. toctree:: :maxdepth: 2 diff --git a/docs/source/starthere/migration-guide.rst b/docs/source/starthere/migration-guide.rst new file mode 100644 index 000000000000..7869fd30524f --- /dev/null +++ b/docs/source/starthere/migration-guide.rst @@ -0,0 +1,70 @@ +Migration guide to use lightning 2.0 +============ + +.. # define a hard line break for html +.. |br| raw:: html + +
+ +.. _dummy_header: + +* Replace ``trainer.strategy=null`` with ``trainer.strategy=auto`` as + `lightning 2.0 doesn't have None strategy `_. +.. +* Remove ``resume_from_checkpoint`` if being used as a trainer flag and pass the path to + `Trainer.fit(ckpt_path="...") method `_. +.. +* Set ``trainer.strategy = "ddp_find_unused_parameters_true"`` if there are unused parameters in your model as lightning 2.0 has find_unused_parameters as False by default. + Reference: `NeMo PR 6433 `_. + More details about this change: `lightning PR 16611 `_. +.. +* If used Trainer's flag ``replace_sampler_ddp`` replace it with + `use_distributed_sampler `_. +.. +* If using ``CheckpointConnector`` replace it with `_CheckpointConnector `_. +.. +* To set or get ``ckpt_path`` use ``trainer.ckpt_path`` directly instead of calling protected API via ``trainer._checkpoint_connector._ckpt_path`` + or using ``trainer._checkpoint_connector.resume_from_checkpoint_fit_path``. +.. +* Change ``import load`` from pytorch_lightning.utilities.cloud_io to ``import _load``. +.. +* If used ``from pytorch_lightning.plugins.precision.native_amp import NativeMixedPrecisionPlugin`` from replace it with + `from pytorch_lightning.plugins.precision import MixedPrecisionPlugin `_. +.. +* Lightning 2.0 adds ``'16-mixed'``, ``'bf16-mixed'`` as the preicison values for fp16 mixed precision and bf16 mixed precision respectively. + For backward compatbility ``16`` or ``'16'`` and ``'bf16'`` also perform mixed precision and is equivalent to ``'16-mixed'`` and ``'bf16-mixed'`` + respectively. However, lightning recommends to use ``'16-mixed'`` and ``'bf16-mixed'`` to make it less ambiguous. Due to this, ``MegatronHalfPrecisionPlugin's`` + parent class from lightning ``MixedPrecisionPlugin`` class, expects the precision arg to be ``'16-mixed'`` and ``'bf16-mixed'``. As a result it's required to + pass ``'16-mixed'`` or ``'bf16-mixed'`` to ``MixedPrecisionPLugin`` whenever the precision passed is any of ``[16, '16', '16-mixed']`` or ``['bf16', 'bf16-mixed']``. + This can be taken care as shown here: `NeMo upgrade to lightning 2.0 PR `_ + and here: `MixedPrecisionPlugin `_. Also, ``'32-true'`` + is added as a precsion value for pure fp32 along with ``32``, ``'32'`` that existed. This can be taken into account as shown here in the `NeMo upgrade to lightning 2.0 PR `_. +.. +* Lightning 2.0 renames epoch end hooks from ``training_epoch_end``, ``validation_epoch_end``, ``test_epoch_end`` to ``on_train_epoch_end``, + ``on_validation_epoch_end``, ``on_test_epoch_end``. The renamed hooks do not accept the outputs arg but instead outputs needs to be defined + as an instance variable of the model class to which the outputs of the step needs to be manually appended. More detailed examples implementing + this can be found under migration guide of `lightning's PR 16520 `_. Example from NeMo + can be found `here `_. +.. +* Lightning 2.0 is not currently supporting multiple dataloders for validation and testing in case of ``dataloader_iter``. The support for this will be added back soon in an + upcoming release. If ``dataloader_iter`` is being used and your config passes multiple files to ``validation_ds.file_names`` or ``test_ds.file_names``, please use just one file + until this issue is fixed with pytorch lightning. +.. +* With lightning 2.0 it's required to set ``limit_val_batches`` and ``num_sanity_val_steps`` to be a multiple of number of microbatches while + using ``dataloader_iter`` (applies only to Megatron files that use dataloader_iter) for all pretraining files (not downstream tasks like finetuning). + This is being taken care internally in NeMo and does not require anything to be done by the user. However, if you are a developer of NeMo and are + building a new model for pretraining that uses ``dataloader_iter`` instead of batch in ``validation_step`` methods please make sure to call + ``self._reconfigure_val_batches()`` in ``build_train_valid_test_datasets method`` of your model. +.. +* If model is being wrapped with ``LightningDistributedModule`` in ``configure_ddp`` method please replace it with ``_LightningModuleWrapperBase`` + as being done here: `NeMo upgrade to lightning 2.0 PR `_. +.. +* If using ``pre_configure_ddp()`` in your DDP, remove it as it's not required anymore. + `NeMo upgrade to lightning 2.0 PR `_. +.. +* If any of the tests use CPU as the device, ensure to explicitly pass it in the trainer as + ``trainer = pl.Trainer(max_epochs=1, accelerator='cpu')`` since deafult val in PTL >= 2.0 is auto and it picks cuda. +.. +* If using ``from pytorch_lightning.loops import TrainingEpochLoop``, replace ``TrainingEpochLoop`` with ``_TrainingEpochLoop``. +.. +* If using ``trainer.fit_loop.max_steps``, replace it with ``trainer.fit_loop.epoch_loop.max_steps``. \ No newline at end of file