Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

[Retiarii] Rewrite trainer with PyTorch Lightning #3359

Merged
merged 32 commits into from
Feb 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
d30dbd3
Add playground
ultmaster Feb 2, 2021
a35448d
Record init parameters in blackbox
ultmaster Feb 2, 2021
1e3bd4b
Merge branch 'master' of https://github.com/microsoft/nni into retiar…
ultmaster Feb 2, 2021
74a691a
Serialization and tests
ultmaster Feb 2, 2021
5bb0134
Cleanup
ultmaster Feb 2, 2021
5f5f4c7
Add functional training
ultmaster Feb 3, 2021
cd849b2
Move experiment
ultmaster Feb 3, 2021
b15c0dd
Merge branch 'master' of https://github.com/microsoft/nni into retiar…
ultmaster Feb 3, 2021
6258bd0
Add basic lightning components
ultmaster Feb 3, 2021
4895abd
Support in execution engine
ultmaster Feb 3, 2021
b6978aa
Update test
ultmaster Feb 3, 2021
d44f038
Add classification and regression
ultmaster Feb 4, 2021
374584d
Finish end-to-end testing
ultmaster Feb 4, 2021
36b9352
Add functional training tests
ultmaster Feb 4, 2021
d591f44
Add API reference
ultmaster Feb 4, 2021
8d1c85e
Add documentation and fix lint
ultmaster Feb 4, 2021
d8bebe0
Fix documentation and dependencies
ultmaster Feb 4, 2021
fc07e48
Fix pipeline dependency
ultmaster Feb 4, 2021
271f57b
Uncomment test cases in lightning trainer
ultmaster Feb 4, 2021
4d24da3
Minor fixes
ultmaster Feb 4, 2021
5d8eec0
Fix unittests
ultmaster Feb 4, 2021
304ef02
Fix test_mutator
ultmaster Feb 4, 2021
d9532e2
Fix test interference
ultmaster Feb 4, 2021
357fec7
Fix lint
ultmaster Feb 4, 2021
f91af19
Update trainer interface
ultmaster Feb 9, 2021
87fddd3
Update documentation
ultmaster Feb 9, 2021
a1c8bbb
Rename lightning -> trainer
ultmaster Feb 9, 2021
17c2a9d
Merge branch 'master' into retiarii/trainer
ultmaster Feb 12, 2021
cad574b
Add Pytorch-lightning as legacy dependency
ultmaster Feb 14, 2021
deba408
Update Pytorch-lightning version
ultmaster Feb 14, 2021
ca1bdc0
Fix pytorch-lightning version on legacy env
ultmaster Feb 14, 2021
a9549cf
Add reason for skipif
ultmaster Feb 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions dependencies/recommended.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,11 @@

-f https://download.pytorch.org/whl/torch_stable.html
tensorflow

# PyTorch 1.7 has compatibility issue with model compression.
# Check for MacOS because this file is used on all platforms.
torch == 1.6.0+cpu ; sys_platform != "darwin"
torch == 1.6.0 ; sys_platform == "darwin"
torchvision == 0.7.0+cpu ; sys_platform != "darwin"
torchvision == 0.7.0 ; sys_platform == "darwin"
pytorch-lightning
onnx
peewee
graphviz
5 changes: 5 additions & 0 deletions dependencies/recommended_legacy.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@
tensorflow == 1.15.4
torch == 1.5.1+cpu
torchvision == 0.6.1+cpu

# It will install pytorch-lightning 0.8.x and unit tests won't work.
# Latest version has conflict with tensorboard and tensorflow 1.x.
pytorch-lightning

keras == 2.1.6
onnx
peewee
Expand Down
14 changes: 10 additions & 4 deletions docs/en_US/NAS/retiarii/ApiReference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,16 @@ Graph Mutation APIs
Trainers
--------

.. autoclass:: nni.retiarii.trainer.pytorch.PyTorchImageClassificationTrainer
.. autoclass:: nni.retiarii.trainer.FunctionalTrainer
:members:

.. autoclass:: nni.retiarii.trainer.pytorch.PyTorchMultiModelTrainer
.. autoclass:: nni.retiarii.trainer.pytorch.lightning.LightningModule
:members:

.. autoclass:: nni.retiarii.trainer.pytorch.lightning.Classification
:members:

.. autoclass:: nni.retiarii.trainer.pytorch.lightning.Regression
:members:

Oneshot Trainers
Expand Down Expand Up @@ -75,8 +81,8 @@ Strategies
Retiarii Experiments
--------------------

.. autoclass:: nni.retiarii.experiment.RetiariiExperiment
.. autoclass:: nni.retiarii.experiment.pytorch.RetiariiExperiment
:members:

.. autoclass:: nni.retiarii.experiment.RetiariiExeConfig
.. autoclass:: nni.retiarii.experiment.pytorch.RetiariiExeConfig
:members:
27 changes: 16 additions & 11 deletions docs/en_US/NAS/retiarii/Tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ Create a Trainer and Exploration Strategy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Classic search approach:**
In this approach, trainer is for training each explored model, while strategy is for sampling the models. Both trainer and strategy are required to explore the model space.
In this approach, trainer is for training each explored model, while strategy is for sampling the models. Both trainer and strategy are required to explore the model space. We recommend PyTorch-Lightning to write the full training process.

**Oneshot (weight-sharing) search approach:**
In this approach, users only need a oneshot trainer, because this trainer takes charge of both search and training.
Expand All @@ -163,10 +163,10 @@ In the following table, we listed the available trainers and strategies.
* - Trainer
- Strategy
- Oneshot Trainer
* - PyTorchImageClassificationTrainer
* - Classification
- TPEStrategy
- DartsTrainer
* - PyTorchMultiModelTrainer
* - Regression
- RandomStrategy
- EnasTrainer
* -
Expand All @@ -182,15 +182,20 @@ Here is a simple example of using trainer and strategy.

.. code-block:: python

trainer = PyTorchImageClassificationTrainer(base_model,
dataset_cls="MNIST",
dataset_kwargs={"root": "data/mnist", "download": True},
dataloader_kwargs={"batch_size": 32},
optimizer_kwargs={"lr": 1e-3},
trainer_kwargs={"max_epochs": 1})
simple_startegy = RandomStrategy()
import nni.retiarii.trainer.pytorch.lightning as pl
from nni.retiarii import blackbox
from torchvision import transforms

Users can refer to `this document <./WriteTrainer.rst>`__ for how to write a new trainer, and refer to `this document <./WriteStrategy.rst>`__ for how to write a new strategy.
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
train_dataset = blackbox(MNIST, root='data/mnist', train=True, download=True, transform=transform)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blackbox is strange here, because i don't understand why it creates train_dataset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o... MNIST is a class name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is limited by serialization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible we also wrap the dataset class by default? when users want to define their own dataset class, they decorate this class with for example @register_dataset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option, for this pr, is renaming blackbox to make_serializable

test_dataset = blackbox(MNIST, root='data/mnist', train=False, download=True, transform=transform)
lightning = pl.Classification(train_dataloader=pl.DataLoader(train_dataset, batch_size=100),
val_dataloaders=pl.DataLoader(test_dataset, batch_size=100),
max_epochs=10)

.. Note:: For NNI to capture the dataset and dataloader and distribute it across different runs, please wrap your dataset with ``blackbox`` and use ``pl.DataLoader`` instead of ``torch.utils.data.DataLoader``. See ``blackbox_module`` section below for details.

Users can refer to `API reference <./ApiReference.rst>`__ on detailed usage of trainer. "`write a trainer <./WriteTrainer.rst>`__" for how to write a new trainer, and refer to `this document <./WriteStrategy.rst>`__ for how to write a new strategy.

Set up an Experiment
^^^^^^^^^^^^^^^^^^^^
Expand Down
117 changes: 86 additions & 31 deletions docs/en_US/NAS/retiarii/WriteTrainer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,59 +3,114 @@ Customize A New Trainer

Trainers are necessary to evaluate the performance of new explored models. In NAS scenario, this further divides into two use cases:

1. **Classic trainers**: trainers that are used to train and evaluate one single model.
1. **Single-arch trainers**: trainers that are used to train and evaluate one single model.
2. **One-shot trainers**: trainers that handle training and searching simultaneously, from an end-to-end perspective.

Classic trainers
----------------
Single-arch trainers
--------------------

All classic trainers need to inherit ``nni.retiarii.trainer.BaseTrainer``, implement the ``fit`` method and decorated with ``@register_trainer`` if it is intended to be used together with Retiarii. The decorator serialize the trainer that is used and its argument to fit for the requirements of NNI.
With PyTorch-Lightning
^^^^^^^^^^^^^^^^^^^^^^

The init function of trainer should take model as its first argument, and the rest of the arguments should be named (``*args`` and ``**kwargs`` may not work as expected) and JSON serializable. This means, currently, passing a complex object like ``torchvision.datasets.ImageNet()`` is not supported. Trainer should use NNI standard API to communicate with tuning algorithms. This includes ``nni.report_intermediate_result`` for periodical metrics and ``nni.report_final_result`` for final metrics.
It's recommended to write training code in PyTorch-Lightning style, that is, to write a LightningModule that defines all elements needed for training (e.g., loss function, optimizer) and to define a trainer that takes (optional) dataloaders to execute the training. Before that, please read the `document of PyTorch-lightning <https://pytorch-lightning.readthedocs.io/>` to learn the basic concepts and components provided by PyTorch-lightning.

In pratice, writing a new training module in NNI should inherit ``nni.retiarii.trainer.pytorch.lightning.LightningModule``, which has a ``set_model`` that will be called after ``__init__`` to save the candidate model (generated by strategy) as ``self.model``. The rest of the process (like ``training_step``) should be the same as writing any other lightning module. Trainers should also communicate with strategies via two API calls (``nni.report_intermediate_result`` for periodical metrics and ``nni.report_final_result`` for final metrics), added in ``on_validation_epoch_end`` and ``teardown`` respectively.

An example is as follows:

.. code-block::python

from nni.retiarii import register_trainer
from nni.retiarii.trainer import BaseTrainer
from nni.retiarii.trainer.pytorch.lightning import LightningModule # please import this one

@register_trainer
class MnistTrainer(BaseTrainer):
def __init__(self, model, optimizer_class_name='SGD', learning_rate=0.1):
@blackbox_module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blackbox_module used on trainer is not that clear, let's discuss the name then

class AutoEncoder(LightningModule):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we do not define model in LightningModule, if we still use the name LightningModule, it may be misleading. we can use more understandable name, for example, BaseTrainer, TrainingModule, etc.

def __init__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is model configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_model

super().__init__()
self.model = model
self.criterion = nn.CrossEntropyLoss()
self.train_dataset = MNIST(train=True)
self.valid_dataset = MNIST(train=False)
self.optimizer = getattr(torch.optim, optimizer_class_name)(lr=learning_rate)

def validate():
pass

def fit(self) -> None:
for i in range(10): # number of epochs:
for x, y in DataLoader(self.dataset):
self.optimizer.zero_grad()
pred = self.model(x)
loss = self.criterion(pred, y)
loss.backward()
self.optimizer.step()
acc = self.validate() # get validation accuracy
nni.report_final_result(acc)
self.decoder = nn.Sequential(
nn.Linear(3, 64),
nn.ReLU(),
nn.Linear(64, 28*28)
)

def forward(self, x):
embedding = self.model(x) # let's search for encoder
return embedding

def training_step(self, batch, batch_idx):
# training_step defined the train loop.
# It is independent of forward
x, y = batch
x = x.view(x.size(0), -1)
z = self.model(x) # model is the one that is searched for
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
# Logging to TensorBoard by default
self.log('train_loss', loss)
return loss

def validation_step(self, batch, batch_idx):
x, y = batch
x = x.view(x.size(0), -1)
z = self.model(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.log('val_loss', loss)

def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer

def on_validation_epoch_end(self):
nni.report_intermediate_result(self.trainer.callback_metrics['val_loss'].item())

def teardown(self, stage):
if stage == 'fit':
nni.report_final_result(self.trainer.callback_metrics['val_loss'].item())

Then, users need to wrap everything (including LightningModule, trainer and dataloaders) into a ``Lightning`` object, and pass this object into a Retiarii experiment.

.. code-block::python

import nni.retiarii.trainer.pytorch.lightning as pl
from nni.retiarii.experiment.pytorch import RetiariiExperiment

lightning = pl.Lightning(AutoEncoder(),
pl.Trainer(max_epochs=10),
train_dataloader=pl.DataLoader(train_dataset, batch_size=100),
val_dataloaders=pl.DataLoader(test_dataset, batch_size=100))
experiment = RetiariiExperiment(base_model, lightning, mutators, strategy)

With FunctionalTrainer
^^^^^^^^^^^^^^^^^^^^^^

There is another way to customize a new trainer with functional APIs, which provides more flexibility. Users only need to write a fit function that wraps everything. This function takes one positional arguments (model) and possible keyword arguments. In this way, users get everything under their control, but exposes less information to the framework and thus fewer opportunities for possible optimization. An example is as belows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consistently use the word "trainer"


.. code-block::python

from nni.retiarii.trainer import FunctionalTrainer
from nni.retiarii.experiment.pytorch import RetiariiExperiment

def fit(model, dataloader):
train(model, dataloader)
acc = test(model, dataloader)
nni.report_final_result(acc)

trainer = FunctionalTrainer(fit, dataloader=DataLoader(foo, bar))
experiment = RetiariiExperiment(base_model, trainer, mutators, strategy)


One-shot trainers
-----------------

One-shot trainers should inheirt ``nni.retiarii.trainer.BaseOneShotTrainer``, which is basically same as ``BaseTrainer``, but only with one extra method ``export()``, which is expected to return the searched best architecture.
One-shot trainers should inheirt ``nni.retiarii.trainer.BaseOneShotTrainer``, and need to implement ``fit()`` (used to conduct the fitting and searching process) and ``export()`` method (used to return the searched best architecture).

Writing a one-shot trainer is very different to classic trainers. First of all, there are no more restrictions on init method arguments, any Python arguments are acceptable. Secondly, the model feeded into one-shot trainers might be a model with Retiarii-specific modules, such as LayerChoice and InputChoice. Such model cannot directly forward-propagate and trainers need to decide how to handle those modules.
ultmaster marked this conversation as resolved.
Show resolved Hide resolved

A typical example is DartsTrainer, where learnable-parameters are used to combine multiple choices in LayerChoice. Retiarii provides ease-to-use utility functions for module-replace purposes, namely ``replace_layer_choice``, ``replace_input_choice``. A simplified example is as follows:

.. code-block::python

from nni.retiarii.trainer import BaseOneShotTrainer
from nni.retiarii.trainer.pytorch import BaseOneShotTrainer
from nni.retiarii.trainer.pytorch.utils import replace_layer_choice, replace_input_choice


Expand Down
2 changes: 1 addition & 1 deletion nni/retiarii/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
from .graph import *
from .execution import *
from .mutator import *
from .utils import blackbox, blackbox_module, register_trainer
from .utils import blackbox, blackbox_module, json_dump, json_dumps, json_load, json_loads, register_trainer
30 changes: 13 additions & 17 deletions nni/retiarii/execution/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,29 @@
import os
import random
import string
from typing import Dict, Any, List
from typing import Dict, List

from .interface import AbstractExecutionEngine, AbstractGraphListener
from .. import codegen, utils
from ..graph import Model, ModelStatus, MetricData
from ..graph import Model, ModelStatus, MetricData, TrainingConfig
from ..integration_api import send_trial, receive_trial_parameters, get_advisor

_logger = logging.getLogger(__name__)

class BaseGraphData:
def __init__(self, model_script: str, training_module: str, training_kwargs: Dict[str, Any]) -> None:
def __init__(self, model_script: str, training_config: TrainingConfig) -> None:
self.model_script = model_script
self.training_module = training_module
self.training_kwargs = training_kwargs
self.training_config = training_config

def dump(self) -> dict:
return {
'model_script': self.model_script,
'training_module': self.training_module,
'training_kwargs': self.training_kwargs
'training_config': self.training_config
}

@staticmethod
def load(data):
return BaseGraphData(data['model_script'], data['training_module'], data['training_kwargs'])
def load(data) -> 'BaseGraphData':
return BaseGraphData(data['model_script'], data['training_config'])


class BaseExecutionEngine(AbstractExecutionEngine):
Expand Down Expand Up @@ -57,8 +55,7 @@ def __init__(self) -> None:

def submit_models(self, *models: Model) -> None:
for model in models:
data = BaseGraphData(codegen.model_to_pytorch_script(model),
model.training_config.module, model.training_config.kwargs)
data = BaseGraphData(codegen.model_to_pytorch_script(model), model.training_config)
self._running_models[send_trial(data.dump())] = model

def register_graph_listener(self, listener: AbstractGraphListener) -> None:
Expand Down Expand Up @@ -105,11 +102,10 @@ def trial_execute_graph(cls) -> None:
"""
graph_data = BaseGraphData.load(receive_trial_parameters())
random_str = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(6))
file_name = f'_generated_model_{random_str}.py'
file_name = f'_generated_model/{random_str}.py'
os.makedirs(os.path.dirname(file_name), exist_ok=True)
with open(file_name, 'w') as f:
f.write(graph_data.model_script)
trainer_cls = utils.import_(graph_data.training_module)
model_cls = utils.import_(f'_generated_model_{random_str}._model')
trainer_instance = trainer_cls(model=model_cls(), **graph_data.training_kwargs)
trainer_instance.fit()
os.remove(file_name)
model_cls = utils.import_(f'_generated_model.{random_str}._model')
graph_data.training_config._execute(model_cls)
os.remove(file_name)
2 changes: 1 addition & 1 deletion nni/retiarii/execution/cgo_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def submit_models(self, *models: List[Model]) -> None:
phy_models_and_placements = self._assemble(logical)
for model, placement, grouped_models in phy_models_and_placements:
data = BaseGraphData(codegen.model_to_pytorch_script(model, placement=placement),
model.training_config.module, model.training_config.kwargs)
model.training_config)
for m in grouped_models:
self._original_models[m.model_id] = m
self._original_model_to_multi_model[m.model_id] = model
Expand Down
Loading