diff --git a/.drone.yml b/.drone.yml
index 472861852cae7..91ccba28a1175 100644
--- a/.drone.yml
+++ b/.drone.yml
@@ -39,6 +39,14 @@ steps:
     # when Image has defined CUDa version we can switch to this package spec "nvidia-dali-cuda${CUDA_VERSION%%.*}0"
     - pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda100 --upgrade-strategy only-if-needed
     - pip list
+    # todo: remove unzip install after new nigtly docker is created
+    - apt-get update -qq
+    - apt-get install -y --no-install-recommends unzip
+    # get legacy checkpoints
+    - wget https://pl-public-data.s3.amazonaws.com/legacy/checkpoints.zip -P legacy/
+    - unzip -o legacy/checkpoints.zip -d legacy/
+    - ls -l legacy/checkpoints/
+    # testing...
     - python -m coverage run --source pytorch_lightning -m pytest pytorch_lightning tests -v --durations=25 # --flake8
     # Running special tests
     - sh tests/special_tests.sh
diff --git a/.github/BECOMING_A_CORE_CONTRIBUTOR.md b/.github/BECOMING_A_CORE_CONTRIBUTOR.md
index 3fa357ef062ca..828f45aedbecc 100644
--- a/.github/BECOMING_A_CORE_CONTRIBUTOR.md
+++ b/.github/BECOMING_A_CORE_CONTRIBUTOR.md
@@ -1,14 +1,14 @@
 # How to become a core contributor
 
-Thanks for your interest in joining the Lightning team! We’re a rapidly growing project which is poised to become the go-to framework for DL researchers! 
-We're currently recruiting for a team of 5 core maintainers. 
+Thanks for your interest in joining the Lightning team! We’re a rapidly growing project which is poised to become the go-to framework for DL researchers!
+We're currently recruiting for a team of 5 core maintainers.
 
 As a core maintainer you will have a strong say in the direction of the project. Big changes will require a majority of maintainers to agree.
 
-### Code of conduct  
+### Code of conduct
 First and foremost, you'll be evaluated against [these core values](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/.github/CONTRIBUTING.md). Any code we commit or feature we add needs to align with those core values.
 
-### The bar for joining the team   
+### The bar for joining the team
 Lightning is being used to solve really hard problems at the top AI labs in the world. As such, the bar for adding team members is extremely high. Candidates must have solid engineering skills, have a good eye for user experience, and must be a power user of Lightning and PyTorch.
 
 With that said, the Lightning team will be diverse and a reflection of an inclusive AI community. You don't have to be an engineer to contribute! Scientists with great usability intuition and PyTorch ninja skills are welcomed!
@@ -36,10 +36,10 @@ Pleasant/helpful tone.
 - Code is NOT overly engineered or hard to read
 - Ask yourself, could a non-engineer understand what’s happening here?
 - Make sure new tests are written
-- Is this NECESSARY for Lightning? There are some PRs which are just purely about adding engineering complexity which have no place in Lightning. 
+- Is this NECESSARY for Lightning? There are some PRs which are just purely about adding engineering complexity which have no place in Lightning.
 Guidance
 - Some other PRs are for people who are wanting to get involved and add something unnecessary. We do want their help though! So don’t approve the PR, but direct them to a Github issue that they might be interested in helping with instead!
-- To be considered for core contributor, please review 10 PRs and help the authors land it on master. Once you've finished the review, ping me 
+- To be considered for core contributor, please review 10 PRs and help the authors land it on master. Once you've finished the review, ping me
 for a sanity check. At the end of 10 PRs if your PR reviews are inline with expectations described above, then you can merge PRs on your own going forward,
 otherwise we'll do a few more until we're both comfortable :)
 
@@ -47,15 +47,15 @@ otherwise we'll do a few more until we're both comfortable :)
 There are some big decisions which the project must make. For these I expect core contributors to have something meaningful to add if it’s their area of expertise.
 
 #### Diversity
-Lightning should reflect the broader community it serves. As such we should have scientists/researchers from 
-different fields contributing!   
+Lightning should reflect the broader community it serves. As such we should have scientists/researchers from
+different fields contributing!
 
 The first 5 core contributors will fit this profile. Thus if you overlap strongly with experiences and expertise as someone else on the team, you might have to wait until the next set of contributors are added.
 
 #### Summary: Requirements to apply
 The goal is to be inline with expectations for solving issues by the last one so you can do them on your own. If not, I might ask you to solve a few more specific ones.
 
-- Solve 10+ Github issues. 
+- Solve 10+ Github issues.
 - Create 5+ meaningful PRs which solves some reported issue - bug,
 - Perform 10+ PR reviews from other contributors.
 
diff --git a/.github/ISSUE_TEMPLATE/documentation.md b/.github/ISSUE_TEMPLATE/documentation.md
index 2b249089657c8..e78df92a18bab 100644
--- a/.github/ISSUE_TEMPLATE/documentation.md
+++ b/.github/ISSUE_TEMPLATE/documentation.md
@@ -12,7 +12,7 @@ assignees: ''
 For typos and doc fixes, please go ahead and:
 
 1. Create an issue.
-2. Fix the typo.   
+2. Fix the typo.
 3. Submit a PR.
 
 Thanks!
diff --git a/.github/ISSUE_TEMPLATE/how-to-question.md b/.github/ISSUE_TEMPLATE/how-to-question.md
index 2a307e18de5c7..786244d2f5e74 100644
--- a/.github/ISSUE_TEMPLATE/how-to-question.md
+++ b/.github/ISSUE_TEMPLATE/how-to-question.md
@@ -9,10 +9,10 @@ assignees: ''
 
 ## ❓ Questions and Help
 
-### Before asking: 
+### Before asking:
 1. Try to find answers to your questions in [the Lightning Forum!](https://forums.pytorchlightning.ai/)
-2. Search for similar [issues](https://github.com/PyTorchLightning/pytorch-lightning/issues).   
-3. Search the [docs](https://pytorch-lightning.readthedocs.io/en/latest/).    
+2. Search for similar [issues](https://github.com/PyTorchLightning/pytorch-lightning/issues).
+3. Search the [docs](https://pytorch-lightning.readthedocs.io/en/latest/).
 
 <!-- If you still can't find what you need: -->
 
@@ -20,7 +20,7 @@ assignees: ''
 
 #### Code
 
-<!-- Please paste a code snippet if your question requires it! -->   
+<!-- Please paste a code snippet if your question requires it! -->
 
 #### What have you tried?
 
diff --git a/.github/prepare-nightly_pkg-name.py b/.github/prepare-nightly_pkg-name.py
deleted file mode 100644
index b85f6049ac140..0000000000000
--- a/.github/prepare-nightly_pkg-name.py
+++ /dev/null
@@ -1,12 +0,0 @@
-import os
-import re
-
-PATH_ROOT = os.path.abspath(os.path.dirname(os.path.dirname(__file__)))
-
-PATH_SETUP = os.path.join(PATH_ROOT, 'setup.py')
-print(f"rename package '{PATH_SETUP}'")
-with open(PATH_SETUP, 'r') as fp:
-    setup = fp.read()
-setup = re.sub(r'name=[\'"]pytorch-lightning[\'"]', 'name="pytorch-lightning-nightly"', setup)
-with open(PATH_SETUP, 'w') as fp:
-    fp.write(setup)
diff --git a/.github/workflows/ci_dockers.yml b/.github/workflows/ci_dockers.yml
index 47b839a54d04d..9bab59bff4dff 100644
--- a/.github/workflows/ci_dockers.yml
+++ b/.github/workflows/ci_dockers.yml
@@ -2,11 +2,21 @@ name: CI build Docker
 # https://www.docker.com/blog/first-docker-github-action-is-here
 # https://github.com/docker/build-push-action
 # see: https://help.github.com/en/actions/reference/events-that-trigger-workflows
-on:  # Trigger the workflow on push or pull request, but only for the master branch
+on: # Trigger the workflow on push or pull request, but only for the master branch
   push:
-    branches: [master, "release/*"]  # include release branches like release/1.0.x
+    branches: [master, "release/*"] # include release branches like release/1.0.x
   pull_request:
     branches: [master, "release/*"]
+    paths:
+      - "dockers/**"
+      - "!dockers/README.md"
+      - "requirements/*.txt"
+      - "environment.yml"
+      - "requirements.txt"
+      - ".github/workflows/ci_dockers.yml"
+      - ".github/workflows/events-nightly.yml"
+      - ".github/workflows/release-docker.yml"
+      - "setup.py"
 
 jobs:
   build-PL:
@@ -55,7 +65,6 @@ jobs:
           build-args: |
             PYTHON_VERSION=${{ matrix.python_version }}
             XLA_VERSION=${{ matrix.xla_version }}
-          cache-from: pytorchlightning/pytorch_lightning:base-xla-py${{ matrix.python_version }}-torch${{ matrix.xla_version }}
           file: dockers/base-xla/Dockerfile
           push: false
         timeout-minutes: 50
@@ -96,7 +105,6 @@ jobs:
             PYTHON_VERSION=${{ matrix.python_version }}
             PYTORCH_VERSION=${{ matrix.pytorch_version }}
             CUDA_VERSION=${{ steps.extend.outputs.CUDA }}
-          cache-from: pytorchlightning/pytorch_lightning:base-cuda-py${{ matrix.python_version }}-torch${{ matrix.pytorch_version }}
           file: dockers/base-cuda/Dockerfile
           push: false
         timeout-minutes: 50
@@ -139,7 +147,6 @@ jobs:
             PYTORCH_VERSION=${{ matrix.pytorch_version }}
             PYTORCH_CHANNEL=${{ steps.extend.outputs.CHANNEL }}
             CUDA_VERSION=${{ steps.extend.outputs.CUDA }}
-          cache-from: pytorchlightning/pytorch_lightning:base-conda-py${{ matrix.python_version }}-torch${{ matrix.pytorch_version }}
           file: dockers/base-conda/Dockerfile
           push: false
         timeout-minutes: 50
diff --git a/.github/workflows/ci_pkg-install.yml b/.github/workflows/ci_pkg-install.yml
index 52b3974e1e4c6..54c9f5c007c82 100644
--- a/.github/workflows/ci_pkg-install.yml
+++ b/.github/workflows/ci_pkg-install.yml
@@ -3,7 +3,7 @@ name: Install pkg
 # see: https://help.github.com/en/actions/reference/events-that-trigger-workflows
 on:  # Trigger the workflow on push or pull request, but only for the master branch
   push:
-    branches: [master, "release/*"]  # include release branches like release/1.0.x
+    branches: [master, "release/*"]
   pull_request:
     branches: [master, "release/*"]
 
@@ -27,13 +27,13 @@ jobs:
 
       - name: Prepare env
         run: |
-          pip install check-manifest "twine>=3.2"
+          pip install check-manifest "twine==3.2" setuptools wheel
 
       - name: Create package
         run: |
           check-manifest
           # python setup.py check --metadata --strict
-          python setup.py sdist
+          python setup.py sdist bdist_wheel
 
       - name: Check package
         run: |
@@ -46,12 +46,18 @@ jobs:
           # this is just a hotfix because of Win cannot install it directly
           pip install -r requirements.txt --find-links https://download.pytorch.org/whl/cpu/torch_stable.html
 
-      - name: Install package
+      - name: Install | Uninstall package - archive
+        run: |
+          # install as archive
+          pip install dist/*.tar.gz
+          cd ..
+          python -c "import pytorch_lightning as pl ; print(pl.__version__)"
+          pip uninstall -y pytorch-lightning
+
+      - name: Install | Uninstall package - wheel
         run: |
-          # pip install virtualenv
-          # virtualenv vEnv --system-site-packages
-          # source vEnv/bin/activate
-          pip install dist/*
-          cd .. & python -c "import pytorch_lightning as pl ; print(pl.__version__)"
-          # deactivate
-          # rm -rf vEnv
+          # install as wheel
+          pip install dist/*.whl
+          cd ..
+          python -c "import pytorch_lightning as pl ; print(pl.__version__)"
+          pip uninstall -y pytorch-lightning
\ No newline at end of file
diff --git a/.github/workflows/ci_test-base.yml b/.github/workflows/ci_test-base.yml
index d1ef75db942e8..ed8a2e30949b7 100644
--- a/.github/workflows/ci_test-base.yml
+++ b/.github/workflows/ci_test-base.yml
@@ -3,7 +3,7 @@ name: CI basic testing
 # see: https://help.github.com/en/actions/reference/events-that-trigger-workflows
 on:  # Trigger the workflow on push or pull request, but only for the master branch
   push:
-    branches: [master, "release/*"]  # include release branches like release/1.0.x
+    branches: [master, "release/*"]
   pull_request:
     branches: [master, "release/*"]
 
diff --git a/.github/workflows/ci_test-conda.yml b/.github/workflows/ci_test-conda.yml
index 15797ff59e981..6dab106471c50 100644
--- a/.github/workflows/ci_test-conda.yml
+++ b/.github/workflows/ci_test-conda.yml
@@ -3,7 +3,7 @@ name: PyTorch & Conda
 # see: https://help.github.com/en/actions/reference/events-that-trigger-workflows
 on:  # Trigger the workflow on push or pull request, but only for the master branch
   push:
-    branches: [master, "release/*"]  # include release branches like release/1.0.x
+    branches: [master, "release/*"]
   pull_request:
     branches: [master, "release/*"]
 
@@ -34,10 +34,21 @@ jobs:
       # todo this probably does not work with docker images, rather cache dockers
       uses: actions/cache@v2
       with:
-        path: Datasets # This path is specific to Ubuntu
-        # Look to see if there is a cache hit for the corresponding requirements file
+        path: Datasets
         key: pl-dataset
 
+    - name: Pull checkpoints from S3
+      # todo: consider adding coma caching, but ATM all models have less then 100KB
+      run: |
+        # todo: remove unzip install after new nigtly docker is created
+        apt-get update -qq
+        apt-get install -y --no-install-recommends unzip
+        # enter legacy and update checkpoints from S3
+        cd legacy
+        curl https://pl-public-data.s3.amazonaws.com/legacy/checkpoints.zip --output checkpoints.zip
+        unzip -o checkpoints.zip
+        ls -l checkpoints/
+
     - name: Tests
       run: |
         # NOTE: run coverage on tests does not propagare faler status for Win, https://github.com/nedbat/coveragepy/issues/1003
diff --git a/.github/workflows/ci_test-full.yml b/.github/workflows/ci_test-full.yml
index c42b9732a8d5a..300a0748dcda3 100644
--- a/.github/workflows/ci_test-full.yml
+++ b/.github/workflows/ci_test-full.yml
@@ -3,7 +3,7 @@ name: CI complete testing
 # see: https://help.github.com/en/actions/reference/events-that-trigger-workflows
 on:  # Trigger the workflow on push or pull request, but only for the master branch
   push:
-    branches: [master, "release/*"]  # include release branches like release/1.0.x
+    branches: [master, "release/*"]
   pull_request:
     branches: [master, "release/*"]
 
@@ -87,6 +87,16 @@ jobs:
         restore-keys: |
           ${{ runner.os }}-pip-py${{ matrix.python-version }}-${{ matrix.requires }}-
 
+    - name: Pull checkpoints from S3
+      # todo: consider adding some caching, but ATM all models have less then 100KB
+      run: |
+        cd legacy
+        # wget is simpler but does not work on Windows
+        python -c "from urllib.request import urlretrieve ; urlretrieve('https://pl-public-data.s3.amazonaws.com/legacy/checkpoints.zip', 'checkpoints.zip')"
+        ls -l .
+        unzip -o checkpoints.zip
+        ls -l checkpoints/
+
     - name: Install dependencies
       env:
         # MAKEFLAGS: "-j2"
@@ -119,8 +129,7 @@ jobs:
     - name: Cache datasets
       uses: actions/cache@v2
       with:
-        path: Datasets # This path is specific to Ubuntu
-        # Look to see if there is a cache hit for the corresponding requirements file
+        path: Datasets
         key: pl-dataset
 
     - name: Tests
diff --git a/.github/workflows/ci_test-tpu.yml b/.github/workflows/ci_test-tpu.yml
index ec2a976ea98e5..b1abcfe123201 100644
--- a/.github/workflows/ci_test-tpu.yml
+++ b/.github/workflows/ci_test-tpu.yml
@@ -2,7 +2,7 @@ name: TPU tests
 
 on:
   push:
-    branches: [master, "release/*"]  # include release branches like release/1.0.x
+    branches: [master, "release/*"]
 # TODO: temporal disable TPU testing until we find way how to pass credentials to forked PRs
 #  pull_request:
 #    branches:
diff --git a/.github/workflows/code-formatting.yml b/.github/workflows/code-formatting.yml
index b4a2e5c76c207..c76901f39db81 100644
--- a/.github/workflows/code-formatting.yml
+++ b/.github/workflows/code-formatting.yml
@@ -2,7 +2,7 @@ name: "Check Code Format"
 
 on:  # Trigger the workflow on push or pull request, but only for the master branch
   push:
-    branches: [master, "release/*"]  # include release branches like release/1.0.x
+    branches: [master, "release/*"]
   pull_request:
     branches: [master, "release/*"]
 
diff --git a/.github/workflows/docs-checks.yml b/.github/workflows/docs-checks.yml
index 3f6b35ba7b7cb..1857ebc8dabea 100644
--- a/.github/workflows/docs-checks.yml
+++ b/.github/workflows/docs-checks.yml
@@ -3,7 +3,7 @@ name: "Docs check"
 
 on:  # Trigger the workflow on push or pull request, but only for the master branch
   push:
-    branches: [master, "release/*"]  # include release branches like release/1.0.x
+    branches: [master, "release/*"]
   pull_request:
     branches: [master, "release/*"]
 
@@ -109,4 +109,3 @@ jobs:
           path: docs/build/html/
         # Use always() to always run this step to publish test results when there are test failures
         if: success()
-
diff --git a/.github/workflows/nightly.yml b/.github/workflows/events-nightly.yml
similarity index 72%
rename from .github/workflows/nightly.yml
rename to .github/workflows/events-nightly.yml
index 61a553d545732..71227308cd4ee 100644
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/events-nightly.yml
@@ -1,49 +1,48 @@
 name: Nightly events
 
 # https://jasonet.co/posts/scheduled-actions/
+# https://github.community/t/distinct-job-for-each-schedule/17811/2
 on:
   schedule:
-    # At the end of every day
-    - cron: "0 0 * * *"
+    - cron: "0 0 * * *" # At the end of every day
 
 # based on https://github.com/pypa/gh-action-pypi-publish
 jobs:
-
   pypi-release:
     runs-on: ubuntu-20.04
 
     steps:
-    # does nightly releases from feature branch
-    - uses: actions/checkout@v2
-      with:
-        ref: release/1.2-dev
-    - uses: actions/setup-python@v2
-      with:
-        python-version: 3.7
-
-    - name: Install dependencies
-      run: >-
-        python -m pip install --user --upgrade setuptools wheel
-
-    - name: Build packages
-      run: |
-        python .github/prepare-nightly_version.py
-        python setup.py sdist bdist_wheel
-        ls -lh dist/
-
-    - name: Delay releasing
-      uses: juliangruber/sleep-action@v1
-      with:
-        time: 5m
+      # does nightly releases from feature branch
+      - uses: actions/checkout@v2
+        with:
+          ref: release/1.2-dev
+      - uses: actions/setup-python@v2
+        with:
+          python-version: 3.7
+
+      - name: Install dependencies
+        run: >-
+          python -m pip install --user --upgrade setuptools wheel
+
+      - name: Build packages
+        run: |
+          python .github/prepare-nightly_version.py
+          python setup.py sdist bdist_wheel
+          ls -lh dist/
+
+      - name: Delay releasing
+        uses: juliangruber/sleep-action@v1
+        with:
+          time: 5m
 
       # We do this, since failures on test.pypi aren't that bad
-    - name: Publish to Test PyPI
-      uses: pypa/gh-action-pypi-publish@v1.4.1
-      with:
-        user: __token__
-        password: ${{ secrets.test_pypi_password }}
-        repository_url: https://test.pypi.org/legacy/
-        verbose: true
+      - name: Publish to Test PyPI
+        uses: pypa/gh-action-pypi-publish@v1.4.1
+        with:
+          user: __token__
+          password: ${{ secrets.test_pypi_password }}
+          repository_url: https://test.pypi.org/legacy/
+          verbose: true
 
   docker-XLA:
     runs-on: ubuntu-20.04
@@ -51,7 +50,7 @@ jobs:
       fail-fast: false
       matrix:
         python_version: [3.6, 3.7]
-        xla_version: [1.6, 1.7]  # todo: , "nightly"
+        xla_version: [1.6, 1.7] # todo: , "nightly"
     steps:
       - name: Checkout
         uses: actions/checkout@v2
@@ -72,8 +71,6 @@ jobs:
           build-args: |
             PYTHON_VERSION=${{ matrix.python_version }}
             XLA_VERSION=${{ matrix.xla_version }}
-          cache-from: pytorchlightning/pytorch_lightning:base-xla-py${{ matrix.python_version }}-torch${{ matrix.xla_version }}
-          cache-to: type=inline
           file: dockers/base-xla/Dockerfile
           push: true
           tags: pytorchlightning/pytorch_lightning:base-xla-py${{ matrix.python_version }}-torch${{ matrix.xla_version }}
@@ -118,8 +115,6 @@ jobs:
             PYTHON_VERSION=${{ matrix.python_version }}
             PYTORCH_VERSION=${{ matrix.pytorch_version }}
             CUDA_VERSION=${{ steps.extend.outputs.CUDA }}
-          cache-from: pytorchlightning/pytorch_lightning:base-cuda-py${{ matrix.python_version }}-torch${{ matrix.pytorch_version }}
-          cache-to: type=inline
           file: dockers/base-cuda/Dockerfile
           push: true
           tags: pytorchlightning/pytorch_lightning:base-cuda-py${{ matrix.python_version }}-torch${{ matrix.pytorch_version }}
@@ -134,8 +129,6 @@ jobs:
             PYTORCH_VERSION=${{ matrix.pytorch_version }}
             PYTORCH_CHANNEL=${{ steps.extend.outputs.CHANNEL }}
             CUDA_VERSION=${{ steps.extend.outputs.CUDA }}
-          cache-from: pytorchlightning/pytorch_lightning:base-conda-py${{ matrix.python_version }}-torch${{ matrix.pytorch_version }}
-          cache-to: type=inline
           file: dockers/base-conda/Dockerfile
           push: true
           tags: pytorchlightning/pytorch_lightning:base-conda-py${{ matrix.python_version }}-torch${{ matrix.pytorch_version }}
diff --git a/.github/workflows/events-ocasional.yml b/.github/workflows/events-ocasional.yml
new file mode 100644
index 0000000000000..a6cd43a8371ea
--- /dev/null
+++ b/.github/workflows/events-ocasional.yml
@@ -0,0 +1,26 @@
+name: Ocasional events
+
+on:
+  push:
+    branches: [master, "release/*"]
+  pull_request_target: {}
+
+jobs:
+
+  # autoupdate is a GitHub Action that auto-updates pull requests branches whenever changes land on their destination branch.
+  # see: https://github.com/marketplace/actions/auto-update
+  pr-auto-update:
+    name: Auto-update PR
+    runs-on: ubuntu-18.04
+    steps:
+      - uses: docker://chinthakagodawita/autoupdate-action:v1
+        # todo: this shall be resolved with https://github.com/chinthakagodawita/autoupdate/issues/100
+        continue-on-error: true
+        env:
+          GITHUB_TOKEN: "${{ secrets.GITHUB_TOKEN }}"
+          DRY_RUN: "false"
+          PR_FILTER: "labelled"
+          PR_LABELS: "0:] Ready-To-Go,has conflicts"
+          MERGE_MSG: "Branch was auto-updated."
+          RETRY_COUNT: "3"
+          RETRY_SLEEP: "500"
diff --git a/.github/workflows/events-recurrent.yml b/.github/workflows/events-recurrent.yml
new file mode 100644
index 0000000000000..6b9382e29901e
--- /dev/null
+++ b/.github/workflows/events-recurrent.yml
@@ -0,0 +1,20 @@
+name: Recurrent events
+
+on:
+  push: {}
+
+jobs:
+
+  # This label will then be managed by this action.
+  #  It will be added to PRs with merge conflicts and removed from PRs without conflicts.
+  #  https://github.com/mschilde/auto-label-merge-conflicts
+  pr-label-conflicts:
+    name: Label PR conflits
+    runs-on: ubuntu-20.04
+    steps:
+      - uses: mschilde/auto-label-merge-conflicts@v2.0
+        with:
+          CONFLICT_LABEL_NAME: "has conflicts"
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          MAX_RETRIES: 3
+          WAIT_MS: 5000
diff --git a/.github/workflows/release-docker.yml b/.github/workflows/release-docker.yml
index ab8dbed24288e..f285794cbc33b 100644
--- a/.github/workflows/release-docker.yml
+++ b/.github/workflows/release-docker.yml
@@ -3,7 +3,7 @@ name: Publish Docker Releases
 # https://github.com/docker/build-push-action
 on:
   push:
-    branches: [master, "release/*"]  # include release branches like release/1.0.x
+    branches: [master, "release/*"]
   release:
     types: [created]
 
diff --git a/.github/workflows/release-pypi.yml b/.github/workflows/release-pypi.yml
index 3cc3157ffbf89..9b2bc0699eeb6 100644
--- a/.github/workflows/release-pypi.yml
+++ b/.github/workflows/release-pypi.yml
@@ -3,9 +3,9 @@ name: PyPI Release
 # https://help.github.com/en/actions/reference/events-that-trigger-workflows
 on:  # Trigger the workflow on push or pull request, but only for the master branch
   push:
-    branches: [master, "release/*"]  # include release branches like release/1.0.x
+    branches: [master, "release/*"]
   release:
-    types: [created, "release/*"]
+    types: [created]
 
 
 jobs:
@@ -61,3 +61,51 @@ jobs:
       with:
         user: __token__
         password: ${{ secrets.pypi_password }}
+
+    # Note: This uses an internal pip API and may not always work
+    # https://github.com/actions/cache/blob/master/examples.md#multiple-oss-in-a-workflow
+    - name: Cache pip
+      uses: actions/cache@v2
+      with:
+        path: ~/.cache/pip
+        key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
+        restore-keys: ${{ runner.os }}-pip-
+
+    - name: Install dependencies
+      run: |
+        pip install -r requirements.txt --find-links https://download.pytorch.org/whl/cpu/torch_stable.html --quiet
+        pip install virtualenv
+        pip install awscli
+
+    - name: Configure AWS credentials
+      uses: aws-actions/configure-aws-credentials@v1
+      with:
+        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+        aws-secret-access-key: ${{ secrets.AWS_SECRET_KEY_ID }}
+        aws-region: us-east-1
+
+    - name: Pull files from S3
+      run: |
+        aws s3 cp --recursive s3://pl-public-data/legacy/checkpoints/ legacy/checkpoints/ #  --acl public-read
+        ls -l legacy/checkpoints/
+
+    - name: Generate checkpoint
+      if: startsWith(github.event.ref, 'refs/tags') || github.event_name == 'release'
+      run: |
+        virtualenv vEnv --system-site-packages
+        source vEnv/bin/activate
+        pip install dist/*
+
+        pl_ver=$(python -c "import pytorch_lightning as pl ; print(pl.__version__)" 2>&1)
+        # generate checkpoint to this version
+        bash legacy/generate_checkpoints.sh $pl_ver
+
+        deactivate
+        rm -rf vEnv
+
+    - name: Push files to S3
+      run: |
+        aws s3 sync legacy/checkpoints/ s3://pl-public-data/legacy/checkpoints/
+        cd legacy
+        zip -r checkpoints.zip checkpoints
+        aws s3 cp checkpoints.zip s3://pl-public-data/legacy/ --acl public-read
diff --git a/.gitignore b/.gitignore
index 237dbef370a2a..65ff649c4341a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -27,6 +27,7 @@ timit_data/
 # C extensions
 *.so
 
+# PyCharm
 .idea/
 
 # Distribution / packaging
@@ -126,11 +127,14 @@ ENV/
 
 # mypy
 .mypy_cache/
+# pytest
+.pytest_cache/
 
 # data
 .data/
 Datasets/
 mnist/
+legacy/checkpoints/
 
 # pl tests
 ml-runs/
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 3fed6f84a8938..3f915022be24c 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -105,6 +105,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 
 - Removed deprecated `TrainResult` ([#5323](https://github.com/PyTorchLightning/pytorch-lightning/pull/5323))
+  
+
+- Removed deprecated `EvalResult` ([#5633](https://github.com/PyTorchLightning/pytorch-lightning/pull/5633))
 
 
 ### Fixed
@@ -123,7 +126,25 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 - Fixed loading yaml ([#5619](https://github.com/PyTorchLightning/pytorch-lightning/pull/5619))
 
-## [1.1.3rc] - 2020-12-29
+## [1.1.4] - YYYY-MM-DD
+
+### Added
+
+- Add automatic optimization property setter to lightning module ([#5169](https://github.com/PyTorchLightning/pytorch-lightning/pull/5169))
+
+### Changed
+
+- Changed deprecated `enable_pl_optimizer=True` ([#5244](https://github.com/PyTorchLightning/pytorch-lightning/pull/5244))
+
+### Fixed
+
+- Fixed `transfer_batch_to_device` for DDP with `len(devices_ids) == 1` ([#5195](https://github.com/PyTorchLightning/pytorch-lightning/pull/5195))
+- Logging only on `not should_accumulate()` during training ([#5417](https://github.com/PyTorchLightning/pytorch-lightning/pull/5417))
+- Resolve interpolation bug with Hydra ([#5406](https://github.com/PyTorchLightning/pytorch-lightning/pull/5406)) 
+- Check environ before selecting a seed to prevent warning message ([#4743](https://github.com/PyTorchLightning/pytorch-lightning/pull/4743))
+
+
+## [1.1.3] - 2021-01-05
 
 ### Added
 
diff --git a/LICENSE b/LICENSE
index b9181e1a6e5d8..2e66bec2e791c 100644
--- a/LICENSE
+++ b/LICENSE
@@ -186,7 +186,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright 2018-2020 William Falcon
+   Copyright 2018-2021 William Falcon
 
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
diff --git a/MANIFEST.in b/MANIFEST.in
index 8db3912027d6d..95672548f724c 100644
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -69,4 +69,4 @@ prune temp*
 prune test*
 prune benchmark*
 prune dockers
-
+prune legacy
diff --git a/README.md b/README.md
index 29abbd5aa5784..402b494414e82 100644
--- a/README.md
+++ b/README.md
@@ -224,7 +224,8 @@ with tempfile.NamedTemporaryFile(suffix='.onnx', delete=False) as tmpfile:
 ```python
 class LitAutoEncoder(pl.LightningModule):
     def training_step(self, batch, batch_idx, opt_idx):
-        (opt_a, opt_b) = self.optimizers()
+        # access your optimizers with use_pl_optimizer=False. Default is True
+        (opt_a, opt_b) = self.optimizers(use_pl_optimizer=True)
 
         loss_a = ...
         self.manual_backward(loss_a, opt_a)
diff --git a/benchmarks/test_sharded_parity.py b/benchmarks/test_sharded_parity.py
index 05fde8e11523a..01975493590e9 100644
--- a/benchmarks/test_sharded_parity.py
+++ b/benchmarks/test_sharded_parity.py
@@ -175,7 +175,8 @@ def train_dataloader(self):
 class SeedTrainLoaderManualModel(SeedTrainLoaderModel):
     def training_step(self, batch, batch_idx, optimizer_idx):
         # manual
-        (opt_a, opt_b) = self.optimizers()
+        # access your optimizers with use_pl_optimizer=False. Default is True
+        (opt_a, opt_b) = self.optimizers(use_pl_optimizer=True)
         loss_1 = self.step(batch)
 
         self.manual_backward(loss_1, opt_a)
diff --git a/dockers/base-conda/Dockerfile b/dockers/base-conda/Dockerfile
index 72bfb0b244351..27ac96f96efcc 100644
--- a/dockers/base-conda/Dockerfile
+++ b/dockers/base-conda/Dockerfile
@@ -39,7 +39,9 @@ RUN apt-get update -qq && \
         build-essential \
         cmake \
         git \
+        wget \
         curl \
+        unzip \
         ca-certificates \
     && \
 
@@ -74,16 +76,16 @@ ENV CONDA_ENV=lightning
 COPY environment.yml environment.yml
 
 # conda init
-RUN conda create -y --name $CONDA_ENV cudatoolkit=${CUDA_VERSION} && \
+RUN conda create -y --name $CONDA_ENV python=${PYTHON_VERSION} pytorch=${PYTORCH_VERSION} cudatoolkit=${CUDA_VERSION} -c ${PYTORCH_CHANNEL} && \
     conda init bash && \
     # NOTE: this requires that the channel is presented in the yaml before packages
-    # replace channel to nigtly if needed, fix PT version and remove Horovod as it will be installe later
+    # replace channel to nigtly if needed, fix PT version and remove Horovod as it will be installed later
     python -c "fname = 'environment.yml' ; req = open(fname).read().replace('pytorch', '${PYTORCH_CHANNEL}', 1) ; open(fname, 'w').write(req)" && \
-    python -c "import re ; fname = 'environment.yml' ; req = re.sub(r'python[>=]+[\d\.]+', 'python=${PYTHON_VERSION}', open(fname).read()) ; open(fname, 'w').write(req)" && \
-    python -c "import re ; fname = 'environment.yml' ; req = re.sub(r'torch[>=]+[\d\.]+', 'torch=${PYTORCH_VERSION}', open(fname).read()) ; open(fname, 'w').write(req)" && \
+    python -c "import re ; fname = 'environment.yml' ; req = re.sub(r'- python[>=]+[\d\.]+', '# - python=${PYTHON_VERSION}', open(fname).read()) ; open(fname, 'w').write(req)" && \
+    python -c "import re ; fname = 'environment.yml' ; req = re.sub(r'- pytorch[>=]+[\d\.]+', '# - pytorch=${PYTORCH_VERSION}', open(fname).read()) ; open(fname, 'w').write(req)" && \
     python -c "fname = 'environment.yml' ; req = open(fname).readlines() ; open(fname, 'w').writelines([ln for ln in req if 'horovod' not in ln])" && \
     cat environment.yml && \
-    conda env update --file environment.yml && \
+    conda env update --name $CONDA_ENV --file environment.yml && \
     conda clean -ya && \
     rm environment.yml
 
diff --git a/dockers/base-cuda/Dockerfile b/dockers/base-cuda/Dockerfile
index bde54b8da7fd1..d84cba8b4cd00 100644
--- a/dockers/base-cuda/Dockerfile
+++ b/dockers/base-cuda/Dockerfile
@@ -44,6 +44,8 @@ RUN apt-get update -qq && \
         cmake \
         git \
         wget \
+        curl \
+        unzip \
         ca-certificates \
         software-properties-common \
     && \
diff --git a/dockers/tpu-tests/Dockerfile b/dockers/tpu-tests/Dockerfile
index 4e83bdcb0d798..93d6244121891 100644
--- a/dockers/tpu-tests/Dockerfile
+++ b/dockers/tpu-tests/Dockerfile
@@ -23,6 +23,12 @@ MAINTAINER PyTorchLightning <https://github.com/PyTorchLightning>
 
 COPY ./ ./pytorch-lightning/
 
+# Pull the legacy checkpoints
+RUN cd pytorch-lightning && \
+    wget https://pl-public-data.s3.amazonaws.com/legacy/checkpoints.zip -P legacy/ && \
+    unzip -o legacy/checkpoints.zip -d legacy/ && \
+    ls -l legacy/checkpoints/
+
 # If using this image for tests, intall more dependencies and don"t delete the source code where the tests live.
 RUN \
     # Install pytorch-lightning at the current PR, plus dependencies.
diff --git a/docs/.build_docs.sh b/docs/.build_docs.sh
index 2b57c47953675..6cf6eab2fd398 100644
--- a/docs/.build_docs.sh
+++ b/docs/.build_docs.sh
@@ -1,3 +1,3 @@
 rm -rf source/generated
 make clean
-make html --debug --jobs 2 SPHINXOPTS="-W"
\ No newline at end of file
+make html --debug --jobs 2 SPHINXOPTS="-W"
diff --git a/docs/Makefile b/docs/Makefile
index 69fe55ecfa9aa..ba501f6f5b1bf 100644
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -16,4 +16,4 @@ help:
 # Catch-all target: route all unknown targets to Sphinx using the new
 # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
 %: Makefile
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
\ No newline at end of file
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/source/_static/main.css b/docs/source/_static/main.css
index 7441b775a4be5..82aa8b338ad39 100644
--- a/docs/source/_static/main.css
+++ b/docs/source/_static/main.css
@@ -1,3 +1,3 @@
 col {
     width: 50% !important;
-}
\ No newline at end of file
+}
diff --git a/docs/source/asr_nlp_tts.rst b/docs/source/asr_nlp_tts.rst
index a5f1ac59bf696..49bed0a981a6e 100644
--- a/docs/source/asr_nlp_tts.rst
+++ b/docs/source/asr_nlp_tts.rst
@@ -10,16 +10,16 @@ These are amazing ecosystems to help with Automatic Speech Recognition (ASR), Na
 NeMo
 ****
 
-`NVIDIA NeMo <https://github.com/NVIDIA/NeMo>`_ is a toolkit for building new State-of-the-Art 
-Conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), 
-Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of 
-prebuilt modules that include everything needed to train on your data. 
-Every module can easily be customized, extended, and composed to create new Conversational AI 
+`NVIDIA NeMo <https://github.com/NVIDIA/NeMo>`_ is a toolkit for building new State-of-the-Art
+Conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR),
+Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of
+prebuilt modules that include everything needed to train on your data.
+Every module can easily be customized, extended, and composed to create new Conversational AI
 model architectures.
 
-Conversational AI architectures are typically very large and require a lot of data  and compute 
-for training. NeMo uses PyTorch Lightning for easy and performant multi-GPU/multi-node 
-mixed-precision training. 
+Conversational AI architectures are typically very large and require a lot of data  and compute
+for training. NeMo uses PyTorch Lightning for easy and performant multi-GPU/multi-node
+mixed-precision training.
 
 .. note:: Every NeMo model is a LightningModule that comes equipped with all supporting infrastructure for training and reproducibility.
 
@@ -31,7 +31,7 @@ NeMo Models
 NeMo Models contain everything needed to train and reproduce state of the art Conversational AI
 research and applications, including:
 
-- neural network architectures 
+- neural network architectures
 - datasets/data loaders
 - data preprocessing/postprocessing
 - data augmentors
@@ -83,7 +83,7 @@ To install from a local clone of NeMo:
 
     ./reinstall.sh # from cloned NeMo's git root
 
-For Docker users, the NeMo container is available on 
+For Docker users, the NeMo container is available on
 `NGC <https://ngc.nvidia.com/catalog/containers/nvidia:nemo>`_.
 
 .. code-block:: bash
@@ -97,7 +97,7 @@ For Docker users, the NeMo container is available on
 Experiment Manager
 ------------------
 
-NeMo's Experiment Manager leverages PyTorch Lightning for model checkpointing, 
+NeMo's Experiment Manager leverages PyTorch Lightning for model checkpointing,
 TensorBoard Logging, and Weights and Biases logging. The Experiment Manager is included by default
 in all NeMo example scripts.
 
@@ -126,11 +126,11 @@ Optionally launch Tensorboard to view training results in ./nemo_experiments (by
 Automatic Speech Recognition (ASR)
 ==================================
 
-Everything needed to train Convolutional ASR models is included with NeMo. 
-NeMo supports multiple Speech Recognition architectures, including Jasper and QuartzNet. 
-`NeMo Speech Models <https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels>`_ 
-can be trained from scratch on custom datasets or 
-fine-tuned using pre-trained checkpoints trained on thousands of hours of audio 
+Everything needed to train Convolutional ASR models is included with NeMo.
+NeMo supports multiple Speech Recognition architectures, including Jasper and QuartzNet.
+`NeMo Speech Models <https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels>`_
+can be trained from scratch on custom datasets or
+fine-tuned using pre-trained checkpoints trained on thousands of hours of audio
 that can be restored for immediate use.
 
 Some typical ASR tasks are included with NeMo:
@@ -141,7 +141,7 @@ Some typical ASR tasks are included with NeMo:
 - `Voice Activity Detection <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/tutorials/asr/06_Voice_Activiy_Detection.ipynb>`_
 - `Speaker Recognition <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/examples/speaker_recognition/speaker_reco.py>`_
 
-See this `asr notebook <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/tutorials/asr/01_ASR_with_NeMo.ipynb>`_ 
+See this `asr notebook <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/tutorials/asr/01_ASR_with_NeMo.ipynb>`_
 for a full tutorial on doing ASR with NeMo, PyTorch Lightning, and Hydra.
 
 Specify ASR Model Configurations with YAML File
@@ -149,7 +149,7 @@ Specify ASR Model Configurations with YAML File
 
 NeMo Models and the PyTorch Lightning Trainer can be fully configured from .yaml files using Hydra.
 
-See this `asr config <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/examples/asr/conf/config.yaml>`_ 
+See this `asr config <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/examples/asr/conf/config.yaml>`_
 for the entire speech to text .yaml file.
 
 .. code-block:: yaml
@@ -198,7 +198,7 @@ Developing ASR Model From Scratch
         trainer.fit(asr_model)
 
 
-Hydra makes every aspect of the NeMo model, 
+Hydra makes every aspect of the NeMo model,
 including the PyTorch Lightning Trainer, customizable from the command line.
 
 .. code-block:: bash
@@ -259,7 +259,7 @@ with PyTorch Lightning since every NeMo model is a Lightning Module.
             log_probs = self.decoder(encoder_output=encoded)
             greedy_predictions = log_probs.argmax(dim=-1, keepdim=False)
             return log_probs, encoded_len, greedy_predictions
-    
+
         # PTL-specific methods
         def training_step(self, batch, batch_nb):
             audio_signal, audio_signal_len, transcript, transcript_len = batch
@@ -281,7 +281,7 @@ Neural Types in NeMo ASR
 ------------------------
 
 NeMo Models and Neural Modules come with Neural Type checking.
-Neural type checking is extremely useful when combining many different neural 
+Neural type checking is extremely useful when combining many different neural
 network architectures for a production-grade application.
 
 .. code-block:: python
@@ -311,12 +311,12 @@ Natural Language Processing (NLP)
 =================================
 
 Everything needed to finetune BERT-like language models for NLP tasks is included with NeMo.
-`NeMo NLP Models <https://ngc.nvidia.com/catalog/models/nvidia:nemonlpmodels>`_  
-include `HuggingFace Transformers <https://github.com/huggingface/transformers>`_ 
-and `NVIDIA Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ BERT and Bio-Megatron models. 
+`NeMo NLP Models <https://ngc.nvidia.com/catalog/models/nvidia:nemonlpmodels>`_
+include `HuggingFace Transformers <https://github.com/huggingface/transformers>`_
+and `NVIDIA Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ BERT and Bio-Megatron models.
 NeMo can also be used for pretraining BERT-based language models from HuggingFace.
 
-Any of the HuggingFace encoders or Megatron-LM encoders can easily be used for the NLP tasks 
+Any of the HuggingFace encoders or Megatron-LM encoders can easily be used for the NLP tasks
 that are included with NeMo:
 
 - `Glue Benchmark (All tasks) <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/tutorials/nlp/GLUE_Benchmark.ipynb>`_
@@ -339,7 +339,7 @@ for a full tutorial on doing NER with NeMo, PyTorch Lightning, and Hydra.
 Specify NER Model Configurations with YAML File
 -----------------------------------------------
 
-.. note:: NeMo Models and the PyTorch Lightning Trainer can be fully configured from .yaml files using Hydra. 
+.. note:: NeMo Models and the PyTorch Lightning Trainer can be fully configured from .yaml files using Hydra.
 
 See this `token classification config <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/examples/nlp/token_classification/conf/token_classification_config.yaml>`_
 for the entire NER (token classification) .yaml file.
@@ -368,7 +368,7 @@ for the entire NER (token classification) .yaml file.
         pretrained_model_name: bert-base-uncased
         lm_checkpoint: null
         ...
-    # the classifier for the downstream task 
+    # the classifier for the downstream task
       head:
         num_fc_layers: 2
         fc_dropout: 0.5
@@ -435,12 +435,12 @@ Hydra makes every aspect of the NeMo model, including the PyTorch Lightning Trai
 Tokenizers
 ----------
 
-Tokenization is the process of converting natural language text into integer arrays 
+Tokenization is the process of converting natural language text into integer arrays
 which can be used for machine learning.
-For NLP tasks, tokenization is an essential part of data preprocessing. 
-NeMo supports all BERT-like model tokenizers from 
+For NLP tasks, tokenization is an essential part of data preprocessing.
+NeMo supports all BERT-like model tokenizers from
 `HuggingFace's AutoTokenizer <https://huggingface.co/transformers/model_doc/auto.html#autotokenizer>`_
-and also supports `Google's SentencePieceTokenizer <https://github.com/google/sentencepiece>`_ 
+and also supports `Google's SentencePieceTokenizer <https://github.com/google/sentencepiece>`_
 which can be trained on custom data.
 
 To see the list of supported tokenizers:
@@ -451,18 +451,18 @@ To see the list of supported tokenizers:
 
     nemo_nlp.modules.get_tokenizer_list()
 
-See this `tokenizer notebook <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/tutorials/nlp/02_NLP_Tokenizers.ipynb>`_ 
+See this `tokenizer notebook <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/tutorials/nlp/02_NLP_Tokenizers.ipynb>`_
 for a full tutorial on using tokenizers in NeMo.
 
 Language Models
 ---------------
 
-Language models are used to extract information from (tokenized) text. 
+Language models are used to extract information from (tokenized) text.
 Much of the state-of-the-art in natural language processing is achieved
-by fine-tuning pretrained language models on the downstream task. 
+by fine-tuning pretrained language models on the downstream task.
 
-With NeMo, you can either `pretrain <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/examples/nlp/language_modeling/bert_pretraining.py>`_ 
-a BERT model on your data or use a pretrained language model from `HuggingFace Transformers <https://github.com/huggingface/transformers>`_  
+With NeMo, you can either `pretrain <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/examples/nlp/language_modeling/bert_pretraining.py>`_
+a BERT model on your data or use a pretrained language model from `HuggingFace Transformers <https://github.com/huggingface/transformers>`_
 or `NVIDIA Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_.
 
 To see the list of language models available in NeMo:
@@ -483,11 +483,11 @@ for a full tutorial on using pretrained language models in NeMo.
 Using a Pre-trained NER Model
 -----------------------------
 
-NeMo has pre-trained NER models that can be used 
+NeMo has pre-trained NER models that can be used
 to get started with Token Classification right away.
-Models are automatically downloaded from NGC, 
+Models are automatically downloaded from NGC,
 cached locally to disk,
-and loaded into GPU memory using the `.from_pretrained` method. 
+and loaded into GPU memory using the `.from_pretrained` method.
 
 .. code-block:: python
 
@@ -511,7 +511,7 @@ and loaded into GPU memory using the `.from_pretrained` method.
 NeMo NER Model Under the Hood
 -----------------------------
 
-Any aspect of NLP training or model architecture design can easily be customized with PyTorch Lightning 
+Any aspect of NLP training or model architecture design can easily be customized with PyTorch Lightning
 since every NeMo model is a Lightning Module.
 
 .. code-block:: python
@@ -546,8 +546,8 @@ since every NeMo model is a Lightning Module.
 Neural Types in NeMo NLP
 ------------------------
 
-NeMo Models and Neural Modules come with Neural Type checking. 
-Neural type checking is extremely useful when combining many different neural network architectures 
+NeMo Models and Neural Modules come with Neural Type checking.
+Neural type checking is extremely useful when combining many different neural network architectures
 for a production-grade application.
 
 .. code-block:: python
@@ -565,11 +565,11 @@ for a production-grade application.
 Text-To-Speech (TTS)
 ====================
 
-Everything needed to train TTS models and generate audio is included with NeMo. 
-`NeMo TTS Models <https://ngc.nvidia.com/catalog/models/nvidia:nemottsmodels>`_ 
+Everything needed to train TTS models and generate audio is included with NeMo.
+`NeMo TTS Models <https://ngc.nvidia.com/catalog/models/nvidia:nemottsmodels>`_
 can be trained from scratch on your own data or pretrained models can be downloaded
-automatically. NeMo currently supports  a two step inference procedure. 
-First, a model is used to generate a mel spectrogram from text. 
+automatically. NeMo currently supports  a two step inference procedure.
+First, a model is used to generate a mel spectrogram from text.
 Second, a model is used to generate audio from a mel spectrogram.
 
 Mel Spectrogram Generators:
@@ -647,10 +647,10 @@ Hydra makes every aspect of the NeMo model, including the PyTorch Lightning Trai
 Using State-Of-The-Art Pre-trained TTS Model
 --------------------------------------------
 
-Generate speech using models trained on `LJSpeech <https://keithito.com/LJ-Speech-Dataset/>`, 
+Generate speech using models trained on `LJSpeech <https://keithito.com/LJ-Speech-Dataset/>`,
 around 24 hours of single speaker data.
 
-See this `TTS notebook <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/tutorials/tts/1_TTS_inference.ipynb>`_ 
+See this `TTS notebook <https://github.com/NVIDIA/NeMo/blob/v1.0.0b1/tutorials/tts/1_TTS_inference.ipynb>`_
 for a full tutorial on generating speech with NeMo, PyTorch Lightning, and Hydra.
 
 .. code-block:: python
@@ -673,7 +673,7 @@ for a full tutorial on generating speech with NeMo, PyTorch Lightning, and Hydra
         if isinstance(audio, torch.Tensor):
             audio = audio.to('cpu').numpy()
         return spectrogram, audio
-        
+
     text_to_generate = input("Input what you want the model to say: ")
     spec, audio = infer(spec_gen, vocoder, text_to_generate)
 
@@ -763,8 +763,8 @@ be customized with PyTorch Lightning since every NeMo model is a LightningModule
 Neural Types in NeMo TTS
 ------------------------
 
-NeMo Models and Neural Modules come with Neural Type checking. 
-Neural type checking is extremely useful when combining many different neural network architectures 
+NeMo Models and Neural Modules come with Neural Type checking.
+Neural type checking is extremely useful when combining many different neural network architectures
 for a production-grade application.
 
 .. code-block:: python
@@ -793,7 +793,7 @@ Learn More
 - Visit the `NVIDIA NeMo Developer Website <https://developer.nvidia.com/nvidia-nemo>`_
 - Read the `NVIDIA NeMo PyTorch Blog <https://medium.com/pytorch/nvidia-nemo-neural-modules-and-models-for-conversational-ai-d660480d9696>`_
 - Download pre-trained `ASR <https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels>`_, `NLP <https://ngc.nvidia.com/catalog/models/nvidia:nemonlpmodels>`_, and `TTS <https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels>`_ models on `NVIDIA NGC <https://ngc.nvidia.com/>`_ to quickly get started with NeMo.
-- Become an expert on Building Conversational AI applications with our `tutorials <https://github.com/NVIDIA/NeMo#tutorials>`_, and `example scripts <https://github.com/NVIDIA/NeMo/tree/v1.0.0b1/examples>`_, 
+- Become an expert on Building Conversational AI applications with our `tutorials <https://github.com/NVIDIA/NeMo#tutorials>`_, and `example scripts <https://github.com/NVIDIA/NeMo/tree/v1.0.0b1/examples>`_,
 - See our `developer guide <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/>`_ for more information on core NeMo concepts, ASR/NLP/TTS collections, and the NeMo API.
 
 .. note:: NeMo tutorial notebooks can be run on `Google Colab <https://colab.research.google.com/notebooks/intro.ipynb>`_.
diff --git a/docs/source/cloud_training.rst b/docs/source/cloud_training.rst
index 9fef417da7442..127bee6478dfd 100644
--- a/docs/source/cloud_training.rst
+++ b/docs/source/cloud_training.rst
@@ -26,4 +26,4 @@ using over 20+ distributions, lists, etc. Of course, you can also configure all
 can be dynamically assembled at runtime.
 
 
-.. hint:: Grid supports the search strategy of your choice! (and much more than just sweeps)
\ No newline at end of file
+.. hint:: Grid supports the search strategy of your choice! (and much more than just sweeps)
diff --git a/docs/source/datamodules.rst b/docs/source/datamodules.rst
index 2589ac605ee11..bc79d7dc3d6ea 100644
--- a/docs/source/datamodules.rst
+++ b/docs/source/datamodules.rst
@@ -129,7 +129,7 @@ Here's a more realistic, complex DataModule that shows how much more reusable th
 
             # self.dims is returned when you call dm.size()
             # Setting default dims here because we know them.
-            # Could optionally be assigned dynamically in dm.setup() 
+            # Could optionally be assigned dynamically in dm.setup()
             self.dims = (1, 28, 28)
 
         def prepare_data(self):
diff --git a/docs/source/governance.rst b/docs/source/governance.rst
index 74d24e306d3f9..22fba33771c0a 100644
--- a/docs/source/governance.rst
+++ b/docs/source/governance.rst
@@ -25,3 +25,4 @@ Core Maintainers
 - Jeff Yang (`ydcjeff <https://github.com/ydcjeff>`_)
 - Roger Shieh (`s-rog <https://github.com/s-rog>`_)
 - Carlos Mocholí (`carmocca <https://github.com/carmocca>`_)
+- Ananth Subramaniam (`ananthsub <https://github.com/ananthsub>`_)
diff --git a/docs/source/introduction_guide.rst b/docs/source/introduction_guide.rst
index a2a8340b34b4f..3a6cc7f2b6631 100644
--- a/docs/source/introduction_guide.rst
+++ b/docs/source/introduction_guide.rst
@@ -1051,7 +1051,7 @@ would be the particular system and how it's trained (ie: A GAN or VAE or GPT).
     out = decoder(features, x)
 
     loss = perceptual_loss(x1, x2, x) + CE(out, x)
-    
+
 In Lightning, this code is organized into a :ref:`lightning_module`.
 
 Engineering code
@@ -1071,7 +1071,7 @@ over GPUs, 16-bit precision, etc. This is normally code that is THE SAME across
         download_data()
 
     dist.barrier()
-    
+
 In Lightning, this code is abstracted out by the :ref:`trainer`.
 
 Non-essential code
@@ -1090,7 +1090,7 @@ This is code that helps the research but isn't relevant to the research code. So
     z = Q.rsample()
     generated = decoder(z)
     self.experiment.log('images', generated)
-    
+
 In Lightning this code is organized into :ref:`callbacks`.
 
 Data code
diff --git a/docs/source/loggers.rst b/docs/source/loggers.rst
index b74fe292b251b..08b3b1e997555 100644
--- a/docs/source/loggers.rst
+++ b/docs/source/loggers.rst
@@ -9,7 +9,7 @@
 Loggers
 *******
 
-Lightning supports the most popular logging frameworks (TensorBoard, Comet, etc...). TensorBoard is used by default, 
+Lightning supports the most popular logging frameworks (TensorBoard, Comet, etc...). TensorBoard is used by default,
 but you can pass to the :class:`~pytorch_lightning.trainer.trainer.Trainer` any combination of the following loggers.
 
 .. note::
@@ -247,7 +247,7 @@ Lightning supports the use of multiple loggers, just pass a list to the
     logger1 = TensorBoardLogger('tb_logs', name='my_model')
     logger2 = TestTubeLogger('tb_logs', name='my_model')
     trainer = Trainer(logger=[logger1, logger2])
-   
+
 The loggers are available as a list anywhere except ``__init__`` in your
 :class:`~pytorch_lightning.core.lightning.LightningModule`.
 
diff --git a/docs/source/lr_finder.rst b/docs/source/lr_finder.rst
index fbeb1f5fd959d..a5c3b312f30fc 100755
--- a/docs/source/lr_finder.rst
+++ b/docs/source/lr_finder.rst
@@ -2,7 +2,7 @@
 
     from pytorch_lightning.trainer.trainer import Trainer
     from pytorch_lightning.core.lightning import LightningModule
-    
+
 .. _lr_finder:
 
 Learning Rate Finder
@@ -22,14 +22,14 @@ for both better performance and faster convergence. Even optimizers such as
 choices.
 
 To reduce the amount of guesswork concerning choosing a good initial learning
-rate, a `learning rate finder` can be used. As described in this `paper <https://arxiv.org/abs/1506.01186>`_ 
-a learning rate finder does a small run where the learning rate is increased 
-after each processed batch and the corresponding loss is logged. The result of 
+rate, a `learning rate finder` can be used. As described in this `paper <https://arxiv.org/abs/1506.01186>`_
+a learning rate finder does a small run where the learning rate is increased
+after each processed batch and the corresponding loss is logged. The result of
 this is a `lr` vs. `loss` plot that can be used as guidance for choosing a optimal
-initial lr. 
+initial lr.
 
-.. warning:: 
-    For the moment, this feature only works with models having a single optimizer. 
+.. warning::
+    For the moment, this feature only works with models having a single optimizer.
     LR Finder support for DDP is not implemented yet, it is coming soon.
 
 ----------
@@ -52,7 +52,7 @@ which can be accessed via ``self.learning_rate`` or ``self.lr``.
 
         def configure_optimizers(self):
             return Adam(self.parameters(), lr=(self.lr or self.learning_rate))
-            
+
     model = LitModel()
 
     # finds learning rate automatically
@@ -81,26 +81,26 @@ method of the trainer. A typical example of this would look like
 
     model = MyModelClass(hparams)
     trainer = Trainer()
-    
+
     # Run learning rate finder
     lr_finder = trainer.tuner.lr_find(model)
-    
+
     # Results can be found in
     lr_finder.results
-    
+
     # Plot with
     fig = lr_finder.plot(suggest=True)
     fig.show()
-    
+
     # Pick point based on plot, or get suggestion
     new_lr = lr_finder.suggestion()
-    
+
     # update hparams of the model
     model.hparams.lr = new_lr
 
     # Fit model
     trainer.fit(model)
-    
+
 The figure produced by ``lr_finder.plot()`` should look something like the figure
 below. It is recommended to not pick the learning rate that achieves the lowest
 loss, but instead something in the middle of the sharpest downward slope (red point).
diff --git a/docs/source/multi_gpu.rst b/docs/source/multi_gpu.rst
index b822d25d6b94e..4a67460057abd 100644
--- a/docs/source/multi_gpu.rst
+++ b/docs/source/multi_gpu.rst
@@ -654,7 +654,7 @@ To use Sharded Training, you need to first install FairScale using the command b
 
 .. code-block:: bash
 
-    pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.1.0.zip
+    pip install fairscale
 
 
 .. code-block:: python
@@ -681,7 +681,7 @@ Reference: https://arxiv.org/abs/1811.06965
 
 .. note:: DDPSequentialPlugin is currently supported only for Pytorch 1.6.
 
-To get started, install FairScale using the command below.
+To get started, install FairScale using the command below. We install a specific branch which contains PyTorch related fixes for Sequential Parallelism.
 
 .. code-block:: bash
 
diff --git a/docs/source/new-project.rst b/docs/source/new-project.rst
index 6586be4141328..21c8693d12152 100644
--- a/docs/source/new-project.rst
+++ b/docs/source/new-project.rst
@@ -133,7 +133,7 @@ Examples of systems are:
 - `DQN <https://colab.research.google.com/drive/1F_RNcHzTfFuQf-LeKvSlud6x7jXYkG31#scrollTo=IAlT0-75T_Kv>`_
 - `GAN <https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/models/gans/basic/basic_gan_module.py>`_
 - `Image classifier <https://colab.research.google.com/drive/1F_RNcHzTfFuQf-LeKvSlud6x7jXYkG31#scrollTo=gEulmrbxwaYL>`_
-- Seq2seq 
+- Seq2seq
 - `SimCLR <https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/models/self_supervised/simclr/simclr_module.py>`_
 - `VAE <https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/models/autoencoders/basic_vae/basic_vae_module.py>`_
 
@@ -196,7 +196,7 @@ First, define the data however you want. Lightning just needs a :class:`~torch.u
 
     dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
     train_loader = DataLoader(dataset)
-    
+
 Next, init the :ref:`lightning_module` and the PyTorch Lightning :class:`~pytorch_lightning.trainer.Trainer`,
 then call fit with both the data and model.
 
@@ -269,7 +269,8 @@ Now you own the train loop!
 .. code-block:: python
 
     def training_step(self, batch, batch_idx, opt_idx):
-        (opt_a, opt_b, opt_c) = self.optimizers()
+        # access your optimizers with use_pl_optimizer=False. Default is True
+        (opt_a, opt_b, opt_c) = self.optimizers(use_pl_optimizer=True)
 
         loss_a = self.generator(batch[0])
 
@@ -401,7 +402,7 @@ It's trivial to use CPUs, GPUs or TPUs in Lightning. There's **NO NEED** to chan
 
     # train on 1 GPU
     trainer = pl.Trainer(gpus=1)
-    
+
 .. code-block:: python
 
     # train on multiple GPUs across nodes (32 gpus here)
@@ -409,7 +410,7 @@ It's trivial to use CPUs, GPUs or TPUs in Lightning. There's **NO NEED** to chan
         gpus=4,
         num_nodes=8
     )
-    
+
 .. code-block:: python
 
     # train on gpu 1, 3, 5 (3 gpus total)
@@ -437,7 +438,7 @@ Without changing a SINGLE line of your code, you can now do the following with t
         limit_train_batches=0.5,
         val_check_interval=0.25
     )
-    
+
 -----------
 
 Checkpoints
@@ -717,7 +718,7 @@ Lightning has many tools for debugging. Here is an example of just a few of them
 
 .. testcode::
 
-    # Automatically overfit the sane batch of your model for a sanity test 
+    # Automatically overfit the sane batch of your model for a sanity test
     trainer = Trainer(overfit_batches=1)
 
 .. testcode::
@@ -727,7 +728,7 @@ Lightning has many tools for debugging. Here is an example of just a few of them
     trainer = Trainer(fast_dev_run=True)
 
 .. testcode::
-   
+
    # train only 20% of an epoch
    trainer = Trainer(limit_train_batches=0.2)
 
@@ -737,10 +738,10 @@ Lightning has many tools for debugging. Here is an example of just a few of them
     trainer = Trainer(val_check_interval=0.25)
 
 .. testcode::
-    
+
     # Profile your code to find speed/memory bottlenecks
     Trainer(profiler=True)
- 
+
 ---------------
 
 ********************
diff --git a/docs/source/optimizers.rst b/docs/source/optimizers.rst
index 2680c01e4c7ec..588bdefb367e3 100644
--- a/docs/source/optimizers.rst
+++ b/docs/source/optimizers.rst
@@ -28,8 +28,15 @@ to manually manage the optimization process. To do so, do the following:
 .. code-block:: python
 
     def training_step(self, batch, batch_idx, optimizer_idx):
-        # ignore optimizer_idx
-        (opt_g, opt_d) = self.optimizers()
+
+        # 1. ignore optimizer_idx
+        # 2. `use_pl_optimizer=True` means `opt_g` and `opt_d` will be of type `LightingOptimizer`
+        # `LightingOptimizer` simply wrapped your optimizer and behave the same way !
+        # When calling `optimizer.step`, `LightingOptimizer` will just handle TPU, AMP, accumulate_grad_batches, etc ... for you.
+
+        # access your optimizers with `use_pl_optimizer=False` or `optimizer.optimizer` when using use_pl_optimizer=True
+        # use_pl_optimizer=True is the default
+        (opt_g, opt_d) = self.optimizers(use_pl_optimizer=True)
 
         # do anything you want
         loss_a = ...
@@ -242,19 +249,29 @@ Here we add a learning-rate warm up
         # update params
         optimizer.step(closure=closure)
 
-The default ``optimizer_step`` is relying on the internal ``LightningOptimizer`` to properly perform a step.
+.. note:: The default ``optimizer_step`` is relying on the internal ``LightningOptimizer`` to properly perform a step. It handles TPUs, AMP, accumulate_grad_batches, zero_grad, and much more ...
+
+.. testcode::
+
+    # function hook in LightningModule
+    def optimizer_step(self, current_epoch, batch_nb, optimizer, optimizer_idx, closure, on_tpu=False, using_native_amp=False, using_lbfgs=False):
+      optimizer.step(closure=closure)
+
+.. note:: To access your wrapped Optimizer from ``LightningOptimizer``, do as follow.
 
 .. testcode::
 
-    from pytorch_lightning.core.optimizer import LightningOptimizer
-   
     # function hook in LightningModule
     def optimizer_step(self, current_epoch, batch_nb, optimizer, optimizer_idx, closure, on_tpu=False, using_native_amp=False, using_lbfgs=False):
-      if not isinstance(optimizer, LightningOptimizer):
-         # wraps into LightingOptimizer only for running step
-         optimizer = LightningOptimizer.to_lightning_optimizer(optimizer, self.trainer)
+
+      # `optimizer is a ``LightningOptimizer`` wrapping the optimizer.
+      # To access it, do as follow:
+      optimizer = optimizer.optimizer
+
+      # run step. However, it won't work on TPU, AMP, etc...
       optimizer.step(closure=closure)
 
+
 ----------
 
 Using the closure functions for optimization
diff --git a/docs/source/sequences.rst b/docs/source/sequences.rst
index 93fefad0d0e35..759a671cc42ef 100644
--- a/docs/source/sequences.rst
+++ b/docs/source/sequences.rst
@@ -2,7 +2,7 @@
 
     from torch.utils.data import IterableDataset
     from pytorch_lightning.trainer.trainer import Trainer
-    
+
 .. _sequences:
 
 Sequential Data
diff --git a/docs/source/slurm.rst b/docs/source/slurm.rst
index d9cb508df10af..fbf718b7d2c15 100644
--- a/docs/source/slurm.rst
+++ b/docs/source/slurm.rst
@@ -1,7 +1,7 @@
 .. testsetup:: *
 
     from pytorch_lightning.trainer.trainer import Trainer
-    
+
 .. _slurm:
 
 Computing cluster (SLURM)
diff --git a/docs/source/test_set.rst b/docs/source/test_set.rst
index 9fe9640aa723b..d9e989a4182f3 100644
--- a/docs/source/test_set.rst
+++ b/docs/source/test_set.rst
@@ -41,7 +41,7 @@ You can run the test set on multiple models using the same trainer instance.
 
     model1 = LitModel()
     model2 = GANModel()
-    
+
     trainer = Trainer()
     trainer.test(model1)
     trainer.test(model2)
@@ -87,7 +87,7 @@ is not available at the time your model was declared.
 
 You can either pass in a single dataloader or a list of them. This optional named
 parameter can be used in conjunction with any of the above use cases. Additionally,
-you can also pass in an :ref:`datamodules` that have overridden the 
+you can also pass in an :ref:`datamodules` that have overridden the
 :ref:`datamodule-test-dataloader-label` method.
 
 .. code-block:: python
@@ -102,6 +102,3 @@ you can also pass in an :ref:`datamodules` that have overridden the
 
     # test (pass in datamodule)
     trainer.test(datamodule=dm)
-    
-
-
diff --git a/docs/source/trainer.rst b/docs/source/trainer.rst
index ecbe241f9f9d4..e55adf0e43174 100644
--- a/docs/source/trainer.rst
+++ b/docs/source/trainer.rst
@@ -335,7 +335,8 @@ optimizer behavior
 Example::
 
     def training_step(self, batch, batch_idx):
-        opt = self.optimizers()
+        # access your optimizers with use_pl_optimizer=False. Default is True
+        opt = self.optimizers(use_pl_optimizer=True)
 
         loss = ...
         self.manual_backward(loss, opt)
@@ -350,7 +351,8 @@ In the multi-optimizer case, ignore the optimizer_idx flag and use the optimizer
 Example::
 
     def training_step(self, batch, batch_idx, optimizer_idx):
-        (opt_a, opt_b) = self.optimizers()
+        # access your optimizers with use_pl_optimizer=False. Default is True
+        (opt_a, opt_b) = self.optimizers(use_pl_optimizer=True)
 
         gen_loss = ...
         self.manual_backward(gen_loss, opt_a)
diff --git a/docs/source/training_tricks.rst b/docs/source/training_tricks.rst
index 10ee668a97fa8..d7230a1fd687a 100644
--- a/docs/source/training_tricks.rst
+++ b/docs/source/training_tricks.rst
@@ -130,4 +130,4 @@ Sequential Model Parallelism with Checkpointing
 PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
 Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
 
-For more information, refer to :ref:`sequential-parallelism`.
\ No newline at end of file
+For more information, refer to :ref:`sequential-parallelism`.
diff --git a/docs/source/transfer_learning.rst b/docs/source/transfer_learning.rst
index e35220764cf04..72d16a9f2bf11 100644
--- a/docs/source/transfer_learning.rst
+++ b/docs/source/transfer_learning.rst
@@ -1,7 +1,7 @@
 .. testsetup:: *
 
     from pytorch_lightning.core.lightning import LightningModule
-    
+
 Transfer Learning
 -----------------
 
diff --git a/docs/source/weights_loading.rst b/docs/source/weights_loading.rst
index 77570260fe58c..f9a4cbd132349 100644
--- a/docs/source/weights_loading.rst
+++ b/docs/source/weights_loading.rst
@@ -92,7 +92,7 @@ You can also control more advanced options, like `save_top_k`, to save the best
     )
 
     trainer = Trainer(callbacks=[checkpoint_callback])
-    
+
 You can retrieve the checkpoint after training by calling
 
 .. code-block:: python
diff --git a/legacy/README.md b/legacy/README.md
new file mode 100644
index 0000000000000..3ce6d15f65568
--- /dev/null
+++ b/legacy/README.md
@@ -0,0 +1,17 @@
+# Maintaining back-compatibility with come legacy versions
+
+The aim of this section is set some baselines and workflows/guidelines for maintaining back compatibility with some legacies version of PL
+
+At this moment we focus on ability running old checkpoints, so the flow here is to create a checkpoint with every release and store it in our public AWS storage and so each CI testing will pull this archive and test loading and resuming training with this model.
+
+If you want to pull all saved version-checkpoints for local testing/development, call
+```bash
+wget https://pl-public-data.s3.amazonaws.com/legacy/checkpoints.zip
+unzip -o checkpoints.zip
+```
+
+To back populate collection with past version you can use following bash:
+```bash
+bash generate_checkpoints.sh 1.0.2 1.0.3 1.0.4
+zip -r checkpoints.zip checkpoints/
+```
diff --git a/tests/trainer/logging_process/__init__.py b/legacy/checkpoints/.gitkeep
similarity index 100%
rename from tests/trainer/logging_process/__init__.py
rename to legacy/checkpoints/.gitkeep
diff --git a/legacy/generate_checkpoints.sh b/legacy/generate_checkpoints.sh
new file mode 100644
index 0000000000000..7726c5b097c5c
--- /dev/null
+++ b/legacy/generate_checkpoints.sh
@@ -0,0 +1,41 @@
+#!/bin/bash
+# Sample call:
+#  bash generate_checkpoints.sh 1.0.2 1.0.3 1.0.4
+
+LEGACY_PATH="$( cd "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
+
+echo $LEGACY_PATH
+# install some PT version here so it does not need to reinstalled for each env
+pip install virtualenv "torch==1.5" --quiet --no-cache-dir
+
+ENV_PATH="$LEGACY_PATH/vEnv"
+
+# iterate over all arguments assuming that each argument is version
+for ver in "$@"
+do
+	echo "processing version: $ver"
+	# mkdir "$LEGACY_PATH/$ver"
+
+  # create local env
+  echo $ENV_PATH
+  virtualenv $ENV_PATH --system-site-packages
+  # activate and install PL version
+  source "$ENV_PATH/bin/activate"
+  # there are problem to load ckpt in older versions since they are saved the newer versions
+  pip install "pytorch_lightning==$ver" "torch==1.3" --quiet --no-cache-dir
+
+  python --version
+  pip --version
+  pip list | grep torch
+
+  python "$LEGACY_PATH/zero_training.py"
+  cp "$LEGACY_PATH/zero_training.py" ${LEGACY_PATH}/checkpoints/${ver}
+
+  mv ${LEGACY_PATH}/checkpoints/${ver}/lightning_logs/version_0/checkpoints/*.ckpt ${LEGACY_PATH}/checkpoints/${ver}/
+  rm -rf ${LEGACY_PATH}/checkpoints/${ver}/lightning_logs
+
+  deactivate
+  # clear env
+  rm -rf $ENV_PATH
+
+done
diff --git a/legacy/zero_training.py b/legacy/zero_training.py
new file mode 100644
index 0000000000000..0115df4143460
--- /dev/null
+++ b/legacy/zero_training.py
@@ -0,0 +1,93 @@
+# Copyright The PyTorch Lightning team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+
+import torch
+from torch.utils.data import Dataset
+
+import pytorch_lightning as pl
+
+PATH_LEGACY = os.path.dirname(__file__)
+
+
+class RandomDataset(Dataset):
+    def __init__(self, size, length: int = 100):
+        self.len = length
+        self.data = torch.randn(length, size)
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def __len__(self):
+        return self.len
+
+
+class DummyModel(pl.LightningModule):
+
+    def __init__(self):
+        super().__init__()
+        self.layer = torch.nn.Linear(32, 2)
+
+    def forward(self, x):
+        return self.layer(x)
+
+    def _loss(self, batch, prediction):
+        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
+        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))
+
+    def _step(self, batch, batch_idx):
+        output = self.layer(batch)
+        loss = self._loss(batch, output)
+        # return {'loss': loss}  # used for PL<1.0
+        return loss  # used for PL >= 1.0
+
+    def training_step(self, batch, batch_idx):
+        return self._step(batch, batch_idx)
+
+    def validation_step(self, batch, batch_idx):
+        self._step(batch, batch_idx)
+
+    def test_step(self, batch, batch_idx):
+        self._step(batch, batch_idx)
+
+    def configure_optimizers(self):
+        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
+        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
+        return [optimizer], [lr_scheduler]
+
+    def train_dataloader(self):
+        return torch.utils.data.DataLoader(RandomDataset(32, 64))
+
+    def val_dataloader(self):
+        return torch.utils.data.DataLoader(RandomDataset(32, 64))
+
+    def test_dataloader(self):
+        return torch.utils.data.DataLoader(RandomDataset(32, 64))
+
+
+def main_train(dir_path, max_epochs: int = 5):
+
+    trainer = pl.Trainer(
+        default_root_dir=dir_path,
+        checkpoint_callback=True,
+        max_epochs=max_epochs,
+    )
+
+    model = DummyModel()
+    trainer.fit(model)
+
+
+if __name__ == '__main__':
+    path_dir = os.path.join(PATH_LEGACY, 'checkpoints', str(pl.__version__))
+    main_train(path_dir)
diff --git a/pl_examples/README.md b/pl_examples/README.md
index 936f1cc3df0cf..a1cb856eb1e33 100644
--- a/pl_examples/README.md
+++ b/pl_examples/README.md
@@ -1,4 +1,4 @@
-# Examples   
+# Examples
 Our most robust examples showing all sorts of implementations
 can be found in our sister library [PyTorch-Lightning-Bolts](https://pytorch-lightning-bolts.readthedocs.io/en/latest/convolutional.html#gpt-2).
 
@@ -14,6 +14,6 @@ In this folder we add 3 simple examples:
 ---
 
 ## Domain examples
-This folder contains older examples. You should instead use the examples 
-in [PyTorch-Lightning-Bolts](https://pytorch-lightning-bolts.readthedocs.io/en/latest/convolutional.html#gpt-2) 
+This folder contains older examples. You should instead use the examples
+in [PyTorch-Lightning-Bolts](https://pytorch-lightning-bolts.readthedocs.io/en/latest/convolutional.html#gpt-2)
 for advanced use cases.
diff --git a/pl_examples/basic_examples/README.md b/pl_examples/basic_examples/README.md
index 18ae204396290..199c453566c6f 100644
--- a/pl_examples/basic_examples/README.md
+++ b/pl_examples/basic_examples/README.md
@@ -1,5 +1,5 @@
-## Basic Examples   
-Use these examples to test how lightning works.   
+## Basic Examples
+Use these examples to test how lightning works.
 
 #### MNIST
 Trains MNIST where the model is defined inside the LightningModule.
@@ -36,7 +36,7 @@ python image_classifier.py --gpus 2
 python image_classifier.py --gpus 2 --distributed_backend 'dp'
 ```
 
----   
+---
 #### Autoencoder
 Showing the power of a system... arbitrarily complex training loops
 ```bash
@@ -49,23 +49,23 @@ python autoencoder.py --gpus 2
 # dataparallel
 python autoencoder.py --gpus 2 --distributed_backend 'dp'
 ```
----    
-# Multi-node example   
+---
+# Multi-node example
 
 This demo launches a job using 2 GPUs on 2 different nodes (4 GPUs total).
 To run this demo do the following:
 
-1. Log into the jumphost node of your SLURM-managed cluster.  
-2. Create a conda environment with Lightning and a GPU PyTorch version.   
-3. Choose a script to submit    
+1. Log into the jumphost node of your SLURM-managed cluster.
+2. Create a conda environment with Lightning and a GPU PyTorch version.
+3. Choose a script to submit
 
-#### DDP  
+#### DDP
 Submit this job to run with DistributedDataParallel (2 nodes, 2 gpus each)
 ```bash
 sbatch submit_ddp_job.sh YourEnv
 ```
 
-#### DDP2  
+#### DDP2
 Submit this job to run with a different implementation of DistributedDataParallel.
 In this version, each node acts like DataParallel but syncs across nodes like DDP.
 ```bash
diff --git a/pytorch_lightning/__init__.py b/pytorch_lightning/__init__.py
index 890db586b2084..5f115ef98fbb1 100644
--- a/pytorch_lightning/__init__.py
+++ b/pytorch_lightning/__init__.py
@@ -1,10 +1,15 @@
 """Root package info."""
 
+import logging as python_logging
+import os
+import time
+
+_this_year = time.strftime("%Y")
 __version__ = '1.2.0dev'
 __author__ = 'William Falcon et al.'
 __author_email__ = 'waf2107@columbia.edu'
 __license__ = 'Apache-2.0'
-__copyright__ = 'Copyright (c) 2018-2020, %s.' % __author__
+__copyright__ = f'Copyright (c) 2018-{_this_year}, {__author__}.'
 __homepage__ = 'https://github.com/PyTorchLightning/pytorch-lightning'
 # this has to be simple string, see: https://github.com/pypa/twine/issues/522
 __docs__ = (
@@ -33,9 +38,6 @@
 - https://pytorch-lightning.readthedocs.io/en/stable
 """
 
-import logging as python_logging
-import os
-
 _logger = python_logging.getLogger("lightning")
 _logger.addHandler(python_logging.StreamHandler())
 _logger.setLevel(python_logging.INFO)
diff --git a/pytorch_lightning/accelerators/cpu_accelerator.py b/pytorch_lightning/accelerators/cpu_accelerator.py
index 7c80a4a30d223..3c2eac7dbb7ad 100644
--- a/pytorch_lightning/accelerators/cpu_accelerator.py
+++ b/pytorch_lightning/accelerators/cpu_accelerator.py
@@ -48,8 +48,6 @@ def setup(self, model):
         # allow for lr schedulers as well
         self.setup_optimizers(model)
 
-        self.trainer.convert_to_lightning_optimizers()
-
         self.trainer.model = model
 
     def train(self):
diff --git a/pytorch_lightning/accelerators/ddp2_accelerator.py b/pytorch_lightning/accelerators/ddp2_accelerator.py
index a5e8d720ce186..0448bf8628d2c 100644
--- a/pytorch_lightning/accelerators/ddp2_accelerator.py
+++ b/pytorch_lightning/accelerators/ddp2_accelerator.py
@@ -189,8 +189,6 @@ def ddp_train(self, process_idx, mp_queue, model):
         # 16-bit
         model = self.trainer.precision_connector.connect(model)
 
-        self.trainer.convert_to_lightning_optimizers()
-
         # device ids change depending on the DDP setup
         device_ids = self.get_device_ids()
 
@@ -210,6 +208,7 @@ def ddp_train(self, process_idx, mp_queue, model):
     def configure_ddp(
             self, model: LightningModule, device_ids: List[int]
     ) -> DistributedDataParallel:
+        self.ddp_plugin.device_ids = device_ids
         model = self.ddp_plugin.configure_ddp(model, device_ids)
         return model
 
diff --git a/pytorch_lightning/accelerators/ddp_accelerator.py b/pytorch_lightning/accelerators/ddp_accelerator.py
index 56f6eaa2223a3..04599e4c6d32f 100644
--- a/pytorch_lightning/accelerators/ddp_accelerator.py
+++ b/pytorch_lightning/accelerators/ddp_accelerator.py
@@ -292,8 +292,6 @@ def ddp_train(self, process_idx, model):
         # 16-bit
         model = self.trainer.precision_connector.connect(model)
 
-        self.trainer.convert_to_lightning_optimizers()
-
         # device ids change depending on the DDP setup
         device_ids = self.get_device_ids()
 
@@ -315,6 +313,7 @@ def ddp_train(self, process_idx, model):
     def configure_ddp(
             self, model: LightningModule, device_ids: List[int]
     ) -> DistributedDataParallel:
+        self.ddp_plugin.device_ids = device_ids
         model = self.ddp_plugin.configure_ddp(model, device_ids)
         return model
 
diff --git a/pytorch_lightning/accelerators/ddp_cpu_spawn_accelerator.py b/pytorch_lightning/accelerators/ddp_cpu_spawn_accelerator.py
index b15b9e8062257..2820763a61307 100644
--- a/pytorch_lightning/accelerators/ddp_cpu_spawn_accelerator.py
+++ b/pytorch_lightning/accelerators/ddp_cpu_spawn_accelerator.py
@@ -148,8 +148,6 @@ def ddp_train(self, process_idx, mp_queue, model):
         # 16-bit
         model = self.trainer.precision_connector.connect(model)
 
-        self.trainer.convert_to_lightning_optimizers()
-
         # DDP spawn already spawned off each process... no need to do anything
         device_ids = self.get_device_ids()
 
@@ -239,6 +237,7 @@ def transfer_distrib_spawn_state_on_fit_end(self, model, mp_queue, results):
     def configure_ddp(
             self, model: LightningModule, device_ids: List[int]
     ) -> DistributedDataParallel:
+        self.ddp_plugin.device_ids = device_ids
         model = self.ddp_plugin.configure_ddp(model, device_ids)
         return model
 
diff --git a/pytorch_lightning/accelerators/ddp_hpc_accelerator.py b/pytorch_lightning/accelerators/ddp_hpc_accelerator.py
index cf6aad9999223..ad953da6d1b23 100644
--- a/pytorch_lightning/accelerators/ddp_hpc_accelerator.py
+++ b/pytorch_lightning/accelerators/ddp_hpc_accelerator.py
@@ -14,8 +14,8 @@
 from typing import Any, List, Optional, Union
 
 import torch
-import torch.distributed as torch_distrib
 import torch.distributed as dist
+import torch.distributed as torch_distrib
 from torch.nn.parallel import DistributedDataParallel
 
 from pytorch_lightning import _logger as log
@@ -177,8 +177,6 @@ def ddp_train(self, process_idx, model):
         # 16-bit
         model = self.trainer.precision_connector.connect(model)
 
-        self.trainer.convert_to_lightning_optimizers()
-
         # device ids change depending on the DDP setup
         device_ids = self.get_device_ids()
 
@@ -199,6 +197,7 @@ def ddp_train(self, process_idx, model):
     def configure_ddp(
             self, model: LightningModule, device_ids: List[int]
     ) -> DistributedDataParallel:
+        self.ddp_plugin.device_ids = device_ids
         model = self.ddp_plugin.configure_ddp(model, device_ids)
         return model
 
diff --git a/pytorch_lightning/accelerators/ddp_spawn_accelerator.py b/pytorch_lightning/accelerators/ddp_spawn_accelerator.py
index e23943e9262f8..2ff5fa0cc01b6 100644
--- a/pytorch_lightning/accelerators/ddp_spawn_accelerator.py
+++ b/pytorch_lightning/accelerators/ddp_spawn_accelerator.py
@@ -163,8 +163,6 @@ def ddp_train(self, process_idx, mp_queue, model, is_master: bool = False, proc_
         # 16-bit
         model = self.trainer.precision_connector.connect(model)
 
-        self.trainer.convert_to_lightning_optimizers()
-
         # device ids change depending on the DDP setup
         device_ids = self.get_device_ids()
 
@@ -271,6 +269,7 @@ def transfer_distrib_spawn_state_on_fit_end(self, model, mp_queue, results):
     def configure_ddp(
             self, model: LightningModule, device_ids: List[int]
     ) -> DistributedDataParallel:
+        self.ddp_plugin.device_ids = device_ids
         model = self.ddp_plugin.configure_ddp(model, device_ids)
         return model
 
diff --git a/pytorch_lightning/accelerators/dp_accelerator.py b/pytorch_lightning/accelerators/dp_accelerator.py
index 847d156d4f11d..8eb1b199a6b09 100644
--- a/pytorch_lightning/accelerators/dp_accelerator.py
+++ b/pytorch_lightning/accelerators/dp_accelerator.py
@@ -64,8 +64,6 @@ def setup(self, model):
         if self.trainer.amp_backend:
             model = self.__init_half_precision(model)
 
-        self.trainer.convert_to_lightning_optimizers()
-
         self.trainer.model = model
 
     def __init_torch_data_parallel(self, model):
diff --git a/pytorch_lightning/accelerators/gpu_accelerator.py b/pytorch_lightning/accelerators/gpu_accelerator.py
index 2fe3b26679f5c..62486f04a5581 100644
--- a/pytorch_lightning/accelerators/gpu_accelerator.py
+++ b/pytorch_lightning/accelerators/gpu_accelerator.py
@@ -53,8 +53,6 @@ def setup(self, model):
         # 16-bit
         model = self.trainer.precision_connector.connect(model)
 
-        self.trainer.convert_to_lightning_optimizers()
-
         self.trainer.model = model
 
     def train(self):
diff --git a/pytorch_lightning/accelerators/horovod_accelerator.py b/pytorch_lightning/accelerators/horovod_accelerator.py
index 150be86210866..57f39125c62c2 100644
--- a/pytorch_lightning/accelerators/horovod_accelerator.py
+++ b/pytorch_lightning/accelerators/horovod_accelerator.py
@@ -90,8 +90,6 @@ def _filter_named_parameters(model, optimizer):
         # 16-bit
         model = self.trainer.precision_connector.connect(model)
 
-        self.trainer.convert_to_lightning_optimizers()
-
         # Update logger rank info from Horovod to avoid race conditions from  different ranks
         # creating directories / writing files in the same locations.
         self.trainer.global_rank = hvd.rank()
diff --git a/pytorch_lightning/accelerators/tpu_accelerator.py b/pytorch_lightning/accelerators/tpu_accelerator.py
index 66fc236a2a775..4b626a67e0533 100644
--- a/pytorch_lightning/accelerators/tpu_accelerator.py
+++ b/pytorch_lightning/accelerators/tpu_accelerator.py
@@ -229,8 +229,6 @@ def __setup_tpu_training(self, model: LightningModule, trainer):
                  f' global rank: {trainer.tpu_global_core_rank}'
                  f' with XLA_USE_BF16={os.environ.get("XLA_USE_BF16")}')
 
-        self.trainer.convert_to_lightning_optimizers()
-
     def backward(self, closure_loss, optimizer, opt_idx, *args, **kwargs):
         # do backward pass
         if self.trainer.train_loop.automatic_optimization:
diff --git a/pytorch_lightning/core/hooks.py b/pytorch_lightning/core/hooks.py
index e6e29ce9ea858..f44103348701e 100644
--- a/pytorch_lightning/core/hooks.py
+++ b/pytorch_lightning/core/hooks.py
@@ -559,9 +559,9 @@ def transfer_batch_to_device(self, batch, device)
             any other device than the one passed in as argument (unless you know what you are doing).
 
         Note:
-            This hook only runs on single GPU training (no data-parallel). If you need multi-GPU support
-            for your custom batch objects, you need to define your custom
-            :class:`~torch.nn.parallel.DistributedDataParallel` and
+            This hook only runs on single GPU training and DDP (no data-parallel).
+            If you need multi-GPU support for your custom batch objects in ``dp`` or ``ddp2``,
+            you need to define your custom :class:`~torch.nn.parallel.DistributedDataParallel` or
             override :meth:`~pytorch_lightning.core.lightning.LightningModule.configure_ddp`.
 
         See Also:
diff --git a/pytorch_lightning/core/lightning.py b/pytorch_lightning/core/lightning.py
index 5e8407f79a13d..ce01f8ff28461 100644
--- a/pytorch_lightning/core/lightning.py
+++ b/pytorch_lightning/core/lightning.py
@@ -102,9 +102,13 @@ def __init__(self, *args, **kwargs):
         self._running_manual_backward = False
         self._current_hook_fx_name = None
         self._current_dataloader_idx = None
+        self._automatic_optimization: bool = True
 
-    def optimizers(self):
-        opts = self.trainer.optimizers
+    def optimizers(self, use_pl_optimizer: bool = True) -> Union[Optimizer, List[Optimizer], List[LightningOptimizer]]:
+        if use_pl_optimizer:
+            opts = list(self.trainer.lightning_optimizers.values())
+        else:
+            opts = self.trainer.optimizers
 
         # single optimizer
         if isinstance(opts, list) and len(opts) == 1 and isinstance(opts[0], Optimizer):
@@ -151,7 +155,11 @@ def automatic_optimization(self) -> bool:
         """
         If False you are responsible for calling .backward, .step, zero_grad.
         """
-        return True
+        return self._automatic_optimization
+
+    @automatic_optimization.setter
+    def automatic_optimization(self, automatic_optimization: bool) -> None:
+        self._automatic_optimization = automatic_optimization
 
     def print(self, *args, **kwargs) -> None:
         r"""
@@ -620,14 +628,14 @@ def validation_step(self, *args, **kwargs):
             for val_batch in val_data:
                 out = validation_step(val_batch)
                 val_outs.append(out)
-                validation_epoch_end(val_outs)
+            validation_epoch_end(val_outs)
 
         Args:
             batch (:class:`~torch.Tensor` | (:class:`~torch.Tensor`, ...) | [:class:`~torch.Tensor`, ...]):
                 The output of your :class:`~torch.utils.data.DataLoader`. A tensor, tuple or list.
             batch_idx (int): The index of this batch
             dataloader_idx (int): The index of the dataloader that produced this batch
-                (only if multiple val datasets used)
+                (only if multiple val dataloaders used)
 
         Return:
            Any of.
@@ -675,11 +683,11 @@ def validation_step(self, batch, batch_idx):
                 # log the outputs!
                 self.log_dict({'val_loss': loss, 'val_acc': val_acc})
 
-        If you pass in multiple val datasets, validation_step will have an additional argument.
+        If you pass in multiple val dataloaders, :meth:`validation_step` will have an additional argument.
 
         .. code-block:: python
 
-            # CASE 2: multiple validation datasets
+            # CASE 2: multiple validation dataloaders
             def validation_step(self, batch, batch_idx, dataloader_idx):
                 # dataloader_idx tells you which dataset this is.
 
@@ -811,7 +819,7 @@ def test_step(self, *args, **kwargs):
                 The output of your :class:`~torch.utils.data.DataLoader`. A tensor, tuple or list.
             batch_idx (int): The index of this batch.
             dataloader_idx (int): The index of the dataloader that produced this batch
-                (only if multiple test datasets used).
+                (only if multiple test dataloaders used).
 
         Return:
            Any of.
@@ -850,17 +858,16 @@ def test_step(self, batch, batch_idx):
                 # log the outputs!
                 self.log_dict({'test_loss': loss, 'test_acc': test_acc})
 
-        If you pass in multiple validation datasets, :meth:`test_step` will have an additional
-        argument.
+        If you pass in multiple test dataloaders, :meth:`test_step` will have an additional argument.
 
         .. code-block:: python
 
-            # CASE 2: multiple test datasets
+            # CASE 2: multiple test dataloaders
             def test_step(self, batch, batch_idx, dataloader_idx):
                 # dataloader_idx tells you which dataset this is.
 
         Note:
-            If you don't need to validate you don't need to implement this method.
+            If you don't need to test you don't need to implement this method.
 
         Note:
             When the :meth:`test_step` is called, the model has been put in eval mode and
diff --git a/pytorch_lightning/core/optimizer.py b/pytorch_lightning/core/optimizer.py
index acba35d9ae0ac..c8dd3f34e2c43 100644
--- a/pytorch_lightning/core/optimizer.py
+++ b/pytorch_lightning/core/optimizer.py
@@ -17,7 +17,7 @@
 
 from torch.optim.optimizer import Optimizer
 
-from pytorch_lightning.utilities import _TPU_AVAILABLE, DeviceType
+from pytorch_lightning.utilities import _TPU_AVAILABLE, DeviceType, AMPType
 from pytorch_lightning.utilities.exceptions import MisconfigurationException
 
 if _TPU_AVAILABLE:
@@ -63,6 +63,10 @@ def __init__(self,
         self._accumulate_grad_batches = accumulate_grad_batches
         self._optimizer_idx = None
 
+    @property
+    def optimizer(self):
+        return self._optimizer
+
     @property
     def defaults(self):
         return self._optimizer.defaults
@@ -103,9 +107,13 @@ def _on_trainer_init(self, trainer):
                 break
 
     @classmethod
-    def to_lightning_optimizer(cls, optimizer, trainer):
-        optimizer = cls(optimizer)
-        optimizer._on_trainer_init(trainer)
+    def _to_lightning_optimizer(cls, optimizer, trainer, opt_idx):
+        # apex overrides .step function and need to be wrapped on each step
+        if trainer.amp_backend == AMPType.APEX:
+            optimizer = cls(optimizer)
+            optimizer._on_trainer_init(trainer)
+        else:
+            optimizer = trainer.lightning_optimizers[opt_idx]
         return optimizer
 
     def _accumulated_batches_reached(self):
@@ -147,7 +155,7 @@ def __optimizer_step(self, *args, closure: Optional[Callable] = None, profiler_n
                     **kwargs
                 )
 
-        trainer.train_loop.on_before_zero_grad(self)
+        trainer.train_loop.on_before_zero_grad(optimizer)
 
         model.optimizer_zero_grad(
             trainer.current_epoch,
diff --git a/pytorch_lightning/core/saving.py b/pytorch_lightning/core/saving.py
index 1761fc0135f3f..9180ab489cdf5 100644
--- a/pytorch_lightning/core/saving.py
+++ b/pytorch_lightning/core/saving.py
@@ -17,16 +17,19 @@
 import inspect
 import os
 from argparse import Namespace
-from typing import Union, Dict, Any, Optional, Callable, MutableMapping, IO
+from copy import deepcopy
+from functools import partial
+from typing import Any, Callable, Dict, IO, MutableMapping, Optional, Union
 from warnings import warn
 
 import torch
 import yaml
 
 from pytorch_lightning import _logger as log
-from pytorch_lightning.utilities import rank_zero_warn, AttributeDict, _OMEGACONF_AVAILABLE
-from pytorch_lightning.utilities.cloud_io import load as pl_load
+from pytorch_lightning.utilities import AttributeDict, rank_zero_warn, _OMEGACONF_AVAILABLE
+from pytorch_lightning.utilities.apply_func import apply_to_collection
 from pytorch_lightning.utilities.cloud_io import get_filesystem
+from pytorch_lightning.utilities.cloud_io import load as pl_load
 from pytorch_lightning.utilities.parsing import parse_class_init_keys
 
 PRIMITIVE_TYPES = (bool, int, float, str)
@@ -34,6 +37,9 @@
 
 if _OMEGACONF_AVAILABLE:
     from omegaconf import OmegaConf
+    from omegaconf.dictconfig import DictConfig
+    from omegaconf.errors import UnsupportedValueType, ValidationError
+
 
 # the older shall be on the top
 CHECKPOINT_PAST_HPARAMS_KEYS = (
@@ -321,9 +327,14 @@ def save_hparams_to_tags_csv(tags_csv: str, hparams: Union[dict, Namespace]) ->
             writer.writerow({"key": k, "value": v})
 
 
-def load_hparams_from_yaml(config_yaml: str) -> Dict[str, Any]:
+def load_hparams_from_yaml(config_yaml: str, use_omegaconf: bool = True) -> Dict[str, Any]:
     """Load hparams from a file.
 
+        Args:
+            config_yaml: Path to config yaml file
+            use_omegaconf: If both `OMEGACONF_AVAILABLE` and `use_omegaconf` are True,
+                the hparams will be converted to `DictConfig` if possible
+
     >>> hparams = Namespace(batch_size=32, learning_rate=0.001, data_root='./any/path/here')
     >>> path_yaml = './testing-hparams.yaml'
     >>> save_hparams_to_yaml(path_yaml, hparams)
@@ -338,9 +349,15 @@ def load_hparams_from_yaml(config_yaml: str) -> Dict[str, Any]:
         return {}
 
     with fs.open(config_yaml, "r") as fp:
-        tags = yaml.load(fp, Loader=yaml.UnsafeLoader)
+        hparams = yaml.load(fp, Loader=yaml.UnsafeLoader)
 
-    return tags
+    if _OMEGACONF_AVAILABLE:
+        if use_omegaconf:
+            try:
+                return OmegaConf.create(hparams)
+            except (UnsupportedValueType, ValidationError):
+                pass
+    return hparams
 
 
 def save_hparams_to_yaml(config_yaml, hparams: Union[dict, Namespace]) -> None:
@@ -361,15 +378,16 @@ def save_hparams_to_yaml(config_yaml, hparams: Union[dict, Namespace]) -> None:
 
     # saving with OmegaConf objects
     if _OMEGACONF_AVAILABLE:
-        if OmegaConf.is_config(hparams):
-            with fs.open(config_yaml, "w", encoding="utf-8") as fp:
-                OmegaConf.save(hparams, fp, resolve=True)
-            return
-        for v in hparams.values():
-            if OmegaConf.is_config(v):
-                with fs.open(config_yaml, "w", encoding="utf-8") as fp:
-                    OmegaConf.save(OmegaConf.create(hparams), fp, resolve=True)
+        # deepcopy: hparams from user shouldn't be resolved
+        hparams = deepcopy(hparams)
+        to_container = partial(OmegaConf.to_container, resolve=True)
+        hparams = apply_to_collection(hparams, DictConfig, to_container)
+        with fs.open(config_yaml, "w", encoding="utf-8") as fp:
+            try:
+                OmegaConf.save(hparams, fp)
                 return
+            except (UnsupportedValueType, ValidationError):
+                pass
 
     if not isinstance(hparams, dict):
         raise TypeError("hparams must be dictionary")
diff --git a/pytorch_lightning/plugins/ddp_plugin.py b/pytorch_lightning/plugins/ddp_plugin.py
index 5557e3269f08b..ad9fb1cc3b58f 100644
--- a/pytorch_lightning/plugins/ddp_plugin.py
+++ b/pytorch_lightning/plugins/ddp_plugin.py
@@ -110,16 +110,15 @@ def init_ddp_connection(
                 torch_backend, rank=global_rank, world_size=world_size
             )
 
+    @property
+    def is_running_single_process_per_device(self) -> bool:
+        # objects do not need to be scattered in single process per device, move objects upfront to device
+        # This property is used in ``self.on_before_forward`` function.
+        return self.device_ids is not None and len(self.device_ids) == 1
+
     def on_before_forward(self, model: LightningModule, *args):
         """
-        Override to handle custom input to device logic. For DDP, no logic is required as this is handled internally
-        within the DDP wrapper.
-
-        Example::
-
-            def on_before_forward(self, model, *args):
-                batch, batch_idx = args
-                return batch.to(model.device)
+        Override to handle custom edge case.
 
         Args:
             args: Inputs to the model.
@@ -128,6 +127,8 @@ def on_before_forward(self, model, *args):
         Returns:
             args moved to correct device if needed.
         """
+        if self.is_running_single_process_per_device:
+            args = model.transfer_batch_to_device(args, model.device)
         return args
 
     def optimizer_state(self, optimizer: Optimizer) -> dict:
diff --git a/pytorch_lightning/plugins/ddp_sequential_plugin.py b/pytorch_lightning/plugins/ddp_sequential_plugin.py
index f2250c90edb17..f8dcecd1e546d 100644
--- a/pytorch_lightning/plugins/ddp_sequential_plugin.py
+++ b/pytorch_lightning/plugins/ddp_sequential_plugin.py
@@ -15,12 +15,12 @@
 from typing import Any, List, Optional
 
 import torch
-import torch.distributed as torch_distrib
 from torch import nn
+import torch.distributed as torch_distrib
 from torch.nn.parallel import DistributedDataParallel
 
-from pytorch_lightning import LightningModule
 from pytorch_lightning import _logger as log
+from pytorch_lightning import LightningModule
 from pytorch_lightning.plugins.rpc_plugin import RPCPlugin
 from pytorch_lightning.utilities import _FAIRSCALE_PIPE_AVAILABLE, rank_zero_only
 from pytorch_lightning.utilities.exceptions import MisconfigurationException
@@ -383,7 +383,6 @@ def register_optimizers(ctx, model):
     model.trainer.optimizers = optimizers
     model.trainer.lr_schedulers = lr_schedulers
     model.trainer.optimizer_frequencies = optimizer_frequencies
-    model.trainer.convert_to_lightning_optimizers()
 
 
 def run_optimizer(ctx, model):
diff --git a/pytorch_lightning/plugins/native_amp.py b/pytorch_lightning/plugins/native_amp.py
index 4df5d128476a4..9df1ba3262afa 100644
--- a/pytorch_lightning/plugins/native_amp.py
+++ b/pytorch_lightning/plugins/native_amp.py
@@ -16,6 +16,7 @@
 import torch
 from torch.optim import Optimizer
 
+from pytorch_lightning.core.optimizer import LightningOptimizer
 from pytorch_lightning.plugins.precision_plugin import PrecisionPlugin
 
 
@@ -52,7 +53,10 @@ def backward(self, closure_loss, optimizer, opt_idx, *args, **kwargs):
 
         # unscale gradient to allow analyze within `on_after_backward`
         if not self.trainer.train_loop.should_accumulate() and automatic_optimization:
-            self.trainer.scaler.unscale_(optimizer)
+            if isinstance(optimizer, LightningOptimizer):
+                self.trainer.scaler.unscale_(optimizer.optimizer)
+            else:
+                self.trainer.scaler.unscale_(optimizer)
 
         return closure_loss
 
diff --git a/pytorch_lightning/plugins/sharded_plugin.py b/pytorch_lightning/plugins/sharded_plugin.py
index ec1500ca7abf4..53439ebc2a3df 100644
--- a/pytorch_lightning/plugins/sharded_plugin.py
+++ b/pytorch_lightning/plugins/sharded_plugin.py
@@ -42,9 +42,6 @@ def optimizer_state(self, optimizer: 'OSS') -> Optional[dict]:
         optimizer.consolidate_state_dict()
         return self._optim_state_dict(optimizer)
 
-    def on_before_forward(self, model: LightningModule, *args):
-        return model.transfer_batch_to_device(args, model.trainer.root_gpu)
-
     def _check_fairscale(self):
         if not _FAIRSCALE_AVAILABLE:
             raise MisconfigurationException(
@@ -66,7 +63,7 @@ def _reinit_with_fairscale_oss(self, trainer):
         optimizers = trainer.optimizers
         for x, optimizer in enumerate(optimizers):
             if is_lightning_optimizer(optimizer):
-                optimizer = optimizer._optimizer
+                optimizer = optimizer.optimizer
             if not isinstance(optimizer, OSS):
                 optim_class = type(optimizer)
                 zero_optimizer = OSS(
@@ -76,7 +73,6 @@ def _reinit_with_fairscale_oss(self, trainer):
                 )
                 optimizers[x] = zero_optimizer
                 del optimizer
-        trainer.convert_to_lightning_optimizers()
 
     def get_model_from_plugin(
             self,
diff --git a/pytorch_lightning/trainer/configuration_validator.py b/pytorch_lightning/trainer/configuration_validator.py
index 80d4c8952a1f3..12aa27279aee4 100644
--- a/pytorch_lightning/trainer/configuration_validator.py
+++ b/pytorch_lightning/trainer/configuration_validator.py
@@ -72,17 +72,7 @@ def __verify_train_loop_configuration(self, model):
 
         trainer.overriden_optimizer_step = is_overridden('optimizer_step', model)
         trainer.overriden_optimizer_zero_grad = is_overridden('optimizer_zero_grad', model)
-
-        enable_pl_optimizer = trainer._enable_pl_optimizer
         automatic_optimization = trainer.train_loop.automatic_optimization
-        if trainer.overriden_optimizer_step and not enable_pl_optimizer and automatic_optimization:
-            rank_zero_warn(
-                "When overriding `LightningModule` optimizer_step with"
-                " `Trainer(..., enable_pl_optimizer=False, ...)`,"
-                " we won't be calling `.zero_grad` we can't assume when you call your `optimizer.step()`."
-                " For Lightning to take care of it, please use `Trainer(enable_pl_optimizer=True)`."
-            )
-
         going_to_accumulate_grad_batches = trainer.accumulation_scheduler.going_to_accumulate_grad_batches()
 
         has_overriden_optimization_functions = trainer.overriden_optimizer_step or trainer.overriden_optimizer_zero_grad
@@ -93,13 +83,6 @@ def __verify_train_loop_configuration(self, model):
                 ' It ensures optimizer_step or optimizer_zero_grad are called on every batch.'
             )
 
-        if (enable_pl_optimizer) and trainer.overriden_optimizer_zero_grad and not automatic_optimization:
-            raise MisconfigurationException(
-                'When overriding `LightningModule` optimizer_zero_grad'
-                ' and preserving model property `automatic_optimization` as True with'
-                ' `Trainer(enable_pl_optimizer=True, ...) is not supported'
-            )
-
     def __verify_eval_loop_configuration(self, model, eval_loop_name):
         step_name = f'{eval_loop_name}_step'
 
diff --git a/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py b/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py
index 8a410d5e8e3e2..47355c8d097ad 100644
--- a/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py
+++ b/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py
@@ -205,7 +205,7 @@ def cache_training_step_metrics(self, opt_closure_result):
         self._logged_metrics.update(logged_metrics_tmp)
         self.cached_results.legacy_batch_log_metrics.update(logged_metrics_tmp)
 
-    def log_metrics(self, metrics, grad_norm_dic, step=None, log_train_step_metrics=False):
+    def log_metrics(self, metrics, grad_norm_dic, step=None):
         """Logs the metric dict passed in.
         If `step` parameter is None and `step` key is presented is metrics,
         uses metrics["step"] as a step
@@ -234,11 +234,8 @@ def log_metrics(self, metrics, grad_norm_dic, step=None, log_train_step_metrics=
 
         elif step is None:
             # added metrics by Lightning for convenience
-            if log_train_step_metrics:
-                step = self.trainer.total_batch_idx
-            else:
-                scalar_metrics['epoch'] = self.trainer.current_epoch
-                step = self.trainer.global_step
+            scalar_metrics['epoch'] = self.trainer.current_epoch
+            step = self.trainer.global_step
 
         # log actual metrics
         if self.trainer.logger is not None:
@@ -622,6 +619,8 @@ def __gather_result_across_time_and_optimizers(self, epoch_output):
         return gathered_epoch_outputs
 
     def log_train_step_metrics(self, batch_output):
+        if self.trainer.train_loop.should_accumulate() and self.trainer.train_loop.automatic_optimization:
+            return
         _, batch_log_metrics = self.cached_results.update_logger_connector()
         # when metrics should be logged
         if self.should_update_logs or self.trainer.fast_dev_run is True:
@@ -630,5 +629,5 @@ def log_train_step_metrics(self, batch_output):
             if grad_norm_dic is None:
                 grad_norm_dic = {}
             if len(batch_log_metrics) > 0 or len(grad_norm_dic) > 0:
-                self.log_metrics(batch_log_metrics, grad_norm_dic, log_train_step_metrics=True)
+                self.log_metrics(batch_log_metrics, grad_norm_dic)
                 self._callback_metrics.update(batch_log_metrics)
diff --git a/pytorch_lightning/trainer/connectors/optimizer_connector.py b/pytorch_lightning/trainer/connectors/optimizer_connector.py
index 8c352c8e5ffeb..8b23203e42bc3 100644
--- a/pytorch_lightning/trainer/connectors/optimizer_connector.py
+++ b/pytorch_lightning/trainer/connectors/optimizer_connector.py
@@ -20,7 +20,11 @@ def __init__(self, trainer):
         self.trainer = trainer
 
     def on_trainer_init(self, enable_pl_optimizer):
-        self.trainer._enable_pl_optimizer = enable_pl_optimizer
+        if enable_pl_optimizer is not None:
+            rank_zero_warn(
+                "Trainer argument `enable_pl_optimizer` is deprecated in v1.1.3. It will be removed in v1.3.0",
+                DeprecationWarning
+            )
         self.trainer.lr_schedulers = []
         self.trainer.optimizers = []
         self.trainer.optimizer_frequencies = []
diff --git a/pytorch_lightning/trainer/connectors/precision_connector.py b/pytorch_lightning/trainer/connectors/precision_connector.py
index 78f1635fb7f4d..4633e328cb3fa 100644
--- a/pytorch_lightning/trainer/connectors/precision_connector.py
+++ b/pytorch_lightning/trainer/connectors/precision_connector.py
@@ -67,7 +67,6 @@ def _setup_amp_backend(self, amp_type: str):
                 self.trainer.amp_backend = AMPType.APEX
                 self.backend = ApexPlugin(self.trainer)
                 log.warn("LightningOptimizer doesn't support Apex")
-                self.trainer._enable_pl_optimizer = False
 
         if not self.trainer.amp_backend:
             raise ModuleNotFoundError(
diff --git a/pytorch_lightning/trainer/optimizers.py b/pytorch_lightning/trainer/optimizers.py
index 919042516ad50..2aaed17e9818c 100644
--- a/pytorch_lightning/trainer/optimizers.py
+++ b/pytorch_lightning/trainer/optimizers.py
@@ -88,8 +88,10 @@ def _convert_to_lightning_optimizer(trainer, optimizer):
             optimizer._on_trainer_init(trainer)
             return optimizer
 
-        if self._enable_pl_optimizer:
-            self.optimizers = [_convert_to_lightning_optimizer(self, opt) for opt in self.optimizers]
+        self._lightning_optimizers = {
+            opt_idx: _convert_to_lightning_optimizer(self, opt)
+            for opt_idx, opt in enumerate(self.optimizers)
+        }
 
     def configure_schedulers(self, schedulers: list, monitor: Optional[str] = None):
         # Convert each scheduler into dict structure with relevant information
diff --git a/pytorch_lightning/trainer/properties.py b/pytorch_lightning/trainer/properties.py
index c32b24458c297..eb8e47ce93195 100644
--- a/pytorch_lightning/trainer/properties.py
+++ b/pytorch_lightning/trainer/properties.py
@@ -20,7 +20,6 @@
 from pytorch_lightning.accelerators.accelerator import Accelerator
 from pytorch_lightning.callbacks import Callback, EarlyStopping, ModelCheckpoint, ProgressBarBase
 from pytorch_lightning.core.lightning import LightningModule
-from pytorch_lightning.core.optimizer import is_lightning_optimizer
 from pytorch_lightning.loggers.base import LightningLoggerBase
 from pytorch_lightning.loggers.tensorboard import TensorBoardLogger
 from pytorch_lightning.trainer.connectors.checkpoint_connector import CheckpointConnector
@@ -66,6 +65,7 @@ class TrainerProperties(ABC):
     callbacks: List[Callback]
     num_nodes: int
     num_processes: int
+    _lightning_optimizers = None
 
     @property
     def log_dir(self):
@@ -267,16 +267,17 @@ def save_checkpoint(self, filepath, weights_only: bool = False):
     def get_model(self):
         return self.model_connector.get_model()
 
+    @property
+    def lightning_optimizers(self):
+        if self._lightning_optimizers is None:
+            self.convert_to_lightning_optimizers()
+        return self._lightning_optimizers
+
     def __getstate__(self):
-        # unwrap optimizer
-        self.optimizers = [opt._optimizer if is_lightning_optimizer(opt) else opt for opt in self.optimizers]
+        # remove lightning_optimizers
+        self._lightning_optimizers = None
         return self.__dict__
 
-    def __setstate__(self, d):
-        self.__dict__ = d
-        # wrap optimizers in enable_pl_optimzer is True
-        self.convert_to_lightning_optimizers()
-
     @property
     def require_distributed_sampler(self):
         if self.accelerator_backend is not None:
diff --git a/pytorch_lightning/trainer/trainer.py b/pytorch_lightning/trainer/trainer.py
index 02977b1c1df65..f1499edb10db5 100644
--- a/pytorch_lightning/trainer/trainer.py
+++ b/pytorch_lightning/trainer/trainer.py
@@ -134,7 +134,7 @@ def __init__(
         distributed_backend: Optional[str] = None,
         automatic_optimization: Optional[bool] = None,
         move_metrics_to_cpu: bool = False,
-        enable_pl_optimizer: bool = False,
+        enable_pl_optimizer: bool = None,  # todo: remove in v1.3
         multiple_trainloader_mode: str = 'max_size_cycle',
     ):
         r"""
@@ -285,7 +285,8 @@ def __init__(
 
             enable_pl_optimizer: If True, each optimizer will be wrapped by
                 `pytorch_lightning.core.optimizer.LightningOptimizer`. It allows Lightning to
-                handle AMP, TPU, accumulated_gradients, etc..
+                handle AMP, TPU, accumulated_gradients, etc.
+                .. warning:: Currently deprecated and it will be removed in v1.3
 
             multiple_trainloader_mode: How to loop over the datasets when there are multiple train loaders.
                 In 'max_size_cycle' mode, the trainer ends one epoch when the largest dataset is traversed,
diff --git a/pytorch_lightning/trainer/training_loop.py b/pytorch_lightning/trainer/training_loop.py
index 8c5cfce75ab81..e4ae2a717e8d5 100644
--- a/pytorch_lightning/trainer/training_loop.py
+++ b/pytorch_lightning/trainer/training_loop.py
@@ -21,6 +21,7 @@
 from pytorch_lightning.callbacks import ModelCheckpoint
 from pytorch_lightning.core.lightning import LightningModule
 from pytorch_lightning.core.memory import ModelSummary
+from pytorch_lightning.core.optimizer import LightningOptimizer
 from pytorch_lightning.core.step_result import Result
 from pytorch_lightning.trainer.states import TrainerState
 from pytorch_lightning.trainer.supporters import Accumulator, TensorRunningAccum
@@ -499,6 +500,9 @@ def optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_
                 'native PyTorch amp and lbfgs are not compatible.'
                 ' To request, please file a Github issue in PyTorch and tag @mcarilli')
 
+        # wraps into LightingOptimizer only for running step
+        optimizer = LightningOptimizer._to_lightning_optimizer(optimizer, self.trainer, opt_idx)
+
         # model hook
         model_ref.optimizer_step(
             self.trainer.current_epoch,
diff --git a/pytorch_lightning/utilities/__init__.py b/pytorch_lightning/utilities/__init__.py
index 0a5ed04eb72a3..bf6069230f115 100644
--- a/pytorch_lightning/utilities/__init__.py
+++ b/pytorch_lightning/utilities/__init__.py
@@ -31,6 +31,7 @@
     _GROUP_AVAILABLE,
     _HOROVOD_AVAILABLE,
     _HYDRA_AVAILABLE,
+    _HYDRA_EXPERIMENTAL_AVAILABLE,
     _module_available,
     _NATIVE_AMP_AVAILABLE,
     _OMEGACONF_AVAILABLE,
diff --git a/pytorch_lightning/utilities/apply_func.py b/pytorch_lightning/utilities/apply_func.py
index cbbf84809b400..0fc23d4f80e1f 100644
--- a/pytorch_lightning/utilities/apply_func.py
+++ b/pytorch_lightning/utilities/apply_func.py
@@ -79,12 +79,14 @@ def apply_to_collection(data: Any, dtype: Union[type, tuple], function: Callable
         return function(data, *args, **kwargs)
 
     # Recursively apply to collection items
-    elif isinstance(data, Mapping):
+    if isinstance(data, Mapping):
         return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs)
                           for k, v in data.items()})
-    elif isinstance(data, tuple) and hasattr(data, '_fields'):  # named tuple
+
+    if isinstance(data, tuple) and hasattr(data, '_fields'):  # named tuple
         return elem_type(*(apply_to_collection(d, dtype, function, *args, **kwargs) for d in data))
-    elif isinstance(data, Sequence) and not isinstance(data, str):
+
+    if isinstance(data, Sequence) and not isinstance(data, str):
         return elem_type([apply_to_collection(d, dtype, function, *args, **kwargs) for d in data])
 
     # data is neither of dtype, nor a collection
diff --git a/pytorch_lightning/utilities/imports.py b/pytorch_lightning/utilities/imports.py
index acdebfbf239e4..7a65d32cb3ff1 100644
--- a/pytorch_lightning/utilities/imports.py
+++ b/pytorch_lightning/utilities/imports.py
@@ -45,6 +45,7 @@ def _module_available(module_path: str) -> bool:
 _NATIVE_AMP_AVAILABLE = _module_available("torch.cuda.amp") and hasattr(torch.cuda.amp, "autocast")
 _OMEGACONF_AVAILABLE = _module_available("omegaconf")
 _HYDRA_AVAILABLE = _module_available("hydra")
+_HYDRA_EXPERIMENTAL_AVAILABLE = _module_available("hydra.experimental")
 _HOROVOD_AVAILABLE = _module_available("horovod.torch")
 _TORCHTEXT_AVAILABLE = _module_available("torchtext")
 _XLA_AVAILABLE = _module_available("torch_xla")
diff --git a/pytorch_lightning/utilities/parsing.py b/pytorch_lightning/utilities/parsing.py
index 521dd5200521a..e631f8715b806 100644
--- a/pytorch_lightning/utilities/parsing.py
+++ b/pytorch_lightning/utilities/parsing.py
@@ -115,7 +115,6 @@ def get_init_args(frame) -> dict:
     self_var, args_var, kwargs_var = parse_class_init_keys(cls)
     filtered_vars = [n for n in (self_var, args_var, kwargs_var) if n]
     exclude_argnames = (*filtered_vars, '__class__', 'frame', 'frame_args')
-
     # only collect variables that appear in the signature
     local_args = {k: local_vars[k] for k in init_parameters.keys()}
     local_args.update(local_args.get(kwargs_var, {}))
diff --git a/pytorch_lightning/utilities/seed.py b/pytorch_lightning/utilities/seed.py
index 1ce782f967ebb..353112c1866b6 100644
--- a/pytorch_lightning/utilities/seed.py
+++ b/pytorch_lightning/utilities/seed.py
@@ -22,6 +22,7 @@
 import torch
 
 from pytorch_lightning import _logger as log
+from pytorch_lightning.utilities import rank_zero_warn
 
 
 def seed_everything(seed: Optional[int] = None) -> int:
@@ -41,18 +42,17 @@ def seed_everything(seed: Optional[int] = None) -> int:
 
     try:
         if seed is None:
-            seed = os.environ.get("PL_GLOBAL_SEED", _select_seed_randomly(min_seed_value, max_seed_value))
+            seed = os.environ.get("PL_GLOBAL_SEED")
         seed = int(seed)
     except (TypeError, ValueError):
         seed = _select_seed_randomly(min_seed_value, max_seed_value)
+        rank_zero_warn(f"No correct seed found, seed set to {seed}")
 
-    if (seed > max_seed_value) or (seed < min_seed_value):
-        log.warning(
-            f"{seed} is not in bounds, \
-            numpy accepts from {min_seed_value} to {max_seed_value}"
-        )
+    if not (min_seed_value <= seed <= max_seed_value):
+        rank_zero_warn(f"{seed} is not in bounds, numpy accepts from {min_seed_value} to {max_seed_value}")
         seed = _select_seed_randomly(min_seed_value, max_seed_value)
 
+    log.info(f"Global seed set to {seed}")
     os.environ["PL_GLOBAL_SEED"] = str(seed)
     random.seed(seed)
     np.random.seed(seed)
@@ -62,6 +62,4 @@ def seed_everything(seed: Optional[int] = None) -> int:
 
 
 def _select_seed_randomly(min_seed_value: int = 0, max_seed_value: int = 255) -> int:
-    seed = random.randint(min_seed_value, max_seed_value)
-    log.warning(f"No correct seed found, seed set to {seed}")
-    return seed
+    return random.randint(min_seed_value, max_seed_value)
diff --git a/requirements/devel.txt b/requirements/devel.txt
index a8c5293c8c7db..dcf66495ee46f 100644
--- a/requirements/devel.txt
+++ b/requirements/devel.txt
@@ -8,4 +8,4 @@
 -r ./test.txt
 
 # install all extra dependencies for running examples
--r ./examples.txt
\ No newline at end of file
+-r ./examples.txt
diff --git a/requirements/examples.txt b/requirements/examples.txt
index c6ee7bff0dff5..83ceafe3c2934 100644
--- a/requirements/examples.txt
+++ b/requirements/examples.txt
@@ -1,2 +1,2 @@
 torchvision>=0.5
-gym>=0.17.0
\ No newline at end of file
+gym>=0.17.0
diff --git a/requirements/loggers.txt b/requirements/loggers.txt
index 3ec7b25db4643..001210855871d 100644
--- a/requirements/loggers.txt
+++ b/requirements/loggers.txt
@@ -3,4 +3,4 @@ neptune-client>=0.4.109
 comet-ml>=3.1.12
 mlflow>=1.0.0
 test_tube>=0.7.5
-wandb>=0.8.21
\ No newline at end of file
+wandb>=0.8.21
diff --git a/requirements/test.txt b/requirements/test.txt
index 4da33ac9ed3ab..80ef988e70eeb 100644
--- a/requirements/test.txt
+++ b/requirements/test.txt
@@ -6,7 +6,7 @@ pytest>=5.0
 flake8>=3.6
 flake8-black
 check-manifest
-twine==1.13.0
+twine==3.2
 # scipy>=0.13.3
 scikit-learn>=0.22.2
 scikit-image>=0.17.2
diff --git a/setup.py b/setup.py
index 961540fb961ec..2993c96c23526 100755
--- a/setup.py
+++ b/setup.py
@@ -69,7 +69,7 @@
     url=pytorch_lightning.__homepage__,
     download_url='https://github.com/PyTorchLightning/pytorch-lightning',
     license=pytorch_lightning.__license__,
-    packages=find_packages(exclude=['tests', 'tests/*', 'benchmarks']),
+    packages=find_packages(exclude=['tests', 'tests/*', 'benchmarks', 'legacy', 'legacy/*']),
 
     long_description=_load_readme_description(PATH_ROOT),
     long_description_content_type='text/markdown',
diff --git a/tests/README.md b/tests/README.md
index 8ef006c4d879a..7b857a1901fd7 100644
--- a/tests/README.md
+++ b/tests/README.md
@@ -33,8 +33,8 @@ The GPU machine must have:
 3. [Horovod with NCCL](https://horovod.readthedocs.io/en/stable/gpus_include.html) support: `HOROVOD_GPU_OPERATIONS=NCCL pip install horovod`
 
 
-## Running Coverage   
-Make sure to run coverage on a GPU machine with at least 2 GPUs and NVIDIA apex installed. 
+## Running Coverage
+Make sure to run coverage on a GPU machine with at least 2 GPUs and NVIDIA apex installed.
 
 ```bash
 cd pytorch-lightning
diff --git a/tests/__init__.py b/tests/__init__.py
index e0ec83a2efbca..57feda6280c38 100644
--- a/tests/__init__.py
+++ b/tests/__init__.py
@@ -18,6 +18,8 @@
 _TEST_ROOT = os.path.dirname(__file__)
 _PROJECT_ROOT = os.path.dirname(_TEST_ROOT)
 _TEMP_PATH = os.path.join(_PROJECT_ROOT, 'test_temp')
+DATASETS_PATH = os.path.join(_PROJECT_ROOT, 'Datasets')
+LEGACY_PATH = os.path.join(_PROJECT_ROOT, 'legacy')
 
 # todo: this setting `PYTHONPATH` may not be used by other evns like Conda for import packages
 if _PROJECT_ROOT not in os.getenv('PYTHONPATH', ""):
diff --git a/tests/callbacks/test_callbacks.py b/tests/callbacks/test_callbacks.py
index 53d6f80d9d7bf..b12d0c2884106 100644
--- a/tests/callbacks/test_callbacks.py
+++ b/tests/callbacks/test_callbacks.py
@@ -33,8 +33,6 @@ def test_trainer_callback_system(torch_save):
         limit_train_batches=3,
         limit_test_batches=2,
         progress_bar_refresh_rate=0,
-        # todo: enabled since internally we wrap the model for optimizer step, this should be fixed
-        enable_pl_optimizer=True
     )
 
     # no call yet
diff --git a/tests/checkpointing/test_legacy_checkpoints.py b/tests/checkpointing/test_legacy_checkpoints.py
new file mode 100644
index 0000000000000..42623cb4df1ec
--- /dev/null
+++ b/tests/checkpointing/test_legacy_checkpoints.py
@@ -0,0 +1,72 @@
+# Copyright The PyTorch Lightning team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import glob
+import os
+import sys
+
+import pytest
+
+from pytorch_lightning import Trainer
+from tests import LEGACY_PATH
+
+LEGACY_CHECKPOINTS_PATH = os.path.join(LEGACY_PATH, 'checkpoints')
+CHECKPOINT_EXTENSION = ".ckpt"
+
+
+# todo: add more legacy checkpoints - for < v0.8
+@pytest.mark.parametrize("pl_version", [
+    # "0.8.1",
+    "0.8.3",
+    "0.8.4",
+    # "0.8.5", # this version has problem with loading on PT<=1.4 as it seems to be archive
+    # "0.9.0", # this version has problem with loading on PT<=1.4 as it seems to be archive
+    "0.10.0",
+    "1.0.0",
+    "1.0.1",
+    "1.0.2",
+    "1.0.3",
+    "1.0.4",
+    "1.0.5",
+    "1.0.6",
+    "1.0.7",
+    "1.0.8",
+    "1.1.0",
+    "1.1.1",
+    "1.1.2",
+    "1.1.3",
+])
+def test_resume_legacy_checkpoints(tmpdir, pl_version):
+    path_dir = os.path.join(LEGACY_CHECKPOINTS_PATH, pl_version)
+
+    # todo: make this as mock, so it is cleaner...
+    orig_sys_paths = list(sys.path)
+    sys.path.insert(0, path_dir)
+    from zero_training import DummyModel
+
+    path_ckpts = sorted(glob.glob(os.path.join(path_dir, f'*{CHECKPOINT_EXTENSION}')))
+    assert path_ckpts, 'No checkpoints found in folder "%s"' % path_dir
+    path_ckpt = path_ckpts[-1]
+
+    model = DummyModel.load_from_checkpoint(path_ckpt)
+    trainer = Trainer(default_root_dir=tmpdir, max_epochs=6)
+    result = trainer.fit(model)
+    assert result
+
+    # todo
+    # model = DummyModel()
+    # trainer = Trainer(default_root_dir=tmpdir, max_epochs=1, resume_from_checkpoint=path_ckpt)
+    # result = trainer.fit(model)
+    # assert result
+
+    sys.path = orig_sys_paths
diff --git a/tests/checkpointing/test_model_checkpoint.py b/tests/checkpointing/test_model_checkpoint.py
index 9e757488b255d..34bc657bae595 100644
--- a/tests/checkpointing/test_model_checkpoint.py
+++ b/tests/checkpointing/test_model_checkpoint.py
@@ -561,8 +561,7 @@ def validation_epoch_end(self, outputs):
 
 
 @mock.patch.dict(os.environ, {"PL_DEV_DEBUG": "1"})
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_checkpoint_repeated_strategy(enable_pl_optimizer, tmpdir):
+def test_checkpoint_repeated_strategy(tmpdir):
     """
     This test validates that the checkpoint can be called when provided to callbacks list
     """
@@ -582,7 +581,6 @@ def validation_step(self, batch, batch_idx):
         limit_val_batches=2,
         limit_test_batches=2,
         callbacks=[checkpoint_callback],
-        enable_pl_optimizer=enable_pl_optimizer,
         weights_summary=None,
         progress_bar_refresh_rate=0,
     )
@@ -599,7 +597,6 @@ def validation_step(self, batch, batch_idx):
             limit_val_batches=2,
             limit_test_batches=2,
             resume_from_checkpoint=checkpoint_callback.best_model_path,
-            enable_pl_optimizer=enable_pl_optimizer,
             weights_summary=None,
             progress_bar_refresh_rate=0,
         )
@@ -610,8 +607,7 @@ def validation_step(self, batch, batch_idx):
 
 
 @mock.patch.dict(os.environ, {"PL_DEV_DEBUG": "1"})
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_checkpoint_repeated_strategy_extended(enable_pl_optimizer, tmpdir):
+def test_checkpoint_repeated_strategy_extended(tmpdir):
     """
     This test validates checkpoint can be called several times without
     increasing internally its global step if nothing run.
@@ -656,7 +652,6 @@ def assert_checkpoint_log_dir(idx):
         limit_train_batches=limit_train_batches,
         limit_val_batches=3,
         limit_test_batches=4,
-        enable_pl_optimizer=enable_pl_optimizer,
         callbacks=[checkpoint_cb],
     )
     trainer = pl.Trainer(**trainer_config)
diff --git a/tests/checkpointing/test_torch_saving.py b/tests/checkpointing/test_torch_saving.py
index a15d425f5a0e7..ca3afd4e9e5e2 100644
--- a/tests/checkpointing/test_torch_saving.py
+++ b/tests/checkpointing/test_torch_saving.py
@@ -18,19 +18,16 @@
 import torch
 
 from pytorch_lightning import Trainer
-from pytorch_lightning.core.optimizer import LightningOptimizer
 from tests.base import BoringModel
 
 
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_model_torch_save(tmpdir, enable_pl_optimizer):
+def test_model_torch_save(tmpdir):
     """Test to ensure torch save does not fail for model and trainer."""
     model = BoringModel()
     num_epochs = 1
     trainer = Trainer(
         default_root_dir=tmpdir,
         max_epochs=num_epochs,
-        enable_pl_optimizer=enable_pl_optimizer,
     )
     temp_path = os.path.join(tmpdir, 'temp.pt')
     trainer.fit(model)
@@ -39,8 +36,6 @@ def test_model_torch_save(tmpdir, enable_pl_optimizer):
     torch.save(trainer.model, temp_path)
     torch.save(trainer, temp_path)
     trainer = torch.load(temp_path)
-    is_lightning_optimizer = isinstance(trainer.optimizers[0], LightningOptimizer)
-    assert is_lightning_optimizer if enable_pl_optimizer else not is_lightning_optimizer
 
 
 @pytest.mark.skipif(platform.system() == "Windows", reason="Distributed training is not supported on Windows")
diff --git a/tests/core/test_lightning_module.py b/tests/core/test_lightning_module.py
index 9d45310a1de54..9cea8cf28c07f 100644
--- a/tests/core/test_lightning_module.py
+++ b/tests/core/test_lightning_module.py
@@ -41,8 +41,7 @@ def optimizer_step(self, *_, **__):
         trainer.fit(model)
 
 
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_automatic_optimization_num_calls(enable_pl_optimizer, tmpdir):
+def test_automatic_optimization_num_calls(tmpdir):
 
     with patch("torch.optim.SGD.step") as sgd_step, \
          patch("torch.optim.SGD.zero_grad") as sgd_zero_grad, \
@@ -71,16 +70,12 @@ def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx,
                     if batch_idx % 2 == 0:
                         assert isinstance(optimizer, SGD)
                         optimizer.step(closure=optimizer_closure)
-                        if not enable_pl_optimizer:
-                            optimizer.zero_grad()
 
                 # update discriminator opt every 4 steps
                 if optimizer_idx == 1:
                     if batch_idx % 4 == 0:
                         assert isinstance(optimizer, Adam)
                         optimizer.step(closure=optimizer_closure)
-                        if not enable_pl_optimizer:
-                            optimizer.zero_grad()
 
         model = TestModel()
         model.training_epoch_end = None
@@ -91,7 +86,6 @@ def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx,
             limit_train_batches=8,
             limit_val_batches=1,
             accumulate_grad_batches=1,
-            enable_pl_optimizer=enable_pl_optimizer
         )
 
         trainer.fit(model)
@@ -102,8 +96,7 @@ def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx,
     assert adam_zero_grad.call_count == 2
 
 
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_params_groups_and_state_are_accessible(enable_pl_optimizer, tmpdir):
+def test_params_groups_and_state_are_accessible(tmpdir):
 
     class TestModel(BoringModel):
 
@@ -136,7 +129,6 @@ def optimizer_step(self, current_epoch, batch_nb, optimizer, optimizer_idx, clos
         limit_train_batches=8,
         limit_val_batches=1,
         accumulate_grad_batches=1,
-        enable_pl_optimizer=enable_pl_optimizer
     )
 
     trainer.fit(model)
diff --git a/tests/core/test_lightning_optimizer.py b/tests/core/test_lightning_optimizer.py
index 30330729d14ba..d32ef1ab69ab4 100644
--- a/tests/core/test_lightning_optimizer.py
+++ b/tests/core/test_lightning_optimizer.py
@@ -40,13 +40,12 @@ def configure_optimizers(self):
         limit_val_batches=1,
         max_epochs=1,
         weights_summary=None,
-        enable_pl_optimizer=True,
     )
     trainer.fit(model)
 
     groups = "{'dampening': 0, 'initial_lr': 0.1, 'lr': 0.01, 'momentum': 0, 'nesterov': False, 'weight_decay': 0}"
     expected = f"LightningSGD(groups=[{groups}])"
-    assert trainer.optimizers[0].__repr__() == expected
+    assert trainer._lightning_optimizers[0].__repr__() == expected
 
 
 def test_lightning_optimizer_from_user(tmpdir):
@@ -68,13 +67,12 @@ def configure_optimizers(self):
         limit_val_batches=1,
         max_epochs=1,
         weights_summary=None,
-        enable_pl_optimizer=True,
     )
     trainer.fit(model)
 
     groups = "{'amsgrad': False, 'betas': (0.9, 0.999), 'eps': 1e-08, 'initial_lr': 0.1, 'lr': 0.01, 'weight_decay': 0}"
     expected = f"LightningAdam(groups=[{groups}])"
-    assert trainer.optimizers[0].__repr__() == expected
+    assert trainer._lightning_optimizers[0].__repr__() == expected
 
 
 @patch("torch.optim.Adam.step", autospec=True)
@@ -122,7 +120,6 @@ def automatic_optimization(self) -> bool:
         limit_val_batches=1,
         max_epochs=1,
         weights_summary=None,
-        enable_pl_optimizer=True,
     )
     trainer.fit(model)
 
@@ -176,7 +173,6 @@ def automatic_optimization(self) -> bool:
         max_epochs=1,
         weights_summary=None,
         accumulate_grad_batches=2,
-        enable_pl_optimizer=True,
     )
     trainer.fit(model)
 
@@ -259,7 +255,6 @@ def configure_optimizers(self):
         limit_val_batches=1,
         max_epochs=1,
         weights_summary=None,
-        enable_pl_optimizer=True,
     )
     trainer.fit(model)
 
@@ -312,7 +307,6 @@ def configure_optimizers(self):
             limit_val_batches=1,
             max_epochs=1,
             weights_summary=None,
-            enable_pl_optimizer=True,
         )
         trainer.fit(model)
 
@@ -372,7 +366,6 @@ def configure_optimizers(self):
                 limit_val_batches=1,
                 max_epochs=1,
                 weights_summary=None,
-                enable_pl_optimizer=True,
             )
             trainer.fit(model)
 
@@ -425,7 +418,6 @@ def configure_optimizers(self):
             limit_val_batches=1,
             max_epochs=1,
             weights_summary=None,
-            enable_pl_optimizer=True,
         )
         trainer.fit(model)
 
diff --git a/tests/deprecated_api/test_remove_1-3.py b/tests/deprecated_api/test_remove_1-3.py
index 3deb4e219fcee..ff442f192c887 100644
--- a/tests/deprecated_api/test_remove_1-3.py
+++ b/tests/deprecated_api/test_remove_1-3.py
@@ -134,3 +134,8 @@ def test_v1_3_0_trainer_cli_profiler(cli_args, expected_parsed_arg, expected_pro
     assert getattr(args, "profiler") == expected_parsed_arg
     trainer = Trainer.from_argparse_args(args)
     assert isinstance(trainer.profiler, expected_profiler)
+
+
+def test_trainer_enable_pl_optimizer(tmpdir):
+    with pytest.deprecated_call(match='will be removed in v1.3'):
+        Trainer(enable_pl_optimizer=True)
diff --git a/tests/loggers/test_all.py b/tests/loggers/test_all.py
index d250f8a12d85e..780c37b3c19e0 100644
--- a/tests/loggers/test_all.py
+++ b/tests/loggers/test_all.py
@@ -142,7 +142,7 @@ def log_metrics(self, metrics, step):
     if logger_class == TensorBoardLogger:
         expected = [
             (0, ['hp_metric']),
-            (0, ['train_some_val']),
+            (0, ['epoch', 'train_some_val']),
             (0, ['early_stop_on', 'epoch', 'val_loss']),
             (0, ['hp_metric']),
             (1, ['epoch', 'test_loss'])
@@ -150,7 +150,7 @@ def log_metrics(self, metrics, step):
         assert log_metric_names == expected
     else:
         expected = [
-            (0, ['train_some_val']),
+            (0, ['epoch', 'train_some_val']),
             (0, ['early_stop_on', 'epoch', 'val_loss']),
             (1, ['epoch', 'test_loss'])
         ]
diff --git a/tests/loggers/test_tensorboard.py b/tests/loggers/test_tensorboard.py
index b9ee43ee71578..f10a491ac696b 100644
--- a/tests/loggers/test_tensorboard.py
+++ b/tests/loggers/test_tensorboard.py
@@ -213,8 +213,11 @@ def test_tensorboard_with_accummulated_gradients(mock_log_metrics, expected, tmp
     Tests to ensure that tensorboard log properly when accumulated_gradients > 1
     """
     class TestModel(BoringModel):
-        _count = 0
-        _indexes = []
+
+        def __init__(self):
+            super().__init__()
+            self._count = 0
+            self._indexes = []
 
         def training_step(self, batch, batch_idx):
             output = self.layer(batch)
@@ -222,10 +225,10 @@ def training_step(self, batch, batch_idx):
             self.log('count', self._count, on_step=True, on_epoch=True)
             self.log('loss', loss, on_step=True, on_epoch=True)
 
-            if self.trainer.logger_connector.should_update_logs:
-                self._indexes.append(self._count)
+            if not self.trainer.train_loop.should_accumulate():
+                if self.trainer.logger_connector.should_update_logs:
+                    self._indexes.append(self.trainer.global_step)
 
-            self._count += 1
             return loss
 
         def validation_step(self, batch, batch_idx):
@@ -245,14 +248,13 @@ def configure_optimizers(self):
 
     logger_0 = TensorBoardLogger(tmpdir, default_hp_metric=False)
 
-    accumulate_grad_batches = 2
     trainer = Trainer(
         default_root_dir=tmpdir,
         limit_train_batches=12,
-        limit_val_batches=12,
+        limit_val_batches=0,
         max_epochs=3,
         gpus=0,
-        accumulate_grad_batches=accumulate_grad_batches,
+        accumulate_grad_batches=2,
         logger=[logger_0],
         log_every_n_steps=3,
     )
@@ -260,5 +262,6 @@ def configure_optimizers(self):
 
     mock_count_epochs = [m[2]["step"] for m in mock_log_metrics.mock_calls if "count_epoch" in m[2]["metrics"]]
     assert mock_count_epochs == expected
+
     mock_count_steps = [m[2]["step"] for m in mock_log_metrics.mock_calls if "count_step" in m[2]["metrics"]]
     assert model._indexes == mock_count_steps
diff --git a/tests/models/conf/config.yaml b/tests/models/conf/config.yaml
new file mode 100644
index 0000000000000..faf751c24f6cb
--- /dev/null
+++ b/tests/models/conf/config.yaml
@@ -0,0 +1,17 @@
+# Copyright The PyTorch Lightning team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+defaults:
+  - training: default
+
+log: ${training.log}
diff --git a/tests/models/conf/training/default.yaml b/tests/models/conf/training/default.yaml
new file mode 100644
index 0000000000000..2c35b22365420
--- /dev/null
+++ b/tests/models/conf/training/default.yaml
@@ -0,0 +1,2 @@
+# @package training
+log: "Something"
diff --git a/tests/models/test_amp.py b/tests/models/test_amp.py
index 55d32cc662701..c9f6ea05ad2b8 100644
--- a/tests/models/test_amp.py
+++ b/tests/models/test_amp.py
@@ -144,8 +144,7 @@ def test_amp_gpu_ddp_slurm_managed(tmpdir):
     assert trainer.slurm_connector.resolve_root_node_address('abc[23-24, 45-40, 40]') == 'abc23'
 
 
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_cpu_model_with_amp(enable_pl_optimizer, tmpdir):
+def test_cpu_model_with_amp(tmpdir):
     """Make sure model trains on CPU."""
     trainer_options = dict(
         default_root_dir=tmpdir,
@@ -154,7 +153,6 @@ def test_cpu_model_with_amp(enable_pl_optimizer, tmpdir):
         limit_train_batches=0.4,
         limit_val_batches=0.4,
         precision=16,
-        enable_pl_optimizer=enable_pl_optimizer,
     )
 
     model = EvalModelTemplate()
diff --git a/tests/models/test_cpu.py b/tests/models/test_cpu.py
index c39e84c21dfa3..5f69c55a086e6 100644
--- a/tests/models/test_cpu.py
+++ b/tests/models/test_cpu.py
@@ -23,11 +23,10 @@
 from pytorch_lightning import Trainer
 from pytorch_lightning.callbacks import Callback, EarlyStopping, ModelCheckpoint
 from pytorch_lightning.trainer.states import TrainerState
-from tests.base import BoringModel
+from tests.base import BoringModel, EvalModelTemplate
 
 
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_cpu_slurm_save_load(enable_pl_optimizer, tmpdir):
+def test_cpu_slurm_save_load(tmpdir):
     """Verify model save/load/checkpoint on CPU."""
     model = BoringModel()
 
@@ -43,7 +42,6 @@ def test_cpu_slurm_save_load(enable_pl_optimizer, tmpdir):
         limit_train_batches=0.2,
         limit_val_batches=0.2,
         callbacks=[ModelCheckpoint(dirpath=tmpdir)],
-        enable_pl_optimizer=enable_pl_optimizer,
     )
     trainer.fit(model)
     real_global_step = trainer.global_step
@@ -89,7 +87,6 @@ def on_train_epoch_start(self, trainer, model):
         default_root_dir=tmpdir,
         max_epochs=1,
         logger=logger,
-        enable_pl_optimizer=enable_pl_optimizer,
         callbacks=[_StartCallback(), ModelCheckpoint(dirpath=tmpdir)],
     )
     # by calling fit again, we trigger training, loading weights from the cluster
@@ -97,8 +94,7 @@ def on_train_epoch_start(self, trainer, model):
     trainer.fit(model)
 
 
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_early_stopping_cpu_model(enable_pl_optimizer, tmpdir):
+def test_early_stopping_cpu_model(tmpdir):
     class ModelTrainVal(BoringModel):
         def validation_epoch_end(self, outputs) -> None:
             val_loss = torch.stack([x["x"] for x in outputs]).mean()
@@ -111,7 +107,6 @@ def validation_epoch_end(self, outputs) -> None:
         gradient_clip_val=1.0,
         overfit_batches=0.20,
         track_grad_norm=2,
-        enable_pl_optimizer=enable_pl_optimizer,
         progress_bar_refresh_rate=0,
         accumulate_grad_batches=2,
         limit_train_batches=0.1,
@@ -131,8 +126,7 @@ def validation_epoch_end(self, outputs) -> None:
 @pytest.mark.skipif((platform.system() == "Darwin" and
                      LooseVersion(torch.__version__) < LooseVersion("1.3.0")),
                     reason="Distributed training is not supported on MacOS before Torch 1.3.0")
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_multi_cpu_model_ddp(enable_pl_optimizer, tmpdir):
+def test_multi_cpu_model_ddp(tmpdir):
     """Make sure DDP works."""
     tutils.set_random_master_port()
 
@@ -145,7 +139,6 @@ def test_multi_cpu_model_ddp(enable_pl_optimizer, tmpdir):
         gpus=None,
         num_processes=2,
         accelerator='ddp_cpu',
-        enable_pl_optimizer=enable_pl_optimizer,
     )
 
     model = BoringModel()
@@ -300,6 +293,25 @@ def test_cpu_model(tmpdir):
         progress_bar_refresh_rate=0,
         max_epochs=1,
         limit_train_batches=0.4,
+        limit_val_batches=0.4
+    )
+
+    model = EvalModelTemplate()
+
+    tpipes.run_model_test(trainer_options, model, on_gpu=False)
+
+
+def test_all_features_cpu_model(tmpdir):
+    """Test each of the trainer options."""
+    trainer_options = dict(
+        default_root_dir=tmpdir,
+        gradient_clip_val=1.0,
+        overfit_batches=0.20,
+        track_grad_norm=2,
+        progress_bar_refresh_rate=0,
+        accumulate_grad_batches=2,
+        max_epochs=1,
+        limit_train_batches=0.4,
         limit_val_batches=0.4,
     )
 
diff --git a/tests/models/test_hooks.py b/tests/models/test_hooks.py
index 1f25d46f82944..a25a8181e763a 100644
--- a/tests/models/test_hooks.py
+++ b/tests/models/test_hooks.py
@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import inspect
+import os
 from unittest.mock import MagicMock
 
 import pytest
@@ -20,7 +21,7 @@
 from pytorch_lightning import Trainer
 from pytorch_lightning.accelerators.gpu_accelerator import GPUAccelerator
 from pytorch_lightning.trainer.states import TrainerState
-from tests.base import BoringModel, EvalModelTemplate
+from tests.base import BoringModel, EvalModelTemplate, RandomDataset
 
 
 @pytest.mark.parametrize('max_steps', [1, 2, 3])
@@ -125,6 +126,49 @@ def transfer_batch_to_device(self, data, device):
     assert batch_gpu.samples.device == batch_gpu.targets.device == expected
 
 
+@pytest.mark.skipif(torch.cuda.device_count() < 2, reason="test requires multi-GPU machine")
+@pytest.mark.skipif(not os.getenv("PL_RUNNING_SPECIAL_TESTS", '0') == '1',
+                    reason="test should be run outside of pytest")
+def test_transfer_batch_hook_ddp(tmpdir):
+    """
+    Test custom data are properly moved to the right device using ddp
+    """
+
+    class CustomBatch:
+
+        def __init__(self, data):
+            self.samples = data[0]
+
+        def to(self, device, **kwargs):
+            self.samples = self.samples.to(device, **kwargs)
+            return self
+
+    def collate_fn(batch):
+        return CustomBatch(batch)
+
+    class TestModel(BoringModel):
+        def training_step(self, batch, batch_idx):
+            assert batch.samples.device == self.device
+            assert isinstance(batch_idx, int)
+
+        def train_dataloader(self):
+            return torch.utils.data.DataLoader(RandomDataset(32, 64), collate_fn=collate_fn)
+
+    model = TestModel()
+    model.validation_step = None
+    model.training_epoch_end = None
+    trainer = Trainer(
+        default_root_dir=tmpdir,
+        limit_train_batches=2,
+        limit_val_batches=0,
+        max_epochs=1,
+        weights_summary=None,
+        accelerator="ddp",
+        gpus=2,
+    )
+    trainer.fit(model)
+
+
 @pytest.mark.parametrize(
     'max_epochs,batch_idx_',
     [(2, 5), (3, 8), (4, 12)]
diff --git a/tests/models/test_horovod.py b/tests/models/test_horovod.py
index 7ac7cd235f392..752ce4b60d42f 100644
--- a/tests/models/test_horovod.py
+++ b/tests/models/test_horovod.py
@@ -69,8 +69,7 @@ def _run_horovod(trainer_options, on_gpu=False):
 
 
 @pytest.mark.skipif(platform.system() == "Windows", reason="Horovod is not supported on Windows")
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_horovod_cpu(enable_pl_optimizer, tmpdir):
+def test_horovod_cpu(tmpdir):
     """Test Horovod running multi-process on CPU."""
     trainer_options = dict(
         default_root_dir=str(tmpdir),
@@ -82,14 +81,12 @@ def test_horovod_cpu(enable_pl_optimizer, tmpdir):
         limit_val_batches=0.2,
         accelerator='horovod',
         deterministic=True,
-        enable_pl_optimizer=enable_pl_optimizer,
     )
     _run_horovod(trainer_options)
 
 
 @pytest.mark.skipif(platform.system() == "Windows", reason="Horovod is not supported on Windows")
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_horovod_cpu_implicit(enable_pl_optimizer, tmpdir):
+def test_horovod_cpu_implicit(tmpdir):
     """Test Horovod without specifying a backend, inferring from env set by `horovodrun`."""
     trainer_options = dict(
         default_root_dir=str(tmpdir),
@@ -100,7 +97,6 @@ def test_horovod_cpu_implicit(enable_pl_optimizer, tmpdir):
         limit_train_batches=0.4,
         limit_val_batches=0.2,
         deterministic=True,
-        enable_pl_optimizer=enable_pl_optimizer,
     )
     _run_horovod(trainer_options)
 
@@ -206,8 +202,7 @@ def validation_step(self, batch, *args, **kwargs):
 
 
 @pytest.mark.skipif(platform.system() == "Windows", reason="Horovod is not supported on Windows")
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_horovod_multi_optimizer(enable_pl_optimizer, tmpdir):
+def test_horovod_multi_optimizer(tmpdir):
     model = BasicGAN(**EvalModelTemplate.get_default_hparams())
 
     # fit model
@@ -219,7 +214,6 @@ def test_horovod_multi_optimizer(enable_pl_optimizer, tmpdir):
         limit_val_batches=0.2,
         deterministic=True,
         accelerator='horovod',
-        enable_pl_optimizer=enable_pl_optimizer,
     )
     trainer.fit(model)
     assert trainer.state == TrainerState.FINISHED, f"Training failed with {trainer.state}"
@@ -241,8 +235,7 @@ def get_optimizer_params(optimizer):
 
 @pytest.mark.skipif(not _HOROVOD_AVAILABLE, reason="Horovod is unavailable")
 @pytest.mark.skipif(platform.system() == "Windows", reason="Horovod is not supported on Windows")
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_result_reduce_horovod(enable_pl_optimizer, tmpdir):
+def test_result_reduce_horovod(tmpdir):
     """Make sure result logging works with Horovod.
 
     This test mirrors tests/core/test_results.py::_ddp_test_fn
@@ -282,7 +275,6 @@ def training_epoch_end(self, outputs) -> None:
             max_epochs=1,
             log_every_n_steps=1,
             weights_summary=None,
-            enable_pl_optimizer=enable_pl_optimizer,
         )
 
         trainer.fit(model)
diff --git a/tests/models/test_hparams.py b/tests/models/test_hparams.py
index 7081d450ee256..df69089a83d9d 100644
--- a/tests/models/test_hparams.py
+++ b/tests/models/test_hparams.py
@@ -25,10 +25,14 @@
 from torch.utils.data import DataLoader
 
 from pytorch_lightning import LightningModule, Trainer
+from pytorch_lightning.callbacks import ModelCheckpoint
 from pytorch_lightning.core.saving import load_hparams_from_yaml, save_hparams_to_yaml
-from pytorch_lightning.utilities import AttributeDict, is_picklable
+from pytorch_lightning.utilities import _HYDRA_EXPERIMENTAL_AVAILABLE, AttributeDict, is_picklable
 from tests.base import BoringModel, EvalModelTemplate, TrialMNIST
 
+if _HYDRA_EXPERIMENTAL_AVAILABLE:
+    from hydra.experimental import compose, initialize
+
 
 class SaveHparamsModel(BoringModel):
     """ Tests that a model can take an object """
@@ -483,13 +487,13 @@ def test_hparams_save_yaml(tmpdir):
     path_yaml = os.path.join(tmpdir, 'testing-hparams.yaml')
 
     save_hparams_to_yaml(path_yaml, hparams)
-    assert load_hparams_from_yaml(path_yaml) == hparams
+    assert load_hparams_from_yaml(path_yaml, use_omegaconf=False) == hparams
 
     save_hparams_to_yaml(path_yaml, Namespace(**hparams))
-    assert load_hparams_from_yaml(path_yaml) == hparams
+    assert load_hparams_from_yaml(path_yaml, use_omegaconf=False) == hparams
 
     save_hparams_to_yaml(path_yaml, AttributeDict(hparams))
-    assert load_hparams_from_yaml(path_yaml) == hparams
+    assert load_hparams_from_yaml(path_yaml, use_omegaconf=False) == hparams
 
     save_hparams_to_yaml(path_yaml, OmegaConf.create(hparams))
     assert load_hparams_from_yaml(path_yaml) == hparams
@@ -636,3 +640,46 @@ def test_model_with_fsspec_as_parameter(tmpdir):
     )
     trainer.fit(model)
     trainer.test()
+
+
+@pytest.mark.skipif(not _HYDRA_EXPERIMENTAL_AVAILABLE, reason="Hydra experimental is not available")
+def test_model_save_hyper_parameters_interpolation_with_hydra(tmpdir):
+    """
+    This test relies on configuration saved under tests/models/conf/config.yaml
+    """
+
+    class TestHydraModel(BoringModel):
+
+        def __init__(self, args_0, args_1, args_2, kwarg_1=None):
+            self.save_hyperparameters()
+            self.test_hparams()
+            config_file = f"{tmpdir}/hparams.yaml"
+            save_hparams_to_yaml(config_file, self.hparams)
+            self.hparams = load_hparams_from_yaml(config_file)
+            self.test_hparams()
+            super().__init__()
+
+        def test_hparams(self):
+            assert self.hparams.args_0.log == "Something"
+            assert self.hparams.args_1['cfg'].log == "Something"
+            assert self.hparams.args_2[0].log == "Something"
+            assert self.hparams.kwarg_1['cfg'][0].log == "Something"
+
+    with initialize(config_path="conf"):
+        args_0 = compose(config_name="config")
+        args_1 = {"cfg": compose(config_name="config")}
+        args_2 = [compose(config_name="config")]
+        kwarg_1 = {"cfg": [compose(config_name="config")]}
+        model = TestHydraModel(args_0, args_1, args_2, kwarg_1=kwarg_1)
+        epochs = 2
+        checkpoint_callback = ModelCheckpoint(monitor=None, dirpath=tmpdir, save_top_k=-1)
+        trainer = Trainer(
+            default_root_dir=tmpdir,
+            callbacks=[checkpoint_callback],
+            limit_train_batches=10,
+            limit_val_batches=10,
+            max_epochs=epochs,
+            logger=False,
+        )
+        trainer.fit(model)
+        _ = TestHydraModel.load_from_checkpoint(checkpoint_callback.best_model_path)
diff --git a/tests/models/test_restore.py b/tests/models/test_restore.py
index e29afe8e66e55..26df7fd348cc8 100644
--- a/tests/models/test_restore.py
+++ b/tests/models/test_restore.py
@@ -51,8 +51,7 @@ def on_train_end(self, trainer, pl_module):
         self._check_properties(trainer, pl_module)
 
 
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_model_properties_resume_from_checkpoint(enable_pl_optimizer, tmpdir):
+def test_model_properties_resume_from_checkpoint(tmpdir):
     """ Test that properties like `current_epoch` and `global_step`
     in model and trainer are always the same. """
     model = EvalModelTemplate()
@@ -61,7 +60,6 @@ def test_model_properties_resume_from_checkpoint(enable_pl_optimizer, tmpdir):
         default_root_dir=tmpdir,
         max_epochs=1,
         logger=False,
-        enable_pl_optimizer=enable_pl_optimizer,
         callbacks=[checkpoint_callback, ModelTrainerPropertyParity()],  # this performs the assertions
     )
     trainer = Trainer(**trainer_args)
@@ -98,8 +96,7 @@ def on_train_start(self, trainer, pl_module):
         self.callbacks = deepcopy(trainer.callbacks)
 
 
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_callbacks_state_resume_from_checkpoint(enable_pl_optimizer, tmpdir):
+def test_callbacks_state_resume_from_checkpoint(tmpdir):
     """ Test that resuming from a checkpoint restores callbacks that persist state. """
     model = EvalModelTemplate()
     callback_capture = CaptureCallbacksBeforeTraining()
@@ -110,7 +107,6 @@ def get_trainer_args():
             default_root_dir=tmpdir,
             max_steps=1,
             logger=False,
-            enable_pl_optimizer=enable_pl_optimizer,
             callbacks=[
                 checkpoint,
                 callback_capture,
@@ -137,11 +133,10 @@ def get_trainer_args():
             assert before.best_model_score == after.best_model_score
 
 
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_callbacks_references_resume_from_checkpoint(enable_pl_optimizer, tmpdir):
+def test_callbacks_references_resume_from_checkpoint(tmpdir):
     """ Test that resuming from a checkpoint sets references as expected. """
     model = EvalModelTemplate()
-    args = {'default_root_dir': tmpdir, 'max_steps': 1, 'logger': False, "enable_pl_optimizer": enable_pl_optimizer}
+    args = {'default_root_dir': tmpdir, 'max_steps': 1, 'logger': False}
 
     # initial training
     checkpoint = ModelCheckpoint(dirpath=tmpdir, monitor="early_stop_on", save_last=True)
diff --git a/tests/special_tests.sh b/tests/special_tests.sh
index 8650be6fd4682..70dd10ce3d60e 100644
--- a/tests/special_tests.sh
+++ b/tests/special_tests.sh
@@ -22,5 +22,6 @@ python ${DEFAULTS} tests/plugins/test_ddp_sequential_plugin.py::test_ddp_sequent
 python ${DEFAULTS} tests/plugins/test_ddp_sequential_plugin.py::test_ddp_sequential_plugin_ddp_rpc_automatic
 python ${DEFAULTS} tests/utilities/test_all_gather_grad.py::test_all_gather_collection
 # python ${DEFAULTS} tests/plugins/test_ddp_sequential_plugin.py::test_ddp_sequential_plugin_ddp_rpc_with_wrong_balance
-python ${DEFAULTS} tests/trainer/logging_process/test_train_loop_logging_1_0.py::test_logging_sync_dist_true_ddp
+python ${DEFAULTS} tests/trainer/logging_/test_train_loop_logging_1_0.py::test_logging_sync_dist_true_ddp
 python ${DEFAULTS} tests/trainer/test_trainer.py::test_pytorch_profiler_trainer_ddp
+python ${DEFAULTS} tests/models/test_hooks.py::test_transfer_batch_hook_ddp
diff --git a/tests/trainer/logging_/__init__.py b/tests/trainer/logging_/__init__.py
new file mode 100644
index 0000000000000..e69de29bb2d1d
diff --git a/tests/trainer/logging_process/test_distributed_logging.py b/tests/trainer/logging_/test_distributed_logging.py
similarity index 100%
rename from tests/trainer/logging_process/test_distributed_logging.py
rename to tests/trainer/logging_/test_distributed_logging.py
diff --git a/tests/trainer/logging_process/test_eval_loop_logging_1_0.py b/tests/trainer/logging_/test_eval_loop_logging_1_0.py
similarity index 100%
rename from tests/trainer/logging_process/test_eval_loop_logging_1_0.py
rename to tests/trainer/logging_/test_eval_loop_logging_1_0.py
diff --git a/tests/trainer/logging_process/test_logger_connector.py b/tests/trainer/logging_/test_logger_connector.py
similarity index 100%
rename from tests/trainer/logging_process/test_logger_connector.py
rename to tests/trainer/logging_/test_logger_connector.py
diff --git a/tests/trainer/logging_process/test_train_loop_logging_1_0.py b/tests/trainer/logging_/test_train_loop_logging_1_0.py
similarity index 100%
rename from tests/trainer/logging_process/test_train_loop_logging_1_0.py
rename to tests/trainer/logging_/test_train_loop_logging_1_0.py
diff --git a/tests/trainer/optimization/test_manual_optimization.py b/tests/trainer/optimization/test_manual_optimization.py
index a1caba0457980..cc0324befdf24 100644
--- a/tests/trainer/optimization/test_manual_optimization.py
+++ b/tests/trainer/optimization/test_manual_optimization.py
@@ -23,7 +23,6 @@
 
 from pytorch_lightning import seed_everything, Trainer
 from pytorch_lightning.utilities import _APEX_AVAILABLE
-from pytorch_lightning.utilities.exceptions import MisconfigurationException
 from tests.base.boring_model import BoringModel
 
 
@@ -33,6 +32,11 @@ def test_multiple_optimizers_manual(tmpdir):
     Tests that only training_step can be used
     """
     class TestModel(BoringModel):
+
+        def __init__(self):
+            super().__init__()
+            self.automatic_optimization = False
+
         def training_step(self, batch, batch_idx, optimizer_idx):
             # manual
             (opt_a, opt_b) = self.optimizers()
@@ -69,10 +73,6 @@ def configure_optimizers(self):
             optimizer_2 = torch.optim.SGD(self.layer.parameters(), lr=0.1)
             return optimizer, optimizer_2
 
-        @property
-        def automatic_optimization(self) -> bool:
-            return False
-
     model = TestModel()
     model.val_dataloader = None
 
@@ -456,7 +456,6 @@ def test_manual_optimization_and_return_tensor(tmpdir):
         amp_backend='native',
         accelerator="ddp_spawn",
         gpus=2,
-        enable_pl_optimizer=True
     )
     trainer.fit(model)
 
@@ -575,7 +574,6 @@ def automatic_optimization(self) -> bool:
         amp_backend='native',
         accumulate_grad_batches=4,
         gpus=1,
-        enable_pl_optimizer=True,
     )
     trainer.fit(model)
 
@@ -650,7 +648,6 @@ def automatic_optimization(self) -> bool:
         precision=16,
         amp_backend='native',
         gpus=1,
-        enable_pl_optimizer=True,
     )
 
     trainer.fit(model)
@@ -732,7 +729,6 @@ def automatic_optimization(self) -> bool:
         limit_val_batches=2,
         max_epochs=1,
         log_every_n_steps=1,
-        enable_pl_optimizer=True,
     )
 
     trainer.fit(model)
@@ -797,7 +793,6 @@ def automatic_optimization(self) -> bool:
         max_epochs=1,
         log_every_n_steps=1,
         accumulate_grad_batches=2,
-        enable_pl_optimizer=True,
     )
 
     trainer.fit(model)
@@ -853,7 +848,6 @@ def automatic_optimization(self) -> bool:
         max_epochs=1,
         log_every_n_steps=1,
         accumulate_grad_batches=2,
-        enable_pl_optimizer=True,
     )
 
     trainer.fit(model)
@@ -931,7 +925,6 @@ def automatic_optimization(self) -> bool:
         max_epochs=1,
         log_every_n_steps=1,
         accumulate_grad_batches=2,
-        enable_pl_optimizer=True,
     )
 
     trainer.fit(model)
@@ -1040,7 +1033,6 @@ def automatic_optimization(self) -> bool:
         max_epochs=1,
         log_every_n_steps=1,
         accumulate_grad_batches=2,
-        enable_pl_optimizer=True,
         gpus=2,
         accelerator="ddp",
     )
@@ -1051,35 +1043,3 @@ def automatic_optimization(self) -> bool:
 
     expected_calls = [call(closure=ANY, optim='adam')] * 2
     mock_adam_step.assert_has_calls(expected_calls)
-
-
-def test_step_with_misconfiguraiton_error_when_overriding_optimizer_zero_grad(tmpdir):
-    """
-    Tests that `optimizer_zero_grad` in manual_optimization triggers a MisconfigurationException
-    """
-    try:
-        class TestModel(BoringModel):
-
-            def optimizer_zero_grad(self, *_):
-                pass
-
-            @property
-            def automatic_optimization(self) -> bool:
-                return False
-
-        model = TestModel()
-        model.val_dataloader = None
-        model.training_epoch_end = None
-
-        limit_train_batches = 8
-        Trainer(
-            default_root_dir=tmpdir,
-            limit_train_batches=limit_train_batches,
-            limit_val_batches=2,
-            max_epochs=1,
-            log_every_n_steps=1,
-            accumulate_grad_batches=2,
-            enable_pl_optimizer=True,
-        )
-    except MisconfigurationException as ex:
-        assert "`Trainer(enable_pl_optimizer=True, ...) is not supported" in str(ex)
diff --git a/tests/trainer/optimization/test_optimizers.py b/tests/trainer/optimization/test_optimizers.py
index 8e4e7abf9c25a..dacdc988488ed 100644
--- a/tests/trainer/optimization/test_optimizers.py
+++ b/tests/trainer/optimization/test_optimizers.py
@@ -181,9 +181,8 @@ def test_reducelronplateau_scheduling(tmpdir):
     ), 'lr scheduler was not correctly converted to dict'
 
 
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
-def test_optimizer_return_options(enable_pl_optimizer):
-    trainer = Trainer(enable_pl_optimizer=enable_pl_optimizer)
+def test_optimizer_return_options():
+    trainer = Trainer()
     model = EvalModelTemplate()
 
     # single optimizer
diff --git a/tests/trainer/test_trainer.py b/tests/trainer/test_trainer.py
index b9723878adad5..ac124b71db3a4 100644
--- a/tests/trainer/test_trainer.py
+++ b/tests/trainer/test_trainer.py
@@ -504,16 +504,13 @@ def test_model_checkpoint_only_weights(tmpdir):
 
 
 def test_model_freeze_unfreeze():
-
     model = EvalModelTemplate()
-
     model.freeze()
     model.unfreeze()
 
 
-@pytest.mark.parametrize("enable_pl_optimizer", [False, True])
 @pytest.mark.parametrize("url_ckpt", [True, False])
-def test_resume_from_checkpoint_epoch_restored(monkeypatch, tmpdir, tmpdir_server, url_ckpt, enable_pl_optimizer):
+def test_resume_from_checkpoint_epoch_restored(monkeypatch, tmpdir, tmpdir_server, url_ckpt):
     """Verify resuming from checkpoint runs the right number of epochs"""
     # set $TORCH_HOME, which determines torch hub's cache path, to tmpdir
     monkeypatch.setenv("TORCH_HOME", tmpdir)
@@ -541,7 +538,6 @@ def on_load_checkpoint(self, _):
         callbacks=[ModelCheckpoint(dirpath=tmpdir, monitor='early_stop_on', save_top_k=-1)],
         default_root_dir=tmpdir,
         val_check_interval=1.0,
-        enable_pl_optimizer=enable_pl_optimizer,
         progress_bar_refresh_rate=0,
         logger=False,
         weights_summary=None,
diff --git a/tests/utilities/test_seed.py b/tests/utilities/test_seed.py
new file mode 100644
index 0000000000000..74c6674eec793
--- /dev/null
+++ b/tests/utilities/test_seed.py
@@ -0,0 +1,55 @@
+import os
+from unittest import mock
+
+import pytest
+
+import pytorch_lightning.utilities.seed as seed_utils
+
+
+@mock.patch.dict(os.environ, {}, clear=True)
+def test_seed_stays_same_with_multiple_seed_everything_calls():
+    """
+    Ensure that after the initial seed everything,
+    the seed stays the same for the same run.
+    """
+    with pytest.warns(UserWarning, match="No correct seed found"):
+        seed_utils.seed_everything()
+    initial_seed = os.environ.get("PL_GLOBAL_SEED")
+
+    with pytest.warns(None) as record:
+        seed_utils.seed_everything()
+    assert not record  # does not warn
+    seed = os.environ.get("PL_GLOBAL_SEED")
+
+    assert initial_seed == seed
+
+
+@mock.patch.dict(os.environ, {"PL_GLOBAL_SEED": "2020"}, clear=True)
+def test_correct_seed_with_environment_variable():
+    """
+    Ensure that the PL_GLOBAL_SEED environment is read
+    """
+    assert seed_utils.seed_everything() == 2020
+
+
+@mock.patch.dict(os.environ, {"PL_GLOBAL_SEED": "invalid"}, clear=True)
+@mock.patch.object(seed_utils, attribute='_select_seed_randomly', new=lambda *_: 123)
+def test_invalid_seed():
+    """
+    Ensure that we still fix the seed even if an invalid seed is given
+    """
+    with pytest.warns(UserWarning, match="No correct seed found"):
+        seed = seed_utils.seed_everything()
+    assert seed == 123
+
+
+@mock.patch.dict(os.environ, {}, clear=True)
+@mock.patch.object(seed_utils, attribute='_select_seed_randomly', new=lambda *_: 123)
+@pytest.mark.parametrize("seed", (10e9, -10e9))
+def test_out_of_bounds_seed(seed):
+    """
+    Ensure that we still fix the seed even if an out-of-bounds seed is given
+    """
+    with pytest.warns(UserWarning, match="is not in bounds"):
+        actual = seed_utils.seed_everything(seed)
+    assert actual == 123