Skip to content

Commit

Permalink
Add 4GPU unit test (pytorch#82)
Browse files Browse the repository at this point in the history
For now this literally just runs `NGPU=4 ./run_llama_train.sh` but I
verified at least it catches problems.

As a follow up, we should integrate mgpu test infra from pytorch and set
up actual unit tests to run in this job.

We should probably also keep testing the run_llama_train.sh script, and
add other combinations of 2D parallelism to ensure they all keep
working.

<img width="2120" alt="image"
src="https://github.com/pytorch/torchtrain/assets/4984825/2c235e9a-04ed-4f2d-9915-67de39d78e1c">
  • Loading branch information
wconstab authored Feb 24, 2024
1 parent d5de78c commit 2fe3152
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 2 deletions.
43 changes: 43 additions & 0 deletions .github/workflows/unit_test_4gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: 4 GPU Unit Test

on:
push:
branches: [ main ]
pull_request:

concurrency:
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
cancel-in-progress: true

defaults:
run:
shell: bash -l -eo pipefail {0}

jobs:
unit_tests_4gpu:
runs-on: linux.g5.12xlarge.nvidia.gpu
strategy:
matrix:
python-version: ['3.10']
steps:
- name: Check out repo
uses: actions/checkout@v3
- name: Setup conda env
uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
miniconda-version: "latest"
activate-environment: test
python-version: ${{ matrix.python-version }}
- name: Update pip
run: python -m pip install --upgrade pip
- name: Install dependencies
run: |
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
python -m pip install -r requirements.txt
python -m pip install -r dev-requirements.txt
python -m pip install -e .
- name: Run NGPU=4 ./run_llama_train.sh
run: NGPU=4 ./run_llama_train.sh
- name: Upload Coverage to Codecov
uses: codecov/codecov-action@v3
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Unit Test
name: CPU Unit Test

on:
push:
Expand All @@ -14,7 +14,7 @@ defaults:
shell: bash -l -eo pipefail {0}

jobs:
unit_tests:
cpu_unit_tests:
runs-on: ubuntu-latest
strategy:
matrix:
Expand Down

0 comments on commit 2fe3152

Please sign in to comment.