Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Improve TPU CI #6078

Merged
merged 60 commits into from
Jul 19, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
d65d61d
i
tchaton Feb 19, 2021
7795539
i
tchaton Feb 19, 2021
e378d06
i
tchaton Feb 19, 2021
91edd05
i
tchaton Feb 19, 2021
82547af
i
tchaton Feb 19, 2021
f2bc33a
i
tchaton Feb 19, 2021
dec21fb
i
tchaton Feb 19, 2021
2b681bb
i
tchaton Feb 19, 2021
918f8e4
i
tchaton Feb 19, 2021
6f0a23d
i
tchaton Feb 19, 2021
e9550b9
i
tchaton Feb 19, 2021
323769b
i
tchaton Feb 19, 2021
abf79bc
i
tchaton Feb 19, 2021
53c0189
i
tchaton Feb 19, 2021
b1c062d
i
tchaton Feb 19, 2021
6f3a88f
i
tchaton Feb 19, 2021
3ab63bb
i
tchaton Feb 19, 2021
15b5afa
i
tchaton Feb 19, 2021
305b869
i
tchaton Feb 19, 2021
aea7a39
i
tchaton Feb 19, 2021
cdfcfc1
i
tchaton Feb 19, 2021
eab350f
i
tchaton Feb 19, 2021
84f5634
i
tchaton Feb 19, 2021
9c35f09
i
tchaton Feb 19, 2021
c11090e
i
tchaton Feb 19, 2021
c293389
i
tchaton Feb 19, 2021
9c6966e
i
tchaton Feb 19, 2021
1f42ddd
i
tchaton Feb 22, 2021
e089706
i
tchaton Feb 22, 2021
9f9767c
i
tchaton Feb 22, 2021
b0552ad
i
tchaton Feb 22, 2021
b539dad
i
tchaton Feb 22, 2021
5c78781
i
tchaton Feb 22, 2021
3e16c6e
i
tchaton Feb 22, 2021
f99e228
i
tchaton Feb 22, 2021
d9ecb4e
update
tchaton Feb 22, 2021
623be9d
update ci
tchaton Feb 22, 2021
d208bcc
i
tchaton Feb 22, 2021
e99ee67
i
tchaton Feb 22, 2021
9d44a06
i
tchaton Mar 4, 2021
5bc87b0
i
tchaton Mar 4, 2021
3d386d3
Merge branch 'master' into improve_tpu_ci
tchaton Mar 4, 2021
e794525
Merge branch 'master' into improve_tpu_ci
Borda Mar 31, 2021
2ab98c5
Merge branch 'master' into improve_tpu_ci
kaushikb11 Jul 19, 2021
78125bf
Remove unnecessary stuff
kaushikb11 Jul 19, 2021
13f6428
Update deprecated gcloud setup cli
kaushikb11 Jul 19, 2021
41f3f38
Add Jsonnet steps
kaushikb11 Jul 19, 2021
250842b
Remove build & push docker
kaushikb11 Jul 19, 2021
df45061
Add Jsonnet steps to the flow
kaushikb11 Jul 19, 2021
6f39acb
Update deploy cluster step
kaushikb11 Jul 19, 2021
437164a
Add set pr number & commit sha step
kaushikb11 Jul 19, 2021
8e2fef6
Add set pr number & commit sha step to flow
kaushikb11 Jul 19, 2021
094547b
Fix syntax
kaushikb11 Jul 19, 2021
483507d
Update deploy cluster step
kaushikb11 Jul 19, 2021
d361e8e
Fix indent
kaushikb11 Jul 19, 2021
4939889
Update jsonnet
kaushikb11 Jul 19, 2021
4a6d33d
Add pr no & commit sha env
kaushikb11 Jul 19, 2021
a5e6f52
Update jsonnet logic
kaushikb11 Jul 19, 2021
a144b0d
Add workflow steps
kaushikb11 Jul 19, 2021
c596123
Update variable
kaushikb11 Jul 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 30 additions & 14 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,19 @@ orbs:
go: circleci/[email protected]
codecov: codecov/[email protected]

# Workflow Steps:
# 1. Checkout
# 2. Install GO
# 3. Checkout ml-testing-accelerators
# 4. GCP GKE install
# 5. Update Kubeconfig with credintials
# 6. Install jsonnet
# 7. Update jsonnet
# 8. Deploy the job on the kubernetes cluster
# 9. Statistics
# 10. Upload coverage results
# 11. Upload coverage to Codecov

references:

make_docs: &make_docs
Expand Down Expand Up @@ -33,25 +46,28 @@ references:
git checkout stable
cd ..

build_push_docker: &build_push_docker
install_jsonnet: &install_jsonnet
run:
name: Install jsonnet
command: |
go get github.com/google/go-jsonnet/cmd/jsonnet

update_jsonnet: &update_jsonnet
run:
name: Build and push Docker image
environment:
- PYTHON_VER: 3.7
name: Update jsonnet
command: |
gcloud --quiet auth configure-docker
#cd dockers/tpu-tests
docker build --tag "$GCR_IMAGE_PATH:$CIRCLE_WORKFLOW_JOB_ID" -f ./dockers/tpu-tests/Dockerfile --build-arg "PYTHON_VERSION=$PYTHON_VER" --build-arg "PYTORCH_VERSION=$XLA_VER" .
docker push "$GCR_IMAGE_PATH:$CIRCLE_WORKFLOW_JOB_ID"
export PR_NUMBER=$(git ls-remote origin "pull/*/head" | grep -F -f <(git rev-parse HEAD) | awk -F'/' '{print $3}')
export SHA=$(git rev-parse --short HEAD)
python -c "fname = 'dockers/tpu-tests/tpu_test_cases.jsonnet' ; data = open(fname).read().replace('{PYTORCH_VERSION}', '$XLA_VER')
data = data.replace('{PYTHON_VERSION}', '$PYTHON_VER').replace('{PR_NUMBER}', '$PR_NUMBER').replace('{SHA}', '$SHA') ; open(fname, 'w').write(data)"
cat dockers/tpu-tests/tpu_test_cases.jsonnet

deploy_cluster: &deploy_cluster
run:
name: Deploy the job on the kubernetes cluster
command: |
go get github.com/google/go-jsonnet/cmd/jsonnet
export PATH=$PATH:$HOME/go/bin
python -c "fname = 'dockers/tpu-tests/tpu_test_cases.jsonnet' ; fff = open(fname).read().replace('pytorch-VERSION', 'pytorch-$XLA_VER') ; open(fname, 'w').write(fff)"
job_name=$(jsonnet -J ml-testing-accelerators/ dockers/tpu-tests/tpu_test_cases.jsonnet --ext-str image=$GCR_IMAGE_PATH --ext-str image-tag=$CIRCLE_WORKFLOW_JOB_ID | kubectl create -f -)
job_name=$(jsonnet -J ml-testing-accelerators/ dockers/tpu-tests/tpu_test_cases.jsonnet | kubectl create -f -) && \
job_name=${job_name#job.batch/}
job_name=${job_name% created}
echo "Waiting on kubernetes job: $job_name"
Expand All @@ -72,7 +88,6 @@ references:
# First portion is the test logs. Print these to Github Action stdout.
cat xx00 && \
echo "Done with log retrieval attempt." && \
gcloud container images delete "$GCR_IMAGE_PATH:$CIRCLE_WORKFLOW_JOB_ID" --force-delete-tags && \
exit $status_code

stats: &stats
Expand All @@ -92,6 +107,7 @@ jobs:
- image: circleci/python:3.7
environment:
- XLA_VER: 1.8
- PYTHON_VER: 3.7
- MAX_CHECKS: 240
- CHECK_SPEEP: 5
steps:
Expand All @@ -102,8 +118,8 @@ jobs:
- gcp-gke/update-kubeconfig-with-credentials:
cluster: $GKE_CLUSTER
perform-login: true
- setup_remote_docker
- *build_push_docker
- *install_jsonnet
- *update_jsonnet
- *deploy_cluster
- *stats
- codecov/upload:
Expand Down
17 changes: 14 additions & 3 deletions dockers/tpu-tests/tpu_test_cases.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,28 @@ local tputests = base.BaseTest {

timeout: 900, # 15 minutes, in seconds.

image: std.extVar('image'),
imageTag: std.extVar('image-tag'),
image: 'pytorchlightning/pytorch_lightning',
imageTag: 'base-xla-py{PYTHON_VERSION}-torch{PYTORCH_VERSION}',

tpuSettings+: {
softwareVersion: 'pytorch-VERSION',
softwareVersion: 'pytorch-{PYTORCH_VERSION}',
},
accelerator: tpus.v3_8,

command: utils.scriptCommand(
|||
source ~/.bashrc
conda activate lightning
mkdir -p /home/runner/work/pytorch-lightning && cd /home/runner/work/pytorch-lightning
git clone https://github.com/PyTorchLightning/pytorch-lightning.git
cd pytorch-lightning
echo $PWD
git ls-remote --refs origin
git fetch origin "refs/pull/{PR_NUMBER}/head:pr/{PR_NUMBER}" && git checkout "pr/{PR_NUMBER}"
kaushikb11 marked this conversation as resolved.
Show resolved Hide resolved
kaushikb11 marked this conversation as resolved.
Show resolved Hide resolved
git checkout {SHA}
pip install -e .
tchaton marked this conversation as resolved.
Show resolved Hide resolved
echo $KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS
export XRT_TPU_CONFIG="tpu_worker;0;${KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS:7}"
coverage run --source=pytorch_lightning -m pytest -v --capture=no \
tests/profiler/test_xla_profiler.py \
pytorch_lightning/utilities/xla_device.py \
Expand Down