You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First off, thanks for adding GCP support to cml-runner.
While playing around with it, I noticed that my compute engine instances weren't being shutdown/terminated (i.e., VM powered off) or deleted. I experimented with --idle-timeout and --single, yet neither made a difference. The instance stays alive. This is unexpected given the following sentence from the docs on self-hosted runners:
After the job runs, the instance automatically shuts down.
And indeed, that's what I've observed with AWS instances. Those seem to terminate correctly.
Here's my Github workflow for extra context:
name: 'Train-in-the-cloud-GCP'
on:
workflow_dispatch:
jobs:
deploy-runner:
runs-on: [ubuntu-latest]
steps:
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v2
- name: 'Deploy runner on GCP'
shell: bash
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
# Notice use of `GOOGLE_APPLICATION_CREDENTIALS_DATA` instead of
# `GOOGLE_APPLICATION_CREDENTIALS`. Contrary to what docs suggest, the
# latter causes problems for terraform.
GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
run: |
cml-runner \
--cloud gcp \
--cloud-region europe-west1-b \
--cloud-type=n1-standard-1 \
--labels=cml-runner
model-training:
needs: deploy-runner
runs-on: [self-hosted, cml-runner]
container: docker://dvcorg/cml-py3:latest
steps:
- uses: actions/checkout@v2
- name: 'Train my dummy model'
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
run: |
echo "Training a super awesome model"
sleep 5
echo "Training complete"
Hope there's a way to ensure the same auto-shutdown behavior on GCP. As it is, the risk of getting smacked with an expensive bill for an idle GPU is just too real =p
The text was updated successfully, but these errors were encountered:
a3lem
changed the title
GCP compute instances persist beyond idle timeout interval
GCP compute instances are not shutdown after idle timeout interval
Jul 18, 2021
a3lem
changed the title
GCP compute instances are not shutdown after idle timeout interval
GCP compute instances are not shutdown after idle timeout
Jul 18, 2021
See #649 (espcially #649 (comment)), I've experienced similar on AWS. I'm not sure I 100% trust this workaround though as I just found a GPU instance that had been running fro 12 hours which incorporated this hack.
First off, thanks for adding GCP support to cml-runner.
While playing around with it, I noticed that my compute engine instances weren't being shutdown/terminated (i.e., VM powered off) or deleted. I experimented with
--idle-timeout
and--single
, yet neither made a difference. The instance stays alive. This is unexpected given the following sentence from the docs on self-hosted runners:And indeed, that's what I've observed with AWS instances. Those seem to terminate correctly.
Here's my Github workflow for extra context:
Hope there's a way to ensure the same auto-shutdown behavior on GCP. As it is, the risk of getting smacked with an expensive bill for an idle GPU is just too real =p
The text was updated successfully, but these errors were encountered: