Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP compute instances are not shutdown after idle timeout #661

Closed
a3lem opened this issue Jul 18, 2021 · 2 comments · Fixed by #653
Closed

GCP compute instances are not shutdown after idle timeout #661

a3lem opened this issue Jul 18, 2021 · 2 comments · Fixed by #653
Assignees
Labels
cml-runner Subcommand p0-critical Max priority (ASAP)

Comments

@a3lem
Copy link

a3lem commented Jul 18, 2021

First off, thanks for adding GCP support to cml-runner.

While playing around with it, I noticed that my compute engine instances weren't being shutdown/terminated (i.e., VM powered off) or deleted. I experimented with --idle-timeout and --single, yet neither made a difference. The instance stays alive. This is unexpected given the following sentence from the docs on self-hosted runners:

After the job runs, the instance automatically shuts down.

And indeed, that's what I've observed with AWS instances. Those seem to terminate correctly.

Here's my Github workflow for extra context:

name: 'Train-in-the-cloud-GCP'
on: 
  workflow_dispatch:

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: 'Deploy runner on GCP'
        shell: bash
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          # Notice use of `GOOGLE_APPLICATION_CREDENTIALS_DATA` instead of
          # `GOOGLE_APPLICATION_CREDENTIALS`. Contrary to what docs suggest, the
          # latter causes problems for terraform.
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
        run: |
          cml-runner \
          --cloud gcp \
          --cloud-region europe-west1-b	 \
          --cloud-type=n1-standard-1 \
          --labels=cml-runner
          
  model-training:
    needs: deploy-runner
    runs-on: [self-hosted, cml-runner]
    container: docker://dvcorg/cml-py3:latest
    steps:
      - uses: actions/checkout@v2
      - name: 'Train my dummy model'
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        run: |
          echo "Training a super awesome model"
          sleep 5
          echo "Training complete"

Hope there's a way to ensure the same auto-shutdown behavior on GCP. As it is, the risk of getting smacked with an expensive bill for an idle GPU is just too real =p

@a3lem a3lem changed the title GCP compute instances persist beyond idle timeout interval GCP compute instances are not shutdown after idle timeout interval Jul 18, 2021
@a3lem a3lem changed the title GCP compute instances are not shutdown after idle timeout interval GCP compute instances are not shutdown after idle timeout Jul 18, 2021
@ivyleavedtoadflax
Copy link

ivyleavedtoadflax commented Jul 19, 2021

See #649 (espcially #649 (comment)), I've experienced similar on AWS. I'm not sure I 100% trust this workaround though as I just found a GPU instance that had been running fro 12 hours which incorporated this hack.

@DavidGOrtega
Copy link
Contributor

related to #649

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cml-runner Subcommand p0-critical Max priority (ASAP)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants