GCP compute instances are not shutdown after idle timeout #661

a3lem · 2021-07-18T14:38:19Z

First off, thanks for adding GCP support to cml-runner.

While playing around with it, I noticed that my compute engine instances weren't being shutdown/terminated (i.e., VM powered off) or deleted. I experimented with --idle-timeout and --single, yet neither made a difference. The instance stays alive. This is unexpected given the following sentence from the docs on self-hosted runners:

After the job runs, the instance automatically shuts down.

And indeed, that's what I've observed with AWS instances. Those seem to terminate correctly.

Here's my Github workflow for extra context:

name: 'Train-in-the-cloud-GCP'
on: 
  workflow_dispatch:

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: 'Deploy runner on GCP'
        shell: bash
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          # Notice use of `GOOGLE_APPLICATION_CREDENTIALS_DATA` instead of
          # `GOOGLE_APPLICATION_CREDENTIALS`. Contrary to what docs suggest, the
          # latter causes problems for terraform.
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
        run: |
          cml-runner \
          --cloud gcp \
          --cloud-region europe-west1-b	 \
          --cloud-type=n1-standard-1 \
          --labels=cml-runner
          
  model-training:
    needs: deploy-runner
    runs-on: [self-hosted, cml-runner]
    container: docker://dvcorg/cml-py3:latest
    steps:
      - uses: actions/checkout@v2
      - name: 'Train my dummy model'
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        run: |
          echo "Training a super awesome model"
          sleep 5
          echo "Training complete"

Hope there's a way to ensure the same auto-shutdown behavior on GCP. As it is, the risk of getting smacked with an expensive bill for an idle GPU is just too real =p

The text was updated successfully, but these errors were encountered:

ivyleavedtoadflax · 2021-07-19T05:32:42Z

See #649 (espcially #649 (comment)), I've experienced similar on AWS. I'm not sure I 100% trust this workaround though as I just found a GPU instance that had been running fro 12 hours which incorporated this hack.

DavidGOrtega · 2021-07-19T12:10:59Z

related to #649

a3lem changed the title ~~GCP compute instances persist beyond idle timeout interval~~ GCP compute instances are not shutdown after idle timeout interval Jul 18, 2021

a3lem changed the title ~~GCP compute instances are not shutdown after idle timeout interval~~ GCP compute instances are not shutdown after idle timeout Jul 18, 2021

DavidGOrtega self-assigned this Jul 19, 2021

DavidGOrtega added cml-runner Subcommand p0-critical Max priority (ASAP) labels Jul 19, 2021

DavidGOrtega mentioned this issue Jul 19, 2021

Cloud runner terminate improvements #653

Merged

DavidGOrtega closed this as completed in #653 Jul 20, 2021

0x2b3bfa0 mentioned this issue Jul 21, 2021

GOOGLE_APPLICATION_CREDENTIALS doesn't get propagated into instances iterative/terraform-provider-iterative#165

Closed

a3lem mentioned this issue Jul 28, 2021

GCP cloud runner not terminating #678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCP compute instances are not shutdown after idle timeout #661

GCP compute instances are not shutdown after idle timeout #661

a3lem commented Jul 18, 2021

ivyleavedtoadflax commented Jul 19, 2021 •

edited

Loading

DavidGOrtega commented Jul 19, 2021

GCP compute instances are not shutdown after idle timeout #661

GCP compute instances are not shutdown after idle timeout #661

Comments

a3lem commented Jul 18, 2021

ivyleavedtoadflax commented Jul 19, 2021 • edited Loading

DavidGOrtega commented Jul 19, 2021

ivyleavedtoadflax commented Jul 19, 2021 •

edited

Loading