Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP VM instance does not terminate after idle timeout #1386

Open
hopeai opened this issue Jun 9, 2023 · 4 comments
Open

GCP VM instance does not terminate after idle timeout #1386

hopeai opened this issue Jun 9, 2023 · 4 comments
Labels
bug Something isn't working cloud-gcp Google Cloud cml-runner Subcommand

Comments

@hopeai
Copy link

hopeai commented Jun 9, 2023

Related to issue #834

Facing this issue with cml v0.19.0

I'm starting GCP instance using the following command inside a docker container which uses iterativeai/cml:latest image in a buildkite pipeline. I also share GOOGLE_APPLICATION_CREDENTIALS_DATA as environment variable, it starts and register self-hosted runner without any problem.

I configured gcloud with:

echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json
gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json
gcloud -q config set project PROJECT
gcloud -q auth configure-docker

run cml runner launch with

cml runner launch \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=m+t4	 \
--name="$CML_RUNNER_NAME" \
--labels=gcp-snapshot-creation \
--token="$BUILDKITE_GITHUB_TOKEN" \
--repo=https://github.com/ORG/REPO.git \
--cloud-hdd-size=60 \
--idle-timeout=120

However, after idle-timeout passes GCP VM doesn't terminate. The output of journalctl --unit cml --no-pager is as follows:

-- Logs begin at Fri 2023-06-09 10:33:16 UTC, end at Fri 2023-06-09 11:28:01 UTC. --
Jun 09 10:38:43 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: Started cml.service.
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 287ms"}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 234ms"}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Preparing workdir /tmp/tmp.SpVQaKjpnL/.cml/cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Launching github runner"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Terraform 1.4.6"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Outputs: 0"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Connected to acpid service."}
Jun 09 10:38:58 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 307ms"}
Jun 09 10:39:01 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:01.672Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"ready"}
Jun 09 10:39:20 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:20.270Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
Jun 09 10:43:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:44.840Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
Jun 09 10:43:47 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:47.170Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
Jun 09 11:02:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T11:02:44.588Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Unregistering runner cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 403 in 159ms"}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Retrying because of rate limit in 990 seconds"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 260ms"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"DELETE /repos/ORG/REPO/actions/runners/314 - 204 in 371ms"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"\tSuccess"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Waiting 10 seconds to destroy"}
Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"runner status","reason":"timeout:120","status":"terminated"}
Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: cml.service: Succeeded

I am not facing any issue with aws ec2 instances and those terminate after idle-timeout.

@hopeai hopeai added the epic Collection of sub-issues label Jun 9, 2023
@dacbd dacbd added bug Something isn't working cml-runner Subcommand cloud-gcp Google Cloud and removed epic Collection of sub-issues labels Jun 9, 2023
@dacbd
Copy link
Contributor

dacbd commented Jun 14, 2023

@hopeai Can you share more of the workflow that you are running even if it's mostly redacted? I going to try and reproduce the behavior you are seeing again.

@hopeai
Copy link
Author

hopeai commented Jun 14, 2023

Sure @dacbd,

I'm using the following Buildkite pipeline to manually launch a gpu server when I need.

agents:
  queue: "my-buildkite-agent"
steps:
  - block: ":warning: Provision ec2 instance"
  - command: "buildkite-agent pipeline upload .buildkite/cml_launch_runner.yml"
    label: ":pipeline: Pipeline upload"
    if: "build.branch == 'MYBRANCH'"

  - wait

  - command: "buildkite-agent pipeline upload .buildkite/trigger_my_buildkite_pipeline.yaml"
    label: ":pipeline: Pipeline upload"
    if: "build.branch == 'MYBRANCH'"

The two pipelines which previous pipeline uploads are as follows:

  • .buildkite/cml_launch_runner.yml
agents:
  queue: "my-buildkite-agent"

steps:
  - label: ":gcloud: setup gcp creds"
    command: "./scripts/configure_gcloud.sh"

  - wait

  - label: ":gcloud: launch GCP Snapshot Creation Instance"
    command:
      - "./scripts/cml_launch_runner.sh"
    plugins:
      - docker#v5.3.0:
          image: "iterativeai/cml:latest"
          environment:
            - "BUILDKITE_GITHUB_USERNAME"
            - "BUILDKITE_GITHUB_TOKEN"
            - "BUILDKITE_BRANCH"
            - "BUILDKITE_BUILD_NUMBER"
            - "BUILDKITE_COMMIT"
            - "GOOGLE_APPLICATION_CREDENTIALS_DATA"
  • .buildkite/trigger_my_buildkite_pipeline.yaml
agents:
  queue: "my-buildkite-agent"

steps:
  - label: ":github: Trigger Github Actions"
    command:
      - "./scripts/trigger_github_workflow.sh"

The bash scripts which are used in the pipelines are as follows:

  • ./scripts/configure_gcloud.sh
 #!/bin/bash

set -eu -o pipefail

echo ":gcloud: Configure gcloud..."

export GOOGLE_APPLICATION_CREDENTIALS_DATA=$BUILDKITE_GS_APPLICATION_CREDENTIALS_JSON

echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json
gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json
gcloud -q config set project MY-PROJECT
gcloud -q auth configure-docker
gcloud -v

echo "Done!"
  • ./scripts/cml_launch_runner.sh
#!/bin/bash

set -eu -o pipefail

# Build a unique instance name
CML_RUNNER_NAME="some-name"

# Provision an GCP EC2 g4dn.xlage instance
cml runner launch \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=m+t4	 \
--name="$CML_RUNNER_NAME" \
--labels=gpu-cml \
--token="$BUILDKITE_GITHUB_TOKEN" \
--repo=https://github.com/ORG/myrepo.git \
--cloud-hdd-size=60 \
--idle-timeout=120
  • ./scripts/trigger_github_workflow.sh
#!/bin/bash


curl -L \
-X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer $BUILDKITE_GITHUB_TOKEN" \
-H "X-GitHub-Api-Version: 2022-11-28" \
--url https://api.github.com/repos/intenseye/myrepo/actions/workflows/my_workflow.yml/dispatches \
-d '{"ref":"'"$BUILDKITE_BRANCH"'"}'

By running these pipelines everything works as expected and cml starts a gpu server on GCP and this self-hosted runner accepts my GitHub workflow jobs and jobs finish successfully. However, after idle-timeout GCP VM is not terminated. I checked the logs in self-hosted runner and shared here. I think this could help you to reproduce this behavior.

Related to issue #834

Facing this issue with cml v0.19.0

I'm starting GCP instance using the following command inside a docker container which uses iterativeai/cml:latest image in a buildkite pipeline. I also share GOOGLE_APPLICATION_CREDENTIALS_DATA as environment variable, it starts and register self-hosted runner without any problem.

I configured gcloud with:

echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json
gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json
gcloud -q config set project PROJECT
gcloud -q auth configure-docker

run cml runner launch with

cml runner launch \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=m+t4	 \
--name="$CML_RUNNER_NAME" \
--labels=gcp-snapshot-creation \
--token="$BUILDKITE_GITHUB_TOKEN" \
--repo=https://github.com/ORG/REPO.git \
--cloud-hdd-size=60 \
--idle-timeout=120

However, after idle-timeout passes GCP VM doesn't terminate. The output of journalctl --unit cml --no-pager is as follows:

-- Logs begin at Fri 2023-06-09 10:33:16 UTC, end at Fri 2023-06-09 11:28:01 UTC. --
Jun 09 10:38:43 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: Started cml.service.
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 287ms"}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 234ms"}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Preparing workdir /tmp/tmp.SpVQaKjpnL/.cml/cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Launching github runner"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Terraform 1.4.6"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Outputs: 0"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Connected to acpid service."}
Jun 09 10:38:58 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 307ms"}
Jun 09 10:39:01 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:01.672Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"ready"}
Jun 09 10:39:20 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:20.270Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
Jun 09 10:43:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:44.840Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
Jun 09 10:43:47 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:47.170Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
Jun 09 11:02:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T11:02:44.588Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Unregistering runner cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 403 in 159ms"}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Retrying because of rate limit in 990 seconds"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 260ms"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"DELETE /repos/ORG/REPO/actions/runners/314 - 204 in 371ms"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"\tSuccess"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Waiting 10 seconds to destroy"}
Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"runner status","reason":"timeout:120","status":"terminated"}
Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: cml.service: Succeeded

I am not facing any issue with aws ec2 instances and those terminate after idle-timeout.

@dacbd
Copy link
Contributor

dacbd commented Jun 14, 2023

The total runtime Is close to one hour which makes me think that your gcloud -q auth activate-service-accouth method of authorization is generating an Oauth token which has expired by the time CML tries to shutdown.

I believe if you run the following query in GCP's "log explorer" you will NOT see a delete request (because it will be unauthorised from the expired credentials)

protoPayload.authenticationInfo.principalEmail="your_service_account_name@your_project_name.iam.gserviceaccount.com"
resource.type="gce_instance"
protoPayload.methodName="v1.compute.instances.delete"

Can you try authenticating your github actions via the following:

      - name: 'Authenticate to Google Cloud'
        uses: 'google-github-actions/auth@v0'
        with:
          credentials_json: ${{ secrets.GCP_CML_RUNNER_KEY }}

where secrets.GCP_CML_RUNNER_KEY is the contents of the json key. ( I would remove the white space after and before the jsons opening/closing {}'s so that github actions doesn't try to obsfcate single characters like { or })

you can also configure docker in a similar fashion:

      - name: Login to GAR
        uses: docker/login-action@v2
        with:
          registry: <location>-docker.pkg.dev
          username: _json_key
          password: ${{ secrets.GAR_JSON_KEY }}

@hopeai
Copy link
Author

hopeai commented Jun 19, 2023

Thanks a lot @dacbd for pointing this out. Let me try this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cloud-gcp Google Cloud cml-runner Subcommand
Projects
None yet
Development

No branches or pull requests

2 participants