e2e tests on CI - actually await k8s resources to be ready before starting tests #1997

joeyorlando · 2023-05-23T21:38:05Z

Occasionally, the Playwright global setup step (which authenticates w/ the Grafana API + configures the plugin) would fail, leading to the CI job to instantly fail (playwright doesn't retry global setup if it fails).

My current hypothesis as to why this is happening is because the oncall-engine and oncall-celery pods aren't actually ready in these cases based on the way the jupyterhub/action-k8s-await-workloads action await k8s workloads:

By using the kubectl rollout status deployment/<deployment-name> --timeout=300s instead, we can be sure that these pods are actually ready to receive traffic before we start the tests.

❯ kubectl rollout status --help
Show the status of the rollout.

 By default 'rollout status' will watch the status of the latest rollout until it's done. If you don't want to wait for
the rollout to finish then you can use --watch=false. Note that if a new rollout starts in-between, then 'rollout
status' will continue watching the latest revision. If you want to pin to a specific revision and abort if it is rolled
over by another revision, use --revision=N where N is the revision you need to watch for.

Lastly, even despite this, sometimes the POST /api/internal/v1/plugin/sync endpoint will return HTTP 500 (example logs from failed CI job). In this case, let's setup the Playwright global setup to retry 3 times.

joeyorlando · 2023-05-23T23:00:00Z

.github/workflows/linting-and-tests.yml

@@ -361,7 +364,7 @@ jobs:
            --set oncall.twilio.authToken="${{ secrets.TWILIO_AUTH_TOKEN }}" \
            --set oncall.twilio.phoneNumber="\"${{ secrets.TWILIO_PHONE_NUMBER }}"\" \
            --set oncall.twilio.verifySid="${{ secrets.TWILIO_VERIFY_SID }}" \
-            --set grafana.replicas=3 \
+            --set grafana.replicas=1 \


Can't reliably use > 1 grafana replica when using SQLite as the grafana database
bitnami/charts#10905

joeyorlando · 2023-05-23T23:34:01Z

.github/workflows/linting-and-tests.yml

@@ -287,6 +287,9 @@ jobs:
      - name: Checkout
        uses: actions/checkout@v3

+      - name: Collect Workflow Telemetry
+        uses: runforesight/workflow-telemetry-action@v1


this is useful to see how much CPU/memory various steps in the workflow are using. It attaches an artifact to the build that looks something like this:

…rting tests (#1997) Occasionally, the Playwright global setup step (which authenticates w/ the Grafana API + configures the plugin) would fail, leading to the CI job to instantly fail (playwright doesn't retry global setup if it fails). My current hypothesis as to why this is happening is because the `oncall-engine` and `oncall-celery` pods aren't _actually_ ready in these cases based on the way the `jupyterhub/action-k8s-await-workloads` action await k8s workloads: <img width="1076" alt="Screenshot 2023-05-23 at 18 24 36" src="https://github.com/grafana/oncall/assets/9406895/68d8d2d9-4274-4749-8788-e0a9a3dbad83"> By using the `kubectl rollout status deployment/<deployment-name> --timeout=300s` instead, we can be sure that these pods are _actually_ ready to receive traffic before we start the tests. ```bash ❯ kubectl rollout status --help Show the status of the rollout. By default 'rollout status' will watch the status of the latest rollout until it's done. If you don't want to wait for the rollout to finish then you can use --watch=false. Note that if a new rollout starts in-between, then 'rollout status' will continue watching the latest revision. If you want to pin to a specific revision and abort if it is rolled over by another revision, use --revision=N where N is the revision you need to watch for. ``` Lastly, even despite this, sometimes the `POST /api/internal/v1/plugin/sync` endpoint will return HTTP 500 ([example logs](https://github.com/grafana/oncall/actions/runs/5062712137/jobs/9088529416#step:19:2536) from failed CI job). In this case, let's setup the Playwright global setup to retry 3 times.

wip

bd94da5

joeyorlando added pr:no changelog pr:no public docs Added to a PR that does not require public documentation updates labels May 23, 2023

joeyorlando requested a review from a team May 23, 2023 21:38

add retries to global setup

77175ab

joeyorlando requested a review from a team May 23, 2023 21:44

wip

6b0d938

joeyorlando removed the request for review from a team May 23, 2023 22:27

await grafana deployment as well

2f1d049

joeyorlando changed the title ~~make e2e tests global setup more reliable~~ e2e tests on CI - actually await k8s resources to be ready before starting tests May 23, 2023

joeyorlando added 4 commits May 23, 2023 18:40

only use 1 grafana container

d0d698f

add note

27f4412

retry global setup if it fails

df7d387

add more notes

dfe6040

joeyorlando commented May 23, 2023

View reviewed changes

joeyorlando added 5 commits May 23, 2023 19:09

try 3x parrallelizing e2e tests

8e11e51

try using 5 engine replicas

925ffd8

wip

e825a70

Trigger Build

4e719a1

Trigger Build

b76c6c9

joeyorlando commented May 23, 2023

View reviewed changes

configure workflow telemtry action

dd63d97

grafana deleted a comment from github-actions bot May 23, 2023

joeyorlando merged commit eefe7be into dev May 24, 2023

joeyorlando deleted the jorlando/fix-flaky-e2e-test-global-setup branch May 24, 2023 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e tests on CI - actually await k8s resources to be ready before starting tests #1997

e2e tests on CI - actually await k8s resources to be ready before starting tests #1997

joeyorlando commented May 23, 2023 •

edited

Loading

joeyorlando May 23, 2023

joeyorlando May 23, 2023

e2e tests on CI - actually await k8s resources to be ready before starting tests #1997

e2e tests on CI - actually await k8s resources to be ready before starting tests #1997

Conversation

joeyorlando commented May 23, 2023 • edited Loading

joeyorlando May 23, 2023

Choose a reason for hiding this comment

joeyorlando May 23, 2023

Choose a reason for hiding this comment

joeyorlando commented May 23, 2023 •

edited

Loading