Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e tests on CI - actually await k8s resources to be ready before starting tests #1997

Merged
merged 14 commits into from
May 24, 2023

Conversation

joeyorlando
Copy link
Contributor

@joeyorlando joeyorlando commented May 23, 2023

Occasionally, the Playwright global setup step (which authenticates w/ the Grafana API + configures the plugin) would fail, leading to the CI job to instantly fail (playwright doesn't retry global setup if it fails).

My current hypothesis as to why this is happening is because the oncall-engine and oncall-celery pods aren't actually ready in these cases based on the way the jupyterhub/action-k8s-await-workloads action await k8s workloads:

Screenshot 2023-05-23 at 18 24 36

By using the kubectl rollout status deployment/<deployment-name> --timeout=300s instead, we can be sure that these pods are actually ready to receive traffic before we start the tests.

❯ kubectl rollout status --help
Show the status of the rollout.

 By default 'rollout status' will watch the status of the latest rollout until it's done. If you don't want to wait for
the rollout to finish then you can use --watch=false. Note that if a new rollout starts in-between, then 'rollout
status' will continue watching the latest revision. If you want to pin to a specific revision and abort if it is rolled
over by another revision, use --revision=N where N is the revision you need to watch for.

Lastly, even despite this, sometimes the POST /api/internal/v1/plugin/sync endpoint will return HTTP 500 (example logs from failed CI job). In this case, let's setup the Playwright global setup to retry 3 times.

@joeyorlando joeyorlando added pr:no changelog pr:no public docs Added to a PR that does not require public documentation updates labels May 23, 2023
@joeyorlando joeyorlando requested a review from a team May 23, 2023 21:38
@joeyorlando joeyorlando requested a review from a team May 23, 2023 21:44
@joeyorlando joeyorlando removed the request for review from a team May 23, 2023 22:27
@joeyorlando joeyorlando changed the title make e2e tests global setup more reliable e2e tests on CI - actually await k8s resources to be ready before starting tests May 23, 2023
@@ -361,7 +364,7 @@ jobs:
--set oncall.twilio.authToken="${{ secrets.TWILIO_AUTH_TOKEN }}" \
--set oncall.twilio.phoneNumber="\"${{ secrets.TWILIO_PHONE_NUMBER }}"\" \
--set oncall.twilio.verifySid="${{ secrets.TWILIO_VERIFY_SID }}" \
--set grafana.replicas=3 \
--set grafana.replicas=1 \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't reliably use > 1 grafana replica when using SQLite as the grafana database
bitnami/charts#10905

@@ -287,6 +287,9 @@ jobs:
- name: Checkout
uses: actions/checkout@v3

- name: Collect Workflow Telemetry
uses: runforesight/workflow-telemetry-action@v1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is useful to see how much CPU/memory various steps in the workflow are using. It attaches an artifact to the build that looks something like this:
Screenshot 2023-05-23 at 19 33 13
Screenshot 2023-05-23 at 19 33 20

@grafana grafana deleted a comment from github-actions bot May 23, 2023
@grafana grafana deleted a comment from github-actions bot May 23, 2023
@grafana grafana deleted a comment from github-actions bot May 23, 2023
@grafana grafana deleted a comment from github-actions bot May 23, 2023
@grafana grafana deleted a comment from github-actions bot May 23, 2023
@grafana grafana deleted a comment from github-actions bot May 23, 2023
@grafana grafana deleted a comment from github-actions bot May 23, 2023
@grafana grafana deleted a comment from github-actions bot May 23, 2023
@grafana grafana deleted a comment from github-actions bot May 23, 2023
@grafana grafana deleted a comment from github-actions bot May 23, 2023
@joeyorlando joeyorlando merged commit eefe7be into dev May 24, 2023
@joeyorlando joeyorlando deleted the jorlando/fix-flaky-e2e-test-global-setup branch May 24, 2023 00:20
brojd pushed a commit that referenced this pull request Sep 18, 2024
…rting tests (#1997)

Occasionally, the Playwright global setup step (which authenticates w/
the Grafana API + configures the plugin) would fail, leading to the CI
job to instantly fail (playwright doesn't retry global setup if it
fails).

My current hypothesis as to why this is happening is because the
`oncall-engine` and `oncall-celery` pods aren't _actually_ ready in
these cases based on the way the `jupyterhub/action-k8s-await-workloads`
action await k8s workloads:

<img width="1076" alt="Screenshot 2023-05-23 at 18 24 36"
src="https://github.com/grafana/oncall/assets/9406895/68d8d2d9-4274-4749-8788-e0a9a3dbad83">


By using the `kubectl rollout status deployment/<deployment-name>
--timeout=300s` instead, we can be sure that these pods are _actually_
ready to receive traffic before we start the tests.
```bash
❯ kubectl rollout status --help
Show the status of the rollout.

 By default 'rollout status' will watch the status of the latest rollout until it's done. If you don't want to wait for
the rollout to finish then you can use --watch=false. Note that if a new rollout starts in-between, then 'rollout
status' will continue watching the latest revision. If you want to pin to a specific revision and abort if it is rolled
over by another revision, use --revision=N where N is the revision you need to watch for.
```

Lastly, even despite this, sometimes the `POST
/api/internal/v1/plugin/sync` endpoint will return HTTP 500 ([example
logs](https://github.com/grafana/oncall/actions/runs/5062712137/jobs/9088529416#step:19:2536)
from failed CI job). In this case, let's setup the Playwright global
setup to retry 3 times.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr:no public docs Added to a PR that does not require public documentation updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant