-
Notifications
You must be signed in to change notification settings - Fork 899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the kubeflow-m2m-oidc-configurator a CronJob #2667
Make the kubeflow-m2m-oidc-configurator a CronJob #2667
Conversation
@juliusvonkohout or @kimwnasptd can we restart the tests? Both of them failed because of a non-related issue:
|
@KRomanov, i restarted the tests. If they fail again we might have to increase the timeouts in this PR. |
b98a24d
to
4abca40
Compare
@juliusvonkohout this is super weird. I limited the CronJob with |
I restarted the tests. yeah our CICD is a bit problematic at the moment. If we can specify more resources in this public repository yes, otherwise we have to increase the timeouts. https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories |
@juliusvonkohout maybe the issue is with CICD resource sharing? If the memory and cpu is shared between multiple workflows, it may be problematic. I see one of the failing tests completed with success. Can you restart the last test workflow? Also, is this something I could do myself, for example with the github bot with commands in comment? |
...nts/configure-self-signed-kubernetes-oidc-issuer/cronjob.kubeflow-m2m-oidc-configurator.yaml
Outdated
Show resolved
Hide resolved
name: kubeflow-m2m-oidc-configurator | ||
namespace: istio-system | ||
spec: | ||
schedule: '* * * * *' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHould we not go with every 5 minutes instead of every minute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change to every 5 minutes. There is also configuration for not adding more jobs until the last one is completed and from the latest log from cicd workflows shows that there is no more than 1 job created at a time.
defaultMode: 0777 | ||
items: | ||
- key: script.sh | ||
path: script.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure that script.sh is idempotent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, well it doesn't verify if the JWKS is present and after all is always performing the patch so this might be an improvement. I think the JWKS value should be also compared and only patched if different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made changes so the script will first check for the JWKS present in RequestAuthentication and only patch if not equal to the desired JWKS.
I did restart and it failed again. In the KFP repository that was possible with /retest or /retest-failed or so. Probably something i can investigate in the next weeks when i am less busy. |
@juliusvonkohout maybe we could add verbosity to the logs in CICD GH Workflows? We currently know that the pods aren't ready but what is the actual reason? DockerHub pull rate limits? Not enough resources? Failing Pod? |
Yes, lets do that in a separate PR with @codablock as well. |
The tests in #2696 were successful so i reran the test and hope that the CICD is happy now. If not please rebase the PR against the master branch. |
https://github.com/kubeflow/manifests/actions/runs/8891109875 here is the successful test. |
So we need a rebase and step by step debugging with minimal changes. |
/hold |
/retest |
4abca40
to
0a707ba
Compare
0a707ba
to
e791d53
Compare
Signed-off-by: Krzysztof Romanowski <[email protected]> Signed-off-by: Krzysztof Romanowski <[email protected]>
Signed-off-by: Krzysztof Romanowski <[email protected]>
It was tested with self-hosted runner using custom dockerconfig credentials for debugging. Signed-off-by: Krzysztof Romanowski <[email protected]>
…tor.yaml Signed-off-by: Krzysztof Romanowski <[email protected]>
Signed-off-by: Krzysztof Romanowski <[email protected]>
4eb270b
to
8d54066
Compare
@diegolovison this is the PR we've discussed on the Manifests WG Call. |
@kimwnasptd @rimolive please also review |
/lgtm |
/hold |
Huh, the CICD failed again with:
https://github.com/kubeflow/manifests/actions/runs/9502404627/job/26190294755?pr=2667 But, only 1/3 m2m cicd workflows failed... I'll have another look and try to pinpoint why this is happening. I guess this should be rather a small change. |
Going to wait for the feedback |
aa466e7
to
6972652
Compare
Signed-off-by: Krzysztof Romanowski <[email protected]>
6972652
to
c830a7a
Compare
I made changes to the script so it will patch and then verify if the patch with jwks persisted. If it's not persisted, the Pod will finish with failure. The CronJob is configured to restart failed Job Pod 3 times. If this also fails, the @diegolovison , @juliusvonkohout , @kimwnasptd , please review. |
@diegolovison can you test now? @rimolive we have to merge this for rc.2 |
/lgtm There was no feedback for a week and we need it in the next RC @rimolive |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: juliusvonkohout The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
Which issue is resolved by this Pull Request:
Resolves #2646
Description of your changes:
Changing the Job to CronJob improves the robustness of the setup in case if the JWKS will change or the user accidentally overwrote the requestauthentication.
Checklist:
kind
and onvcluster
.