-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CML Kubernetes self-hosted runner is registered to GitHub but the workflow never continues #1415
Comments
@ludelafo I think the issue is that the executing command The two solutions I see are:
|
Hi @dacbd, thank you for your input. I'll investigate more on my side to check if I can fix the issue. What questions me is that I remember to have the same set up previously and it worked out of the box. I'll get back to you if I find something. |
Hello @dacbd, After a few months working on other projects, I'm back on CML/MLOps principles. After updating all packages to check if this issue is resolved, my team and I are still having troubles to use CML with Kubernetes and GitHub Actions. In order to try to identify the problem, I created a minimal reproducible example that you can find here: https://github.com/swiss-ai-center/cml-kubernetes-github-actions-runner-minimal-reproducible-example. It contains all the steps to reproduce the issue and open questions for more investigating. We are three people looking into this issue and weren't able to find a solution. I'll tag them (@rmarquis, @leonardcser) so they can intervene in the conversation if necessary. We are highly motivated to help Iterative fix this issue, so please let us know how we can help! Thanks in advance, |
@ludelafo I'm sorry I dont have much capacity to help you, and I'm not sure how busy @0x2b3bfa0 is. A few things I would recommend: inspecting to cluster to make sure the pod is even being created also going into your gcp logs explorer and inspecting the API calls/activity to make sure nothing is being denied or missing. CML generates a ssh key that is used for the instance. You can run the command locally using your own ssh key (there should be a few examples in the docs) and then try and ssh into it your self and inspect the contents for errors. (CML does it's readiness check via ssh) |
I’m encountering a similar issue with CML while setting up a self-hosted runner for GitHub on a Azure Kubernetes cluster. Despite the runner being created and registering successfully with GitHub, the workflow hangs with the log message:
Interestingly, I have an operational setup using AKS 1.23 with the same code, but I encounter the error when trying to execute the pipeline via GitHub Actions in an AKS 1.27 or 1.28 environment. I note diferent log of the pods when executing in an AKS 1.23 and when I execute using an AKS 1.27 or 1.28: AKS 1.23 AKS 1.27 or 1.28 Any guidance or assistance you can provide would be greatly appreciated. We are keen to resolve this and are open to any further investigation or adjustments needed. Thanks in advance, Neemias |
Thank you very much for your insights, I'm glad I'm not the only person to have these issues. My suspicions were the same as yours. I do think it's a difference in compatibility between versions of Kubernetes that I haven't been able to test on my own: Google Cloud automatically updates Kubernetes clusters and I wasn't able to go back to an old enough version to validate this point. Fortunately, thanks to your feedback, I think Iterative now has a new lead. I do not work on this project anymore, but my colleagues might be able to help you if needed. Looking forward to seeing any improvements on this! |
@neemias-carvalho-movti We didn't find a solution to this issue on our side. Good to know that it is seemingly the Kubernetes version that might be responisble here. For our use case, we eventually side-stepped the problem by using two runners: a standard GitHub action that instantiates the k8s cluster, a second one that trains the model with GPU support and generates CML reports. We don't use CML to instantiate k8s. |
We will not use CML for retraining on Kubernetes as this is seemingly not working anymore on newer Kubernetes releases. See iterative/cml#1415
Hi CML team,
I'm facing an issue with CML when creating a self-hosted runner for GitHub on a Google Cloud Kubernetes cluster.
The runner is created and seems to register to GitHub. However, the workflow never continues and hangs on
I'm using the following steps to create the runner:
repo
scope.CML_PAT
.GCP_SERVICE_ACCOUNT_KEY
.Here are some logs that might help you:
Logs of the runner just after the start
Logs of the runner after some time
Logs of the GitHub workflow
I was able to check if the runner was successfully able to register to GitHub by running the following command (from the GitHub API documentation):
Output of the cURL command
You can find a repository with the code used to reproduce this issue here.
I created two workflows to test the runner:
workflow-from-actions.yml
using CML official GitHub Actionsworkflow-from-sources.yml
using CML and TPI from sourcesYou can find the execution of the two workflows here and here.
I did try all sorts of things to try to make it work, but I was not able to find a solution. I tried to:
repo
scopepermissions
to the GitHub workflow file0.18.x
)--cloud-image="iterativeai/cml:0-dvc3-base1-gpu"
,--tpi-version="= 0.11.18"
and--cml-version="0.19.0"
arguments to set older versions of CML and TPIPlease let me know if I can be of any help and thank you!
The text was updated successfully, but these errors were encountered: