Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Operator V2 Installation - Certificate error #2404

Open
Sharathmk99 opened this issue Jan 25, 2025 · 3 comments
Open

Training Operator V2 Installation - Certificate error #2404

Sharathmk99 opened this issue Jan 25, 2025 · 3 comments

Comments

@Sharathmk99
Copy link

What happened?

Tried to install training operator v2 using,

kustomize build overlays/standalone | k apply --server-side -f -

But I see few errors in operator logs,

{"level":"info","ts":"2025-01-25T00:33:26.11952301Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:198","msg":"Probe endpoints are configured on healthz and readyz"}
{"level":"info","ts":"2025-01-25T00:33:26.129701605Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:151","msg":"Starting manager"}
{"level":"info","ts":"2025-01-25T00:33:26.129896477Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:159","msg":"Waiting for certificate generation to complete"}
{"level":"info","ts":"2025-01-25T00:33:26.129986683Z","caller":"manager/server.go:83","msg":"starting server","name":"health probe","addr":"0.0.0.0:8081"}
{"level":"info","ts":"2025-01-25T00:33:26.230660358Z","logger":"cert-rotation","caller":"rotator/rotator.go:283","msg":"starting cert rotator controller"}
{"level":"info","ts":"2025-01-25T00:33:26.230850649Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *v1.Secret"}
{"level":"info","ts":"2025-01-25T00:33:26.231048199Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2025-01-25T00:33:26.231094307Z","caller":"controller/controller.go:183","msg":"Starting Controller","controller":"cert-rotator"}
{"level":"info","ts":"2025-01-25T00:33:26.335062878Z","logger":"cert-rotation","caller":"rotator/rotator.go:327","msg":"refreshing CA and server certs"}
{"level":"info","ts":"2025-01-25T00:33:26.336526061Z","caller":"controller/controller.go:217","msg":"Starting workers","controller":"cert-rotator","worker count":1}
{"level":"info","ts":"2025-01-25T00:33:26.336964719Z","logger":"cert-rotation","caller":"rotator/rotator.go:327","msg":"refreshing CA and server certs"}
{"level":"info","ts":"2025-01-25T00:33:26.649546565Z","logger":"cert-rotation","caller":"rotator/rotator.go:333","msg":"server certs refreshed"}
{"level":"error","ts":"2025-01-25T00:33:26.828035288Z","logger":"cert-rotation","caller":"rotator/rotator.go:329","msg":"could not refresh CA and server certs","error":"Operation cannot be fulfilled on secrets \"training-operator-v2-webhook-cert\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:329\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:461\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:357\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:772\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224"}
{"level":"info","ts":"2025-01-25T00:33:26.84476085Z","logger":"cert-rotation","caller":"rotator/rotator.go:354","msg":"no cert refresh needed"}
{"level":"error","ts":"2025-01-25T00:33:26.844879706Z","logger":"cert-rotation","caller":"rotator/rotator.go:786","msg":"secret is not well-formed, cannot update webhook configurations","error":"Cert secret is not well-formed, missing ca.crt","errorVerbose":"Cert secret is not well-formed, missing ca.crt\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.buildArtifactsFromSecret\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:508\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:784\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700","stacktrace":"github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:786\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224"}
{"level":"info","ts":"2025-01-25T00:33:26.846220958Z","logger":"cert-rotation","caller":"rotator/rotator.go:354","msg":"no cert refresh needed"}
{"level":"info","ts":"2025-01-25T00:33:26.846686859Z","logger":"cert-rotation","caller":"rotator/rotator.go:834","msg":"Ensuring CA cert","name":"validator.training-operator-v2.kubeflow.org","gvk":"admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration","name":"validator.training-operator-v2.kubeflow.org","gvk":"admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
{"level":"info","ts":"2025-01-25T00:33:26.855632829Z","logger":"cert-rotation","caller":"rotator/rotator.go:354","msg":"no cert refresh needed"}
{"level":"info","ts":"2025-01-25T00:33:26.855880309Z","logger":"cert-rotation","caller":"rotator/rotator.go:834","msg":"Ensuring CA cert","name":"validator.training-operator-v2.kubeflow.org","gvk":"admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration","name":"validator.training-operator-v2.kubeflow.org","gvk":"admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
{"level":"info","ts":"2025-01-25T00:33:28.444279918Z","logger":"cert-rotation","caller":"rotator/rotator.go:873","msg":"certs are ready in /tmp/k8s-webhook-server/serving-certs"}
{"level":"info","ts":"2025-01-25T00:33:28.444380977Z","logger":"cert-rotation","caller":"rotator/rotator.go:893","msg":"CA certs are injected to webhooks"}
{"level":"info","ts":"2025-01-25T00:33:28.444446367Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:161","msg":"Certs ready"}
{"level":"info","ts":"2025-01-25T00:33:28.447769906Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:186","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"kubeflow.org/v2alpha1, Kind=ClusterTrainingRuntime"}
{"level":"info","ts":"2025-01-25T00:33:28.447915533Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:202","msg":"Registering a validating webhook","GVK":"kubeflow.org/v2alpha1, Kind=ClusterTrainingRuntime","path":"/validate-kubeflow-org-v2alpha1-clustertrainingruntime"}
{"level":"info","ts":"2025-01-25T00:33:28.448102942Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob","source":"kind source: *v2alpha1.TrainJob"}
{"level":"info","ts":"2025-01-25T00:33:28.448094826Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:191","msg":"Starting webhook server"}
{"level":"info","ts":"2025-01-25T00:33:28.448189893Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob","source":"kind source: *v1alpha2.JobSet"}
{"level":"info","ts":"2025-01-25T00:33:28.448203562Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:108","msg":"disabling http/2"}
{"level":"info","ts":"2025-01-25T00:33:28.448132635Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v2alpha1-clustertrainingruntime"}
{"level":"info","ts":"2025-01-25T00:33:28.448216867Z","caller":"controller/controller.go:183","msg":"Starting Controller","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob"}
{"level":"info","ts":"2025-01-25T00:33:28.448361527Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:186","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"kubeflow.org/v2alpha1, Kind=TrainingRuntime"}
{"level":"info","ts":"2025-01-25T00:33:28.44852866Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:202","msg":"Registering a validating webhook","GVK":"kubeflow.org/v2alpha1, Kind=TrainingRuntime","path":"/validate-kubeflow-org-v2alpha1-trainingruntime"}
{"level":"info","ts":"2025-01-25T00:33:28.448658051Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v2alpha1-trainingruntime"}
{"level":"info","ts":"2025-01-25T00:33:28.448757722Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:186","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"kubeflow.org/v2alpha1, Kind=TrainJob"}
{"level":"info","ts":"2025-01-25T00:33:28.448844606Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:202","msg":"Registering a validating webhook","GVK":"kubeflow.org/v2alpha1, Kind=TrainJob","path":"/validate-kubeflow-org-v2alpha1-trainjob"}
{"level":"info","ts":"2025-01-25T00:33:28.448899689Z","logger":"controller-runtime.certwatcher","caller":"certwatcher/certwatcher.go:161","msg":"Updated current TLS certificate"}
{"level":"info","ts":"2025-01-25T00:33:28.448986981Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v2alpha1-trainjob"}
{"level":"info","ts":"2025-01-25T00:33:28.449144418Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}
{"level":"info","ts":"2025-01-25T00:33:28.449230561Z","logger":"controller-runtime.certwatcher","caller":"certwatcher/certwatcher.go:115","msg":"Starting certificate watcher"}
{"level":"info","ts":"2025-01-25T00:33:28.549636012Z","caller":"controller/controller.go:217","msg":"Starting workers","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob","worker count":1}
2025/01/25 00:34:40 http: TLS handshake error from 10.244.1.0:15324: EOF

When I run example mnist script I get below error,

kubernetes.client.exceptions.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': '45641c53-8ad1-424b-b95e-53d4b4e83619', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '82e3d690-7d2e-4ad6-a7f5-510ca551c18b', 'X-Kubernetes-Pf-Prioritylevel-Uid': '82e78391-8ebb-4d00-a3e1-2453f78087a0', 'Date': 'Sat, 25 Jan 2025 00:34:40 GMT', 'Content-Length': '629'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"validator.trainjob.kubeflow.org\": failed to call webhook: Post \"https://training-operator-v2.kubeflow-system.svc:443/validate-kubeflow-org-v2alpha1-trainjob?timeout=10s\": context deadline exceeded","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"validator.trainjob.kubeflow.org\": failed to call webhook: Post \"https://training-operator-v2.kubeflow-system.svc:443/validate-kubeflow-org-v2alpha1-trainjob?timeout=10s\": context deadline exceeded"}]},"code":500}



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/abc/repo/training-operator/example/mnist.py", line 321, in <module>
    job_name = client.train(
  File "/Users/abc/repo/training-operator/example/.venv/lib/python3.10/site-packages/kubeflow/training/api/training_client.py", line 256, in train
    raise RuntimeError(
RuntimeError: Failed to create TrainJob: default/da2981fcab1b

What did you expect to happen?

Run example script

Environment

Kubernetes version:

$ kubectl version

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@andreyvelich
Copy link
Member

Thank you for testing the Kubeflow Trainer V2 @Sharathmk99!
Please can you check that your cluster doesn't have the training-operator-v2-webhook-cert secret before you deploying the V2 manifests:

$ kubectl delete -k v2/overlays/standalone --server-side
$ 
$ kubectl apply -k v2/overlays/standalone --server-side

It looks like that the cert-rotator can't update the Secret.

@Sharathmk99
Copy link
Author

Sharathmk99 commented Jan 27, 2025

Thank you @andreyvelich for quick response.

I made sure no secret was there in cluster,

k get secret -n kubeflow-system training-operator-v2-webhook-cert
Error from server (NotFound): namespaces "kubeflow-system" not found

And applied using,

kustomize build overlays/standalone | k apply --server-side -f -

I see below output,

namespace/jobset-system serverside-applied
namespace/kubeflow-system serverside-applied
customresourcedefinition.apiextensions.k8s.io/clustertrainingruntimes.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/jobsets.jobset.x-k8s.io serverside-applied
customresourcedefinition.apiextensions.k8s.io/trainingruntimes.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/trainjobs.kubeflow.org serverside-applied
serviceaccount/jobset-controller-manager serverside-applied
serviceaccount/training-operator-v2 serverside-applied
role.rbac.authorization.k8s.io/jobset-leader-election-role serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-manager-role serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-metrics-reader serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-proxy-role serverside-applied
clusterrole.rbac.authorization.k8s.io/training-operator-v2 serverside-applied
rolebinding.rbac.authorization.k8s.io/jobset-leader-election-rolebinding serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/jobset-manager-rolebinding serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/jobset-proxy-rolebinding serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/training-operator-v2 serverside-applied
secret/jobset-webhook-server-cert serverside-applied
secret/training-operator-v2-webhook-cert serverside-applied
service/jobset-controller-manager-metrics-service serverside-applied
service/jobset-webhook-service serverside-applied
service/training-operator-v2 serverside-applied
deployment.apps/jobset-controller-manager serverside-applied
deployment.apps/training-operator-v2 serverside-applied
clustertrainingruntime.kubeflow.org/torch-distributed serverside-applied
mutatingwebhookconfiguration.admissionregistration.k8s.io/jobset-mutating-webhook-configuration serverside-applied
validatingwebhookconfiguration.admissionregistration.k8s.io/jobset-validating-webhook-configuration serverside-applied
validatingwebhookconfiguration.admissionregistration.k8s.io/validator.training-operator-v2.kubeflow.org serverside-applied

I see new secret created,

k get secret -n kubeflow-system training-operator-v2-webhook-cert
NAME                                TYPE     DATA   AGE
training-operator-v2-webhook-cert   Opaque   4      20s

Still I see some errors related to CA rotation,

{"level":"info","ts":"2025-01-27T11:59:42.354951862Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:198","msg":"Probe endpoints are configured on healthz and readyz"}
{"level":"info","ts":"2025-01-27T11:59:42.364251253Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:151","msg":"Starting manager"}
{"level":"info","ts":"2025-01-27T11:59:42.364434858Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:159","msg":"Waiting for certificate generation to complete"}
{"level":"info","ts":"2025-01-27T11:59:42.364650638Z","caller":"manager/server.go:83","msg":"starting server","name":"health probe","addr":"0.0.0.0:8081"}
{"level":"info","ts":"2025-01-27T11:59:42.465549503Z","logger":"cert-rotation","caller":"rotator/rotator.go:283","msg":"starting cert rotator controller"}
{"level":"info","ts":"2025-01-27T11:59:42.465677031Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *v1.Secret"}
{"level":"info","ts":"2025-01-27T11:59:42.46579984Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2025-01-27T11:59:42.465844423Z","caller":"controller/controller.go:183","msg":"Starting Controller","controller":"cert-rotator"}
{"level":"info","ts":"2025-01-27T11:59:42.570820478Z","logger":"cert-rotation","caller":"rotator/rotator.go:327","msg":"refreshing CA and server certs"}
{"level":"info","ts":"2025-01-27T11:59:42.573314806Z","caller":"controller/controller.go:217","msg":"Starting workers","controller":"cert-rotator","worker count":1}
{"level":"info","ts":"2025-01-27T11:59:42.57365322Z","logger":"cert-rotation","caller":"rotator/rotator.go:327","msg":"refreshing CA and server certs"}
{"level":"info","ts":"2025-01-27T11:59:43.01433926Z","logger":"cert-rotation","caller":"rotator/rotator.go:333","msg":"server certs refreshed"}
{"level":"info","ts":"2025-01-27T11:59:43.015894357Z","logger":"cert-rotation","caller":"rotator/rotator.go:354","msg":"no cert refresh needed"}
{"level":"info","ts":"2025-01-27T11:59:43.01641181Z","logger":"cert-rotation","caller":"rotator/rotator.go:834","msg":"Ensuring CA cert","name":"validator.training-operator-v2.kubeflow.org","gvk":"admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration","name":"validator.training-operator-v2.kubeflow.org","gvk":"admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
{"level":"info","ts":"2025-01-27T11:59:43.025815832Z","logger":"cert-rotation","caller":"rotator/rotator.go:354","msg":"no cert refresh needed"}
{"level":"info","ts":"2025-01-27T11:59:43.026321461Z","logger":"cert-rotation","caller":"rotator/rotator.go:834","msg":"Ensuring CA cert","name":"validator.training-operator-v2.kubeflow.org","gvk":"admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration","name":"validator.training-operator-v2.kubeflow.org","gvk":"admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
{"level":"error","ts":"2025-01-27T11:59:43.136008903Z","logger":"cert-rotation","caller":"rotator/rotator.go:329","msg":"could not refresh CA and server certs","error":"Operation cannot be fulfilled on secrets \"training-operator-v2-webhook-cert\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:329\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:461\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:357\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:285\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226"}
{"level":"info","ts":"2025-01-27T11:59:43.15691219Z","logger":"cert-rotation","caller":"rotator/rotator.go:354","msg":"no cert refresh needed"}
{"level":"info","ts":"2025-01-27T11:59:44.671107519Z","logger":"cert-rotation","caller":"rotator/rotator.go:873","msg":"certs are ready in /tmp/k8s-webhook-server/serving-certs"}
{"level":"info","ts":"2025-01-27T11:59:44.671214632Z","logger":"cert-rotation","caller":"rotator/rotator.go:893","msg":"CA certs are injected to webhooks"}
{"level":"info","ts":"2025-01-27T11:59:44.671328673Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:161","msg":"Certs ready"}
{"level":"info","ts":"2025-01-27T11:59:44.67504124Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:186","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"kubeflow.org/v2alpha1, Kind=ClusterTrainingRuntime"}
{"level":"info","ts":"2025-01-27T11:59:44.675245471Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:202","msg":"Registering a validating webhook","GVK":"kubeflow.org/v2alpha1, Kind=ClusterTrainingRuntime","path":"/validate-kubeflow-org-v2alpha1-clustertrainingruntime"}
{"level":"info","ts":"2025-01-27T11:59:44.675372007Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob","source":"kind source: *v2alpha1.TrainJob"}
{"level":"info","ts":"2025-01-27T11:59:44.675477155Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob","source":"kind source: *v1alpha2.JobSet"}
{"level":"info","ts":"2025-01-27T11:59:44.675495568Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v2alpha1-clustertrainingruntime"}
{"level":"info","ts":"2025-01-27T11:59:44.675520476Z","caller":"controller/controller.go:183","msg":"Starting Controller","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob"}
{"level":"info","ts":"2025-01-27T11:59:44.675443881Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:191","msg":"Starting webhook server"}
{"level":"info","ts":"2025-01-27T11:59:44.675695675Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:186","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"kubeflow.org/v2alpha1, Kind=TrainingRuntime"}
{"level":"info","ts":"2025-01-27T11:59:44.675668912Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:108","msg":"disabling http/2"}
{"level":"info","ts":"2025-01-27T11:59:44.675863964Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:202","msg":"Registering a validating webhook","GVK":"kubeflow.org/v2alpha1, Kind=TrainingRuntime","path":"/validate-kubeflow-org-v2alpha1-trainingruntime"}
{"level":"info","ts":"2025-01-27T11:59:44.676211252Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v2alpha1-trainingruntime"}
{"level":"info","ts":"2025-01-27T11:59:44.676468219Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:186","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"kubeflow.org/v2alpha1, Kind=TrainJob"}
{"level":"info","ts":"2025-01-27T11:59:44.676567804Z","logger":"controller-runtime.certwatcher","caller":"certwatcher/certwatcher.go:161","msg":"Updated current TLS certificate"}
{"level":"info","ts":"2025-01-27T11:59:44.676635151Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:202","msg":"Registering a validating webhook","GVK":"kubeflow.org/v2alpha1, Kind=TrainJob","path":"/validate-kubeflow-org-v2alpha1-trainjob"}
{"level":"info","ts":"2025-01-27T11:59:44.67683655Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}
{"level":"info","ts":"2025-01-27T11:59:44.676878013Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v2alpha1-trainjob"}
{"level":"info","ts":"2025-01-27T11:59:44.676927885Z","logger":"controller-runtime.certwatcher","caller":"certwatcher/certwatcher.go:115","msg":"Starting certificate watcher"}
{"level":"info","ts":"2025-01-27T11:59:44.776492005Z","caller":"controller/controller.go:217","msg":"Starting workers","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob","worker count":1}

When I try running,

python mnist.py --num-workers 4 --worker-resources "nvidia.com/gpu" 1 --worker-resource cpu 4 --worker-resources memory 16Gi --epochs 100 --batch-size 100 --lr 1e-1 --lr-period 25 --lr-gamma 0.7

I still get error,

Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'e4e98273-e267-4ca4-942c-cae655519bec', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '82e3d690-7d2e-4ad6-a7f5-510ca551c18b', 'X-Kubernetes-Pf-Prioritylevel-Uid': '82e78391-8ebb-4d00-a3e1-2453f78087a0', 'Date': 'Mon, 27 Jan 2025 12:01:41 GMT', 'Content-Length': '641'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"validator.trainjob.kubeflow.org\": failed to call webhook: Post \"https://training-operator-v2.kubeflow-system.svc:443/validate-kubeflow-org-v2alpha1-trainjob?timeout=10s\": net/http: TLS handshake timeout","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"validator.trainjob.kubeflow.org\": failed to call webhook: Post \"https://training-operator-v2.kubeflow-system.svc:443/validate-kubeflow-org-v2alpha1-trainjob?timeout=10s\": net/http: TLS handshake timeout"}]},"code":500}

In operator logs I see,

2025/01/27 12:01:41 http: TLS handshake error from 10.244.0.0:2869: EOF

I tried to restart the operator and I don't see any error related to CA rotation,

{"level":"info","ts":"2025-01-27T12:02:27.70945057Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:198","msg":"Probe endpoints are configured on healthz and readyz"}
{"level":"info","ts":"2025-01-27T12:02:27.718117238Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:151","msg":"Starting manager"}
{"level":"info","ts":"2025-01-27T12:02:27.718245891Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:159","msg":"Waiting for certificate generation to complete"}
{"level":"info","ts":"2025-01-27T12:02:27.718431507Z","caller":"manager/server.go:83","msg":"starting server","name":"health probe","addr":"0.0.0.0:8081"}
{"level":"info","ts":"2025-01-27T12:02:27.819224705Z","logger":"cert-rotation","caller":"rotator/rotator.go:283","msg":"starting cert rotator controller"}
{"level":"info","ts":"2025-01-27T12:02:27.819388108Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *v1.Secret"}
{"level":"info","ts":"2025-01-27T12:02:27.819478411Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2025-01-27T12:02:27.819501946Z","caller":"controller/controller.go:183","msg":"Starting Controller","controller":"cert-rotator"}
{"level":"info","ts":"2025-01-27T12:02:27.924595919Z","logger":"cert-rotation","caller":"rotator/rotator.go:354","msg":"no cert refresh needed"}
{"level":"info","ts":"2025-01-27T12:02:27.924771375Z","logger":"cert-rotation","caller":"rotator/rotator.go:873","msg":"certs are ready in /tmp/k8s-webhook-server/serving-certs"}
{"level":"info","ts":"2025-01-27T12:02:27.925954323Z","caller":"controller/controller.go:217","msg":"Starting workers","controller":"cert-rotator","worker count":1}
{"level":"info","ts":"2025-01-27T12:02:27.927567201Z","logger":"cert-rotation","caller":"rotator/rotator.go:354","msg":"no cert refresh needed"}
{"level":"info","ts":"2025-01-27T12:02:27.928038677Z","logger":"cert-rotation","caller":"rotator/rotator.go:834","msg":"Ensuring CA cert","name":"validator.training-operator-v2.kubeflow.org","gvk":"admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration","name":"validator.training-operator-v2.kubeflow.org","gvk":"admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
{"level":"info","ts":"2025-01-27T12:02:29.211325214Z","logger":"cert-rotation","caller":"rotator/rotator.go:893","msg":"CA certs are injected to webhooks"}
{"level":"info","ts":"2025-01-27T12:02:29.211438566Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:161","msg":"Certs ready"}
{"level":"info","ts":"2025-01-27T12:02:29.215470292Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:186","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"kubeflow.org/v2alpha1, Kind=ClusterTrainingRuntime"}
{"level":"info","ts":"2025-01-27T12:02:29.215746678Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:202","msg":"Registering a validating webhook","GVK":"kubeflow.org/v2alpha1, Kind=ClusterTrainingRuntime","path":"/validate-kubeflow-org-v2alpha1-clustertrainingruntime"}
{"level":"info","ts":"2025-01-27T12:02:29.215853383Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:191","msg":"Starting webhook server"}
{"level":"info","ts":"2025-01-27T12:02:29.215871389Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob","source":"kind source: *v2alpha1.TrainJob"}
{"level":"info","ts":"2025-01-27T12:02:29.215931396Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:108","msg":"disabling http/2"}
{"level":"info","ts":"2025-01-27T12:02:29.215991577Z","caller":"controller/controller.go:175","msg":"Starting EventSource","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob","source":"kind source: *v1alpha2.JobSet"}
{"level":"info","ts":"2025-01-27T12:02:29.216048341Z","caller":"controller/controller.go:183","msg":"Starting Controller","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob"}
{"level":"info","ts":"2025-01-27T12:02:29.216072176Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v2alpha1-clustertrainingruntime"}
{"level":"info","ts":"2025-01-27T12:02:29.216273695Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:186","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"kubeflow.org/v2alpha1, Kind=TrainingRuntime"}
{"level":"info","ts":"2025-01-27T12:02:29.216446544Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:202","msg":"Registering a validating webhook","GVK":"kubeflow.org/v2alpha1, Kind=TrainingRuntime","path":"/validate-kubeflow-org-v2alpha1-trainingruntime"}
{"level":"info","ts":"2025-01-27T12:02:29.21669279Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v2alpha1-trainingruntime"}
{"level":"info","ts":"2025-01-27T12:02:29.216865098Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:186","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"kubeflow.org/v2alpha1, Kind=TrainJob"}
{"level":"info","ts":"2025-01-27T12:02:29.216876805Z","logger":"controller-runtime.certwatcher","caller":"certwatcher/certwatcher.go:161","msg":"Updated current TLS certificate"}
{"level":"info","ts":"2025-01-27T12:02:29.217027668Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:202","msg":"Registering a validating webhook","GVK":"kubeflow.org/v2alpha1, Kind=TrainJob","path":"/validate-kubeflow-org-v2alpha1-trainjob"}
{"level":"info","ts":"2025-01-27T12:02:29.217173468Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}
{"level":"info","ts":"2025-01-27T12:02:29.2173382Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v2alpha1-trainjob"}
{"level":"info","ts":"2025-01-27T12:02:29.217339246Z","logger":"controller-runtime.certwatcher","caller":"certwatcher/certwatcher.go:115","msg":"Starting certificate watcher"}
{"level":"info","ts":"2025-01-27T12:02:29.317102086Z","caller":"controller/controller.go:217","msg":"Starting workers","controller":"trainjob","controllerGroup":"kubeflow.org","controllerKind":"TrainJob","worker count":1}

Still same error,

2025/01/27 12:03:26 http: TLS handshake error from 10.244.4.0:34677: EOF
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': '7c5b5814-7a0c-4895-9a57-ff83b01e5bb8', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '82e3d690-7d2e-4ad6-a7f5-510ca551c18b', 'X-Kubernetes-Pf-Prioritylevel-Uid': '82e78391-8ebb-4d00-a3e1-2453f78087a0', 'Date': 'Mon, 27 Jan 2025 12:03:26 GMT', 'Content-Length': '641'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"validator.trainjob.kubeflow.org\": failed to call webhook: Post \"https://training-operator-v2.kubeflow-system.svc:443/validate-kubeflow-org-v2alpha1-trainjob?timeout=10s\": net/http: TLS handshake timeout","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"validator.trainjob.kubeflow.org\": failed to call webhook: Post \"https://training-operator-v2.kubeflow-system.svc:443/validate-kubeflow-org-v2alpha1-trainjob?timeout=10s\": net/http: TLS handshake timeout"}]},"code":500}

I'm pretty sure something related to Certificate and not sure how to fix it.

@andreyvelich
Copy link
Member

Do you have any specific configuration in your cluster ?
Does it accept self-signed certificates @Sharathmk99 ?
cc @Electronic-Waste @tenzen-y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants