-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in container image CI when the commit has multiple tags #350
Comments
Thanks for the report, although this issue must be open on clastix/kamaji since it's clearly a Kamaji scoped. |
@Heiko-san I think you're building the image on your own using a private repository, and referring to a custom tag. The migration feature is expecting a 1:1 parity with the public registry, as you can see here: Line 251 in e34fc18
Please, notice the Lines 163 to 169 in e34fc18
For such use-cases I suggest you in using the CLI argument Please, let me know if it works for you. |
No, we don't build images on our own here. |
@Heiko-san the issue is that we don't have any published image with that prefix: https://hub.docker.com/r/clastix/kamaji/tags?page=1&name=helm As explained here the resulting tag is extracted from the Line 6 in e34fc18
I'm running Kamaji with this version:
I have the following datastores:
I can start the migration of the following TCP:
After switching the Datastore, this is the watch output of the TCP:
With that said, I think something is "broken" in your environment, and I don't have enough details to understand what exactly is. |
this is what I've done to reproduce the issue:
kg datastores.kamaji.clastix.io
#NAME DRIVER AGE
#default etcd 7d
kg tcp -A
#NAMESPACE NAME VERSION STATUS CONTROL-PLANE ENDPOINT KUBECONFIG DATASTORE AGE
#bcp-cluster-test1 bcp-cluster-test1 v1.25.7 Ready 10.123.1.31:6443 bcp-cluster-test1-admin-kubeconfig default 3h55m
# this is the datastore created for issue 43 testing
k create ns kamaji-etcd-3
kubens kamaji-etcd-3
k apply -f secret.yaml
helm install kamaji-etcd-3 charts/kamaji-etcd -n kamaji-etcd-3 -f values.yaml
# when its ready:
kg datastores.kamaji.clastix.io
#NAME DRIVER AGE
#default etcd 7d1h
#kamaji-etcd-3 etcd 35m
replicas: 3
serviceAccount:
create: true
name: ""
image:
repository: quay.io/coreos/etcd
tag: ""
pullPolicy: IfNotPresent
peerApiPort: 2380
clientPort: 2379
metricsPort: 2381
livenessProbe: {}
extraArgs: []
autoCompactionMode: periodic
autoCompactionRetention: 5m
snapshotCount: "10000"
persistenVolumeClaim:
size: 5Gi
storageClassName: ""
accessModes:
- ReadWriteOnce
customAnnotations: {}
defragmentation:
backup:
enabled: true
all: false
snapshotNamePrefix: kamaji-etcd-3
snapshotDateFormat: $(date +%Y%m%d-%H%M)
s3:
url: https://s3.our.domain
bucket: kamaji/kamaji-etcd-3/
accessKey:
valueFrom:
secretKeyRef:
key: access_key
name: minio-key
secretKey:
valueFrom:
secretKeyRef:
key: secret_key
name: minio-key
image:
repository: minio/mc
tag: "RELEASE.2022-11-07T23-47-39Z"
pullPolicy: IfNotPresent
podLabels:
application: kamaji-etcd-3
podAnnotations: {}
securityContext:
allowPrivilegeEscalation: false
priorityClassName: system-cluster-critical
resources:
limits: {}
requests: {}
nodeSelector:
kubernetes.io/os: linux
tolerations: []
affinity: {}
topologySpreadConstraints: []
datastore:
enabled: true
serviceMonitor:
enabled: false
namespace: ''
labels: {}
annotations: {}
matchLabels: {}
targetLabels: []
serviceAccount:
name: etcd
namespace: etcd-system
endpoint:
interval: "15s"
scrapeTimeout: ""
metricRelabelings: []
relabelings: []
alerts:
enabled: false
namespace: ''
labels: {}
annotations: {}
rules: []
kubens bcp-cluster-test1
kubectl patch --type merge tcp bcp-cluster-test1 -p '{"spec": {"dataStore": "kamaji-etcd-3"}}'
kubectl get tcp -w
bcp-cluster-test1 v1.25.7 Ready 10.123.1.31:6443 bcp-cluster-test1-admin-kubeconfig default 5h30m
bcp-cluster-test1 v1.25.7 Ready 10.123.1.31:6443 bcp-cluster-test1-admin-kubeconfig default 5h30m
bcp-cluster-test1 v1.25.7 Migrating 10.123.1.31:6443 bcp-cluster-test1-admin-kubeconfig default 5h30m
bcp-cluster-test1 v1.25.7 Migrating 10.123.1.31:6443 bcp-cluster-test1-admin-kubeconfig default 5h30m
# stuck here ... kg -n kamaji-system pod
#NAME READY STATUS RESTARTS AGE
#capi-kamaji-controller-manager-77df58dcdd-q7p4s 1/1 Running 4 (119m ago) 23h
#etcd-0 1/1 Running 0 45h
#etcd-1 1/1 Running 0 44h
#etcd-2 1/1 Running 0 44h
#kamaji-6fb5f95bb7-j4smm 1/1 Running 3 (119m ago) 24h
#migrate-bcp-cluster-test1-bcp-cluster-test1-6m7gb 0/1 ImagePullBackOff 0 71s
kg -n kamaji-system pod migrate-bcp-cluster-test1-bcp-cluster-test1-6m7gb -o yaml | yq -r '.spec.containers[0].image'
#clastix/kamaji:vhelm-v0.12.4 |
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install \
cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.11.0 \
--set installCRDs=true helm repo add clastix https://clastix.github.io/charts
helm repo update
helm install kamaji clastix/kamaji -n kamaji-system --create-namespace |
Still unable to replicate, @Heiko-san. May I ask you to share the Kamaji first log lines?
Also, sharing the image running as Kamaji manager: |
kubectl -n kamaji-system get pods -l app.kubernetes.io/instance=kamaji -l app.kubernetes.io/component=controller-manager -o jsonpath='{.items[0].spec.containers[0].image}'
#clastix/kamaji:v0.3.3 kubectl -n kamaji-system logs deployments/kamaji --tail=-1|head -n6
#2023/08/25 09:05:44 maxprocs: Updating GOMAXPROCS=1: using minimum #allowed GOMAXPROCS
#2023-08-25T09:05:45Z INFO setup Kamaji version helm-v0.12.4 e34fc18
#2023-08-25T09:05:45Z INFO setup Build from: https://github.com/clastix/kamaji
#2023-08-25T09:05:45Z INFO setup Build date: 2023-08-08T09:03:13
#2023-08-25T09:05:45Z INFO setup Go Version: go1.19.12
#2023-08-25T09:05:45Z INFO setup Go OS/Arch: linux/amd64 I guess this is the source of the problem (?)
We may also look at the setup using teams if it helps. |
@Heiko-san thanks, I noticed a bug in the container CI automation, which is different from the one computed through the Makefile: kamaji/.github/workflows/docker-ci.yml Line 22 in 2b638fe
Line 6 in 2b638fe
I'll try to release a container image for v0.3.3 which has this racy condition, as well as a hotfix for this CI automation. |
@Heiko-san the v0.3.3 rebuilt version has been released on the Docker hub, and the version has been computed correctly this time.
Thanks for the detailed bug report: this was a subtle bug to catch! |
Hi @prometherion ,
I was not sure if restart would suffice (or if we had to reinstall), since you wrote it's a race contition. But version seems to be correct even after multiple restarts. However the resulting version on the migration pod still has an issue:
|
Same after we helm uninstalled kamaji and reinstalled it fresh. kubectl -n kamaji-system logs deployments/kamaji --tail=-1|head -n6
#2023/08/28 11:56:27 maxprocs: Updating GOMAXPROCS=1: using minimum allowed GOMAXPROCS
#2023-08-28T11:56:28Z INFO setup Kamaji version v0.3.3 e34fc18
#2023-08-28T11:56:28Z INFO setup Build from: https://github.com/clastix/kamaji
#2023-08-28T11:56:28Z INFO setup Build date: 2023-08-08T09:03:13
#2023-08-28T11:56:28Z INFO setup Go Version: go1.19.12
#2023-08-28T11:56:28Z INFO setup Go OS/Arch: linux/amd64 and: apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2023-08-28T12:07:17Z"
generateName: migrate-bcp-cluster-test1-bcp-cluster-test1-
labels:
controller-uid: eb5eda20-74f1-4ec6-84e6-d33fee0990ba
job-name: migrate-bcp-cluster-test1-bcp-cluster-test1
name: migrate-bcp-cluster-test1-bcp-cluster-test1-kjgmk
namespace: kamaji-system
resourceVersion: "5403260"
uid: 4e0d5534-7bef-4b9f-b390-c372a9049a92
spec:
containers:
- args:
- migrate
- --tenant-control-plane=bcp-cluster-test1/bcp-cluster-test1
- --target-datastore=kamaji-etcd-4
command:
- /kamaji
image: clastix/kamaji:vv0.3.3
imagePullPolicy: IfNotPresent
#... |
Line 159 in 973392b
We should return with no I'll address this for the v0.3.4 release. In the meanwhile you can solve your issue without inflecting the image manually, just specify the Thanks for giving this a try! |
Hmm with the flag set to containers:
- args:
- manager
- --health-probe-bind-address=:8081
- --leader-elect
- --metrics-bind-address=:8080
- --tmp-directory=/tmp/kamaji
- --datastore=default
- --migrate-image=clastix/kamaji:v0.3.3 The migration pod actually starts and completes: kg pod
#NAME READY STATUS RESTARTS AGE
#capi-kamaji-controller-manager-77df58dcdd-q7p4s 1/1 Running 4 (3d3h ago) 4d1h
#etcd-0 1/1 Running 0 61m
#etcd-1 1/1 Running 0 61m
#etcd-2 1/1 Running 0 61m
#kamaji-cfb8859cf-942bc 1/1 Running 0 8m8s#
migrate-bcp-cluster-test1-bcp-cluster-test1-tw4p4 0/1 Completed 0 6m43s k logs migrate-bcp-cluster-test1-bcp-cluster-test1-tw4p4
#2023/08/28 12:51:11 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
#2023-08-28T12:51:11Z INFO generating the controller-runtime client
I0828 12:51:12.465679 1 request.go:690] Waited for 1.037141623s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/cilium.io/v2alpha1?timeout=32s
#2023-08-28T12:51:12Z INFO retrieving the TenantControlPlane
#2023-08-28T12:51:12Z INFO retrieving the TenantControlPlane used DataStore
#2023-08-28T12:51:12Z INFO retrieving the target DataStore
#2023-08-28T12:51:12Z INFO generating the origin storage connection
#2023-08-28T12:51:12Z INFO generating the target storage connection
#2023-08-28T12:51:12Z INFO migration from origin to target started
#2023-08-28T12:51:13Z INFO migration completed However the migration doesn't seem to be successfully finished: kg -n bcp-cluster-test1 tcp
#NAME VERSION STATUS CONTROL-PLANE ENDPOINT KUBECONFIG DATASTORE AGE
#bcp-cluster-test1 v1.25.7 Migrating 10.123.1.31:6443 bcp-cluster-test1-admin-kubeconfig kamaji-etcd-2 46m |
The kamaji logs are required to understand what's going there. |
clusterctl describe cluster bcp-cluster-test2
#NAME READY SEVERITY REASON SINCE MESSAGE
#Cluster/bcp-cluster-test2 True 2m30s
#├─ClusterInfrastructure - VSphereCluster/bcp-cluster-test2 True 2m59s
#├─ControlPlane - KamajiControlPlane/bcp-cluster-test2
#└─Workers
# └─MachineDeployment/bcp-cluster-test2-md-0 True 65s
# └─Machine/bcp-cluster-test2-md-0-7b697b8758xxwdc5-czjch True 101s kg tcp
#NAME VERSION STATUS CONTROL-PLANE ENDPOINT KUBECONFIG DATASTORE AGE
#bcp-cluster-test2 v1.25.7 Ready 10.123.1.32:6443 bcp-cluster-test2-admin-kubeconfig kamaji-etcd-2 3m49s
kubectl patch --type merge tcp bcp-cluster-test2 -p '{"spec": {"dataStore": "kamaji-etcd-4"}}'
(last lines get repeated from here) kg pod -n kamaji-system
#NAME READY STATUS RESTARTS AGE
#capi-kamaji-controller-manager-77df58dcdd-q7p4s 1/1 Running 4 (3d4h ago) 4d1h
#etcd-0 1/1 Running 0 80m
#etcd-1 1/1 Running 0 80m
#etcd-2 1/1 Running 0 80m
#kamaji-cfb8859cf-942bc 1/1 Running 0 27m
#migrate-bcp-cluster-test1-bcp-cluster-test1-tw4p4 0/1 Completed 0 26m
#migrate-bcp-cluster-test2-bcp-cluster-test2-4klkd 0/1 Completed 0 3m49s kg tcp
#NAME VERSION STATUS CONTROL-PLANE ENDPOINT KUBECONFIG DATASTORE AGE
#bcp-cluster-test2 v1.25.7 Migrating 10.123.1.32:6443 bcp-cluster-test2-admin-kubeconfig kamaji-etcd-2 11m clusterctl describe cluster bcp-cluster-test2
#NAME READY SEVERITY REASON SINCE MESSAGE
#Cluster/bcp-cluster-test2 False Info WaitingForControlPlane 4m35s
#├─ClusterInfrastructure - VSphereCluster/bcp-cluster-test2 True 11m
#├─ControlPlane - KamajiControlPlane/bcp-cluster-test2
#└─Workers
# └─MachineDeployment/bcp-cluster-test2-md-0 True 9m27s
# └─Machine/bcp-cluster-test2-md-0-7b697b8758xxwdc5-czjch True 10m |
Ok, now I get it. The issue here is that you're operating though Cluster API, which means the source of truth for the Control Plane is the You don't have to start the migration by editing the TCP definition, but rather the KCP. |
Ah thanks!!! Now it worked, by editing the KamajiControlPlane. (We still have |
One last thing (I don't know if this is really an issue or maybe even desired behavior, but I wanted to let you know): before first migration kg node
#NAME STATUS ROLES AGE VERSION
#bcp-cluster-test2-md-0-7b697b8758xbj99n-tqrtz Ready <none> 3m34s v1.25.7 after migration & node rotate kg node
#NAME STATUS ROLES AGE VERSION
#bcp-cluster-test2-md-0-7b697b8758xbj99n-k26r4 Ready <none> 2m39s v1.25.7 after migrating back kg node
#NAME STATUS ROLES AGE VERSION
#bcp-cluster-test2-md-0-7b697b8758xbj99n-k26r4 Ready <none> 9m32s v1.25.7
#bcp-cluster-test2-md-0-7b697b8758xbj99n-tqrtz NotReady <none> 30m v1.25.7 |
Don't wanna be an RTFM dude, but 🙃 https://kamaji.clastix.io/guides/datastore-migration/#migrate-data
|
Hi,
we created an additional datastore of type etcd using your helm chart clastix/kamaji-etcd.
We then tried to migrate a cluster from the old to the new datastore (both of type etcd), using this tutorial: https://kamaji.clastix.io/guides/datastore-migration/
This did not work as expected, the migration job gets an image pull backoff:
It seems the calculation of the version of the image has an error.
kamaji:vhelm-v0.12.4
looks like it tries to find the version (here v0.3.3) by looking into the helm chart (here v0.12.4), but the result doesnt create the correct string.We use kamaji v0.3.3 together with cluster-api v1.5.0 with infrastructure-vsphere v1.7.0 and your kamaji plugin control-plane-kamaji v0.3.0.
The text was updated successfully, but these errors were encountered: