Skip to content

Commit

Permalink
CRDB-45670: helm: automate the statefulset update involving new PVCs
Browse files Browse the repository at this point in the history
  • Loading branch information
pritesh-lahoti committed Jan 16, 2025
1 parent af33dad commit 71e62ee
Show file tree
Hide file tree
Showing 3 changed files with 69 additions and 38 deletions.
22 changes: 3 additions & 19 deletions build/templates/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,26 +203,10 @@ $ helm upgrade my-release cockroachdb/cockroachdb \

Kubernetes will carry out a safe [rolling upgrade](https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#updating-statefulsets) of your CockroachDB nodes one-by-one.

However, the upgrade will fail if it involves adding new Persistent Volume Claim (PVC) to the existing pods (e.g. enabling WAL Failover, pushing logs to a separate volume, etc.). In such cases, kindly repeat the following steps for each pod:
1. Delete the statefulset
```shell
$ kubectl delete sts my-release-cockroachdb --cascade=orphan
```
The statefulset name can be found by running `kubectl get sts`. Note the `--cascade=orphan` flag used to prevent the deletion of pods.

2. Delete the pod
```shell
$ kubectl delete pod my-release-cockroachdb-<pod_number>
```

3. Upgrade Helm chart
```shell
$ helm upgrade my-release cockroachdb/cockroachdb
```
Kindly update the values.yaml file or provide the necessary flags to the `helm upgrade` command. This step will recreate the pod with the new PVCs.
However, the upgrade will fail if it involves adding new Persistent Volume Claim (PVC) to the existing pods (e.g. enabling WAL Failover, pushing logs to a separate volume, etc.).
In such cases, kindly run the `scripts/upgrade_with_new_pvc.sh` script to upgrade the cluster.

Note that the above steps need to be repeated for each pod in the CockroachDB cluster. This will ensure that the cluster is upgraded without any downtime.
Given the manual process involved, it is likely to cause network churn as cockroachdb will try to rebalance data across the other nodes. We are working on an automated solution to handle this scenario.
`./scripts/upgrade_with_new_pvc.sh -h` can be used for generating help on how to run the script.

Monitor the cluster's pods until all have been successfully restarted:

Expand Down
22 changes: 3 additions & 19 deletions cockroachdb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,26 +204,10 @@ $ helm upgrade my-release cockroachdb/cockroachdb \

Kubernetes will carry out a safe [rolling upgrade](https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#updating-statefulsets) of your CockroachDB nodes one-by-one.

However, the upgrade will fail if it involves adding new Persistent Volume Claim (PVC) to the existing pods (e.g. enabling WAL Failover, pushing logs to a separate volume, etc.). In such cases, kindly repeat the following steps for each pod:
1. Delete the statefulset
```shell
$ kubectl delete sts my-release-cockroachdb --cascade=orphan
```
The statefulset name can be found by running `kubectl get sts`. Note the `--cascade=orphan` flag used to prevent the deletion of pods.

2. Delete the pod
```shell
$ kubectl delete pod my-release-cockroachdb-<pod_number>
```

3. Upgrade Helm chart
```shell
$ helm upgrade my-release cockroachdb/cockroachdb
```
Kindly update the values.yaml file or provide the necessary flags to the `helm upgrade` command. This step will recreate the pod with the new PVCs.
However, the upgrade will fail if it involves adding new Persistent Volume Claim (PVC) to the existing pods (e.g. enabling WAL Failover, pushing logs to a separate volume, etc.).
In such cases, kindly run the `scripts/upgrade_with_new_pvc.sh` script to upgrade the cluster.

Note that the above steps need to be repeated for each pod in the CockroachDB cluster. This will ensure that the cluster is upgraded without any downtime.
Given the manual process involved, it is likely to cause network churn as cockroachdb will try to rebalance data across the other nodes. We are working on an automated solution to handle this scenario.
`./scripts/upgrade_with_new_pvc.sh -h` can be used for generating help on how to run the script.

Monitor the cluster's pods until all have been successfully restarted:

Expand Down
63 changes: 63 additions & 0 deletions scripts/upgrade_with_new_pvc.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/bin/bash

Help()
{
# Display Help
echo "This script performs Helm upgrade involving new PVCs. Kindly run it from the root of the repository."
echo
echo "usage: ./scripts/upgrade_with_new_pvc.sh <release_name> <chart_version> <namespace> <sts_name> <num_replicas> [kubeconfig]"
echo
echo "options:"
echo "release_name: Helm release name, e.g. my-release"
echo "chart_version: Helm chart version to upgrade to, e.g. 15.0.0"
echo "namespace: Kubernetes namespace, e.g. default"
echo "sts_name: Statefulset name, e.g. my-release-cockroachdb"
echo "num_replicas: Number of replicas in the statefulset, e.g. 3"
echo "kubeconfig (optional): Path to the kubeconfig file. Default is $HOME/.kube/config."
echo
echo "example: ./scripts/upgrade_with_new_pvc.sh my-release 15.0.0 default my-release-cockroachdb 3"
echo
}

while getopts ":h" option; do
case $option in
h) # display Help
Help
exit;;
\?) # incorrect option
echo "Error: Invalid option"
exit;;
esac
done

release_name=$1
chart_version=$2
namespace=$3
sts_name=$4
num_replicas=$5
kubeconfig=${6:-$HOME/.kube/config}

# For each replica, do the following:
# 1. Delete the statefulset
# 2. Delete the pod replica
# 3. Upgrade the Helm chart

for i in $(seq 0 $((num_replicas-1))); do
echo "========== Iteration $((i+1)) =========="

echo "$((i+1)). Deleting sts"
kubectl --kubeconfig=$kubeconfig -n $namespace delete statefulset $sts_name --cascade=orphan --wait=true

echo "$((i+1)). Deleting replica"
kubectl --kubeconfig=$kubeconfig -n $namespace delete pod $sts_name-$i --wait=true

echo "$((i+1)). Upgrading Helm"
# The "--wait" flag ensures the deleted pod replica and STS are up and running.
# However, at times, the STS fails to understand that all replicas are running and the upgrade is stuck.
# The "--timeout 1m" helps with short-circuiting the upgrade process. Even if the upgrade does time out, it is
# harmless and the last upgrade process will be successful once all the pods replicas have been updated.
helm upgrade $release_name ./cockroachdb --kubeconfig=$kubeconfig --namespace $namespace --version $chart_version --wait --timeout 1m --debug

echo "Iteration $((i+1)) complete. Kindly validate that the changes before proceeding."
read -p "Press enter to continue ..."
done

0 comments on commit 71e62ee

Please sign in to comment.