CRDB-45670: helm: automate the statefulset update involving new PVCs

cockroachdb · Jan 16, 2025 · 6444bc3 · 6444bc3
1 parent af33dad
commit 6444bc3
Show file tree

Hide file tree

Showing 3 changed files with 73 additions and 38 deletions.
diff --git a/build/templates/README.md b/build/templates/README.md
@@ -203,26 +203,10 @@ $ helm upgrade my-release cockroachdb/cockroachdb \
 
 Kubernetes will carry out a safe [rolling upgrade](https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#updating-statefulsets) of your CockroachDB nodes one-by-one.
 
-However, the upgrade will fail if it involves adding new Persistent Volume Claim (PVC) to the existing pods (e.g. enabling WAL Failover, pushing logs to a separate volume, etc.). In such cases, kindly repeat the following steps for each pod:
-1. Delete the statefulset
-```shell
-$ kubectl delete sts my-release-cockroachdb --cascade=orphan
-```
-The statefulset name can be found by running `kubectl get sts`. Note the `--cascade=orphan` flag used to prevent the deletion of pods.
-
-2. Delete the pod
-```shell
-$ kubectl delete pod my-release-cockroachdb-<pod_number>
-```
-
-3. Upgrade Helm chart
-```shell
-$ helm upgrade my-release cockroachdb/cockroachdb
-```
-Kindly update the values.yaml file or provide the necessary flags to the `helm upgrade` command. This step will recreate the pod with the new PVCs.
+However, the upgrade will fail if it involves adding new Persistent Volume Claim (PVC) to the existing pods (e.g. enabling WAL Failover, pushing logs to a separate volume, etc.).
+In such cases, kindly run the `scripts/upgrade_with_new_pvc.sh` script to upgrade the cluster.
 
-Note that the above steps need to be repeated for each pod in the CockroachDB cluster. This will ensure that the cluster is upgraded without any downtime.
-Given the manual process involved, it is likely to cause network churn as cockroachdb will try to rebalance data across the other nodes. We are working on an automated solution to handle this scenario.
+`./scripts/upgrade_with_new_pvc.sh -h` can be used for generating help on how to run the script.
 
 Monitor the cluster's pods until all have been successfully restarted:
 

diff --git a/cockroachdb/README.md b/cockroachdb/README.md
@@ -204,26 +204,10 @@ $ helm upgrade my-release cockroachdb/cockroachdb \
 
 Kubernetes will carry out a safe [rolling upgrade](https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#updating-statefulsets) of your CockroachDB nodes one-by-one.
 
-However, the upgrade will fail if it involves adding new Persistent Volume Claim (PVC) to the existing pods (e.g. enabling WAL Failover, pushing logs to a separate volume, etc.). In such cases, kindly repeat the following steps for each pod:
-1. Delete the statefulset
-```shell
-$ kubectl delete sts my-release-cockroachdb --cascade=orphan
-```
-The statefulset name can be found by running `kubectl get sts`. Note the `--cascade=orphan` flag used to prevent the deletion of pods.
-
-2. Delete the pod
-```shell
-$ kubectl delete pod my-release-cockroachdb-<pod_number>
-```
-
-3. Upgrade Helm chart
-```shell
-$ helm upgrade my-release cockroachdb/cockroachdb
-```
-Kindly update the values.yaml file or provide the necessary flags to the `helm upgrade` command. This step will recreate the pod with the new PVCs.
+However, the upgrade will fail if it involves adding new Persistent Volume Claim (PVC) to the existing pods (e.g. enabling WAL Failover, pushing logs to a separate volume, etc.).
+In such cases, kindly run the `scripts/upgrade_with_new_pvc.sh` script to upgrade the cluster.
 
-Note that the above steps need to be repeated for each pod in the CockroachDB cluster. This will ensure that the cluster is upgraded without any downtime.
-Given the manual process involved, it is likely to cause network churn as cockroachdb will try to rebalance data across the other nodes. We are working on an automated solution to handle this scenario.
+`./scripts/upgrade_with_new_pvc.sh -h` can be used for generating help on how to run the script.
 
 Monitor the cluster's pods until all have been successfully restarted:
 

diff --git a/scripts/upgrade_with_new_pvc.sh b/scripts/upgrade_with_new_pvc.sh
@@ -0,0 +1,67 @@
+#!/bin/bash
+
+Help()
+{
+   # Display Help
+   echo "This script performs Helm upgrade involving new PVCs. Kindly run it from the root of the repository."
+   echo
+   echo "usage: ./scripts/upgrade_with_new_pvc.sh <release_name> <chart_version> <namespace> <sts_name> <num_replicas> [kubeconfig]"
+   echo
+   echo "options:"
+   echo "release_name: Helm release name, e.g. my-release"
+   echo "chart: Helm chart to use (either referenced locally, or to the Helm repository), e.g. cockroachdb/cockroachdb"
+   echo "chart_version: Helm chart version to upgrade to, e.g. 15.0.0"
+   echo "values_file: Path to the values file, e.g. ./cockroachdb/values.yaml"
+   echo "namespace: Kubernetes namespace, e.g. default"
+   echo "sts_name: Statefulset name, e.g. my-release-cockroachdb"
+   echo "num_replicas: Number of replicas in the statefulset, e.g. 3"
+   echo "kubeconfig (optional): Path to the kubeconfig file. Default is $HOME/.kube/config."
+   echo
+   echo "example: ./scripts/upgrade_with_new_pvc.sh my-release cockroachdb/cockroachdb 15.0.0 ./cockroachdb/values.yaml default my-release-cockroachdb 3"
+   echo
+}
+
+while getopts ":h" option; do
+   case $option in
+      h) # display Help
+         Help
+         exit;;
+     \?) # incorrect option
+         echo "Error: Invalid option"
+         exit;;
+   esac
+done
+
+release_name=$1
+chart=$2
+chart_version=$3
+values_file=$4
+namespace=$5
+sts_name=$6
+num_replicas=$7
+kubeconfig=${8:-$HOME/.kube/config}
+
+# For each replica, do the following:
+# 1. Delete the statefulset
+# 2. Delete the pod replica
+# 3. Upgrade the Helm chart
+
+for i in $(seq 0 $((num_replicas-1))); do
+  echo "========== Iteration $((i+1)) =========="
+
+  echo "$((i+1)). Deleting sts"
+  kubectl --kubeconfig=$kubeconfig -n $namespace delete statefulset $sts_name --cascade=orphan --wait=true
+
+  echo "$((i+1)). Deleting replica"
+  kubectl --kubeconfig=$kubeconfig -n $namespace delete pod $sts_name-$i --wait=true
+
+  echo "$((i+1)). Upgrading Helm"
+  # The "--wait" flag ensures the deleted pod replica and STS are up and running.
+  # However, at times, the STS fails to understand that all replicas are running and the upgrade is stuck.
+  # The "--timeout 1m" helps with short-circuiting the upgrade process. Even if the upgrade does time out, it is
+  # harmless and the last upgrade process will be successful once all the pods replicas have been updated.
+  helm upgrade -f $values_file $release_name $chart --kubeconfig=$kubeconfig --namespace $namespace --version $chart_version --wait --timeout 1m --debug
+
+  echo "Iteration $((i+1)) complete. Kindly validate the changes before proceeding."
+  read -p "Press enter to continue ..."
+done