docs: Rook Ceph Upgrade

Add a guide to explain how to upgrade rook-ceph component. Signed-off-by: Suraj Deshmukh <[email protected]>
kinvolk · Nov 19, 2020 · 98e3b8c · 98e3b8c
1 parent 4c40aa1
commit 98e3b8c
Showing 1 changed file with 187 additions and 0 deletions.
diff --git a/docs/how-to-guides/upgrade-rook-ceph.md b/docs/how-to-guides/upgrade-rook-ceph.md
@@ -0,0 +1,187 @@
+# Upgrading Rook Ceph
+
+## Contents
+
+- [Introduction](#introduction)
+- [Steps](#steps)
+  - [Step 1: Ensure `AUTOSCALE` is `on`](#step-1-ensure-autoscale-is-on)
+  - [Step 2: Watch](#step-2-watch)
+    - [Step 2.1: Ceph status](#step-21-ceph-status)
+    - [Step 2.2: Pods in rook namespace](#step-22-pods-in-rook-namespace)
+    - [Step 2.3: Rook version update](#step-23-rook-version-update)
+    - [Step 2.4: Ceph version update](#step-24-ceph-version-update)
+    - [Step 2.5: Events in rook namespace](#step-25-events-in-rook-namespace)
+  - [Step 3: Dashboards](#step-3-dashboards)
+    - [Step 3.1: Ceph](#step-31-ceph)
+    - [Step 3.2: Grafana](#step-32-grafana)
+  - [Step 4: Make a note of existing image versions](#step-4-make-a-note-of-existing-image-versions)
+  - [Step 5: Perform updates](#step-5-perform-updates)
+  - [Step 6: Verify that the CSI images are updated](#step-6-verify-that-the-csi-images-are-updated)
+  - [Step 7: Final checks](#step-7-final-checks)
+- [Additional resources](#additional-resources)
+
+## Introduction
+
+[Rook Ceph](../configuration-reference/components/rook.md) is one of the storage providers of
+Lokomotive. With a distributed system as complex as Ceph, the upgrade process is not trivial. This
+document enlists steps on how to perform the upgrade and how to monitor this process.
+
+## Steps
+
+Following steps are inspired by [`rook`](https://rook.io/docs/rook/master/ceph-upgrade.html) docs.
+
+### Step 1: Ensure `AUTOSCALE` is set to `on`
+
+Start a shell in the toolbox pod as specified in [this
+doc](rook-ceph-storage.md#enable-and-access-toolbox) and run the following command:
+
+```console
+# ceph osd pool autoscale-status | grep replicapool
+POOL           SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
+replicapool      0                 3.0         3241G  0.0000                                  1.0      32              on
+```
+
+Ensure that the `AUTOSCALE` column outputs `on` and not `warn`. If the output of the `AUTOSCALE`
+column says `warn`, then run the command below, to make sure that pool autoscaling is enabled. It is
+required to ensure that the placement groups scale up as the data in the cluster increases.
+
+```bash
+ceph osd pool set replicapool pg_autoscale_mode on
+```
+
+### Step 2: Watch
+
+Watch events, updates and pods.
+
+#### Step 2.1: Ceph status
+
+Leave the following running in the toolbox pod:
+
+```bash
+watch ceph status
+```
+
+Ensure that the output says that `health:` is `HEALTH_OK`. Match the output such that everything
+looks as explained in the [rook upgrade
+docs](https://rook.io/docs/rook/master/ceph-upgrade.html#status-output).
+
+> **IMPORTANT**: Don't proceed further if the output is anything other than `HEALTH_OK`.
+
+During the ongoing upgrade and after completion, make sure that the output stays in `HEALTH_OK`
+state. If the cluster 60% full, then the output can sometimes turn into `HEALTH_WARN`.
+
+#### Step 2.2: Pods in rook namespace
+
+Keep an eye on the `STATUS` field of the following output, in another terminal window, from the
+`rook` namespace. Make sure that the pods are restarted in record time and don't go into
+`CrashLoopBackOff` state. Leave the following command running:
+
+```bash
+watch kubectl -n rook get pods -o wide
+```
+
+#### Step 2.3: Rook version update
+
+Run the following command in a new terminal window to keep an eye on the rook version update as it
+is upgrades for all the sub-components:
+
+```bash
+watch --exec kubectl -n rook get deployments -l rook_cluster=rook -o \
+  jsonpath='{range .items[*]}{.metadata.name}{"  \treq/upd/avl: "}{.spec.replicas}{"/"}{.status.updatedReplicas}{"/"}{.status.readyReplicas}{"  \trook-version="}{.metadata.labels.rook-version}{"\n"}{end}'
+```
+
+```bash
+watch --exec kubectl -n rook get jobs -o \
+  jsonpath='{range .items[*]}{.metadata.name}{"  \tsucceeded: "}{.status.succeeded}{"      \trook-version="}{.metadata.labels.rook-version}{"\n"}{end}'
+```
+
+You should see that `rook-version` slowly changes to `v1.4.6`.
+
+#### Step 2.4: Ceph version update
+
+Run the following command to keep an eye on the Ceph version update as the new pods come up in a new
+terminal window:
+
+```bash
+watch --exec kubectl -n rook get deployments -l rook_cluster=rook -o \
+  jsonpath='{range .items[*]}{.metadata.name}{"  \treq/upd/avl: "}{.spec.replicas}{"/"}{.status.updatedReplicas}{"/"}{.status.readyReplicas}{"  \tceph-version="}{.metadata.labels.ceph-version}{"\n"}{end}'
+```
+
+You should see that `ceph-version` slowly changes to `15.2.5`.
+
+#### Step 2.5: Events in rook namespace
+
+In a new terminal leave the following command running, to keep track of the events happening in the
+`rook` namespace. Keep an eye on the column `TYPE` of the following output and especially events
+that are not of type `Normal`.
+
+```bash
+kubectl -n rook get events -w
+```
+
+### Step 3: Dashboards
+
+Monitor various dashboards.
+
+#### Step 3.1: Ceph
+
+Open the Ceph dashboard in a browser window. Read the docs
+[here](rook-ceph-storage.md#access-the-ceph-dashboard) to access the dashboard.
+
+> **NOTE**: Accessing dashboard can be a hassle because while the components are upgrading you may
+> lose access to it multiple times.
+
+#### Step 3.2: Grafana
+
+Gain access to the Grafana dashboard as instructed
+[here](monitoring-with-prometheus-operator.md#access-grafana). And keep an eye on the dashboard
+named `Ceph - Cluster`.
+
+> **NOTE**: The data in the Grafana dashboard will always be outdated compared to the `watch ceph
+> status` running inside the toolbox pod.
+
+### Step 4: Make a note of existing image versions
+
+Make a note of the images of the pods in the rook namespace:
+
+```bash
+kubectl -n rook get pod -o \
+  jsonpath='{range .items[*]}{.metadata.name}{"\n\t"}{.status.phase}{"\t\t"}{.spec.containers[0].image}{"\t"}{.spec.initContainers[0].image}{"\n\n"}{end}'
+```
+
+After the upgrade is complete, we can verify the output of the above command to see if the workloads
+now run updated images.
+
+### Step 5: Perform updates
+
+With everything monitored, you can start the update process now by executing the following commands:
+
+```bash
+kubectl apply -f https://raw.githubusercontent.com/kinvolk/lokomotive/master/assets/charts/components/rook/templates/resources.yaml
+lokoctl component apply rook rook-ceph
+```
+
+### Step 6: Verify that the CSI images are updated
+
+Verify if the images were updated, comparing it with the output of the [Step
+4](#step-4-make-a-note-of-existing-image-versions).
+
+```bash
+kubectl -n rook get pod -o \
+  jsonpath='{range .items[*]}{.metadata.name}{"\n\t"}{.status.phase}{"\t\t"}{.spec.containers[0].image}{"\t"}{.spec.initContainers[0].image}{"\n\n"}{end}'
+```
+
+### Step 7: Final checks
+
+Once everything is up to date, then run the following commands in the toolbox pod, to verify if all
+the OSDs are in `up` state:
+
+```bash
+ceph osd status
+```
+
+## Additional resources
+
+- [Rook Upgrade docs](https://rook.io/docs/rook/v1.4/ceph-upgrade.html).
+- [General Troubleshooting](https://rook.io/docs/rook/v1.5/common-issues.html).
+- [Ceph Troubleshooting](https://rook.io/docs/rook/v1.4/ceph-common-issues.html).