Merge pull request #1 from fibbs/hasupport_docs

add comments to values.yaml, documentation about HA to README.md.gotmpl
aeciopires · Nov 18, 2024 · b3e0bf1 · b3e0bf1
2 parents 309d77f + 021b310
commit b3e0bf1
Show file tree

Hide file tree

Showing 2 changed files with 210 additions and 144 deletions.
diff --git a/charts/zabbix/README.md.gotmpl b/charts/zabbix/README.md.gotmpl
@@ -142,9 +142,14 @@ helm uninstall zabbix -n monitoring
 
 # Breaking changes of this helm chart
 
+## Version 6.0.0
+
+* New implementation of native Zabbix Server High Availability (see below)
+* No breaking changes in values.yaml, but nevertheless you might want to review your values.yaml's `zabbixServer.zabbixServerHA` section
+
 ## Version 5.0.0
 
-* Will be used Postgresql 16.x and Zabbix 7.x.
+* Will be using Postgresql 16.x and Zabbix 7.x.
 * Adjust in extraEnv to add support in environment variables from configmap and secret. More info: #93
 
 ## Version 4.0.0
@@ -282,6 +287,21 @@ A database is required for zabbix to work, in this helm chart we're using Postgr
 > We use plain postgresql database by default WITHOUT persistence. If you want persistence or 
 would like to use TimescaleDB instead, check the comments in the ``values.yaml`` file.
 
+# Support of native Zabbix Server High Availability
+
+Since version 6.0, Zabbix has his own implementation of [High Availability](https://www.zabbix.com/documentation/current/en/manual/concepts/server/ha), which is a simple approach to realize a Hot-Standby high availability setup with Zabbix Server. This feature applies only to Zabbix Server component, not Zabbix Proxy, Webdriver, Web Frontend or such. In a Zabbix monitoring environment, by design, there can only be one central active Zabbix Server taking over the responsability of storing data into database, calculating triggers, sending alerts, evt. The native High Availability concept does not change that, it just implements a way to have additional Zabbix Server processes being "standby" and "jumping in" as soon as the active one does not report it's availability (updating a table in the database), anymore. As such, the Zabbix Server High Availability works well together (and somewhat requires, to be an entirely high available setup), an also high available database setup. High availability of Postgres Database is not covered by this Helm Chart, but can rather easily be achieved by using one of the well-known Postgresql databse operators [PGO](https://github.com/CrunchyData/postgres-operator) and [CNPG](https://cloudnative-pg.io), which are supported to be used with this Helm Chart. 
+
+For the HA feature, which has not been designed for usage in Kubernetes, to work in K8S, there have been some challenges to overcome, primarily the fact that Zabbix Server doesn't allow to upgrade or to initialize database schema when running in HA mode enabled. Intention by Zabbix is to turn HA mode off, issue Major Release Upgrade, turn HA mode back on. This doesn't conclude with Kubernetes concepts. Beside of that, some additional circumstances led us to an implementation as follows:
+
+* added a portion in values.yaml generally switching "Zabbix Server HA" on or off. If turned off, the Zabbix Server deployment will always be started with 1 replica and without the ZBX_HANODENAME env variable. This is an easy-to-use setup with no additional job pods, but it's not possible to just scale up zabbix server pods from here
+* when .Values.zabbixServer.zabbixServerHA.enabled is set to true, a Kubernetes Job, marked as Helm post-install,post-upgrade hook, is being deployed together with a Role, Rolebinding and ServiceAccount, allowing this job pod to execute some changes via Kubernetes API. The job runs after each installation and upgrade process, scales down zabbix server pods if needed, manages db entries for active HA and non-HA server nodes being connected to the database, etc. Additionally, this job figures out whether a migration from a non-HA enabled setup to a HA-enabled one has been done, and handles necessary actions (scale down pods, delete entries in database) accordingly
+* the sidecar containers running together with the Zabbix Server pods have been updated not only to prevent starting Zabbix Server pods when database is not available, but also when the schema version of the database is not yet the correct one, adding an additional layer of preventing pods from crashing
+
+Additionally, in order to make it possible to use **Active checks** and **Active Zabbix Proxies** with a Zabbix Server setup having High Availability enabled, a **HA Labels sidecar** has been introduced, continuously monitoring the Zabbix server pod for amount of running Zabbix server processes to figure out whether the Pod is being "active" or "standby" Zabbix Server node, and updating HA-related labels on the pod, accordingly.
+
+The reason to implement it this way and not by probing the port number, which was my initial approach, is that probing the port of Zabbix Server will make it generate a message in the log, stating that a connection without a proper payload has been initiated towards the Zabbix Server.
+
+
 # Thanks
 
 > **About the new home of helm chart**