Monitoring #62

intarga · 2025-02-05T20:10:09Z

TODO list:

This PR includes:

Metrics instrumentation of ingestion
node_exporter and postgres_exporter deployment
Improvements and bugfixes to related parts of our ansible playbooks
Tracing (logging) instrumentation of ingestion

Related changes outside this repo (see PRs linked in TODO list):

Configuration for Obsklim Prometheus to scrape our metrics
Configuration for alerting
Dashboard in Obsklim Grafana

Things to note:

Unfortunately the Grafana instance IT maintains is read-only, meaning we can't easily edit or experiment with our dashboard. Making changes to the dashboard requires:
- Exporting the dashboard JSON
- Setting up your own local Grafana
- Importing the dashboard
- Making your changes
- Exporting again
- Opening a PR on the departments repo replacing the dashboard file with your newly exported JSON
- Waiting 2 minutes for their yaml lint pipeline to run (which doesn't even check the things that should be checked in the dashboard)
- Merging the PR
- Waiting 15-20 minutes for the deployment to go live
In my opinion this is an unacceptable edit loop for a visualisation dashboard. I'll leave some feedback for the people at IT who maintain it, but if they aren't receptive, we may eventually want to set up our own.
I wanted to include metrics/dashboards using pg_stat_statements which would give us more detailed insights into query performance. Unfortunately the stat_statements collector in latest version of postgres_exporter is broken on Postgres 17. There's already a fix for it in their main branch, but as we don't know when the next release will be, I suggest we ignore this for now, and come back to it when the new version is live. The configuration is already done on our end, so it would just require running a playbook to update postgres_exporter, and making visualisations with the new metrics.
I was originally intending to turn on ingestion to test and refine the dashboard (Manuel turned it off in relation to his migration work), but now especially as we don't have stat_statements yet, I think this can wait and be bundled into my upcoming benchmarking work, as that will stress the dashboard anyway.

The previous version had a footgun that rules removed from the ansible vars would not be removed by ansible from ostack. This version also allows our vars to match the structure expected by the ostack collection

…es.yml

These need to both always run because github doesn't allow you to put conditions on requiring check passes for branch protection

Lun4m · 2025-02-21T09:22:27Z

Waiting 15-20 minutes for the deployment to go live

In my opinion this is an unacceptable edit loop for a visualisation dashboard. I'll leave some feedback for the people at IT who maintain it, but if they aren't receptive, we may eventually want to set up our own.

20 minutes for changing some graphs sounds rough, it's probably faster to do it by hand with the GUI.

I wanted to include metrics/dashboards using pg_stat_statements which would give us more detailed insights into query performance. Unfortunately the stat_statements collector in latest version of postgres_exporter is broken on Postgres 17. There's already a fix for it in their main branch, but as we don't know when the next release will be, I suggest we ignore this for now, and come back to it when the new version is live. The configuration is already done on our end, so it would just require running a playbook to update postgres_exporter, and making visualisations with the new metrics.

It seems like they are merging it soon (today even 🎉 )

I was originally intending to turn on ingestion to test and refine the dashboard (Manuel turned it off in relation to his migration work), but now especially as we don't have stat_statements yet, I think this can wait and be bundled into my upcoming benchmarking work, as that will stress the dashboard anyway.

That makes sense, plus we probably also need to merge the other two open draft PRs before that.

Lun4m

Nice, it's so satisfying to see that the dashboard is running!

Lun4m · 2025-02-20T14:01:12Z

ansible/roles/pg/tasks/configure/install_postgres.yml

  ansible.builtin.replace:
    dest: /etc/postgresql/{{ pg_version }}/main/postgresql.conf
    regexp: '(data_directory\s=\s)\S+'
    replace: "\\1'{{ pg_dir }}'"
  become: true
+  when: not pg_data_directory_set


Is this necessary? Isn't the task idempotent?

Lun4m · 2025-02-20T14:03:33Z

ansible/roles/pg/tasks/configure/install_postgres.yml

    regexp: '#(listen_addresses\s=\s)\S+'
    replace: "\\1'*'"
  become: true
+  when: not pg_listen_addresses_set


Here too, but in this case it probably should be '#?(listen_addresses\s=\s)\S+' since it's commented out by default

Lun4m · 2025-02-20T14:11:38Z

ansible/roles/pg/tasks/configure/install_postgres.yml

  ansible.builtin.replace:
    dest: /etc/postgresql/{{ pg_version }}/main/postgresql.conf
    regexp: '(data_directory\s=\s)\S+'
    replace: "\\1'{{ pg_dir }}'"
  become: true
+  when: not pg_data_directory_set
+  notify:
+    - Rsync postgres directory to ssd mount


Does this need to run after the change in postgresql.conf? We are syncing /var/lib/postgresql, not etc/postgresql?

Lun4m · 2025-02-20T14:14:34Z

ansible/roles/pg/tasks/configure/install_postgres.yml

+  notify: Restart postgres service
+
+# Make sure any configuration changes are applied before moving on
+- name: Flush handlers
+  ansible.builtin.meta: flush_handlers


Do we need to restart after every change? We are simply editing text files? Same question for repmgr.yml and create_primary,yml (since postgresql_set should be saving the changes to a config file that is loaded when postgres is re/started).

Lun4m · 2025-02-20T14:21:58Z

ansible/roles/pg/tasks/main.yml

+# Needed to disable certain tasks, as replicas are read-only and can't run them
+- name: Discover replication status
+  community.postgresql.postgresql_query:
+    db: "lard"
+    login_host: "localhost"
+    login_user: "lard_user"
+    login_password: "{{ pg_lard_password }}"
+    query: SELECT * FROM pg_stat_wal_receiver
+  register: pg_stat_wal_receiver
+  # This will fail on first run, as the lard user and db are not yet created. In this case
+  # the following task will not run, leaving is_replica as false
+  ignore_errors: true
+- name: Set replication status fact
+  ansible.builtin.set_fact:
+    pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}"
+  when: pg_stat_wal_receiver is succeeded
+
 - name: Configure repmgr
  ansible.builtin.import_tasks: configure/repmgr.yml
+  when: not pg_is_replica


I guess I didn't test if running the repmgr stuff only on the primary would work 😅 But then way can't we simply move the Configure repmgr task inside the Create primary block, instead of using the pg_is_replica variable (seems like it should do the same thing as pg_primary_ip)?
Aaaah, now I get it, this will run for both the first time and only for the primary on successive runs!
But I was just looking again at the repmgr docs:

On the standby, do not create a PostgreSQL instance (i.e. do not execute initdb or any database creation scripts provided by packages), but do ensure the destination data directory (and any other directories which you want PostgreSQL to use) exist and are owned by the postgres system user. Permissions must be set to 0700 (drwx------).

😮‍💨 maybe we don't even need to run the Create repmgr user and Create repmgr database on the standby

Lun4m · 2025-02-20T15:38:54Z

ansible/roles/pg/tasks/configure/install_postgres.yml

 - name: Change postgres data directory
+  # we use replace here instead of postgresql_set because postgres is not running


Yes, but most importantly because the data_directory and listen_address cannot be set while postgres is running (I think?)

Lun4m · 2025-02-21T08:52:13Z

ansible/roles/pg/tasks/main.yml

+# Needed to disable certain tasks, as replicas are read-only and can't run them
+- name: Discover replication status
+  community.postgresql.postgresql_query:
+    db: "lard"
+    login_host: "localhost"
+    login_user: "lard_user"
+    login_password: "{{ pg_lard_password }}"
+    query: SELECT * FROM pg_stat_wal_receiver
+  register: pg_stat_wal_receiver
+  # This will fail on first run, as the lard user and db are not yet created. In this case
+  # the following task will not run, leaving is_replica as false
+  ignore_errors: true
+- name: Set replication status fact
+  ansible.builtin.set_fact:
+    pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}"
+  when: pg_stat_wal_receiver is succeeded


Two suggestions:

Add default here so we don't need to have it in configure.yml

Suggested change

# Needed to disable certain tasks, as replicas are read-only and can't run them

- name: Discover replication status

community.postgresql.postgresql_query:

db: "lard"

login_host: "localhost"

login_user: "lard_user"

login_password: "{{ pg_lard_password }}"

query: SELECT * FROM pg_stat_wal_receiver

register: pg_stat_wal_receiver

# This will fail on first run, as the lard user and db are not yet created. In this case

# the following task will not run, leaving is_replica as false

ignore_errors: true

- name: Set replication status fact

ansible.builtin.set_fact:

pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}"

when: pg_stat_wal_receiver is succeeded

# Needed to disable certain tasks, as replicas are read-only and can't run them

- name: Discover replication status

community.postgresql.postgresql_query:

db: "lard"

login_host: "localhost"

login_user: "lard_user"

login_password: "{{ pg_lard_password }}"

query: SELECT * FROM pg_stat_wal_receiver

register: pg_stat_wal_receiver

# This will fail on first run, as the lard user and db are not yet created. In this case

# the following task will not run, leaving is_replica as false

ignore_errors: true

- name: Set replication status fact

ansible.builtin.set_fact:

pg_is_replica: "{{ (pg_stat_wal_receiver.rowcount | default(0)) == 1 }}"

Can we simply stat the repmgr.conf file instead?

Suggested change

# Needed to disable certain tasks, as replicas are read-only and can't run them

- name: Discover replication status

community.postgresql.postgresql_query:

db: "lard"

login_host: "localhost"

login_user: "lard_user"

login_password: "{{ pg_lard_password }}"

query: SELECT * FROM pg_stat_wal_receiver

register: pg_stat_wal_receiver

# This will fail on first run, as the lard user and db are not yet created. In this case

# the following task will not run, leaving is_replica as false

ignore_errors: true

- name: Set replication status fact

ansible.builtin.set_fact:

pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}"

when: pg_stat_wal_receiver is succeeded

- name: Stat replication conf file

ansible.builtin.stat:

path: /etc/repmgr.conf

register: replication_conf

- name: Set replication status fact

ansible.builtin.set_fact:

replication_is_setup: "{{ replication_conf.stat.exists }}"

Lun4m · 2025-02-21T08:55:43Z

ansible/roles/pg/tasks/configure/create_primary.yml

-
-# make sure these changes take effect
- name: Restart postgres service
-  ansible.builtin.systemd_service:
-    name: postgresql
-    state: restarted
-  become: true


If we decide to keep the handlers, do we need to flush them here?

Lun4m · 2025-02-21T09:01:13Z

ingestion/src/lib.rs

+    let method = req.method().clone();
+
+    let response = next.run(req).await;
+
+    let latency = start.elapsed().as_secs_f64();
+    let status = response.status().as_u16().to_string();
+
+    let labels = [
+        ("method", method.to_string()),


Suggested change

let method = req.method().clone();

let response = next.run(req).await;

let latency = start.elapsed().as_secs_f64();

let status = response.status().as_u16().to_string();

let labels = [

("method", method.to_string()),

let method = req.method().to_string();

let response = next.run(req).await;

let latency = start.elapsed().as_secs_f64();

let status = response.status().as_u16().to_string();

let labels = [

("method", method),

Lun4m · 2025-02-21T09:11:01Z

ingestion/src/main.rs

+    let _ = metrics::histogram!("http_requests_duration_seconds");
+    let _ = metrics::counter!("kldata_messages_received");
+    let _ = metrics::counter!("kldata_failures");
+    let _ = metrics::counter!("kafka_messages_received");
+    let _ = metrics::counter!("kafka_failures");
+    let _ = metrics::counter!("scalar_datapoints");
+    let _ = metrics::counter!("nonscalar_datapoints");


Should we have a metrics struct that registers these metrics as fields and is passed into the ingestor and kafka tasks (and further down where it's needed)? So we don't have to worry about typos?
Edit: do they have to be defined in main? Otherwise we could have local metric variables/structs for each sub-task

jo-asplin-met-no · 2025-02-24T14:52:29Z

ingestion/src/kvkafka.rs

                            .expect("could not commit offset in consumer"); // ensure we keep offset
                    }
                    Err(e) => {
+                        metrics::counter!("kafka_failures").increment(1);


Idea: count the specific types of kafka failures separately ("kafka_failure:this", "kafka_failure:that", ...), and possibly also keep the current counter ("kafka_failures") as a summary so you can look at that first, and only when there's activity there you go on to look at the graph for the specific type. Pro: easier pinpointing of problem. Con: dashboard clutter.

jo-asplin-met-no · 2025-02-24T15:16:58Z

ingestion/src/kvkafka.rs

                                }
                            }
                            if let Err(e) = consumer.consume_messageset(msgset) {
-                                eprintln!("{}", e);
+                                metrics::counter!("kafka_failures").increment(1);
+                                error!("{}", e);


Should "{}" be replaced with e.g. "consumer.consume_messageset(msgset): {}" for easier identification of where the error occurred? (ditto for other "anonymous" occurrences of error!).

jo-asplin-met-no · 2025-02-24T15:34:33Z

ingestion/src/lib.rs

+
+    let response = next.run(req).await;
+
+    let latency = start.elapsed().as_secs_f64();


I suggest to rename latency to e.g. duration to reflect that the value includes the time spent processing the request rather than just the time from the initiation to onset.

jo-asplin-met-no · 2025-02-24T15:40:06Z

ingestion/src/main.rs

+        .expect("Failed to set up metrics exporter");
+
+    // Register metrics so they're guaranteed to show in exporter output
+    let _ = metrics::histogram!("http_requests_duration_seconds");


Suggest to declare metric names as string constants and replace existing occurrences with those constants. This reduces the risk of accidental typos slipping through compilation undetected.

intarga force-pushed the monitoring branch 4 times, most recently from 3f2df96 to 8c8dce7 Compare February 12, 2025 12:58

intarga force-pushed the monitoring branch 2 times, most recently from f9cc4ec to a02184b Compare February 18, 2025 16:07

intarga added 24 commits February 19, 2025 16:22

set up metrics exporter

d6b3fa1

add initial ingestor metrics

2153168

add security group for metrics, fix bug with group vars dir name

b0fdf24

ansible: fix typos

67479ad

ansible: replace keypairs to work around hashing bug

ececf8d

register metrics at startup

069f4bf

fix rebase dependency breakage

ed755b1

ansible: rework security group task to be declarative

4aa1bd8

The previous version had a footgun that rules removed from the ansible vars would not be removed by ansible from ostack. This version also allows our vars to match the structure expected by the ostack collection

ansible: add node and postgres metrics exporters

c6dbe86

ansible: fix wrong user in postgres_exporter config

7a2b1b3

ansible: enable stat_statements collector in postgres_exporter

f49716b

ingestion: add histogram metric for request latency

07792b8

ingestion: add kafka_failures metric

4e0a559

ansible: idempotentize install_postgres.yml

99c56cd

ansible: update ostack python deps

1c3282b

ansible: add missing region configs to teardown.yml

c04850d

ansible: bring back replace based pg conf changes where needed

5f886fc

ansible ensure config applied and pg running at end of install_postgr…

65b0207

…es.yml

ansible: disable pg tasks that can't run on replicas

dabb502

ansible: improve service handling in pg role

2ca7164

ansible: fix wrong boolean operators

5734528

ansible: idempotentize the conf file changes by checking the file first

c8cb411

ansible: handle replication check failing on first run

6852d08

ansible: enable pg_stat_statements

433f492

intarga added 3 commits February 19, 2025 16:41

ingestion: set up tracing

f2139aa

.github: add ansible lint to ci

373c236

.github: merge ci and ansible workflows, fix ansible-lint

90c872f

These need to both always run because github doesn't allow you to put conditions on requiring check passes for branch protection

intarga force-pushed the monitoring branch from 3a68f83 to e242a2d Compare February 19, 2025 15:41

ansible: satisfy ansible-lint

e45ee03

intarga force-pushed the monitoring branch from e242a2d to e45ee03 Compare February 19, 2025 15:46

intarga marked this pull request as ready for review February 20, 2025 13:48

intarga requested review from Lun4m and jo-asplin-met-no February 20, 2025 13:48

Lun4m requested changes Feb 21, 2025

View reviewed changes

Lun4m linked an issue Feb 21, 2025 that may be closed by this pull request

Monitoring #52

Open

jo-asplin-met-no reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring #62

Monitoring #62

intarga commented Feb 5, 2025 •

edited

Loading

Lun4m commented Feb 21, 2025 •

edited

Loading

Lun4m left a comment

Lun4m Feb 20, 2025

Lun4m Feb 20, 2025

Lun4m Feb 20, 2025

Lun4m Feb 20, 2025

Lun4m Feb 20, 2025

Lun4m Feb 20, 2025

Lun4m Feb 21, 2025 •

edited

Loading

Lun4m Feb 21, 2025

Lun4m Feb 21, 2025

Lun4m Feb 21, 2025 •

edited

Loading

jo-asplin-met-no Feb 24, 2025

jo-asplin-met-no Feb 24, 2025

jo-asplin-met-no Feb 24, 2025

jo-asplin-met-no Feb 24, 2025 •

edited

Loading

		- name: Change postgres data directory
		# we use replace here instead of postgresql_set because postgres is not running


		let response = next.run(req).await;

		let latency = start.elapsed().as_secs_f64();

Monitoring #62

Are you sure you want to change the base?

Monitoring #62

Conversation

intarga commented Feb 5, 2025 • edited Loading

TODO list:

This PR includes:

Related changes outside this repo (see PRs linked in TODO list):

Things to note:

Lun4m commented Feb 21, 2025 • edited Loading

Lun4m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lun4m Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lun4m Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jo-asplin-met-no Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

intarga commented Feb 5, 2025 •

edited

Loading

Lun4m commented Feb 21, 2025 •

edited

Loading

Lun4m Feb 21, 2025 •

edited

Loading

Lun4m Feb 21, 2025 •

edited

Loading

jo-asplin-met-no Feb 24, 2025 •

edited

Loading