Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring #62

Open
wants to merge 28 commits into
base: trunk
Choose a base branch
from
Open

Monitoring #62

wants to merge 28 commits into from

Conversation

intarga
Copy link
Member

@intarga intarga commented Feb 5, 2025

TODO list:

This PR includes:

  • Metrics instrumentation of ingestion
  • node_exporter and postgres_exporter deployment
  • Improvements and bugfixes to related parts of our ansible playbooks
  • Tracing (logging) instrumentation of ingestion

Related changes outside this repo (see PRs linked in TODO list):

Things to note:

  • Unfortunately the Grafana instance IT maintains is read-only, meaning we can't easily edit or experiment with our dashboard. Making changes to the dashboard requires:

    • Exporting the dashboard JSON
    • Setting up your own local Grafana
    • Importing the dashboard
    • Making your changes
    • Exporting again
    • Opening a PR on the departments repo replacing the dashboard file with your newly exported JSON
    • Waiting 2 minutes for their yaml lint pipeline to run (which doesn't even check the things that should be checked in the dashboard)
    • Merging the PR
    • Waiting 15-20 minutes for the deployment to go live

    In my opinion this is an unacceptable edit loop for a visualisation dashboard. I'll leave some feedback for the people at IT who maintain it, but if they aren't receptive, we may eventually want to set up our own.

  • I wanted to include metrics/dashboards using pg_stat_statements which would give us more detailed insights into query performance. Unfortunately the stat_statements collector in latest version of postgres_exporter is broken on Postgres 17. There's already a fix for it in their main branch, but as we don't know when the next release will be, I suggest we ignore this for now, and come back to it when the new version is live. The configuration is already done on our end, so it would just require running a playbook to update postgres_exporter, and making visualisations with the new metrics.

  • I was originally intending to turn on ingestion to test and refine the dashboard (Manuel turned it off in relation to his migration work), but now especially as we don't have stat_statements yet, I think this can wait and be bundled into my upcoming benchmarking work, as that will stress the dashboard anyway.

@intarga intarga force-pushed the monitoring branch 4 times, most recently from 3f2df96 to 8c8dce7 Compare February 12, 2025 12:58
@intarga intarga force-pushed the monitoring branch 2 times, most recently from f9cc4ec to a02184b Compare February 18, 2025 16:07
The previous version had a footgun that rules removed from the ansible
vars would not be removed by ansible from ostack. This version also
allows our vars to match the structure expected by the ostack collection
These need to both always run because github doesn't allow you to put
conditions on requiring check passes for branch protection
@intarga intarga marked this pull request as ready for review February 20, 2025 13:48
@Lun4m
Copy link
Collaborator

Lun4m commented Feb 21, 2025

  • Waiting 15-20 minutes for the deployment to go live

In my opinion this is an unacceptable edit loop for a visualisation dashboard. I'll leave some feedback for the people at IT who maintain it, but if they aren't receptive, we may eventually want to set up our own.

20 minutes for changing some graphs sounds rough, it's probably faster to do it by hand with the GUI.

I wanted to include metrics/dashboards using pg_stat_statements which would give us more detailed insights into query performance. Unfortunately the stat_statements collector in latest version of postgres_exporter is broken on Postgres 17. There's already a fix for it in their main branch, but as we don't know when the next release will be, I suggest we ignore this for now, and come back to it when the new version is live. The configuration is already done on our end, so it would just require running a playbook to update postgres_exporter, and making visualisations with the new metrics.

It seems like they are merging it soon (today even 🎉 )

I was originally intending to turn on ingestion to test and refine the dashboard (Manuel turned it off in relation to his migration work), but now especially as we don't have stat_statements yet, I think this can wait and be bundled into my upcoming benchmarking work, as that will stress the dashboard anyway.

That makes sense, plus we probably also need to merge the other two open draft PRs before that.

Copy link
Collaborator

@Lun4m Lun4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, it's so satisfying to see that the dashboard is running!

ansible.builtin.replace:
dest: /etc/postgresql/{{ pg_version }}/main/postgresql.conf
regexp: '(data_directory\s=\s)\S+'
replace: "\\1'{{ pg_dir }}'"
become: true
when: not pg_data_directory_set
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Isn't the task idempotent?

Comment on lines 85 to +88
regexp: '#(listen_addresses\s=\s)\S+'
replace: "\\1'*'"
become: true
when: not pg_listen_addresses_set
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too, but in this case it probably should be '#?(listen_addresses\s=\s)\S+' since it's commented out by default

ansible.builtin.replace:
dest: /etc/postgresql/{{ pg_version }}/main/postgresql.conf
regexp: '(data_directory\s=\s)\S+'
replace: "\\1'{{ pg_dir }}'"
become: true
when: not pg_data_directory_set
notify:
- Rsync postgres directory to ssd mount
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to run after the change in postgresql.conf? We are syncing /var/lib/postgresql, not etc/postgresql?

Comment on lines +111 to +115
notify: Restart postgres service

# Make sure any configuration changes are applied before moving on
- name: Flush handlers
ansible.builtin.meta: flush_handlers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to restart after every change? We are simply editing text files? Same question for repmgr.yml and create_primary,yml (since postgresql_set should be saving the changes to a config file that is loaded when postgres is re/started).

Comment on lines +16 to +35
# Needed to disable certain tasks, as replicas are read-only and can't run them
- name: Discover replication status
community.postgresql.postgresql_query:
db: "lard"
login_host: "localhost"
login_user: "lard_user"
login_password: "{{ pg_lard_password }}"
query: SELECT * FROM pg_stat_wal_receiver
register: pg_stat_wal_receiver
# This will fail on first run, as the lard user and db are not yet created. In this case
# the following task will not run, leaving is_replica as false
ignore_errors: true
- name: Set replication status fact
ansible.builtin.set_fact:
pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}"
when: pg_stat_wal_receiver is succeeded

- name: Configure repmgr
ansible.builtin.import_tasks: configure/repmgr.yml
when: not pg_is_replica
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I didn't test if running the repmgr stuff only on the primary would work 😅 But then way can't we simply move the Configure repmgr task inside the Create primary block, instead of using the pg_is_replica variable (seems like it should do the same thing as pg_primary_ip)?
Aaaah, now I get it, this will run for both the first time and only for the primary on successive runs!
But I was just looking again at the repmgr docs:

On the standby, do not create a PostgreSQL instance (i.e. do not execute initdb or any database creation scripts provided by packages), but do ensure the destination data directory (and any other directories which you want PostgreSQL to use) exist and are owned by the postgres system user. Permissions must be set to 0700 (drwx------).

😮‍💨 maybe we don't even need to run the Create repmgr user and Create repmgr database on the standby

- name: Change postgres data directory
# we use replace here instead of postgresql_set because postgres is not running
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but most importantly because the data_directory and listen_address cannot be set while postgres is running (I think?)

Comment on lines +16 to +31
# Needed to disable certain tasks, as replicas are read-only and can't run them
- name: Discover replication status
community.postgresql.postgresql_query:
db: "lard"
login_host: "localhost"
login_user: "lard_user"
login_password: "{{ pg_lard_password }}"
query: SELECT * FROM pg_stat_wal_receiver
register: pg_stat_wal_receiver
# This will fail on first run, as the lard user and db are not yet created. In this case
# the following task will not run, leaving is_replica as false
ignore_errors: true
- name: Set replication status fact
ansible.builtin.set_fact:
pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}"
when: pg_stat_wal_receiver is succeeded
Copy link
Collaborator

@Lun4m Lun4m Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions:

  1. Add default here so we don't need to have it in configure.yml
Suggested change
# Needed to disable certain tasks, as replicas are read-only and can't run them
- name: Discover replication status
community.postgresql.postgresql_query:
db: "lard"
login_host: "localhost"
login_user: "lard_user"
login_password: "{{ pg_lard_password }}"
query: SELECT * FROM pg_stat_wal_receiver
register: pg_stat_wal_receiver
# This will fail on first run, as the lard user and db are not yet created. In this case
# the following task will not run, leaving is_replica as false
ignore_errors: true
- name: Set replication status fact
ansible.builtin.set_fact:
pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}"
when: pg_stat_wal_receiver is succeeded
# Needed to disable certain tasks, as replicas are read-only and can't run them
- name: Discover replication status
community.postgresql.postgresql_query:
db: "lard"
login_host: "localhost"
login_user: "lard_user"
login_password: "{{ pg_lard_password }}"
query: SELECT * FROM pg_stat_wal_receiver
register: pg_stat_wal_receiver
# This will fail on first run, as the lard user and db are not yet created. In this case
# the following task will not run, leaving is_replica as false
ignore_errors: true
- name: Set replication status fact
ansible.builtin.set_fact:
pg_is_replica: "{{ (pg_stat_wal_receiver.rowcount | default(0)) == 1 }}"
  1. Can we simply stat the repmgr.conf file instead?
Suggested change
# Needed to disable certain tasks, as replicas are read-only and can't run them
- name: Discover replication status
community.postgresql.postgresql_query:
db: "lard"
login_host: "localhost"
login_user: "lard_user"
login_password: "{{ pg_lard_password }}"
query: SELECT * FROM pg_stat_wal_receiver
register: pg_stat_wal_receiver
# This will fail on first run, as the lard user and db are not yet created. In this case
# the following task will not run, leaving is_replica as false
ignore_errors: true
- name: Set replication status fact
ansible.builtin.set_fact:
pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}"
when: pg_stat_wal_receiver is succeeded
- name: Stat replication conf file
ansible.builtin.stat:
path: /etc/repmgr.conf
register: replication_conf
- name: Set replication status fact
ansible.builtin.set_fact:
replication_is_setup: "{{ replication_conf.stat.exists }}"

Comment on lines -152 to -158

# make sure these changes take effect
- name: Restart postgres service
ansible.builtin.systemd_service:
name: postgresql
state: restarted
become: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to keep the handlers, do we need to flush them here?

Comment on lines +444 to +452
let method = req.method().clone();

let response = next.run(req).await;

let latency = start.elapsed().as_secs_f64();
let status = response.status().as_u16().to_string();

let labels = [
("method", method.to_string()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let method = req.method().clone();
let response = next.run(req).await;
let latency = start.elapsed().as_secs_f64();
let status = response.status().as_u16().to_string();
let labels = [
("method", method.to_string()),
let method = req.method().to_string();
let response = next.run(req).await;
let latency = start.elapsed().as_secs_f64();
let status = response.status().as_u16().to_string();
let labels = [
("method", method),

Comment on lines +91 to +97
let _ = metrics::histogram!("http_requests_duration_seconds");
let _ = metrics::counter!("kldata_messages_received");
let _ = metrics::counter!("kldata_failures");
let _ = metrics::counter!("kafka_messages_received");
let _ = metrics::counter!("kafka_failures");
let _ = metrics::counter!("scalar_datapoints");
let _ = metrics::counter!("nonscalar_datapoints");
Copy link
Collaborator

@Lun4m Lun4m Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a metrics struct that registers these metrics as fields and is passed into the ingestor and kafka tasks (and further down where it's needed)? So we don't have to worry about typos?
Edit: do they have to be defined in main? Otherwise we could have local metric variables/structs for each sub-task

@Lun4m Lun4m linked an issue Feb 21, 2025 that may be closed by this pull request
.expect("could not commit offset in consumer"); // ensure we keep offset
}
Err(e) => {
metrics::counter!("kafka_failures").increment(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: count the specific types of kafka failures separately ("kafka_failure:this", "kafka_failure:that", ...), and possibly also keep the current counter ("kafka_failures") as a summary so you can look at that first, and only when there's activity there you go on to look at the graph for the specific type. Pro: easier pinpointing of problem. Con: dashboard clutter.

}
}
if let Err(e) = consumer.consume_messageset(msgset) {
eprintln!("{}", e);
metrics::counter!("kafka_failures").increment(1);
error!("{}", e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "{}" be replaced with e.g. "consumer.consume_messageset(msgset): {}" for easier identification of where the error occurred? (ditto for other "anonymous" occurrences of error!).


let response = next.run(req).await;

let latency = start.elapsed().as_secs_f64();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to rename latency to e.g. duration to reflect that the value includes the time spent processing the request rather than just the time from the initiation to onset.

.expect("Failed to set up metrics exporter");

// Register metrics so they're guaranteed to show in exporter output
let _ = metrics::histogram!("http_requests_duration_seconds");
Copy link
Collaborator

@jo-asplin-met-no jo-asplin-met-no Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to declare metric names as string constants and replace existing occurrences with those constants. This reduces the risk of accidental typos slipping through compilation undetected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Monitoring
3 participants