-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitoring #62
base: trunk
Are you sure you want to change the base?
Monitoring #62
Conversation
3f2df96
to
8c8dce7
Compare
f9cc4ec
to
a02184b
Compare
The previous version had a footgun that rules removed from the ansible vars would not be removed by ansible from ostack. This version also allows our vars to match the structure expected by the ostack collection
These need to both always run because github doesn't allow you to put conditions on requiring check passes for branch protection
20 minutes for changing some graphs sounds rough, it's probably faster to do it by hand with the GUI.
It seems like they are merging it soon (today even 🎉 )
That makes sense, plus we probably also need to merge the other two open draft PRs before that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, it's so satisfying to see that the dashboard is running!
ansible.builtin.replace: | ||
dest: /etc/postgresql/{{ pg_version }}/main/postgresql.conf | ||
regexp: '(data_directory\s=\s)\S+' | ||
replace: "\\1'{{ pg_dir }}'" | ||
become: true | ||
when: not pg_data_directory_set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? Isn't the task idempotent?
regexp: '#(listen_addresses\s=\s)\S+' | ||
replace: "\\1'*'" | ||
become: true | ||
when: not pg_listen_addresses_set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too, but in this case it probably should be '#?(listen_addresses\s=\s)\S+'
since it's commented out by default
ansible.builtin.replace: | ||
dest: /etc/postgresql/{{ pg_version }}/main/postgresql.conf | ||
regexp: '(data_directory\s=\s)\S+' | ||
replace: "\\1'{{ pg_dir }}'" | ||
become: true | ||
when: not pg_data_directory_set | ||
notify: | ||
- Rsync postgres directory to ssd mount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to run after the change in postgresql.conf
? We are syncing /var/lib/postgresql
, not etc/postgresql
?
notify: Restart postgres service | ||
|
||
# Make sure any configuration changes are applied before moving on | ||
- name: Flush handlers | ||
ansible.builtin.meta: flush_handlers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to restart after every change? We are simply editing text files? Same question for repmgr.yml
and create_primary,yml
(since postgresql_set
should be saving the changes to a config file that is loaded when postgres is re/started).
# Needed to disable certain tasks, as replicas are read-only and can't run them | ||
- name: Discover replication status | ||
community.postgresql.postgresql_query: | ||
db: "lard" | ||
login_host: "localhost" | ||
login_user: "lard_user" | ||
login_password: "{{ pg_lard_password }}" | ||
query: SELECT * FROM pg_stat_wal_receiver | ||
register: pg_stat_wal_receiver | ||
# This will fail on first run, as the lard user and db are not yet created. In this case | ||
# the following task will not run, leaving is_replica as false | ||
ignore_errors: true | ||
- name: Set replication status fact | ||
ansible.builtin.set_fact: | ||
pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}" | ||
when: pg_stat_wal_receiver is succeeded | ||
|
||
- name: Configure repmgr | ||
ansible.builtin.import_tasks: configure/repmgr.yml | ||
when: not pg_is_replica |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I didn't test if running the repmgr stuff only on the primary would work 😅 But then way can't we simply move the Configure repmgr
task inside the Create primary
block, instead of using the pg_is_replica
variable (seems like it should do the same thing as pg_primary_ip
)?
Aaaah, now I get it, this will run for both the first time and only for the primary on successive runs!
But I was just looking again at the repmgr docs:
On the standby, do not create a PostgreSQL instance (i.e. do not execute initdb or any database creation scripts provided by packages), but do ensure the destination data directory (and any other directories which you want PostgreSQL to use) exist and are owned by the postgres system user. Permissions must be set to 0700 (drwx------).
😮💨 maybe we don't even need to run the Create repmgr user
and Create repmgr database
on the standby
- name: Change postgres data directory | ||
# we use replace here instead of postgresql_set because postgres is not running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but most importantly because the data_directory
and listen_address
cannot be set while postgres is running (I think?)
# Needed to disable certain tasks, as replicas are read-only and can't run them | ||
- name: Discover replication status | ||
community.postgresql.postgresql_query: | ||
db: "lard" | ||
login_host: "localhost" | ||
login_user: "lard_user" | ||
login_password: "{{ pg_lard_password }}" | ||
query: SELECT * FROM pg_stat_wal_receiver | ||
register: pg_stat_wal_receiver | ||
# This will fail on first run, as the lard user and db are not yet created. In this case | ||
# the following task will not run, leaving is_replica as false | ||
ignore_errors: true | ||
- name: Set replication status fact | ||
ansible.builtin.set_fact: | ||
pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}" | ||
when: pg_stat_wal_receiver is succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two suggestions:
- Add default here so we don't need to have it in
configure.yml
# Needed to disable certain tasks, as replicas are read-only and can't run them | |
- name: Discover replication status | |
community.postgresql.postgresql_query: | |
db: "lard" | |
login_host: "localhost" | |
login_user: "lard_user" | |
login_password: "{{ pg_lard_password }}" | |
query: SELECT * FROM pg_stat_wal_receiver | |
register: pg_stat_wal_receiver | |
# This will fail on first run, as the lard user and db are not yet created. In this case | |
# the following task will not run, leaving is_replica as false | |
ignore_errors: true | |
- name: Set replication status fact | |
ansible.builtin.set_fact: | |
pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}" | |
when: pg_stat_wal_receiver is succeeded | |
# Needed to disable certain tasks, as replicas are read-only and can't run them | |
- name: Discover replication status | |
community.postgresql.postgresql_query: | |
db: "lard" | |
login_host: "localhost" | |
login_user: "lard_user" | |
login_password: "{{ pg_lard_password }}" | |
query: SELECT * FROM pg_stat_wal_receiver | |
register: pg_stat_wal_receiver | |
# This will fail on first run, as the lard user and db are not yet created. In this case | |
# the following task will not run, leaving is_replica as false | |
ignore_errors: true | |
- name: Set replication status fact | |
ansible.builtin.set_fact: | |
pg_is_replica: "{{ (pg_stat_wal_receiver.rowcount | default(0)) == 1 }}" |
- Can we simply stat the
repmgr.conf
file instead?
# Needed to disable certain tasks, as replicas are read-only and can't run them | |
- name: Discover replication status | |
community.postgresql.postgresql_query: | |
db: "lard" | |
login_host: "localhost" | |
login_user: "lard_user" | |
login_password: "{{ pg_lard_password }}" | |
query: SELECT * FROM pg_stat_wal_receiver | |
register: pg_stat_wal_receiver | |
# This will fail on first run, as the lard user and db are not yet created. In this case | |
# the following task will not run, leaving is_replica as false | |
ignore_errors: true | |
- name: Set replication status fact | |
ansible.builtin.set_fact: | |
pg_is_replica: "{{ pg_stat_wal_receiver.rowcount == 1 }}" | |
when: pg_stat_wal_receiver is succeeded | |
- name: Stat replication conf file | |
ansible.builtin.stat: | |
path: /etc/repmgr.conf | |
register: replication_conf | |
- name: Set replication status fact | |
ansible.builtin.set_fact: | |
replication_is_setup: "{{ replication_conf.stat.exists }}" |
|
||
# make sure these changes take effect | ||
- name: Restart postgres service | ||
ansible.builtin.systemd_service: | ||
name: postgresql | ||
state: restarted | ||
become: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to keep the handlers, do we need to flush them here?
let method = req.method().clone(); | ||
|
||
let response = next.run(req).await; | ||
|
||
let latency = start.elapsed().as_secs_f64(); | ||
let status = response.status().as_u16().to_string(); | ||
|
||
let labels = [ | ||
("method", method.to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let method = req.method().clone(); | |
let response = next.run(req).await; | |
let latency = start.elapsed().as_secs_f64(); | |
let status = response.status().as_u16().to_string(); | |
let labels = [ | |
("method", method.to_string()), | |
let method = req.method().to_string(); | |
let response = next.run(req).await; | |
let latency = start.elapsed().as_secs_f64(); | |
let status = response.status().as_u16().to_string(); | |
let labels = [ | |
("method", method), |
let _ = metrics::histogram!("http_requests_duration_seconds"); | ||
let _ = metrics::counter!("kldata_messages_received"); | ||
let _ = metrics::counter!("kldata_failures"); | ||
let _ = metrics::counter!("kafka_messages_received"); | ||
let _ = metrics::counter!("kafka_failures"); | ||
let _ = metrics::counter!("scalar_datapoints"); | ||
let _ = metrics::counter!("nonscalar_datapoints"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a metrics struct that registers these metrics as fields and is passed into the ingestor and kafka tasks (and further down where it's needed)? So we don't have to worry about typos?
Edit: do they have to be defined in main? Otherwise we could have local metric variables/structs for each sub-task
.expect("could not commit offset in consumer"); // ensure we keep offset | ||
} | ||
Err(e) => { | ||
metrics::counter!("kafka_failures").increment(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea: count the specific types of kafka failures separately ("kafka_failure:this", "kafka_failure:that", ...), and possibly also keep the current counter ("kafka_failures") as a summary so you can look at that first, and only when there's activity there you go on to look at the graph for the specific type. Pro: easier pinpointing of problem. Con: dashboard clutter.
} | ||
} | ||
if let Err(e) = consumer.consume_messageset(msgset) { | ||
eprintln!("{}", e); | ||
metrics::counter!("kafka_failures").increment(1); | ||
error!("{}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "{}" be replaced with e.g. "consumer.consume_messageset(msgset): {}" for easier identification of where the error occurred? (ditto for other "anonymous" occurrences of error!).
|
||
let response = next.run(req).await; | ||
|
||
let latency = start.elapsed().as_secs_f64(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to rename latency to e.g. duration to reflect that the value includes the time spent processing the request rather than just the time from the initiation to onset.
.expect("Failed to set up metrics exporter"); | ||
|
||
// Register metrics so they're guaranteed to show in exporter output | ||
let _ = metrics::histogram!("http_requests_duration_seconds"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to declare metric names as string constants and replace existing occurrences with those constants. This reduces the risk of accidental typos slipping through compilation undetected.
TODO list:
stat_statements
in postgres config- [ ] make dashboard visualisations that use stat_statementsThis PR includes:
Related changes outside this repo (see PRs linked in TODO list):
Things to note:
Unfortunately the Grafana instance IT maintains is read-only, meaning we can't easily edit or experiment with our dashboard. Making changes to the dashboard requires:
departments
repo replacing the dashboard file with your newly exported JSONIn my opinion this is an unacceptable edit loop for a visualisation dashboard. I'll leave some feedback for the people at IT who maintain it, but if they aren't receptive, we may eventually want to set up our own.
I wanted to include metrics/dashboards using
pg_stat_statements
which would give us more detailed insights into query performance. Unfortunately thestat_statements
collector in latest version ofpostgres_exporter
is broken on Postgres 17. There's already a fix for it in their main branch, but as we don't know when the next release will be, I suggest we ignore this for now, and come back to it when the new version is live. The configuration is already done on our end, so it would just require running a playbook to update postgres_exporter, and making visualisations with the new metrics.I was originally intending to turn on ingestion to test and refine the dashboard (Manuel turned it off in relation to his migration work), but now especially as we don't have
stat_statements
yet, I think this can wait and be bundled into my upcoming benchmarking work, as that will stress the dashboard anyway.