Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ingesting pg_settings for dbm users #14577

Merged
merged 13 commits into from
May 23, 2023
Merged

Conversation

jmeunier28
Copy link
Contributor

@jmeunier28 jmeunier28 commented May 16, 2023

What does this PR do?

This feature adds a new job, database-metadata, which requires dbm. The job has two main features:

  1. Sends cloud_metadata to the DBM backend for internal resource creation (important for unified host tagging)
  2. Queries pg_settings && forwards this to the backend to be displayed in the DBM UI

An example of the settings in our UI is here:
Screen Shot 2023-05-19 at 12 54 08 PM

Motivation

There are a number of relevant use cases for this, for example:

  1. Engineer who is investigating a degraded database might want to check certain settings
  2. Changes in certain settings could cause any number of issues for the database
  3. Not all default settings are conducive to certain workloads, we can use this & other metrics to provide insights to users about how to tune/tweak them

@github-actions
Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

@codecov
Copy link

codecov bot commented May 16, 2023

Codecov Report

Merging #14577 (cd238ce) into master (8bbccee) will increase coverage by 0.26%.
The diff coverage is 25.00%.

Flag Coverage Δ
activemq_xml 82.31% <ø> (ø)
ambari 85.75% <ø> (ø)
apache 95.08% <ø> (ø)
arangodb 98.23% <ø> (ø)
argocd 88.43% <ø> (ø)
avi_vantage 92.54% <ø> (ø)
azure_iot_edge 82.08% <ø> (ø)
boundary 100.00% <ø> (ø)
btrfs 82.91% <ø> (ø)
cacti 87.90% <ø> (ø)
calico 83.54% <ø> (ø)
cert_manager 77.41% <ø> (ø)
cilium 75.46% <ø> (∅)
citrix_hypervisor 87.50% <ø> (ø)
cloud_foundry_api 95.98% <ø> (+0.12%) ⬆️
cloudera 99.82% <ø> (ø)
cockroachdb 91.90% <ø> (ø)
consul 91.65% <ø> (ø)
coredns 94.57% <ø> (ø)
couch 95.19% <ø> (+0.24%) ⬆️
datadog_checks_base 89.61% <25.00%> (+0.32%) ⬆️
datadog_checks_dev 82.75% <ø> (+0.07%) ⬆️
datadog_checks_downloader 81.65% <ø> (ø)
datadog_cluster_agent 90.19% <ø> (ø)
ddev 99.18% <ø> (+0.51%) ⬆️
directory 96.42% <ø> (ø)
disk 89.16% <ø> (-2.15%) ⬇️
dns_check 93.90% <ø> (ø)
druid 97.70% <ø> (ø)
eks_fargate 94.05% <ø> (ø)
envoy 94.18% <ø> (ø)
etcd 93.98% <ø> (ø)
external_dns 89.28% <ø> (ø)
foundationdb 78.50% <ø> (ø)
gitlab_runner 91.94% <ø> (ø)
harbor 80.04% <ø> (ø)
http_check 95.82% <ø> (+2.15%) ⬆️
ibm_ace 91.79% <ø> (ø)
ibm_db2 95.10% <ø> (ø)
ibm_i 81.95% <ø> (ø)
ibm_mq 91.24% <ø> (ø)
istio 77.43% <ø> (+0.55%) ⬆️
kube_dns 95.97% <ø> (ø)
kube_metrics_server 94.87% <ø> (ø)
kube_proxy 96.80% <ø> (ø)
kubelet 90.74% <ø> (ø)
kubernetes_state 89.18% <ø> (ø)
lighttpd 83.64% <ø> (ø)
linkerd 85.14% <ø> (+1.14%) ⬆️
mapr 82.70% <ø> (ø)
marathon 83.12% <ø> (ø)
mcache 93.50% <ø> (ø)
mesos_master 89.75% <ø> (ø)
mesos_slave 93.63% <ø> (ø)
nagios 89.01% <ø> (ø)
network 94.01% <ø> (+1.10%) ⬆️
nginx 95.24% <ø> (+0.54%) ⬆️
openstack 51.45% <ø> (ø)
pgbouncer 91.33% <ø> (ø)
php_fpm 90.25% <ø> (+0.84%) ⬆️
postfix 88.04% <ø> (ø)
powerdns_recursor 96.65% <ø> (ø)
process 85.42% <ø> (+0.28%) ⬆️
proxysql 98.97% <ø> (ø)
pulsar 100.00% <ø> (ø)
redisdb 87.50% <ø> (ø)
rethinkdb 97.93% <ø> (ø)
riak 99.22% <ø> (ø)
singlestore 90.81% <ø> (ø)
snowflake 96.47% <ø> (ø)
squid 100.00% <ø> (ø)
ssh_check 91.58% <ø> (ø)
statsd 87.36% <ø> (+1.05%) ⬆️
supervisord 92.30% <ø> (ø)
system_core 90.90% <ø> (ø)
tcp_check 91.58% <ø> (+1.34%) ⬆️
teamcity 88.35% <ø> (+2.87%) ⬆️
teradata 94.24% <ø> (ø)
tls 92.07% <ø> (+0.82%) ⬆️
tokumx 58.40% <ø> (ø)
traffic_server 96.13% <ø> (ø)
twemproxy 79.45% <ø> (ø)
twistlock 79.62% <ø> (ø)
varnish 84.39% <ø> (+0.26%) ⬆️
win32_event_log 86.54% <ø> (+0.41%) ⬆️
windows_performance_counters 98.36% <ø> (ø)
windows_service 98.00% <ø> (ø)
wmi_check 92.91% <ø> (ø)
yarn 89.50% <ø> (ø)
zk 82.62% <ø> (+1.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@github-actions
Copy link

github-actions bot commented May 16, 2023

Test Results

     958 files       958 suites   6h 34m 30s ⏱️
  5 210 tests   5 134 ✔️      69 💤 5  2 🔥
21 320 runs  17 679 ✔️ 3 634 💤 5  2 🔥

For more details on these failures and errors, see this check.

Results for commit cd238ce.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

@github-actions
Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

@github-actions
Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

DEFAULT_COLLECTION_INTERVAL = 30

PG_SETTINGS_QUERY = """
SELECT name, setting, category, short_desc FROM pg_settings
Copy link
Contributor

@justiniso justiniso May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For productionizing: how expensive is this query? Hopefully not much, but the concern I have is if the agent is restarting frequently, the 30m frequency will actually end up being more like 15s frequency between restarts. So is the 30m a performance optimization at all?

(As noted in chat, only name and setting are necessary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this stuff is stored in the system catalogs, which requires a disk read but then it should (in theory) be served from memory?

Seems OK to run this query once every 15s. I only had the interval so long because these settings aren't changed very often

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kk let's just review and explain plan and latency before merge, but otherwise sounds good to me.

from datadog_checks.base.utils.tracking import tracked_method

# default collection interval in minutes
DEFAULT_COLLECTION_INTERVAL = 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not frequently updating data, but 30 minutes is so long. Imagine looking at a 1-hour graph and having no idea where the change actually happened. 10m is the minimum I would declare for any data we send unless the query is very expensive. I think 5m would also be reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah @joelmarcotte mentioned the same thing, we can decrease it to 5m

def report_postgres_settings(self):
rows = self._collect_postgres_settings()
if rows:
event = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highly, highly recommend including the postgres version in this payload since settings constants hardcoded on the backend/frontend will differ across versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes i meant to do that, thanks for the reminder

@ghost ghost added the documentation label May 18, 2023
@github-actions
Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

@github-actions
Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

@github-actions
Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

@jmeunier28 jmeunier28 changed the title WIP: send pg_settings to dbm backend Support ingesting pg_settings for dbm users May 18, 2023
@jmeunier28 jmeunier28 marked this pull request as ready for review May 19, 2023 17:21
@jmeunier28 jmeunier28 requested review from a team as code owners May 19, 2023 17:21
@@ -427,6 +427,22 @@ files:
value:
type: number
example: 3500
- name: database_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this nested? What does "metadata" mean to a customer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thinking we have one config section for settings & schema, but also this job will emit resources. Do you have a better idea on how these settings should look? I'm not attached to this at all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly find the term “database_metadata” confusing used as an external configuration API. Am I alone? Why not just top-level

collect_settings:
  enabled: true
  collection_interval: 600

which follows the spec for query metrics and samples. https://xkcd.com/927/

example: true
- name: settings_collection_interval
description: |
Set the database settings collection interval (in minutes). Each collection involves a single query to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in minutes

None of the other collection intervals are in minutes, this seems confusing to customers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I did minutes bc ... writing 300 for the default felt weird, but I can change it to seconds

from datadog_checks.base.utils.tracking import tracked_method

# default collection interval in minutes
DEFAULT_COLLECTION_INTERVAL = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small python nit: adding a unit suffix _ms _seconds _min is appreciated if we're not standardizing on seconds (but I suggest aligning on seconds)

DEFAULT_COLLECTION_INTERVAL = 30

PG_SETTINGS_QUERY = """
SELECT name, setting, category, short_desc FROM pg_settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kk let's just review and explain plan and latency before merge, but otherwise sounds good to me.

@jmeunier28 jmeunier28 force-pushed the jmeunier/hackathon branch from 77c2b91 to 2c7845c Compare May 19, 2023 22:24
@jmeunier28 jmeunier28 force-pushed the jmeunier/hackathon branch from 170ab4d to 1c7ee7c Compare May 22, 2023 14:48
@@ -427,6 +427,22 @@ files:
value:
type: number
example: 3500
- name: database_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly find the term “database_metadata” confusing used as an external configuration API. Am I alone? Why not just top-level

collect_settings:
  enabled: true
  collection_interval: 600

which follows the spec for query metrics and samples. https://xkcd.com/927/

)

# by default, send resources every 5 minutes
self.collection_interval = min(300, self.pg_settings_collection_interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of min here? Do we not trust the configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to set how often the check will run, if the user wants to collect settings more often than the default check iteration we have for sending resources, then we should do that

Comment on lines 433 to 445
- name: collect_settings
description: |
Enable collection of pg_settings. Requires `dbm: true`.
value:
type: boolean
example: false
- name: settings_collection_interval
description: |
Set the database settings collection interval (in seconds). Each collection involves a single query to
`pg_settings`.
value:
type: number
example: 600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make these hidden to avoid early configuration and customers copying configs and messing themselves up later down the road.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make these hidden to avoid early configuration and customers copying configs and messing themselves up later down the road.

if we do this #14577 (comment), why do you assume we are going to change this? also its an alpha feature so seems fair that we could break the config at some point if we want to?

Comment on lines +98 to +100
elapsed_s = time.time() - self._time_since_last_settings_query
if elapsed_s >= self.pg_settings_collection_interval and self._collect_pg_settings_enabled:
self._pg_settings_cached = self._collect_postgres_settings()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -716,6 +718,7 @@ def check(self, _):
if self._config.dbm_enabled:
self.statement_metrics.run_job_loop(tags)
self.statement_samples.run_job_loop(tags)
self.metadata_samples.run_job_loop(tags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not saying anything new, but we are adding yet another open connection for the datadog agent for this job loop. Let's make sure this is the last one and this class is used for all additional data not collected as part of the collection interval.

Copy link
Contributor

@alexandre-normand alexandre-normand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@jmeunier28 jmeunier28 merged commit 16f75ab into master May 23, 2023
@jmeunier28 jmeunier28 deleted the jmeunier/hackathon branch May 23, 2023 20:28
github-actions bot pushed a commit that referenced this pull request May 23, 2023
Send pg_settings to dbm backend

Co-authored-by: Alex Normand <[email protected]> 16f75ab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants