Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove monotonic count from ignored types in no duplicate assertion #9463

Merged
merged 2 commits into from
Jun 18, 2021

Conversation

npezzotti
Copy link
Contributor

@npezzotti npezzotti commented Jun 2, 2021

What does this PR do?

  • Updates the the aggregator's no duplicates assertion to check for duplicate monotonic count types
  • Explores options for failing aggregator when duplicate metrics are submitted for metric types which should have one context per run, such as rate

Motivation

  • AI-1553
  • Submitting duplicate rates or monotonic counts (when new sample is less than previous) can lead to unexpected values or the agent resetting the counter

Additional Notes

Ideas explored to fail aggregator when this occurs:

  1. Initial idea for catching the issue at metric submission:
    NON_DUPLICATE_TYPES = [RATE, MONOTONIC_COUNT]
    ...    
    def submit_metric(self, check, check_id, mtype, name, value, tags, hostname, flush_first_value):
        if mtype in self.NON_DUPLICATE_TYPES and self.is_duplicate_context(MetricStub(name, mtype, value, tags, hostname, None)):
            raise Exception("'{}' of type '{}' submitted twice with same context.".format(name, AggregatorStub.METRIC_ENUM_MAP_REV[mtype])
        
        if not self.ignore_metric(name):
            self._metrics[name].append(MetricStub(name, mtype, value, tags, hostname, None))

    def is_duplicate_context(self, new_metric):
        def format_stub(stub):
            return stub.name, stub.type, str(sorted(stub.tags)), stub.hostname

        contexts = self.metrics(new_metric.name)
        for context in contexts:
            return format_stub(context) == format_stub(new_metric)
  1. Two ways to catch duplicates when calling assert_metric in tests:
    def assert_metric(
        self, name, value=None, tags=None, count=None, at_least=1, hostname=None, metric_type=None, device=None
    ):
        ...
        candidates = []
        ...
        # enforce a count of 1 for candidates with a non-duplicate type (when a count=0 is not used)
        if count != 0 and candidates and all(m.type in self.NON_DUPLICATE_TYPES for m in candidates):
            count = 1
        ...
        # set a condition that catches duplicates with a non-duplicate type
        if value is not None and candidates and all(self.is_aggregate(m.type) for m in candidates):
            got = sum(m.value for m in candidates)
            msg = "Expected count value for '{}': {}, got {}".format(name, value, got)
            condition = value == got
        elif count != 0 and candidates and all(m.type in self.NON_DUPLICATE_TYPES for m in candidates):
            msg = "Duplicate '{}' of non-duplicate type".format(name)
            condition = len(candidates) == 1
        elif count is not None:
            msg = "Needed exactly {} candidates for '{}', got {}".format(count, name, len(candidates))
            condition = len(candidates) == count
        else:
            msg = "Needed at least {} candidates for '{}', got {}".format(at_least, name, len(candidates))
            condition = len(candidates) >= at_least
        self._assert(condition, msg=msg, expected_stub=expected_metric, submitted_elements=self._metrics)    
        ...          
  • Not exactly sure how to tell aggregator when check method is called twice in integration tests- maybe this can be asserted at the check level in tests through assert_no_duplicate_*?
  • dd_agent_check fixture populates aggregator with json check output in replay_check_run, which would result in one rate value despite duplicates (when called with rate)

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have changelog/ and integration/ labels attached

@codecov
Copy link

codecov bot commented Jun 2, 2021

Codecov Report

Merging #9463 (ad8802a) into master (d34d3f8) will increase coverage by 0.06%.
The diff coverage is 100.00%.

Flag Coverage Δ
active_directory 100.00% <ø> (ø)
activemq_xml 82.31% <ø> (ø)
aerospike 86.92% <ø> (ø)
airflow 89.94% <ø> (ø)
amazon_msk 87.82% <ø> (ø)
ambari 86.98% <ø> (ø)
apache 94.90% <ø> (ø)
aspdotnet 93.87% <ø> (ø)
azure_iot_edge 82.01% <ø> (ø)
btrfs 82.91% <ø> (ø)
cacti 84.01% <ø> (ø)
cassandra_nodetool 94.19% <ø> (ø)
ceph 91.04% <ø> (ø)
cilium 85.84% <ø> (+1.88%) ⬆️
cisco_aci 95.88% <ø> (ø)
clickhouse 96.95% <ø> (ø)
cloud_foundry_api 95.98% <ø> (+0.12%) ⬆️
cockroachdb 97.18% <ø> (ø)
consul 93.84% <ø> (ø)
coredns 96.36% <ø> (ø)
couch 94.81% <ø> (+0.74%) ⬆️
couchbase 81.45% <ø> (ø)
crio 100.00% <ø> (ø)
datadog_checks_base 89.56% <100.00%> (+0.35%) ⬆️
datadog_checks_dev 80.62% <ø> (-0.02%) ⬇️
datadog_checks_downloader 80.40% <ø> (ø)
directory 94.70% <ø> (ø)
disk 91.00% <ø> (-0.51%) ⬇️
dns_check 94.44% <ø> (ø)
dotnetclr 100.00% <ø> (ø)
druid 97.70% <ø> (ø)
ecs_fargate 77.39% <ø> (ø)
eks_fargate 94.05% <ø> (ø)
elastic 88.54% <ø> (ø)
envoy 93.68% <ø> (+0.26%) ⬆️
etcd 93.09% <ø> (ø)
exchange_server 100.00% <ø> (ø)
external_dns 100.00% <ø> (ø)
fluentd 94.77% <ø> (ø)
gearmand 77.27% <ø> (+1.29%) ⬆️
gitlab 89.94% <ø> (ø)
gitlab_runner 90.32% <ø> (ø)
glusterfs 80.09% <ø> (+0.92%) ⬆️
go_expvar 91.95% <ø> (ø)
gunicorn 94.29% <ø> (+0.76%) ⬆️
haproxy 95.22% <ø> (+0.17%) ⬆️
harbor 91.58% <ø> (ø)
hazelcast 92.39% <ø> (ø)
hdfs_datanode 90.00% <ø> (ø)
hdfs_namenode 87.94% <ø> (ø)
http_check 89.96% <ø> (+1.82%) ⬆️
ibm_db2 93.87% <ø> (ø)
ibm_mq 89.99% <ø> (+1.45%) ⬆️
ibm_was 97.44% <ø> (ø)
iis 92.41% <ø> (ø)
istio 93.18% <ø> (+0.56%) ⬆️
kafka_consumer 81.15% <ø> (ø)
kong 93.33% <ø> (ø)
kube_apiserver_metrics 97.35% <ø> (ø)
kube_controller_manager 97.05% <ø> (ø)
kube_dns 98.85% <ø> (ø)
kube_metrics_server 100.00% <ø> (ø)
kube_proxy 100.00% <ø> (ø)
kube_scheduler 98.07% <ø> (ø)
kubelet 89.47% <ø> (ø)
kubernetes_state 89.69% <ø> (+0.03%) ⬆️
kyototycoon 85.96% <ø> (ø)
lighttpd 83.64% <ø> (ø)
linkerd 87.05% <ø> (+1.17%) ⬆️
linux_proc_extras 96.22% <ø> (ø)
mapr 84.97% <ø> (ø)
mapreduce 82.27% <ø> (+0.45%) ⬆️
marathon 83.12% <ø> (ø)
marklogic 95.32% <ø> (ø)
mcache 93.39% <ø> (ø)
mesos_master 92.20% <ø> (ø)
mesos_slave 93.60% <ø> (ø)
mongo 94.74% <ø> (+0.28%) ⬆️
mysql 85.18% <ø> (+0.25%) ⬆️
nagios 89.53% <ø> (ø)
network 77.76% <ø> (+1.00%) ⬆️
nfsstat 95.20% <ø> (ø)
nginx 95.11% <ø> (+0.93%) ⬆️
nginx_ingress_controller 98.30% <ø> (ø)
openldap 96.33% <ø> (ø)
openmetrics 97.14% <ø> (ø)
openstack 51.30% <ø> (ø)
openstack_controller 90.59% <ø> (ø)
oracle 93.61% <ø> (+0.63%) ⬆️
pdh_check 97.77% <ø> (ø)
pgbouncer 91.50% <ø> (ø)
php_fpm 89.95% <ø> (+0.43%) ⬆️
postfix 88.04% <ø> (ø)
postgres 92.20% <ø> (+0.78%) ⬆️
powerdns_recursor 95.93% <ø> (ø)
process 85.12% <ø> (+0.28%) ⬆️
prometheus 94.17% <ø> (ø)
proxysql 99.62% <ø> (ø)
rabbitmq 93.73% <ø> (ø)
redisdb 86.87% <ø> (ø)
rethinkdb 97.93% <ø> (ø)
riak 99.22% <ø> (ø)
riakcs 93.61% <ø> (ø)
sap_hana 93.04% <ø> (ø)
scylla 100.00% <ø> (ø)
snmp 91.57% <ø> (+0.04%) ⬆️
snowflake 94.44% <ø> (+0.58%) ⬆️
sonarqube 95.69% <ø> (ø)
spark 93.64% <ø> (ø)
sqlserver 81.71% <ø> (ø)
squid 100.00% <ø> (ø)
ssh_check 91.08% <ø> (ø)
statsd 87.36% <ø> (+1.05%) ⬆️
supervisord 92.18% <ø> (ø)
system_core 91.04% <ø> (ø)
system_swap 98.30% <ø> (ø)
tcp_check 86.53% <ø> (ø)
teamcity 80.00% <ø> (ø)
tls 97.04% <ø> (+0.87%) ⬆️
tokumx 58.40% <ø> (?)
twemproxy 78.33% <ø> (ø)
twistlock 80.74% <ø> (ø)
varnish 84.57% <ø> (+0.24%) ⬆️
vault 94.80% <ø> (+0.54%) ⬆️
vertica 92.27% <ø> (ø)
voltdb 96.81% <ø> (ø)
vsphere 89.74% <ø> (+0.05%) ⬆️
win32_event_log 86.03% <ø> (+0.28%) ⬆️
windows_service 95.83% <ø> (ø)
wmi_check 92.91% <ø> (ø)
yarn 90.30% <ø> (ø)
zk 85.17% <ø> (+0.75%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...s_base/tests/stubs/test_aggregator_no_duplicate.py 100.00% <ø> (ø)
...hecks_base/datadog_checks/base/stubs/aggregator.py 66.92% <100.00%> (ø)
...checks_dev/datadog_checks/dev/tooling/constants.py 90.90% <0.00%> (-2.64%) ⬇️
disk/datadog_checks/disk/disk.py 79.05% <0.00%> (-1.36%) ⬇️
datadog_checks_base/tests/test_prometheus.py 99.48% <0.00%> (-0.26%) ⬇️
datadog_checks_base/tests/test_openmetrics.py 97.47% <0.00%> (-0.23%) ⬇️
ibm_mq/tests/test_ibm_mq_int.py 100.00% <0.00%> (ø)
kubernetes_state/tests/test_kubernetes_state.py 97.33% <0.00%> (ø)
sqlserver/datadog_checks/sqlserver/connection.py 76.62% <0.00%> (ø)
...g_checks/win32_event_log/legacy/win32_event_log.py 69.76% <0.00%> (ø)
... and 71 more

Copy link
Contributor

@coignetp coignetp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the investigation! This solution is the best option in my opinion. We should use this assert_no_duplicate_metrics method in more integration though

@@ -75,8 +75,6 @@ def test_assert_no_duplicate_message(aggregator):
[
dict(type='count', name='metric.count', value=1, tags=['aa'], hostname='1'),
dict(type='count', name='metric.count', value=1, tags=['aa'], hostname='1'),
dict(type='monotonic_count', name='metric.monotonic_count', value=1, tags=['aa'], hostname='1'),
dict(type='monotonic_count', name='metric.monotonic_count', value=1, tags=['aa'], hostname='1'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a rate case in a duplicate metric section above

@@ -418,7 +418,7 @@ def assert_no_duplicate_metrics(self):
- hostname
"""
# metric types that intended to be called multiple times are ignored
ignored_types = [self.COUNT, self.MONOTONIC_COUNT, self.COUNTER]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double submission of a monotonic_count is allowed and has a defined behaviour, however it's easily a source of error. While there is no integration submitting a monotonic_count twice on purpose, let's raise an error on these cases.

@coignetp coignetp merged commit 6dfc81f into master Jun 18, 2021
@coignetp coignetp deleted the nathan/aggregator-test-multiple-rate branch June 18, 2021 13:58
github-actions bot pushed a commit that referenced this pull request Jun 18, 2021
…9463)

* removed monotonic count from ignored types in no-duplicate assertion

* added test cases for duplicate rate/monotonic count metric 6dfc81f
alexandre-normand pushed a commit that referenced this pull request Jun 24, 2021
…9463)

* removed monotonic count from ignored types in no-duplicate assertion

* added test cases for duplicate rate/monotonic count metric
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants