Remove monotonic count from ignored types in no duplicate assertion #9463

npezzotti · 2021-06-02T23:07:22Z

What does this PR do?

Updates the the aggregator's no duplicates assertion to check for duplicate monotonic count types
Explores options for failing aggregator when duplicate metrics are submitted for metric types which should have one context per run, such as rate

Motivation

AI-1553
Submitting duplicate rates or monotonic counts (when new sample is less than previous) can lead to unexpected values or the agent resetting the counter

Additional Notes

Ideas explored to fail aggregator when this occurs:

Initial idea for catching the issue at metric submission:

    NON_DUPLICATE_TYPES = [RATE, MONOTONIC_COUNT]
    ...    
    def submit_metric(self, check, check_id, mtype, name, value, tags, hostname, flush_first_value):
        if mtype in self.NON_DUPLICATE_TYPES and self.is_duplicate_context(MetricStub(name, mtype, value, tags, hostname, None)):
            raise Exception("'{}' of type '{}' submitted twice with same context.".format(name, AggregatorStub.METRIC_ENUM_MAP_REV[mtype])
        
        if not self.ignore_metric(name):
            self._metrics[name].append(MetricStub(name, mtype, value, tags, hostname, None))

    def is_duplicate_context(self, new_metric):
        def format_stub(stub):
            return stub.name, stub.type, str(sorted(stub.tags)), stub.hostname

        contexts = self.metrics(new_metric.name)
        for context in contexts:
            return format_stub(context) == format_stub(new_metric)

Two ways to catch duplicates when calling assert_metric in tests:

    def assert_metric(
        self, name, value=None, tags=None, count=None, at_least=1, hostname=None, metric_type=None, device=None
    ):
        ...
        candidates = []
        ...
        # enforce a count of 1 for candidates with a non-duplicate type (when a count=0 is not used)
        if count != 0 and candidates and all(m.type in self.NON_DUPLICATE_TYPES for m in candidates):
            count = 1
        ...
        # set a condition that catches duplicates with a non-duplicate type
        if value is not None and candidates and all(self.is_aggregate(m.type) for m in candidates):
            got = sum(m.value for m in candidates)
            msg = "Expected count value for '{}': {}, got {}".format(name, value, got)
            condition = value == got
        elif count != 0 and candidates and all(m.type in self.NON_DUPLICATE_TYPES for m in candidates):
            msg = "Duplicate '{}' of non-duplicate type".format(name)
            condition = len(candidates) == 1
        elif count is not None:
            msg = "Needed exactly {} candidates for '{}', got {}".format(count, name, len(candidates))
            condition = len(candidates) == count
        else:
            msg = "Needed at least {} candidates for '{}', got {}".format(at_least, name, len(candidates))
            condition = len(candidates) >= at_least
        self._assert(condition, msg=msg, expected_stub=expected_metric, submitted_elements=self._metrics)    
        ...

Not exactly sure how to tell aggregator when check method is called twice in integration tests- maybe this can be asserted at the check level in tests through assert_no_duplicate_*?
dd_agent_check fixture populates aggregator with json check output in replay_check_run, which would result in one rate value despite duplicates (when called with rate)

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
PR title must be written as a CHANGELOG entry (see why)
Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
PR must have changelog/ and integration/ labels attached

codecov · 2021-06-02T23:11:51Z

Codecov Report

Merging #9463 (ad8802a) into master (d34d3f8) will increase coverage by 0.06%.
The diff coverage is 100.00%.

Flag	Coverage Δ
active_directory	`100.00% <ø> (ø)`
activemq_xml	`82.31% <ø> (ø)`
aerospike	`86.92% <ø> (ø)`
airflow	`89.94% <ø> (ø)`
amazon_msk	`87.82% <ø> (ø)`
ambari	`86.98% <ø> (ø)`
apache	`94.90% <ø> (ø)`
aspdotnet	`93.87% <ø> (ø)`
azure_iot_edge	`82.01% <ø> (ø)`
btrfs	`82.91% <ø> (ø)`
cacti	`84.01% <ø> (ø)`
cassandra_nodetool	`94.19% <ø> (ø)`
ceph	`91.04% <ø> (ø)`
cilium	`85.84% <ø> (+1.88%)`	⬆️
cisco_aci	`95.88% <ø> (ø)`
clickhouse	`96.95% <ø> (ø)`
cloud_foundry_api	`95.98% <ø> (+0.12%)`	⬆️
cockroachdb	`97.18% <ø> (ø)`
consul	`93.84% <ø> (ø)`
coredns	`96.36% <ø> (ø)`
couch	`94.81% <ø> (+0.74%)`	⬆️
couchbase	`81.45% <ø> (ø)`
crio	`100.00% <ø> (ø)`
datadog_checks_base	`89.56% <100.00%> (+0.35%)`	⬆️
datadog_checks_dev	`80.62% <ø> (-0.02%)`	⬇️
datadog_checks_downloader	`80.40% <ø> (ø)`
directory	`94.70% <ø> (ø)`
disk	`91.00% <ø> (-0.51%)`	⬇️
dns_check	`94.44% <ø> (ø)`
dotnetclr	`100.00% <ø> (ø)`
druid	`97.70% <ø> (ø)`
ecs_fargate	`77.39% <ø> (ø)`
eks_fargate	`94.05% <ø> (ø)`
elastic	`88.54% <ø> (ø)`
envoy	`93.68% <ø> (+0.26%)`	⬆️
etcd	`93.09% <ø> (ø)`
exchange_server	`100.00% <ø> (ø)`
external_dns	`100.00% <ø> (ø)`
fluentd	`94.77% <ø> (ø)`
gearmand	`77.27% <ø> (+1.29%)`	⬆️
gitlab	`89.94% <ø> (ø)`
gitlab_runner	`90.32% <ø> (ø)`
glusterfs	`80.09% <ø> (+0.92%)`	⬆️
go_expvar	`91.95% <ø> (ø)`
gunicorn	`94.29% <ø> (+0.76%)`	⬆️
haproxy	`95.22% <ø> (+0.17%)`	⬆️
harbor	`91.58% <ø> (ø)`
hazelcast	`92.39% <ø> (ø)`
hdfs_datanode	`90.00% <ø> (ø)`
hdfs_namenode	`87.94% <ø> (ø)`
http_check	`89.96% <ø> (+1.82%)`	⬆️
ibm_db2	`93.87% <ø> (ø)`
ibm_mq	`89.99% <ø> (+1.45%)`	⬆️
ibm_was	`97.44% <ø> (ø)`
iis	`92.41% <ø> (ø)`
istio	`93.18% <ø> (+0.56%)`	⬆️
kafka_consumer	`81.15% <ø> (ø)`
kong	`93.33% <ø> (ø)`
kube_apiserver_metrics	`97.35% <ø> (ø)`
kube_controller_manager	`97.05% <ø> (ø)`
kube_dns	`98.85% <ø> (ø)`
kube_metrics_server	`100.00% <ø> (ø)`
kube_proxy	`100.00% <ø> (ø)`
kube_scheduler	`98.07% <ø> (ø)`
kubelet	`89.47% <ø> (ø)`
kubernetes_state	`89.69% <ø> (+0.03%)`	⬆️
kyototycoon	`85.96% <ø> (ø)`
lighttpd	`83.64% <ø> (ø)`
linkerd	`87.05% <ø> (+1.17%)`	⬆️
linux_proc_extras	`96.22% <ø> (ø)`
mapr	`84.97% <ø> (ø)`
mapreduce	`82.27% <ø> (+0.45%)`	⬆️
marathon	`83.12% <ø> (ø)`
marklogic	`95.32% <ø> (ø)`
mcache	`93.39% <ø> (ø)`
mesos_master	`92.20% <ø> (ø)`
mesos_slave	`93.60% <ø> (ø)`
mongo	`94.74% <ø> (+0.28%)`	⬆️
mysql	`85.18% <ø> (+0.25%)`	⬆️
nagios	`89.53% <ø> (ø)`
network	`77.76% <ø> (+1.00%)`	⬆️
nfsstat	`95.20% <ø> (ø)`
nginx	`95.11% <ø> (+0.93%)`	⬆️
nginx_ingress_controller	`98.30% <ø> (ø)`
openldap	`96.33% <ø> (ø)`
openmetrics	`97.14% <ø> (ø)`
openstack	`51.30% <ø> (ø)`
openstack_controller	`90.59% <ø> (ø)`
oracle	`93.61% <ø> (+0.63%)`	⬆️
pdh_check	`97.77% <ø> (ø)`
pgbouncer	`91.50% <ø> (ø)`
php_fpm	`89.95% <ø> (+0.43%)`	⬆️
postfix	`88.04% <ø> (ø)`
postgres	`92.20% <ø> (+0.78%)`	⬆️
powerdns_recursor	`95.93% <ø> (ø)`
process	`85.12% <ø> (+0.28%)`	⬆️
prometheus	`94.17% <ø> (ø)`
proxysql	`99.62% <ø> (ø)`
rabbitmq	`93.73% <ø> (ø)`
redisdb	`86.87% <ø> (ø)`
rethinkdb	`97.93% <ø> (ø)`
riak	`99.22% <ø> (ø)`
riakcs	`93.61% <ø> (ø)`
sap_hana	`93.04% <ø> (ø)`
scylla	`100.00% <ø> (ø)`
snmp	`91.57% <ø> (+0.04%)`	⬆️
snowflake	`94.44% <ø> (+0.58%)`	⬆️
sonarqube	`95.69% <ø> (ø)`
spark	`93.64% <ø> (ø)`
sqlserver	`81.71% <ø> (ø)`
squid	`100.00% <ø> (ø)`
ssh_check	`91.08% <ø> (ø)`
statsd	`87.36% <ø> (+1.05%)`	⬆️
supervisord	`92.18% <ø> (ø)`
system_core	`91.04% <ø> (ø)`
system_swap	`98.30% <ø> (ø)`
tcp_check	`86.53% <ø> (ø)`
teamcity	`80.00% <ø> (ø)`
tls	`97.04% <ø> (+0.87%)`	⬆️
tokumx	`58.40% <ø> (?)`
twemproxy	`78.33% <ø> (ø)`
twistlock	`80.74% <ø> (ø)`
varnish	`84.57% <ø> (+0.24%)`	⬆️
vault	`94.80% <ø> (+0.54%)`	⬆️
vertica	`92.27% <ø> (ø)`
voltdb	`96.81% <ø> (ø)`
vsphere	`89.74% <ø> (+0.05%)`	⬆️
win32_event_log	`86.03% <ø> (+0.28%)`	⬆️
windows_service	`95.83% <ø> (ø)`
wmi_check	`92.91% <ø> (ø)`
yarn	`90.30% <ø> (ø)`
zk	`85.17% <ø> (+0.75%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...s_base/tests/stubs/test_aggregator_no_duplicate.py	`100.00% <ø> (ø)`
...hecks_base/datadog_checks/base/stubs/aggregator.py	`66.92% <100.00%> (ø)`
...checks_dev/datadog_checks/dev/tooling/constants.py	`90.90% <0.00%> (-2.64%)`	⬇️
disk/datadog_checks/disk/disk.py	`79.05% <0.00%> (-1.36%)`	⬇️
datadog_checks_base/tests/test_prometheus.py	`99.48% <0.00%> (-0.26%)`	⬇️
datadog_checks_base/tests/test_openmetrics.py	`97.47% <0.00%> (-0.23%)`	⬇️
ibm_mq/tests/test_ibm_mq_int.py	`100.00% <0.00%> (ø)`
kubernetes_state/tests/test_kubernetes_state.py	`97.33% <0.00%> (ø)`
sqlserver/datadog_checks/sqlserver/connection.py	`76.62% <0.00%> (ø)`
...g_checks/win32_event_log/legacy/win32_event_log.py	`69.76% <0.00%> (ø)`
... and 71 more

coignetp

Thanks a lot for the investigation! This solution is the best option in my opinion. We should use this assert_no_duplicate_metrics method in more integration though

coignetp · 2021-06-04T08:02:32Z

datadog_checks_base/tests/stubs/test_aggregator_no_duplicate.py

@@ -75,8 +75,6 @@ def test_assert_no_duplicate_message(aggregator):
            [
                dict(type='count', name='metric.count', value=1, tags=['aa'], hostname='1'),
                dict(type='count', name='metric.count', value=1, tags=['aa'], hostname='1'),
-                dict(type='monotonic_count', name='metric.monotonic_count', value=1, tags=['aa'], hostname='1'),
-                dict(type='monotonic_count', name='metric.monotonic_count', value=1, tags=['aa'], hostname='1'),


Let's add a rate case in a duplicate metric section above

coignetp · 2021-06-09T13:03:05Z

datadog_checks_base/datadog_checks/base/stubs/aggregator.py

@@ -418,7 +418,7 @@ def assert_no_duplicate_metrics(self):
        - hostname
        """
        # metric types that intended to be called multiple times are ignored
-        ignored_types = [self.COUNT, self.MONOTONIC_COUNT, self.COUNTER]


Double submission of a monotonic_count is allowed and has a defined behaviour, however it's easily a source of error. While there is no integration submitting a monotonic_count twice on purpose, let's raise an error on these cases.

…9463) * removed monotonic count from ignored types in no-duplicate assertion * added test cases for duplicate rate/monotonic count metric 6dfc81f

…9463) * removed monotonic count from ignored types in no-duplicate assertion * added test cases for duplicate rate/monotonic count metric

removed monotonic count from ignored types in no-duplicate assertion

9afd01c

npezzotti requested review from a team as code owners June 2, 2021 23:07

ghost added the integration/datadog_checks_base label Jun 2, 2021

npezzotti added the changelog/Changed label Jun 2, 2021

coignetp requested changes Jun 4, 2021

View reviewed changes

added test cases for duplicate rate/monotonic count metric

ad8802a

coignetp approved these changes Jun 9, 2021

View reviewed changes

coignetp merged commit 6dfc81f into master Jun 18, 2021

coignetp deleted the nathan/aggregator-test-multiple-rate branch June 18, 2021 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove monotonic count from ignored types in no duplicate assertion #9463

Remove monotonic count from ignored types in no duplicate assertion #9463

npezzotti commented Jun 2, 2021 •

edited

Loading

codecov bot commented Jun 2, 2021 •

edited

Loading

coignetp left a comment

coignetp Jun 4, 2021

coignetp Jun 9, 2021

Remove monotonic count from ignored types in no duplicate assertion #9463

Remove monotonic count from ignored types in no duplicate assertion #9463

Conversation

npezzotti commented Jun 2, 2021 • edited Loading

What does this PR do?

Motivation

Additional Notes

Review checklist (to be filled by reviewers)

codecov bot commented Jun 2, 2021 • edited Loading

Codecov Report

coignetp left a comment

Choose a reason for hiding this comment

coignetp Jun 4, 2021

Choose a reason for hiding this comment

coignetp Jun 9, 2021

Choose a reason for hiding this comment

npezzotti commented Jun 2, 2021 •

edited

Loading

codecov bot commented Jun 2, 2021 •

edited

Loading