Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diagnostics: add datadriven testing to diagnostics output to detect regressions/changes #134450

Closed
dhartunian opened this issue Nov 6, 2024 · 2 comments · Fixed by #139062
Closed
Assignees
Labels
O-postmortem Originated from a Postmortem action item. P-1 Issues/test failures with a fix SLA of 1 month T-observability

Comments

@dhartunian
Copy link
Collaborator

dhartunian commented Nov 6, 2024

Today, it's not possible to easily detect changes in the diagnostic output for a cluster.

What's been tricky here is managing the fact that the output of diagnostics is non-deterministic. We can probably filter the output lightly to make a deterministic payload for testing. This will likely have to be the sql and schema stats data which can contain arbitrary numbers.

Jira issue: CRDB-44083

@dhartunian dhartunian added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-postmortem Originated from a Postmortem action item. P-1 Issues/test failures with a fix SLA of 1 month T-observability labels Nov 6, 2024
Copy link

blathers-crl bot commented Nov 6, 2024

Hi @dhartunian, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@dhartunian dhartunian removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Nov 6, 2024
@exalate-issue-sync exalate-issue-sync bot added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Nov 6, 2024
Copy link

blathers-crl bot commented Nov 6, 2024

Hi @exalate-issue-sync[bot], please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@exalate-issue-sync exalate-issue-sync bot removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Nov 7, 2024
dhartunian added a commit to dhartunian/cockroach that referenced this issue Jan 14, 2025
This change adds test coverage to the diagnostic reporter that's
meant to catch situations where schema or statement scrubbing is
accidentally turned off.

In the course of adding tests for SQL Stats it was discovered that the
diagnostics reporter would include statements that were in internal
applications (`$ internal` prefix) so a change was made to omit those
from the reports.

Resolves: cockroachdb#134450

Release note: None
craig bot pushed a commit that referenced this issue Jan 17, 2025
139062: server,sql: increase redaction coverage of diagnostics tests r=angles-n-daemons a=dhartunian

This change adds test coverage to the diagnostic reporter that's meant to catch situations where schema or statement scrubbing is accidentally turned off.

In the course of adding tests for SQL Stats it was discovered that the diagnostics reporter would include statements that were in internal applications (`$ internal` prefix) so a change was made to omit those from the reports.

Resolves: #134450

Release note: None

139066: db-console: update rac v2 overload dashboard charts r=sumeerbhola a=kvoli

Update the db console overload dashboard to:

- remove metrics associated with v1 replication admission control
- rename metrics associated with v2 replication admission control to
  remove the version reference
- add a chart containing the per-node send queue size in bytes

<details><summary>Screenshots</summary>
<p>

![image](https://github.com/user-attachments/assets/5ce5b9eb-4f87-4a4b-a6a5-185c688f199e)
![image](https://github.com/user-attachments/assets/faea8862-0f90-415c-8ce1-0ece9b40f988)
![image](https://github.com/user-attachments/assets/9667f41b-607c-4b17-b3c4-dceba6e77ccb)

</p>
</details> 

Resolves: #128039
Release note (ui change): The overload dashboard on DB Console now shows
only the v2 replication admission control metrics, where previously it
displayed both v1 and v2 metrics. Additionally, the aggregate size of
queued replication entries is now shown.

139171: sql: use parsed statements for persistedsqlstats r=fqazi a=fqazi

Previously, we would re-parse SQL statements used to upsert statement and txn stats. To address this patch, this patch will parse these statements once and use ExecParsed to reduce CPU usage. This patch also adds a simple benchmark for this code path as well, which shows a small 1% delta.

Before:
BenchmarkSQLStatsFlush          100        1415926687 ns/op        319339313 B/op   2302002 allocs/op
After:
BenchmarkSQLStatsFlush          100        1396673170 ns/op        319003310 B/op   2298192 allocs/op

Fixes: #134583
Release note: None

139273: roachtest: collect qps metrics over longer window in gracefuldrain test r=arulajmani a=arulajmani

The gracefuldrain test was modernized in cf30717. Prior to that commit, QPS metrics were collected over a 10s interval, whereas the modernization refactor changed this to 1 second intervals. Looking at a few recent test failures, I see QPS metrics above the failure threshold, which makes me think suspect that this 1s interval is causing the sorts of inaccuracies MeasureQPS warns against. Also See #133020 (comment).

One thing that doesn't line up is the timeline of this tests failure and cf30717. Still, this patch changes the metric's interval back to 10s.

References #133020

Release note: None

Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
@craig craig bot closed this as completed in f876eb6 Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-postmortem Originated from a Postmortem action item. P-1 Issues/test failures with a fix SLA of 1 month T-observability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant