-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diagnostics: add datadriven testing to diagnostics output to detect regressions/changes #134450
Labels
O-postmortem
Originated from a Postmortem action item.
P-1
Issues/test failures with a fix SLA of 1 month
T-observability
Comments
Hi @dhartunian, please add branch-* labels to identify which branch(es) this C-bug affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Hi @exalate-issue-sync[bot], please add branch-* labels to identify which branch(es) this C-bug affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
dhartunian
added a commit
to dhartunian/cockroach
that referenced
this issue
Jan 14, 2025
This change adds test coverage to the diagnostic reporter that's meant to catch situations where schema or statement scrubbing is accidentally turned off. In the course of adding tests for SQL Stats it was discovered that the diagnostics reporter would include statements that were in internal applications (`$ internal` prefix) so a change was made to omit those from the reports. Resolves: cockroachdb#134450 Release note: None
craig bot
pushed a commit
that referenced
this issue
Jan 17, 2025
139062: server,sql: increase redaction coverage of diagnostics tests r=angles-n-daemons a=dhartunian This change adds test coverage to the diagnostic reporter that's meant to catch situations where schema or statement scrubbing is accidentally turned off. In the course of adding tests for SQL Stats it was discovered that the diagnostics reporter would include statements that were in internal applications (`$ internal` prefix) so a change was made to omit those from the reports. Resolves: #134450 Release note: None 139066: db-console: update rac v2 overload dashboard charts r=sumeerbhola a=kvoli Update the db console overload dashboard to: - remove metrics associated with v1 replication admission control - rename metrics associated with v2 replication admission control to remove the version reference - add a chart containing the per-node send queue size in bytes <details><summary>Screenshots</summary> <p> ![image](https://github.com/user-attachments/assets/5ce5b9eb-4f87-4a4b-a6a5-185c688f199e) ![image](https://github.com/user-attachments/assets/faea8862-0f90-415c-8ce1-0ece9b40f988) ![image](https://github.com/user-attachments/assets/9667f41b-607c-4b17-b3c4-dceba6e77ccb) </p> </details> Resolves: #128039 Release note (ui change): The overload dashboard on DB Console now shows only the v2 replication admission control metrics, where previously it displayed both v1 and v2 metrics. Additionally, the aggregate size of queued replication entries is now shown. 139171: sql: use parsed statements for persistedsqlstats r=fqazi a=fqazi Previously, we would re-parse SQL statements used to upsert statement and txn stats. To address this patch, this patch will parse these statements once and use ExecParsed to reduce CPU usage. This patch also adds a simple benchmark for this code path as well, which shows a small 1% delta. Before: BenchmarkSQLStatsFlush 100 1415926687 ns/op 319339313 B/op 2302002 allocs/op After: BenchmarkSQLStatsFlush 100 1396673170 ns/op 319003310 B/op 2298192 allocs/op Fixes: #134583 Release note: None 139273: roachtest: collect qps metrics over longer window in gracefuldrain test r=arulajmani a=arulajmani The gracefuldrain test was modernized in cf30717. Prior to that commit, QPS metrics were collected over a 10s interval, whereas the modernization refactor changed this to 1 second intervals. Looking at a few recent test failures, I see QPS metrics above the failure threshold, which makes me think suspect that this 1s interval is causing the sorts of inaccuracies MeasureQPS warns against. Also See #133020 (comment). One thing that doesn't line up is the timeline of this tests failure and cf30717. Still, this patch changes the metric's interval back to 10s. References #133020 Release note: None Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: Arul Ajmani <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
O-postmortem
Originated from a Postmortem action item.
P-1
Issues/test failures with a fix SLA of 1 month
T-observability
Today, it's not possible to easily detect changes in the diagnostic output for a cluster.
What's been tricky here is managing the fact that the output of diagnostics is non-deterministic. We can probably filter the output lightly to make a deterministic payload for testing. This will likely have to be the sql and schema stats data which can contain arbitrary numbers.
Jira issue: CRDB-44083
The text was updated successfully, but these errors were encountered: