Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: fail changefeed when server.child_metrics.enabled cluster setting is false and metrics label config used #75682

Closed
amruss opened this issue Jan 29, 2022 · 3 comments · Fixed by #94948
Assignees
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-quick-win Likely to be a quick win for someone experienced. T-cdc

Comments

@amruss
Copy link
Contributor

amruss commented Jan 29, 2022

See: https://cockroachlabs.atlassian.net/wiki/spaces/CDC/pages/2398552506/22.1+Metrics+Labels+Acceptance+Testing for reproducibility

When creating a changefeed using the CREATE CHANGEFED ... WITH metrics_label=X configuration the user must set the cluster setting server.child_metrics.enabled=true in order for the feature to work. If they do not set this cluster setting, we still allow them to create the changefeed, but the metrics label feature is silently not applied. We should instead fail the changefeed creation when this happens, with a similar error message to when COCKROACH_EXPERIMENTAL_ENABLE_PER_CHANGEFEED_METRICS=true is not set:

image

Jira issue: CRDB-12779

Epic CRDB-13931

@amruss amruss added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-cdc Change Data Capture T-cdc labels Jan 29, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jan 29, 2022

cc @cockroachdb/cdc

@amruss
Copy link
Contributor Author

amruss commented Feb 2, 2022

We decided instead to do a warning since we want users to be able to toggle this cluster setting to change whether they are receiving the scopes

@amruss amruss added the E-quick-win Likely to be a quick win for someone experienced. label Feb 2, 2022
@miretskiy
Copy link
Contributor

@samiskin -- close this issue? or is it being worked on?

samiskin added a commit to samiskin/cockroach that referenced this issue Jan 9, 2023
Resolves cockroachdb#75682

Surfaces a notice of
```
server.child_metrics.enabled is set to false, metrics will only be published to
the '<scope>' label when it is set to true"
```
When child_metrics setting isn't enabled during changefeed creation

Release note (enterprise change): Changefeeds created/altered with a
metrics_label set while server.child_metrics.enabled is false will now provide
the user a notice upon creation.

<what was there before: Previously, ...>
<why it needed to change: This was inadequate because ...>
<what you did about it: To address this, this patch ...>
craig bot pushed a commit that referenced this issue Jan 17, 2023
94239: loqrecovery: use captured meta range content for LOQ plans r=erikgrinaker a=aliher1911

Note: only last commit belongs to this PR. Will update description once #93157 lands.

Previously loss of quorum recovery planner was using local replica info collected from all nodes to find source of truth for replicas that lost quorum.
With online approach local info snapshots don't have atomicity. This could cause planner to fail if available replicas are caught in different states on different nodes.
This commit adds alternative planning approach when range metadata is available. Instead of fixing individual replicas that can't make progress it finds ranges that can't make progress from metadata using descriptors and updates their replicas to recover from loss of quorum.
This commit also adds replica collection stage as a part of make-plan command itself. To invoke collection from a cluster instead of using files one needs to provide --host and other standard cluster connection related flags (--cert-dir, --insecure etc.) as appropriate.

Example command output for a local cluster with 3 out of 5 nodes surrvivng looks like:
```
~/tmp ❯❯❯ cockroach debug recover make-plan --insecure --host 127.0.0.1:26257 >recover-plan.json
Nodes scanned:           3
Total replicas analyzed: 228
Ranges without quorum:   15
Discarded live replicas: 0

Proposed changes:
  range r4:/System/tsd updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].
  range r80:/Table/106/1 updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):3,(n4,s4):2].
  range r87:/Table/106/1/"paris"/"\xcc\xcc\xcc\xcc\xcc\xcc@\x00\x80\x00\x00\x00\x00\x00\x00(" updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):3,(n4,s4):2].
  range r88:/Table/106/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x14" updating replica (n3,s3):3 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].
  range r105:/Table/106/1/"washington dc"/"L\xcc\xcc\xcc\xcc\xccL\x00\x80\x00\x00\x00\x00\x00\x00\x0f" updating replica (n3,s3):3 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n5,s5):1,(n4,s4):2].
  range r98:/Table/107/1/"boston"/"333333D\x00\x80\x00\x00\x00\x00\x00\x00\x03" updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].
  range r95:/Table/107/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x06" updating replica (n3,s3):2 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n4,s4):4,(n5,s5):3].
  range r125:/Table/107/1/"washington dc"/"DDDDDDD\x00\x80\x00\x00\x00\x00\x00\x00\x04" updating replica (n3,s3):2 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):1,(n5,s5):3].
  range r115:/Table/108/1/"boston"/"8Q\xeb\x85\x1e\xb8B\x00\x80\x00\x00\x00\x00\x00\x00n" updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].
  range r104:/Table/108/1/"new york"/"\x1c(\xf5\u008f\\I\x00\x80\x00\x00\x00\x00\x00\x007" updating replica (n2,s2):2 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):3].
  range r102:/Table/108/1/"seattle"/"p\xa3\xd7\n=pD\x00\x80\x00\x00\x00\x00\x00\x00\xdc" updating replica (n3,s3):2 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n4,s4):4,(n5,s5):3].
  range r126:/Table/108/1/"washington dc"/"Tz\xe1G\xae\x14L\x00\x80\x00\x00\x00\x00\x00\x00\xa5" updating replica (n3,s3):2 to (n3,s3):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):1,(n5,s5):3].
  range r86:/Table/108/3 updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n4,s4):3,(n5,s5):2].
  range r59:/Table/109/1 updating replica (n2,s2):3 to (n2,s2):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].
  range r65:/Table/111/1 updating replica (n3,s3):3 to (n3,s3):15. Discarding available replicas: [], discarding dead replicas: [(n5,s5):4,(n4,s4):2].

Discovered dead nodes would be marked as decommissioned:
  n4, n5


Proceed with plan creation [y/N] y
Plan created.
To stage recovery application in half-online mode invoke:

'cockroach debug recover apply-plan  --host=127.0.0.1:26257 --insecure=true <plan file>'

Alternatively distribute plan to below nodes and invoke 'debug recover apply-plan --store=<store-dir> <plan file>' on:
- node n2, store(s) s2
- node n1, store(s) s1
- node n3, store(s) s3
```

Release note: None

Fixes: #93038
Fixes: #93046

94948: changefeedccl: give notice when metrics_label set without child_metrics r=samiskin a=samiskin

Resolves #75682

Surfaces a notice of
```
server.child_metrics.enabled is set to false, metrics will only be published to the '<scope>' label when it is set to true"
```
When child_metrics setting isn't enabled during changefeed creation

Release note (enterprise change): Changefeeds created/altered with a metrics_label set while server.child_metrics.enabled is false will now provide the user a notice upon creation.

95009: tree: fix panic when encoding tuple r=rafiss a=rafiss

fixes #95008

This adds a bounds check to avoid a panic.

Release note (bug fix): Fixed a crash that could happen when formatting a tuple with an unknown type.

95294: sql: make pg_description aware of builtin function descriptions r=rafiss,msirek a=knz

Epic: CRDB-23454
Fixes #95292.
Needed for #88061. 

First commit from #95289.

This also extends the completion rules to properly handle
functions in multiple namespaces.

Release note (bug fix): `pg_catalog.pg_description` and `pg_catalog.obj_description()` are now able to retrieve the descriptive help for built-in functions.

95356: server: remove unused migrationExecutor r=ajwerner a=ajwerner

This is no longer referenced since #91627.

Epic: none

Release note: None

Co-authored-by: Oleg Afanasyev <[email protected]>
Co-authored-by: Shiranka Miskin <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in f5a5b4b Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-quick-win Likely to be a quick win for someone experienced. T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants