-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: error decoding key during TestChangefeedNemeses #137125
Comments
Hi @wenyihu6, please add branch-* labels to identify which branch(es) this C-bug affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
cc @cockroachdb/cdc |
Added a 24.2 label, but once we can repro this we should test if this issue occurs in earlier releases. |
Decoder worked fine here:
Error happened here when we try decoding it using a different schema timestamp 1737583093.061385999,2147483647 - previous timestamp before the backfill.
Aside, it seems we have other failures for this test under stress, but they seem less flaky. |
Another failure which seems pretty rare:
|
logTestChangefeedNemeses430013875.zip Failures for the one above ^ Update: filed #139653 for this |
This patch disable declarative schema changer when TestChangefeedNemeses uses sql smith. Informs: cockroachdb#137125
This patch disables the declarative schema changer when TestChangefeedNemeses uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled as a workaround. This ensures nemesis testing can run in CI with SQLSmith without being skipped. Informs: cockroachdb#137125 Release note: None
This patch disables the declarative schema changer when TestChangefeedNemeses uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled as a workaround. This ensures nemesis testing can run in CI with SQLSmith without being skipped. Informs: cockroachdb#137125 Release note: None
139914: cdctest: use legacy schema changer when sql smith is enabled r=aerfrei a=wenyihu6 This patch disables the declarative schema changer when TestChangefeedNemeses uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled as a workaround. This ensures nemesis testing can run in CI with SQLSmith without being skipped. Informs: #137125 Release note: None 139959: go.mod: bump Pebble to 1157615755bc r=RaduBerinde a=jbowens Changes: * [`11576157`](cockroachdb/pebble@11576157) db: add optional ValidateKey Comparer func * [`303c8855`](cockroachdb/pebble@303c8855) metamorphic: fix time-bound filtering * [`89cae40f`](cockroachdb/pebble@89cae40f) Update stale comment in comparer.go * [`a4761ee8`](cockroachdb/pebble@a4761ee8) cache: readShard: use a separate mutex * [`16466355`](cockroachdb/pebble@16466355) cockroachkvs: pull in MVCC block-property collector, filter * [`2fa3b969`](cockroachdb/pebble@2fa3b969) internal/metamorphic: fix bugs in KeyFormat abstraction * [`6c523bfb`](cockroachdb/pebble@6c523bfb) metamorphic: ignore rangedels for deduplicating point prefixes Release note: none. Epic: none. Co-authored-by: Wenyi Hu <[email protected]> Co-authored-by: Jackson Owens <[email protected]>
I noticed that this test passes when declarative schema changer is disabled. |
139265: opt: remove incorrect query plans for trigram similarity filters r=normanchenn a=normanchenn Previously, the optimizer would produce incorrect query plans for queries with trigram similarity filters when `pg_trgm.similarity_threshold == 0`, producing incorrect results. To address this, this patch adds a check to return early if `pg_trgm.similarity_threshold == 0` in trigram similarity queries on inverted indices. Fixes: #122443 Release note (bug fix): The optimizer could produce incorrect query plans for queries using trigram similarity filters (e.g. `col % 'val'`) when `pg_trgm.similarity_threshold` was set to 0. This bug was introduced in v22.2.0 and is now fixed. Note that this issue does not affect v24.2.0+ releases when the `optimizer_use_trigram_similarity_optimization` session variable (introduced in v24.2.0) is set to its default value `true`, as it would skip this behaviour. 139914: cdctest: use legacy schema changer when sql smith is enabled r=aerfrei a=wenyihu6 This patch disables the declarative schema changer when TestChangefeedNemeses uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled as a workaround. This ensures nemesis testing can run in CI with SQLSmith without being skipped. Informs: #137125 Release note: None Co-authored-by: Norman Chen <[email protected]> Co-authored-by: Wenyi Hu <[email protected]>
139914: cdctest: use legacy schema changer when sql smith is enabled r=aerfrei a=wenyihu6 This patch disables the declarative schema changer when TestChangefeedNemeses uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled as a workaround. This ensures nemesis testing can run in CI with SQLSmith without being skipped. Informs: #137125 Release note: None Co-authored-by: Wenyi Hu <[email protected]>
139914: cdctest: use legacy schema changer when sql smith is enabled r=aerfrei a=wenyihu6 This patch disables the declarative schema changer when TestChangefeedNemeses uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled as a workaround. This ensures nemesis testing can run in CI with SQLSmith without being skipped. Informs: #137125 Release note: None Co-authored-by: Wenyi Hu <[email protected]>
139914: cdctest: use legacy schema changer when sql smith is enabled r=aerfrei a=wenyihu6 This patch disables the declarative schema changer when TestChangefeedNemeses uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled as a workaround. This ensures nemesis testing can run in CI with SQLSmith without being skipped. Informs: #137125 Release note: None Co-authored-by: Wenyi Hu <[email protected]>
after chatting with @ajwerner, it sounds like what we're doing interpreting data as of a descriptor at a previous timestamp is not totally sound. the rowfetcher can't see the index the data is from because it's from the middle of a schema change and not visible / write only. we need to allow it to see those indexes. the old schema changer doesnt make new indexes for stuff as often which is why switching back to it seems to have prevented the issue. fwiw this sounds like a real issue we should fix |
Summary of the investigation I have done so far: Repro steps (note that you would need to revert my fix which disables declarative schema changer): An example stack trace of where this error comes from is
```
‹0a05f2f66e8b8812430a35000000000a261e313733383031313334313731353933343030302e3030303030303030303066017866017846017816017881560178120a0898f8caddba83aa8f18›): unknown tag 110
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 +(1) forced error mark
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | ‹"terminal changefeed error"›
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeedbase/*changefeedbase.terminalError](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeedbase/*changefeedbase.terminalError)::
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 +Wraps: (2) attached stack trace
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + -- stack trace:
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).DecodeKV](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).DecodeKV)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/cdcevent/event.go:557
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [...repeated from below...]
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 +Wraps: (3) error decoding key ‹/Table/106/110/3/0›@1738011343.009791000,0 (hex_kv: ‹0a05f2f66e8b8812430a35000000000a261e313733383031313334313731353933343030302e3030303030303030303066017866017846017816017881560178120a0898f8caddba83aa8f18›)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 +Wraps: (4) attached stack trace
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + -- stack trace:
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/util/encoding.PeekLength](http://github.com/cockroachdb/cockroach/pkg/util/encoding.PeekLength)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/util/encoding/encoding.go:2094
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/rowenc.EncDatumFromBuffer](http://github.com/cockroachdb/cockroach/pkg/sql/rowenc.EncDatumFromBuffer)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/rowenc/encoded_datum.go:148
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/rowenc.DecodeKeyValsUsingSpec](http://github.com/cockroachdb/cockroach/pkg/sql/rowenc.DecodeKeyValsUsingSpec)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/rowenc/index_encoding.go:569
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).DecodeIndexKey](http://github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).DecodeIndexKey)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/row/fetcher.go:849
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).nextKey](http://github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).nextKey)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/row/fetcher.go:783
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).startScan](http://github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).startScan)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/row/fetcher.go:744
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).ConsumeKVProvider](http://github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).ConsumeKVProvider)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/row/fetcher.go:736
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).decodeKV](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).decodeKV)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/cdcevent/event.go:582
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).DecodeKV](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).DecodeKV)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/cdcevent/event.go:538
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*kvEventToRowConsumer).ConsumeEvent.func1](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*kvEventToRowConsumer).ConsumeEvent.func1)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/event_processing.go:357
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*kvEventToRowConsumer).ConsumeEvent](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*kvEventToRowConsumer).ConsumeEvent)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/event_processing.go:358
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).workerLoop](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).workerLoop)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/event_processing.go:643
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).startWorkers.func1](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).startWorkers.func1)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/event_processing.go:615
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).startWorkers.Group.GoCtx.func2](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).startWorkers.Group.GoCtx.func2)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/util/ctxgroup/ctxgroup.go:189
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [golang.org/x/sync/errgroup.(*Group).Go.func1](http://golang.org/x/sync/errgroup.(*Group).Go.func1)
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | external/org_golang_x_sync/errgroup/errgroup.go:78
E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | runtime.goexit
```
Here is what we have found so far:
first one
sec one
|
I looked into this with @asg0451 using Side-Eye. We were able to capture the relevant descriptors and errors at the point of failure. I think it doesn't occur in the legacy schema changer because of how it handles adding new columns. Namely it doesn't create new indexes for them and just mutates the primary index in place. This causes other issues like #35738. Fundamentally, the issue is that for the prev value of the row, the schema timestamp used is the predecessor to the row's timestamp. The If you look at code here below, everything uses cockroach/pkg/ccl/changefeedccl/cdcevent/event.go Lines 617 to 620 in 0ed4fa1
cockroach/pkg/ccl/changefeedccl/cdcevent/rowfetcher_cache.go Lines 290 to 304 in 4bf415a
Consider plumbing an index ID from
|
Describe the problem
TestChangefeedNemeses/nemeses_options=={EnableFpValidator: false,EnableSQLSmith: true} fails under stress.
More details on the changefeed statement:
Terminal error:
Logs:
logTestChangefeedNemeses2982758917.zip
Jira issue: CRDB-45398
The text was updated successfully, but these errors were encountered: