Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: error decoding key during TestChangefeedNemeses #137125

Open
wenyihu6 opened this issue Dec 10, 2024 · 10 comments
Open

changefeedccl: error decoding key during TestChangefeedNemeses #137125

wenyihu6 opened this issue Dec 10, 2024 · 10 comments
Labels
A-cdc Change Data Capture branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). T-cdc

Comments

@wenyihu6
Copy link
Contributor

wenyihu6 commented Dec 10, 2024

Describe the problem

TestChangefeedNemeses/nemeses_options=={EnableFpValidator: false,EnableSQLSmith: true} fails under stress.

stepping through state reverting with unexpected error: error decoding key ‹/Table/106/110/1/0›@1733849403.305284000,0 

More details on the changefeed statement:

I241210 16:49:28.231472 3384 12@util/log/event_log.go:39 ⋮ [T1,Vsystem,n1,job=1028190919862616065,change-aggregator] 512 ={"Timestamp":1733849368231469000,"EventType":"changefeed_emitted_bytes","Description":"‹CREATE CHANGEFEED FOR TABLE foo INTO 'pulsar://testfeed?region=testfeedRegion' WITH OPTIONS (diff, resolved, updated)›","SinkType":"pulsar","NumTables":1,"Resolved":"yes","InitialScan":"yes","JobId":1028190919862616065,"EmittedBytes":775,"EmittedMessages":5,"LoggingInterval":86400000000000,"Closing":true}
I241210 16:49:28.231516 3378 ccl/changefeedccl/changefeed_stmt.go:1307 ⋮ [T1,Vsystem,n1,job=CHANGEFEED id=1028190919862616065] 513  CHANGEFEED 1028190919862616065 shutting down (cause: context canceled)
W241210 16:49:28.231574 3378 ccl/changefeedccl/changefeed_stmt.go:1217 ⋮ [T1,Vsystem,n1,job=CHANGEFEED id=1028190919862616065] 514  job failed (context canceled)

Terminal error:

W241210 16:50:05.972643 121282 ccl/changefeedccl/changefeed_stmt.go:1217 ⋮ [T1,Vsystem,n1,job=CHANGEFEED id=1028190962947063809] 7351  job failed (error decoding key ‹/Table/106/110/1/0›@1733849403.305284000,0 (hex_kv: ‹0a05f2f66e8988124a0a3c000000000a261e313733333834393430323638363835353030302e303030303030303030308156017816017836017826017881260178160178160178120a08a0a3a39089dbf78718›): unknown tag 110)
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352  CHANGEFEED job 1028190962947063809: stepping through state reverting with unexpected error: error decoding key ‹/Table/106/110/1/0›@1733849403.305284000,0 (hex_kv: ‹0a05f2f66e8988124a0a3c000000000a261e313733333834393430323638363835353030302e303030303030303030308156017816017836017826017881260178160178160178120a08a0a3a39089dbf78718›): unknown tag 110
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +(1) forced error mark
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | ‹"terminal changefeed error"›
unexpected error: error decoding key ‹/Table/106/110/1/0›@1733849403.305284000,0 (hex_kv: ‹0a05f2f66e8988124a0a3c000000000a261e313733333834393430323638363835353030302e303030303030303030308156017816017836017826017881260178160178160178120a08a0a3a39089dbf78718›): unknown tag 110
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +(1) forced error mark
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | ‹"terminal changefeed error"›
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeedbase/*changefeedbase.terminalError::
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +Wraps: (2) attached stack trace
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  -- stack trace:
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).DecodeKV
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | 	pkg/ccl/changefeedccl/cdcevent/event.go:557
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | [...repeated from below...]
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +Wraps: (3) error decoding key ‹/Table/106/110/1/0›@1733849403.305284000,0 (hex_kv: ‹0a05f2f66e8988124a0a3c000000000a261e313733333834393430323638363835353030302e303030303030303030308156017816017836017826017881260178160178160178120a08a0a3a39089dbf78718›)
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +Wraps: (4) attached stack trace
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  -- stack trace:
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | github.com/cockroachdb/cockroach/pkg/util/encoding.PeekLength
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | 	pkg/util/encoding/encoding.go:2094
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | github.com/cockroachdb/cockroach/pkg/sql/rowenc.EncDatumFromBuffer
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | 	pkg/sql/rowenc/encoded_datum.go:148
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | github.com/cockroachdb/cockroach/pkg/sql/rowenc.DecodeKeyValsUsingSpec
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | 	pkg/sql/rowenc/index_encoding.go:569
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).DecodeIndexKey
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | 	pkg/sql/row/fetcher.go:849
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).nextKey
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | 	pkg/sql/row/fetcher.go:783
E241210 16:50:05.972717 121282 jobs/registry.go:1593 ⋮ [T1,Vsystem,n1] 7352 +  | github.com/cockroachdb/cockroach/p

Logs:
logTestChangefeedNemeses2982758917.zip

Jira issue: CRDB-45398

@wenyihu6 wenyihu6 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 10, 2024
Copy link

blathers-crl bot commented Dec 10, 2024

Hi @wenyihu6, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@wenyihu6 wenyihu6 added A-cdc Change Data Capture T-cdc labels Dec 10, 2024
Copy link

blathers-crl bot commented Dec 10, 2024

cc @cockroachdb/cdc

@rharding6373 rharding6373 added the branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 label Jan 7, 2025
@rharding6373
Copy link
Collaborator

Added a 24.2 label, but once we can repro this we should test if this issue occurs in earlier releases.

@wenyihu6 wenyihu6 self-assigned this Jan 14, 2025
@wenyihu6
Copy link
Contributor Author

Decoder worked fine here:

I250122 21:58:15.613741 99423 ccl/changefeedccl/cdcevent/event.go:574 ⋮ [T1,Vsystem,n1,job=1040425529403080705,change-aggregator] 11158  decoding a current row: kv.Key: /Table/106/110/‹3›/‹0›, kv.Value: {[0 0 0 0 10 38 30 49 55 51 55 53 56 51 48 57 49 57 50 48 56 57 51 48 48 48 46 48 48 48 48 48 48 48 48 48 48 129 6 1 128 22 1 120 54 1 120 22 1 120 70 1 120 38 1 120 22 1 128 129 54 1 128 22 1 128] 1737583092.987383000,0}, schemaTS: 1737583093.061386000,0

Error happened here when we try decoding it using a different schema timestamp 1737583093.061385999,2147483647 - previous timestamp before the backfill.

I250122 21:58:15.614827 99423 ccl/changefeedccl/cdcevent/event.go:577 ⋮ [T1,Vsystem,n1,job=1040425529403080705,change-aggregator] 11163  decoding a prev row: kv.Key: /Table/106/110/‹3›/‹0›, kv.Value: {[0 0 0 0 10 38 30 49 55 51 55 53 56 51 48 57 49 57 50 48 56 57 51 48 48 48 46 48 48 48 48 48 48 48 48 48 48 129 6 1 128 22 1 120 54 1 120 22 1 120 70 1 120 38 1 120 22 1 128 129 54 1 128 22 1 128] 1737583092.987383000,0}, schemaTS: 1737583093.061385999,2147483647

Aside, it seems we have other failures for this test under stress, but they seem less flaky.

@wenyihu6
Copy link
Contributor Author

Another failure which seems pretty rare:

    nemeses_test.go:43: topic foo partition : saw new row timestamp 1737650046368613000.0000000000 after 1737650046539222000.0000000000 was resolved
    nemeses_test.go:43: topic foo partition : saw new row timestamp 1737650046368613000.0000000000 after 1737650046539222000.0000000000 was resolved

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jan 23, 2025

logTestChangefeedNemeses430013875.zip

Failures for the one above ^

Update: filed #139653 for this

wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jan 27, 2025
This patch disable declarative schema changer when
TestChangefeedNemeses uses sql smith.

Informs: cockroachdb#137125
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jan 27, 2025
This patch disables the declarative schema changer when TestChangefeedNemeses
uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled
as a workaround. This ensures nemesis testing can run in CI with SQLSmith
without being skipped.

Informs: cockroachdb#137125
Release note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jan 28, 2025
This patch disables the declarative schema changer when TestChangefeedNemeses
uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled
as a workaround. This ensures nemesis testing can run in CI with SQLSmith
without being skipped.

Informs: cockroachdb#137125
Release note: None
craig bot pushed a commit that referenced this issue Jan 28, 2025
139914: cdctest: use legacy schema changer when sql smith is enabled r=aerfrei a=wenyihu6

This patch disables the declarative schema changer when TestChangefeedNemeses
uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled
as a workaround. This ensures nemesis testing can run in CI with SQLSmith
without being skipped.

Informs: #137125
Release note: None

139959: go.mod: bump Pebble to 1157615755bc r=RaduBerinde a=jbowens

Changes:

 * [`11576157`](cockroachdb/pebble@11576157) db: add optional ValidateKey Comparer func
 * [`303c8855`](cockroachdb/pebble@303c8855) metamorphic: fix time-bound filtering
 * [`89cae40f`](cockroachdb/pebble@89cae40f) Update stale comment in comparer.go
 * [`a4761ee8`](cockroachdb/pebble@a4761ee8) cache: readShard: use a separate mutex
 * [`16466355`](cockroachdb/pebble@16466355) cockroachkvs: pull in MVCC block-property collector, filter
 * [`2fa3b969`](cockroachdb/pebble@2fa3b969) internal/metamorphic: fix bugs in KeyFormat abstraction
 * [`6c523bfb`](cockroachdb/pebble@6c523bfb) metamorphic: ignore rangedels for deduplicating point prefixes

Release note: none.
Epic: none.

Co-authored-by: Wenyi Hu <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
@wenyihu6
Copy link
Contributor Author

I noticed that this test passes when declarative schema changer is disabled.

craig bot pushed a commit that referenced this issue Jan 28, 2025
139265: opt: remove incorrect query plans for trigram similarity filters r=normanchenn a=normanchenn

Previously, the optimizer would produce incorrect query plans for queries with trigram similarity filters when `pg_trgm.similarity_threshold == 0`, producing incorrect results.

To address this, this patch adds a check to return early if `pg_trgm.similarity_threshold == 0` in trigram similarity queries on inverted indices.

Fixes: #122443

Release note (bug fix): The optimizer could produce incorrect query plans for queries using trigram similarity filters (e.g. `col % 'val'`) when `pg_trgm.similarity_threshold` was set to 0. This bug was introduced in v22.2.0 and is now fixed. Note that this issue does not affect v24.2.0+ releases when the `optimizer_use_trigram_similarity_optimization` session variable (introduced in v24.2.0) is set to its default value `true`, as it would skip this behaviour.

139914: cdctest: use legacy schema changer when sql smith is enabled r=aerfrei a=wenyihu6

This patch disables the declarative schema changer when TestChangefeedNemeses
uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled
as a workaround. This ensures nemesis testing can run in CI with SQLSmith
without being skipped.

Informs: #137125
Release note: None

Co-authored-by: Norman Chen <[email protected]>
Co-authored-by: Wenyi Hu <[email protected]>
@wenyihu6 wenyihu6 removed their assignment Jan 28, 2025
craig bot pushed a commit that referenced this issue Jan 28, 2025
139914: cdctest: use legacy schema changer when sql smith is enabled r=aerfrei a=wenyihu6

This patch disables the declarative schema changer when TestChangefeedNemeses
uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled
as a workaround. This ensures nemesis testing can run in CI with SQLSmith
without being skipped.

Informs: #137125
Release note: None

Co-authored-by: Wenyi Hu <[email protected]>
craig bot pushed a commit that referenced this issue Jan 28, 2025
139914: cdctest: use legacy schema changer when sql smith is enabled r=aerfrei a=wenyihu6

This patch disables the declarative schema changer when TestChangefeedNemeses
uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled
as a workaround. This ensures nemesis testing can run in CI with SQLSmith
without being skipped.

Informs: #137125
Release note: None

Co-authored-by: Wenyi Hu <[email protected]>
craig bot pushed a commit that referenced this issue Jan 28, 2025
139914: cdctest: use legacy schema changer when sql smith is enabled r=aerfrei a=wenyihu6

This patch disables the declarative schema changer when TestChangefeedNemeses
uses SQLSmith. Enabling it causes a decoder error, so it is temporarily disabled
as a workaround. This ensures nemesis testing can run in CI with SQLSmith
without being skipped.

Informs: #137125
Release note: None

Co-authored-by: Wenyi Hu <[email protected]>
@asg0451
Copy link
Contributor

asg0451 commented Jan 29, 2025

after chatting with @ajwerner, it sounds like what we're doing interpreting data as of a descriptor at a previous timestamp is not totally sound. the rowfetcher can't see the index the data is from because it's from the middle of a schema change and not visible / write only. we need to allow it to see those indexes.

the old schema changer doesnt make new indexes for stuff as often which is why switching back to it seems to have prevented the issue.

fwiw this sounds like a real issue we should fix

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jan 29, 2025

Summary of the investigation I have done so far:

Repro steps (note that you would need to revert my fix which disables declarative schema changer):
./dev test pkg/ccl/changefeedccl --ignore-cache -v -f TestChangefeedNemeses --stress

An example stack trace of where this error comes from is

``` ‹0a05f2f66e8b8812430a35000000000a261e313733383031313334313731353933343030302e3030303030303030303066017866017846017816017881560178120a0898f8caddba83aa8f18›): unknown tag 110 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 +(1) forced error mark E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | ‹"terminal changefeed error"› E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeedbase/*changefeedbase.terminalError](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeedbase/*changefeedbase.terminalError):: E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 +Wraps: (2) attached stack trace E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + -- stack trace: E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).DecodeKV](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).DecodeKV) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/cdcevent/event.go:557 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [...repeated from below...] E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 +Wraps: (3) error decoding key ‹/Table/106/110/3/0›@1738011343.009791000,0 (hex_kv: ‹0a05f2f66e8b8812430a35000000000a261e313733383031313334313731353933343030302e3030303030303030303066017866017846017816017881560178120a0898f8caddba83aa8f18›) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 +Wraps: (4) attached stack trace E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + -- stack trace: E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/util/encoding.PeekLength](http://github.com/cockroachdb/cockroach/pkg/util/encoding.PeekLength) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/util/encoding/encoding.go:2094 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/rowenc.EncDatumFromBuffer](http://github.com/cockroachdb/cockroach/pkg/sql/rowenc.EncDatumFromBuffer) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/rowenc/encoded_datum.go:148 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/rowenc.DecodeKeyValsUsingSpec](http://github.com/cockroachdb/cockroach/pkg/sql/rowenc.DecodeKeyValsUsingSpec) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/rowenc/index_encoding.go:569 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).DecodeIndexKey](http://github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).DecodeIndexKey) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/row/fetcher.go:849 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).nextKey](http://github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).nextKey) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/row/fetcher.go:783 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).startScan](http://github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).startScan) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/row/fetcher.go:744 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).ConsumeKVProvider](http://github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).ConsumeKVProvider) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/sql/row/fetcher.go:736 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).decodeKV](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).decodeKV) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/cdcevent/event.go:582 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).DecodeKV](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).DecodeKV) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/cdcevent/event.go:538 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*kvEventToRowConsumer).ConsumeEvent.func1](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*kvEventToRowConsumer).ConsumeEvent.func1) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/event_processing.go:357 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*kvEventToRowConsumer).ConsumeEvent](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*kvEventToRowConsumer).ConsumeEvent) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/event_processing.go:358 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).workerLoop](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).workerLoop) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/event_processing.go:643 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).startWorkers.func1](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).startWorkers.func1) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/ccl/changefeedccl/event_processing.go:615 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).startWorkers.Group.GoCtx.func2](http://github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*parallelEventConsumer).startWorkers.Group.GoCtx.func2) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | pkg/util/ctxgroup/ctxgroup.go:189 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | [golang.org/x/sync/errgroup.(*Group).Go.func1](http://golang.org/x/sync/errgroup.(*Group).Go.func1) E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | external/org_golang_x_sync/errgroup/errgroup.go:78 E250127 20:55:45.969017 88668 jobs/registry.go:1673 ⋮ [T1,Vsystem,n1,job=1041828790772727809] 13005 + | runtime.goexit ```

Here is what we have found so far:

  • I haven't been able reproduce this error with legacy schema changer which suggests that this may be declarative schema changer related.
  • This error happened after dropping a column.
  • This error happened when we try to decode a previous row for a backfill event.
  • We found this line
    rfe.Val.PrevValue = rfe.Val.Value
    to be particularly interesting. We were able to decode the same key value pair using the current schema timestamp but not at a previous schema timestamp for this backfill event.
  • The error arises from
    return 0, errors.Errorf("unknown tag %d", m)
    and
    vals[j], key, err = EncDatumFromBuffer(enc, key)
    which seems to only take in key not value
  • These are the two versions of the table descriptors: see below
first one
desc: name:"foo" id:106 version:438 modification_time:<wall_time:1738014868399778000 logical:1 > parent_id:104 unexposed_parent_schema_id:105 columns:<name:"id" id:1 type:<family: IntFamily width: 64 precision: 0 locale: "" visible_type: 0 oid: 20 time_precision_is_set: false > nullable:false hidden:false inaccessible:false generated_as_identity_type:NOT_IDENTITY_COLUMN virtual:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"ts" id:2 type:<family: StringFamily width: 0 precision: 0 locale: "" visible_type: 0 oid: 25 time_precision_is_set: false > nullable:true default_expr:"'0':::STRING" hidden:false inaccessible:false generated_as_identity_type:NOT_IDENTITY_COLUMN virtual:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"test0" id:19 type:<family: StringFamily width: 0 precision: 0 locale: "" visible_type: 0 oid: 25 time_precision_is_set: false > nullable:true default_expr:"'x':::STRING" hidden:false inaccessible:false generated_as_identity_type:NOT_IDENTITY_COLUMN virtual:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"test1" id:21 type:<family: StringFamily width: 0 precision: 0 locale: "" visible_type: 0 oid: 25 time_precision_is_set: false > nullable:true default_expr:"'x':::STRING" hidden:false inaccessible:false generated_as_identity_type:NOT_IDENTITY_COLUMN virtual:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > next_column_id:32 families:<name:"primary" id:0 column_names:"id" column_names:"ts" column_names:"test0" column_names:"test1" column_names:"test2" column_ids:1 column_ids:2 column_ids:19 column_ids:21 column_ids:31 default_column_id:2 > next_family_id:1 primary_index:<name:"foo_pkey" id:108 unique:true version:4 key_column_names:"id" key_column_directions:ASC store_column_names:"ts" store_column_names:"test0" store_column_names:"test1" key_column_ids:1 store_column_ids:2 store_column_ids:19 store_column_ids:21 foreign_key:<table:0 index:0 name:"" validity:Validated shared_prefix_len:0 on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > interleave:<> partitioning:<num_columns:0 num_implicit_columns:0 > type:FORWARD created_explicitly:true encoding_type:1 sharded:<is_sharded:false name:"" shard_buckets:0 > disabled:false geo_config:<> predicate:"" use_delete_preserving_encoding:false created_at_nanos:0 constraint_id:108 not_visible:false invisibility:0 > next_index_id:112 privileges:<users:<user_proto:"admin" privileges:2 with_grant_option:2 > users:<user_proto:"enterprisefeeduser" privileges:67108896 with_grant_option:0 > users:<user_proto:"root" privileges:2 with_grant_option:2 > owner_proto:"root" version:3 > mutations:<column:<name:"test2" id:31 type:<family: StringFamily width: 0 precision: 0 locale: "" visible_type: 0 oid: 25 time_precision_is_set: false > nullable:true default_expr:"'x':::STRING" hidden:false inaccessible:false generated_as_identity_type:NOT_IDENTITY_COLUMN virtual:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > state:WRITE_ONLY direction:ADD mutation_id:1 rollback:false > mutations:<index:<name:"crdb_internal_index_110_name_placeholder" id:110 unique:true version:4 key_column_names:"id" key_column_directions:ASC store_column_names:"ts" store_column_names:"test0" store_column_names:"test1" store_column_names:"test2" key_column_ids:1 store_column_ids:2 store_column_ids:19 store_column_ids:21 store_column_ids:31 foreign_key:<table:0 index:0 name:"" validity:Validated shared_prefix_len:0 on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > interleave:<> partitioning:<num_columns:0 num_implicit_columns:0 > type:FORWARD created_explicitly:true encoding_type:1 sharded:<is_sharded:false name:"" shard_buckets:0 > disabled:false geo_config:<> predicate:"" use_delete_preserving_encoding:false created_at_nanos:0 constraint_id:110 not_visible:false invisibility:0 > state:WRITE_ONLY direction:ADD mutation_id:1 rollback:false > mutations:<index:<name:"crdb_internal_index_111_name_placeholder" id:111 unique:true version:4 key_column_names:"id" key_column_directions:ASC store_column_names:"ts" store_column_names:"test0" store_column_names:"test1" store_column_names:"test2" key_column_ids:1 store_column_ids:2 store_column_ids:19 store_column_ids:21 store_column_ids:31 foreign_key:<table:0 index:0 name:"" validity:Validated shared_prefix_len:0 on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > interleave:<> partitioning:<num_columns:0 num_implicit_columns:0 > type:FORWARD created_explicitly:true encoding_type:1 sharded:<is_sharded:false name:"" shard_buckets:0 > disabled:false geo_config:<> predicate:"" use_delete_preserving_encoding:true created_at_nanos:0 constraint_id:111 not_visible:false invisibility:0 > state:DELETE_ONLY direction:DROP mutation_id:1 rollback:false > next_mutation_id:1 format_version:3 state:PUBLIC offline_reason:"" view_query:"" is_materialized_view:false refresh_view_required:false declarative_schema_changer_state:<job_id:1041840433561141249 revertible:true targets:<element_proto:<index_column:<table_id:106 index_id:108 column_id:1 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:108 column_id:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:108 column_id:19 ordinal_in_kind:1 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:108 column_id:21 ordinal_in_kind:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<primary_index:<embedded_index:<table_id:106 index_id:108 is_unique:true is_created_explicitly:true constraint_id:108 > > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<index_name:<table_id:106 index_id:108 name:"foo_pkey" > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<index_data:<table_id:106 index_id:108 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<table_data:<table_id:106 database_id:104 > > metadata:<> target_status:PUBLIC > targets:<element_proto:<column:<table_id:106 column_id:31 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<column_name:<table_id:106 column_id:31 name:"test2" > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<column_type:<table_id:106 column_id:31 embedded_type_t:<type:<family: StringFamily width: 0 precision: 0 locale: "" visible_type: 0 oid: 25 time_precision_is_set: false > type_name:"STRING" > is_nullable:true element_creation_metadata:<in_23_1_or_later:true in_24_3_or_later:true > > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<column_default_expression:<table_id:106 column_id:31 embedded_expr:<expr:"'x':::STRING" > > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<primary_index:<embedded_index:<table_id:106 index_id:110 is_unique:true is_created_explicitly:true constraint_id:110 source_index_id:108 temporary_index_id:111 > > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_name:<table_id:106 index_id:110 name:"foo_pkey" > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_column:<table_id:106 index_id:110 column_id:1 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_column:<table_id:106 index_id:110 column_id:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_column:<table_id:106 index_id:110 column_id:19 ordinal_in_kind:1 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_column:<table_id:106 index_id:110 column_id:21 ordinal_in_kind:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_data:<table_id:106 index_id:110 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<temporary_index:<embedded_index:<table_id:106 index_id:111 is_unique:true is_created_explicitly:true constraint_id:111 source_index_id:108 > > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:111 column_id:1 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:111 column_id:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:111 column_id:19 ordinal_in_kind:1 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:111 column_id:21 ordinal_in_kind:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_data:<table_id:106 index_id:111 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:110 column_id:31 ordinal_in_kind:3 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_column:<table_id:106 index_id:111 column_id:31 ordinal_in_kind:3 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:WRITE_ONLY current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:WRITE_ONLY current_statuses:ABSENT current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:TRANSIENT_DELETE_ONLY current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC target_ranks:0 target_ranks:1 target_ranks:2 target_ranks:3 target_ranks:4 target_ranks:5 target_ranks:6 target_ranks:7 target_ranks:8 target_ranks:9 target_ranks:10 target_ranks:11 target_ranks:12 target_ranks:13 target_ranks:14 target_ranks:15 target_ranks:16 target_ranks:17 target_ranks:18 target_ranks:19 target_ranks:20 target_ranks:21 target_ranks:22 target_ranks:23 target_ranks:24 target_ranks:25 target_ranks:26 relevant_statements:<statement:<statement:"ALTER TABLE foo ADD COLUMN test2 STRING DEFAULT 'x'" redacted_statement:"ALTER TABLE d.public.foo ADD COLUMN test2 STRING DEFAULT \342\200\271'x'\342\200\272" statement_tag:"ALTER TABLE" > > authorization:<user_name:"root" > name_mapping:<id:106 name:"foo" columns:<key:1 value:"id" > columns:<key:2 value:"ts" > columns:<key:19 value:"test0" > columns:<key:21 value:"test1" > columns:<key:31 value:"test2" > columns:<key:4294967292 value:"crdb_internal_origin_timestamp" > columns:<key:4294967293 value:"crdb_internal_origin_id" > columns:<key:4294967294 value:"tableoid" > columns:<key:4294967295 value:"crdb_internal_mvcc_timestamp" > families:<key:0 value:"primary" > indexes:<key:110 value:"foo_pkey" > > > drop_time:0 replacement_of:<id:0 time:<> > audit_mode:DISABLED drop_job_id:0 create_query:"" create_as_of_time:<wall_time:1738014850245152000 > temporary:false partition_all_by:false exclude_data_from_backup:false next_constraint_id:112 import_start_wall_time:0 schema_locked:false import_epoch:0 import_type:IMPORT_WITH_START_TIME_ONLY replicated_pcr_version:0 next_trigger_id:0 next_policy_id:0 , familyID: 0, err: <nil>, schemaTS: 1738014868.413102000,0
at rf fetcher:  "n\x8a\x88"
sec one
desc: name:"foo" id:106 version:439 modification_time:<wall_time:1738014868413102000 logical:1 > parent_id:104 unexposed_parent_schema_id:105 columns:<name:"id" id:1 type:<family: IntFamily width: 64 precision: 0 locale: "" visible_type: 0 oid: 20 time_precision_is_set: false > nullable:false hidden:false inaccessible:false generated_as_identity_type:NOT_IDENTITY_COLUMN virtual:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"ts" id:2 type:<family: StringFamily width: 0 precision: 0 locale: "" visible_type: 0 oid: 25 time_precision_is_set: false > nullable:true default_expr:"'0':::STRING" hidden:false inaccessible:false generated_as_identity_type:NOT_IDENTITY_COLUMN virtual:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"test0" id:19 type:<family: StringFamily width: 0 precision: 0 locale: "" visible_type: 0 oid: 25 time_precision_is_set: false > nullable:true default_expr:"'x':::STRING" hidden:false inaccessible:false generated_as_identity_type:NOT_IDENTITY_COLUMN virtual:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"test1" id:21 type:<family: StringFamily width: 0 precision: 0 locale: "" visible_type: 0 oid: 25 time_precision_is_set: false > nullable:true default_expr:"'x':::STRING" hidden:false inaccessible:false generated_as_identity_type:NOT_IDENTITY_COLUMN virtual:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > columns:<name:"test2" id:31 type:<family: StringFamily width: 0 precision: 0 locale: "" visible_type: 0 oid: 25 time_precision_is_set: false > nullable:true default_expr:"'x':::STRING" hidden:false inaccessible:false generated_as_identity_type:NOT_IDENTITY_COLUMN virtual:false pg_attribute_num:0 alter_column_type_in_progress:false system_column_kind:NONE > next_column_id:32 families:<name:"primary" id:0 column_names:"id" column_names:"ts" column_names:"test0" column_names:"test1" column_names:"test2" column_ids:1 column_ids:2 column_ids:19 column_ids:21 column_ids:31 default_column_id:2 > next_family_id:1 primary_index:<name:"foo_pkey" id:110 unique:true version:4 key_column_names:"id" key_column_directions:ASC store_column_names:"ts" store_column_names:"test0" store_column_names:"test1" store_column_names:"test2" key_column_ids:1 store_column_ids:2 store_column_ids:19 store_column_ids:21 store_column_ids:31 foreign_key:<table:0 index:0 name:"" validity:Validated shared_prefix_len:0 on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > interleave:<> partitioning:<num_columns:0 num_implicit_columns:0 > type:FORWARD created_explicitly:true encoding_type:1 sharded:<is_sharded:false name:"" shard_buckets:0 > disabled:false geo_config:<> predicate:"" use_delete_preserving_encoding:false created_at_nanos:0 constraint_id:110 not_visible:false invisibility:0 > next_index_id:112 privileges:<users:<user_proto:"admin" privileges:2 with_grant_option:2 > users:<user_proto:"enterprisefeeduser" privileges:67108896 with_grant_option:0 > users:<user_proto:"root" privileges:2 with_grant_option:2 > owner_proto:"root" version:3 > mutations:<index:<name:"crdb_internal_index_108_name_placeholder" id:108 unique:true version:4 key_column_names:"id" key_column_directions:ASC store_column_names:"ts" store_column_names:"test0" store_column_names:"test1" key_column_ids:1 store_column_ids:2 store_column_ids:19 store_column_ids:21 foreign_key:<table:0 index:0 name:"" validity:Validated shared_prefix_len:0 on_delete:NO_ACTION on_update:NO_ACTION match:SIMPLE > interleave:<> partitioning:<num_columns:0 num_implicit_columns:0 > type:FORWARD created_explicitly:true encoding_type:1 sharded:<is_sharded:false name:"" shard_buckets:0 > disabled:false geo_config:<> predicate:"" use_delete_preserving_encoding:false created_at_nanos:0 constraint_id:108 not_visible:false invisibility:0 > state:WRITE_ONLY direction:DROP mutation_id:1 rollback:false > next_mutation_id:1 format_version:3 state:PUBLIC offline_reason:"" view_query:"" is_materialized_view:false refresh_view_required:false declarative_schema_changer_state:<job_id:1041840433561141249 targets:<element_proto:<index_column:<table_id:106 index_id:108 column_id:1 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:108 column_id:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:108 column_id:19 ordinal_in_kind:1 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:108 column_id:21 ordinal_in_kind:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<primary_index:<embedded_index:<table_id:106 index_id:108 is_unique:true is_created_explicitly:true constraint_id:108 > > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<index_name:<table_id:106 index_id:108 name:"foo_pkey" > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<index_data:<table_id:106 index_id:108 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:ABSENT > targets:<element_proto:<table_data:<table_id:106 database_id:104 > > metadata:<> target_status:PUBLIC > targets:<element_proto:<column:<table_id:106 column_id:31 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<column_name:<table_id:106 column_id:31 name:"test2" > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<column_type:<table_id:106 column_id:31 embedded_type_t:<type:<family: StringFamily width: 0 precision: 0 locale: "" visible_type: 0 oid: 25 time_precision_is_set: false > type_name:"STRING" > is_nullable:true element_creation_metadata:<in_23_1_or_later:true in_24_3_or_later:true > > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<column_default_expression:<table_id:106 column_id:31 embedded_expr:<expr:"'x':::STRING" > > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<primary_index:<embedded_index:<table_id:106 index_id:110 is_unique:true is_created_explicitly:true constraint_id:110 source_index_id:108 temporary_index_id:111 > > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_name:<table_id:106 index_id:110 name:"foo_pkey" > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_column:<table_id:106 index_id:110 column_id:1 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_column:<table_id:106 index_id:110 column_id:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_column:<table_id:106 index_id:110 column_id:19 ordinal_in_kind:1 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_column:<table_id:106 index_id:110 column_id:21 ordinal_in_kind:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_data:<table_id:106 index_id:110 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<temporary_index:<embedded_index:<table_id:106 index_id:111 is_unique:true is_created_explicitly:true constraint_id:111 source_index_id:108 > > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:111 column_id:1 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:111 column_id:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:111 column_id:19 ordinal_in_kind:1 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:111 column_id:21 ordinal_in_kind:2 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_data:<table_id:106 index_id:111 > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > targets:<element_proto:<index_column:<table_id:106 index_id:110 column_id:31 ordinal_in_kind:3 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:PUBLIC > targets:<element_proto:<index_column:<table_id:106 index_id:111 column_id:31 ordinal_in_kind:3 kind:STORED > > metadata:<sub_work_id:1 source_element_id:1 > target_status:TRANSIENT_ABSENT > current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:VALIDATED current_statuses:ABSENT current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:TRANSIENT_ABSENT current_statuses:TRANSIENT_ABSENT current_statuses:TRANSIENT_ABSENT current_statuses:TRANSIENT_ABSENT current_statuses:TRANSIENT_ABSENT current_statuses:PUBLIC current_statuses:PUBLIC current_statuses:TRANSIENT_ABSENT target_ranks:0 target_ranks:1 target_ranks:2 target_ranks:3 target_ranks:4 target_ranks:5 target_ranks:6 target_ranks:7 target_ranks:8 target_ranks:9 target_ranks:10 target_ranks:11 target_ranks:12 target_ranks:13 target_ranks:14 target_ranks:15 target_ranks:16 target_ranks:17 target_ranks:18 target_ranks:19 target_ranks:20 target_ranks:21 target_ranks:22 target_ranks:23 target_ranks:24 target_ranks:25 target_ranks:26 relevant_statements:<statement:<statement:"ALTER TABLE foo ADD COLUMN test2 STRING DEFAULT 'x'" redacted_statement:"ALTER TABLE d.public.foo ADD COLUMN test2 STRING DEFAULT \342\200\271'x'\342\200\272" statement_tag:"ALTER TABLE" > > authorization:<user_name:"root" > name_mapping:<id:106 name:"foo" columns:<key:1 value:"id" > columns:<key:2 value:"ts" > columns:<key:19 value:"test0" > columns:<key:21 value:"test1" > columns:<key:31 value:"test2" > columns:<key:4294967292 value:"crdb_internal_origin_timestamp" > columns:<key:4294967293 value:"crdb_internal_origin_id" > columns:<key:4294967294 value:"tableoid" > columns:<key:4294967295 value:"crdb_internal_mvcc_timestamp" > families:<key:0 value:"primary" > indexes:<key:110 value:"foo_pkey" > > > drop_time:0 replacement_of:<id:0 time:<> > audit_mode:DISABLED drop_job_id:0 create_query:"" create_as_of_time:<wall_time:1738014850245152000 > temporary:false partition_all_by:false exclude_data_from_backup:false next_constraint_id:112 import_start_wall_time:0 schema_locked:false import_epoch:0 import_type:IMPORT_WITH_START_TIME_ONLY replicated_pcr_version:0 next_trigger_id:0 next_policy_id:0 , familyID: 0, err: <nil>, schemaTS: 1738014868.413102000,1
at rf fetcher:  /Table/2/0

@ajwerner
Copy link
Contributor

I looked into this with @asg0451 using Side-Eye. We were able to capture the relevant descriptors and errors at the point of failure.

I think it doesn't occur in the legacy schema changer because of how it handles adding new columns. Namely it doesn't create new indexes for them and just mutates the primary index in place. This causes other issues like #35738.

Fundamentally, the issue is that for the prev value of the row, the schema timestamp used is the predecessor to the row's timestamp. The rowFetcherCache looks up the appropriate version for that timestamp, but it appears to be using the primary index at that timestamp for decoding despite the fact that the row was read from an index other than the primary index at that time.

If you look at code here below, everything uses GetPrimaryIndex() but the row you're decoding comes from an index that at the timestamp you want to decode it is not the primary index. In this specific case, you'll see that the row was read from index 110 but the primary index at the time was 108

fetcher, family, err := d.rfCache.RowFetcherForColumnFamily(desc, familyID, systemColumns, keyOnly)
if err != nil {
return err
}

if keyOnly {
relevantColumns = tableDesc.GetPrimaryIndex().CollectKeyColumnIDs().Ordered()
} else {
relevantColumns, err = getRelevantColumnsForFamily(tableDesc, familyDesc)
}
if err != nil {
return nil, nil, err
}
if err := rowenc.InitIndexFetchSpec(
&spec, c.codec, tableDesc, tableDesc.GetPrimaryIndex(), relevantColumns,
); err != nil {
return nil, nil, err
}

Consider plumbing an index ID from (*rowFetcherCache).tableDescForKey to (*rowFetcherCache).RowFetcherForColumnFamily in (*eventDecoder).initForKey and then using the appropriate index to construct the fetcher (and as part of its cache key).

remaining, tableID, _, err := rowenc.DecodePartialTableIDIndexID(key)

@asg0451 asg0451 added the C-test-failure Broken test (automatically or manually discovered). label Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). T-cdc
Projects
None yet
Development

No branches or pull requests

4 participants