Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workload report error “Check snap != nullptr failed: Can not find disaggregated task” when run ch #9567

Closed
Lily2025 opened this issue Oct 30, 2024 · 2 comments · Fixed by #9570
Assignees
Labels
affects-8.1 This bug affects the 8.1.x(LTS) versions. component/storage severity/major type/bug The issue is confirmed as a bug.

Comments

@Lily2025
Copy link

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、run ch q7
2、workload report error

2. What did you expect to see? (Required)

no error

3. What did you see instead (Required)

workload report error
[2024-10-28 21:11:43] execute run failed, err execute query q7 failed Error 1105: other error for mpp stream: Code: 11004, e.displayText() = DB::Exception: Check snap != nullptr failed: Can not find disaggregated task, task_id=DisTaskId<MPP<gather_id:1, query_ts:1730149536627187623, local_query_id:5634, server_id:316, start_ts:453548320125354066,task_id:4>,executor=TableFullScan_78> (from s1006_t2737_4307_1_36486282018035081_360582), e.what() = DB::Exception,\r\n"]

4. What is your TiFlash version? (Required)

/tiflash/tiflash version
TiFlash
Release Version: v8.4.0-alpha-80-g08535236b
Edition: Community
Git Commit Hash: 0853523
Git Branch: HEAD
UTC Build Time: 2024-10-28 11:09:41
Enable Features: jemalloc sm4(GmSSL) mem-profiling avx2 avx512 unwind thinlto hnsw.l2=skylake hnsw.cosine=skylake vec.l2=skylake vec.cos=skylake
Profile: RELWITHDEBINFO
Compiler: clang++ 17.0.6

Raft Proxy
Git Commit Hash: c3bedc86b1470ceacd15dfb19fa7bfa94e8ab49d
Git Commit Branch: HEAD
UTC Build Time: ""
Rust Version: rustc 1.77.0-nightly (89e2160c4 2023-12-27)
Storage Engine: tiflash
Prometheus Prefix: tiflash_proxy_
Profile: release
Enable Features: external-jemalloc portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine openssl-vendored

@Lily2025 Lily2025 added the type/bug The issue is confirmed as a bug. label Oct 30, 2024
@Lily2025
Copy link
Author

/type bug
/severity major
/assign Lloyd-Pottiger

@Lloyd-Pottiger
Copy link
Contributor

Lloyd-Pottiger commented Oct 30, 2024

Root cause

The query is slow (> 300s), then the snapshot in WN expired.

[2024/10/29 05:05:36.656 +08:00] [INFO] [WNDisaggSnapshotManager.h:56] ["Register Disaggregated Snapshot, task_id=DisTaskId<MPP<gather_id:1, query_ts:1730149536627187623, local_query_id:5634, server_id:316, start_ts:453548320125354066,task_id:3>,executor=TableFullScan_78>"] [thread_id=219]
[2024/10/29 05:05:36.671 +08:00] [INFO] [WNDisaggSnapshotManager.h:56] ["Register Disaggregated Snapshot, task_id=DisTaskId<MPP<gather_id:1, query_ts:1730149536627187623, local_query_id:5634, server_id:316, start_ts:453548320125354066,task_id:4>,executor=TableFullScan_78>"] [thread_id=288]
......
[2024/10/29 05:10:36.695 +08:00] [INFO] [WNDisaggSnapshotManager.cpp:61] ["Remove expired Disaggregated Snapshot, task_id=DisTaskId<MPP<gather_id:1, query_ts:1730149536627187623, local_query_id:5634, server_id:316, start_ts:453548320125354066,task_id:4>,executor=TableFullScan_78> expired_at=2024-10-28 21:10:36.671471"] [thread_id=413]
[2024/10/29 05:10:36.704 +08:00] [INFO] [WNDisaggSnapshotManager.cpp:61] ["Remove expired Disaggregated Snapshot, task_id=DisTaskId<MPP<gather_id:1, query_ts:1730149536627187623, local_query_id:5634, server_id:316, start_ts:453548320125354066,task_id:3>,executor=TableFullScan_78> expired_at=2024-10-28 21:10:36.656404"] [thread_id=413]
......
[2024/10/29 05:10:46.131 +08:00] [ERROR] [FlashService.cpp:1146] ["FetchDisaggPages meet exception: Check snap != nullptr failed: Can not find disaggregated task, task_id=DisTaskId<MPP<gather_id:1, query_ts:1730149536627187623, local_query_id:5634, server_id:316, start_ts:453548320125354066,task_id:4>,executor=TableFullScan_78>\n\n       0x1f55969\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, int) [tiflash+32856425]\n                \tdbms/src/Common/Exception.h:46\n       0x8728a3f\tDB::FlashService::FetchDisaggPages(grpc::ServerContext*, disaggregated::FetchDisaggPagesRequest const*, grpc::ServerWriter<disaggregated::PagesPacket>*) [tiflash+141724223]\n                \tdbms/src/Flash/FlashService.cpp:1114\n       0x9bc14e1\tgrpc::internal::ServerStreamingHandler<tikvpb::Tikv::Service, disaggregated::FetchDisaggPagesRequest, disaggregated::PagesPacket>::RunHandler(grpc::internal::MethodHandler::HandlerParameter const&) [tiflash+163321057]\n                \tcontrib/grpc/include/grpcpp/impl/codegen/method_handler.h:206\n       0x94c53bb\tgrpc::Server::SyncRequest::ContinueRunAfterInterception() [tiflash+155997115]\n                \tcontrib/grpc/src/cpp/server/server_cc.cc:433\n       0x94c5225\tgrpc::Server::SyncRequest::Run(std::__1::shared_ptr<grpc::Server::GlobalCallbacks> const&, bool) [tiflash+155996709]\n                \tcontrib/grpc/src/cpp/server/server_cc.cc:421\n       0x94d5d41\tgrpc::ThreadManager::WorkerThread::WorkerThread(grpc::ThreadManager*)::$_0::__invoke(void*) [tiflash+156065089]\n                \tcontrib/grpc/src/cpp/thread_manager/thread_manager.cc:36\n       0x989edd6\tgrpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::__invoke(void*) [tiflash+160034262]\n                \tcontrib/grpc/src/core/lib/gprpp/thd_posix.cc:110\n  0x7f36af496c02\tstart_thread [libc.so.6+564226]\n  0x7f36af51aed4\t__GI___clone [libc.so.6+1105620]"] [source="DisTaskId<MPP<gather_id:1, query_ts:1730149536627187623, local_query_id:5634, server_id:316, start_ts:453548320125354066,task_id:4>,executor=TableFullScan_78>"] [thread_id=445]

Workaround

Enlarging the configuration profiles.default.disagg_task_snapshot_timeout (default 300, 300s) can ease the issue.

@JinheLin will fix this issue later.

@ti-chi-bot ti-chi-bot bot closed this as completed in 5dd3a73 Oct 31, 2024
yibin87 pushed a commit to yibin87/tiflash that referenced this issue Nov 6, 2024
…ingcap#9570)

close pingcap#9567

Signed-off-by: Lloyd-Pottiger <[email protected]>

Co-authored-by: Lloyd-Pottiger <[email protected]>
Co-authored-by: Lloyd-Pottiger <[email protected]>
ti-chi-bot bot added a commit that referenced this issue Nov 8, 2024
…9570) (#9571)

close #9567

Signed-off-by: ti-chi-bot <[email protected]>

Co-authored-by: jinhelin <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-8.1 This bug affects the 8.1.x(LTS) versions. component/storage severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants