Release 2.25.0.0-b467: [BACKPORT 2.25.0] [#25146] DocDB: Fix shared exchange shutdown hang · yugabyte/yugabyte-db

2.25.0.0-b467
056113c
Compare

Choose a tag to compare

Loading

View all tags

2.25.0.0-b467: [BACKPORT 2.25.0] [#25146] DocDB: Fix shared exchange shutdown hang

2.25.0.0-b467
056113c
Compare

Choose a tag to compare

Loading

View all tags

spolitov tagged this 13 Dec 18:37

Summary:
Due to some unknown bug we could get into situation where shared exchange query is not notified about flush completion.
As result LockablePgClientSession could wait on latch forever, so this particular shared exchange thread cannot complete.
And it blocks one thread that executes CheckExpiredSessions.
When there are too many such threads, we get into situation when all poller threads are blocked and we cannot execute CheckExpiredSessions.
So all other pg client sessions also cannot be destroyed, because above poller is used for it.

The following was introduced to address this issue:
1) Switch from Latch to a new state in exchange, so even if flush did not respond, thread could be finished and joined.
2) Postpone joining on exchnage threads that are not finished yet. So hang of single exchange thread would not block other threads.
3) Added check to SharedExchangeQuery dtor, that response was sent.

It is possible that there are no other bugs, and actual issue caused by the following deadlock.
All poller threads were trying to join on exchange threads.
But actual requests did not complete since they also use the same pool as used by poller.
Jira: DB-14306
Original commit: 9b40a42446594bc0050c51a514946b04acd5eaf5/D40489

Test Plan: Jenkins

Reviewers: esheng

Reviewed By: esheng

Subscribers: ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D40658

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly