Fix TiFlash hang issue after #9072 #9424

windtalker · 2024-09-09T08:28:55Z

What problem does this PR solve?

Issue Number: close #9413

Problem Summary:

What is changed and how it works?

The root casue of the hang issue introduced by #9072 is

ExchangeSenderSinkOp eventually write data to LooseBoundedMPMCQueue, although LooseBoundedMPMCQueue is named with LooseBounded, it actually has a upper bound
For the same MPPTunnel(which holds a LooseBoundedMPMCQueue), all the ExchangeSenderSinkOp will try to write data to the same LooseBoundedMPMCQueue concurrently
If current MPPTunnel is not writable(its LooseBoundedMPMCQueue is full), ExchangeSenderSinkOp will register itself to the pipeline notify furture(the tunnelSender)
pipeline notify future notify one task at a time when the consumer of LooseBoundedMPMCQueue read a message from LooseBoundedMPMCQueue
In current implementation, when an ExchangeSenderSinkOp is notified in stage 4, it is not 100% sure that the ExchangeSenderSinkOp will write data to LooseBoundedMPMCQueue because
- ExchangeSenderSinkOp only calls writer->write(block); to write the data, and inside writer->write(block) the data can be cached in writer instead of writting to LooseBoundedMPMCQueue
- If current block is empty, it will call writer->flush() to flush the cached data, and if there is no cached data, it will not write to LooseBoundedMPMCQueue
Consider a case that there is M ExchangeSenderSinkOp, and the LooseBoundedMPMCQueue has a limited size of N, where M > N, and all the M ExchangeSenderSinkOp tries to write to a full LooseBoundedMPMCQueue at the same time. Then all the M ExchangeSenderSinkOp are registered to the pipeline notify future. As described in stage 4, the pipeline notify future only notify one task at a time when the consumer of LooseBoundedMPMCQueue read a message from LooseBoundedMPMCQueue, then at most N ExchangeSenderSinkOp will be notified. If all the notified N ExchangeSenderSinkOp do not write data to LooseBoundedMPMCQueue, then the LooseBoundedMPMCQueue become empty queue, and there is still M - N ExchangeSenderSinkOp waiting on the pipeline notify future. Since LooseBoundedMPMCQueue is empty, all the M - N ExchangeSenderSinkOp have no chance to be notified. Then the whole query hangs.

There is 2 possible fix

For each read from LooseBoundedMPMCQueue, triger a notification if current queue is empty
Make sure that each time an ExchangeSenderSinkOp is notified, the ExchangeSenderSinkOp should either write data to LooseBoundedMPMCQueue, or try to notify another ExchangeSenderSinkOp

The first fix should be earier, but consider that ExchangeReceiver will also using the notify way after #9073, the first fix may not work in the furture, so this pr use the second fix.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

Signed-off-by: xufei <[email protected]>

windtalker · 2024-09-09T08:36:27Z

/cc @SeaRise

ti-chi-bot · 2024-09-09T08:36:30Z

@windtalker: GitHub didn't allow me to request PR reviews from the following users: SeaRise.

Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @SeaRise

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SeaRise · 2024-09-09T08:59:47Z

/cc @SeaRise

That's amazing!

fuzhe1989 · 2024-09-09T09:04:13Z

have on chance to be notified

have no chance to be notified

SeaRise · 2024-09-09T09:10:08Z

How about modifying the waitForWritable method of ExchangeWriter so that it returns WaitResult::Ready even if the tunnel is not ready, as long as the cache has not reached the limit? Then, when sink->write detects that the tunnel is not ready, it returns WaitResult::WAIT_FOR_NOTIFY. This way, when the sink is notified, it will definitely write to the mpmcqueue, preventing any hang-ups?

windtalker · 2024-09-09T09:15:12Z

How about modifying the waitForWritable method of ExchangeWriter so that it returns WaitResult::Ready even if the tunnel is not ready, as long as the cache has not reached the limit? Then, when sink->write detects that the tunnel is not ready, it returns WaitResult::WAIT_FOR_NOTIFY. This way, when the sink is notified, it will definitely write to the mpmcqueue, preventing any hang-ups?

BY "even if the tunnel is not ready" you mean the tunnel is not connected or the tunnel is full?

SeaRise · 2024-09-09T09:18:39Z

How about modifying the waitForWritable method of ExchangeWriter so that it returns WaitResult::Ready even if the tunnel is not ready, as long as the cache has not reached the limit? Then, when sink->write detects that the tunnel is not ready, it returns WaitResult::WAIT_FOR_NOTIFY. This way, when the sink is notified, it will definitely write to the mpmcqueue, preventing any hang-ups?

BY "even if the tunnel is not ready" you mean the tunnel is not connected or the tunnel is full?

I mean tunnel is full.

SeaRise · 2024-09-09T15:02:19Z

Considering this case:
If it's hash-partitioned, and the concurrency is 2, both operators need to flush, and the number of tunnels is 2:

Operator1 and Operator2 are both waiting for a notification on tunnel[1].
Operator1 is notified by tunnel[1], and since it has data to flush, it won't call triggerPipelineWriterNotify.
When Operator1 flushes, just because Partition 1 has no data written, and only Partition 0 has data, tunnel[1] will not trigger a write at this time.
Because tunnel[1] has no data written to it, there is no consumer to consume from tunnel[1] to trigger Operator2, resulting in a hang.

So query will hang, right?

Signed-off-by: xufei <[email protected]>

windtalker · 2024-09-10T01:49:53Z

iting for a notifi

Current pr makes sure that operator will send data to all tunnels, even if there is no actually data to send.

Signed-off-by: xufei <[email protected]>

windtalker · 2024-09-10T05:23:17Z

Another possible fix from @SeaRise
Basic idea: currently, ExchangeSenderSinkOp will first write data to writer's internal cache, and if cache if full, then flush the cache to tunnelset, so another fix should be "only register ExchangeSenderSinkOp to pipeline notify furture if the writer wants to flush data to tunnelset".
Changes need to made:

call tunnel's waitForWritable only when writer wants to flush data to tunnelset, currently, we call tunnelset's waitForWritable in ExchangeSenderSinkOp->prepare(), we need to move this code inside each writer's write function, the code should be something like this

function write()
{
     // write data to cache
    if (cache is full)
    {
        if (tunnelset is writable)
            // flush data to tunnelset
        else
            // return wait_for_notify or wait_for_polling
    }
}

Inside ExchangeSenderSinkOp->prepare(), it should check if current writer has remaining data to flush to tunnel, the code should be something like this

function prepare() 
{
    if (writer->hasDataToFlush())
    {
        if (tunnel is writable)
            // flush data to tunnelset
        else
            // return wait_for_notify or wait_for_polling
    }
    else
    {
        // return need_input status
    }
}

Usually, a tunnelset contains more than 1 tunnel. Unfortunately, there is no atomic way to check if all the tunnels are writable, so in currently code, we use force_push once all tunnels claim they are ok to write independently. In this fix, it should be re-considered that if we still need to use force_push to do this or we can maintain flush data to tunnel independently for all the tunnels in the same tunnelset

gengliqi · 2024-09-10T10:23:47Z

dbms/src/Flash/Mpp/MPPTunnelSetHelper.cpp

-        block.clear();
-        tracked_packet->addChunk(codec_stream->getString());
-        codec_stream->clear();
+        if (block)


Why add this condition？

This is added for debug, there is no need to add this. I will remove it.

SeaRise · 2024-09-10T10:56:30Z

dbms/src/Common/LooseBoundedMPMCQueue.h

+        auto should_notify = false;
+        {
+            std::lock_guard lock(mu);
+            should_notify = status != MPMCQueueStatus::CANCELLED && !isFullWithoutLock();
+        }
+        if (should_notify)
+            pipe_writer_cv.notifyOne();


how about just pipe_writer_cv.notifyOne()?

SeaRise · 2024-09-10T10:57:55Z

dbms/src/Flash/Coprocessor/tests/gtest_streaming_writer.cpp

@@ -137,7 +138,7 @@ try
            batch_send_min_limit,
            *dag_context_ptr);
        for (const auto & block : blocks)
-            dag_writer->write(block);
+            dag_writer->doWrite(block);


Suggested change

dag_writer->doWrite(block);

dag_writer->write(block);

seems useless change

SeaRise · 2024-09-10T10:58:29Z

dbms/src/Flash/Coprocessor/DAGResponseWriter.h

@@ -29,7 +29,15 @@ class DAGResponseWriter
    DAGResponseWriter(Int64 records_per_chunk_, DAGContext & dag_context_);
    /// prepared with sample block
    virtual void prepare(const Block &){};
-    virtual void write(const Block & block) = 0;
+    // return true if write is actually write the data
+    virtual bool doWrite(const Block & block) = 0;


how about moving to protected?

SeaRise · 2024-09-10T10:58:42Z

dbms/src/Flash/Coprocessor/DAGResponseWriter.h

+    }
+
+    // return true if flush is actually flush data
+    virtual bool doFlush() = 0;


SeaRise · 2024-09-10T10:59:04Z

dbms/src/Flash/Coprocessor/tests/gtest_ti_remote_block_inputstream.cpp

@@ -352,7 +353,7 @@ class TestTiRemoteBlockInputStream : public testing::Test

        // 2. encode all blocks
        for (const auto & block : source_blocks)
-            dag_writer->write(block);
+            dag_writer->doWrite(block);


seems useless change

SeaRise · 2024-09-10T10:59:12Z

dbms/src/Flash/Coprocessor/tests/gtest_ti_remote_block_inputstream.cpp

@@ -378,7 +379,7 @@ class TestTiRemoteBlockInputStream : public testing::Test

        // 2. encode all blocks
        for (const auto & block : source_blocks)
-            dag_writer->write(block);
+            dag_writer->doWrite(block);


SeaRise · 2024-09-10T11:03:25Z

dbms/src/Flash/Mpp/MPPTunnelSetHelper.cpp

-        block.clear();
-        tracked_packet->addChunk(codec_stream->getString());
-        codec_stream->clear();
+        if (block)


Suggested change

if (block)

if likely (block && block.rows() > 0)

No need to add these check because write() makes sure the block has data before it push the block into blocks

SeaRise

others lgtm

Signed-off-by: xufei <[email protected]>

windtalker · 2024-09-10T12:14:42Z

Another possible fix from @SeaRise Basic idea: currently, ExchangeSenderSinkOp will first write data to writer's internal cache, and if cache if full, then flush the cache to tunnelset, so another fix should be "only register ExchangeSenderSinkOp to pipeline notify furture if the writer wants to flush data to tunnelset". Changes need to made:

call tunnel's waitForWritable only when writer wants to flush data to tunnelset, currently, we call tunnelset's waitForWritable in ExchangeSenderSinkOp->prepare(), we need to move this code inside each writer's write function, the code should be something like this
function write()
{
     // write data to cache
    if (cache is full)
    {
        if (tunnelset is writable)
            // flush data to tunnelset
        else
            // return wait_for_notify or wait_for_polling
    }
}
Inside ExchangeSenderSinkOp->prepare(), it should check if current writer has remaining data to flush to tunnel, the code should be something like this
function prepare() 
{
    if (writer->hasDataToFlush())
    {
        if (tunnel is writable)
            // flush data to tunnelset
        else
            // return wait_for_notify or wait_for_polling
    }
    else
    {
        // return need_input status
    }
}
Usually, a tunnelset contains more than 1 tunnel. Unfortunately, there is no atomic way to check if all the tunnels are writable, so in currently code, we use force_push once all tunnels claim they are ok to write independently. In this fix, it should be re-considered that if we still need to use force_push to do this or we can maintain flush data to tunnel independently for all the tunnels in the same tunnelset

After some offline discussion, we decide to use this pr as a quick fix for the hang issue, and will refine the whole tunnel write process using the above idea.

ti-chi-bot · 2024-09-10T12:45:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gengliqi, SeaRise

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [SeaRise,gengliqi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2024-09-10T12:45:54Z

[LGTM Timeline notifier]

Timeline:

2024-09-10 10:25:10.100761417 +0000 UTC m=+351979.841185357: ☑️ agreed by gengliqi.
2024-09-10 12:45:53.321848502 +0000 UTC m=+360423.062272441: ☑️ agreed by SeaRise.

windtalker added 6 commits September 9, 2024 14:29

save work

f9a3206

Signed-off-by: xufei <[email protected]>

save work

55c6711

Signed-off-by: xufei <[email protected]>

fix build

095eb29

Signed-off-by: xufei <[email protected]>

rename

c723aa1

Signed-off-by: xufei <[email protected]>

refine and add ut

c6e8279

Signed-off-by: xufei <[email protected]>

fix

6e7c6ee

Signed-off-by: xufei <[email protected]>

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-triage-completed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed do-not-merge/needs-triage-completed labels Sep 9, 2024

windtalker changed the title ~~Fix TiFlash hang issue after #9072~~ [DNM] Fix TiFlash hang issue after #9072 Sep 9, 2024

windtalker added 2 commits September 10, 2024 09:36

fix

c20aeb5

Signed-off-by: xufei <[email protected]>

format code

9746690

Signed-off-by: xufei <[email protected]>

windtalker added 3 commits September 10, 2024 11:11

refine

4d1fc6b

Signed-off-by: xufei <[email protected]>

save work

98e797f

Signed-off-by: xufei <[email protected]>

rename

25cbb7b

Signed-off-by: xufei <[email protected]>

windtalker changed the title ~~[DNM] Fix TiFlash hang issue after #9072~~ Fix TiFlash hang issue after #9072 Sep 10, 2024

gengliqi approved these changes Sep 10, 2024

View reviewed changes

ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Sep 10, 2024

SeaRise reviewed Sep 10, 2024

View reviewed changes

address comments

3b7541a

Signed-off-by: xufei <[email protected]>

SeaRise approved these changes Sep 10, 2024

View reviewed changes

ti-chi-bot bot added the lgtm label Sep 10, 2024

ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Sep 10, 2024

ti-chi-bot bot merged commit 1be6569 into pingcap:master Sep 10, 2024
5 checks passed

windtalker mentioned this pull request Oct 21, 2024

Fix potential hang issues when fine grained shuffle is enabled #9548

Merged

12 tasks

windtalker deleted the fix_hang_issue branch December 6, 2024 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TiFlash hang issue after #9072 #9424

Fix TiFlash hang issue after #9072 #9424

windtalker commented Sep 9, 2024 •

edited

Loading

windtalker commented Sep 9, 2024

ti-chi-bot bot commented Sep 9, 2024

SeaRise commented Sep 9, 2024

fuzhe1989 commented Sep 9, 2024

SeaRise commented Sep 9, 2024

windtalker commented Sep 9, 2024

SeaRise commented Sep 9, 2024

SeaRise commented Sep 9, 2024

windtalker commented Sep 10, 2024

windtalker commented Sep 10, 2024

gengliqi Sep 10, 2024

windtalker Sep 10, 2024

SeaRise Sep 10, 2024

SeaRise Sep 10, 2024

SeaRise Sep 10, 2024

windtalker Sep 10, 2024

SeaRise Sep 10, 2024

windtalker Sep 10, 2024

SeaRise Sep 10, 2024

SeaRise Sep 10, 2024

SeaRise Sep 10, 2024

windtalker Sep 10, 2024

SeaRise left a comment

windtalker commented Sep 10, 2024

ti-chi-bot bot commented Sep 10, 2024

ti-chi-bot bot commented Sep 10, 2024

Fix TiFlash hang issue after #9072 #9424

Fix TiFlash hang issue after #9072 #9424

Conversation

windtalker commented Sep 9, 2024 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

windtalker commented Sep 9, 2024

ti-chi-bot bot commented Sep 9, 2024

SeaRise commented Sep 9, 2024

fuzhe1989 commented Sep 9, 2024

SeaRise commented Sep 9, 2024

windtalker commented Sep 9, 2024

SeaRise commented Sep 9, 2024

SeaRise commented Sep 9, 2024

windtalker commented Sep 10, 2024

windtalker commented Sep 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeaRise left a comment

Choose a reason for hiding this comment

windtalker commented Sep 10, 2024

ti-chi-bot bot commented Sep 10, 2024

ti-chi-bot bot commented Sep 10, 2024

[LGTM Timeline notifier]

windtalker commented Sep 9, 2024 •

edited

Loading