Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TiFlash hang issue after #9072 #9424

Merged
merged 12 commits into from
Sep 10, 2024
Merged

Conversation

windtalker
Copy link
Contributor

@windtalker windtalker commented Sep 9, 2024

What problem does this PR solve?

Issue Number: close #9413

Problem Summary:

What is changed and how it works?

The root casue of the hang issue introduced by #9072 is

  1. ExchangeSenderSinkOp eventually write data to LooseBoundedMPMCQueue, although LooseBoundedMPMCQueue is named with LooseBounded, it actually has a upper bound
  2. For the same MPPTunnel(which holds a LooseBoundedMPMCQueue), all the ExchangeSenderSinkOp will try to write data to the same LooseBoundedMPMCQueue concurrently
  3. If current MPPTunnel is not writable(its LooseBoundedMPMCQueue is full), ExchangeSenderSinkOp will register itself to the pipeline notify furture(the tunnelSender)
  4. pipeline notify future notify one task at a time when the consumer of LooseBoundedMPMCQueue read a message from LooseBoundedMPMCQueue
  5. In current implementation, when an ExchangeSenderSinkOp is notified in stage 4, it is not 100% sure that the ExchangeSenderSinkOp will write data to LooseBoundedMPMCQueue because
    • ExchangeSenderSinkOp only calls writer->write(block); to write the data, and inside writer->write(block) the data can be cached in writer instead of writting to LooseBoundedMPMCQueue
    • If current block is empty, it will call writer->flush() to flush the cached data, and if there is no cached data, it will not write to LooseBoundedMPMCQueue
  6. Consider a case that there is M ExchangeSenderSinkOp, and the LooseBoundedMPMCQueue has a limited size of N, where M > N, and all the M ExchangeSenderSinkOp tries to write to a full LooseBoundedMPMCQueue at the same time. Then all the M ExchangeSenderSinkOp are registered to the pipeline notify future. As described in stage 4, the pipeline notify future only notify one task at a time when the consumer of LooseBoundedMPMCQueue read a message from LooseBoundedMPMCQueue, then at most N ExchangeSenderSinkOp will be notified. If all the notified N ExchangeSenderSinkOp do not write data to LooseBoundedMPMCQueue, then the LooseBoundedMPMCQueue become empty queue, and there is still M - N ExchangeSenderSinkOp waiting on the pipeline notify future. Since LooseBoundedMPMCQueue is empty, all the M - N ExchangeSenderSinkOp have no chance to be notified. Then the whole query hangs.

There is 2 possible fix

  • For each read from LooseBoundedMPMCQueue, triger a notification if current queue is empty
  • Make sure that each time an ExchangeSenderSinkOp is notified, the ExchangeSenderSinkOp should either write data to LooseBoundedMPMCQueue, or try to notify another ExchangeSenderSinkOp

The first fix should be earier, but consider that ExchangeReceiver will also using the notify way after #9073, the first fix may not work in the furture, so this pr use the second fix.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Signed-off-by: xufei <[email protected]>
Signed-off-by: xufei <[email protected]>
Signed-off-by: xufei <[email protected]>
Signed-off-by: xufei <[email protected]>
Signed-off-by: xufei <[email protected]>
Signed-off-by: xufei <[email protected]>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-triage-completed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed do-not-merge/needs-triage-completed labels Sep 9, 2024
@windtalker
Copy link
Contributor Author

/cc @SeaRise

Copy link
Contributor

ti-chi-bot bot commented Sep 9, 2024

@windtalker: GitHub didn't allow me to request PR reviews from the following users: SeaRise.

Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @SeaRise

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@SeaRise
Copy link
Contributor

SeaRise commented Sep 9, 2024

/cc @SeaRise

That's amazing!

@fuzhe1989
Copy link
Contributor

have on chance to be notified

have no chance to be notified

@SeaRise
Copy link
Contributor

SeaRise commented Sep 9, 2024

How about modifying the waitForWritable method of ExchangeWriter so that it returns WaitResult::Ready even if the tunnel is not ready, as long as the cache has not reached the limit? Then, when sink->write detects that the tunnel is not ready, it returns WaitResult::WAIT_FOR_NOTIFY. This way, when the sink is notified, it will definitely write to the mpmcqueue, preventing any hang-ups?

@windtalker
Copy link
Contributor Author

How about modifying the waitForWritable method of ExchangeWriter so that it returns WaitResult::Ready even if the tunnel is not ready, as long as the cache has not reached the limit? Then, when sink->write detects that the tunnel is not ready, it returns WaitResult::WAIT_FOR_NOTIFY. This way, when the sink is notified, it will definitely write to the mpmcqueue, preventing any hang-ups?

BY "even if the tunnel is not ready" you mean the tunnel is not connected or the tunnel is full?

@SeaRise
Copy link
Contributor

SeaRise commented Sep 9, 2024

How about modifying the waitForWritable method of ExchangeWriter so that it returns WaitResult::Ready even if the tunnel is not ready, as long as the cache has not reached the limit? Then, when sink->write detects that the tunnel is not ready, it returns WaitResult::WAIT_FOR_NOTIFY. This way, when the sink is notified, it will definitely write to the mpmcqueue, preventing any hang-ups?

BY "even if the tunnel is not ready" you mean the tunnel is not connected or the tunnel is full?

I mean tunnel is full.

@windtalker windtalker changed the title Fix TiFlash hang issue after #9072 [DNM] Fix TiFlash hang issue after #9072 Sep 9, 2024
@SeaRise
Copy link
Contributor

SeaRise commented Sep 9, 2024

Considering this case:
If it's hash-partitioned, and the concurrency is 2, both operators need to flush, and the number of tunnels is 2:

  1. Operator1 and Operator2 are both waiting for a notification on tunnel[1].
  2. Operator1 is notified by tunnel[1], and since it has data to flush, it won't call triggerPipelineWriterNotify.
  3. When Operator1 flushes, just because Partition 1 has no data written, and only Partition 0 has data, tunnel[1] will not trigger a write at this time.
  4. Because tunnel[1] has no data written to it, there is no consumer to consume from tunnel[1] to trigger Operator2, resulting in a hang.

So query will hang, right?

Signed-off-by: xufei <[email protected]>
Signed-off-by: xufei <[email protected]>
@windtalker
Copy link
Contributor Author

  1. iting for a notifi

Current pr makes sure that operator will send data to all tunnels, even if there is no actually data to send.

Signed-off-by: xufei <[email protected]>
Signed-off-by: xufei <[email protected]>
Signed-off-by: xufei <[email protected]>
@windtalker
Copy link
Contributor Author

Another possible fix from @SeaRise
Basic idea: currently, ExchangeSenderSinkOp will first write data to writer's internal cache, and if cache if full, then flush the cache to tunnelset, so another fix should be "only register ExchangeSenderSinkOp to pipeline notify furture if the writer wants to flush data to tunnelset".
Changes need to made:

  1. call tunnel's waitForWritable only when writer wants to flush data to tunnelset, currently, we call tunnelset's waitForWritable in ExchangeSenderSinkOp->prepare(), we need to move this code inside each writer's write function, the code should be something like this
function write()
{
     // write data to cache
    if (cache is full)
    {
        if (tunnelset is writable)
            // flush data to tunnelset
        else
            // return wait_for_notify or wait_for_polling
    }
}
  1. Inside ExchangeSenderSinkOp->prepare(), it should check if current writer has remaining data to flush to tunnel, the code should be something like this
function prepare() 
{
    if (writer->hasDataToFlush())
    {
        if (tunnel is writable)
            // flush data to tunnelset
        else
            // return wait_for_notify or wait_for_polling
    }
    else
    {
        // return need_input status
    }
}
  1. Usually, a tunnelset contains more than 1 tunnel. Unfortunately, there is no atomic way to check if all the tunnels are writable, so in currently code, we use force_push once all tunnels claim they are ok to write independently. In this fix, it should be re-considered that if we still need to use force_push to do this or we can maintain flush data to tunnel independently for all the tunnels in the same tunnelset

@windtalker windtalker changed the title [DNM] Fix TiFlash hang issue after #9072 Fix TiFlash hang issue after #9072 Sep 10, 2024
block.clear();
tracked_packet->addChunk(codec_stream->getString());
codec_stream->clear();
if (block)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added for debug, there is no need to add this. I will remove it.

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Sep 10, 2024
Comment on lines 225 to 231
auto should_notify = false;
{
std::lock_guard lock(mu);
should_notify = status != MPMCQueueStatus::CANCELLED && !isFullWithoutLock();
}
if (should_notify)
pipe_writer_cv.notifyOne();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about just pipe_writer_cv.notifyOne()?

@@ -137,7 +138,7 @@ try
batch_send_min_limit,
*dag_context_ptr);
for (const auto & block : blocks)
dag_writer->write(block);
dag_writer->doWrite(block);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dag_writer->doWrite(block);
dag_writer->write(block);

seems useless change

@@ -29,7 +29,15 @@ class DAGResponseWriter
DAGResponseWriter(Int64 records_per_chunk_, DAGContext & dag_context_);
/// prepared with sample block
virtual void prepare(const Block &){};
virtual void write(const Block & block) = 0;
// return true if write is actually write the data
virtual bool doWrite(const Block & block) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about moving to protected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

}

// return true if flush is actually flush data
virtual bool doFlush() = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -352,7 +353,7 @@ class TestTiRemoteBlockInputStream : public testing::Test

// 2. encode all blocks
for (const auto & block : source_blocks)
dag_writer->write(block);
dag_writer->doWrite(block);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems useless change

@@ -378,7 +379,7 @@ class TestTiRemoteBlockInputStream : public testing::Test

// 2. encode all blocks
for (const auto & block : source_blocks)
dag_writer->write(block);
dag_writer->doWrite(block);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

block.clear();
tracked_packet->addChunk(codec_stream->getString());
codec_stream->clear();
if (block)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (block)
if likely (block && block.rows() > 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add these check because write() makes sure the block has data before it push the block into blocks

Copy link
Contributor

@SeaRise SeaRise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

others lgtm

Signed-off-by: xufei <[email protected]>
@windtalker
Copy link
Contributor Author

Another possible fix from @SeaRise Basic idea: currently, ExchangeSenderSinkOp will first write data to writer's internal cache, and if cache if full, then flush the cache to tunnelset, so another fix should be "only register ExchangeSenderSinkOp to pipeline notify furture if the writer wants to flush data to tunnelset". Changes need to made:

  1. call tunnel's waitForWritable only when writer wants to flush data to tunnelset, currently, we call tunnelset's waitForWritable in ExchangeSenderSinkOp->prepare(), we need to move this code inside each writer's write function, the code should be something like this
function write()
{
     // write data to cache
    if (cache is full)
    {
        if (tunnelset is writable)
            // flush data to tunnelset
        else
            // return wait_for_notify or wait_for_polling
    }
}
  1. Inside ExchangeSenderSinkOp->prepare(), it should check if current writer has remaining data to flush to tunnel, the code should be something like this
function prepare() 
{
    if (writer->hasDataToFlush())
    {
        if (tunnel is writable)
            // flush data to tunnelset
        else
            // return wait_for_notify or wait_for_polling
    }
    else
    {
        // return need_input status
    }
}
  1. Usually, a tunnelset contains more than 1 tunnel. Unfortunately, there is no atomic way to check if all the tunnels are writable, so in currently code, we use force_push once all tunnels claim they are ok to write independently. In this fix, it should be re-considered that if we still need to use force_push to do this or we can maintain flush data to tunnel independently for all the tunnels in the same tunnelset

After some offline discussion, we decide to use this pr as a quick fix for the hang issue, and will refine the whole tunnel write process using the above idea.

@ti-chi-bot ti-chi-bot bot added the lgtm label Sep 10, 2024
Copy link
Contributor

ti-chi-bot bot commented Sep 10, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gengliqi, SeaRise

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Sep 10, 2024
Copy link
Contributor

ti-chi-bot bot commented Sep 10, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-09-10 10:25:10.100761417 +0000 UTC m=+351979.841185357: ☑️ agreed by gengliqi.
  • 2024-09-10 12:45:53.321848502 +0000 UTC m=+360423.062272441: ☑️ agreed by SeaRise.

@ti-chi-bot ti-chi-bot bot merged commit 1be6569 into pingcap:master Sep 10, 2024
5 checks passed
@windtalker windtalker deleted the fix_hang_issue branch December 6, 2024 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tpch query 10 hangs after #9072
4 participants