Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mysql sink may be blocked by network io wait #825

Merged
merged 8 commits into from
Aug 6, 2020

Conversation

amyangfei
Copy link
Contributor

What problem does this PR solve?

In a test when TiDB cluster is abnormal, ticdc executes DMLs with some errors, but one of goroutine is hang up and blocks the whole sink

goroutine 1560236 [IO wait, 205 minutes]:
internal/poll.runtime_pollWait(0x7fa0f1b689c8, 0x72, 0xffffffffffffffff)
        runtime/netpoll.go:184 +0x55
internal/poll.(*pollDesc).wait(0xc010079618, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
        internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc010079600, 0xc0113cc000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        internal/poll/fd_unix.go:169 +0x1cf
net.(*netFD).Read(0xc010079600, 0xc0113cc000, 0x1000, 0x1000, 0xc0254cad80, 0xc010079600, 0xc010079618)
        net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc00ddea000, 0xc0113cc000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        net/net.go:184 +0x68
github.com/go-sql-driver/mysql.(*buffer).fill(0xc013bf2360, 0x4, 0xc0254cabd0, 0x69d3b3)
        github.com/go-sql-driver/[email protected]/buffer.go:90 +0x13c
github.com/go-sql-driver/mysql.(*buffer).readNext(0xc013bf2360, 0x4, 0x22b5, 0x0, 0x0, 0x22b1, 0xc0001b4000)
        github.com/go-sql-driver/[email protected]/buffer.go:119 +0x9c
github.com/go-sql-driver/mysql.(*mysqlConn).readPacket(0xc013bf2360, 0x22b5, 0x22b5, 0x22b5, 0x4051d9, 0xc017504a00)
        github.com/go-sql-driver/[email protected]/packets.go:31 +0x91
github.com/go-sql-driver/mysql.(*mysqlConn).readResultSetHeaderPacket(0xc013bf2360, 0xc01734c003, 0xc017502500, 0x22b0)
        github.com/go-sql-driver/[email protected]/packets.go:537 +0x2f
github.com/go-sql-driver/mysql.(*mysqlConn).exec(0xc013bf2360, 0xc017502500, 0x22b0, 0xc017330000, 0x64)
        github.com/go-sql-driver/[email protected]/connection.go:345 +0xd3
github.com/go-sql-driver/mysql.(*mysqlConn).Exec(0xc013bf2360, 0xc0156bd900, 0x340, 0xc017330000, 0x64, 0x64, 0x0, 0x0, 0x3, 0xc0254cad98)
        github.com/go-sql-driver/[email protected]/connection.go:327 +0x1a8
github.com/go-sql-driver/mysql.(*mysqlConn).ExecContext(0xc013bf2360, 0x2de6440, 0xc002706480, 0xc0156bd900, 0x340, 0xc017302000, 0x64, 0x64, 0x0, 0x0, ...)
        github.com/go-sql-driver/[email protected]/connection.go:532 +0x165
database/sql.ctxDriverExec(0x2de6440, 0xc002706480, 0x7fa0f11a9408, 0xc013bf2360, 0x0, 0x0, 0xc0156bd900, 0x340, 0xc017302000, 0x64, ...)
        database/sql/ctxutil.go:31 +0x227
database/sql.(*DB).execDC.func2()
        database/sql/sql.go:1519 +0x1db
database/sql.withLock(0x2daad80, 0xc015746280, 0xc0254cb158)
        database/sql/sql.go:3184 +0x6d
database/sql.(*DB).execDC(0xc001de0000, 0x2de6440, 0xc002706480, 0xc015746280, 0xc02446cd60, 0xc0156bd900, 0x340, 0xc01722c700, 0x64, 0x64, ...)
        database/sql/sql.go:1514 +0x4b2
database/sql.(*Tx).ExecContext(0xc028b72280, 0x2de6440, 0xc002706480, 0xc0156bd900, 0x340, 0xc01722c700, 0x64, 0x64, 0x2db0a40, 0xc02c54c480, ...)
        database/sql/sql.go:2275 +0xef
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDMLWithMaxRetries.func2.2(0xbfc2b352a76b1ac5, 0x46ebd2c39e57, 0x48c0660)
        github.com/pingcap/ticdc@/cdc/sink/mysql.go:619 +0x17c
github.com/pingcap/ticdc/cdc/sink.(*Statistics).RecordBatchExecution(0xc000ce83f0, 0xc0254cb538, 0x4e3a7a, 0xc000c0a1e0)
        github.com/pingcap/ticdc@/cdc/sink/statistics.go:86 +0x50
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDMLWithMaxRetries.func2(0xfadd0b, 0x48e1358)
        github.com/pingcap/ticdc@/cdc/sink/mysql.go:612 +0xa7
github.com/pingcap/ticdc/pkg/retry.Run.func1(0xc029496260, 0xc029496280)
        github.com/pingcap/ticdc@/pkg/retry/retry.go:31 +0x2a
github.com/cenkalti/backoff.RetryNotify(0xc0254cb858, 0x2dab6c0, 0xc029496260, 0x0, 0x0, 0x0)
        github.com/cenkalti/[email protected]+incompatible/retry.go:37 +0xb8
github.com/cenkalti/backoff.Retry(...)
        github.com/cenkalti/[email protected]+incompatible/retry.go:24
github.com/pingcap/ticdc/pkg/retry.Run(0x1dcd6500, 0x8, 0xc0254cb8e0, 0x58, 0x2663c80)
        github.com/pingcap/ticdc@/pkg/retry/retry.go:30 +0xb8
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDMLWithMaxRetries(0xc026f2a500, 0x2de6440, 0xc002706480, 0xc0238a9480, 0x8, 0x1a, 0x17, 0x0)

What is changed and how it works?

Use a hack code to avoid io wait in some routine blocks others to exit.
As the network io wait is blocked in kernel code, the goroutine is in a
D-state that we could not even stop it by canceling the context. So if this
scenario happens, the blocked goroutine will be leak.

Check List

Tests

  • Unit test
  • Integration test

Release note

  • Fix network io wait will block task exits

@amyangfei amyangfei added the component/sink Sink component. label Aug 5, 2020
@amyangfei
Copy link
Contributor Author

/run-integration-tests

@codecov-commenter
Copy link

codecov-commenter commented Aug 5, 2020

Codecov Report

Merging #825 into master will decrease coverage by 0.0097%.
The diff coverage is 0.0000%.

@@               Coverage Diff                @@
##             master       #825        +/-   ##
================================================
- Coverage   32.4760%   32.4663%   -0.0098%     
================================================
  Files            97         97                
  Lines         10848      10765        -83     
================================================
- Hits           3523       3495        -28     
+ Misses         6991       6934        -57     
- Partials        334        336         +2     

@amyangfei
Copy link
Contributor Author

/run-integration-tests

@amyangfei amyangfei added the status/ptal Could you please take a look? label Aug 5, 2020
@amyangfei amyangfei requested review from zier-one and overvenus August 5, 2020 13:24
for _, w := range s.workers {
w.waitAllTxnsExecuted()
}
done <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
done <- struct{}{}
close(done)

s.notifier.Notify()
for _, w := range s.workers {
w.waitAllTxnsExecuted()
done := make(chan struct{}, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
done := make(chan struct{}, 1)
done := make(chan struct{})

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Aug 6, 2020
@zier-one
Copy link
Contributor

zier-one commented Aug 6, 2020

/merge

@ti-srebot
Copy link
Contributor

@leoppro Oops! This PR requires at least 2 LGTMs to merge. The current number of LGTM is 1.

@amyangfei
Copy link
Contributor Author

/run-integration-tests

@amyangfei amyangfei added this to the v4.0.5 milestone Aug 6, 2020
@liuzix
Copy link
Contributor

liuzix commented Aug 6, 2020

/lgtm

@ti-srebot ti-srebot removed the status/LGT1 Indicates that a PR has LGTM 1. label Aug 6, 2020
@ti-srebot ti-srebot added the status/LGT2 Indicates that a PR has LGTM 2. label Aug 6, 2020
@amyangfei
Copy link
Contributor Author

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Aug 6, 2020
@liuzix
Copy link
Contributor

liuzix commented Aug 6, 2020

/merge

@ti-srebot
Copy link
Contributor

/run-all-tests

@amyangfei
Copy link
Contributor Author

/merge

1 similar comment
@amyangfei
Copy link
Contributor Author

/merge

@ti-srebot
Copy link
Contributor

/run-all-tests

@ti-srebot ti-srebot merged commit e0dcfb7 into pingcap:master Aug 6, 2020
@amyangfei amyangfei deleted the fix-io-wait branch August 6, 2020 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/sink Sink component. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. status/ptal Could you please take a look?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants