WIP: VReplication parallel copy #8934

shlomi-noach · 2021-10-05T12:57:19Z

Followup to #8056

This is an initial attempt at paralellizing VReplication copy; the main applicability is for MoveTables or Migrate, with multiple tables involved.

Recap

As quick recap from "The general VReplication flow" in #8056:

VReplication currently only ever copies one table at a time
Rows are read by rowstreamer using a LOCK TABLES READ + get GTID + START TRANSACTION WITH CONSISTENT SNAPSHOT + UNLOCK TABLES
- from there on, table rows are read in an open transaction
vcopier invokes the above via gRPC, receives the rows and writes them down to target table + updates copy_state
catchup and fast forward steps follow, applying events from the binary log
- to save resources, VReplication does not apply changes to rows that haven been yet in copy range

Why parallelize and what's the premise

Trivially we want to parallelize the copy to save time; we've seen as high as weeks-long mass imports of data.

Parallelization can occur in two places:

By reading multiple tables concurrently in rowstreamer
By writing to multiple tables ocncurrently in vcopier
- Note: parallelizing writes to a single table can also have gain, when the tables does not have an AUTO_INCREMENT column. If AUTO_INCREMENT exists, basic tests show almost no gain with 2+ concurrent writes. We wish to focus on multi-table concurrency.

We know gRPC is a major source of overhead, and so we want to avoid multiple gRPC calls; we also assume that parallel data transfer across the network is not faster than serial data transfer across the network, assuming we're able to keep the network busy/utilized.

How not to parallelize VReplication copy

We don't take current behavior and multiply n times in parallel. If we did that:

We'd have n gRPCs
We'd create n LOCK TABLE statements
We'd pull n times the same binary log events
There'd be n VPlayers processing those duplicate events

Proposed solution

We want a single gRPC call that parallelizes into n workers in both ends: on rowstreamer and on vcopier; we want a single vplayer to process all binlog events.

gRPC

We add a VStreamRowsParallel() function, with VStreamRowsParallelRequest message. In essence, it's similar to VStreamRows, but:

It provides multiple queries
It provides multiple lastPk values

Obviously queries and the lastPK values correspond to each other.

VStreamRowsResponse is extended to include TableName. vcopier will need this to differentiate between responses of different queries/tables.

rowstreamer

Receives a VStreamRowsParallel request with multiple queries
creates plans for all queries
Generates sendQuery for all queries
Creates n+1 DB connections. One for each table/plan, plus one global that creates a lock.
Runs a single LOCAL TABLES t1 READ, t2 READ, t3 READ, ... query
Captures GTID
Callbacks to the streamer
- A new goroutine is created for each table/plan
- Each has its own connection, issuing a START TRANSACTION WITH CONSISTENT SNAPSHOT
Runs UNLOCK TABLES
Each table/plan proceeds to SELECT FROM t(i) concurrently. Each maintains pktsize
Each calls back to the "main" orchestrating mechanism to send rows. This is serialized.

vreplication/vcopier

Much of the logic is already implictly supported, by virtue of copy_state backend table. VCopier supports multiple tables in a workflow (as MoveTables supports -all flag), and so catchup/fastforward know how to handle the existence of multiple plans and multiple lastPks.

To simplify things, we will parallize by running batches of n tables at a time. This can, and will, have fragmentation. One or two of the tables will be larger than the others; some tables will complete first, but the batch will only complete when all n tables are processed.

We do it that way because this is what allows us to take a single table lock for all tables involved, and to keep our sanity while looking into GTID value.

Best approach would be to use a greedy alorithm: pick tables by size descending. this will parallaize more tables of same size at a time, which optimizes for less fragmentation (fragmentation == time wasted not parallelizing when we have an available slot).

VCopier needs to pick n tables at a time, compute plans for all these tables, and invoke StreamRowsParallel. Possibly there will already be lastPk for some of those tables; this is trivially read from copy_state with no significant changes other than reorganizing the data.

We will create n workers. Ideally, each worker will write to a different table (we have n tables), but it is possible that rowstreamer sends results from one table more frequelty. The logic to parallelize the writes on vcopier is not trivial I think. We can allow for periodic parallelization of writes to same table if that simplifies the code.

vcopier needs to both parallelize (via goroutine) the writes, but at the same time be able to respond to the send function with error result. As I'm writing this the problem goes more complex in my mind... :(

Then, calls to catchup/fastforward converge again; there's only one thread to run those.

Initial PR status:

New gRPC and collateral interfaces are implemented
Parallelization is implemented on streamer via parallel_rowstreamer.go and single_rowstreamer.go
Work began on parallel_vcopier.go; there is an initial refactor to support multiple concurrnet plans;; there is no parallelization yet, and vplayer needs to be extracted/encapsulated.

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

cc @rohit-nayak-ps @sougou @deepthi ; no need for code review right now, though you're welcome to, of course.

Signed-off-by: Shlomi Noach <[email protected]>

…t startTransactionWithConsistentSnapshot Signed-off-by: Shlomi Noach <[email protected]>

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach · 2021-10-06T12:04:01Z

Updates:

Calls to send() function in parallel_vstreamer.go are now concurrent; serialization did not make sense because we are waiting on parallel execution & return codes from vcopier
vcopier now applies vstream responses concurrently. Invocation of vplayer/fasforward are serialized; catchup is outside the concurrency logic anyway, and obviously it is serialized.
I've actually activated parallel vcopier as default in vreplication.go; let's see if all the tests break.

shlomi-noach · 2021-10-06T12:30:12Z

Whoa. All pre-existing tests are passing with the new logic.

I'll start crafting specialized tests.

shlomi-noach · 2021-10-06T12:51:02Z

Of course all tests passed. They weren't running the new flow after all.

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach · 2021-10-07T14:18:56Z

Most test failures right now seems to originate again by the type of testing we do: our endtoend tests look for a specific sequence of queries, and now parallelism ruined it all...

shlomi-noach · 2021-10-07T14:25:58Z

Thoughts on the design are welcome

Signed-off-by: Shlomi Noach <[email protected]>

deepthi · 2021-10-20T16:38:04Z

Any guesses / predictions as to how this might affect the memory usage on the vttablets? Both source and target.

shlomi-noach · 2021-10-20T18:26:00Z

Great question, it will add n*vstream_packet_size on vstreamer. Default of vstream_packet_size is 250000.
Probably same amount of memory on vcopier/target.
I'm trying to think if there's anything else that is obvious but I can't find anything else.

github-actions · 2022-07-16T01:34:33Z

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

Push additional commits to the associated branch.
Remove the stale label.
Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

Signed-off-by: Shlomi Noach <[email protected]>

github-actions · 2022-08-24T01:35:51Z

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

Push additional commits to the associated branch.
Remove the stale label.
Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

github-actions · 2022-08-31T01:39:39Z

This PR was closed because it has been stale for 7 days with no activity.

shlomi-noach added 13 commits October 3, 2021 15:24

VStreamRowsParallelRequest

d855639

Signed-off-by: Shlomi Noach <[email protected]>

support concurrent calls to ThrottleCheckOK()

faaf7f3

Signed-off-by: Shlomi Noach <[email protected]>

support TableName in VStreamRowsResponse

3a079fe

Signed-off-by: Shlomi Noach <[email protected]>

support lockTablesWithHandler: multi table lock with callback; suppor…

a0976f0

…t startTransactionWithConsistentSnapshot Signed-off-by: Shlomi Noach <[email protected]>

plural naming

2b2efc8

Signed-off-by: Shlomi Noach <[email protected]>

implementing VStreamRowsParallel in query service et al

0786488

Signed-off-by: Shlomi Noach <[email protected]>

implementing VStreamRowsParallel in query service et al

027da28

Signed-off-by: Shlomi Noach <[email protected]>

implementing VStreamRowsParallel

891bd1b

Signed-off-by: Shlomi Noach <[email protected]>

fix test

3091303

Signed-off-by: Shlomi Noach <[email protected]>

support StreamRowsParallel()

0e29da9

Signed-off-by: Shlomi Noach <[email protected]>

implementing VStreamRowsParallel

3c78c66

Signed-off-by: Shlomi Noach <[email protected]>

parallel rowstreamer

f3b2fc9

Signed-off-by: Shlomi Noach <[email protected]>

parallel vcopier

38912f5

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication labels Oct 5, 2021

shlomi-noach mentioned this pull request Oct 5, 2021

Analyzing VReplication behavior #8056

Open

shlomi-noach added 3 commits October 6, 2021 14:23

concurrency in vcopier

42e8b38

Signed-off-by: Shlomi Noach <[email protected]>

concurrent calls to send() in parallel rowstreamer

c72a760

Signed-off-by: Shlomi Noach <[email protected]>

call newParallelVCopier by default

15f5492

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach added 6 commits October 6, 2021 15:55

whoops. now actually activating parallel_vcopier

bbb17aa

Signed-off-by: Shlomi Noach <[email protected]>

implementing VStreamRowsParallel in queryservice

ba55f8d

Signed-off-by: Shlomi Noach <[email protected]>

fix table names listing

c727c62

Signed-off-by: Shlomi Noach <[email protected]>

use table match name (could be different on target as with Online DDL)

b374507

Signed-off-by: Shlomi Noach <[email protected]>

remove NO_AITO_CREATE_USER as this sql_mode is not recognized in mysql 8

040e261

Signed-off-by: Shlomi Noach <[email protected]>

gRPC turns nil structs into empty constructs; deal with it

d23d22d

Signed-off-by: Shlomi Noach <[email protected]>

Merge branch 'main' into rowstreamer-parallel

0de40d2

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach mentioned this pull request Oct 12, 2021

vstreamer: remove table locks for snapshots #8962

Closed

1 task

rohit-nayak-ps mentioned this pull request Jan 4, 2022

Do Not Merge: process multiple tables at a time during the copy phase #9460

Closed

3 tasks

mattlord added the Type: Performance label Jun 15, 2022

github-actions bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jul 16, 2022

shlomi-noach removed the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jul 17, 2022

maxenglander mentioned this pull request Jul 22, 2022

WIP vcopier parallel bulk insert proof-of-concept #10788

Closed

merged main, resolved conflict

2ba0abb

Signed-off-by: Shlomi Noach <[email protected]>

github-actions bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Aug 24, 2022

github-actions bot closed this Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: VReplication parallel copy #8934

WIP: VReplication parallel copy #8934

shlomi-noach commented Oct 5, 2021

shlomi-noach commented Oct 6, 2021

shlomi-noach commented Oct 6, 2021

shlomi-noach commented Oct 6, 2021

shlomi-noach commented Oct 7, 2021

shlomi-noach commented Oct 7, 2021

deepthi commented Oct 20, 2021

shlomi-noach commented Oct 20, 2021

github-actions bot commented Jul 16, 2022

github-actions bot commented Aug 24, 2022

github-actions bot commented Aug 31, 2022

WIP: VReplication parallel copy #8934

WIP: VReplication parallel copy #8934

Conversation

shlomi-noach commented Oct 5, 2021

Recap

Why parallelize and what's the premise

How not to parallelize VReplication copy

Proposed solution

gRPC

rowstreamer

vreplication/vcopier

Checklist

shlomi-noach commented Oct 6, 2021

shlomi-noach commented Oct 6, 2021

shlomi-noach commented Oct 6, 2021

shlomi-noach commented Oct 7, 2021

shlomi-noach commented Oct 7, 2021

deepthi commented Oct 20, 2021

shlomi-noach commented Oct 20, 2021

github-actions bot commented Jul 16, 2022

github-actions bot commented Aug 24, 2022

github-actions bot commented Aug 31, 2022