-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analyzing VReplication behavior #8056
Comments
I realize this is a terribly long read. Not sure I'd survive it myself. If there's any thoughts before I pursue the suggested path, please let me know. |
Thanks for the detailed explanation/comparison of the two approaches. Since I have worked only with the current VReplication algo, I don't have an intuitive feel for which algo will perform better in the at-scale sharded Vitess deployments. Some comments/thoughts, more to clarify my understanding ... TakeawayThe Notes (in no particular order)
|
I like the terms "stream-first" and "copy-first".
The current state of #8044: it's stable and working, and supports both approaches with minimal abstraction. The difference between the two isn't that big, really, in terms of code changes.
It will be absolutely opaque to the user. According to #8044: Materialize with GROUP BY ==> copy-first. Online DDL which adds a UNIQUE constraint ==> copy-first. The rest: stream-first.
I agree that copy-first is more efficient and does not waste double-writes. assuming we proceed with stream-first, we can also apply PK comparisons for
Not sure stream-first has an advantage over copy-first in this regard.
Per #8044 the state is identical to copy-first. min-PK == lastPk anyway, and max-PK can be re-evaluated if needed (in case vreplication is restarted).
Right. And I wasn't even planning on running concurrent copies of chunks of same tables; I was only considering parallelizing multiple table writes. |
Summary of video discussion 2021-06-01 between @deepthi , @rohit-nayak-ps and myself We scheduled the discussion in an attempt to get a shared understanding and consensus of how we should proceed with VReplication parallelization/optimization efforts. There are two use cases in two different trajectories:
Single table parallelizationFirst we concluded we must experiment a bit to see how much we can gain by parallelizing writes of a single table. Single table optimization plausibilityI ran a hacky experiment on a
I noticed a major difference if the table does or does not have a
The latter is a bit discouraging. Single table optimization implementationThis is entirely on the vreplication (target) side, irrespective of any changes in the streamer (source) side. Target side receives rows up to some max-packet-size. As @vmg noted, nothing forces us to wait for the packet to fill up before writing down rows to the target table. We can choose to flush rows every That part is easy. Somewhat more complicated is how we track each such write. Today we only track the last-PK written in We are yet to conclude whether it's worth trying to parallize writes for a single table. Multi table parallelizationLet's consider the existing logic, which @rohit-nayak-ps coined as As illustrated in the original comment, this logic uses a transaction Our options are:
1. Multiple streamsThis was the approach that was first considered when the idea for parallel vreplication first came up. Discussion points for this approach:
2. Change of logic: stream-first#8044 is a valid (relevant tests are passing) POC where we change the logic to avoid
Downside is that this doesn't exist yet, and is a change in paradigm (we will need to build trust). Creating a POC for this is basically implementing full functionality. We need a solid and large test case. Summary@deepthi suggests to proceed with POC for (2) Change of logic: stream-first. Thoughts welcome. |
Also, based on the discussion, it is likely that the consistent snapshot approach will simply not work for large VReplication-based online DDL (because we have to copy from primary->primary, and not from a replica/rdonly). |
I’m not entirely up to speed with this discussion, but I wanted to chime in with some initial thoughts. Shlomi, thanks for the detailed write-up and all the careful considerations. In terms of the technical details of the gh-ost flow, I will have some follow-up questions. I'm still reasoning about the algorithm. I’m asking myself a somewhat philosophical question: Are the performance benefits from this new approach worth a rewrite? My initial answer to this question is no. Here some points on why I’m thinking this way:
===== One last side note that I’ve been thinking about while digesting this proposal. I agree that the existing implementation is complex. Buuut I would argue that is conceptually is simpler. I found that it is straightforward to reason about and understand why it is correct. The opposite seems true for the gh-ost flow. Implementation is simpler, but the correctness of the approach is more complex. Granted, this could be just a side effect of my familiarity with how things work today. =====
I think this is a good property of existent VReplication. I would love to not lose this. |
Thank you for your thoughts @rafael! Some comments on the technical side:
That's commendable! I think some use cases require better performance: importing an entire database from an external source is one example where we've projected a multi-week operation in a real use case and with current implementation.
The new approach is not all that different from the existing approach. If you take #8044, the change of logic is rather minor.
It will.
I twill.
I am yet to provide a mathematical proof of the gh-ost logic. I'm not sure how to formalize it. But it has been tested well. Anecdotally, my first logic for online schema changes, oak-online-alter-table, on which pt-online-schema-change is based, provided no test suite, and to my understanding there's yet no testing logic for pt-osc. We will build trust in the new logic by testing. We do have the luxury of already having a mechanism with a test suite, and can test the new logic by running it through the existing suite. FWIW, coming from the
Not sure how you mean? Did you mean the property of copying from replica/rdonly? This will not change. Again, the switch of logic from as presented in #8044 is basically it (though it does not introduce parallelism yet). There is no fundamental change to the flow. It's the same streamer, same vcopier, same vplayer. Streamer just reads rows in a different way and flushes them in a different way. Player runs a To me the "lock/consistent snapshot/big SELECT" is intimidating. I can't imagine running a 8 hour operation that would need to maintain an open transaction with consistent snapshot on a primary. In my experience this can lead to outage. That was my main concern embarking on the task. Assuming I'm correct about the risk on a primary (and that was my clear experience at my previous job), that means you always have to rely on a replica to be handy to run some operation. I'm not sure how much of a big deal that is. For Online DDL, the design is to run/read/write on primary. It's possible to run replica->primary, I think, but haven't explored it yet. |
Oh if you already have use cases where it will take multiple weeks, I definitely agree it requires better performance. I'm still curious of more details of what kind of external databases. With my current understanding of performance, it seems that is external databases that are larger than 10 TB and also resharding many tables as part of this process. Something like that?? My point around performance here is that so far, they have been solvable without major changes. There are other issues (like the one I mentioned for Materialize tables), that are blockers and we don't have solutions.
Yes, that was the property. Ah great, this is not changing.
Original implementation here didn't use consistent snapshot and it was a short lock. So it was not opened for multiple hours. Would you have the same concern in that case? If I recall correctly, the consistent snapshot was introduced to reduce even further the duration of the locks. I wonder if that optimization actually just added a reliability risk. The previous implementation was used at Slack extensively and we never ran into issues. For context, it was not for copy, but for VDiff. We used it when we migrated our legacy systems to Vitess. This is a similar use case to the one you describe of copying an external database. The main difference in our approach is that we didn't use the copy phase, we started from a backup and used vreplication to catch up. To your point, among others, performance was a concern for this use case and we avoided copying the data altogether.
If this is the case, I'm less concerned. I haven't actually look at the PR yet, I was catching up with the discussion here. I still have to do my homework of reflecting more about the new algorithm. |
I am not really familiar with the previous implementation, myself, so I have no insights on that. I'll ask around though. |
I'm gonna pivot a bit and focus again on current implementation, coined Moreover, I've come to realize that with So, gonna look more closely into how we can optimize with current logic. |
I have a rough draft (in my brain) of how we might run concurrent table copy using the existing VReplication logic, on top of a single stream. It will take me a while to write it down in words, and I'm still unsure how complex it would be to implement it in code. |
Some design thoughts and initial work on parallel vreplication copy: #8934 |
I was looking into VReplication behavior as part of ongoing improvement to VReplication for Online DDL. It led me into a better understanding of the VReplication flow, and what I believe to be bottlenecks we can solve. I did some benchmarking in a no-traffic scenario (traffic scenario soon to come) to better understand the relationship between the different components. I'd like to share my findings. This will be long, so TL;DR:
Sample PR to demonstrate potential code changes: #8044
Disclaimer: some opinions expressed here based on my personal past experience in production. I realize some may not hold true everywhere. Also, tests are failing, so obviously I haven't solved this yet.
But let's begin with a general explanation of the VReplication flow (also see Life of a Stream). I'll use an Online DDL flow, which mostly simple: copy shared columns from source to target. Much like CopyTables.
The general VReplication flow
The current flow uses perfect accuracy by tracking down GTIDs, locking tables and getting consistent read snapshots. For simplicity, we illustrate the flow for a single table operation. It goes like this:
vstreamer
) and target (vreplicator
)vstreamer
(rowstreamer) is responsible of reading & streaming table data from source to targetvreplicator
has two main components:vcopier
: reads rows streamed fromvstreamer
andINSERT
s them to target tablevplayer
: tails the binary logs and applies relevant events to the target tableThese building blocks are essential. What's interesting is how they interact. We identify three phases: Copy, Catch-up, Fast-forward
Copy
vplayer
will only apply changes to rows already copied, or anything before those queries.so we begin with
vplayer
doing nothingrowstreamer
begins workrowstreamer
runslock table my_table READ
(writes blocked on the table) and gets opens a transactionwith consistent snapshot
, and reads current GTID.Which means, a locking & blocking operation, which gets us a GTID and a transaction that is guaranteed to read data associated with that exact GTID
in that transaction,
rowstreamer
runs aselect <all-relevant-columns> from <my_table> order by <pk-columns>
.this is an attempt to read basically all columns, all rows from a table in a single transaction, and stream those results back.
rowstreamer
sends down the GTID value (to be intercepted byvcopier
)rowstreamer
reads row by row; when the data it has accumulated from the query exceeds-vstream_packet_size
, it sends down accumulated rows, runs some housekeeping, and continues to accumulate further rows.Meanwhile,
vcopier
got the GTID, takes note.vcopier
now gets the first batch of rows. It writes them all at once, in a single transaction, to MySQL.In that same transaction, it updates
_vt.copy_state
table with identities of written rows (identified by last PK)The flow then switches to Catchup
Catchup
Vplayer
processes events from the binary log. By now the binary logs have accumulated quite a few events.insert/update/delete
on our table rowsIn same transaction where we apply the event, we update
_vt.vreplication
table with associated GTIDSide notes
If the table is small enough, we might actually suffice with the flow thus far. Remember,
vstreamer
was actually selecting all rows from the table. So if all went well, we copied all rows, we caught up with binlog events, and we're good to cut-over or complete the workflow.However, if the table is large enough, there are situations where we are not done.
vstreamer
from sending the rows.vreplicator
has acopyTimeout
of 1 hour. When that timeout expires,vreplicator
sends anio.EOF
tovstreamer
, which aborts the operation.In either of these scenarios,
vreplicator
has the last processed GTID and last read PK, and can resume work by:Fast forward
This is actually a merger between both Copy & Catchup:
vstreamer
to prepare another snapshotvreplicator
tellsvstreamer
: "start with this PK"vstreamer
creates a new transaction with consistent snapshot. It prepares a new query:select <all-relevant-columns> from <my_table> where <pk-columns> > :lastPK order by <pk-columns>
So it attempts to read all table rows, starting from the given position (rows before that position are already handled by
vcopier
)vstreamer
sends down the GTID for the transactionvcopier
does not proceed to read rows fromvstreamer.
In the time it took to create the new consistent snapshot, a few more events have appeared in the binary log. Because the flow keeps strict ordering of transactions, the flow wants to first apply all transactions in the binlog up to the GTID just received fromvstreamer
.vplayer
applies those rowsvstreamer
reads and streams more rows, either manages to read till end of table or it does not.vcopier
writes rows,vplayer
catches up with events, and so forth.Benchmark & analysis of impact of factors
Let's identify some factors affecting the flow:
vstream_packet_size
: the smaller it is, the more back-and-forth gRPC traffic we will see betweenvstreamer
andvcopier
: those will be more batches (smaller batches) of data sent from source to target.copyTimeout
: the smaller it is, the more we will interrupt the mega-queryrowstreamer
attempts to run. Theoretically, we could set that value to near infinite, in the hope of delivering the entire table in one single sweep. But as mentioned before, network failures can happen anyway.So with a small value, we will:
target
tosource
: "get me more rows please")LOCK TABLES my_table READ
While not tunable in
master
branch, I've made it possible for the source query to haveLIMIT <number-of-rows>
. What if we voluntarily close the copy cycle on thevstreamer
side?See these changes to support the new behavior:
https://github.com/vitessio/vitess/pull/8044/files#diff-a1cffc790e352be31a3f600180d968b6d96f3bd90acf4bfa0a49e3a66611558cR227-R332
https://github.com/vitessio/vitess/pull/8044/files#diff-862152928bc2f7cafae8ed7dd0fa49608a7f2f60615dc6131059876cd42c0087R206
Spoiler: smaller reads lead to more gRPC communication. This time
vstreamer
terminates the communication.vcopier
says "well I want more rows" and initiates the next copy. Again, this means:target
tosource
: "get me more rows please")LOCK TABLES my_table READ
To be discussed later, I've also experimented with the overhead of the consistent snapshot and of the table lock.
The below benchmarks some of these params. It is notable that I did not include a
vstream_packet_size
tuning in the below.vstream_packet_size
>=64K
seems to be good enough and similar behaviorvstream_packet_size
high is desirable. But, what's interesting is how the flow is coupled with that value, to be discussed later.640000
(640k)The benchmark is to run an Online DDL on a large table. In Online DDL both source and target are same vttablet, so obviously same host, and there's no cross hosts network latency. It's noteworthy that there's still network involved: the source and target communicate via gRPC even though they're the same process.
The table:
The table takes
1.4GB
on disk.So this table is quite simple, and with some text content to make it somewhat fat.
For now, only consider rows With Snapshot=TRUE.
This is a subset of some more experiments I ran. The results are consistent up to 2-3 seconds across executions. What can we learn?
master
branch. NoLIMIT
to theSELCET
query. 1 hour timeout (longer than it takes to copy the table)LIMIT
s are catastrophicNow, I began by blaming this on gRPC. However, the above is not enough information to necessarily blame gRPC. Smaller timeout == more calls to vstreamer. Smaller LIMIT == more calls to vstreamer. It makes sens to first suspect the performance of a
vstreamer
cycle. Is it theLOCK TABLES
, perhaps? Is it the transactionWITH CONSISTENT SNAPSHOT
? Anything else aboutvstreamer
? The following indicator gives us the answer:TRUE
orFALSE
.In the first iteration of the PR, I just created a new
streamWithoutSnapshot
function:https://github.com/vitessio/vitess/pull/8044/files#diff-efb6bf2b0113d05b4466e1c54f4abf7ccad01637cc8ad1f01ceb2bb916dd67fdR47-R64
At this time I'm not even trying to justify that it is correct. It is generally not. But, we're in a no-traffic scenario, which means there's no need to lock the table, no need for consistent snapshot. So basically, this function reads the rows the same way, without the lock & snapshot overhead.
Further development, which I'll discuss later, actually creates smaller
LIMIT
queries, but continuously streaming. I'll present it shortly, but the conclusion is clear: if youLIMIT
without creating new gRPC calls, performance remains stable. BTW, @vmg is similarly looking into decoupling the value ofvstream_packet_size
fromINSERT
s on vcopier side; we can batch smaller chunks ofINSERT
s from one largevstream_packet_size
.But, but, ... What's wrong with the current flow?
It looks like a winner. BTW it is also 2 minutes faster than a
gh-ost
migration on same table! So is there anything to fix?A few things:
SELECT <all-relevant-columns> FROM my_table ORDER BY <pk-columns>
, in aCONSISTENT SNAPSHOT
transaction is unsustainable. Evidently (and discussed internally) quite a few users have been using this and without reporting an issue. Either that executed on a replica, or their workload is somehow different, or my claim is invalid. Or in some gray area in between. I just can't shake off my experience where a highhistory list length
predicts an imminent outage in production.SELECT ...
from multiple tables at once in a single transaction... A transaction requires us to serialize our queries.So what I'm looking into now, is how to break down the Grand Select into multiple smaller selects, while:
(1) is easy, right? Just use
LIMIT 10000
. But we see how that leads to terrible run times.Other disadvantages to the current flow:
gh-ost
, there is frequent switch between rowcopy and binlog catchup, and so the progress becomes more predictable. I'll discuss thegh-ost
logic shortly.vcopier
will copy, write down those rows, andvplayer
will later delete them.Of course, let's not forget about the advantages of the current flow:
> :lastPK
. That is, the flow requires this behavior, but a side effect is that we only process relevant rows.Anyway. If only for the sake of parallelism, I'm looking into eliminating the consistent snapshot and breaking down the Grand Select query into smaller,
LIMIT
ed queries.The gh-ost flow
I'd like to explain how
gh-ost
runs migrations, without locking and without consistent snapshots or GTID tracking.gh-ost
was written when we were not even using GTIDs, but to simplify the discussion and to make it more comparable to the current VReplication flow, I'll describegh-ost
's flow as if it were using GTIDs. It's just syntactic sugar.I think we can apply the
gh-ost
flow to Vitess/VReplication, either fully or partially, to achieve parallelism.The flow has two componenst: rowcopy and binlog catchup. There is no fast forward. Binlog catchup are prioritized over rowcopy. Do notice that
gh-ost
rowcopy operates on a single machine, so it runs something likeINSERT INTO... SELECT
rather than the VReplication two step ofSELECT
into memory &INSERT
read values.The steps:
DELETE
? If it's anUPDATE
?DELETE
, we delete the row, whether it actually exists or not.INSERT
, we convert it into aREPLACE INTO
and executeUPDATE
, we convert it into aREPLACE INTO
and execute.This means an
UPDATE
will actually create a new row.1000
rows to copy1000
as an example, we only copy between10
and10000
rows at a time. these limites are hard coded intogh-ost
.INSERT INGORE INTO _my_table_gho SELECT <relevant-columns> FROM my_table WHERE <pk in computed range>
Notice that we
INSERT IGNORE
.Methodic break. We prioritize binlog events over rowcopy. This is done by making the binlog/catchup writes a
REPLACE INTO
(always succeeds and overwrites), and making rowcopyINSERT IGNORE
(always yields in case of conflict).consider that
UPDATE
we encountered in the catchup phase, and where we didn't even copy a single line. ThatUPDATE
was converted to aREPLACE INTO
, creating a new row. Later, in the future, rowcopy will reach that row and attempt to copy it. But it will useINSERT IGNORE
so it will yield to the data applied by the binlog event. If no transaction ever changed that row again, then the two (theREPLACE INTO
and theINSERT IGNORE
for the specific row) will have had the same data anyway.It is difficult to mathematically formalize the correctness of the algorithm, and it's something I wanted to do a while back. Maybe some day. At any case, it's the same algorithm created in
oak-online-schema-change
, copied bypt-online-schema-change
, and continued byfb-osc
and of coursegh-ost
. It is well tested and has been in production for a decade.Back to the flow:
Some characteristics of the
gh-ost
flow:LOCK TABLES
(only) final cut-over phase does)50
row chunk size and1000
row chunk size; for some workloads it could make a difference)UNIQUE KEY
.INSERT IGNORE
andREPLACE INTO
queries, when you add a newUNIQUE KEY
,gh-ost
will silently drop duplicate rows while copying the data or while applying binlog events to the new table.UNIQUE
constraint.We prioritise binlog/catchup just because we want to avoid a log backlog and the risk of binlog events being purged, but logically we could switch any tim ewe like.
gh-ost
applies binlog events for rows we haven't copied yet. We could skip those events like VREplication does, at the cost of tracking PK for the table.It is in a sense the opposite of the VReplication wasteful scenario. VReplication is wasteful for deletes,
gh-ost
is wasteful forINSERT
s andUPDATE
s.Where/how we can merge gh-ost flow into VReplication
I'll begin by again stressing:
gh-ost
flow cannot solve the issue of adding aUNIQue KEY
. The resulting table is consistent, but the user might expect an error if duplicates are found.Otherwise, the process is sound. The PR actually incorporates
gh-ost
logic now:UPDATE
s remainUPDATE
s for nowINSERT
intoREPLACE INTO
:https://github.com/vitessio/vitess/pull/8044/files#diff-944642971422bb00ed68466f784f8b586f5e800f57e0c46002d6194c517cb1a0R531
vstreamer
runs multiple smallSELECT
s (withLIMIT
), but without terminating straming; the results are all concatenated and streamed back:https://github.com/vitessio/vitess/pull/8044/files#diff-862152928bc2f7cafae8ed7dd0fa49608a7f2f60615dc6131059876cd42c0087R268-R286
https://github.com/vitessio/vitess/pull/8044/files#diff-862152928bc2f7cafae8ed7dd0fa49608a7f2f60615dc6131059876cd42c0087R166-R197
https://github.com/vitessio/vitess/pull/8044/files#diff-efb6bf2b0113d05b4466e1c54f4abf7ccad01637cc8ad1f01ceb2bb916dd67fdR47-R64
gh-ost
does.benchmark results
This works for Online DDL; as mentioned above other tests are failing, there' still some termination condition to handle. We'll get this, but the proof of concept:
The above shows we've managed to break the source query into multiple smaller queries, and remain at exact same performance.
It is also the final proof that our bottleneck was gRPC to begin with; not the queries, not database performance.
How can the new flow help us?
vcopier
side, we could split it again, and write1000
rows of this table, and1000
of that table, concurrentlygh-ost
flow, we don't even need to track PK per table, this simplifies the logic.Different approaches?
At the cost of maintaining two different code paths, we could choose to:
I'm still unsure about aggregation logic/impact, and @sougou suggested Materialized Views are in particular sensitive to chunk size and prefer mass queries.
UNIQUE KEY
constraintTo be continued
I'll need to fix outstanding issues with the changed flow, and then run a benchmark under load. I want to say I can predict how the new flow will be so much better - but I know reality will have to prove me wrong.
Thoughts are welcome.
cc @vitessio/ps-vitess @rohit-nayak-ps @sougou @deepthi @vmg
The text was updated successfully, but these errors were encountered: