-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Design Proposal #2229
Comments
@mch2 Thanks for sharing the detailed comparison. Could you also share the Disk IO metrics for both the approaches, as I can see there is going to be a significant trade-off with the new approach, since the same document will be replicated twice, once as part of the translog and then as part of the Segment Replication. Since we are saving on the compute and memory here, it would be good to offset that against the increased IO provisioning cost for disk which customers would incur here. Also the IO cost would be even grow higher as the size of indices grows, leading to larger segments. Merges would then create even larger (new) segment files which will need fresh replication to replica nodes adding to the network and disk IO costs. |
Thanks @getsaurabh02, Am gathering IOPS and network data and will add it asap. Made updates above to Durability section, fsyncs will still be required by default. |
The new design proposal here is a modification of the existing replication system that introduces new tradeoffs. I am curious about presenting a non-replicated index vs this proposed segment replication vs existing replication system, providing increasing thresholds of reliability with trade offs of performance and resource utilization. There could be a 4th+ options in the future using by using an external durable store and it would be useful to see where it aligns on axis of performance and reliability compared to the other existing options. |
I think it's worth explaining in this issue what the current replication mechanism is, so "copying segment files to replicas, instead of document replication that ... " Collected some questions from an IRL brainstorm:
|
Added total bw used during the test run and IOPS. |
Can we plan to isolate durability? That is the translog and IndexWriter can be in a "black box" that we can optimize for bulk indexing and provide parameters to "tune" durability based on user need? (e.g., a DurabilityPolicy(ies) that enable users to disable durability altogether if their usecase does not require it?) Follow up: that "black box" can also be tweaked for bulk indexing. Today the
+1 I love how merges are separated today, but what about the new Merge On Refresh policy just added? Have we benchmarked with this enabled? Should we consider it to be the default when using segment replication? |
can we turn on/off seg replication? instead of having this on index creation level. |
I'd like to see some discussion on the consistency changes that a user will experience with segment replication. I don't think OpenSearch gives strong transactional guarantees today with document replication, but I believe all acknowledged writes will be visible to a user regardless of whether a query hits a primary or replica. I think this will change with segment replication. |
I'd suggest just pointing to the RFC section that covers document replication.
Is this refresh coupled to the segment replicaiton process i.e. do we forcefully refresh once replication is complete? Does this mean we need to disable the on-schedule refresh on replicas?
Will every checkpoint be queued on the replica?
When you say "created as empty" does this imply a change to a "lazy" form of shard startup compared to the exisitng method of shard recovery? |
Thanks @mch2 for this well thought through design proposal. I like the approach, and it is exciting to think of the possibilities that segment replication will open up. The performance improvements in your POC are impressive.. If I understood correctly, the primary pushes a notification to replicas (via a listener) each time it refreshes, and replicas then invoke a GET checkpoint API to get the latest delta of segments. Have you considered having replicas just poll the primary periodically for new segments? __ On similar lines, segment replication, could in future, enable OpenSearch to support different number of indexing and search shards for an index. Instead of the current setup of having replicas mapped to a single primary, we could have a search shard (replica) subscribe to multiple indexing shards (primaries), fetch their segments and add them into its own directory. This would give users a true indexing and search compute separation. The replicas subscribing to a primary and periodically pulling segments from it model lends itself more naturally to such a |
This is a part of the process of merging our feature branch - feature/segment-replication - back into main by re-PRing our changes from the feature branch. GatedAutoCloseable currently wraps a subclass of RefCounted. Segment replication adds another subclass, but this also wraps RefCounted. Both subclasses have the same shutdown hook - decRef. This change makes the superclass less generic to increase code convergence. The breakdown of the plan to merge segment-replication to main is detailed in opensearch-project#2355 Segment replication design proposal - opensearch-project#2229 Signed-off-by: Kartik Ganesh <[email protected]>
I have few questions on the config part.
|
* Refactoring GatedAutoCloseable to AutoCloseableRefCounted This is a part of the process of merging our feature branch - feature/segment-replication - back into main by re-PRing our changes from the feature branch. GatedAutoCloseable currently wraps a subclass of RefCounted. Segment replication adds another subclass, but this also wraps RefCounted. Both subclasses have the same shutdown hook - decRef. This change makes the superclass less generic to increase code convergence. The breakdown of the plan to merge segment-replication to main is detailed in #2355 Segment replication design proposal - #2229 Signed-off-by: Kartik Ganesh <[email protected]> * Minor refactoring in RecoveryState This change makes two minor updates to RecoveryState - 1. The readRecoveryState API is removed because it can be replaced by an invocation of the constructor 2. The class members of the Timer inner class are changed to private, and accesses are only through the public APIs Signed-off-by: Kartik Ganesh <[email protected]> * Update RecoveryTargetTests to test Timer subclasses deterministically This change removes the use of RandomBoolean in testing the Timer classes and creates a dedicated unit test for each. The common test logic is shared via a private method. Signed-off-by: Kartik Ganesh <[email protected]> * Move the RecoveryState.Timer class to a top-level class This will eventually be reused across both replication use-cases - peer recovery and segment replication. Signed-off-by: Kartik Ganesh <[email protected]> * Further update of timer tests in RecoveryTargetTests Removes a non-deterministic code path around stopping the timer, and avoids assertThat (deprecated) Signed-off-by: Kartik Ganesh <[email protected]> * Rename to ReplicationTimer Signed-off-by: Kartik Ganesh <[email protected]> * Remove RecoveryTargetTests assert on a running timer Trying to serialize and deserialize a running Timer instance, and then checking for equality leads to flaky test failures when the ser/deser takes time. Signed-off-by: Kartik Ganesh <[email protected]>
* Refactoring GatedAutoCloseable to AutoCloseableRefCounted This is a part of the process of merging our feature branch - feature/segment-replication - back into main by re-PRing our changes from the feature branch. GatedAutoCloseable currently wraps a subclass of RefCounted. Segment replication adds another subclass, but this also wraps RefCounted. Both subclasses have the same shutdown hook - decRef. This change makes the superclass less generic to increase code convergence. The breakdown of the plan to merge segment-replication to main is detailed in #2355 Segment replication design proposal - #2229 Signed-off-by: Kartik Ganesh <[email protected]> * Minor refactoring in RecoveryState This change makes two minor updates to RecoveryState - 1. The readRecoveryState API is removed because it can be replaced by an invocation of the constructor 2. The class members of the Timer inner class are changed to private, and accesses are only through the public APIs Signed-off-by: Kartik Ganesh <[email protected]> * Update RecoveryTargetTests to test Timer subclasses deterministically This change removes the use of RandomBoolean in testing the Timer classes and creates a dedicated unit test for each. The common test logic is shared via a private method. Signed-off-by: Kartik Ganesh <[email protected]> * Move the RecoveryState.Timer class to a top-level class This will eventually be reused across both replication use-cases - peer recovery and segment replication. Signed-off-by: Kartik Ganesh <[email protected]> * Further update of timer tests in RecoveryTargetTests Removes a non-deterministic code path around stopping the timer, and avoids assertThat (deprecated) Signed-off-by: Kartik Ganesh <[email protected]> * Rename to ReplicationTimer Signed-off-by: Kartik Ganesh <[email protected]> * Remove RecoveryTargetTests assert on a running timer Trying to serialize and deserialize a running Timer instance, and then checking for equality leads to flaky test failures when the ser/deser takes time. Signed-off-by: Kartik Ganesh <[email protected]> (cherry picked from commit c7c410a)
…#3014) * Refactoring GatedAutoCloseable to AutoCloseableRefCounted This is a part of the process of merging our feature branch - feature/segment-replication - back into main by re-PRing our changes from the feature branch. GatedAutoCloseable currently wraps a subclass of RefCounted. Segment replication adds another subclass, but this also wraps RefCounted. Both subclasses have the same shutdown hook - decRef. This change makes the superclass less generic to increase code convergence. The breakdown of the plan to merge segment-replication to main is detailed in #2355 Segment replication design proposal - #2229 Signed-off-by: Kartik Ganesh <[email protected]> * Minor refactoring in RecoveryState This change makes two minor updates to RecoveryState - 1. The readRecoveryState API is removed because it can be replaced by an invocation of the constructor 2. The class members of the Timer inner class are changed to private, and accesses are only through the public APIs Signed-off-by: Kartik Ganesh <[email protected]> * Update RecoveryTargetTests to test Timer subclasses deterministically This change removes the use of RandomBoolean in testing the Timer classes and creates a dedicated unit test for each. The common test logic is shared via a private method. Signed-off-by: Kartik Ganesh <[email protected]> * Move the RecoveryState.Timer class to a top-level class This will eventually be reused across both replication use-cases - peer recovery and segment replication. Signed-off-by: Kartik Ganesh <[email protected]> * Further update of timer tests in RecoveryTargetTests Removes a non-deterministic code path around stopping the timer, and avoids assertThat (deprecated) Signed-off-by: Kartik Ganesh <[email protected]> * Rename to ReplicationTimer Signed-off-by: Kartik Ganesh <[email protected]> * Remove RecoveryTargetTests assert on a running timer Trying to serialize and deserialize a running Timer instance, and then checking for equality leads to flaky test failures when the ser/deser takes time. Signed-off-by: Kartik Ganesh <[email protected]> (cherry picked from commit c7c410a) Co-authored-by: Kartik Ganesh <[email protected]>
This document outlines a proposal for implementing segment replication without additional external dependencies. This is a WIP. Please feel free to leave comments below with any thoughts or suggestions!
Objective:
Copy Lucene’s immutable segment files to replicas instead of document replication where documents are sent to and re-indexed on all replicas. This will improve indexing throughput and lower resource utilization on replicas at the expense of increased network usage. The RFC has a good introduction to both replication strategies.
Requirements:
With the feature enabled, primary shards will be the only shards indexing into Lucene. Replica shards exist as read only copies. After a primary shard creates new segment files, either from indexing or a segment merge, they are sent to replicas where they are made searchable.
Proposal:
Segment replication will be triggered when a primary shard refreshes. A refresh on a primary shard occurs for various reasons, for example on a schedule, after a merge, or flush. During a refresh, Lucene performs an os flush that writes the latest in memory segment files to the filesystem, making them available for replicas to fetch. A listener will trigger after the primary finishes refresh and send a notification to replicas. This notification will include a ReplicationCheckpoint that includes the sequence number associated with the latest document indexed, the latest commit generation, and primary term. If the replica determines that it is behind this checkpoint, it initiates a replication process.
Read only replicas:
To make replica shards read only, we need to disable their IndexWriter and noop all operations that attempt to interact with it inside of the engine. The POC does with a config param sent to the engine to conditionally start up a writer. To make this cleaner, creating separate engine and shard implementations for replicas is currently being explored.
Replication checkpoints:
ReplicationCheckpoints will be processed serially when received by replicas. If there is an active replication event when a checkpoint is received, the replica will store and attempt to process the checkpoint after it completes. Primary shards will compute the latest metadata and segmentInfos, even if this data is ahead of the requested checkpoint. The actual checkpoint processed will be computed and returned to the replica and compared against the latest received checkpoint.
Durability:
With the initial implementation, all documents will still be sent to each shard and persisted in the transaction log. Operations will not be purged from a replica's copy of the translog until a new commit point is received. While we have the translog to replay operations, we must still guarantee the stability of the index on disk so that it can be started and operations replayed. To maintain OpenSearch's durability guarantee, segments on both primary and replica shards will be by default fsynced to disk on commit. On replicas we recognize that a commit has occurred with an increase in the commit generation sent with the ReplicationCheckpoint.
In addition to fsyncs, we need to ensure that at all times replicas have the previous commit point. It is possible for a replica to be added after a Primary has been created, so sending the latest SegmentInfos info from a primary can leave the replica in a corrupt state. To ensure each replica has a both the latest SegmentInfos and all segments referenced by the latest commit point, the primary will include both in its metadata list returned to replicas.
For users who do not require this level of durability or it is acceptable to recover from a snapshot, I propose we add an additional setting that disables fsyncs on replicas and prevents sending the additional merged away files.
Merge:
There is no special logic for merges. After a primary completes a merge, it will refresh to open up a new reader on new segments and mark the old segments for deletion. This refresh will trigger our listener and publish a new ReplicationCheckpoint to replicas.
Shard startup:
Currently when replica shards are started they go through the peer recovery service and recover from the active primary. With segment replication shards will be created as empty and a notification sent to the primary to initiate tracking on the new replica. This ensures that all future checkpoints will be sent to the replica and will initiate replication on the next refresh.
Performance:
Early performance tests show improvements with segment replication enabled. This run using OpenSearch benchmark showed a ~40-45% drop in CPU and Memory usage, a 19% drop in p99 latency and a 57% increase in p100 throughput.
Data from more test runs will be added here as we continue building out the feature and run tests with different cluster configurations.
Failover: #2212
Shard Allocation: #6210
Configuration:
Segment replication will be disabled by default and enabled with a setting during index creation. To begin with this will be a simple boolean setting.
FAQ:
This will be a lot of load on primary shards. They will have to first fetch and compute the metadata diff from every replica and then orchestrate their copy and hold that state between requests.
While the first implementation will use the primary shard as our object store and transport layer for messaging, we will extract interfaces allowing for these to be swapped out for other options.
This should be supported in future versions but not part of the initial launch of the feature.
The text was updated successfully, but these errors were encountered: