-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for incremental download of translog files #16204
base: main
Are you sure you want to change the base?
Add support for incremental download of translog files #16204
Conversation
❌ Gradle check result for cd08d89: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
cd08d89
to
f17a378
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16204 +/- ##
============================================
- Coverage 72.10% 72.08% -0.03%
+ Complexity 64862 64858 -4
============================================
Files 5307 5308 +1
Lines 302606 302720 +114
Branches 43717 43734 +17
============================================
+ Hits 218208 218228 +20
- Misses 66541 66542 +1
- Partials 17857 17950 +93 ☔ View full report in Codecov by Sentry. |
f17a378
to
b8f56ea
Compare
❌ Gradle check result for b8f56ea: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
833b508
to
535bc28
Compare
❌ Gradle check result for 535bc28: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
535bc28
to
ce18c9b
Compare
❌ Gradle check result for ce18c9b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java
Show resolved
Hide resolved
...er/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferMetadataHandler.java
Outdated
Show resolved
Hide resolved
* | ||
* - Magic number (Uint32): A constant value that identifies the start of the footer. | ||
* - Algorithm ID (Uint32): The identifier of the checksum algorithm used. Currently, this is always 0. | ||
* - Checksum (Uint64): The checksum of the entire translog data, calculated using the specified algorithm. | ||
*/ | ||
public class TranslogFooter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious will Translog Footer only apply to remote translog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the Translog footer will be applied to both Remote and Local translog. The footer will be applied at the time of generation rollover, when the translog file becomes immutable. By instantiating TranslogCheckedContainer
for both Local and Remote translog, we will now have the checksum pre-calculated for both, which will be written into footer.
For remote translog, upon each sync, the files are transferred to remote and the generation is rolled over.
For local translog, the sync can happen independently of generation rollover. So, we will keep writing to file until the rollover happens.
server/src/main/java/org/opensearch/index/translog/TranslogFooter.java
Outdated
Show resolved
Hide resolved
ce18c9b
to
6ea2adb
Compare
❌ Gradle check result for 6ea2adb: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
translogCheckedContainer = new TranslogCheckedContainer(byteArrayOutputStream.toByteArray()); | ||
} | ||
|
||
// Enable translogCheckedContainer for remote as well as local translog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we enabling translogCheckedContainer
for local translogs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per discussion with @Bukhtawar and @sachinpkale, for ensuring consistent view of the world, we are enabling footer for both local and remote translogs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming there would be low/minimal overhead of checksum for the local translog? Can we run a perf benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am running the OSB (opensearch benchmarks) for the same to understand the impact. But from code workflow POV, it seems to be a low overhead operation.
@@ -438,8 +438,43 @@ public TranslogReader closeIntoReader() throws IOException { | |||
synchronized (syncLock) { | |||
try (ReleasableLock toClose = writeLock.acquire()) { | |||
synchronized (this) { | |||
Long translogWithoutFooterChecksum = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need to explicitly initialise Long
with null
value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am more comfortable with overt initialisation of the variable prior and post the workflow. It find that it helps ensure that code is cleaner.
If you feel strongly about it or implicit initialisation leads to some performance benefits, we can change the same.
(translogCheckedContainer != null) ? translogCheckedContainer.getChecksum() : null | ||
translogWithoutFooterChecksum, | ||
translogWithFooterChecksum, | ||
true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it, that we are always passing hasFooter
to true
here? There can be cases when using older header and therefore, no footer right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular instance of TranslogReader
creation is from TranslogWriter.closeIntoReader
. This method is called only by current writers when they are rolling generation and creating readers out of them. Now, once this PR is merged, all the writers will have headers and therefore, whenever they try to close into reader, they will indeed have header.
Note that older header will never go through this workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. From what i understand by your comment. This below particular else code block looks redundant?
else {
// If we reach here then it means we are using older header and therefore, no footer.
// So, both translogWithoutFooterChecksum and translogWithFooterChecksum are same.
translogWithoutFooterChecksum = (translogCheckedContainer != null)
? translogCheckedContainer.getChecksum()
: null;
translogWithFooterChecksum = translogWithoutFooterChecksum;
}
In that case can we think of removing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is there for safety purpose. Imagine a scenario wherein we have an open translog partially written when the OpenSearch
process is upgraded, especially since it is under customer control. Now, when we will finally close this writer, then we will omit the footer in this case.
Now, for remote translog, we do not have a problem as the expectation is that any data which was in memory and not synced with remote will be lost. So the new footer will start anew.
But for local translog, the older partial translog will be present on disk which is of older version. So we still need to close it but not write the footer.
// When we open a reader to Translog from a path, we want to fetch the checksum | ||
// as that would be needed later on while creating the metadata map for | ||
// generation to checksum. | ||
Long translogChecksum = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long translogChecksum = null; | |
Long translogChecksum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer explicit initialisation of the variable especially since we are initialising hasFooter
with false
. It ensures that code is consistent and overt. If you feel strongly about it, we can make the change.
Long translogChecksum = null; | ||
boolean hasFooter = false; | ||
try { | ||
translogChecksum = TranslogFooter.readChecksum(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing. As readChecksum
returns null
value of checksum for an older version file that doesn't have footer. Here we are storing this translogChecksum
as null
and setting hasFooter to true
even though the file doesn't have footer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I have refactored this block to make it better streamlined. Now we will directly check-
boolean hasFooter = translogChecksum != null;
public Long getTranslogWithFooterChecksum() { | ||
return translogWithFooterChecksum; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename it better, like fullChecksum
and the other contentChecksum
* @return the checksum value from the translog footer, or `null` if the footer is not present | ||
* @throws IOException if an I/O error occurs while reading the footer | ||
*/ | ||
static Long readChecksum(Path path) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer non-static functions? Are they getting invoked from a static context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this method is invoked from static contexts in all instances.
downloadOnce
in RemoteFSTranslog
is one instance and open
in TranslogReader
is another one.
byte[] footer = TranslogFooter.write( | ||
channel, | ||
translogCheckedContainer.getChecksum(), | ||
!Boolean.TRUE.equals(remoteTranslogEnabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doubt: as we have already written the translog to disk, after that we are writing footer to file. Why is it that we are forcing a sync again only in case of when remoteTranslogEnabled
is false
.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So presently Remote translog does not flush the changes to disk and only keeps it in memory. During sync workflow, it would write the data to file channel in memory and then during rollover, we will upload it directly to remote store.
On the other hand, for local translog, we are flushing the changes to disk on each sync.
Now, in TranslogFooter.write
method, we are doing the same operation- For local translog, we flush to disk and for remote translog, we write the data to file channel without flushing to disk.
Presently, the download workflow for remote backed storage works in a manner which causes the download for same translog files multiple times, each time deleting all the older files before downloading them again. This causes significant wasted network bandwidth, along with the time taken for the shard to become active. This change adds support for downloading the translog files incrementally and omitting the same if they are present locally. Signed-off-by: Harsh Rawat <[email protected]>
This commit adds the unit tests applicable for the changes made to support incremental download of translog files. Primarily, the unit tests cover the changes in- - TranslogFooter to write and read the footer of Translog - RemoteFSTranslog to skip the download of locally present translog - TranslogWriter to create reader with checksum upon close - TranslogReader closeIntoReader functionality Signed-off-by: Harsh Rawat <[email protected]>
6ea2adb
to
1211703
Compare
Rebased over mainline and addressed the comments which needed any refactoring. |
❕ Gradle check result for 1211703: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
This PR is stalled because it has been open for 30 days with no activity. |
This PR is stalled because it has been open for 30 days with no activity. |
Description
Presently, the download workflow for remote backed storage works in a manner that causes the download of the same translog files multiple times, each time deleting all the older files before downloading them again. This causes significant wasted network bandwidth, along with the time taken for the shard to become active.
This change adds support for downloading the translog files incrementally and omitting the same if they are present locally.
Implementation
The key changes include-
Incremental Download Logic in RemoteFsTranslog:
Translog Footer Handling:
Generation-to-Checksum Mapping in TranslogTransferMetadata:
Testing
Manual Testing
Manually tested failover and restore workflow on a 3-node EC2 cluster.
Also, tested backward compatibility wherein initial cluster was created and indexed with older OpenSearch process. It was then switched to incremental download.
Newly added tests
RemoteFsTranslogTests
class has been updated to include new test cases for the incremental download logic:testIncrementalDownloadWithMatchingChecksum
tests the scenario where the local and remote translog files have the same checksum, and the download is skipped.testIncrementalDownloadWithDifferentChecksum
tests the scenario where the local and remote translog files have different checksums, and only the missing generation is downloaded.TranslogFooterTests
class has been added to verify the functionality of theTranslogFooter
class, including writing and reading the footer.TranslogTransferMetadataHandlerTests
class has been updated to include test cases for handling the generation-to-checksum map in theTranslogTransferMetadata
.Related Issues
One of the optimisations for #15277
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.