Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use async client for delete blob or path in S3 Blob Container #16788

Merged
merged 6 commits into from
Jan 9, 2025

Conversation

ashking94
Copy link
Member

@ashking94 ashking94 commented Dec 5, 2024

Description

This PR addresses the port exhaustion issue (issue #16883) causing indexing failures and partial snapshots in OpenSearch clusters with high indexing loads. The problem manifests as periodic spikes in 5xx HTTP status codes during indexing operations and "Cannot assign requested address" exceptions in logs, particularly during stale segment deletion.

While an async client already exists, this PR extends its use to cover all S3 blob delete operations. This change aims to significantly reduce port exhaustion by minimizing the creation of new sockets for every delete request under high load.

Key changes:

  1. S3BlobContainer.java:

    • Refactored delete operations to exclusively use the async client
    • Removed synchronous delete methods, replacing them with async versions
    • Updated error handling and logging for async operations
      • Metric publisher hook for List was missing at one place which has been handled now.
  2. S3AsyncService.java:

    • Create retry policy within SocketAccess.doPrivileged to fix access issues. This also makes it in sync with sync client.
    • Refactored code to remove redundant code
  3. S3RepositoryPlugin.java:

  • Closing the event the loop group during close of the S3RepositoryPlugin else there are threads leaked due to their daemon nature.
  1. BlobStoreRepository.java:

    • Removed SNAPSHOT_ASYNC_DELETION_ENABLE_SETTING as async deletion is now the default
    • Updated deleteContainer and deleteFromContainer methods to use async operations exclusively
  2. Updated test classes to reflect the changes:

    • S3BlobStoreRepositoryTests.java
    • S3RepositoryThirdPartyTests.java
    • S3BlobStoreContainerTests.java
    • S3RepositoryPluginTests.java
  3. Removed references to the now obsolete async deletion setting in ClusterSettings.java

These changes should significantly improve the handling of delete operations in high-load scenarios, preventing port exhaustion and related issues by leveraging the existing async client more extensively.

Related Issues

Resolves #16883 (Port Exhaustion Causing Indexing Failures and Partial Snapshots)

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Dec 5, 2024

❌ Gradle check result for 384b63a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 1c58299: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 49d893f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 81e356d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for de40809: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d9b306e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d9b306e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d9b306e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 1db7150: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 1db7150: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 1db7150: SUCCESS

Copy link

codecov bot commented Dec 19, 2024

Codecov Report

Attention: Patch coverage is 70.83333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 72.18%. Comparing base (b5f651f) to head (1db7150).
Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
...rg/opensearch/repositories/s3/S3BlobContainer.java 45.45% 6 Missing ⚠️
...org/opensearch/repositories/s3/S3AsyncService.java 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16788      +/-   ##
============================================
- Coverage     72.21%   72.18%   -0.03%     
+ Complexity    65335    65273      -62     
============================================
  Files          5318     5318              
  Lines        304081   303991      -90     
  Branches      43995    43982      -13     
============================================
- Hits         219578   219425     -153     
- Misses        66541    66576      +35     
- Partials      17962    17990      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added bug Something isn't working Storage:Snapshots labels Dec 19, 2024
@ashking94 ashking94 marked this pull request as ready for review December 19, 2024 09:35
@ashking94
Copy link
Member Author

Codecov Report

Attention: Patch coverage is 70.83333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 72.18%. Comparing base (b5f651f) to head (1db7150).

Files with missing lines Patch % Lines
...rg/opensearch/repositories/s3/S3BlobContainer.java 45.45% 6 Missing ⚠️
...org/opensearch/repositories/s3/S3AsyncService.java 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
☔ View full report in Codecov by Sentry. 📢 Have feedback on the report? Share it here.

Trying to increase the coverage to unit tests.

@ashking94 ashking94 merged commit 1d4b85f into opensearch-project:main Jan 9, 2025
62 of 64 checks passed
@ashking94 ashking94 added the backport 2.x Backport to 2.x branch label Jan 9, 2025
@ashking94 ashking94 deleted the async-deletion branch January 9, 2025 05:52
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 9, 2025
* Use async client for delete blob or path in S3 Blob Container

Signed-off-by: Ashish Singh <[email protected]>

* Fix UTs

Signed-off-by: Ashish Singh <[email protected]>

* Fix failures in S3BlobStoreRepositoryTests

Signed-off-by: Ashish Singh <[email protected]>

* Fix S3BlobStoreRepositoryTests

Signed-off-by: Ashish Singh <[email protected]>

* Fix failures in S3RepositoryThirdPartyTests

Signed-off-by: Ashish Singh <[email protected]>

* Fix failures in S3RepositoryPluginTests

Signed-off-by: Ashish Singh <[email protected]>

---------

Signed-off-by: Ashish Singh <[email protected]>
(cherry picked from commit 1d4b85f)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
ashking94 pushed a commit that referenced this pull request Jan 9, 2025
#16984)

* Use async client for delete blob or path in S3 Blob Container



* Fix UTs



* Fix failures in S3BlobStoreRepositoryTests



* Fix S3BlobStoreRepositoryTests



* Fix failures in S3RepositoryThirdPartyTests



* Fix failures in S3RepositoryPluginTests



---------


(cherry picked from commit 1d4b85f)

Signed-off-by: Ashish Singh <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working skip-changelog Storage:Snapshots
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[BUG] Port Exhaustion Causing Indexing Failures and Partial Snapshots
2 participants