Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry S3 task log fetch #14714

Merged
merged 10 commits into from
Aug 3, 2023
Merged

Retry S3 task log fetch #14714

merged 10 commits into from
Aug 3, 2023

Conversation

YongGang
Copy link
Contributor

@YongGang YongGang commented Aug 1, 2023

Description

Saw the following error when fetching task status from S3. This is due to S3 rate limiting on query, we should retry the operation in this case.

com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID:; S3 Extended Request ID: ; Proxy: null)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879) ~[aws-java-sdk-core-1.12.317.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418) ~[aws-java-sdk-core-1.12.317.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387) ~[aws-java-sdk-core-1.12.317.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157) ~[aws-java-sdk-core-1.12.317.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814) ~[aws-java-sdk-core-1.12.317.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781) ~[aws-java-sdk-core-1.12.317.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755) ~[aws-java-sdk-core-1.12.317.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715) ~[aws-java-sdk-core-1.12.317.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697) ~[aws-java-sdk-core-1.12.317.jar:?]
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561) ~[aws-java-sdk-core-1.12.317.jar:?]
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541) ~[aws-java-sdk-core-1.12.317.jar:?]
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456) ~[aws-java-sdk-s3-1.12.317.jar:?]
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403) ~[aws-java-sdk-s3-1.12.317.jar:?]
	at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1372) ~[aws-java-sdk-s3-1.12.317.jar:?]
	at org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3.getObjectMetadata(ServerSideEncryptingAmazonS3.java:97) ~[?:?]
	at org.apache.druid.storage.s3.S3TaskLogs.streamTaskFile(S3TaskLogs.java:90) ~[?:?]

Now change to use S3Utils.retryS3Operation for task log fetch, same as task file push method.

Release note

Retry S3 task log fetch


Key changed/added classes in this PR
  • make streamTaskFile method retry on transient S3 error in S3TaskLogs class

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@@ -87,38 +92,42 @@ public Optional<InputStream> streamTaskStatus(String taskid) throws IOException
private Optional<InputStream> streamTaskFile(final long offset, String taskKey) throws IOException
Copy link
Contributor

@kfaraz kfaraz Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nicer to just add a new method streamTaskFileWithRetry, which calls the existing streamTaskFile.

something like:

private Optional<InputStream> streamTaskFileWithRetry(final long offset, String taskKey)
{
  try {
    return S3Utils.retryOperation(() -> streamTaskFile(offset, taskKey))
  }
  catch (Exception e) {
    throw new IOE(e, "Failed to stream logs from: %s", taskKey);
  }
}

The new method can also have a javadoc to mention which failure cases are retried.

Copy link
Contributor Author

@YongGang YongGang Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the existing streamTaskFile method could throw some wrapped exception e.g. new IOE(e, "Failed to stream logs from: %s", taskKey), in that case the outer new method streamTaskFileWithRetry won't know whether should retry or not. That means we need to refine the exception throwing in the existing streamTaskFile as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you can modify the exception if required, as long as the overall logic remains the same. The advantage of having a separate retry method is readability and a small diff.

@@ -67,6 +67,11 @@ public S3TaskLogs(
public Optional<InputStream> streamTaskLog(final String taskid, final long offset) throws IOException
{
final String taskKey = getTaskLogKey(taskid, "log");
// this is to satisfy CodeQL scan
Preconditions.checkArgument(
offset < Long.MAX_VALUE && offset > Long.MIN_VALUE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can ignore the CodeQL scan for now because I am not entirely sure if Long.MIN_VALUE or Long.MAX_VALUE are not being used anywhere on purpose to represent some special scenarios.

The scan might also go away if we make the suggested change of adding a new method rather than updating the existing one. Although, I am not entirely sure if it will.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, though PR can't be merged with CodeQL scan issue right?

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, left some minor comments.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions, otherwise looks good.

Comment on lines 494 to 511
EasyMock.reset(s3Client);
AmazonS3Exception awsError = new AmazonS3Exception("AWS Error");
awsError.setErrorCode("503");
awsError.setStatusCode(503);
EasyMock.expect(s3Client.getObjectMetadata(EasyMock.anyString(), EasyMock.anyString())).andThrow(awsError);
EasyMock.expectLastCall().once();
String logPath = TEST_PREFIX + "/" + KEY_1 + "/status.json";
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentLength(STATUS_CONTENTS.length());
EasyMock.expect(s3Client.getObjectMetadata(TEST_BUCKET, logPath)).andReturn(objectMetadata);
S3Object s3Object = new S3Object();
s3Object.setObjectContent(new ByteArrayInputStream(STATUS_CONTENTS.getBytes(StandardCharsets.UTF_8)));
GetObjectRequest getObjectRequest = new GetObjectRequest(TEST_BUCKET, logPath);
getObjectRequest.setRange(0, STATUS_CONTENTS.length() - 1);
getObjectRequest.withMatchingETagConstraint(objectMetadata.getETag());
EasyMock.expect(s3Client.getObject(getObjectRequest)).andReturn(s3Object);
EasyMock.expectLastCall().once();
replayAll();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some 1-line comments might be good here, or at least a logical separation using newlines.

@kfaraz kfaraz merged commit 20c48b6 into apache:master Aug 3, 2023
@YongGang YongGang deleted the retry-s3-task-status branch August 17, 2023 16:02
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants