Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3: Data read has a different checksum than expected #953

Closed
ben-manes opened this issue Dec 14, 2018 · 55 comments
Closed

S3: Data read has a different checksum than expected #953

ben-manes opened this issue Dec 14, 2018 · 55 comments
Assignees
Labels
bug This issue is a bug.

Comments

@ben-manes
Copy link

This might be fixed by 472f5bf, but not sure. I also am seeing some errors where the response.contentLength() differs from the actual size again, so there might be a race condition again. This is using 2.1.4 + netty.

Caused by: software.amazon.awssdk.core.exception.SdkClientException: Data read has a different checksum than expected.
        at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:97)
        at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:39)
        at software.amazon.awssdk.services.s3.internal.handlers.AsyncChecksumValidationInterceptor.validatePutObjectChecksum(AsyncChecksumValidationInterceptor.java:105)
        at software.amazon.awssdk.services.s3.internal.handlers.AsyncChecksumValidationInterceptor.afterUnmarshalling(AsyncChecksumValidationInterceptor.java:91)
        at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.lambda$afterUnmarshalling$9(ExecutionInterceptorChain.java:152)
        at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.reverseForEach(ExecutionInterceptorChain.java:210)
        at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.afterUnmarshalling(ExecutionInterceptorChain.java:152)
        at software.amazon.awssdk.core.client.handler.BaseClientHandler.runAfterUnmarshallingInterceptors(BaseClientHandler.java:120)
        at software.amazon.awssdk.core.client.handler.BaseClientHandler.lambda$interceptorCalling$2(BaseClientHandler.java:133)
        at software.amazon.awssdk.core.client.handler.AttachHttpMetadataResponseHandler.handle(AttachHttpMetadataResponseHandler.java:40)
        at software.amazon.awssdk.core.client.handler.AttachHttpMetadataResponseHandler.handle(AttachHttpMetadataResponseHandler.java:28)
        at software.amazon.awssdk.core.internal.http.async.SyncResponseHandlerAdapter.lambda$prepare$0(SyncResponseHandlerAdapter.java:85)
@dylanmroweFS
Copy link

Similar issue here, using 2.1.4. We're uploading files to buckets in multiple regions asynchronously. This happens after a few successful uploads.

java.util.concurrent.CompletionException: software.amazon.awssdk.core.exception.SdkClientException: Data read has a different checksum than expected.
	at software.amazon.awssdk.utils.CompletableFutureUtils.errorAsCompletionException(CompletableFutureUtils.java:61)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncExecutionFailureExceptionReportingStage.lambda$execute$0(AsyncExecutionFailureExceptionReportingStage.java:50)
	at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
	at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.retryErrorIfNeeded(AsyncRetryableStage.java:166)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.retryIfNeeded(AsyncRetryableStage.java:118)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.lambda$execute$0(AsyncRetryableStage.java:103)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$executeHttpRequest$1(MakeAsyncHttpRequestStage.java:136)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: software.amazon.awssdk.core.exception.SdkClientException: Data read has a different checksum than expected.
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:97)
	at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:39)
	at software.amazon.awssdk.services.s3.internal.handlers.AsyncChecksumValidationInterceptor.validatePutObjectChecksum(AsyncChecksumValidationInterceptor.java:105)
	at software.amazon.awssdk.services.s3.internal.handlers.AsyncChecksumValidationInterceptor.afterUnmarshalling(AsyncChecksumValidationInterceptor.java:91)
	at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.lambda$afterUnmarshalling$9(ExecutionInterceptorChain.java:152)
	at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.reverseForEach(ExecutionInterceptorChain.java:210)
	at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.afterUnmarshalling(ExecutionInterceptorChain.java:152)
	at software.amazon.awssdk.core.client.handler.BaseClientHandler.runAfterUnmarshallingInterceptors(BaseClientHandler.java:120)
	at software.amazon.awssdk.core.client.handler.BaseClientHandler.lambda$interceptorCalling$2(BaseClientHandler.java:133)
	at software.amazon.awssdk.core.client.handler.AttachHttpMetadataResponseHandler.handle(AttachHttpMetadataResponseHandler.java:40)
	at software.amazon.awssdk.core.client.handler.AttachHttpMetadataResponseHandler.handle(AttachHttpMetadataResponseHandler.java:28)
	at software.amazon.awssdk.core.internal.http.async.SyncResponseHandlerAdapter.lambda$prepare$0(SyncResponseHandlerAdapter.java:85)
	at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
	at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
	... 1 common frames omitted

@zoewangg zoewangg added investigating This issue is being investigated and/or work is in progress to resolve the issue. bug This issue is a bug. labels Dec 17, 2018
@GJKrupa
Copy link

GJKrupa commented Dec 26, 2018

I'm seeing something similar with 2.2.0. I have a basic integration test using LocalStack that's producing this checksum error every time the test calls getObject asynchronously with ByteArrayAsyncResponseTransformer. As far as I can see from debugging, the checksum validation is running before the response content is fully read (the test data is a simple UUID but the exception is raised with only half the UUID in the transformer's underlying BAOS.

The failing test is in commit bfd3a767 of https://gitlab.com/gkrupa/s3web (S3BackendSpec).

@zoewangg
Copy link
Contributor

zoewangg commented Jan 9, 2019

Related to #965. Fixed via #966 and the fix will be included in the next release.

@zoewangg zoewangg removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Jan 9, 2019
@zoewangg
Copy link
Contributor

The fix has been released in 2.3.0. Closing the issue.

@harrylaou
Copy link

Using 2.3.1 and still have this.
[SdkClientException: Data read has a different checksum than expected. Was -1560488262, but expected 1701186097]

@GJKrupa
Copy link

GJKrupa commented Jan 16, 2019

Same. If I update the BOM version in the test I linked above it still fails.

@harrylaou
Copy link

@ben-manes is this going to re-open? or is it fixed in the next release?

@vicpara
Copy link

vicpara commented Jan 24, 2019

We also have the same problem. The workaround is pretty ugly and inefficient.

@ben-manes
Copy link
Author

@harrylaou I switched back to v1 as I was fighting bugs daily with v2 for a few months. Since then I’ve had no problems. Unfortunately AWS didn’t dog food this sdk before releasing and I don’t have the time to continue beta testing. I’ll switch over later once it matures. You might feel similar. I’m not reopening but you can file a new issue if the AWS team doesn’t reopen this one.

@millems
Copy link
Contributor

millems commented Jan 24, 2019

Reopening, because it sounds like there are more cases where this can occur that haven't already been covered by previous fixes.

@spfink
Copy link
Contributor

spfink commented Jan 24, 2019

@ben-manes, @vicpara, do either of you have an easily reproducible test case for this?

@millems
Copy link
Contributor

millems commented Feb 8, 2019

We're unfortunately not able to reproduce this on our end. Is there some specific configuration you have on your object that may be causing this?

@vicpara
Copy link

vicpara commented Feb 19, 2019

@millems We are using the library in scala. Ideally we would like to call the method you have defined as follows
CompletableFuture<ReturnT> getObject(Consumer<GetObjectRequest.Builder> getObjectRequest, AsyncResponseTransformer<GetObjectResponse, ReturnT> asyncResponseTransformer)

in scala the code would be :
client.getObject(getObjectRequest(bucket, fileName), new ByteArrayAsyncResponseTransformer[GetObjectResponse]()).toTask

Also, please note that AsyncResponseTransformer.toBytes() wouldn't work because of limitations of calling java static methods in scala for classes that have generics. Syntax like AsyncResponseTransformer[GetObjectResponse,[ResponseBytes]].toBytes() or AsyncResponseTransformer.toBytes():GetObjectResponse doesn't compile.
Calling new ByteArrayAsyncResponseTransformer[GetObjectResponse]() should work if it was not for the #953

Since this doesn't work we are using the following workaround :

val bucket = "bucket_one"
val fileName = "file1.txt"
val tempFile: File = File.createTempFile("temp-", fileName.flattenPath)
tempFile.deleteOnExit()
val localPath: Path = tempFile.toPath
client.getObject(getObjectRequest(bucket, fileName), localPath)

@dementiev
Copy link

dementiev commented Mar 18, 2019

Still have this issue on "software.amazon.awssdk:s3:2.5.10" version:

Code example:

... this.client = S3Client.create();

        client.getObject(
                GetObjectRequest.builder().bucket(bucket).key(key).build(),
                ResponseTransformer.toFile(download)
        );

Caused by: software.amazon.awssdk.core.exception.SdkClientException: Data read has a different checksum than expected. Was 1229635159, but expected 0

Any suggestions on how to fix?

@millems
Copy link
Contributor

millems commented Mar 18, 2019

@dementiev
Please help answer a few questions to enable us to reproduce this issue:

  1. Does this happen sporadically or consistently?
  2. Can you share how you are creating the S3 client?
  3. Are there any 'unusual' characters in the bucket or key name (non-alphanumeric)?
  4. Are there any 'unusual' configurations on the bucket (non-defaults)?

@tim-fdc
Copy link

tim-fdc commented Mar 25, 2019

Hi,
also getting this error for an async client only.

Using version software.amazon.awssdk:aws-sdk-java:2.5.15.

Error happens consistently using the following client (Kotlin code)

class S3CredentialsProvider : AwsCredentialsProvider {
    override fun resolveCredentials(): AwsCredentials {
        val awsCredentialsProcessBuilder = AwsSessionCredentials.create(
                "<SECRET>",
                "<SECRET>",
                "<SECRET>")

        return awsCredentialsProcessBuilder
    }
}
 
   val s3Client = S3AsyncClient.builder()
            .credentialsProvider(S3CredentialsProvider())
            .region(Region.EU_WEST_1)
            .build()

Writing an object as follows:

    val toByteArray = "test".toByteArray()
    val asyncRequestBody = AsyncRequestBody.fromBytes(toByteArray)
    s3Client.putObject(
            PutObjectRequest.builder()
                    .bucket(config.bucket)
                    .key("test")
                    .build(),
            asyncRequestBody).await()

Using Kotlin Coroutines extensions (await()) to transform Java CompletableFuture to Coroutine. Works for listings. I am also using the sessionToken. Seems to work without a sessionToken.

For non async client

S3Client.builder()
            .region(Region.EU_WEST_1)
            .credentialsProvider(S3CredentialsProvider())
            .build()

it works.

Here is the exception:

Exception in thread "main" software.amazon.awssdk.core.exception.SdkClientException: Data read has a different checksum than expected.
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:97)
	at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:39)
	at software.amazon.awssdk.services.s3.checksums.ChecksumsEnabledValidator.validatePutObjectChecksum(ChecksumsEnabledValidator.java:134)
	at software.amazon.awssdk.services.s3.internal.handlers.AsyncChecksumValidationInterceptor.afterUnmarshalling(AsyncChecksumValidationInterceptor.java:86)
	at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.lambda$afterUnmarshalling$9(ExecutionInterceptorChain.java:152)
	at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.reverseForEach(ExecutionInterceptorChain.java:210)
	at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.afterUnmarshalling(ExecutionInterceptorChain.java:152)
	at software.amazon.awssdk.core.client.handler.BaseClientHandler.runAfterUnmarshallingInterceptors(BaseClientHandler.java:138)
	at software.amazon.awssdk.core.client.handler.BaseClientHandler.lambda$interceptorCalling$2(BaseClientHandler.java:151)
	at software.amazon.awssdk.core.client.handler.AttachHttpMetadataResponseHandler.handle(AttachHttpMetadataResponseHandler.java:40)
	at software.amazon.awssdk.core.client.handler.AttachHttpMetadataResponseHandler.handle(AttachHttpMetadataResponseHandler.java:28)
	at software.amazon.awssdk.core.internal.http.async.AsyncResponseHandler.lambda$prepare$0(AsyncResponseHandler.java:88)
	at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
	at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
	at software.amazon.awssdk.core.internal.http.async.AsyncResponseHandler$BaosSubscriber.onComplete(AsyncResponseHandler.java:129)
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$FullResponseContentPublisher$1.request(ResponseHandler.java:369)
	at software.amazon.awssdk.core.internal.http.async.AsyncResponseHandler$BaosSubscriber.onSubscribe(AsyncResponseHandler.java:108)
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$FullResponseContentPublisher.subscribe(ResponseHandler.java:360)
	at software.amazon.awssdk.core.internal.http.async.AsyncResponseHandler.onStream(AsyncResponseHandler.java:71)
	at software.amazon.awssdk.core.internal.http.async.AsyncAfterTransmissionInterceptorCallingResponseHandler.onStream(AsyncAfterTransmissionInterceptorCallingResponseHandler.java:86)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.onStream(MakeAsyncHttpRequestStage.java:249)
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.channelRead0(ResponseHandler.java:112)
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.channelRead0(ResponseHandler.java:65)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at com.typesafe.netty.http.HttpStreamsHandler.channelRead(HttpStreamsHandler.java:129)
	at com.typesafe.netty.http.HttpStreamsClientHandler.channelRead(HttpStreamsClientHandler.java:148)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at software.amazon.awssdk.http.nio.netty.internal.FutureCancelHandler.channelRead0(FutureCancelHandler.java:42)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
	at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1436)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1203)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1247)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
	at java.lang.Thread.run(Thread.java:748)

Any ideas?

Thanks!
Tim

@millems
Copy link
Contributor

millems commented Apr 16, 2019

@tim-fdc Do you have any special bucket configuration, or special characters in the bucket name?

The following works for me (Java runtime):

AwsSessionCredentials credentials = AwsSessionCredentials.create("<>",
                                                                 "<>",
                                                                 "<>");
try (S3AsyncClient client = S3AsyncClient.builder()
                                         .credentialsProvider(() -> credentials)
                                         .region(Region.EU_WEST_1)
                                         .build()) {
    client.createBucket(r -> r.bucket("tmp-millem")).join();
    client.putObject(r -> r.bucket("tmp-millem").key("test"), AsyncRequestBody.fromBytes("test".getBytes())).join();

    System.out.println(client.getObject(r -> r.bucket("tmp-millem").key("test"), AsyncResponseTransformer.toBytes())
                             .join()
                             .asUtf8String());

    client.deleteObject(r -> r.bucket("tmp-millem").key("test")).join();
    client.deleteBucket(r -> r.bucket("tmp-millem")).join();
}

derwasp pushed a commit to albumprinter/junit-synnefo that referenced this issue Apr 18, 2019
derwasp added a commit to albumprinter/junit-synnefo that referenced this issue Apr 21, 2019
 - Use async apis instead of sync apis for everything but the S3 download jobs because of aws/aws-sdk-java-v2#953
 - Actually make the `threads` setting work: now we can limit the amount of parallel executions with it
 - Download artifacts after each job has finished and not at the end of all the jobs
 - Moved the helper functions to separate files

Closes #6 #2
@millems millems assigned millems and unassigned millems Jul 28, 2020
@marc-christian-schulze
Copy link

marc-christian-schulze commented Jul 30, 2020

@millems, I've provided a reproducer here:
https://github.com/marc-christian-schulze/kotlinx-reproducer-2109

It shows that the race condition appears when the onNext method is called from a netty IO thread and a non-netty IO thread. Your example does not redprocue the race condition since it is using Flowable.fromIterable which does not use other threads than the calling one to deliver the items.

I've spent quiet some time together with the colleagues from the Kotlin project to narrow down the issue and finally ended up at OrderedWriteChannelHandlerContext::doInOrder

private void doInOrder(Runnable task) {
        if (!channel().eventLoop().inEventLoop()) {
            task.run();
        } else {
            // If we're in the event loop, queue a task to perform the write, so that it occurs after writes that were scheduled
            // off of the event loop.
            channel().eventLoop().execute(task);
        }
    }

There the actual "magic" happens depending on which thread is doing the ctx.writeAndFlush invocation. Once threads (netty IO and non-netty IO) start to invoke this doInOrder with fast iterations the actual network packets will be enqueued in random order since the channel's event loop is a thread pool that does not guarantee FIFO execution of scheduled tasks (otherwise the whole concept of the pool would be pointless).

@millems
Copy link
Contributor

millems commented Jul 30, 2020

@marc-christian-schulze OrderedWriteChannelHandlerContext::doInOrder was written with an assumption that the task to be executed will always end up executing the task via the event loop. When that assumption holds, the class behaves correctly.

I can't determine where this assumption is breaking down without a repro case, and I can't fix the issue without determining where that assumption is breaking down.

I'll give your repro case a try and see where that assumption is wrong. Thanks!

@millems
Copy link
Contributor

millems commented Jul 31, 2020

As an update from today's investigation, I can reproduce the exception using @marc-christian-schulze's code, but cannot attribute it to the OrderedWriteChannelHandlerContext. Even after removing the OrderedWriteChannelHandlerContext, the test still fails.

I don't see anything obviously wrong with the publisher to take some of the blame off the SDK, but that's really difficult to tell at a glance. It might be the publisher and it might be the SDK. More research is still needed... That said, it does look like the publisher is working correctly from a naive perspective (onNext is never invoked in parallel, it's being invoked in the correct order, etc.).

The working theory should be that it's the SDK until proven otherwise...

@millems
Copy link
Contributor

millems commented Jul 31, 2020

As a last update for today, I think the issue is that the OrderedWriteChannelHandlerContext isn't always being attached (e.g. on the second request using the same channel). The bug we're encountering is the bug that the OrderedWriteChannelHandlerContext is intended to fix. It's just not always being applied. Tomorrow, I'll see if I can fix the OrderedWriteChannelHandlerContext to always be attached, which will ideally fix the issue.

@millems
Copy link
Contributor

millems commented Jul 31, 2020

Thanks to @marc-christian-schulze for the repro case, which I was able to fix with #1967. It looks like netty-reactive-streams can attach the same handler to a connection twice, so the OrderedWriteChannelHandlerContext wasn't always used for all streaming writes. This means that the race condition it is meant to fix could still occur sometimes. The fix ensures that the order-write fixing is always applied.

millems added a commit that referenced this issue Jul 31, 2020
This fixes a cause of "S3: Data read has a different checksum than expected": #953
millems added a commit that referenced this issue Jul 31, 2020
This fixes a cause of "S3: Data read has a different checksum than expected": #953
aws-sdk-java-automation added a commit that referenced this issue Sep 11, 2020
…27850d2c

Pull request: release <- staging/a7058983-8371-4514-a451-f9d527850d2c
@pavellikin
Copy link

Hi. Unfortunately, I am still able to reproduce the issue on the SDK version 2.15.11. I prepared a demo project to show the issue.

The project has 3 tests. First to upload ~100kB array with checksum validation. This is the only green test in the scope.
Both other tests upload ~1mB array with and without checksum validation. In the test with checksum validation, I get mentioned "Data read has a different checksum than expected" error. Without checksum validation bytes are uploaded to s3 but they are different from the original byte array.

Looks like the packets still be reordered as @marc-christian-schulze mentioned.

@millems
Copy link
Contributor

millems commented Oct 21, 2021

Self-contained repro:

S3AsyncClient s3 = S3AsyncClient.create();
String bucketKey = "some-bucket-name";

try {
    s3.createBucket(r -> r.bucket(bucketKey)).join();
} catch (Exception e) {
}

byte[] expectedOutputBytes = new byte[1_000_000];
ThreadLocalRandom.current().nextBytes(expectedOutputBytes);

for (int iteration = 1; iteration < 1000; iteration++) {
    List<byte[]> output = new ArrayList<>();
    ByteArrayInputStream bytesInput = new ByteArrayInputStream(expectedOutputBytes);
    while (true) {
        byte[] buffer = new byte[ThreadLocalRandom.current().nextInt(0, 1000)];
        int read = bytesInput.read(buffer);
        if (read < 0) {
            break;
        }

        output.add(buffer);
    }

    System.out.println("Uploading " + output.size() + " chunks");

    s3.putObject(r -> r.bucket(bucketKey).key(bucketKey).contentLength((long) expectedOutputBytes.length),
                 AsyncRequestBody.fromPublisher(Flux.fromIterable(output).map(ByteBuffer::wrap)))
      .join();

    System.out.println("Downloading " + output.size() + " chunks");

    ResponseBytes<GetObjectResponse> response =
        s3.getObject(r -> r.bucket(bucketKey).key(bucketKey),
                     AsyncResponseTransformer.toBytes()).join();

    assertThat(response.asByteArrayUnsafe()).isEqualTo(expectedOutputBytes);
}

@millems
Copy link
Contributor

millems commented Oct 22, 2021

Added a fix for the repro from me above with #2788. Will keep looking into the repro from burnout. Theirs is a complicated repro that involves proxying requests to a mock S3....

@millems
Copy link
Contributor

millems commented Oct 23, 2021

@burnout171 's repro fails because it relies on the source publisher being a netty service. That underlying service is reusing the byte buffers passed to the SDK, before the SDK reads them. The data is being changed by the SDK caller within the ByteBuffer before the SDK can read them.

This is an extremely unusual case, so I suspect that most customers will be fixed by #2788. I'm not even sure if I want to fix @burnout171 's use-case, because it would require the SDK defensively copying every byte provided by the customer. It's an extremely unusual use-case, and working around it would potentially impact everyone.

Instead, I wonder whether it's possible to just warn when it happens. I'll think about it and report back. LMK if anyone here has an opinion.

@Bennett-Lynch
Copy link
Contributor

Bennett-Lynch commented Oct 23, 2021

An opt-in configuration for the SDK to defensively copy all ByteBuffers might make sense, purely as a convenience option, but the general Netty guidance is to not expose pooled ByteBufs to components outside of your control, for the exact reasons above. Overall, I think we should advise users to either perform a deep-copy of pooled ByteBufs or disable ByteBuf pooling altogether.

millems added a commit that referenced this issue Oct 25, 2021
…length returned by the publisher.

This fixes two potential bugs:
1. A source of "checksum mismatch" exceptions (#953) when the published data length exceeds the content-length.
2. A source of "hung futures" (#2576?) when the published data is shorter than the content-length.

This may not be the definitive fix for those two issues.
millems added a commit that referenced this issue Oct 25, 2021
…length returned by the publisher.

This fixes two potential bugs:
1. A source of "checksum mismatch" exceptions (#953) when the published data length exceeds the content-length.
2. A source of "hung futures" (#2576?) when the published data is shorter than the content-length.

This may not be the definitive fix for those two issues.
millems added a commit that referenced this issue Oct 25, 2021
…length returned by the publisher. (#2788)

This fixes two potential bugs:
1. A source of "checksum mismatch" exceptions (#953) when the published data length exceeds the content-length.
2. A source of "hung futures" (#2576?) when the published data is shorter than the content-length.

This may not be the definitive fix for those two issues.
@millems
Copy link
Contributor

millems commented Oct 25, 2021

I've updated the error message to indicate that internal byte buffers changing might be a cause of this, as it was in @burnout171 's case. The user doing a data copy or disabling pooling in the underlying data source is an option in that situation: #2798

This error message improvement in addition to #2788, should put a lid on this issue. We have further error message improvements coming down the chain as well, with #2796, but the primary cause (shorter content-lengths from the requester than data provided by the data source) should be fixed with tomorrow's release.

@millems millems closed this as completed Oct 25, 2021
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
None yet
Development

No branches or pull requests