-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3: Data read has a different checksum than expected #953
Comments
Similar issue here, using 2.1.4. We're uploading files to buckets in multiple regions asynchronously. This happens after a few successful uploads.
|
I'm seeing something similar with 2.2.0. I have a basic integration test using LocalStack that's producing this checksum error every time the test calls getObject asynchronously with ByteArrayAsyncResponseTransformer. As far as I can see from debugging, the checksum validation is running before the response content is fully read (the test data is a simple UUID but the exception is raised with only half the UUID in the transformer's underlying BAOS. The failing test is in commit bfd3a767 of https://gitlab.com/gkrupa/s3web (S3BackendSpec). |
The fix has been released in 2.3.0. Closing the issue. |
Using 2.3.1 and still have this. |
Same. If I update the BOM version in the test I linked above it still fails. |
@ben-manes is this going to re-open? or is it fixed in the next release? |
We also have the same problem. The workaround is pretty ugly and inefficient. |
@harrylaou I switched back to v1 as I was fighting bugs daily with v2 for a few months. Since then I’ve had no problems. Unfortunately AWS didn’t dog food this sdk before releasing and I don’t have the time to continue beta testing. I’ll switch over later once it matures. You might feel similar. I’m not reopening but you can file a new issue if the AWS team doesn’t reopen this one. |
Reopening, because it sounds like there are more cases where this can occur that haven't already been covered by previous fixes. |
@ben-manes, @vicpara, do either of you have an easily reproducible test case for this? |
We're unfortunately not able to reproduce this on our end. Is there some specific configuration you have on your object that may be causing this? |
@millems We are using the library in scala. Ideally we would like to call the method you have defined as follows in scala the code would be : Also, please note that Since this doesn't work we are using the following workaround :
|
Still have this issue on "software.amazon.awssdk:s3:2.5.10" version: Code example:
Caused by: software.amazon.awssdk.core.exception.SdkClientException: Data read has a different checksum than expected. Was 1229635159, but expected 0 Any suggestions on how to fix? |
@dementiev
|
Hi, Using version Error happens consistently using the following client (Kotlin code)
Writing an object as follows:
Using Kotlin Coroutines extensions ( For non async client
it works. Here is the exception:
Any ideas? Thanks! |
@tim-fdc Do you have any special bucket configuration, or special characters in the bucket name? The following works for me (Java runtime): AwsSessionCredentials credentials = AwsSessionCredentials.create("<>",
"<>",
"<>");
try (S3AsyncClient client = S3AsyncClient.builder()
.credentialsProvider(() -> credentials)
.region(Region.EU_WEST_1)
.build()) {
client.createBucket(r -> r.bucket("tmp-millem")).join();
client.putObject(r -> r.bucket("tmp-millem").key("test"), AsyncRequestBody.fromBytes("test".getBytes())).join();
System.out.println(client.getObject(r -> r.bucket("tmp-millem").key("test"), AsyncResponseTransformer.toBytes())
.join()
.asUtf8String());
client.deleteObject(r -> r.bucket("tmp-millem").key("test")).join();
client.deleteBucket(r -> r.bucket("tmp-millem")).join();
} |
- Use async apis instead of sync apis for everything but the S3 download jobs because of aws/aws-sdk-java-v2#953 - Actually make the `threads` setting work: now we can limit the amount of parallel executions with it - Download artifacts after each job has finished and not at the end of all the jobs - Moved the helper functions to separate files Closes #6 #2
@millems, I've provided a reproducer here: It shows that the race condition appears when the I've spent quiet some time together with the colleagues from the Kotlin project to narrow down the issue and finally ended up at OrderedWriteChannelHandlerContext::doInOrder
There the actual "magic" happens depending on which thread is doing the |
@marc-christian-schulze I can't determine where this assumption is breaking down without a repro case, and I can't fix the issue without determining where that assumption is breaking down. I'll give your repro case a try and see where that assumption is wrong. Thanks! |
As an update from today's investigation, I can reproduce the exception using @marc-christian-schulze's code, but cannot attribute it to the I don't see anything obviously wrong with the publisher to take some of the blame off the SDK, but that's really difficult to tell at a glance. It might be the publisher and it might be the SDK. More research is still needed... That said, it does look like the publisher is working correctly from a naive perspective (onNext is never invoked in parallel, it's being invoked in the correct order, etc.). The working theory should be that it's the SDK until proven otherwise... |
As a last update for today, I think the issue is that the |
Thanks to @marc-christian-schulze for the repro case, which I was able to fix with #1967. It looks like netty-reactive-streams can attach the same handler to a connection twice, so the |
This fixes a cause of "S3: Data read has a different checksum than expected": #953
This fixes a cause of "S3: Data read has a different checksum than expected": #953
…27850d2c Pull request: release <- staging/a7058983-8371-4514-a451-f9d527850d2c
Hi. Unfortunately, I am still able to reproduce the issue on the SDK version 2.15.11. I prepared a demo project to show the issue. The project has 3 tests. First to upload ~100kB array with checksum validation. This is the only green test in the scope. Looks like the packets still be reordered as @marc-christian-schulze mentioned. |
Self-contained repro: S3AsyncClient s3 = S3AsyncClient.create();
String bucketKey = "some-bucket-name";
try {
s3.createBucket(r -> r.bucket(bucketKey)).join();
} catch (Exception e) {
}
byte[] expectedOutputBytes = new byte[1_000_000];
ThreadLocalRandom.current().nextBytes(expectedOutputBytes);
for (int iteration = 1; iteration < 1000; iteration++) {
List<byte[]> output = new ArrayList<>();
ByteArrayInputStream bytesInput = new ByteArrayInputStream(expectedOutputBytes);
while (true) {
byte[] buffer = new byte[ThreadLocalRandom.current().nextInt(0, 1000)];
int read = bytesInput.read(buffer);
if (read < 0) {
break;
}
output.add(buffer);
}
System.out.println("Uploading " + output.size() + " chunks");
s3.putObject(r -> r.bucket(bucketKey).key(bucketKey).contentLength((long) expectedOutputBytes.length),
AsyncRequestBody.fromPublisher(Flux.fromIterable(output).map(ByteBuffer::wrap)))
.join();
System.out.println("Downloading " + output.size() + " chunks");
ResponseBytes<GetObjectResponse> response =
s3.getObject(r -> r.bucket(bucketKey).key(bucketKey),
AsyncResponseTransformer.toBytes()).join();
assertThat(response.asByteArrayUnsafe()).isEqualTo(expectedOutputBytes);
} |
Added a fix for the repro from me above with #2788. Will keep looking into the repro from burnout. Theirs is a complicated repro that involves proxying requests to a mock S3.... |
@burnout171 's repro fails because it relies on the source publisher being a netty service. That underlying service is reusing the byte buffers passed to the SDK, before the SDK reads them. The data is being changed by the SDK caller within the ByteBuffer before the SDK can read them. This is an extremely unusual case, so I suspect that most customers will be fixed by #2788. I'm not even sure if I want to fix @burnout171 's use-case, because it would require the SDK defensively copying every byte provided by the customer. It's an extremely unusual use-case, and working around it would potentially impact everyone. Instead, I wonder whether it's possible to just warn when it happens. I'll think about it and report back. LMK if anyone here has an opinion. |
An opt-in configuration for the SDK to defensively copy all ByteBuffers might make sense, purely as a convenience option, but the general Netty guidance is to not expose pooled ByteBufs to components outside of your control, for the exact reasons above. Overall, I think we should advise users to either perform a deep-copy of pooled ByteBufs or disable ByteBuf pooling altogether. |
…length returned by the publisher. This fixes two potential bugs: 1. A source of "checksum mismatch" exceptions (#953) when the published data length exceeds the content-length. 2. A source of "hung futures" (#2576?) when the published data is shorter than the content-length. This may not be the definitive fix for those two issues.
…length returned by the publisher. This fixes two potential bugs: 1. A source of "checksum mismatch" exceptions (#953) when the published data length exceeds the content-length. 2. A source of "hung futures" (#2576?) when the published data is shorter than the content-length. This may not be the definitive fix for those two issues.
…length returned by the publisher. (#2788) This fixes two potential bugs: 1. A source of "checksum mismatch" exceptions (#953) when the published data length exceeds the content-length. 2. A source of "hung futures" (#2576?) when the published data is shorter than the content-length. This may not be the definitive fix for those two issues.
I've updated the error message to indicate that internal byte buffers changing might be a cause of this, as it was in @burnout171 's case. The user doing a data copy or disabling pooling in the underlying data source is an option in that situation: #2798 This error message improvement in addition to #2788, should put a lid on this issue. We have further error message improvements coming down the chain as well, with #2796, but the primary cause (shorter content-lengths from the requester than data provided by the data source) should be fixed with tomorrow's release. |
|
This might be fixed by 472f5bf, but not sure. I also am seeing some errors where the
response.contentLength()
differs from the actual size again, so there might be a race condition again. This is using2.1.4
+ netty.The text was updated successfully, but these errors were encountered: