Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 putObject (async) is hanging when using ByteArrayAsyncRequestBody #2576

Closed
jmassiot opened this issue Jul 5, 2021 · 4 comments
Closed
Labels
bug This issue is a bug.

Comments

@jmassiot
Copy link

jmassiot commented Jul 5, 2021

From times to times, we are facing cases where the S3 async putObject API is hanging.

Describe the bug

We are currently using quite intensively (~ 400 put/s) the S3 async putObject API (with both a FileAsyncRequestBody or a ByteArrayAsyncRequestBody).
And from times to times (~ once a day), we have some calls to putObject that are hanging:

  • when using the FileAsyncRequestBody, most probably we are facing the issue fixed by this PR Fix a race condition in FileAsyncRequestBody #2536
  • but we face a similar issue when using the ByteArrayAsyncRequestBody, and the reason here is not clear and is the purpose of this issue

Expected Behavior

Calls to putObject with a ByteArrayAsyncRequestBody should not hang.

Current Behavior

Calls to putObject with a ByteArrayAsyncRequestBody is hanging.

Steps to Reproduce

As mentioned above, it is very difficult to reproduce the issue.
It happens in our case approximatively once a day with a production traffic and several hundreds of puts per seconds.

Possible Solution

When trying to compare FileAsyncRequestBody and ByteArrayAsyncRequestBody, I found the following on which i have a doubt.

  • on FileAsyncRequestBody around the subscribe method, there is this comment:
            // We need to synchronize here because the subscriber could call
            // request() from within onSubscribe which would potentially
            // trigger onNext before onSubscribe is finished.
            Subscription subscription = new FileSubscription(channel, s, chunkSizeInBytes);
            synchronized (subscription) {
                s.onSubscribe(subscription);
            }
  • whereas on ByteArrayAsyncRequestBody, we have the following:
        // As per 2.13, this method must return normally (i.e. not throw).
        try {
            s.onSubscribe(
                    new Subscription() {
                        private boolean done = false;

                        @Override
                        public void request(long n) {
                            if (done) {
                                return;
                            }
                            if (n > 0) {
                                done = true;
                                s.onNext(ByteBuffer.wrap(bytes));
                                s.onComplete();
                            } else {
                                s.onError(new IllegalArgumentException("§3.9: non-positive requests are not allowed!"));
                            }
                        }

So, in case of ByteArrayAsyncRequestBody, there is not the same synchronization as the one we have in FileAsyncRequestBody (which means that onNext may be called on another thread whereas onSubscribe is not finished)

Furthermore, on StreamedRequest within NettyRequestExecutor, we have the following:

    private static class StreamedRequest extends DelegateHttpRequest implements StreamedHttpRequest {

        private final Publisher<ByteBuffer> publisher;
        private final Optional<Long> requestContentLength;
        private long written = 0L;
        private boolean done;
        private Subscription subscription;   <=== the subscription field is not volatile
...
...
        @Override
        public void subscribe(Subscriber<? super HttpContent> subscriber) {
            publisher.subscribe(new Subscriber<ByteBuffer>() {
                @Override
                public void onSubscribe(Subscription subscription) {
                    StreamedRequest.this.subscription = subscription;   <=== the subscription field is set by one thread
                    subscriber.onSubscribe(subscription);
                }

                @Override
                public void onNext(ByteBuffer contentBytes) {
                    if (done) {
                        return;
                    }

                    try {
                        int newLimit = clampedBufferLimit(contentBytes.remaining());
                        contentBytes.limit(newLimit);
                        ByteBuf contentByteBuf = Unpooled.wrappedBuffer(contentBytes);
                        HttpContent content = new DefaultHttpContent(contentByteBuf);

                        subscriber.onNext(content);
                        written += newLimit;

                        if (!shouldContinuePublishing()) {
                            done = true;
                            subscription.cancel();                    <=== and potentially accessed by another thread, no ?
                            subscriber.onComplete();
                        }
                    } catch (Throwable t) {
                        onError(t);
                    }
                }

Don't you think that the subcription field should be volatile, here ?

  • If one thread is setting this field, and another one is trying to access it, the one trying to access it may face a NullPointerException, no ? (that will then go to the onError method that will not display anything as done has been set to true in the thread calling onNext)
  • and if it is the case, then the onComplete method will never be called on the subscriber, no ?

Context

We are facing this issue in production.
There are some workarounds like putting timeouts and then re-launching the put.
But it will be good if we can avoid in the 1st place such behavior.

Your Environment

  • AWS Java SDK version used: 2.10.62
  • JDK version used: jdk16
  • Operating System and version: Linux
@jmassiot jmassiot added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jul 5, 2021
@debora-ito
Copy link
Member

Hi @jmassiot, thank you for reaching out, and apologies for the lack of response.

We're seeing other reports of the same issue you described, we'll do a deeper investigation on possible causes. This may take a while since it's not easily reproducible.

@debora-ito debora-ito removed the needs-triage This issue or PR still needs to be triaged. label Aug 24, 2021
millems added a commit that referenced this issue Oct 25, 2021
…length returned by the publisher.

This fixes two potential bugs:
1. A source of "checksum mismatch" exceptions (#953) when the published data length exceeds the content-length.
2. A source of "hung futures" (#2576?) when the published data is shorter than the content-length.

This may not be the definitive fix for those two issues.
millems added a commit that referenced this issue Oct 25, 2021
…length returned by the publisher.

This fixes two potential bugs:
1. A source of "checksum mismatch" exceptions (#953) when the published data length exceeds the content-length.
2. A source of "hung futures" (#2576?) when the published data is shorter than the content-length.

This may not be the definitive fix for those two issues.
millems added a commit that referenced this issue Oct 25, 2021
…length returned by the publisher. (#2788)

This fixes two potential bugs:
1. A source of "checksum mismatch" exceptions (#953) when the published data length exceeds the content-length.
2. A source of "hung futures" (#2576?) when the published data is shorter than the content-length.

This may not be the definitive fix for those two issues.
@millems
Copy link
Contributor

millems commented Oct 26, 2021

I think we fixed one cause of this: specifying a contentLength on putObject that is greater than the byte length of the ByteArrayAsyncRequestBody. After tomorrow's release, can you deploy the fix and verify whether you're still seeing the issue?

@debora-ito
Copy link
Member

@jmassiot the fix was released in SDK version 2.17.68, please try it out and let us know if you still see the issue.

@github-actions
Copy link

github-actions bot commented Nov 3, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

aws-sdk-java-automation added a commit that referenced this issue May 30, 2023
…7a0ba708b

Pull request: release <- staging/47746ac0-519b-4e9b-9beb-b4b7a0ba708b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
None yet
Development

No branches or pull requests

3 participants