Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent requests cause StreamIDTooSmall error, channel state becomes transientFailure #912

Closed
i9 opened this issue Jul 28, 2020 · 12 comments · Fixed by #922
Closed

concurrent requests cause StreamIDTooSmall error, channel state becomes transientFailure #912

i9 opened this issue Jul 28, 2020 · 12 comments · Fixed by #922
Labels
kind/bug Feature doesn't work as expected.

Comments

@i9
Copy link

i9 commented Jul 28, 2020

Describe the bug

grpc-swift version 1.0.0-alpha.17, swift-nio-http2 from: "1.12.1"

grpc-swift client(iOS App) --- nginx --- grpc server

client sends 3 different rpcs almost at the same time, one failed due to StreamIDTooSmall, the other two failed due to unavailable. error log:

2020-07-27T15:19:49+0800 error: connection_id=4003FC27-24A7-4B11-B041-37FDCA5E35C1/0 error=StreamIDTooSmall() grpc client error
2020-07-27T15:19:49+0800 info: new_state=transientFailure old_state=ready connection_id=4003FC27-24A7-4B11-B041-37FDCA5E35C1/0 connectivity state change
"gRPC state did change from ready to transientFailure"
2020-07-27T15:19:49+0800 error: request_id=9168A18B-3ED8-43F2-BCF5-188C1F68FF9A call_state=active connection_id=4003FC27-24A7-4B11-B041-37FDCA5E35C1/0 error=unavailable (14) grpc client error
2020-07-27T15:19:49+0800 error: call_state=active request_id=517D243A-8A9B-413C-97E9-7078E3866614 connection_id=4003FC27-24A7-4B11-B041-37FDCA5E35C1/0 error=unavailable (14) grpc client error

But from nginx log, 2 out of 3 responses are returned successfully. one had error due to client prematurely closed connection while processing HTTP/2 connection which we assume is due to client side close and reconnect.

we suspect in swift-nio-http2 createStreamChannel, self.nextOutboundStreamID = HTTP2StreamID(Int32(streamID) + 2) isn't thread safe, but we couldn't find any doc or guidance what to do.

To reproduce

Steps to reproduce the bug you've found:

  1. send concurrent grpc requests

Expected behaviour

All requests succeeded.

Additional information

about 15% of our users are seeing the error. If it's the api caller's responsibility to avoid concurrent requests, we would like to see the doc clearly mention that.

I also filed apple/swift-nio-http2#213 but unsure which repo will fix the issue (if it's indeed a bug)

@i9 i9 added the kind/bug Feature doesn't work as expected. label Jul 28, 2020
@Lukasa
Copy link
Collaborator

Lukasa commented Jul 28, 2020

The underlying issue is in swift-nio-http2, as I discussed in apple/swift-nio-http2#213. However, the fix there is complex, so we should consider whether we can work around the issue in grpc.

There are two possible workarounds that I can see. The first is to apply a queue in grpc to ensure that we flush data in the order we create the streams. This is annoying, but can work. The second is even easier: just consider StreamIDTooSmall errors to be trivially retryable, and to always retry when we hit them.

@glbrntt, thoughts?

@glbrntt
Copy link
Collaborator

glbrntt commented Jul 28, 2020

@Lukasa I like the simplicity of just retrying, but StreamIDTooSmall is thrown at the connection level and I don't think there's currently a way to recover that information at the stream level (since the multiplexer doesn't route errors into the stream channel).

Unless I'm missing something we'd have to make changes to NIOHTTP2 (namely: add the streamID to the error and route that error into the relevant stream via the multiplexer) to allow for this. It'd at least enable other HTTP/2 users to work around this in the same way.

@Lukasa
Copy link
Collaborator

Lukasa commented Jul 28, 2020

Hmm, yeah, that's gonna be trickier. Let's chat about this today.

@glbrntt
Copy link
Collaborator

glbrntt commented Jul 29, 2020

@i9 just to provide an update: we decided to fix this in NIOHTTP2 rather than working around it in gRPC. The issue tracking this is: apple/swift-nio-http2#214

@i9
Copy link
Author

i9 commented Jul 30, 2020

@i9 just to provide an update: we decided to fix this in NIOHTTP2 rather than working around it in gRPC. The issue tracking this is: apple/swift-nio-http2#214

really appreciate the swift action 😉

@givip
Copy link

givip commented Jul 31, 2020

This bug is easy reproduce on iOS device (on Simulator this happens very rarely) using NIOTSEventLoopGroup
I launched two simple RPC calls one after another.
First returns with result, second freezes and after some amount of time fails future with error:

2020-07-31 10:52:02.075090+0400  [] tcp_input [C2:1] flags=[R.] seq=3741818499, ack=2511433662, win=2056 state=ESTABLISHED rcv_nxt=3741818499, snd_una=2511433662
2020-07-31 10:52:04.167720+0400  [] tcp_output [C2:1] flags=[R.] seq=2511433662, ack=3741818499, win=1028 state=CLOSED rcv_nxt=3741818499, snd_una=2511433662
2020-07-31 10:52:04.169659+0400  [] nw_read_request_report [C2] Receive failed with error "Operation timed out"
2020-07-31 10:52:04.170263+0400  [] sb_empty_assert [C2:1] sbflush: sb 0x1090369d0 so 0x109036850 cc 0 mb 0x1090369e0
2020-07-31 10:52:04.171069+0400  [] sb_empty_assert [C2:1] sbflush: sb 0x1090369d0 so 0x109036850 cc 0 mb 0x1090369e0, dumping backtrace:
        [arm64] libnetcore-1880.120.4
    0   libnetwork.dylib                    0x000000018eb68e94 __nw_create_backtrace_string + 116
    1   libusrtcp.dylib                     0x000000019184de40 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 273984
    2   libusrtcp.dylib                     0x000000019184cfe8 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 270312
    3   libusrtcp.dylib                     0x000000019184c2f8 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 267000
    4   libusrtcp.dylib                     0x0000000191851438 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 287800
    5   libusrtcp.dylib                     0x000000019182aa58 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 129624
    6   libdispatch.dylib                   0x0000000108986338 _dispatch_call_block_and_release + 24
    7   libdispatch.dylib                   0x0000000108987730 _dispatch_client_callout + 16
    8   libdispatch.dylib                   0x000000010899081c _dispatch_workloop_invoke + 2480
    9   libdispatch.dylib                   0x000000010899a6c4 _dispatch_workloop_worker_thread + 1304
    10  libsystem_pthread.dylib             0x000000018c21cb74 _pthread_wqthread + 272
    11  libsystem_pthread.dylib             0x000000018c21f740 start_wqthread + 8
2020-07-31T10:52:04+0400 error: error=POSIXErrorCode: Operation timed out connection_id=F5DA533B-CAF8-4453-B2AC-81F97D161815/0 grpc client error
2020-07-31T10:52:04+0400 info: old_state=ready connection_id=F5DA533B-CAF8-4453-B2AC-81F97D161815/0 new_state=transientFailure connectivity state change
2020-07-31T10:52:04+0400 error: request_id=D83D0506-D423-450D-B1BD-E36F04BEEE2F connection_id=F5DA533B-CAF8-4453-B2AC-81F97D161815/0 error=unavailable (14) call_state=active grpc client error

@Lukasa
Copy link
Collaborator

Lukasa commented Jul 31, 2020

@givip There is no evidence in that log of this error occurring. Is there any evidence elsewhere?

@givip
Copy link

givip commented Jul 31, 2020

@Lukasa What you are considering as evidence? I thought, last 3 lines of log mean the same error as @i9 described. May be needed some example project with minimum code to reproduce?

@Lukasa
Copy link
Collaborator

Lukasa commented Jul 31, 2020

This log:

2020-07-27T15:19:49+0800 error: connection_id=4003FC27-24A7-4B11-B041-37FDCA5E35C1/0 error=StreamIDTooSmall() grpc client error

Note the presence of the string StreamIDTooSmall() in the log, which marks it as this error.

@givip
Copy link

givip commented Aug 1, 2020

@Lukasa I don't have this error's mention, but it seems that bug behavior is the same. I start two simultaneous requests and the second never performs, after some amount of time if fails with error

@Lukasa
Copy link
Collaborator

Lukasa commented Aug 1, 2020

The symptom may be the same but without that log it cannot be this issue.

@givip
Copy link

givip commented Aug 1, 2020

@Lukasa My issue is exactly #901 . Sorry

glbrntt added a commit to glbrntt/grpc-swift that referenced this issue Aug 6, 2020
Motivation:

The multiplexer from SwiftNIO HTTP/2 required that the first write on
each stream matched the order in which the streams were created.
Violating this led to a connection error; all in-flight and subsequent
RPCs on that connection would fail. Some of our users noticed this
(grpc#912). SwiftNIO HTTP/2 recently reworked the API around how streams
are created: HTTP/2 stream channels are no longer created with a stream
ID and now deliver the frame payload to the channel pipeline rather than
the entire HTTP/2 frame. This allows for a stream ID to be assigned to a
stream when it attempts to flush its first write, rather than when the
stream is created.

Modifications:

- Increase the minimum HTTP/2 version to 1.13.0
- Move away from deprecated APIs: this required changing the inbound-in
  and outbound-out types for `_GRPCClientChannelHandler` as well as a
  few smaller changes elsewhere.

Result:

- RPCs can be created concurrently without fear of violating any stream
  ID ordering rules
- Resolves grpc#912
glbrntt added a commit that referenced this issue Aug 7, 2020
Motivation:

The multiplexer from SwiftNIO HTTP/2 required that the first write on
each stream matched the order in which the streams were created.
Violating this led to a connection error; all in-flight and subsequent
RPCs on that connection would fail. Some of our users noticed this
(#912). SwiftNIO HTTP/2 recently reworked the API around how streams
are created: HTTP/2 stream channels are no longer created with a stream
ID and now deliver the frame payload to the channel pipeline rather than
the entire HTTP/2 frame. This allows for a stream ID to be assigned to a
stream when it attempts to flush its first write, rather than when the
stream is created.

Modifications:

- Increase the minimum HTTP/2 version to 1.13.0
- Move away from deprecated APIs: this required changing the inbound-in
  and outbound-out types for `_GRPCClientChannelHandler` as well as a
  few smaller changes elsewhere.

Result:

- RPCs can be created concurrently without fear of violating any stream
  ID ordering rules
- Resolves #912
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Feature doesn't work as expected.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants