Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing on iOS after receiving an error from Network stack while idle #1132

Closed
rlasante opened this issue Feb 10, 2021 · 2 comments · Fixed by #1133
Closed

Crashing on iOS after receiving an error from Network stack while idle #1132

rlasante opened this issue Feb 10, 2021 · 2 comments · Fixed by #1133
Labels
kind/bug Feature doesn't work as expected.

Comments

@rlasante
Copy link

Describe the bug

Versions:
grpc-swift 1.0.0-alpha.22
protoc-gen-grpc-swift

Platforms:
iOS 13.3.1, 13.4.1, 13.6.1, 14.2, 14.3, 14.4

In our application we're using swift-grpc 1.0.0-alpha-22 to stream audio data from a mobile client up to our server.
We're seeing in our production crash logs the following stack trace where immediately before the crash the state of our grpc connection transitions to idle and then we receive this error.

We appear to be hitting the precondition on ConnectionManager:470

0 Streaming 0x0000000104b7f76c GRPC.ConnectionManager.channelError(Swift.Error) -> () + 720
1 Streaming 0x0000000104b96658 protocol witness for NIO._ChannelInboundHandler.errorCaught(context: NIO.ChannelHandlerContext, error: Swift.Error) -> () in conformance GRPC.GRPCIdleHandler : NIO._ChannelInboundHandler in GRPC + 44
2 Streaming 0x0000000104c196bc NIO.ChannelHandlerContext.(invokeErrorCaught in _EEC863903996E9F191EBAFEB0FB0DFDD)(Swift.Error) -> () + 64
3 Streaming 0x0000000104c0d748 (extension in NIO):NIO._ChannelInboundHandler.errorCaught(context: NIO.ChannelHandlerContext, error: Swift.Error) -> () + 36
4 Streaming 0x0000000104c196bc NIO.ChannelHandlerContext.(invokeErrorCaught in _EEC863903996E9F191EBAFEB0FB0DFDD)(Swift.Error) -> () + 64
5 Streaming 0x0000000104c0d748 (extension in NIO):NIO._ChannelInboundHandler.errorCaught(context: NIO.ChannelHandlerContext, error: Swift.Error) -> () + 36
6 Streaming 0x0000000104c196bc NIO.ChannelHandlerContext.(invokeErrorCaught in _EEC863903996E9F191EBAFEB0FB0DFDD)(Swift.Error) -> () + 64
7 Streaming 0x0000000104c0d748 (extension in NIO):NIO._ChannelInboundHandler.errorCaught(context: NIO.ChannelHandlerContext, error: Swift.Error) -> () + 36
8 Streaming 0x0000000104c196bc NIO.ChannelHandlerContext.(invokeErrorCaught in _EEC863903996E9F191EBAFEB0FB0DFDD)(Swift.Error) -> () + 64
9 Streaming 0x0000000104c1894c NIO.ChannelPipeline.fireErrorCaught(Swift.Error) -> () + 92
10 Streaming 0x0000000104cb10e0 function signature specialization <Arg[1] = Dead> of NIOTransportServices.NIOTSConnectionChannel.(dataReceivedHandler in _8F3A0963B0F58CDCE328C45B7B282707)(content: Foundation.Data?, context: Network.NWConnection.ContentContext?, isComplete: Swift.Bool, error: Network.NWError?) -> () + 704
11 libswiftNetwork.dylib 0x00000001e466db90 closure #1 (__C.OS_dispatch_data?, __C.OS_nw_content_context?, Swift.Bool, __C.OS_nw_error?) -> () in Network.NWConnection.receiveMessage(completion: (Foundation.Data?, Network.NWConnection.ContentContext?, Swift.Bool, Network.NWError?) -> ()) -> () + 168
12 libswiftNetwork.dylib 0x00000001e466d7cc reabstraction thunk helper from @escaping @callee_guaranteed (@guaranteed __C.OS_dispatch_data?, @guaranteed __C.OS_nw_content_context?, @unowned Swift.Bool, @guaranteed __C.OS_nw_error?) -> () to @escaping @callee_unowned @convention(block) (@unowned __C.OS_dispatch_data?, @unowned __C.OS_nw_content_context?, @unowned Swift.Bool, @unowned __C.OS_nw_error?) -> () + 96
13 libdispatch.dylib 0x00000001989f7fb8 _dispatch_block_async_invoke2 + 144
14 libdispatch.dylib 0x00000001989e9db0 _dispatch_client_callout + 16
15 libdispatch.dylib 0x00000001989f110c _dispatch_lane_serial_drain + 576
16 libdispatch.dylib 0x00000001989f1c90 _dispatch_lane_invoke + 456
17 libdispatch.dylib 0x00000001989f0fd8 _dispatch_lane_serial_drain + 268
18 libdispatch.dylib 0x00000001989f1c90 _dispatch_lane_invoke + 456
19 libdispatch.dylib 0x00000001989fbd78 _dispatch_workloop_worker_thread + 704
20 libsystem_pthread.dylib 0x00000001e48a5814 _pthread_wqthread + 272
21 libsystem_pthread.dylib 0x00000001e48ac76c start_wqthread + 4

To reproduce

I haven't been able to reproduce the issue so far locally but the only way I've actually gotten a similar stack trace is via a connection termination from the server due to a 120s timeout.

Steps to reproduce the stack trace (note current state hasn't been idle for me so haven't hit the crash)

  1. Create Connection secure connection to server
  2. Make a bidirectional RPC
  3. Start streaming data to the server
  4. Pause sending any data to the server while still connected
  5. Receive a PEER_RESET from the Network Framework due to server cancelling the connection
  6. (Currently haven't crashed because not idle)

Expected behaviour

We either don't receive errors from the network framework while idle or we don't crash if we do.

Additional information

  • This is using a BidirectionalStreamingCall
  • It seems related to the user having paused recording so we're not actively streaming audio for varying periods of time
  • We also frequently have received unavailable(14) error on the RPC prior to the network error
  • The only other way I've been able to get a similar stack trace is when the server has killed the connection on their end (after a 120s timeout)
  • I have not been able to reproduce this locally but it's happening in production consistently
  • We haven't reproduced it yet with a non-optimized build
  • The one crash that we had extra logging enabled for indicated that the app was in the background when it switched states to idle and then received the network error.

I've tried debugging it a number of times and it seems to be very intermittent but I can't seem to create the timing where we transition to idle before we receive a network error nor can I figure out how to avoid said timing.

@rlasante rlasante added the kind/bug Feature doesn't work as expected. label Feb 10, 2021
@Lukasa
Copy link
Collaborator

Lukasa commented Feb 10, 2021

Yeah, this crash seems like it shouldn't be possible, so that means there's some misunderstanding in the logic around the management of the GRPCIdleHandler and the ConnectionManager.

@Lukasa
Copy link
Collaborator

Lukasa commented Feb 10, 2021

Yeah, that precondition just seems excessively aggressive. There is no particular reason that an idle channel shouldn't be able to error. It's certainly surprising, but insufficiently defensive. In this case I'm prepared to believe that the channelInactive state transition fired first, but that the error from the receive was propagated out anyway. This can happen in NIOTS, I suspect, but the only way I see it happening is if we got both data and an error in the same read, which is not something I've ever seen happen.

Either way, grpc is probably being a bit aggressive here.

Lukasa added a commit to Lukasa/grpc-swift that referenced this issue Feb 10, 2021
Motivation:

When a connection is in the idle state it is unlikely, but possible, to
see channel errors. This is most likely to happen when a channel has
just gone idle, and will shortly become entirely unavailable, but some
other I/O operation sees an error in the meantime. However, it could
arguably happen before any activity has been seen at all.

Right now we handle this by crashing. That's not great, mostly because
it's entirely possible to see weird reordering of events or delayed
error delivery from other handlers. It would be better to treat these
similarly to the way we treat errors in transientFailure or shutdown: as
fundamentally not adding drastically new information.

After all, consider the two options: either this error is channel fatal,
or it is not. In either case, being idle is immaterial. If the error is
fatal to the channel, we'll see the channel go inactive shortly and be
happy. Alternatively, the channel may be just fine, in which case we can
safely attempt to re-use it later. If the channel is in a weird broken
state we'll likely hit an immediate error on reconnection and move on
with our lives.

The one minor difference I've proposed here is that it's probably worth
us notifying about errors of this kind. While they aren't likely to
cause a bug in grpc, they could be a source of all kinds of weird
issues, and at the very least they probably represent a bug somewhere
else in the code. We'd like to know about them if we can.

Modifications:

- Remove the hard-crash on error in idle, replace it with a log.

Result:

Fewer crashes.

Fixes grpc#1132.
Lukasa added a commit to Lukasa/grpc-swift that referenced this issue Feb 11, 2021
Motivation:

When a connection is in the idle state it is unlikely, but possible, to
see channel errors. This is most likely to happen when a channel has
just gone idle, and will shortly become entirely unavailable, but some
other I/O operation sees an error in the meantime. However, it could
arguably happen before any activity has been seen at all.

Right now we handle this by crashing. That's not great, mostly because
it's entirely possible to see weird reordering of events or delayed
error delivery from other handlers. It would be better to treat these
similarly to the way we treat errors in transientFailure or shutdown: as
fundamentally not adding drastically new information.

After all, consider the two options: either this error is channel fatal,
or it is not. In either case, being idle is immaterial. If the error is
fatal to the channel, we'll see the channel go inactive shortly and be
happy. Alternatively, the channel may be just fine, in which case we can
safely attempt to re-use it later. If the channel is in a weird broken
state we'll likely hit an immediate error on reconnection and move on
with our lives.

The one minor difference I've proposed here is that it's probably worth
us notifying about errors of this kind. While they aren't likely to
cause a bug in grpc, they could be a source of all kinds of weird
issues, and at the very least they probably represent a bug somewhere
else in the code. We'd like to know about them if we can.

Modifications:

- Remove the hard-crash on error in idle, replace it with a log.

Result:

Fewer crashes.

Fixes grpc#1132.
Lukasa added a commit that referenced this issue Feb 11, 2021
Motivation:

When a connection is in the idle state it is unlikely, but possible, to
see channel errors. This is most likely to happen when a channel has
just gone idle, and will shortly become entirely unavailable, but some
other I/O operation sees an error in the meantime. However, it could
arguably happen before any activity has been seen at all.

Right now we handle this by crashing. That's not great, mostly because
it's entirely possible to see weird reordering of events or delayed
error delivery from other handlers. It would be better to treat these
similarly to the way we treat errors in transientFailure or shutdown: as
fundamentally not adding drastically new information.

After all, consider the two options: either this error is channel fatal,
or it is not. In either case, being idle is immaterial. If the error is
fatal to the channel, we'll see the channel go inactive shortly and be
happy. Alternatively, the channel may be just fine, in which case we can
safely attempt to re-use it later. If the channel is in a weird broken
state we'll likely hit an immediate error on reconnection and move on
with our lives.

The one minor difference I've proposed here is that it's probably worth
us notifying about errors of this kind. While they aren't likely to
cause a bug in grpc, they could be a source of all kinds of weird
issues, and at the very least they probably represent a bug somewhere
else in the code. We'd like to know about them if we can.

Modifications:

- Remove the hard-crash on error in idle, replace it with a log.

Result:

Fewer crashes.

Fixes #1132.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Feature doesn't work as expected.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants