Crashing on iOS after receiving an error from Network stack while idle #1132

rlasante · 2021-02-10T17:40:19Z

Describe the bug

Versions:
grpc-swift 1.0.0-alpha.22
protoc-gen-grpc-swift

Platforms:
iOS 13.3.1, 13.4.1, 13.6.1, 14.2, 14.3, 14.4

In our application we're using swift-grpc 1.0.0-alpha-22 to stream audio data from a mobile client up to our server.
We're seeing in our production crash logs the following stack trace where immediately before the crash the state of our grpc connection transitions to idle and then we receive this error.

We appear to be hitting the precondition on ConnectionManager:470

0 Streaming 0x0000000104b7f76c GRPC.ConnectionManager.channelError(Swift.Error) -> () + 720
1 Streaming 0x0000000104b96658 protocol witness for NIO._ChannelInboundHandler.errorCaught(context: NIO.ChannelHandlerContext, error: Swift.Error) -> () in conformance GRPC.GRPCIdleHandler : NIO._ChannelInboundHandler in GRPC + 44
2 Streaming 0x0000000104c196bc NIO.ChannelHandlerContext.(invokeErrorCaught in _EEC863903996E9F191EBAFEB0FB0DFDD)(Swift.Error) -> () + 64
3 Streaming 0x0000000104c0d748 (extension in NIO):NIO._ChannelInboundHandler.errorCaught(context: NIO.ChannelHandlerContext, error: Swift.Error) -> () + 36
4 Streaming 0x0000000104c196bc NIO.ChannelHandlerContext.(invokeErrorCaught in _EEC863903996E9F191EBAFEB0FB0DFDD)(Swift.Error) -> () + 64
5 Streaming 0x0000000104c0d748 (extension in NIO):NIO._ChannelInboundHandler.errorCaught(context: NIO.ChannelHandlerContext, error: Swift.Error) -> () + 36
6 Streaming 0x0000000104c196bc NIO.ChannelHandlerContext.(invokeErrorCaught in _EEC863903996E9F191EBAFEB0FB0DFDD)(Swift.Error) -> () + 64
7 Streaming 0x0000000104c0d748 (extension in NIO):NIO._ChannelInboundHandler.errorCaught(context: NIO.ChannelHandlerContext, error: Swift.Error) -> () + 36
8 Streaming 0x0000000104c196bc NIO.ChannelHandlerContext.(invokeErrorCaught in _EEC863903996E9F191EBAFEB0FB0DFDD)(Swift.Error) -> () + 64
9 Streaming 0x0000000104c1894c NIO.ChannelPipeline.fireErrorCaught(Swift.Error) -> () + 92
10 Streaming 0x0000000104cb10e0 function signature specialization <Arg[1] = Dead> of NIOTransportServices.NIOTSConnectionChannel.(dataReceivedHandler in _8F3A0963B0F58CDCE328C45B7B282707)(content: Foundation.Data?, context: Network.NWConnection.ContentContext?, isComplete: Swift.Bool, error: Network.NWError?) -> () + 704
11 libswiftNetwork.dylib 0x00000001e466db90 closure #1 (__C.OS_dispatch_data?, __C.OS_nw_content_context?, Swift.Bool, __C.OS_nw_error?) -> () in Network.NWConnection.receiveMessage(completion: (Foundation.Data?, Network.NWConnection.ContentContext?, Swift.Bool, Network.NWError?) -> ()) -> () + 168
12 libswiftNetwork.dylib 0x00000001e466d7cc reabstraction thunk helper from @escaping @callee_guaranteed (@guaranteed __C.OS_dispatch_data?, @guaranteed __C.OS_nw_content_context?, @unowned Swift.Bool, @guaranteed __C.OS_nw_error?) -> () to @escaping @callee_unowned @convention(block) (@unowned __C.OS_dispatch_data?, @unowned __C.OS_nw_content_context?, @unowned Swift.Bool, @unowned __C.OS_nw_error?) -> () + 96
13 libdispatch.dylib 0x00000001989f7fb8 _dispatch_block_async_invoke2 + 144
14 libdispatch.dylib 0x00000001989e9db0 _dispatch_client_callout + 16
15 libdispatch.dylib 0x00000001989f110c _dispatch_lane_serial_drain + 576
16 libdispatch.dylib 0x00000001989f1c90 _dispatch_lane_invoke + 456
17 libdispatch.dylib 0x00000001989f0fd8 _dispatch_lane_serial_drain + 268
18 libdispatch.dylib 0x00000001989f1c90 _dispatch_lane_invoke + 456
19 libdispatch.dylib 0x00000001989fbd78 _dispatch_workloop_worker_thread + 704
20 libsystem_pthread.dylib 0x00000001e48a5814 _pthread_wqthread + 272
21 libsystem_pthread.dylib 0x00000001e48ac76c start_wqthread + 4

To reproduce

I haven't been able to reproduce the issue so far locally but the only way I've actually gotten a similar stack trace is via a connection termination from the server due to a 120s timeout.

Steps to reproduce the stack trace (note current state hasn't been idle for me so haven't hit the crash)

Create Connection secure connection to server
Make a bidirectional RPC
Start streaming data to the server
Pause sending any data to the server while still connected
Receive a PEER_RESET from the Network Framework due to server cancelling the connection
(Currently haven't crashed because not idle)

Expected behaviour

We either don't receive errors from the network framework while idle or we don't crash if we do.

Additional information

This is using a BidirectionalStreamingCall
It seems related to the user having paused recording so we're not actively streaming audio for varying periods of time
We also frequently have received unavailable(14) error on the RPC prior to the network error
The only other way I've been able to get a similar stack trace is when the server has killed the connection on their end (after a 120s timeout)
I have not been able to reproduce this locally but it's happening in production consistently
We haven't reproduced it yet with a non-optimized build
The one crash that we had extra logging enabled for indicated that the app was in the background when it switched states to idle and then received the network error.

I've tried debugging it a number of times and it seems to be very intermittent but I can't seem to create the timing where we transition to idle before we receive a network error nor can I figure out how to avoid said timing.

Lukasa · 2021-02-10T18:05:52Z

Yeah, this crash seems like it shouldn't be possible, so that means there's some misunderstanding in the logic around the management of the GRPCIdleHandler and the ConnectionManager.

Lukasa · 2021-02-10T18:22:10Z

Yeah, that precondition just seems excessively aggressive. There is no particular reason that an idle channel shouldn't be able to error. It's certainly surprising, but insufficiently defensive. In this case I'm prepared to believe that the channelInactive state transition fired first, but that the error from the receive was propagated out anyway. This can happen in NIOTS, I suspect, but the only way I see it happening is if we got both data and an error in the same read, which is not something I've ever seen happen.

Either way, grpc is probably being a bit aggressive here.

Motivation: When a connection is in the idle state it is unlikely, but possible, to see channel errors. This is most likely to happen when a channel has just gone idle, and will shortly become entirely unavailable, but some other I/O operation sees an error in the meantime. However, it could arguably happen before any activity has been seen at all. Right now we handle this by crashing. That's not great, mostly because it's entirely possible to see weird reordering of events or delayed error delivery from other handlers. It would be better to treat these similarly to the way we treat errors in transientFailure or shutdown: as fundamentally not adding drastically new information. After all, consider the two options: either this error is channel fatal, or it is not. In either case, being idle is immaterial. If the error is fatal to the channel, we'll see the channel go inactive shortly and be happy. Alternatively, the channel may be just fine, in which case we can safely attempt to re-use it later. If the channel is in a weird broken state we'll likely hit an immediate error on reconnection and move on with our lives. The one minor difference I've proposed here is that it's probably worth us notifying about errors of this kind. While they aren't likely to cause a bug in grpc, they could be a source of all kinds of weird issues, and at the very least they probably represent a bug somewhere else in the code. We'd like to know about them if we can. Modifications: - Remove the hard-crash on error in idle, replace it with a log. Result: Fewer crashes. Fixes grpc#1132.

Motivation: When a connection is in the idle state it is unlikely, but possible, to see channel errors. This is most likely to happen when a channel has just gone idle, and will shortly become entirely unavailable, but some other I/O operation sees an error in the meantime. However, it could arguably happen before any activity has been seen at all. Right now we handle this by crashing. That's not great, mostly because it's entirely possible to see weird reordering of events or delayed error delivery from other handlers. It would be better to treat these similarly to the way we treat errors in transientFailure or shutdown: as fundamentally not adding drastically new information. After all, consider the two options: either this error is channel fatal, or it is not. In either case, being idle is immaterial. If the error is fatal to the channel, we'll see the channel go inactive shortly and be happy. Alternatively, the channel may be just fine, in which case we can safely attempt to re-use it later. If the channel is in a weird broken state we'll likely hit an immediate error on reconnection and move on with our lives. The one minor difference I've proposed here is that it's probably worth us notifying about errors of this kind. While they aren't likely to cause a bug in grpc, they could be a source of all kinds of weird issues, and at the very least they probably represent a bug somewhere else in the code. We'd like to know about them if we can. Modifications: - Remove the hard-crash on error in idle, replace it with a log. Result: Fewer crashes. Fixes #1132.

rlasante added the kind/bug Feature doesn't work as expected. label Feb 10, 2021

Lukasa mentioned this issue Feb 10, 2021

Ignore errors in idle state. #1133

Merged

Lukasa closed this as completed in #1133 Feb 11, 2021

glbrntt mentioned this issue May 28, 2021

Crash in ConnectionManager.channelError #1194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashing on iOS after receiving an error from Network stack while idle #1132

Crashing on iOS after receiving an error from Network stack while idle #1132

rlasante commented Feb 10, 2021

Lukasa commented Feb 10, 2021

Lukasa commented Feb 10, 2021

Crashing on iOS after receiving an error from Network stack while idle #1132

Crashing on iOS after receiving an error from Network stack while idle #1132

Comments

rlasante commented Feb 10, 2021

Describe the bug

To reproduce

Expected behaviour

Additional information

Lukasa commented Feb 10, 2021

Lukasa commented Feb 10, 2021