-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crash in alpha 17 #949
Comments
Thanks for the bug report @AVVS -- I think I've worked out the set of conditions that would lead to this. Knowing a few extra bits of information would add some certainty that I have found the problem you've hit though.
|
Motivation: In certain conditions it's possible for a connection to idle when in the 'active' state resulting in a precondition failure. The active state is not user facing and represents the state where we have an active connection to a remote peer but have not yet seen the initial settings frame. It's not usually possible for users to get a connection in this state since we normally only vend connections in the 'ready' state (where we have seen the initial settings frame). However, in the 'fastFailure' mode this constraint is lifted. If keepalive is configured, the connection is in the active state, and the client starts an RPC then it is possible for the keepalive timeout to fire, idling the connection before it reaches the ready state, resulting in a precondition failure. Modifications: - Allow the connection manager to idle from the active state Result: - Idling from the active state is tolerated - Resolves grpc#949
Motivation: In certain conditions it's possible for a connection to idle when in the 'active' state resulting in a precondition failure. The active state is not user facing and represents the state where we have an active connection to a remote peer but have not yet seen the initial settings frame. It's not usually possible for users to get a connection in this state since we normally only vend connections in the 'ready' state (where we have seen the initial settings frame). However, in the 'fastFailure' mode this constraint is lifted. If keepalive is configured, the connection is in the active state, and the client starts an RPC then it is possible for the keepalive timeout to fire, idling the connection before it reaches the ready state, resulting in a precondition failure. Modifications: - Allow the connection manager to idle from the active state Result: - Idling from the active state is tolerated - Resolves #949
We are not using fast failure, but we do use bidi streams - I noticed that the connection may close itself if there is an active bidi stream, but no data has been going back and forth for a while (except for keepalive) In that case we were making sure a reconnect happens and its generally been good I sadly dont have the precondition message - crashlytics didnt capture it for some reason |
That's unfortunate because it makes it highly unlikely I've fixed your bug! Can you provide some more context on this, like the configuration you're using for your client?
Hmm, how are you observing that? Are you sure this isn't the connection being dropped rather than gRPC actively deciding to close it? We should never close the connection if there are active streams. |
Reading delegate logs: /// holds grpc NIO chat event loop
private let chatGroup = MultiThreadedEventLoopGroup(numberOfThreads: 1)
/// holds keepalive configuration for chat
private let keepalive = ClientConnectionKeepalive(
interval: .seconds(10),
timeout: .seconds(5),
permitWithoutCalls: true,
maximumPingsWithoutData: 100,
minimumSentPingIntervalWithoutData: .minutes(1)
)
/// holds grpc chat connection manager
private lazy var chatChannel = ClientConnection
.secure(group: chatGroup)
.withKeepalive(keepalive)
.withConnectionIdleTimeout(.hours(24)) // was default earlier, but was going .ready -> .idle often
.withErrorDelegate(self)
.withConnectivityStateDelegate(self)
.withCallStartBehavior(.waitsForConnectivity)
.withBackgroundActivityLogger(Logger(label: "slr.chat"))
.connect(host: Constants.GRPC.chatHost, port: 443)
private lazy var gRPCChatClient = Pbx_NodeClient(channel: chatChannel)
...
func didCatchError(_ error: Error, logger: Logger, file: StaticString, line: Int) {
if let err = error as? IOError {
SLRLog(level: .error, ["IO chat connection err", err.errnoCode, err.description])
} else {
SLRLog(level: .error, ["unexpected chat connection", error.localizedDescription, file, line])
}
}
func connectivityStateDidChange(from oldState: ConnectivityState, to newState: ConnectivityState) {
SLRLog(level: .info, ["chat connection state change from", oldState, "to", newState])
// connection closed when it went to .idle
switch (oldState, newState) {
case (.ready, .idle), (.ready, .transientFailure):
chatState.accept(.error(NIOHTTP2Errors.IOOnClosedConnection())) // triggers reconnect -- not sure why this keeps happening
default: break
}
} Its possible .ready -> .idle would trigger due to Only call that happens with this grpc connection is establishing a bidi loop right away, which is supposed to be active/reconnec loop = gRPCChatClient.messageLoop { [weak self] message in
self?.handleLoop(message)
}
loop?.status.whenComplete { result in
if case .success(let value) = result {
SLRLog(level: .debug, "[chat][loop] \(value.description) - \(value.code)")
} else if case .failure(let err) = result {
SLRLog(level: .debug, "[chat][loop] \(err.localizedDescription)")
}
} Please let me know if there are more details I can provide or anything in the code is worth investigating/changing to reliably trigger the error - its not happening often, but I noticed that about 50% of the time it happens when the device is still in the background mode or would be switching from background to foreground. Assume the connection would be getting cut since the app stops operating in background after a few seconds, at the same time the crash doesnt happen all the time - I havent had it once on my development devices, but crash logs tell that it happens for others |
Thanks, there are so useful things to note here, I'll leave some comments inline.
You should prefer using
I suspect the pings aren't actually being received by the remote because the app is in the background. It shouldn't be necessary to set the
The connection is supposed to transition from ready to idle when there are no active RPCs.
Just to check: the reconnect you mention here, is this just creating a new RPC? There should be no need to create a new connection here, gRPC will do that for you.
Can you provide some logs from right before a crash? I think that'd be helpful to see. |
Motivation: Related to grpc#949. It's possible for the client's connection to be dropped, and for the connection state to become idle, prior to the scheduled close from the keepalive handler firing. As the idle handler never updates its state to 'closed' in 'channelInactive', when the scheduled close event is picked up in the idle handler it effectively ask the connection manager to double-idle, hitting a precondition failure. Modifications: - Move the 'GRPCIdleHandler' to '.closed' in 'channelInactive' - Cancel scheduled pings and timeouts on 'channelInactive' in the 'GRPCKeepaliveHandlers' Result: - We avoid another path to a precondition failure
Motivation: Related to grpc#949. It's possible for the client's connection to be dropped, and for the connection state to become idle, prior to the scheduled close from the keepalive handler firing. As the idle handler never updates its state to 'closed' in 'channelInactive', when the scheduled close event is picked up in the idle handler it effectively ask the connection manager to double-idle, hitting a precondition failure. Modifications: - Move the 'GRPCIdleHandler' to '.closed' in 'channelInactive' - Cancel scheduled pings and timeouts on 'channelInactive' in the 'GRPCKeepaliveHandlers' Result: - We avoid another path to a precondition failure
Motivation: Related to #949. It's possible for the client's connection to be dropped, and for the connection state to become idle, prior to the scheduled close from the keepalive handler firing. As the idle handler never updates its state to 'closed' in 'channelInactive', when the scheduled close event is picked up in the idle handler it effectively ask the connection manager to double-idle, hitting a precondition failure. Modifications: - Move the 'GRPCIdleHandler' to '.closed' in 'channelInactive' - Cancel scheduled pings and timeouts on 'channelInactive' in the 'GRPCKeepaliveHandlers' Result: - We avoid another path to a precondition failure
It initiates new bidi stream: // onError handler thats closed when accept(.error()) is called
private func handleError(_ err: Error) {
SLRLog(level: .error, ["chat connection failed", err])
guard chatConnectionState != .reconnecting,
chatConnectionState != .closing else {
return
}
chatConnectionState = .reconnecting
close().whenComplete { [weak self] _ in
guard let self = self, let backoff = self.backoffSequence.next()?.backoff else {
return
}
self.connectionGroup.next().scheduleTask(in: .seconds(Int64(backoff))) { [unowned self] in
self.chatState.accept(.reconnect(err))
}
}
} This is the closing function, just so you can see that its internal cleanup:
And this is the reconnect that eventually happens (error -> closing -> connect):
private func establishLoop() -> EventLoopFuture<Void> {
loop = gRPCChatClient.messageLoop { [weak self] message in
self?.handleLoop(message)
}
loop?.status.whenComplete { result in
if case .success(let value) = result {
SLRLog(level: .debug, "[chat][loop] \(value.description) - \(value.code)")
} else if case .failure(let err) = result {
SLRLog(level: .debug, "[chat][loop] \(err.localizedDescription)")
}
}
return opsGroup.next().makeSucceededFuture(())
}
Thats very likely, which is totally fine and having a connection reestablish is also absolutely fine. Just trying to get that crash out of the way :) I just figured out that the logs in crashlytics are from old to new... Well, that should make things easier because there are useful things in there. Screenshots with the logs, seems like its following the same pattern. One thing that could be at play here is that remote server connection may be getting completely dropped (we were doing load testing, which would lead to servers dying). At times that seems to be the cause of the connection getting dropped, but at times it seems just the transition from background to foreground Please note the timestamps as they might give some clue as to when this happened (during normal work or bg/fg transition) |
Thanks for all the info! I'm fairly confident that I've now fixed (see #953) the bug you've been hitting. Specifically, it would require the connection to be dropped followed by the scheduled close firing (as a result of the ping not being ack'd). This would hit the precondition failure you've been seeing. It also ties in with your logs since you're seeing the transient failure from the ready state and the error you see just before is "Socket is not connected (errno: 57)" (suggesting the connection was dropped). |
Thanks, made a build and we are currently testing it :) will let you know if it reoccurs, hopefully not though! |
Guys, I got this issue again on version 1.11.0 :
|
Can you please attach the full crash report? |
|
This crash seems different: a segmentation fault in |
@Lukasa Ty Lukasa, we will research it. |
Describe the bug
Crash triggered by transition to idle state from keepalive handler
iOS 13.6.1 alpha 17
Not sure how to reproduce - had these reports in crashlytics, would be happy to debug further if you can guide me
on what I can do to provide more information
Thanks!
The text was updated successfully, but these errors were encountered: