Timed out connection not closed? #24

nefilim · 2014-04-15T18:29:40Z

Running into a strange issue, I'll try to summarize as briefly as possible.

I have two servers, M(onitor) & S(erver). M is receiving data from S, once per second, over a spray-websocket connection.

Browser client connects to M over another web socket connection. As S pushes data to M, it's pushed out to the browser client. If the browser client disappears (I simulate it by quitting my VPN client that browser connects to M over), the data from S fails to be written at M over the web socket connection it had with the browser client.

Brower <---- (websocket/vpn) ---- Monitor <---- (websocket) ---- Server

Spray logs warnings endlessly, it doesn't appear that the connection gets cleaned up:

2014-04-15 18:17:41,701 WARN [ReportingActorSystem-akka.actor.default-dispatcher-10] s.c.s.HttpServerConnection [Slf4jLogger.scala : 71] CommandFailed for Tcp.Write text frame: {"node":"10.0.20.202","up":true,"metrics":{"buildI ...
2014-04-15 18:17:41,704 WARN [ReportingActorSystem-akka.actor.default-dispatcher-10] s.c.s.HttpServerConnection [Slf4jLogger.scala : 71] event pipeline: dropped CommandFailed(Write(ByteString(-127, 126, 0, -69, 123, 34, 110, 111, 100, 101, 34, 58, 34, 49, 48, 46, 48, 46, 50, 48, 46, 50, 48, 50, 34, 44, 34, 117, 112, 34, 58, 116, 114, 117, 101, 44, 34, 109, 101, 116, 114, 105, 99, 115, 34, 58, 123, 34, 98, 117, 105, 108, 100, 73, 110, 102, 111, 34, 58, 123, 34, 99, 111, 109, 112, 111, 110, 101, 110, 116, 78, 97, 109, 101, 34, 58, 34, 119, 111, 114, 107, 101, 114, 34, 44, 34, 99, 111, 109, 112, 111, 110, 101, 110, 116, 86, 101, 114, 115, 105, 111, 110, 34, 58, 34, 48, 46, 49, 46, 50, 45, 83, 78, 65, 80, 83, 72, 79, 84, 34, 44, 34, 98, 117, 105, 108, 100, 84, 105, 109, 101, 34, 58, 34, 84, 104, 117, 32, 65, 112, 114, 32, 49, 48, 32, 49, 57, 58, 50, 49, 58, 52, 56, 32, 80, 68, 84, 32, 50, 48, 49, 52, 34, 44, 34, 117, 112, 116, 105, 109, 101, 77, 105, 108, 108, 105, 115, 34, 58, 50, 54, 55, 57, 55, 57, 48, 53, 57, 125, 125, 125),NoAck(null)))

I don't get any (unhandled) messages at my WebSocketServerConnection (worker) or my web socket server actor (that created the worker), no indication that the connection is unavailable.

Have you witnessed this issue?

dcaoyuan · 2014-04-16T06:10:37Z

Closed events are logged on debug level usually. Did you tried to tune log level to 'DEBUG'?

Or, you can try to override closeLogic method of WebSocketServerConnection to:

  def closeLogic: Receive = {
    case ev: Http.ConnectionClosed =>
      context.stop(self)
      log.info("Connection closed on event: {}", ev)
  }

nefilim · 2014-04-16T13:42:09Z

Yea I am running at debug level, it never receives Http.ConnectionClosed (or ErrorClosed or PeerClosed that I'm also listening for).
I believe the (worker) actor is never stopped, that's why the HttpServerConnection(really ConnectionHandler) never stops warning.

nefilim · 2014-04-16T15:08:19Z

I looked into this further, the OS (amazon linux in this case) doesn't recognize the connection as timed out on the server side, it remains in the ESTABLISHED state. I tuned the kernel tcp keep alive parameters much more aggressively (60s, 10s, 4 probes) but didn't have much luck.
It took just over 16 minutes for the connection to be closed (this was reflected in Spray with a ErrorClosed("Connection timed out") message).
I wonder what can be done in Spray/Akka, for instance, close the connection if X writes failed in Y time.

dcaoyuan · 2014-04-16T16:14:13Z

We may need to implement some application-level (here, websocket layer) timeout, like spray-can for Http here:

http://spray.io/documentation/1.2.1/spray-can/http-server/

Request Timeouts
If the handler does not respond to a request within the configured request-timeout period a 
spray.http.Timedout message is sent to the timeout handler, which can be the “regular” handler itself 
or another actor (depending on the timeout-handler config setting). The timeout handler then has 
the chance to complete the request within the time period configured as timeout-timeout. Only if the 
timeout handler also misses its deadline for completing the request will the connection actor 
complete the request itself with a “hard-coded” error response.

In order to change the respective config setting for that connection only the application can send 
the following messages to the sender of a request (part) or the connection actor:

spray.io.ConnectionTimeouts.SetIdleTimeout
spray.http.SetRequestTimeout
spray.http.SetTimeoutTimeout

nefilim · 2014-04-16T16:33:26Z

Are those timeouts not in effect when using UHttp extension? Or when the connections have been upgraded only Websocket only? I'm wondering because I was unable to use both extensions at the same time, so using the UHttp extension for serving HTTP content also:

IO(UHttp) ! Http.Bind(service, interface = "0.0.0.0", port = 8080)
IO(UHttp) ! Http.Bind(webSocketServer, interface = "0.0.0.0", port = WebSocketServer.defaultPort)

this didn't work:

IO(Http) ! Http.Bind(service, interface = "0.0.0.0", port = 8080)
IO(UHttp) ! Http.Bind(webSocketServer, interface = "0.0.0.0", port = WebSocketServer.defaultPort)

I also noticed this enhancement in Spray/Akka HTTP, wonder if it will be helpful here at all:
spray/spray#615

dcaoyuan · 2014-04-16T16:42:22Z

Those timeouts work under http. When upgraded to WebSocket, the original http connection (tcp) will shift to new io pipelines which drops all http event/command processing.

So, for

IO(UHttp) ! Http.Bind(service, interface = "0.0.0.0", port = 8080)
IO(UHttp) ! Http.Bind(webSocketServer, interface = "0.0.0.0", port = WebSocketServer.defaultPort)

The http bind on 8080 will be effected by those timeouts settings, on WebSocket binding, no.

dcaoyuan · 2014-04-17T07:45:52Z

Correction:

I looked code of spray.io and spray.can again, here's the correction of above errata:

The pipeline stages of Http are:

ServerFrontend(settings) >>
      RequestChunkAggregation(requestChunkAggregationLimit) >>
      PipeliningLimiter(pipeliningLimit) >>
      StatsSupport(statsHolder.get)  >>
      RemoteAddressHeaderSupport  >>
      SSLSessionInfoSupport  >>
      RequestParsing(settings) >>
      ResponseRendering(settings) >>

      ConnectionTimeouts(idleTimeout) >>
      PreventHalfClosedConnections(sslEncryption) >>
      SslTlsSupport  >>
      TickGenerator(reapingCycle)  >>
      BackPressureHandling

Vs the WebSocket:

WebSocketFrontend(settings, serverHandler) >>
      FrameComposing(wsFrameSizeLimit, wsContext) >>
      FrameParsing(wsFrameSizeLimit) >>
      FrameRendering(maskGen, wsContext) >>

      ConnectionTimeouts(idleTimeout) >>
      PreventHalfClosedConnections(sslEncryption) >>
      SslTlsSupport  >>
      TickGenerator(reapingCycle)  >>
      BackPressureHandling

As the WebSocket pipelines also staged ConnectionTimeouts and PreventHalfClosedConnections, the only difference to Http is requestTimeout, which only processed by Http's stages. I.e. WebSocket will also process idleTimeout.

we did a simple test for:
Browser ----> (websocket) ----> Server

When the client machine running Browser break network to Server, by plugging out the network cable. The server will auto-disconnected after 60 secs, just as the settings of idleTimeout.

But for the case @nefilim mentioned:
Brower <---- (websocket/vpn) ---- Monitor <---- (websocket) ---- Server

It's the server pushing msg to browser client. When network broken, on the server side, the server is still keeping to write data to the Tcp socket, so it's not idle and the idleTimeout may not work, and Akka IO just firing CommandFailed(Write) event. I'm not sure yet, why Akka IO does not detected the broken of network and fires some XXXClosed events.

We'll dig it. Or, you can choose to capture the CommandFailed events and disconnect it positively as a temporary solution. To disconnect the socket manually, you can send Tcp.Close message to serverConnection from your worker.

nefilim · 2014-04-17T14:34:17Z

Thanks for looking. I had the same thought re timeouts, the periodic server push is keeping it alive, hence another heuristic like X failed writes in Y time. I'm wondering if this is how they plan on implementing spray/spray#615

Unfortunately nothing is received at my actors, I think Spray is swallowing that event:

  def baseEventPipeline(tcpConnection: ActorRef): Pipeline[Event] = {
    case x: Tcp.ConnectionClosed ⇒
      log.debug("Connection was {}, awaiting TcpConnection termination...", x)
      context.become {
        case Terminated(`tcpConnection`) ⇒
          log.debug("TcpConnection terminated, stopping")
          context.stop(self)
      }

    case _: Droppable ⇒ // don't warn
    case ev           ⇒ log.warning("event pipeline: dropped {}", ev) // <==== SWALLOWED?
  }

unless I'm doing something wrong :)

I have a few small issues/possible improvements around closing connections, I'll open a pull request.

dcaoyuan · 2014-04-17T15:15:48Z

Did you try on the newest committed code? at least after ee08024

nefilim · 2014-04-17T16:22:32Z

Ah, very nice thanks! I'm handling FrameCommandFailed appropriately now on my end.

One curiosity I've observed, a connection created with Chrome takes a lot longer to time out (~2.5 minutes vs ~10 seconds) server side than a connection created by Safari, 100% repeatable in my scenario. Also, the Chrome created connection write fails with:

2014-04-17 16:18:50,572 WARN [ReportingActorSystem-akka.actor.default-dispatcher-10] s.c.s.HttpServerConnection [Slf4jLogger.scala : 71] CommandFailed for Tcp.Write text frame: ��q�������-M,��� …

while the Safari created connection fails with:

2014-04-17 16:12:54,287 WARN [ReportingActorSystem-akka.actor.default-dispatcher-8] s.c.s.HttpServerConnection [Slf4jLogger.scala : 71] CommandFailed for Tcp.Write text frame: {"node":"10.0.20.173","up":true,"metrics":{"buildI …

both are text frames with exactly the same format JSON, I'm not sure what is going on with the Chrome created connection, binary encoded perhaps?

fommil · 2015-07-02T14:58:03Z

This one is stinging me very reproducibly for Frames of about 5MB, thank you for doing the leg work on it! I've not fixed it yet though, so any further advice well appreciated.

Am I right in reading that the only way to increase the timeout settings is to do this?

      connection ! spray.io.ConnectionTimeouts.SetIdleTimeout(1 minute)
      connection ! spray.http.SetRequestTimeout(1 minute)
      connection ! spray.http.SetTimeoutTimeout(1 minute)

is there really no other way, like setting these config values

spray.can.server {
    idle-timeout = 60 s
    request-timeout = 20 s
    timeout-timeout = 2 s
}

If I am getting timeouts, how can I be sure? I'm not seeing anything except the CommandFailed for Tcp.Write and the event pipeline: dropped messages. Enabling DEBUG gives me insanely large unhandled message errors inside the UHTTP-IO hierarchy for the ByteString.

fommil · 2015-07-02T15:05:12Z

what am I talking about... I'm on the client side, so I should be setting a client side timeout!

fommil · 2015-07-02T15:18:42Z

ok, bumping the timeouts doesn't make a difference.

I start getting FrameCommandFailed and then shortly after that, the actor dispatcher seems to become completely saturated with work somewhere else (the actor is not stopped or dead, but it doesn't receive any more messages).

Any ideas what could be going on?

fommil · 2015-07-02T15:45:10Z

I'm hijacking your ticket, I'll create a new one with an example of what I'm seeing.

fommil mentioned this issue Jul 2, 2015

crazy sized log messages in ConnectionHandler spray/spray#1060

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timed out connection not closed? #24

Timed out connection not closed? #24

nefilim commented Apr 15, 2014

dcaoyuan commented Apr 16, 2014

nefilim commented Apr 16, 2014

nefilim commented Apr 16, 2014

dcaoyuan commented Apr 16, 2014

nefilim commented Apr 16, 2014

dcaoyuan commented Apr 16, 2014

dcaoyuan commented Apr 17, 2014

nefilim commented Apr 17, 2014

dcaoyuan commented Apr 17, 2014

nefilim commented Apr 17, 2014

fommil commented Jul 2, 2015

fommil commented Jul 2, 2015

fommil commented Jul 2, 2015

fommil commented Jul 2, 2015

Timed out connection not closed? #24

Timed out connection not closed? #24

Comments

nefilim commented Apr 15, 2014

dcaoyuan commented Apr 16, 2014

nefilim commented Apr 16, 2014

nefilim commented Apr 16, 2014

dcaoyuan commented Apr 16, 2014

nefilim commented Apr 16, 2014

dcaoyuan commented Apr 16, 2014

dcaoyuan commented Apr 17, 2014

nefilim commented Apr 17, 2014

dcaoyuan commented Apr 17, 2014

nefilim commented Apr 17, 2014

fommil commented Jul 2, 2015

fommil commented Jul 2, 2015

fommil commented Jul 2, 2015

fommil commented Jul 2, 2015