Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timed out connection not closed? #24

Open
nefilim opened this issue Apr 15, 2014 · 14 comments
Open

Timed out connection not closed? #24

nefilim opened this issue Apr 15, 2014 · 14 comments

Comments

@nefilim
Copy link
Contributor

nefilim commented Apr 15, 2014

Running into a strange issue, I'll try to summarize as briefly as possible.

I have two servers, M(onitor) & S(erver). M is receiving data from S, once per second, over a spray-websocket connection.

Browser client connects to M over another web socket connection. As S pushes data to M, it's pushed out to the browser client. If the browser client disappears (I simulate it by quitting my VPN client that browser connects to M over), the data from S fails to be written at M over the web socket connection it had with the browser client.

Brower <---- (websocket/vpn) ---- Monitor <---- (websocket) ---- Server

Spray logs warnings endlessly, it doesn't appear that the connection gets cleaned up:

2014-04-15 18:17:41,701 WARN [ReportingActorSystem-akka.actor.default-dispatcher-10] s.c.s.HttpServerConnection [Slf4jLogger.scala : 71] CommandFailed for Tcp.Write text frame: {"node":"10.0.20.202","up":true,"metrics":{"buildI ...
2014-04-15 18:17:41,704 WARN [ReportingActorSystem-akka.actor.default-dispatcher-10] s.c.s.HttpServerConnection [Slf4jLogger.scala : 71] event pipeline: dropped CommandFailed(Write(ByteString(-127, 126, 0, -69, 123, 34, 110, 111, 100, 101, 34, 58, 34, 49, 48, 46, 48, 46, 50, 48, 46, 50, 48, 50, 34, 44, 34, 117, 112, 34, 58, 116, 114, 117, 101, 44, 34, 109, 101, 116, 114, 105, 99, 115, 34, 58, 123, 34, 98, 117, 105, 108, 100, 73, 110, 102, 111, 34, 58, 123, 34, 99, 111, 109, 112, 111, 110, 101, 110, 116, 78, 97, 109, 101, 34, 58, 34, 119, 111, 114, 107, 101, 114, 34, 44, 34, 99, 111, 109, 112, 111, 110, 101, 110, 116, 86, 101, 114, 115, 105, 111, 110, 34, 58, 34, 48, 46, 49, 46, 50, 45, 83, 78, 65, 80, 83, 72, 79, 84, 34, 44, 34, 98, 117, 105, 108, 100, 84, 105, 109, 101, 34, 58, 34, 84, 104, 117, 32, 65, 112, 114, 32, 49, 48, 32, 49, 57, 58, 50, 49, 58, 52, 56, 32, 80, 68, 84, 32, 50, 48, 49, 52, 34, 44, 34, 117, 112, 116, 105, 109, 101, 77, 105, 108, 108, 105, 115, 34, 58, 50, 54, 55, 57, 55, 57, 48, 53, 57, 125, 125, 125),NoAck(null)))

I don't get any (unhandled) messages at my WebSocketServerConnection (worker) or my web socket server actor (that created the worker), no indication that the connection is unavailable.

Have you witnessed this issue?

@dcaoyuan
Copy link
Owner

Closed events are logged on debug level usually. Did you tried to tune log level to 'DEBUG'?

Or, you can try to override closeLogic method of WebSocketServerConnection to:

  def closeLogic: Receive = {
    case ev: Http.ConnectionClosed =>
      context.stop(self)
      log.info("Connection closed on event: {}", ev)
  }

@nefilim
Copy link
Contributor Author

nefilim commented Apr 16, 2014

Yea I am running at debug level, it never receives Http.ConnectionClosed (or ErrorClosed or PeerClosed that I'm also listening for).
I believe the (worker) actor is never stopped, that's why the HttpServerConnection(really ConnectionHandler) never stops warning.

@nefilim
Copy link
Contributor Author

nefilim commented Apr 16, 2014

I looked into this further, the OS (amazon linux in this case) doesn't recognize the connection as timed out on the server side, it remains in the ESTABLISHED state. I tuned the kernel tcp keep alive parameters much more aggressively (60s, 10s, 4 probes) but didn't have much luck.
It took just over 16 minutes for the connection to be closed (this was reflected in Spray with a ErrorClosed("Connection timed out") message).
I wonder what can be done in Spray/Akka, for instance, close the connection if X writes failed in Y time.

@dcaoyuan
Copy link
Owner

We may need to implement some application-level (here, websocket layer) timeout, like spray-can for Http here:

http://spray.io/documentation/1.2.1/spray-can/http-server/

Request Timeouts
If the handler does not respond to a request within the configured request-timeout period a 
spray.http.Timedout message is sent to the timeout handler, which can be the “regular” handler itself 
or another actor (depending on the timeout-handler config setting). The timeout handler then has 
the chance to complete the request within the time period configured as timeout-timeout. Only if the 
timeout handler also misses its deadline for completing the request will the connection actor 
complete the request itself with a “hard-coded” error response.

In order to change the respective config setting for that connection only the application can send 
the following messages to the sender of a request (part) or the connection actor:

spray.io.ConnectionTimeouts.SetIdleTimeout
spray.http.SetRequestTimeout
spray.http.SetTimeoutTimeout

@nefilim
Copy link
Contributor Author

nefilim commented Apr 16, 2014

Are those timeouts not in effect when using UHttp extension? Or when the connections have been upgraded only Websocket only? I'm wondering because I was unable to use both extensions at the same time, so using the UHttp extension for serving HTTP content also:

IO(UHttp) ! Http.Bind(service, interface = "0.0.0.0", port = 8080)
IO(UHttp) ! Http.Bind(webSocketServer, interface = "0.0.0.0", port = WebSocketServer.defaultPort)

this didn't work:

IO(Http) ! Http.Bind(service, interface = "0.0.0.0", port = 8080)
IO(UHttp) ! Http.Bind(webSocketServer, interface = "0.0.0.0", port = WebSocketServer.defaultPort)

I also noticed this enhancement in Spray/Akka HTTP, wonder if it will be helpful here at all:
spray/spray#615

@dcaoyuan
Copy link
Owner

Those timeouts work under http. When upgraded to WebSocket, the original http connection (tcp) will shift to new io pipelines which drops all http event/command processing.

So, for

IO(UHttp) ! Http.Bind(service, interface = "0.0.0.0", port = 8080)
IO(UHttp) ! Http.Bind(webSocketServer, interface = "0.0.0.0", port = WebSocketServer.defaultPort)

The http bind on 8080 will be effected by those timeouts settings, on WebSocket binding, no.

@dcaoyuan
Copy link
Owner

Correction:

I looked code of spray.io and spray.can again, here's the correction of above errata:

The pipeline stages of Http are:

ServerFrontend(settings) >>
      RequestChunkAggregation(requestChunkAggregationLimit) >>
      PipeliningLimiter(pipeliningLimit) >>
      StatsSupport(statsHolder.get)  >>
      RemoteAddressHeaderSupport  >>
      SSLSessionInfoSupport  >>
      RequestParsing(settings) >>
      ResponseRendering(settings) >>

      ConnectionTimeouts(idleTimeout) >>
      PreventHalfClosedConnections(sslEncryption) >>
      SslTlsSupport  >>
      TickGenerator(reapingCycle)  >>
      BackPressureHandling 

Vs the WebSocket:

WebSocketFrontend(settings, serverHandler) >>
      FrameComposing(wsFrameSizeLimit, wsContext) >>
      FrameParsing(wsFrameSizeLimit) >>
      FrameRendering(maskGen, wsContext) >>

      ConnectionTimeouts(idleTimeout) >>
      PreventHalfClosedConnections(sslEncryption) >>
      SslTlsSupport  >>
      TickGenerator(reapingCycle)  >>
      BackPressureHandling 

As the WebSocket pipelines also staged ConnectionTimeouts and PreventHalfClosedConnections, the only difference to Http is requestTimeout, which only processed by Http's stages. I.e. WebSocket will also process idleTimeout.

we did a simple test for:
Browser ----> (websocket) ----> Server

When the client machine running Browser break network to Server, by plugging out the network cable. The server will auto-disconnected after 60 secs, just as the settings of idleTimeout.

But for the case @nefilim mentioned:
Brower <---- (websocket/vpn) ---- Monitor <---- (websocket) ---- Server

It's the server pushing msg to browser client. When network broken, on the server side, the server is still keeping to write data to the Tcp socket, so it's not idle and the idleTimeout may not work, and Akka IO just firing CommandFailed(Write) event. I'm not sure yet, why Akka IO does not detected the broken of network and fires some XXXClosed events.

We'll dig it. Or, you can choose to capture the CommandFailed events and disconnect it positively as a temporary solution. To disconnect the socket manually, you can send Tcp.Close message to serverConnection from your worker.

@nefilim
Copy link
Contributor Author

nefilim commented Apr 17, 2014

Thanks for looking. I had the same thought re timeouts, the periodic server push is keeping it alive, hence another heuristic like X failed writes in Y time. I'm wondering if this is how they plan on implementing spray/spray#615

Unfortunately nothing is received at my actors, I think Spray is swallowing that event:

  def baseEventPipeline(tcpConnection: ActorRef): Pipeline[Event] = {
    case x: Tcp.ConnectionClosed 
      log.debug("Connection was {}, awaiting TcpConnection termination...", x)
      context.become {
        case Terminated(`tcpConnection`) 
          log.debug("TcpConnection terminated, stopping")
          context.stop(self)
      }

    case _: Droppable  // don't warn
    case ev            log.warning("event pipeline: dropped {}", ev) // <==== SWALLOWED?
  }

unless I'm doing something wrong :)

I have a few small issues/possible improvements around closing connections, I'll open a pull request.

@dcaoyuan
Copy link
Owner

Did you try on the newest committed code? at least after ee08024

@nefilim
Copy link
Contributor Author

nefilim commented Apr 17, 2014

Ah, very nice thanks! I'm handling FrameCommandFailed appropriately now on my end.

One curiosity I've observed, a connection created with Chrome takes a lot longer to time out (~2.5 minutes vs ~10 seconds) server side than a connection created by Safari, 100% repeatable in my scenario. Also, the Chrome created connection write fails with:

2014-04-17 16:18:50,572 WARN [ReportingActorSystem-akka.actor.default-dispatcher-10] s.c.s.HttpServerConnection [Slf4jLogger.scala : 71] CommandFailed for Tcp.Write text frame: ��q�������-M,��� …

while the Safari created connection fails with:

2014-04-17 16:12:54,287 WARN [ReportingActorSystem-akka.actor.default-dispatcher-8] s.c.s.HttpServerConnection [Slf4jLogger.scala : 71] CommandFailed for Tcp.Write text frame: {"node":"10.0.20.173","up":true,"metrics":{"buildI …

both are text frames with exactly the same format JSON, I'm not sure what is going on with the Chrome created connection, binary encoded perhaps?

@fommil
Copy link

fommil commented Jul 2, 2015

This one is stinging me very reproducibly for Frames of about 5MB, thank you for doing the leg work on it! I've not fixed it yet though, so any further advice well appreciated.

Am I right in reading that the only way to increase the timeout settings is to do this?

      connection ! spray.io.ConnectionTimeouts.SetIdleTimeout(1 minute)
      connection ! spray.http.SetRequestTimeout(1 minute)
      connection ! spray.http.SetTimeoutTimeout(1 minute)

is there really no other way, like setting these config values

spray.can.server {
    idle-timeout = 60 s
    request-timeout = 20 s
    timeout-timeout = 2 s
}

If I am getting timeouts, how can I be sure? I'm not seeing anything except the CommandFailed for Tcp.Write and the event pipeline: dropped messages. Enabling DEBUG gives me insanely large unhandled message errors inside the UHTTP-IO hierarchy for the ByteString.

@fommil
Copy link

fommil commented Jul 2, 2015

what am I talking about... I'm on the client side, so I should be setting a client side timeout!

@fommil
Copy link

fommil commented Jul 2, 2015

ok, bumping the timeouts doesn't make a difference.

I start getting FrameCommandFailed and then shortly after that, the actor dispatcher seems to become completely saturated with work somewhere else (the actor is not stopped or dead, but it doesn't receive any more messages).

Any ideas what could be going on?

@fommil
Copy link

fommil commented Jul 2, 2015

I'm hijacking your ticket, I'll create a new one with an example of what I'm seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants