Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket timeout issues #2026

Closed
donovanhide opened this issue Feb 22, 2017 · 72 comments
Closed

Websocket timeout issues #2026

donovanhide opened this issue Feb 22, 2017 · 72 comments

Comments

@donovanhide
Copy link
Contributor

With 0.50.2 repeated calls to ledger_data whilst also streaming the ledger and transaction streams leads to timeouts in the Server code in rippled with log_level set to trace:

2017-Feb-22 20:10:35 Server:TRC Websocket received '
{
        "binary" : true,
        "command" : "ledger_data",
        "id" : 39,
        "ledger" : 27861637,
        "marker" : "1AC61580EF42AA58D69059995313F05C5342E6AC2BA6A46A51EFB508C0F79C75"
}
'
2017-Feb-22 20:10:35 Server:TRC Websocket received '
{
        "binary" : true,
        "command" : "ledger_data",
        "id" : 40,
        "ledger" : 27861637,
        "marker" : "1B7EA1321C09D8421502B0A63EAD615EA3CB92EEB6393909312EA76C002C7289"
}
'
2017-Feb-22 20:10:35 Server:TRC ##22 timer: Connection timed out
2017-Feb-22 20:10:35 Server:TRC ##23 read: End of file
2017-Feb-22 20:10:35 Server:TRC ##24 write: Broken pipe

Problem is not seen in 0.40.

@vinniefalco
Copy link
Contributor

How big do you expect the returned data-set to be? in megabytes?

@donovanhide
Copy link
Contributor Author

Changing this timeout to 30 from 3 seems to fix the issue:
https://github.com/ripple/rippled/blob/209fe8f7a92a74b173f2b23ae263c4885fc5b16d/src/ripple/server/impl/BaseWSPeer.h#L368
I'm connecting locally, so I guess I'm not processing both the streamed data and the ledger_data responses within 3 seconds in all cases.

@vinniefalco
Copy link
Contributor

Hmm...that look sketchy! I will investigate

@vinniefalco
Copy link
Contributor

We're going to make a change so that the 3 second timeout will only happen during unit tests. Websocket connections from the loopback address will be 30 seconds.

@donovanhide
Copy link
Contributor Author

Sounds good to me.

@donovanhide
Copy link
Contributor Author

Looks like the same issue crops up more sporadically, one or two times a day:

2017-Feb-23 18:58:50 Server:TRC #23277 destroyed: 1 request
2017-Feb-23 18:58:58 Server:TRC #23278 accept:    138.197.193.51
2017-Feb-23 18:58:58 Server:TRC #23278 destroyed: 1 request
2017-Feb-23 18:59:00 Server:TRC #23279 accept:    178.19.226.15
2017-Feb-23 18:59:01 Server:TRC #23279 destroyed: 1 request
2017-Feb-23 18:59:02 Server:TRC #23280 accept:    52.25.10.5
2017-Feb-23 18:59:03 Server:TRC #23280 destroyed: 1 request
2017-Feb-23 18:59:15 Server:TRC ##8 timer: Connection timed out
2017-Feb-23 18:59:15 Server:TRC ##9 read: End of file
2017-Feb-23 18:59:23 Server:TRC #23281 accept:    96.127.136.18
2017-Feb-23 18:59:23 Server:TRC #23281 destroyed: 1 request
2017-Feb-23 18:59:25 Server:TRC #23282 accept:    52.25.10.5
2017-Feb-23 18:59:25 Server:TRC #23282 destroyed: 1 request
2017-Feb-23 18:59:41 Server:TRC #23283 accept:    178.19.226.15
2017-Feb-23 18:59:41 Server:TRC #23283 destroyed: 1 request

##8 and ##9 refer to my connection.

@vinniefalco
Copy link
Contributor

Is it because you are requesting data that takes more than 30 seconds? Try changing the timeout to 5 minutes?

@donovanhide
Copy link
Contributor Author

After the first 10 or 15 seconds at startup, I've done all the ledger_data requests I'll ever do. After that it's just reading the transaction and ledger streams and the occasional submit. When the timeout gets logged, no API call was in progress.

@vinniefalco
Copy link
Contributor

So could this be that... it actually timed out due to inactivity?

@donovanhide
Copy link
Contributor Author

Well, my code got disconnected last time around ledger 27881937. Doesn't seem like there was any large gaps between ledgers around that time, so the stream data should have been been resetting the timeout I assume.
I use a separate connection for submitting trades however. Maybe that one is timing out? Where is the timeout set between API requests? Looks like the timeout might be roughly an hour after the last submit on that second connection. Not sure an inactivity timeout is great for a bot when markets go quiet :-)

@vinniefalco
Copy link
Contributor

If a client goes idle, rippled should issue a websocket ping every so often to keep the connection alive. We are looking into it. Probably a bug in the server.

@vinniefalco
Copy link
Contributor

Try giving this a spin? https://github.com/vinniefalco/rippled/tree/0.50.3

Thanks

@donovanhide
Copy link
Contributor Author

/usr/bin/g++ -o build/gcc.release/src/ripple/unity/app_main.o -c -Wall -frtti -std=c++14 -Wno-invalid-offsetof -pthread -Wno-sign-compare -Wno-char-subscripts -Wno-format -g -Wno-unused-but-set-variable -Wno-deprecated -O3 -fno-strict-aliasing -Wno-unuse
d-local-typedefs -DOPENSSL_NO_SSL2 -DDEPRECATED_IN_MAC_OS_X_VERSION_10_7_AND_LATER -DHAVE_USLEEP=1 -DSOCI_CXX_C11=1 -D_SILENCE_STDEXT_HASH_DEPRECATION_WARNINGS -DBOOST_NO_AUTO_PTR -DBOOST_COROUTINE_NO_DEPRECATION_WARNING -DBOOST_COROUTINES_NO_DEPRECATION
_WARNING -DNDEBUG -D_FILE_OFFSET_BITS=64 -Isrc -Isrc/beast -Isrc/beast/include -Isrc/beast/extras -Isrc/nudb/include -Ibuild/proto -Isrc/soci/src -Isrc/soci/include src/ripple/unity/app_main.cpp
In file included from src/ripple/server/impl/PlainWSPeer.h:23:0,
                 from src/ripple/server/impl/PlainHTTPPeer.h:24,
                 from src/ripple/server/impl/Door.h:26,
                 from src/ripple/server/impl/ServerImpl.h:25,
                 from src/ripple/server/Server.h:24,
                 from src/ripple/rpc/impl/ServerHandlerImp.h:25,
                 from src/ripple/rpc/ServerHandler.h:29,
                 from src/ripple/overlay/make_Overlay.h:23,
                 from src/ripple/app/main/Application.cpp:65,
                 from src/ripple/unity/app_main.cpp:24:
src/ripple/server/impl/BaseWSPeer.h: In member function ‘void ripple::BaseWSPeer<Handler, Impl>::start_timer()’:
src/ripple/server/impl/BaseWSPeer.h:436:13: error: ‘remote_address_’ was not declared in this scope
             remote_address_.address().is_loopback() ?
             ^
scons: *** [build/gcc.release/src/ripple/unity/app_main.o] Error 1
scons: building terminated because of errors.

@vinniefalco
Copy link
Contributor

Sorry about that, Visual Studio strikes again. I've fixed the error (hopefully) waiting on Travis now. The branch has been rewritten.

@donovanhide
Copy link
Contributor Author

Seems to close the connection even more aggressively now (both connections with in around 5 seconds) . Can you replicate my original issue at your end using 0.50.2?

@vinniefalco
Copy link
Contributor

I know this one works
https://github.com/vinniefalco/rippled/commits/develop

@vinniefalco
Copy link
Contributor

This one also works, tested it using a browser script. It stays connected even when idle
https://github.com/vinniefalco/rippled/tree/0.50.3

I am not sure what your issue is. It might be something other than an idle timeout.

@donovanhide
Copy link
Contributor Author

I tried your 0.50.3 tag and it disconnected more quickly as described. Not sure a browser implementation of a websockets client is going to cover all bases. In my bot, using Go, I'm also pinging every 60 seconds:

https://github.com/rubblelabs/ripple/blob/master/websockets/remote.go#L552-L558

This is a regression as 0.40 could keep running for weeks/months at a time :-) If you need a Go repro I can knock one together tomorrow, but you'd have to install Go to build it.

@vinniefalco
Copy link
Contributor

Well 0.40.0 used websocketpp while 0.50.0 uses Beast. Its not surprising that there's a bug. We'll figure it out!

@sublimator
Copy link
Contributor

sublimator commented Feb 25, 2017 via email

@vinniefalco
Copy link
Contributor

@donovanhide is your Go websocket client responding to pings (I see that it sends them but does it also reply)? rippled will not reset its timer just because it hears a ping. It only resets the timer if it gets a pong in response to a sent ping.

@donovanhide
Copy link
Contributor Author

Default behaviour is to reply to all pings with a pong:

https://godoc.org/github.com/gorilla/websocket#hdr-Control_Messages
https://github.com/gorilla/websocket/blob/master/conn.go#L1041-L1054

I'll try and get a barebones repro built.

@vinniefalco
Copy link
Contributor

See #2032

@donovanhide
Copy link
Contributor Author

Is this fix not going to be in 0.60?

@vinniefalco
Copy link
Contributor

There were two versions of the fix. A good one, and a great one. The great one includes improvements to Beast that allow concurrent ping and write operations. Its being sorted out so it can go into some version.

@nbougalis
Copy link
Contributor

The fix Vinnie describes will go into 0.70.0.

@donovanhide
Copy link
Contributor Author

donovanhide commented Mar 24, 2017

I'm still getting intermittent disconnections with 0.70.0. This sounds like other people experiencing the same:

https://www.xrpchat.com/topic/3341-websocket-closing-ripple-lib

@bachase
Copy link
Collaborator

bachase commented Mar 24, 2017

@donovanhide Were you testing with 0.70.0-b2? The fix should be 15f969a.

@donovanhide
Copy link
Contributor Author

donovanhide commented Mar 24, 2017

Yep, can't paste in the git log, but that's the exact commit I'm using. Gaps are longer between disconnections, but still occurs.

@miguelportilla
Copy link
Contributor

miguelportilla commented Mar 31, 2017

"closing slow client" means the Websocket write queue filled up. Writing to the socket outperformed sending. We currently impose a limit of 100 queue items before considering the client too slow and disconnecting.

@donovanhide
Copy link
Contributor Author

Seems a bit aggressive if the plan is to process 1000 tx/sec :-)

@vinniefalco
Copy link
Contributor

@donovanhide I agree, something ain't right there. The server implementation needs to be redesigned anyway. But in this case i think you are really too slow, it means that the other end is not reading the data fast enough. You should be able to read on your end fast enough to keep the queue empty.

@donovanhide
Copy link
Contributor Author

My code has been happily running for over two years, so not sure I want to optimise it because rippled changed its websocket library.

@miguelportilla
Copy link
Contributor

miguelportilla commented Mar 31, 2017

@donovanhide That is the limit of a single Websocket write queue. It's unusual for a queue to get beyond a few items in my experience. I can work with you to figure out why it happened.

@vinniefalco Another possibility is the server upstream bandwidth was momentarily saturated.

@vinniefalco
Copy link
Contributor

vinniefalco commented Mar 31, 2017

If rippled queues more than 100 items at a time, it would cause that problem.

@donovanhide Are you running your own server? You could increase the queue limit to 10,000 or remove it completely as a test.

@donovanhide
Copy link
Contributor Author

Sounds like it should be a configurable. My code is the only client of that server. I need to add each item of metaData that I receive into my version of the ledger state. I could buffer them all and then process them, but seems like quite a lot of work because the number 100 was hard-coded.

@vinniefalco
Copy link
Contributor

vinniefalco commented Mar 31, 2017

@donovanhide How do you prevent a memory exhaustion attack where someone subscribes to a million streams and then reads only one byte at a time per second on the connection?

@donovanhide
Copy link
Contributor Author

Here's a typical snapshot of my time budget for each ledger:

Receiving 91.583198ms Bellman: 148.022203ms Signing: 688.565µs Submitting: 16.415657ms

@donovanhide
Copy link
Contributor Author

In that situation, you leave the send queue limit to a low value. For private servers, it's not an issue.

@donovanhide
Copy link
Contributor Author

Or you put an nginx websocket proxy in front of it and configure the rate limiter :-)

@vinniefalco
Copy link
Contributor

Or you put an nginx websocket proxy in front of it and configure the rate limiter :-)

Oh I like that much better

@miguelportilla
Copy link
Contributor

@donovanhide Can you please try a large queue limit size to see if it eliminates the problem? It might be a good idea to also log the max size the queue grows to.

@donovanhide
Copy link
Contributor Author

Just to be clear, a rate-limited nginx websocket proxy is a solution to attack prevention, not my slow transaction stream message reading problem :-)
@miguelportilla Yep, I'll whack it up. If you want to reproduce, you could replay that ledger with a timed sleep between each message read. If the transaction per ledger volume goes up though, it's more likely that others will hit this with their javascript code.

@miguelportilla
Copy link
Contributor

@donovanhide Thank you. I will replay that ledger and reproduce on my end. What does your RPC subscription look like?

@donovanhide
Copy link
Contributor Author

donovanhide commented Mar 31, 2017

@miguelportilla

{
        "command" : "subscribe",
        "id" : 1,
        "streams" : [ "ledger", "transactions", "server" ]
}

@donovanhide
Copy link
Contributor Author

Actually just reviewed my code (it's been a while...) and seen that I am buffering all received messages in the transaction stream for a ledger before processing them, which suggests that the typical time of around 70ms to receive a typical ledger is not down to what my code is doing, as far as I can see.

Would be interesting to see how long it takes for you to receive a full ledger's worth of transactions. That is, examine TxCount in the ledger message, start timer, receive TxCount transaction messages, stop timer. Ledger 28,696,629 was another biggie :-)

@vinniefalco
Copy link
Contributor

Just to be clear, a rate-limited nginx websocket proxy is a solution to attack prevention, not my slow transaction stream message reading problem :-)

Yes, I get that. I would rather rippled focus on what it does best, implementing RCL, and offload security solutions to other products that specialize in it. I don't think rippled can compete with nginx when it comes to offering versatility for managing network resources (nor would I want it to).

@tuloski
Copy link

tuloski commented Apr 10, 2017

Now it seems stable. My bots lost around 100k XRP...but fine...maybe I should stop this "business" :)

@tuloski
Copy link

tuloski commented Apr 10, 2017

Mmmm...it is not true...it is not stable.
Got 5 disconnections in 6 hours :(

@miguelportilla
Copy link
Contributor

@tuloski Are you using you running rippled? If so, which version and what is it hosted on?

@tuloski
Copy link

tuloski commented Apr 10, 2017

No I'm using Ripple's public servers.

@miguelportilla
Copy link
Contributor

What does your RPC subscription look like?

@tuloski
Copy link

tuloski commented Apr 11, 2017

I'm using Ripple-lib...

@miguelportilla
Copy link
Contributor

miguelportilla commented Apr 11, 2017

@tuloski I would like to figure why you are getting disconnected. Would you please share with me the relevant sections of your JS script so that I may use them to debug on my end? I can be reached at miguel at ripple dot com

@tuloski
Copy link

tuloski commented Apr 11, 2017

Well it's Ripple-lib 0.12.6: https://github.com/ripple/ripple-lib/tree/0.12.6

@tuloski
Copy link

tuloski commented Apr 20, 2017

This is what I'm getting:
/home/odroid/node_modules/ripple-lib/node_modules/ws/lib/WebSocket.js:214
else throw new Error('not opened');
^

Error: not opened
at WebSocket.send (/home/odroid/node_modules/ripple-lib/node_modules/ws/lib/WebSocket.js:214:16)
at Server._sendMessage (/home/odroid/node_modules/ripple-lib/dist/npm/server.js:746:14)

The servers I'm connected are: wss://s1.ripple.com:443, wss://s-west.ripple.com:443

@miguelportilla
Copy link
Contributor

@tuloski What error message or code do you receive when you are disconnected?

@tuloski
Copy link

tuloski commented Apr 20, 2017

I don't know...it doesn't report the error.
And I don't know how to modify the code to show me the error...all that js drives me crazy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants