Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/cacheserver: too many timeouts #313

Closed
robpike opened this issue Mar 6, 2017 · 5 comments
Closed

cmd/cacheserver: too many timeouts #313

robpike opened this issue Mar 6, 2017 · 5 comments
Assignees
Labels

Comments

@robpike
Copy link
Contributor

robpike commented Mar 6, 2017

When creating a large data set with many largish files (a music library), it all worked but took about twice as long as it should have, according to the bandwidth I was seeing and the size of the data, and was accompanied by a great many errors like this:

rpc.Invoke: 400 Bad Request: read tcp 127.0.0.1:8443->127.0.0.1:44044: i/o timeout 2017/03/03 20:51:52.841992 store/storecache.writer: writeback failed: store/remote.Put("remote,upspin.XXX.com:443"): I/O error:

(Note too the dangling colon.)

I believe what's happening is that the client is doing parallel 1MB writes, and a significant fraction of them time out just before completion, in this case halving my bandwidth. I imagine the parameters of my push matter, but I could see this problem producing much worse slowdowns.

Understand and fix.

@robpike
Copy link
Contributor Author

robpike commented Mar 9, 2017

Not fully debugged yet, but here is what I have uncovered.

The actual error is produced in the Go runtime by epoll and is reported first in the Upspin code by the rpc/server code running in the storage server running in Google Cloud. The error happens when the HTTP server, trying to read the payload for store.Put method invocation, times out at upspin.io/rpc/server.go:173.

I was unable to find a timeout whose value affected the behavior, so I tried another approach and changed the value of writers in store/storecache. It is set to 20 originally, and at 20 I get timeouts every second or two. At 5 they appear a few times a minute, at 4 once every few minutes, and at 3 never.

Let's look at the 4 setting, as that almost never times out and keeps the line saturated as well as 20 (sic). My home line is steady at delivering 1.5MB/s upstream, and with a writers setting of 4, this rate is maintained. A setting of 4 means about 4MB are outstanding on the wire, and that is right on the cusp of timing out. At 1.5MB/s, 4MB takes 3 or 4 seconds. Thus we would expect to see a timeout somewhere in the system in the neighborhood of 3-5 seconds, but I cannot find one.

The network code in the Go runtime is inscrutable to me. (The amazing thing about epoll is that it's better than its predecessor.) Someone who understands that code, or maybe the HTTP code, might know where the relevant timeout is, and may be able to adjust it.

Meanwhile if I get a chance with a different throughput network I'll see what the sweet value of writers is on that, and maybe find a way to set it dynamically.

For now, I will hand-tune my value of writers.

This isn't over yet.

@presotto
Copy link
Collaborator

presotto commented Mar 9, 2017

Without understanding the underlying extraordinarily complex library code, we can't assume that there is any fairness going on in the different streams. Therefore, the applicable timeout may be much more than 3-5 seconds, i.e., one stream could be getting starved.

@robpike
Copy link
Contributor Author

robpike commented Mar 10, 2017

The relevant timeout is 15s and is in cloud/https See https://upspin-review.googlesource.com/c/8285/

There are several free parameters that affect the timeout rate, and one fixed. The fixed one is the bandwidth available; the free ones are number of parallel writers, block size, and timeout. It should be possible to adjust one or more of the free parameters based on the observed bandwidth, although I realize this is not going to be easy.

The current settings will make it all but impossible for store.Put to succeed on a slow link.

@n2vi
Copy link
Contributor

n2vi commented Mar 10, 2017 via email

adg pushed a commit that referenced this issue Mar 10, 2017
4 is still arbitrary but at least on my home line generates almost
no timeouts while still keeping the uplink saturated.

Update #313

Change-Id: Ib641313ac7b98151d5fb80b1b95a987005fedb4b
Reviewed-on: https://upspin-review.googlesource.com/8320
Reviewed-by: David Presotto <[email protected]>
@adg adg added the bug label Mar 12, 2017
@robpike
Copy link
Contributor Author

robpike commented Apr 27, 2017

This has been resolved, mostly.

@robpike robpike closed this as completed Apr 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants