cmd/cacheserver: too many timeouts #313

robpike · 2017-03-06T23:10:40Z

When creating a large data set with many largish files (a music library), it all worked but took about twice as long as it should have, according to the bandwidth I was seeing and the size of the data, and was accompanied by a great many errors like this:

rpc.Invoke: 400 Bad Request: read tcp 127.0.0.1:8443->127.0.0.1:44044: i/o timeout 2017/03/03 20:51:52.841992 store/storecache.writer: writeback failed: store/remote.Put("remote,upspin.XXX.com:443"): I/O error:

(Note too the dangling colon.)

I believe what's happening is that the client is doing parallel 1MB writes, and a significant fraction of them time out just before completion, in this case halving my bandwidth. I imagine the parameters of my push matter, but I could see this problem producing much worse slowdowns.

Understand and fix.

The text was updated successfully, but these errors were encountered:

robpike · 2017-03-09T04:39:09Z

Not fully debugged yet, but here is what I have uncovered.

The actual error is produced in the Go runtime by epoll and is reported first in the Upspin code by the rpc/server code running in the storage server running in Google Cloud. The error happens when the HTTP server, trying to read the payload for store.Put method invocation, times out at upspin.io/rpc/server.go:173.

I was unable to find a timeout whose value affected the behavior, so I tried another approach and changed the value of writers in store/storecache. It is set to 20 originally, and at 20 I get timeouts every second or two. At 5 they appear a few times a minute, at 4 once every few minutes, and at 3 never.

Let's look at the 4 setting, as that almost never times out and keeps the line saturated as well as 20 (sic). My home line is steady at delivering 1.5MB/s upstream, and with a writers setting of 4, this rate is maintained. A setting of 4 means about 4MB are outstanding on the wire, and that is right on the cusp of timing out. At 1.5MB/s, 4MB takes 3 or 4 seconds. Thus we would expect to see a timeout somewhere in the system in the neighborhood of 3-5 seconds, but I cannot find one.

The network code in the Go runtime is inscrutable to me. (The amazing thing about epoll is that it's better than its predecessor.) Someone who understands that code, or maybe the HTTP code, might know where the relevant timeout is, and may be able to adjust it.

Meanwhile if I get a chance with a different throughput network I'll see what the sweet value of writers is on that, and maybe find a way to set it dynamically.

For now, I will hand-tune my value of writers.

This isn't over yet.

presotto · 2017-03-09T15:42:53Z

Without understanding the underlying extraordinarily complex library code, we can't assume that there is any fairness going on in the different streams. Therefore, the applicable timeout may be much more than 3-5 seconds, i.e., one stream could be getting starved.

robpike · 2017-03-10T16:21:13Z

The relevant timeout is 15s and is in cloud/https See https://upspin-review.googlesource.com/c/8285/

There are several free parameters that affect the timeout rate, and one fixed. The fixed one is the bandwidth available; the free ones are number of parallel writers, block size, and timeout. It should be possible to adjust one or more of the free parameters based on the observed bandwidth, although I realize this is not going to be easy.

The current settings will make it all but impossible for store.Put to succeed on a slow link.

n2vi · 2017-03-10T17:19:46Z

We also have to keep in mind the DoS considerations from Timeouts section of the gopheracademy discussion referenced in #127. That's not to say that we should hurt normal performance to defend against attacks, just that a good solution will diagnose all these many issues and is probably worth documenting for a wider audience.

…

On Fri, Mar 10, 2017 at 8:21 AM, Rob Pike ***@***.***> wrote: The relevant timeout is 15s and is in cloud/https See https://upspin-review.googlesource.com/c/8285/ There are several free parameters that affect the timeout rate, and one fixed. The fixed one is the bandwidth available; the free ones are number of parallel writers, block size, and timeout. It should be possible to adjust one or more of the free parameters based on the observed bandwidth, although I realize this is not going to be easy. The current settings will make it all but impossible for store.Put to succeed on a slow link. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#313 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIA3unkwa_1RPV_WyKCODttKiCSHD1B0ks5rkXh7gaJpZM4MU0qV> .

4 is still arbitrary but at least on my home line generates almost no timeouts while still keeping the uplink saturated. Update #313 Change-Id: Ib641313ac7b98151d5fb80b1b95a987005fedb4b Reviewed-on: https://upspin-review.googlesource.com/8320 Reviewed-by: David Presotto <[email protected]>

robpike · 2017-04-27T20:57:34Z

This has been resolved, mostly.

robpike assigned robpike and presotto Mar 6, 2017

adg mentioned this issue Mar 10, 2017

net/http: no way of manipulating timeouts in Handler golang/go#16100

Closed

adg added the bug label Mar 12, 2017

robpike closed this as completed Apr 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/cacheserver: too many timeouts #313

cmd/cacheserver: too many timeouts #313

robpike commented Mar 6, 2017 •

edited

Loading

robpike commented Mar 9, 2017 •

edited

Loading

presotto commented Mar 9, 2017

robpike commented Mar 10, 2017

n2vi commented Mar 10, 2017 via email

robpike commented Apr 27, 2017

cmd/cacheserver: too many timeouts #313

cmd/cacheserver: too many timeouts #313

Comments

robpike commented Mar 6, 2017 • edited Loading

robpike commented Mar 9, 2017 • edited Loading

presotto commented Mar 9, 2017

robpike commented Mar 10, 2017

n2vi commented Mar 10, 2017 via email

robpike commented Apr 27, 2017

robpike commented Mar 6, 2017 •

edited

Loading

robpike commented Mar 9, 2017 •

edited

Loading