-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gvfs-helper: auto-retry after network errors, resource throttling, split GET and POST semantics #208
Conversation
@wilbaker Thanks for the pointers. |
See microsoft/git#208 for the Git changes. This should make the gvfs-helper more robust to intermittent failures when a single network call fails.
/azp run Microsoft.git |
Azure Pipelines successfully started running 1 pipeline(s). |
@jeffhostetler after the 503 update, I was unable to repro any network failures in the C# functional tests. (I got a different flaky failure in the C# code, but that's different ;) ) I'm happy to re-test and approve after you do the cleanup to make this not WIP. |
Thanks for the confirmation! Almost finished.... |
/azp run Microsoft.git |
Azure Pipelines successfully started running 1 pipeline(s). |
Add robust-retry mechanism to automatically retry a request after network errors. This includes retry after: [] transient network problems reported by CURL. [] http 429 throttling (with associated Retry-After) [] http 503 server unavailable (with associated Retry-After) Add voluntary throttling using Azure X-RateLimit-* hints to avoid being soft-throttled (tarpitted) or hard-throttled (429) on later requests. Add global (outside of a single request) azure-throttle data to track the rate limit hints from the cache-server and main Git server independently. Add exponential retry backoff. This is used for transient network problems when we don't have a Retry-After hint. Move the call to index-pack earlier in the response/error handling sequence so that if we receive a 200 but yet the packfile is truncated/corrupted, we can use the regular retry logic to get it again. Refactor the way we create tempfiles for packfiles to use <odb>/pack/tempPacks/ rather than working directly in the <odb>/pack/ directory. Move the code to create a new tempfile to the start of a single request attempt (initial and retry attempts), rather than at the overall start of a request. This gives us a fresh tempfile for each network request attempt. This simplifies the retry mechanism and isolates us from the file ownership issues hidden within the tempfile class. And avoids the need to truncate previous incomplete results. This was necessary because index-pack was pulled into the retry loop. Minor: Add support for logging X-VSS-E2EID to telemetry on network errors. Minor: rename variable: params.b_no_cache_server --> params.b_permit_cache_server_if_defined. This variable is used to indicate whether we should try to use the cache-server when it is defined. Got rid of double-negative logic. Minor: rename variable: params.label --> params.tr2_label Clarify that this variable is only used with trace2 logging. Minor: Move the code to automatically map cache-server 400 responses to normal 401 response earlier in the response/error handling sequence to simplify later retry logic. Minor: Decorate trace2 messages with "(cs)" or "(main)" to identify the server in log messages. Add params->server_type to simplify this. Signed-off-by: Jeff Hostetler <[email protected]>
9a844ff
to
0519894
Compare
} | ||
|
||
/* | ||
* Get exactly 1 object immediately. | ||
* Ignore any queued objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we are assuming that any queued objects will get a flush
request eventually? That sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've split the queued and immediate usage now. The dry-run/pre-scan loops already handle the queue and drain (for missing blobs usually). The main difference now is that any missing trees (or commits) during those loops will be immediately fetched in isolation, but the queue will remain.
Expose the differences in the semantics of GET and POST for the "gvfs/objects" API: HTTP GET: fetches a single loose object over the network. When a commit object is requested, it just returns the single object. HTTP POST: fetches a batch of objects over the network. When the oid-set contains a commit object, all referenced trees are also included in the response. gvfs-helper is updated to take "get" and "post" command line options. the gvfs-helper "server" mode is updated to take "objects.get" and "objects.post" verbs. For convenience, the "get" option and the "objects.get" verb do allow more than one object to be requested. gvfs-helper will automatically issue a series of (single object) HTTP GET requests and creating a series of loose objects. The "post" option and the "objects.post" verb will perform bulk object fetching using the batch-size chunking. Individual HTTP POST requests containing more than one object will be created as a packfile. A HTTP POST for a single object will create a loose object. This commit also contains some refactoring to eliminate the assumption that POST is always associated with packfiles. In gvfs-helper-client.c, gh_client__get_immediate() now uses the "objects.get" verb and ignores any currently queued objects. In gvfs-helper-client.c, the OIDSET built by gh_client__queue_oid() is only processed when gh_client__drain_queue() is called. The queue is processed using the "object.post" verb. Signed-off-by: Jeff Hostetler <[email protected]>
a5d67f2
to
5c65e9a
Compare
@derrickstolee I think I'm done tinkering with this one. 2.20191023.1 has already been thru the functional tests and is good. Just did a final squash on my fixups and am running 2.20191023.2 thru the its paces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rebase these commits onto tentative/features/sparse-checkout-2.24.0
for #214 after you merge.
Thanks for all your help! |
…out-2.23.0 Upgrade to 2.20191023.7-sc which corresponds to commit microsoft/git@a782a7e and includes the following changes (since 2.20191015.2-sc): - microsoft/git#208 - microsoft/git#210
Resolves #195. Includes the following updates to `microsoft/git`: * microsoft/git#208: gvfs-helper: auto-retry after network errors, resource throttling, split GET and POST semantics. * microsoft/git#215: gvfs-helper: dramatically reduce progress noise.
I'm marking this WIP because I haven't done a cleanup round nor squashed things yet.
But I want to get the CI builds a chance to run tonight.
This series attempts to:
[x] auto-retry after network outages
[x] throttle back when requested (or demanded) by the server.
Questions:
[done] What is the right default for the network retry limit?
[done] Should the throttle back have a time limit? (It's one thing to wait 3 or 4 minutes between
packfiles because we hit it too hard, but another if it says we should wait an hour or two.)
[no] Should the network retry look at the amount of data received and try to resume it?
[no] Should the network retry split large packfile requests if we can tell the user's network is flakey?
Basic testing with 5 concurrent fetches shows that it is pretty easy to get throttled and
makes me wonder if we should even bother with multi-threading this. Perhaps we just
limit it to waiting for index-pack in another thread, but only plan to have 1 network thread.
Or maybe that is just when talking to the main server -- we might be able to multithread
when talking to the cache-server.