Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

randomize initial zone selection #583

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Conversation

joshk
Copy link
Contributor

@joshk joshk commented Apr 1, 2019

What is the problem that this PR is trying to fix?

Workers, when deployed, are bound to a primary zone. Whenever a job (and VM) is started, the primary zone is used first, then an alternate zone is selected.

The core problem is when we start hitting zone exhaustion errors for a zone. Since there are pools of workers per zone, and each worker will try its primary zone first, which just puts more pressure on the api, raising the risk of api rate limit issues.

What approach did you choose and why?

This is a first step towards a fix, but not yet the full fix.

This changes the concept of a primary zone, and instead has a random zone picked each try. A zone can still be defined in the config, but this zone will be used each and every time.

The follow up work to this PR is to block a zone for 10mins across all workers (possibly using Redis), which will help reduce api usage, and a better self healing worker setup.

How can you test this?

I've tested this locally by connecting it to staging, and it worked a treat.

What feedback would you like, if any?

This is for general discussion

I also need help understanding how to fix the test failures

@soulshake
Copy link
Contributor

soulshake commented Apr 1, 2019

Note: this comment has been redacted; see followup below

The core problem is when we start hitting zone exhaustion errors for a zone.

Which zone exhaustion errors are you referring to? IIUC, most resource quotas (as well as API quotas) are per project and/or per region rather than per zone. (See GCE Quotas page)

Example from recent outage:

Error
QUOTA_EXCEEDED: Quota 'SSD_TOTAL_GB' exceeded. Limit: 800000.0 in region us-central1. 

^ note the exhaustion is in the region us-central1, not a zone like us-central1-c.

(Note that we have occasionally hit ZONE_RESOURCE_POOL_EXHAUSTED errors in the past, but that has to do with the global resource usage for that zone rather than our projects specifically. And it was not the case in the recent outage.)

Since there are pools of workers per zone, and each worker will try its primary zone first, which just puts more pressure on the api, raising the risk of api rate limit issues.

I think this is only the case when a zone is specified in the worker config. In our case it looks like instances are pretty well distributed across zones already, no?
image

I wonder if it would make more sense to have each worker create job instances in its own zone, if possible, since those are already automatically distributed by the managed instance group.

Also, not all resource types (in particular GPUs) are available in all zones. So I'm not sure it makes sense to make zone pinning an all-or-nothing thing, because I believe that will cause problems when GPUs (and perhaps some specific CPUs) are specified.

Disclaimer: I still don't understand what actually triggered the recent outage, beyond the issue of not being able to delete instances because we had exceeded the API rate limits, thus causing the resource exhaustion. We still don't know what exactly caused us to exceed the API limits in the first place, do we?

@soulshake
Copy link
Contributor

Edit: I take it all back, I see there were plenty of these errors during the last outage:

Mar 29 23:32:53 production-2-worker-org-gce-4mr9 travis-worker-wrapper: 
time="2019-03-30T04:32:53Z" level=error msg="couldn't start instance, attempting 
requeue" err="code=ZONE_RESOURCE_POOL_EXHAUSTED location= 
message=The zone 'projects/travis-ci-prod-2/zones/us-central1-c' does not have 
enough resources available to fulfill the request.  Try a different zone, or try again later." 
job_id=123456798 job_path=xyz/xyz/jobs/123456798 pid=1 
processor=ed8d48ea-5209-4f2b-b595-bfda4c06ce13@1.production-2-worker-org-gce-4mr9 
repository=xyz/xyz self=step_start_instance start_timeout=8m0s uuid=84c6f99e-f021-4b13-baf1-ba101c22e3ab

@joshk
Copy link
Contributor Author

joshk commented Apr 1, 2019 via email

@soulshake
Copy link
Contributor

Thanks for the feedback AJ. I'm just about to hit the hay, but I thought I would also add that our Google reps recommended temporarily not using a zone when we hit these errors. I'll write up a more detailed reply when I wake up.

This definitely makes more sense now that I know we had hit ZONE_RESOURCE_POOL_EXHAUSTED errors during the last incident. 👍

"NETWORK": fmt.Sprintf("network name (default %q)", defaultGCENetwork),
"PREEMPTIBLE": "boot job instances with preemptible flag enabled (default false)",
"PREMIUM_MACHINE_TYPE": fmt.Sprintf("premium machine type (default %q)", defaultGCEPremiumMachineType),
"PROJECT_ID": "[REQUIRED] GCE project id",
"PROJECT_ID": "[REQUIRED] GCE project id (will try to auto detect it if not set)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, [REQUIRED] can be dropped

backend/gce.go Outdated
@@ -119,7 +119,7 @@ var (
"WARMER_URL": "URL for warmer service",
"WARMER_TIMEOUT": fmt.Sprintf("timeout for requests to warmer service (default %v)", defaultGCEWarmerTimeout),
"WARMER_SSH_PASSPHRASE": fmt.Sprintf("The passphrase used to decipher instace SSH keys"),
"ZONE": fmt.Sprintf("zone name (default %q)", defaultGCEZone),
"ZONE": "zone in which to deploy job instaces into (default is to use all zones in the region)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo (instaces)

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some fixes!

P.S. share your ideas, feedbacks or issues with us at https://github.com/fixmie/feedback (this message will be removed after the beta stage).

joshk and others added 3 commits June 25, 2019 19:16
Co-Authored-By: fixmie[bot] <44270338+fixmie[bot]@users.noreply.github.com>
@joshk joshk removed the request for review from emdantrim June 25, 2019 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants