-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguity in docs & multi-datacenter deployments #5697
Comments
Hey @flyinprogrammer first off thanks for distilling this down from #304 its much appreciated! I think you're right in that there's some ambiguity that can lead to misinterpreting the docs you've highlighted. I think the biggest cause of confusion comes from these two lines:
So lets break this down with some more detailed explanations for these requirements. Low latency between servers of the same region is required because the servers must coordinate scheduling decisions of that cluster. Here I'm using cluster and region interchangeably. As this latency increases, the time it takes to make a scheduling decision can increase exponentially. For this reason it is recommended that the latency between servers of the same region be kept below 10ms. Now in order to achieve higher availability it is also recommended to spread these servers among availability zones within a cloud region. An example with AWS would be to run a server in Clients of this region could then be deployed across those availability zones either as a single datacenter or multiple. Multi-region high availability comes in the form of federating multiple regions into a single cluster. Notice I'm changing the distinction between region and cluster now. A region is the scheduling boundary, meaning when you submit a job to Nomad it is registered with a specific region. There is no scheduling done currently across regions. Because of the, region federation is not bound the the same latency requirements as severs in a distinct region. As seen in this image, requests made to region A can be forwarded to region B if that is the specified region for the request: I hope that explanation helps clear things up. To get right to your question: We are working on putting together a sort of running in production guide that outlines best practices we've observed from users running Nomad cluster. If you have a 7 datacenters in a region (so keeping to that 10ms requirement), I would recommend starting with 5 servers. You can then deploy client's with what ever datacenter works best for you. Some users like to have separate clients for different security zones/VPCs, availability zones or racks, or even just one datacenter. It's really up to you and your requirements. I would recommend starting by mapping them to your 7 datacenters to begin with if you need somewhere to start. |
So if this is the architecture ya'll want to articulate, then this sentence needs to go:
And be replaced with a us-east-1 region can include datacenters us-east-1a, us-east-1b, etc. Similarly, the analogy between nomad and consul terminology can be written as: nomad region == consul datacenter, and a nomad datacenter is partition of a nomad region, typically a cloud provider availability zone. Hopefully in Nomad, Consul, and Vault v2 we'd see a normalized naming convention whereby you might have:
But I digress. |
Typically what you've described is the common case but there are instances where Nomad clusters consist of a single region with clients spread across cloudprovider regions. While you loose the multi-region high availability of running multiple sets of servers, you gain scheduling across multiple regions. It's a tradeoff some choose to make. I appreciate your feedback and will take it back to the education team. Thanks @flyinprogrammer! |
It seems this is still in the docs:
versus
I really can't see how that is compatible with each other and why this ticket was closed? |
@MikeN123 let's open a new issue please instead of commenting one that was close two years ago. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Documentation references:
https://github.com/hashicorp/nomad/blame/master/website/source/guides/install/production/reference-architecture.html.md#L29
https://github.com/hashicorp/nomad/blame/master/website/source/guides/install/production/reference-architecture.html.md#L33
https://github.com/hashicorp/nomad/blame/master/website/source/guides/install/production/reference-architecture.html.md#L65
California -> Boston: 44ms theoretical best-case latency
Actual AWS latencies:
us-east-1 -> us-east-2: 16.11ms
us-east-1 -> us-west-1: 70.89ms
us-east-1 -> us-west-2: 78.84ms
So this issue is in reference to #304. We claim that a nomad server cluster should always be 3, 5, or at max, 7 nodes. We then say that a Region can contain datacenters us-east-1 and us-west-2. We then recommend to run nomad servers across different availability zones, and that servers need to maintain a latency of bellow 10 ms. With AWS, the latency between us-east-1 and us-west-2 is approximately 70ms, meaning this claimed architecture doesn't follow our own requirements.
How is someone suppose to actually deploy a multi-region, multi-datacenter, highly available nomad cluster(s)? And what should our production deployment look like when we have more than 7 datacenters in a region?
The text was updated successfully, but these errors were encountered: