Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity in docs & multi-datacenter deployments #5697

Closed
flyinprogrammer opened this issue May 14, 2019 · 6 comments
Closed

Ambiguity in docs & multi-datacenter deployments #5697

flyinprogrammer opened this issue May 14, 2019 · 6 comments
Labels
theme/docs Documentation issues and enhancements type/question

Comments

@flyinprogrammer
Copy link
Contributor

flyinprogrammer commented May 14, 2019

Documentation references:

A Nomad cluster typically comprises three or five servers (but no more than seven) and a number of client agents.

Nomad differs slightly from Consul in that it divides infrastructure into regions which are served by one Nomad server cluster, but can manage multiple datacenters or availability zones. For example, a US Region can include datacenters us-east-1 and us-west-2.

https://github.com/hashicorp/nomad/blame/master/website/source/guides/install/production/reference-architecture.html.md#L29

In cloud environments, a single cluster may be deployed across multiple availability zones. For example, in AWS each Nomad server can be deployed to an associated EC2 instance, and those EC2 instances distributed across multiple AZs. Similarly, Nomad server clusters can be deployed to multiple cloud regions to allow for region level HA scenarios.

https://github.com/hashicorp/nomad/blame/master/website/source/guides/install/production/reference-architecture.html.md#L33

Nomad servers are expected to be able to communicate in high bandwidth, low latency network environments and have below 10 millisecond latencies between cluster members. Nomad servers can be spread across cloud regions or datacenters if they satisfy these latency requirements.

https://github.com/hashicorp/nomad/blame/master/website/source/guides/install/production/reference-architecture.html.md#L65

http://www.stuartcheshire.org/papers/LatencyQuest.html

California -> Boston: 44ms theoretical best-case latency

https://www.cloudping.co/

Actual AWS latencies:
us-east-1 -> us-east-2: 16.11ms
us-east-1 -> us-west-1: 70.89ms
us-east-1 -> us-west-2: 78.84ms


So this issue is in reference to #304. We claim that a nomad server cluster should always be 3, 5, or at max, 7 nodes. We then say that a Region can contain datacenters us-east-1 and us-west-2. We then recommend to run nomad servers across different availability zones, and that servers need to maintain a latency of bellow 10 ms. With AWS, the latency between us-east-1 and us-west-2 is approximately 70ms, meaning this claimed architecture doesn't follow our own requirements.

How is someone suppose to actually deploy a multi-region, multi-datacenter, highly available nomad cluster(s)? And what should our production deployment look like when we have more than 7 datacenters in a region?

@nickethier
Copy link
Member

Hey @flyinprogrammer first off thanks for distilling this down from #304 its much appreciated!

I think you're right in that there's some ambiguity that can lead to misinterpreting the docs you've highlighted. I think the biggest cause of confusion comes from these two lines:

Similarly, Nomad server clusters can be deployed to multiple cloud regions to allow for region level HA scenarios.

Nomad servers can be spread across cloud regions or datacenters if they satisfy these latency requirements.

So lets break this down with some more detailed explanations for these requirements. Low latency between servers of the same region is required because the servers must coordinate scheduling decisions of that cluster. Here I'm using cluster and region interchangeably. As this latency increases, the time it takes to make a scheduling decision can increase exponentially. For this reason it is recommended that the latency between servers of the same region be kept below 10ms. Now in order to achieve higher availability it is also recommended to spread these servers among availability zones within a cloud region. An example with AWS would be to run a server in us-east-1a, us-east-1b, and us-east-1c availability zones for the us-east-1 region.

Clients of this region could then be deployed across those availability zones either as a single datacenter or multiple.

Multi-region high availability comes in the form of federating multiple regions into a single cluster. Notice I'm changing the distinction between region and cluster now. A region is the scheduling boundary, meaning when you submit a job to Nomad it is registered with a specific region. There is no scheduling done currently across regions. Because of the, region federation is not bound the the same latency requirements as severs in a distinct region. As seen in this image, requests made to region A can be forwarded to region B if that is the specified region for the request:
image

I hope that explanation helps clear things up. To get right to your question: We are working on putting together a sort of running in production guide that outlines best practices we've observed from users running Nomad cluster. If you have a 7 datacenters in a region (so keeping to that 10ms requirement), I would recommend starting with 5 servers. You can then deploy client's with what ever datacenter works best for you. Some users like to have separate clients for different security zones/VPCs, availability zones or racks, or even just one datacenter. It's really up to you and your requirements. I would recommend starting by mapping them to your 7 datacenters to begin with if you need somewhere to start.

@flyinprogrammer
Copy link
Contributor Author

So if this is the architecture ya'll want to articulate, then this sentence needs to go:

For example, a US Region can include datacenters us-east-1 and us-west-2.

And be replaced with a us-east-1 region can include datacenters us-east-1a, us-east-1b, etc.

Similarly, the analogy between nomad and consul terminology can be written as: nomad region == consul datacenter, and a nomad datacenter is partition of a nomad region, typically a cloud provider availability zone.

Hopefully in Nomad, Consul, and Vault v2 we'd see a normalized naming convention whereby you might have:

  • Territory: a collection of Regions
  • Region: a collection of Availability Zones
  • Availability Zone: a collection of compute

But I digress.

@nickethier
Copy link
Member

Typically what you've described is the common case but there are instances where Nomad clusters consist of a single region with clients spread across cloudprovider regions. While you loose the multi-region high availability of running multiple sets of servers, you gain scheduling across multiple regions. It's a tradeoff some choose to make.

I appreciate your feedback and will take it back to the education team. Thanks @flyinprogrammer!

@MikeN123
Copy link
Contributor

MikeN123 commented May 5, 2021

It seems this is still in the docs:

For example, a US Region can include datacenters us-east-1 and us-west-2

versus

Nomad servers are expected to be able to communicate in high bandwidth, low latency network environments and have below 10 millisecond latencies between cluster members

I really can't see how that is compatible with each other and why this ticket was closed?

@tgross
Copy link
Member

tgross commented May 5, 2021

@MikeN123 let's open a new issue please instead of commenting one that was close two years ago.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/docs Documentation issues and enhancements type/question
Projects
None yet
Development

No branches or pull requests

5 participants