Ambiguity in docs & multi-datacenter deployments #5697

flyinprogrammer · 2019-05-14T02:41:39Z

Documentation references:

A Nomad cluster typically comprises three or five servers (but no more than seven) and a number of client agents.

Nomad differs slightly from Consul in that it divides infrastructure into regions which are served by one Nomad server cluster, but can manage multiple datacenters or availability zones. For example, a US Region can include datacenters us-east-1 and us-west-2.

https://github.com/hashicorp/nomad/blame/master/website/source/guides/install/production/reference-architecture.html.md#L29

In cloud environments, a single cluster may be deployed across multiple availability zones. For example, in AWS each Nomad server can be deployed to an associated EC2 instance, and those EC2 instances distributed across multiple AZs. Similarly, Nomad server clusters can be deployed to multiple cloud regions to allow for region level HA scenarios.

https://github.com/hashicorp/nomad/blame/master/website/source/guides/install/production/reference-architecture.html.md#L33

Nomad servers are expected to be able to communicate in high bandwidth, low latency network environments and have below 10 millisecond latencies between cluster members. Nomad servers can be spread across cloud regions or datacenters if they satisfy these latency requirements.

https://github.com/hashicorp/nomad/blame/master/website/source/guides/install/production/reference-architecture.html.md#L65

http://www.stuartcheshire.org/papers/LatencyQuest.html

California -> Boston: 44ms theoretical best-case latency

https://www.cloudping.co/

Actual AWS latencies:
us-east-1 -> us-east-2: 16.11ms
us-east-1 -> us-west-1: 70.89ms
us-east-1 -> us-west-2: 78.84ms

So this issue is in reference to #304. We claim that a nomad server cluster should always be 3, 5, or at max, 7 nodes. We then say that a Region can contain datacenters us-east-1 and us-west-2. We then recommend to run nomad servers across different availability zones, and that servers need to maintain a latency of bellow 10 ms. With AWS, the latency between us-east-1 and us-west-2 is approximately 70ms, meaning this claimed architecture doesn't follow our own requirements.

How is someone suppose to actually deploy a multi-region, multi-datacenter, highly available nomad cluster(s)? And what should our production deployment look like when we have more than 7 datacenters in a region?

nickethier · 2019-05-14T03:12:22Z

Hey @flyinprogrammer first off thanks for distilling this down from #304 its much appreciated!

I think you're right in that there's some ambiguity that can lead to misinterpreting the docs you've highlighted. I think the biggest cause of confusion comes from these two lines:

Similarly, Nomad server clusters can be deployed to multiple cloud regions to allow for region level HA scenarios.

Nomad servers can be spread across cloud regions or datacenters if they satisfy these latency requirements.

So lets break this down with some more detailed explanations for these requirements. Low latency between servers of the same region is required because the servers must coordinate scheduling decisions of that cluster. Here I'm using cluster and region interchangeably. As this latency increases, the time it takes to make a scheduling decision can increase exponentially. For this reason it is recommended that the latency between servers of the same region be kept below 10ms. Now in order to achieve higher availability it is also recommended to spread these servers among availability zones within a cloud region. An example with AWS would be to run a server in us-east-1a, us-east-1b, and us-east-1c availability zones for the us-east-1 region.

Clients of this region could then be deployed across those availability zones either as a single datacenter or multiple.

Multi-region high availability comes in the form of federating multiple regions into a single cluster. Notice I'm changing the distinction between region and cluster now. A region is the scheduling boundary, meaning when you submit a job to Nomad it is registered with a specific region. There is no scheduling done currently across regions. Because of the, region federation is not bound the the same latency requirements as severs in a distinct region. As seen in this image, requests made to region A can be forwarded to region B if that is the specified region for the request:

I hope that explanation helps clear things up. To get right to your question: We are working on putting together a sort of running in production guide that outlines best practices we've observed from users running Nomad cluster. If you have a 7 datacenters in a region (so keeping to that 10ms requirement), I would recommend starting with 5 servers. You can then deploy client's with what ever datacenter works best for you. Some users like to have separate clients for different security zones/VPCs, availability zones or racks, or even just one datacenter. It's really up to you and your requirements. I would recommend starting by mapping them to your 7 datacenters to begin with if you need somewhere to start.

flyinprogrammer · 2019-05-14T16:16:16Z

So if this is the architecture ya'll want to articulate, then this sentence needs to go:

For example, a US Region can include datacenters us-east-1 and us-west-2.

And be replaced with a us-east-1 region can include datacenters us-east-1a, us-east-1b, etc.

Similarly, the analogy between nomad and consul terminology can be written as: nomad region == consul datacenter, and a nomad datacenter is partition of a nomad region, typically a cloud provider availability zone.

Hopefully in Nomad, Consul, and Vault v2 we'd see a normalized naming convention whereby you might have:

Territory: a collection of Regions
Region: a collection of Availability Zones
Availability Zone: a collection of compute

But I digress.

nickethier · 2019-05-15T17:12:04Z

Typically what you've described is the common case but there are instances where Nomad clusters consist of a single region with clients spread across cloudprovider regions. While you loose the multi-region high availability of running multiple sets of servers, you gain scheduling across multiple regions. It's a tradeoff some choose to make.

I appreciate your feedback and will take it back to the education team. Thanks @flyinprogrammer!

MikeN123 · 2021-05-05T15:47:27Z

It seems this is still in the docs:

For example, a US Region can include datacenters us-east-1 and us-west-2

versus

Nomad servers are expected to be able to communicate in high bandwidth, low latency network environments and have below 10 millisecond latencies between cluster members

I really can't see how that is compatible with each other and why this ticket was closed?

tgross · 2021-05-05T15:54:40Z

@MikeN123 let's open a new issue please instead of commenting one that was close two years ago.

github-actions · 2022-10-20T02:44:12Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

nickethier added theme/docs Documentation issues and enhancements type/question stage/waiting-reply labels May 14, 2019

stale bot removed the stage/waiting-reply label May 14, 2019

endocrimes added the stage/waiting-reply label May 14, 2019

schmichael mentioned this issue May 14, 2019

Stable website 5693 5697 #5701

Merged

stale bot removed the stage/waiting-reply label May 14, 2019

nickethier closed this as completed May 15, 2019

MikeN123 mentioned this issue May 5, 2021

Docs suggest an unsupported high-latency setup #10515

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguity in docs & multi-datacenter deployments #5697

Ambiguity in docs & multi-datacenter deployments #5697

flyinprogrammer commented May 14, 2019 •

edited

Loading

nickethier commented May 14, 2019

flyinprogrammer commented May 14, 2019

nickethier commented May 15, 2019

MikeN123 commented May 5, 2021

tgross commented May 5, 2021

github-actions bot commented Oct 20, 2022

Ambiguity in docs & multi-datacenter deployments #5697

Ambiguity in docs & multi-datacenter deployments #5697

Comments

flyinprogrammer commented May 14, 2019 • edited Loading

nickethier commented May 14, 2019

flyinprogrammer commented May 14, 2019

nickethier commented May 15, 2019

MikeN123 commented May 5, 2021

tgross commented May 5, 2021

github-actions bot commented Oct 20, 2022

flyinprogrammer commented May 14, 2019 •

edited

Loading