-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguity in docs regarding advertise, addresses configuration parameters and multi-datacenter deployments #304
Comments
If someone from the nomad team could answer those questions in the mean time, that would be awesome! 😄 I am keen to build a test virtualized cluster simulating a multi-data center deployment. |
@F21 We can certainly make these docs more clear. To clarify a few things: Terminology
Multi-datacenterThe Nomad server cluster operates only at the region level. A single region deployment can manage scheduling for multiple datacenters, and unlike Consul you do not deploy Nomad per datacenter. We expect most organizations will only ever need to run one Nomad cluster worldwide (essentially a "global" region), which will simplify deployment. You may choose to create multiple regions for business or regulatory reasons, or for clusters larger than tens of thousands of nodes. Network rules inside a region are as follows:
Beyond that, datacenter is just a label on client nodes that is used for job placement. Multi-regionIf your total fleet is very large (upwards of tens of thousands of machines) you may need to shard into multiple regions. In this case, each region will be represented by a cluster of 3-5 Nomad servers. These servers communicate with each other over serf, both intra-regionally and inter-regionally. Additionally, servers in different regions will need to access each other's RPC address in order for forward jobs to another region. These inter-region links are optional. You could have completely isolated regions, but you will not be able to forward jobs from Region A to Region B, for example. Network rules for this for forwarding are as follows:
Serf (technical details)Serf operates in LAN and WAN modes. LAN mode is used for (typically) physically local, low-latency gossip pools, while WAN is used for higher latency pools on a global level. There are actually two pools here -- one using LAN mode and another using WAN mode. If you have 5 regions with 3 Nomad servers each, you will have 5 local serf rings with 3 nodes each, and one global serf ring with 15 nodes. |
@cbednarski Thanks! That was very useful. Another question about multi-region deployments: There doesn't appear to be any setting to have a list of server ips for servers to join to. Do servers automatically discover each other and join (as long as they can reach each other)? |
@dadgar Is this also something I can set in a configuration file? |
It is not something that can be set in the configuration file. The reason is in most cases the IP to join is not known when writing the static config file. You can see how we over-come this when boot-strapping a cluster in this terraform file. The file launches three nomad servers and then calls the HTTP API endpoint the dynamic IPs of the other servers. |
Probably getting a bit off-topic here, but wouldn't it be possible to have something like consul's |
+1, I agree with @F21, it's very easy to automate the cluster installation & startup with |
@cbednarski, @dadgar I have another question, in Consul we have two gossip pools, LAN and WAN. In order to join servers in a WAN pool we use the |
@cbednarski that's a great explanation. I suggest adding it to the documentation. |
I'm interested in getting a multi-region nomad cluster setup for a POC. I have followed the steps outlined in the documentation and in this post above. Once I have stood everything up I run the "nomad server-members" command and I see all my servers listed as being in the appropriate regions. When I try submitting a job destined for region2 via the HTTP API to a server in region1 the job does not get forwarded to a server in region2. A scheduler in region1 fails to place the allocation and my evaluation ends up blocked. Your documentation implies that this should work. I have also looked at the code and there appears to be forwardRegion logic that looks like it should do the trick. Can someone please document a minimal multi-region deployment? |
I'm seeing the same behavior described above. Any more info on this? |
@xiang Can you expand on the issue you are facing. The issue immediately above yours was fixed as it was an issue in the CLI. |
As a Nomad operator, and huge fan - these statements seem to contradict each other, on the basis that in my opinion there is nothing "physically local, low-latency" about a truly global deployment. So are you guys suggesting or implying that most folks do not have a literal global deployment of infrastructure? Maybe if we quantified what a "physically local, low-latency" network is, that would help? Is ~65ms latency over a VPN between us-west-1 and us-east-1 in AWS considered "physically local, low-latency"? What if it's over a direct connect with VPN backup? What if instead of us-west-1, it's us-west-2 which has ~90ms latency? Are we then also proposing to only run 1x nomad server per datacenter, where a datacenter is a vpc in each AWS region? What if my region only consists for 2 datacenters, us-east-1 and us-west-1, how will quorum work in the event of a network split, is my "global" infrastructure now down? And what if I have more than 5 datacenters in a region because I'm "multicloud" ? I realize these questions could come off as being belligerent; however, I assure you that I just want to get on the same page with where we're going with this product and then show proof success. |
Amazing coincidence of this thread. All my infra is on private IPs and we have many AWS regions which are vpc peered, though not full mesh, and will not be full mesh. The fact that the servers having a common region are trying to talk to each other on 4647 hit my deployment plans. Now I am planning to switch my region names to the AWS region names, so my regions will now be 'us-east-1' instead of just 'us' Lesson learnt! 👍👍👍 |
Hey there Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this. Thanks! |
this is still an issue. |
Speaking for myself: I see a few issues with this issue:
In my opinion, this issue should be laid to rest and maybe open a PR for the doc changes, or create a new one as to what the confusion/missing in the doc is. Thoughts? |
After reading through this issue for the first time I would agree with @shantanugadgil. Initially there was a pretty broad scope to the questions this issue raised and further discussion seemed to have answered some of these but others were raised. I'm going to close this one. If anyone in this thread feels there is still some part of this issue that has not been resolved please open a new one. Thanks! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I am interested in setting up a multi-datacenter cluster. The documentation for configuring the
advertise
andaddresses
options are ambiguous.In particular:
addresses
section, it says that therpc
address should be accessible to all agents, but should only be accessible to cluster members. This sounds a bit contradictory. Should therpc
address be accessible to only agents within the datacenter the server is in?addresses
section theserf
address should only be exposed to other servers within the datacenter. But, it also says that this address is used for gossiping, which according to the architecture is supposed to be used to talk to other servers in other data centers and regions.advertise
section, therpc
address should be accessible by all agents. Is this within a datacenter, a region or every single agent across all regions and data centers?advertise
section, theserf
address should be reachable from all servers (I am assuming across all regions and data centers), but it contradicts theserf
parameter within theaddresses
section which says it should only be reached by servers within their own data center.In regards to multi-datacenter requirements, there's also some things that are not clear:
High-Level Overview
section of theArchitecture
docs, this image does not show which data centers the servers reside in.The text was updated successfully, but these errors were encountered: