-
Notifications
You must be signed in to change notification settings - Fork 1.1k
max_replicas_per_node in paused/drained nodes drives to -> no suitable node when updating service #2979
Comments
is the service configured "stop-first" or "start-first"? |
@sebastianfelipe I would say that it is by design. You should either:
|
It cannot be like that, because:
|
Adding to this, it is a service with stop-first, so in theory, the max_replicas_per_node, should work. |
Well this kind of use case didn't came to my mind when I implemented max replicas feature. We use it just make sure that all replicas do not get scheduled to one worker when another one goes down e.g. in case of virtualization platform crash, network failure or node reboot. So this need to be handled on swarmkit side on way that someone implements test case which handles this situation and then modify logic to handle it correctly. PS. IMO patching existing workers is very old skool approach. If you create flow which first add new already patched worker to swarm and then drop old one you will not see this issue. |
@olljanat thanks for answering. I didn't understand, so you're saying that you thought in the case a VM is down, so what did happen here? Because I turned off definitely a VM and then appears that error, so the question is, why is it pretending to deploy services on down hosts? Maybe I didn't understand your point very well :/ |
@sebastianfelipe I mean if you deploy service with two replicas to swarm where is two workers both of them will run one replica. If you now reboot one of those workers then swarm will notice that there is only one running replica service and schedule second replica running on that only available worker so end result is that one worker have two replicas of service running and second worker have zero replicas. Only way to fix that without max replicas is scale all services to 1 and back to 2 after worker reboot. Other why you end up situation where you application is not anymore fault tolerant for worker crash (if worker which have all replicas goes down then whole application is down until swarm re-start new replicas to another worker). Btw. You see how this was implemented and which kind of tests was included from moby/swarmkit#2758 and that it what is needed to enhance to handle your use case correctly. |
This repo is not for Docker swarm you are looking for https://github.com/docker/swarmkit |
Hi everyone!
Well, I found that is a really huge bug here. I work in a project with 4 VMs, 1 manager and 3 workers. 2/3 workers are active and the other one is paused (it was drained and the issue still the same). Some services have the max_replicas_per_node active.
Let's say the service is called "service-A" and has "max_replicas_per_node = 2" and "replicas=6", so in theory, service-A should be running on the 3 workers with 2 replicas in each one. Everything works fine. Now I turn off a worker, so that service-A has 4/6 replicas running. Everything OK. Now it comes the issue. I have a new release for the service-A, so I re-run the compose file or I just make a docker service update --force _service-A but I got this "1/6: no suitable node (max replicas per node limit exceed; scheduling constrain… "
I think this issue is very very important, because it doesn't let us to put new releases because a machine that is supposed to be in a non-scheduled position is trying to schedule something that it can't or maybe when a new container is creating it counts for the max_replicas_per_node so it can't be created. Well, I'm not sure what exactly is going on here, but this issue doesn't let me deploy releases.
I hope we can found some solution to this.
Thanks,
Sebastián.
The text was updated successfully, but these errors were encountered: