You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When multiple servers are running in a non-HA setup connected to the same db, it causes hosts to start going in and out of the reconnecting state. I haven't looked into why yet, but my immediate hypothesis is that the secondary servers are waiting to receive pings from these agents. But agents use the load-balanced hostname, which always directs to the primary instance. So it never receives ping responses and thinks these agents are unavailable so it starts marking them as reconnecting, but the primary is receiving pings so it keeps marking them back in a not-so-beautiful dance.
Simplest solution is probably to add a service to the server image that starts/stops the rancher server container depending on whether or not it's the primary. This effectively means that failover might take a bit longer, depending on the polling interval for serf changes (though it could also be triggered with serf events), and how long it takes the rancher container to initialize. In any case, I can't see it taking more than 30s. If you only have one instance running and it goes down, you basically need to wait for the cluster to detect this for and AWS to start up a new instance, which can take 10-20m depending on the circumstances. If you can tolerate this downtime, then there's no reason to run multiple instances otherwise it's a good idea to run a secondary.
So in the meantime, we're limited to one-server setups which requires us to accept that extended downtime in the case of a node failure or termination.
The text was updated successfully, but these errors were encountered:
When multiple servers are running in a non-HA setup connected to the same db, it causes hosts to start going in and out of the reconnecting state. I haven't looked into why yet, but my immediate hypothesis is that the secondary servers are waiting to receive pings from these agents. But agents use the load-balanced hostname, which always directs to the primary instance. So it never receives ping responses and thinks these agents are unavailable so it starts marking them as reconnecting, but the primary is receiving pings so it keeps marking them back in a not-so-beautiful dance.
Simplest solution is probably to add a service to the server image that starts/stops the rancher server container depending on whether or not it's the primary. This effectively means that failover might take a bit longer, depending on the polling interval for serf changes (though it could also be triggered with serf events), and how long it takes the rancher container to initialize. In any case, I can't see it taking more than 30s. If you only have one instance running and it goes down, you basically need to wait for the cluster to detect this for and AWS to start up a new instance, which can take 10-20m depending on the circumstances. If you can tolerate this downtime, then there's no reason to run multiple instances otherwise it's a good idea to run a secondary.
So in the meantime, we're limited to one-server setups which requires us to accept that extended downtime in the case of a node failure or termination.
The text was updated successfully, but these errors were encountered: