-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bitnami/redis] All pods in CrashLoopBackOff status after rebooting worker nodes #5181
Comments
Rebooted again after 1 hour (to test another chart bitnami/rabbitmq), now all redis pods status are Running. Then I tested it again with rebooting worker nodes, redis pods are always in CrashLoopBackOff status then, and rabbitmq pods are changed to Running after several minutes. So if pods are always in CrashLoopBackOff status after rebooting, is there any solution to manually recover them without rebooting again? Thanks. |
Hi @wantdrink thanks for the info it is very related to the linked issue, we will check if we can reproduce it in our local environments and figure out the issue. |
Hi, We have opened an internal task in order to investigate a bit more about this error, unfortunately, we can not give you an ETA. BTW, any action or research results are welcome if you are performing more tests on this. Thanks for the feedback! |
Thank you @dani8art. |
We have a similar issue with a number of Redis sentinel deployments in our GKE cluster. We have a chaos test running that randomly kills pods every couple of minutes, eventually it kills the pods serving as the Redis master and that soon leads the other pods in Error/CrashLoopBackOff state. Below is a brief snippet of the logs from one of our pods when it sees the master go down (IPs are ephemeral, service names changed for security) This cluster is running on image bitnami/redis:4.0.14
|
Hi, Could you share the code of your chaos testing? It would be great for us for the investigation. |
Unfortunately the tool is an in house application so cannot be shared (The public equivalent would be chaos-mesh carrying out a pod-kill action. |
@javsalgar what he said. Be advised, we are experimenting with setting |
Hi, We are aware that Redis Sentinel is having issues recovering from failures affecting the master node where Kubernetes has to redeploy the pod in a different node. We are currently working on it and will get back to you as soon as possible. Sorry for the inconvenience. |
Is there any sort of timeline on this? Just trying to manage my expectations :P |
We are currently working on a fix, but I don't have an ETA as it will probably require major changes in the Redis chart. I will keep you updated on any changes. |
Hi, |
Hi @rafariossaa ,
after rebooting all worker nodes and the error logs are the same as before:
redis:
metrics:
|
Hi @wantdrink, Thank you for your feedback. We will continue investigating this issue as it seems to persists after our fix attempt. |
Thank you @migruiz4 . |
Just to add, we are also experiencing this on GKE. We are using preemptible instances, which pre-empt at least once a day, so the situation is similar to @corey-hammerton's chaos testing. We're using chart 12.8.3 Here are sentinel logs during testing with a cluster of 4 with quorum 2 (I know quorum should be higher normally). When removing two nodes, the remaining 2 nodes should still have quorum, however, failover fails to happen.
Other sentinel:
logging in with cli, both sentinels have empty replica lists. while both remaining replicas are still mastered to 192.168.0.48 which is now down. I'll continue to try to figure out what's happening, and post here if I find anything interesting edit: after testing for a while, I've found a couple of times, a replica fails to get configured to a new master. This seems like a consistent issue with the above where sentinels somehow "forget" about replicas. In the worst case, it forgets ALL of the replicas and has nothing to promote. In slightly less worst cases, it forgets one replica, which effectively drops out of the cluster. (in those cases, this node will report as healthy to kubernetes, so there's no easy indication of it without some external monitoring tools) I've gone as far as I can with debugging this. I feel there's something odd about how sentinels are forgetting replicas which seems to be close to the root of the issue. I'm going to try to resolve this on our end by increasing the cluster size and adding some monitoring tools, perhaps patching the readiness check to test to see if redis is a master or is linked to one |
Hi @meseta, Thank you for the additional information. Following up on this, we continue investigating this looking for a permanent fix, we will keep you updated. |
thanks! I'd like to add that we increased the cluster size to 5 (quorum 3), reduced the |
@migruiz4 @meseta We were experiencing the same issue on GKE. We were able to reproduce by manually deleting the vm that the master node was running on, perhaps you could try that to confirm. I tested switching the redis(/opt/bitnami/redis/etc) and sentinel config (/opt/bitnami/redis-sentinel/etc) directories from emptyDir to persistent volumes and the issue was resolved. I have not been able to break the cluster since making this change. |
thanks, I'll give that a try if we experience further instability |
Chart - appVersion: 6.0.12, version: 12.8.3 I using Docker Desktop for Windows which comes with a one node kubernetes cluster that can be enabled. With the setup I specified above, if I delete the master node pod, another one pops back up and communication between the slave nodes and master is successful. However, if I shut the one node cluster down and power it back on, I receive the same error as reported above where all of the pods are in a CrashLoopBackOff. I have tried the suggestion of creating persistent volumes for the redis and sentinel configs that @robertb724 mentions, but the success was inconsistent. It worked the first time after a reboot, but failed every other time I have tried. Commenting just to stay up to date on this issue. |
Hi all, I've just re-opened the internal task so we locate for a proper fix. Unfortunately I cannot give any ETA due to Easter vacations. If you happen to find any issues in the meantime and would like to contribute a fix, feel free to send a PR, we'd be glad to help with the review/release process. |
@marcosbc Now that Easter is out of the way is it possible to give an ETA on this? Thanks |
We have found that pairing |
Thanks for the info @robertb724 ! I will forward this to the rest of the engineering team |
Not ours :( |
Hi @bm1216, Could you add more information about your case? |
I am also not seeing success with this solution. My situation is described in my previous post above on Mar. 29th, 2021. |
We have experienced this in one of our clusters, and it would seem that a hostname is not being populated somewhere but after a cursory look through the charts, I cannot quite pin down where. If you execute a redis command with a blank hostname, you will get the same error:
So it would appear somewhere that a redis hostname is not resolving correctly. My hunch is that the redis container is attempting to interrogate the sentinels before quorum has been reached, which results in a null hostname and then attempts to connect, which triggers the container to exit, and liveness checks to fail which then k8s collects the whole pod, killing the sentinel sidecar off before quorum is reached...only to start the cycle all over again when it is redeployed. |
Hi @pippy44, There is a new version of Redis, what version are you using? |
One thing to note regarding our setup that I have not mentioned is we made adjustments to the script so all references of sentinel/redis are using the headless service entry rather than the IP. |
I'm using the app version 6.2.4, redis-14.6.1 still seeing this issue. sentinel
redis
|
Hi @zhx828, I will add this information to the internal task, we will update the thread when we have more information. |
@qeternity which bitnami/redis version are you using? I'm not specifying the redis in my deployment. It should be the default version from bitnami. |
@zhx828 we are using chart version 14.1.0 and redis/sentinel version 6.2.2 |
@qeternity Still seeing the issue with these versions after restarting the docker. Same error log. I saw you posted last month about the similar issue. Was it just downgrade the version helped? |
I've been facing this issue for ages on my k8s cluster and I've never realized that I did not have persistence enabled for sentinels and that could help a lot to retain configuration. I will test that and report back. |
Which chart:
bitnami/redis redis-12.6.2 app version 6.0.10
Describe the bug
After rebooting all of the K8S worker nodes, the redis pods are always in CrashLoopBackOff status.
To Reproduce
Steps to reproduce the behavior:
NAME READY STATUS RESTARTS AGE
mycache-redis-node-0 1/3 CrashLoopBackOff 19 30m
mycache-redis-node-1 1/3 CrashLoopBackOff 19 30m
mycache-redis-node-2 1/3 CrashLoopBackOff 19 30m
Version of Helm and Kubernetes:
helm version
:kubectl version
:Additional context
The text was updated successfully, but these errors were encountered: