[bitnami/redis] All pods in CrashLoopBackOff status after rebooting worker nodes #5181

wantdrink · 2021-01-22T02:55:46Z

Which chart:
bitnami/redis redis-12.6.2 app version 6.0.10

Describe the bug
After rebooting all of the K8S worker nodes, the redis pods are always in CrashLoopBackOff status.

To Reproduce
Steps to reproduce the behavior:

Set redis image with 5.0.10-debian-10-r81, sentinel image with 5.0.10-debian-10-r78.
cluster = true, slaveCount=3
helm install mycache bitnami/redis -n test -f values.yaml
Test and everything works fine. Reboot all K8S worker nodes.
3 redis pods are always in CrashLoopBackOff status
NAME READY STATUS RESTARTS AGE
mycache-redis-node-0 1/3 CrashLoopBackOff 19 30m
mycache-redis-node-1 1/3 CrashLoopBackOff 19 30m
mycache-redis-node-2 1/3 CrashLoopBackOff 19 30m

Version of Helm and Kubernetes:

Output of helm version:

version.BuildInfo{Version:"v3.2.4", GitCommit:"0ad800ef43d3b826f31a5ad8dfbb4fe05d143688", GitTreeState:"clean", GoVersion:"go1.13.12"}

Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"clean", BuildDate:"2020-06-26T03:47:41Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"clean", BuildDate:"2020-06-26T03:39:24Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Additional context

kubectl describe pod mycache-redis-node-x:

Events:
  Type     Reason                  Age                    From                  Message
  ----     ------                  ----                   ----                  -------
  Normal   Scheduled               <unknown>              default-scheduler     Successfully assigned test/mycache-redis-node-0 to app3.jh.com
  Normal   Pulled                  32m                    kubelet, app3.jh.com  Container image "docker.io/bitnami/redis:5.0.10-debian-10-r81" already present on machine
  Normal   Created                 32m                    kubelet, app3.jh.com  Created container redis
  Normal   Started                 32m                    kubelet, app3.jh.com  Started container sentinel
  Normal   Started                 32m                    kubelet, app3.jh.com  Started container redis
  Normal   Created                 32m                    kubelet, app3.jh.com  Created container sentinel
  Normal   Pulled                  32m                    kubelet, app3.jh.com  Container image "docker.io/bitnami/redis-exporter:1.15.1-debian-10-r2" already present on machine
  Normal   Created                 32m                    kubelet, app3.jh.com  Created container metrics
  Normal   Started                 32m                    kubelet, app3.jh.com  Started container metrics
  Normal   Pulled                  32m                    kubelet, app3.jh.com  Container image "docker.io/bitnami/redis-sentinel:5.0.10-debian-10-r78" already present on machine
  Warning  FailedCreatePodSandBox  25m                    kubelet, app3.jh.com  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "d871e82c81b6d1ed1d45514f1d1942ed37b2e1316bca7f90a340cd449c774f36" network for pod "mycache-redis-node-0": networkPlugin cni failed to set up pod "mycache-redis-node-0_test" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  25m                    kubelet, app3.jh.com  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "a99244ae1b247a0e9a855266fe70c2d715a3faddba7d0779f88822d90f1f3f38" network for pod "mycache-redis-node-0": networkPlugin cni failed to set up pod "mycache-redis-node-0_test" network: open /run/flannel/subnet.env: no such file or directory
  Normal   SandboxChanged          25m (x4 over 25m)      kubelet, app3.jh.com  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  25m                    kubelet, app3.jh.com  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "cadd3e0edeef3b1672c83a83cd3a8bcda35fdf9942e333478b6a77fbe6047afd" network for pod "mycache-redis-node-0": networkPlugin cni failed to set up pod "mycache-redis-node-0_test" network: open /run/flannel/subnet.env: no such file or directory
  Normal   Started                 25m                    kubelet, app3.jh.com  Started container metrics
  Normal   Created                 25m                    kubelet, app3.jh.com  Created container metrics
  Normal   Pulled                  25m                    kubelet, app3.jh.com  Container image "docker.io/bitnami/redis-exporter:1.15.1-debian-10-r2" already present on machine
  Normal   Pulled                  25m                    kubelet, app3.jh.com  Container image "docker.io/bitnami/redis-sentinel:5.0.10-debian-10-r78" already present on machine
  Normal   Started                 25m                    kubelet, app3.jh.com  Started container sentinel
  Normal   Created                 25m                    kubelet, app3.jh.com  Created container sentinel
  Warning  BackOff                 24m (x3 over 24m)      kubelet, app3.jh.com  Back-off restarting failed container
  Normal   Started                 24m (x2 over 25m)      kubelet, app3.jh.com  Started container redis
  Normal   Created                 24m (x2 over 25m)      kubelet, app3.jh.com  Created container redis
  Normal   Pulled                  24m (x2 over 25m)      kubelet, app3.jh.com  Container image "docker.io/bitnami/redis:5.0.10-debian-10-r81" already present on machine
  Warning  BackOff                 4m51s (x111 over 24m)  kubelet, app3.jh.com  Back-off restarting failed container

kubectl logs mycache-redis-node-0 redis

Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at mycache-redis.test.svc.cluster.local:26379: Connection refused
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at -p:6379: Name or service not known

kubectl logs mycache-redis-node-0 sentinel

Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at mycache-redis.test.svc.cluster.local:26379: Connection refused
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at -p:6379: Name or service not known

kubectl logs mycache-redis-node-0 metrics

time="2021-01-22T02:25:11Z" level=info msg="Redis Metrics Exporter v1.15.1    build date: 2021-01-12-02:52:00    sha1: 5bbffe05d7ba4347d6ffc482b70b321abab32209    Go: go1.15.6    GOOS: linux    GOARCH: amd64"
time="2021-01-22T02:25:11Z" level=info msg="Providing metrics at :9121/metrics"
time="2021-01-22T02:25:29Z" level=error msg="Couldn't connect to redis instance"
time="2021-01-22T02:26:29Z" level=error msg="Couldn't connect to redis instance"
time="2021-01-22T02:27:29Z" level=error msg="Couldn't connect to redis instance"
......

The text was updated successfully, but these errors were encountered:

wantdrink · 2021-01-22T03:59:36Z

Rebooted again after 1 hour (to test another chart bitnami/rabbitmq), now all redis pods status are Running.
The bitnami/rabbitmq pods also could be recovered automatically, one pod is running first, and the other two changed form 0/1 running to 1/1 after more than 10 minutes.

Then I tested it again with rebooting worker nodes, redis pods are always in CrashLoopBackOff status then, and rabbitmq pods are changed to Running after several minutes.

So if pods are always in CrashLoopBackOff status after rebooting, is there any solution to manually recover them without rebooting again?
Also is that possible to decrease the interval the bitnami/rabbitmq pods to recover?

Thanks.

dani8art · 2021-01-22T17:27:03Z

Hi @wantdrink thanks for the info it is very related to the linked issue, we will check if we can reproduce it in our local environments and figure out the issue.

wantdrink · 2021-01-23T03:45:14Z

Thanks @dani8art. #1544 looks the same.

dani8art · 2021-01-26T12:04:29Z

Hi,

We have opened an internal task in order to investigate a bit more about this error, unfortunately, we can not give you an ETA.

BTW, any action or research results are welcome if you are performing more tests on this.

Thanks for the feedback!

wantdrink · 2021-01-27T02:12:24Z

Thank you @dani8art.
Any special logs might be helpful for troubleshooting? I'll test and try to grab it.

corey-hammerton · 2021-01-30T20:01:03Z

We have a similar issue with a number of Redis sentinel deployments in our GKE cluster. We have a chaos test running that randomly kills pods every couple of minutes, eventually it kills the pods serving as the Redis master and that soon leads the other pods in Error/CrashLoopBackOff state. Below is a brief snippet of the logs from one of our pods when it sees the master go down (IPs are ephemeral, service names changed for security)

This cluster is running on image bitnami/redis:4.0.14

redis		1:S 30 Jan 12:12:57.366 # Connection with master lost.
redis		1:S 30 Jan 12:12:57.366 * Caching the disconnected master state.
redis		1:S 30 Jan 12:12:57.827 * MASTER <-> SLAVE sync started
redis		1:S 30 Jan 12:12:57.826 * Connecting to MASTER 10.8.23.141:6379
sentinel	1:X 30 Jan 12:13:12.419 # +sdown master my_redis_service 10.8.23.141 6379
sentinel	1:X 30 Jan 12:13:12.481 # +sdown sentinel a7be8b915634304ba3d164b382204822a7cc35c1 10.8.23.141 26379 @ my_redis_service 10.8.23.141 6379
sentinel	1:X 30 Jan 12:13:12.481 # +sdown sentinel 5a7dca1bf749b69d90e32ef1384b7f15428890cb 10.8.24.136 26379 @ my_redis_service 10.8.23.141 6379
sentinel	1:X 30 Jan 12:13:12.557 # +sdown slave 10.8.24.136:6379 10.8.24.136 6379 @ my_redis_service 10.8.23.141 6379
redis		1:S 30 Jan 12:13:29.772 # Error condition on socket for SYNC: Connection timed out
redis		1:S 30 Jan 12:13:29.874 * Connecting to MASTER 10.8.23.141:6379
redis		1:S 30 Jan 12:13:29.874 * MASTER <-> SLAVE sync started
redis		1:S 30 Jan 12:14:01.516 # Error condition on socket for SYNC: Connection timed out
redis		1:S 30 Jan 12:14:01.926 * Connecting to MASTER 10.8.23.141:6379
redis		1:S 30 Jan 12:14:01.926 * MASTER <-> SLAVE sync started
redis		1:S 30 Jan 12:14:33.772 # Error condition on socket for SYNC: Connection timed out
redis		1:S 30 Jan 12:14:33.985 * Connecting to MASTER 10.8.23.141:6379
redis		1:S 30 Jan 12:14:33.985 * MASTER <-> SLAVE sync started
redis		1:S 30 Jan 12:14:37.129 # Error condition on socket for SYNC: No route to host
redis		1:S 30 Jan 12:14:37.993 * MASTER <-> SLAVE sync started
redis		1:S 30 Jan 12:14:37.993 * Connecting to MASTER 10.8.23.141:6379

javsalgar · 2021-02-12T10:03:10Z

Hi,

Could you share the code of your chaos testing? It would be great for us for the investigation.

harroguk · 2021-02-12T10:09:05Z

Unfortunately the tool is an in house application so cannot be shared (The public equivalent would be chaos-mesh carrying out a pod-kill action.

corey-hammerton · 2021-02-13T01:32:10Z

@javsalgar what he said.

Be advised, we are experimenting with setting sentinel.staticID and running similar chaos tests

migruiz4 · 2021-02-22T16:49:11Z

Hi,

We are aware that Redis Sentinel is having issues recovering from failures affecting the master node where Kubernetes has to redeploy the pod in a different node.

We are currently working on it and will get back to you as soon as possible.

Sorry for the inconvenience.

harroguk · 2021-02-22T16:51:09Z

Is there any sort of timeline on this?
Days, Weeks, Months, Years?

Just trying to manage my expectations :P

migruiz4 · 2021-02-23T09:37:38Z

We are currently working on a fix, but I don't have an ETA as it will probably require major changes in the Redis chart.

I will keep you updated on any changes.

rafariossaa · 2021-02-24T16:39:09Z

Hi,
A new version of the chart was released.
Could you give it a try and check if this fixed the issue for you ?

wantdrink · 2021-02-27T08:25:46Z

Hi @rafariossaa ,
Tested with

chart version: redis-12.8.0 app version 6.0.11
image: redis 5.0.10-debian-10-r81, sentinel 5.0.10-debian-10-r78

after rebooting all worker nodes and the error logs are the same as before:
sentinel:

Could not connect to Redis at 10.244.3.7:26379: Connection refused
 08:19:59.19 INFO  ==> Sentinels clean up done
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at mycache-redis.test.svc.cluster.local:26379: Connection refused
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at -p:6379: Name or service not known

redis:

Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at mycache-redis.test.svc.cluster.local:26379: Connection refused
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at -p:6379: Name or service not known

metrics:

time="2021-02-27T07:52:10Z" level=info msg="Redis Metrics Exporter v1.17.1    build date: 2021-02-20-13:14:11    sha1: 39f8ddd5c6bd6e8a14f37779e4899aa884d8a201    Go: go1.16    GOOS: linux    GOARCH: amd64"
time="2021-02-27T07:52:10Z" level=info msg="Providing metrics at :9121/metrics"
time="2021-02-27T07:53:20Z" level=error msg="Couldn't connect to redis instance"

migruiz4 · 2021-03-03T16:12:25Z

Hi @wantdrink,

Thank you for your feedback.

We will continue investigating this issue as it seems to persists after our fix attempt.

wantdrink · 2021-03-04T01:12:11Z

Thank you @migruiz4 .

meseta · 2021-03-20T14:00:05Z

Just to add, we are also experiencing this on GKE. We are using preemptible instances, which pre-empt at least once a day, so the situation is similar to @corey-hammerton's chaos testing. We're using chart 12.8.3

Here are sentinel logs during testing with a cluster of 4 with quorum 2 (I know quorum should be higher normally). When removing two nodes, the remaining 2 nodes should still have quorum, however, failover fails to happen.

1:X 19 Mar 2021 19:22:14.633 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 19 Mar 2021 19:22:14.634 # Redis version=6.0.12, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 19 Mar 2021 19:22:14.634 # Configuration loaded
1:X 19 Mar 2021 19:22:14.635 * Running mode=sentinel, port=26379.
1:X 19 Mar 2021 19:22:14.644 # Sentinel ID is d8853bd87946558c814cb0c1ef5407d04b24c494
1:X 19 Mar 2021 19:22:14.645 # +monitor master mymaster 192.168.0.48 6379 quorum 2
1:X 19 Mar 2021 19:22:14.649 * +slave slave 192.168.0.56:6379 192.168.0.56 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:14.656 * +slave slave 192.168.3.24:6379 192.168.3.24 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:14.664 * +slave slave 192.168.1.5:6379 192.168.1.5 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:15.111 * +sentinel sentinel d6704bfc411c197d3a9c2fbecd9a71183999e756 192.168.0.56 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:15.128 # +new-epoch 1
1:X 19 Mar 2021 19:22:15.534 * +sentinel sentinel 0e890f56b5aaddc6cf964888a8f5060607d4f2f0 192.168.3.24 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:16.133 * +sentinel sentinel 0ed17fc118aa73ea3912d058a27664a87bc2c874 192.168.0.48 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:12.778 # +sdown sentinel d6704bfc411c197d3a9c2fbecd9a71183999e756 192.168.0.56 26379 @ mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:13.232 # +sdown slave 192.168.0.56:6379 192.168.0.56 6379 @ mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:13.232 # +sdown sentinel 0ed17fc118aa73ea3912d058a27664a87bc2c874 192.168.0.48 26379 @ mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:13.671 # +sdown master mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:13.762 # +odown master mymaster 192.168.0.48 6379 #quorum 2/2 
1:X 19 Mar 2021 20:16:13.762 # +new-epoch 2 
1:X 19 Mar 2021 20:16:13.762 # +try-failover master mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:13.771 # +vote-for-leader d8853bd87946558c814cb0c1ef5407d04b24c494 2 
1:X 19 Mar 2021 20:16:13.794 # 0e890f56b5aaddc6cf964888a8f5060607d4f2f0 voted for d8853bd87946558c814cb0c1ef5407d04b24c494 2 
1:X 19 Mar 2021 20:16:23.990 # -failover-abort-not-elected master mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:24.062 # Next failover delay: I will not start a failover before Fri Mar 19 20:16:49 2021 
1:X 19 Mar 2021 20:16:49.957 # +new-epoch 3 
1:X 19 Mar 2021 20:16:49.957 # +try-failover master mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:49.967 # +vote-for-leader d8853bd87946558c814cb0c1ef5407d04b24c494 3 
1:X 19 Mar 2021 20:16:49.988 # 0e890f56b5aaddc6cf964888a8f5060607d4f2f0 voted for d8853bd87946558c814cb0c1ef5407d04b24c494 3 
1:X 19 Mar 2021 20:17:00.144 # -failover-abort-not-elected master mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:17:00.203 # Next failover delay: I will not start a failover before Fri Mar 19 20:17:26 2021 
1:X 19 Mar 2021 20:17:26.186 # +new-epoch 4

Other sentinel:

1:X 19 Mar 2021 19:20:39.726 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
--
1:X 19 Mar 2021 19:20:39.729 # Redis version=6.0.12, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 19 Mar 2021 19:20:39.730 # Configuration loaded
1:X 19 Mar 2021 19:20:39.731 * Running mode=sentinel, port=26379.
1:X 19 Mar 2021 19:20:39.751 # Sentinel ID is 0e890f56b5aaddc6cf964888a8f5060607d4f2f0
1:X 19 Mar 2021 19:20:39.751 # +monitor master mymaster 192.168.0.48 6379 quorum 2
1:X 19 Mar 2021 19:20:39.756 * +slave slave 192.168.0.56:6379 192.168.0.56 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:20:39.768 * +slave slave 192.168.3.24:6379 192.168.3.24 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:20:40.114 * +sentinel sentinel d6704bfc411c197d3a9c2fbecd9a71183999e756 192.168.0.56 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:20:40.137 # +new-epoch 1
1:X 19 Mar 2021 19:20:40.552 * +sentinel sentinel 0ed17fc118aa73ea3912d058a27664a87bc2c874 192.168.0.48 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:21:59.545 # +reset-master master mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:21:59.838 * +sentinel sentinel d6704bfc411c197d3a9c2fbecd9a71183999e756 192.168.0.56 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:00.048 * +sentinel sentinel 0ed17fc118aa73ea3912d058a27664a87bc2c874 192.168.0.48 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:00.129 * +slave slave 192.168.0.56:6379 192.168.0.56 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:00.140 * +slave slave 192.168.3.24:6379 192.168.3.24 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:00.147 * +slave slave 192.168.1.5:6379 192.168.1.5 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:16.643 * +sentinel sentinel d8853bd87946558c814cb0c1ef5407d04b24c494 192.168.1.5 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:12.712 # +sdown master mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:12.712 # +sdown slave 192.168.0.56:6379 192.168.0.56 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:13.136 # +sdown sentinel d6704bfc411c197d3a9c2fbecd9a71183999e756 192.168.0.56 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:13.136 # +sdown sentinel 0ed17fc118aa73ea3912d058a27664a87bc2c874 192.168.0.48 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:13.784 # +new-epoch 2
1:X 19 Mar 2021 20:16:13.793 # +vote-for-leader d8853bd87946558c814cb0c1ef5407d04b24c494 2
1:X 19 Mar 2021 20:16:13.848 # +odown master mymaster 192.168.0.48 6379 #quorum 2/2
1:X 19 Mar 2021 20:16:13.849 # Next failover delay: I will not start a failover before Fri Mar 19 20:16:50 2021
1:X 19 Mar 2021 20:16:49.979 # +new-epoch 3
1:X 19 Mar 2021 20:16:49.987 # +vote-for-leader d8853bd87946558c814cb0c1ef5407d04b24c494 3
1:X 19 Mar 2021 20:16:49.994 # Next failover delay: I will not start a failover before Fri Mar 19 20:17:26 2021
1:X 19 Mar 2021 20:17:26.179 # +new-epoch 4

logging in with cli, both sentinels have empty replica lists. while both remaining replicas are still mastered to 192.168.0.48 which is now down.

I'll continue to try to figure out what's happening, and post here if I find anything interesting

edit: after testing for a while, I've found a couple of times, a replica fails to get configured to a new master. This seems like a consistent issue with the above where sentinels somehow "forget" about replicas. In the worst case, it forgets ALL of the replicas and has nothing to promote. In slightly less worst cases, it forgets one replica, which effectively drops out of the cluster. (in those cases, this node will report as healthy to kubernetes, so there's no easy indication of it without some external monitoring tools)

I've gone as far as I can with debugging this. I feel there's something odd about how sentinels are forgetting replicas which seems to be close to the root of the issue. I'm going to try to resolve this on our end by increasing the cluster size and adding some monitoring tools, perhaps patching the readiness check to test to see if redis is a master or is linked to one

migruiz4 · 2021-03-25T15:35:55Z

Hi @meseta,

Thank you for the additional information.

Following up on this, we continue investigating this looking for a permanent fix, we will keep you updated.

meseta · 2021-03-25T20:05:20Z

thanks! I'd like to add that we increased the cluster size to 5 (quorum 3), reduced the downAfterMilliseconds to 10000, and switched from soft anti-affinity to hard hard anti-affinity, and haven't observed the described issues since those changes, so we'll be running in this configuration for a while. Perhaps the problem is more likely to happen when two redis nodes fail in quick succession, which our changes have reduced the chance of happening? I hope that helps. Good luck!

robertb724 · 2021-03-26T15:24:14Z

@migruiz4 @meseta We were experiencing the same issue on GKE. We were able to reproduce by manually deleting the vm that the master node was running on, perhaps you could try that to confirm. I tested switching the redis(/opt/bitnami/redis/etc) and sentinel config (/opt/bitnami/redis-sentinel/etc) directories from emptyDir to persistent volumes and the issue was resolved. I have not been able to break the cluster since making this change.

meseta · 2021-03-26T15:34:28Z

thanks, I'll give that a try if we experience further instability

pippy44 · 2021-03-29T17:37:15Z

Chart - appVersion: 6.0.12, version: 12.8.3
Values - cluster.enabled=true, sentinel.enabled=true, cluster.slaveCount=3

I using Docker Desktop for Windows which comes with a one node kubernetes cluster that can be enabled. With the setup I specified above, if I delete the master node pod, another one pops back up and communication between the slave nodes and master is successful. However, if I shut the one node cluster down and power it back on, I receive the same error as reported above where all of the pods are in a CrashLoopBackOff. I have tried the suggestion of creating persistent volumes for the redis and sentinel configs that @robertb724 mentions, but the success was inconsistent. It worked the first time after a reboot, but failed every other time I have tried. Commenting just to stay up to date on this issue.

robertb724 · 2021-04-01T04:56:28Z

@pippy44 Yes, just this morning our cluster did go down again after preempting nodes. The persistent volume seems to help but there are more issues. Will keep looking for improvements.

@migruiz4 @meseta

marcosbc · 2021-04-01T09:39:36Z

Hi all, I've just re-opened the internal task so we locate for a proper fix. Unfortunately I cannot give any ETA due to Easter vacations.

If you happen to find any issues in the meantime and would like to contribute a fix, feel free to send a PR, we'd be glad to help with the review/release process.

harroguk · 2021-04-21T19:11:44Z

@marcosbc Now that Easter is out of the way is it possible to give an ETA on this?

Thanks

robertb724 · 2021-04-21T19:21:44Z

We have found that pairing sentinel.staticId = true with the persistent volume for redis and sentinel config has solved our problems

javsalgar · 2021-04-23T07:38:37Z

Thanks for the info @robertb724 ! I will forward this to the rest of the engineering team

bm1216 · 2021-05-12T09:26:21Z

We have found that pairing sentinel.staticId = true with the persistent volume for redis and sentinel config has solved our problems

Not ours :(

Mauraza · 2021-05-13T07:24:12Z

Hi @bm1216,

Could you add more information about your case?

pippy44 · 2021-05-13T12:20:59Z

We have found that pairing sentinel.staticId = true with the persistent volume for redis and sentinel config has solved our problems

Not ours :(

I am also not seeing success with this solution. My situation is described in my previous post above on Mar. 29th, 2021.

qeternity · 2021-05-14T08:21:30Z

We have experienced this in one of our clusters, and it would seem that a hostname is not being populated somewhere but after a cursory look through the charts, I cannot quite pin down where. If you execute a redis command with a blank hostname, you will get the same error:

I have no name!@redis-sentinel-backend-node-0:/$ redis-cli -h -p 6379 ping
Could not connect to Redis at -p:6379: Name or service not known

So it would appear somewhere that a redis hostname is not resolving correctly. My hunch is that the redis container is attempting to interrogate the sentinels before quorum has been reached, which results in a null hostname and then attempts to connect, which triggers the container to exit, and liveness checks to fail which then k8s collects the whole pod, killing the sentinel sidecar off before quorum is reached...only to start the cycle all over again when it is redeployed.

Mauraza · 2021-05-17T12:20:12Z

Hi @pippy44,

There is a new version of Redis, what version are you using?

robertb724 · 2021-05-17T13:05:38Z

One thing to note regarding our setup that I have not mentioned is we made adjustments to the script so all references of sentinel/redis are using the headless service entry rather than the IP.

zhx828 · 2021-06-25T15:00:34Z

I'm using the app version 6.2.4, redis-14.6.1 still seeing this issue.
The error logs are the same to wantdrink

sentinel

Could not connect to Redis at 10.1.0.186:26379: Connection refused

redis

26379: Connection refused
Could not connect to Redis at -p:6379: Name or service not known

Mauraza · 2021-06-28T12:13:08Z

Hi @zhx828,

I will add this information to the internal task, we will update the thread when we have more information.

qeternity · 2021-06-28T12:54:11Z

@Mauraza @zhx828 we are pinned to 14.1.0 and have not seen this issue...can you try using that version? Perhaps there has been a regression in more recent versions.

zhx828 · 2021-06-28T13:50:37Z

@qeternity which bitnami/redis version are you using? I'm not specifying the redis in my deployment. It should be the default version from bitnami.

qeternity · 2021-06-28T13:51:59Z

@zhx828 we are using chart version 14.1.0 and redis/sentinel version 6.2.2

zhx828 · 2021-06-28T14:37:12Z

@qeternity Still seeing the issue with these versions after restarting the docker. Same error log. I saw you posted last month about the similar issue. Was it just downgrade the version helped?

vfiset · 2022-05-16T14:40:58Z

We have found that pairing sentinel.staticId = true with the persistent volume for redis and sentinel config has solved our problems

I've been facing this issue for ages on my k8s cluster and I've never realized that I did not have persistence enabled for sentinels and that could help a lot to retain configuration. I will test that and report back.

dani8art mentioned this issue Jan 22, 2021

No route to host error after Redis master re-spawned #1544

Closed

migruiz4 added the on-hold Issues or Pull Requests with this label will never be considered stale label Feb 22, 2021

rafariossaa mentioned this issue Feb 24, 2021

[bitnami/redis] Fix issues in initialization/restarts #5603

Merged

3 tasks

rsecob mentioned this issue Aug 3, 2021

[bitnami/redis]: fix CrashLoopBackOff with sentinel enabled #7124

Closed

2 tasks

rsecob mentioned this issue Aug 19, 2021

[bitnami/redis]: Enhance sentinel resiliency, harmonize sentinel behaviour by using staticID as default behaviour #7278

Merged

2 tasks

rafariossaa closed this as completed in #7278 Aug 25, 2021

carrodher removed the on-hold Issues or Pull Requests with this label will never be considered stale label Dec 14, 2021

[bitnami/redis] All pods in CrashLoopBackOff status after rebooting worker nodes #5181

[bitnami/redis] All pods in CrashLoopBackOff status after rebooting worker nodes #5181

Comments

wantdrink commented Jan 22, 2021

wantdrink commented Jan 22, 2021 • edited Loading

dani8art commented Jan 22, 2021

wantdrink commented Jan 23, 2021

dani8art commented Jan 26, 2021 • edited Loading

wantdrink commented Jan 27, 2021

corey-hammerton commented Jan 30, 2021 • edited Loading

javsalgar commented Feb 12, 2021

harroguk commented Feb 12, 2021

corey-hammerton commented Feb 13, 2021

migruiz4 commented Feb 22, 2021

harroguk commented Feb 22, 2021

migruiz4 commented Feb 23, 2021

rafariossaa commented Feb 24, 2021

wantdrink commented Feb 27, 2021

migruiz4 commented Mar 3, 2021

wantdrink commented Mar 4, 2021

meseta commented Mar 20, 2021 • edited Loading

migruiz4 commented Mar 25, 2021

meseta commented Mar 25, 2021

robertb724 commented Mar 26, 2021

meseta commented Mar 26, 2021

pippy44 commented Mar 29, 2021 • edited Loading

robertb724 commented Apr 1, 2021

marcosbc commented Apr 1, 2021

harroguk commented Apr 21, 2021

robertb724 commented Apr 21, 2021

javsalgar commented Apr 23, 2021

bm1216 commented May 12, 2021

Mauraza commented May 13, 2021

pippy44 commented May 13, 2021

qeternity commented May 14, 2021 • edited Loading

Mauraza commented May 17, 2021

robertb724 commented May 17, 2021

zhx828 commented Jun 25, 2021

sentinel

redis

Mauraza commented Jun 28, 2021

qeternity commented Jun 28, 2021

zhx828 commented Jun 28, 2021

qeternity commented Jun 28, 2021

zhx828 commented Jun 28, 2021

vfiset commented May 16, 2022

wantdrink commented Jan 22, 2021 •

edited

Loading

dani8art commented Jan 26, 2021 •

edited

Loading

corey-hammerton commented Jan 30, 2021 •

edited

Loading

meseta commented Mar 20, 2021 •

edited

Loading

pippy44 commented Mar 29, 2021 •

edited

Loading

qeternity commented May 14, 2021 •

edited

Loading