Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/redis] All pods in CrashLoopBackOff status after rebooting worker nodes #5181

Closed
wantdrink opened this issue Jan 22, 2021 · 40 comments · Fixed by #7278
Closed

[bitnami/redis] All pods in CrashLoopBackOff status after rebooting worker nodes #5181

wantdrink opened this issue Jan 22, 2021 · 40 comments · Fixed by #7278

Comments

@wantdrink
Copy link

Which chart:
bitnami/redis redis-12.6.2 app version 6.0.10

Describe the bug
After rebooting all of the K8S worker nodes, the redis pods are always in CrashLoopBackOff status.

To Reproduce
Steps to reproduce the behavior:

  1. Set redis image with 5.0.10-debian-10-r81, sentinel image with 5.0.10-debian-10-r78.
  2. cluster = true, slaveCount=3
  3. helm install mycache bitnami/redis -n test -f values.yaml
  4. Test and everything works fine. Reboot all K8S worker nodes.
  5. 3 redis pods are always in CrashLoopBackOff status
    NAME READY STATUS RESTARTS AGE
    mycache-redis-node-0 1/3 CrashLoopBackOff 19 30m
    mycache-redis-node-1 1/3 CrashLoopBackOff 19 30m
    mycache-redis-node-2 1/3 CrashLoopBackOff 19 30m

Version of Helm and Kubernetes:

  • Output of helm version:
version.BuildInfo{Version:"v3.2.4", GitCommit:"0ad800ef43d3b826f31a5ad8dfbb4fe05d143688", GitTreeState:"clean", GoVersion:"go1.13.12"}
  • Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"clean", BuildDate:"2020-06-26T03:47:41Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"clean", BuildDate:"2020-06-26T03:39:24Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Additional context

kubectl describe pod mycache-redis-node-x:

Events:
  Type     Reason                  Age                    From                  Message
  ----     ------                  ----                   ----                  -------
  Normal   Scheduled               <unknown>              default-scheduler     Successfully assigned test/mycache-redis-node-0 to app3.jh.com
  Normal   Pulled                  32m                    kubelet, app3.jh.com  Container image "docker.io/bitnami/redis:5.0.10-debian-10-r81" already present on machine
  Normal   Created                 32m                    kubelet, app3.jh.com  Created container redis
  Normal   Started                 32m                    kubelet, app3.jh.com  Started container sentinel
  Normal   Started                 32m                    kubelet, app3.jh.com  Started container redis
  Normal   Created                 32m                    kubelet, app3.jh.com  Created container sentinel
  Normal   Pulled                  32m                    kubelet, app3.jh.com  Container image "docker.io/bitnami/redis-exporter:1.15.1-debian-10-r2" already present on machine
  Normal   Created                 32m                    kubelet, app3.jh.com  Created container metrics
  Normal   Started                 32m                    kubelet, app3.jh.com  Started container metrics
  Normal   Pulled                  32m                    kubelet, app3.jh.com  Container image "docker.io/bitnami/redis-sentinel:5.0.10-debian-10-r78" already present on machine
  Warning  FailedCreatePodSandBox  25m                    kubelet, app3.jh.com  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "d871e82c81b6d1ed1d45514f1d1942ed37b2e1316bca7f90a340cd449c774f36" network for pod "mycache-redis-node-0": networkPlugin cni failed to set up pod "mycache-redis-node-0_test" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  25m                    kubelet, app3.jh.com  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "a99244ae1b247a0e9a855266fe70c2d715a3faddba7d0779f88822d90f1f3f38" network for pod "mycache-redis-node-0": networkPlugin cni failed to set up pod "mycache-redis-node-0_test" network: open /run/flannel/subnet.env: no such file or directory
  Normal   SandboxChanged          25m (x4 over 25m)      kubelet, app3.jh.com  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  25m                    kubelet, app3.jh.com  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "cadd3e0edeef3b1672c83a83cd3a8bcda35fdf9942e333478b6a77fbe6047afd" network for pod "mycache-redis-node-0": networkPlugin cni failed to set up pod "mycache-redis-node-0_test" network: open /run/flannel/subnet.env: no such file or directory
  Normal   Started                 25m                    kubelet, app3.jh.com  Started container metrics
  Normal   Created                 25m                    kubelet, app3.jh.com  Created container metrics
  Normal   Pulled                  25m                    kubelet, app3.jh.com  Container image "docker.io/bitnami/redis-exporter:1.15.1-debian-10-r2" already present on machine
  Normal   Pulled                  25m                    kubelet, app3.jh.com  Container image "docker.io/bitnami/redis-sentinel:5.0.10-debian-10-r78" already present on machine
  Normal   Started                 25m                    kubelet, app3.jh.com  Started container sentinel
  Normal   Created                 25m                    kubelet, app3.jh.com  Created container sentinel
  Warning  BackOff                 24m (x3 over 24m)      kubelet, app3.jh.com  Back-off restarting failed container
  Normal   Started                 24m (x2 over 25m)      kubelet, app3.jh.com  Started container redis
  Normal   Created                 24m (x2 over 25m)      kubelet, app3.jh.com  Created container redis
  Normal   Pulled                  24m (x2 over 25m)      kubelet, app3.jh.com  Container image "docker.io/bitnami/redis:5.0.10-debian-10-r81" already present on machine
  Warning  BackOff                 4m51s (x111 over 24m)  kubelet, app3.jh.com  Back-off restarting failed container

kubectl logs mycache-redis-node-0 redis

Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at mycache-redis.test.svc.cluster.local:26379: Connection refused
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at -p:6379: Name or service not known

kubectl logs mycache-redis-node-0 sentinel

Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at mycache-redis.test.svc.cluster.local:26379: Connection refused
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at -p:6379: Name or service not known

kubectl logs mycache-redis-node-0 metrics

time="2021-01-22T02:25:11Z" level=info msg="Redis Metrics Exporter v1.15.1    build date: 2021-01-12-02:52:00    sha1: 5bbffe05d7ba4347d6ffc482b70b321abab32209    Go: go1.15.6    GOOS: linux    GOARCH: amd64"
time="2021-01-22T02:25:11Z" level=info msg="Providing metrics at :9121/metrics"
time="2021-01-22T02:25:29Z" level=error msg="Couldn't connect to redis instance"
time="2021-01-22T02:26:29Z" level=error msg="Couldn't connect to redis instance"
time="2021-01-22T02:27:29Z" level=error msg="Couldn't connect to redis instance"
......
@wantdrink
Copy link
Author

wantdrink commented Jan 22, 2021

Rebooted again after 1 hour (to test another chart bitnami/rabbitmq), now all redis pods status are Running.
The bitnami/rabbitmq pods also could be recovered automatically, one pod is running first, and the other two changed form 0/1 running to 1/1 after more than 10 minutes.

Then I tested it again with rebooting worker nodes, redis pods are always in CrashLoopBackOff status then, and rabbitmq pods are changed to Running after several minutes.

So if pods are always in CrashLoopBackOff status after rebooting, is there any solution to manually recover them without rebooting again?
Also is that possible to decrease the interval the bitnami/rabbitmq pods to recover?

Thanks.

@dani8art
Copy link
Contributor

Hi @wantdrink thanks for the info it is very related to the linked issue, we will check if we can reproduce it in our local environments and figure out the issue.

@wantdrink
Copy link
Author

Thanks @dani8art. #1544 looks the same.

@dani8art
Copy link
Contributor

dani8art commented Jan 26, 2021

Hi,

We have opened an internal task in order to investigate a bit more about this error, unfortunately, we can not give you an ETA.

BTW, any action or research results are welcome if you are performing more tests on this.

Thanks for the feedback!

@wantdrink
Copy link
Author

Thank you @dani8art.
Any special logs might be helpful for troubleshooting? I'll test and try to grab it.

@corey-hammerton
Copy link

corey-hammerton commented Jan 30, 2021

We have a similar issue with a number of Redis sentinel deployments in our GKE cluster. We have a chaos test running that randomly kills pods every couple of minutes, eventually it kills the pods serving as the Redis master and that soon leads the other pods in Error/CrashLoopBackOff state. Below is a brief snippet of the logs from one of our pods when it sees the master go down (IPs are ephemeral, service names changed for security)

This cluster is running on image bitnami/redis:4.0.14

redis		1:S 30 Jan 12:12:57.366 # Connection with master lost.
redis		1:S 30 Jan 12:12:57.366 * Caching the disconnected master state.
redis		1:S 30 Jan 12:12:57.827 * MASTER <-> SLAVE sync started
redis		1:S 30 Jan 12:12:57.826 * Connecting to MASTER 10.8.23.141:6379
sentinel	1:X 30 Jan 12:13:12.419 # +sdown master my_redis_service 10.8.23.141 6379
sentinel	1:X 30 Jan 12:13:12.481 # +sdown sentinel a7be8b915634304ba3d164b382204822a7cc35c1 10.8.23.141 26379 @ my_redis_service 10.8.23.141 6379
sentinel	1:X 30 Jan 12:13:12.481 # +sdown sentinel 5a7dca1bf749b69d90e32ef1384b7f15428890cb 10.8.24.136 26379 @ my_redis_service 10.8.23.141 6379
sentinel	1:X 30 Jan 12:13:12.557 # +sdown slave 10.8.24.136:6379 10.8.24.136 6379 @ my_redis_service 10.8.23.141 6379
redis		1:S 30 Jan 12:13:29.772 # Error condition on socket for SYNC: Connection timed out
redis		1:S 30 Jan 12:13:29.874 * Connecting to MASTER 10.8.23.141:6379
redis		1:S 30 Jan 12:13:29.874 * MASTER <-> SLAVE sync started
redis		1:S 30 Jan 12:14:01.516 # Error condition on socket for SYNC: Connection timed out
redis		1:S 30 Jan 12:14:01.926 * Connecting to MASTER 10.8.23.141:6379
redis		1:S 30 Jan 12:14:01.926 * MASTER <-> SLAVE sync started
redis		1:S 30 Jan 12:14:33.772 # Error condition on socket for SYNC: Connection timed out
redis		1:S 30 Jan 12:14:33.985 * Connecting to MASTER 10.8.23.141:6379
redis		1:S 30 Jan 12:14:33.985 * MASTER <-> SLAVE sync started
redis		1:S 30 Jan 12:14:37.129 # Error condition on socket for SYNC: No route to host
redis		1:S 30 Jan 12:14:37.993 * MASTER <-> SLAVE sync started
redis		1:S 30 Jan 12:14:37.993 * Connecting to MASTER 10.8.23.141:6379

@javsalgar
Copy link
Contributor

Hi,

Could you share the code of your chaos testing? It would be great for us for the investigation.

@harroguk
Copy link

Unfortunately the tool is an in house application so cannot be shared (The public equivalent would be chaos-mesh carrying out a pod-kill action.

@corey-hammerton
Copy link

@javsalgar what he said.

Be advised, we are experimenting with setting sentinel.staticID and running similar chaos tests

@migruiz4
Copy link
Member

Hi,

We are aware that Redis Sentinel is having issues recovering from failures affecting the master node where Kubernetes has to redeploy the pod in a different node.

We are currently working on it and will get back to you as soon as possible.

Sorry for the inconvenience.

@migruiz4 migruiz4 added the on-hold Issues or Pull Requests with this label will never be considered stale label Feb 22, 2021
@harroguk
Copy link

Is there any sort of timeline on this?
Days, Weeks, Months, Years?

Just trying to manage my expectations :P

@migruiz4
Copy link
Member

We are currently working on a fix, but I don't have an ETA as it will probably require major changes in the Redis chart.

I will keep you updated on any changes.

@rafariossaa
Copy link
Contributor

Hi,
A new version of the chart was released.
Could you give it a try and check if this fixed the issue for you ?

@wantdrink
Copy link
Author

Hi @rafariossaa ,
Tested with

chart version: redis-12.8.0 app version 6.0.11
image: redis 5.0.10-debian-10-r81, sentinel 5.0.10-debian-10-r78

after rebooting all worker nodes and the error logs are the same as before:
sentinel:

Could not connect to Redis at 10.244.3.7:26379: Connection refused
 08:19:59.19 INFO  ==> Sentinels clean up done
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at mycache-redis.test.svc.cluster.local:26379: Connection refused
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at -p:6379: Name or service not known

redis:

Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at mycache-redis.test.svc.cluster.local:26379: Connection refused
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at -p:6379: Name or service not known

metrics:

time="2021-02-27T07:52:10Z" level=info msg="Redis Metrics Exporter v1.17.1    build date: 2021-02-20-13:14:11    sha1: 39f8ddd5c6bd6e8a14f37779e4899aa884d8a201    Go: go1.16    GOOS: linux    GOARCH: amd64"
time="2021-02-27T07:52:10Z" level=info msg="Providing metrics at :9121/metrics"
time="2021-02-27T07:53:20Z" level=error msg="Couldn't connect to redis instance"

@migruiz4
Copy link
Member

migruiz4 commented Mar 3, 2021

Hi @wantdrink,

Thank you for your feedback.

We will continue investigating this issue as it seems to persists after our fix attempt.

@wantdrink
Copy link
Author

Thank you @migruiz4 .

@meseta
Copy link

meseta commented Mar 20, 2021

Just to add, we are also experiencing this on GKE. We are using preemptible instances, which pre-empt at least once a day, so the situation is similar to @corey-hammerton's chaos testing. We're using chart 12.8.3

Here are sentinel logs during testing with a cluster of 4 with quorum 2 (I know quorum should be higher normally). When removing two nodes, the remaining 2 nodes should still have quorum, however, failover fails to happen.

1:X 19 Mar 2021 19:22:14.633 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 19 Mar 2021 19:22:14.634 # Redis version=6.0.12, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 19 Mar 2021 19:22:14.634 # Configuration loaded
1:X 19 Mar 2021 19:22:14.635 * Running mode=sentinel, port=26379.
1:X 19 Mar 2021 19:22:14.644 # Sentinel ID is d8853bd87946558c814cb0c1ef5407d04b24c494
1:X 19 Mar 2021 19:22:14.645 # +monitor master mymaster 192.168.0.48 6379 quorum 2
1:X 19 Mar 2021 19:22:14.649 * +slave slave 192.168.0.56:6379 192.168.0.56 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:14.656 * +slave slave 192.168.3.24:6379 192.168.3.24 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:14.664 * +slave slave 192.168.1.5:6379 192.168.1.5 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:15.111 * +sentinel sentinel d6704bfc411c197d3a9c2fbecd9a71183999e756 192.168.0.56 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:15.128 # +new-epoch 1
1:X 19 Mar 2021 19:22:15.534 * +sentinel sentinel 0e890f56b5aaddc6cf964888a8f5060607d4f2f0 192.168.3.24 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:16.133 * +sentinel sentinel 0ed17fc118aa73ea3912d058a27664a87bc2c874 192.168.0.48 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:12.778 # +sdown sentinel d6704bfc411c197d3a9c2fbecd9a71183999e756 192.168.0.56 26379 @ mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:13.232 # +sdown slave 192.168.0.56:6379 192.168.0.56 6379 @ mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:13.232 # +sdown sentinel 0ed17fc118aa73ea3912d058a27664a87bc2c874 192.168.0.48 26379 @ mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:13.671 # +sdown master mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:13.762 # +odown master mymaster 192.168.0.48 6379 #quorum 2/2 
1:X 19 Mar 2021 20:16:13.762 # +new-epoch 2 
1:X 19 Mar 2021 20:16:13.762 # +try-failover master mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:13.771 # +vote-for-leader d8853bd87946558c814cb0c1ef5407d04b24c494 2 
1:X 19 Mar 2021 20:16:13.794 # 0e890f56b5aaddc6cf964888a8f5060607d4f2f0 voted for d8853bd87946558c814cb0c1ef5407d04b24c494 2 
1:X 19 Mar 2021 20:16:23.990 # -failover-abort-not-elected master mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:24.062 # Next failover delay: I will not start a failover before Fri Mar 19 20:16:49 2021 
1:X 19 Mar 2021 20:16:49.957 # +new-epoch 3 
1:X 19 Mar 2021 20:16:49.957 # +try-failover master mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:16:49.967 # +vote-for-leader d8853bd87946558c814cb0c1ef5407d04b24c494 3 
1:X 19 Mar 2021 20:16:49.988 # 0e890f56b5aaddc6cf964888a8f5060607d4f2f0 voted for d8853bd87946558c814cb0c1ef5407d04b24c494 3 
1:X 19 Mar 2021 20:17:00.144 # -failover-abort-not-elected master mymaster 192.168.0.48 6379 
1:X 19 Mar 2021 20:17:00.203 # Next failover delay: I will not start a failover before Fri Mar 19 20:17:26 2021 
1:X 19 Mar 2021 20:17:26.186 # +new-epoch 4 

Other sentinel:

1:X 19 Mar 2021 19:20:39.726 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
--
1:X 19 Mar 2021 19:20:39.729 # Redis version=6.0.12, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 19 Mar 2021 19:20:39.730 # Configuration loaded
1:X 19 Mar 2021 19:20:39.731 * Running mode=sentinel, port=26379.
1:X 19 Mar 2021 19:20:39.751 # Sentinel ID is 0e890f56b5aaddc6cf964888a8f5060607d4f2f0
1:X 19 Mar 2021 19:20:39.751 # +monitor master mymaster 192.168.0.48 6379 quorum 2
1:X 19 Mar 2021 19:20:39.756 * +slave slave 192.168.0.56:6379 192.168.0.56 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:20:39.768 * +slave slave 192.168.3.24:6379 192.168.3.24 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:20:40.114 * +sentinel sentinel d6704bfc411c197d3a9c2fbecd9a71183999e756 192.168.0.56 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:20:40.137 # +new-epoch 1
1:X 19 Mar 2021 19:20:40.552 * +sentinel sentinel 0ed17fc118aa73ea3912d058a27664a87bc2c874 192.168.0.48 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:21:59.545 # +reset-master master mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:21:59.838 * +sentinel sentinel d6704bfc411c197d3a9c2fbecd9a71183999e756 192.168.0.56 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:00.048 * +sentinel sentinel 0ed17fc118aa73ea3912d058a27664a87bc2c874 192.168.0.48 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:00.129 * +slave slave 192.168.0.56:6379 192.168.0.56 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:00.140 * +slave slave 192.168.3.24:6379 192.168.3.24 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:00.147 * +slave slave 192.168.1.5:6379 192.168.1.5 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 19:22:16.643 * +sentinel sentinel d8853bd87946558c814cb0c1ef5407d04b24c494 192.168.1.5 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:12.712 # +sdown master mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:12.712 # +sdown slave 192.168.0.56:6379 192.168.0.56 6379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:13.136 # +sdown sentinel d6704bfc411c197d3a9c2fbecd9a71183999e756 192.168.0.56 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:13.136 # +sdown sentinel 0ed17fc118aa73ea3912d058a27664a87bc2c874 192.168.0.48 26379 @ mymaster 192.168.0.48 6379
1:X 19 Mar 2021 20:16:13.784 # +new-epoch 2
1:X 19 Mar 2021 20:16:13.793 # +vote-for-leader d8853bd87946558c814cb0c1ef5407d04b24c494 2
1:X 19 Mar 2021 20:16:13.848 # +odown master mymaster 192.168.0.48 6379 #quorum 2/2
1:X 19 Mar 2021 20:16:13.849 # Next failover delay: I will not start a failover before Fri Mar 19 20:16:50 2021
1:X 19 Mar 2021 20:16:49.979 # +new-epoch 3
1:X 19 Mar 2021 20:16:49.987 # +vote-for-leader d8853bd87946558c814cb0c1ef5407d04b24c494 3
1:X 19 Mar 2021 20:16:49.994 # Next failover delay: I will not start a failover before Fri Mar 19 20:17:26 2021
1:X 19 Mar 2021 20:17:26.179 # +new-epoch 4

logging in with cli, both sentinels have empty replica lists. while both remaining replicas are still mastered to 192.168.0.48 which is now down.

I'll continue to try to figure out what's happening, and post here if I find anything interesting

edit: after testing for a while, I've found a couple of times, a replica fails to get configured to a new master. This seems like a consistent issue with the above where sentinels somehow "forget" about replicas. In the worst case, it forgets ALL of the replicas and has nothing to promote. In slightly less worst cases, it forgets one replica, which effectively drops out of the cluster. (in those cases, this node will report as healthy to kubernetes, so there's no easy indication of it without some external monitoring tools)

I've gone as far as I can with debugging this. I feel there's something odd about how sentinels are forgetting replicas which seems to be close to the root of the issue. I'm going to try to resolve this on our end by increasing the cluster size and adding some monitoring tools, perhaps patching the readiness check to test to see if redis is a master or is linked to one

@migruiz4
Copy link
Member

Hi @meseta,

Thank you for the additional information.

Following up on this, we continue investigating this looking for a permanent fix, we will keep you updated.

@meseta
Copy link

meseta commented Mar 25, 2021

thanks! I'd like to add that we increased the cluster size to 5 (quorum 3), reduced the downAfterMilliseconds to 10000, and switched from soft anti-affinity to hard hard anti-affinity, and haven't observed the described issues since those changes, so we'll be running in this configuration for a while. Perhaps the problem is more likely to happen when two redis nodes fail in quick succession, which our changes have reduced the chance of happening? I hope that helps. Good luck!

@robertb724
Copy link

@migruiz4 @meseta We were experiencing the same issue on GKE. We were able to reproduce by manually deleting the vm that the master node was running on, perhaps you could try that to confirm. I tested switching the redis(/opt/bitnami/redis/etc) and sentinel config (/opt/bitnami/redis-sentinel/etc) directories from emptyDir to persistent volumes and the issue was resolved. I have not been able to break the cluster since making this change.

@meseta
Copy link

meseta commented Mar 26, 2021

thanks, I'll give that a try if we experience further instability

@pippy44
Copy link

pippy44 commented Mar 29, 2021

Chart - appVersion: 6.0.12, version: 12.8.3
Values - cluster.enabled=true, sentinel.enabled=true, cluster.slaveCount=3

I using Docker Desktop for Windows which comes with a one node kubernetes cluster that can be enabled. With the setup I specified above, if I delete the master node pod, another one pops back up and communication between the slave nodes and master is successful. However, if I shut the one node cluster down and power it back on, I receive the same error as reported above where all of the pods are in a CrashLoopBackOff. I have tried the suggestion of creating persistent volumes for the redis and sentinel configs that @robertb724 mentions, but the success was inconsistent. It worked the first time after a reboot, but failed every other time I have tried. Commenting just to stay up to date on this issue.

@robertb724
Copy link

@pippy44 Yes, just this morning our cluster did go down again after preempting nodes. The persistent volume seems to help but there are more issues. Will keep looking for improvements.

@migruiz4 @meseta

@marcosbc
Copy link
Contributor

marcosbc commented Apr 1, 2021

Hi all, I've just re-opened the internal task so we locate for a proper fix. Unfortunately I cannot give any ETA due to Easter vacations.

If you happen to find any issues in the meantime and would like to contribute a fix, feel free to send a PR, we'd be glad to help with the review/release process.

@harroguk
Copy link

@marcosbc Now that Easter is out of the way is it possible to give an ETA on this?

Thanks

@robertb724
Copy link

We have found that pairing sentinel.staticId = true with the persistent volume for redis and sentinel config has solved our problems

@javsalgar
Copy link
Contributor

Thanks for the info @robertb724 ! I will forward this to the rest of the engineering team

@bm1216
Copy link

bm1216 commented May 12, 2021

We have found that pairing sentinel.staticId = true with the persistent volume for redis and sentinel config has solved our problems

Not ours :(

@Mauraza
Copy link
Contributor

Mauraza commented May 13, 2021

Hi @bm1216,

Could you add more information about your case?

@pippy44
Copy link

pippy44 commented May 13, 2021

We have found that pairing sentinel.staticId = true with the persistent volume for redis and sentinel config has solved our problems

Not ours :(

I am also not seeing success with this solution. My situation is described in my previous post above on Mar. 29th, 2021.

@qeternity
Copy link

qeternity commented May 14, 2021

We have experienced this in one of our clusters, and it would seem that a hostname is not being populated somewhere but after a cursory look through the charts, I cannot quite pin down where. If you execute a redis command with a blank hostname, you will get the same error:

I have no name!@redis-sentinel-backend-node-0:/$ redis-cli -h -p 6379 ping
Could not connect to Redis at -p:6379: Name or service not known

So it would appear somewhere that a redis hostname is not resolving correctly. My hunch is that the redis container is attempting to interrogate the sentinels before quorum has been reached, which results in a null hostname and then attempts to connect, which triggers the container to exit, and liveness checks to fail which then k8s collects the whole pod, killing the sentinel sidecar off before quorum is reached...only to start the cycle all over again when it is redeployed.

@Mauraza
Copy link
Contributor

Mauraza commented May 17, 2021

Hi @pippy44,

There is a new version of Redis, what version are you using?

@robertb724
Copy link

One thing to note regarding our setup that I have not mentioned is we made adjustments to the script so all references of sentinel/redis are using the headless service entry rather than the IP.

@zhx828
Copy link

zhx828 commented Jun 25, 2021

I'm using the app version 6.2.4, redis-14.6.1 still seeing this issue.
The error logs are the same to wantdrink

sentinel

Could not connect to Redis at 10.1.0.186:26379: Connection refused

redis

26379: Connection refused
Could not connect to Redis at -p:6379: Name or service not known

@Mauraza
Copy link
Contributor

Mauraza commented Jun 28, 2021

Hi @zhx828,

I will add this information to the internal task, we will update the thread when we have more information.

@qeternity
Copy link

@Mauraza @zhx828 we are pinned to 14.1.0 and have not seen this issue...can you try using that version? Perhaps there has been a regression in more recent versions.

@zhx828
Copy link

zhx828 commented Jun 28, 2021

@qeternity which bitnami/redis version are you using? I'm not specifying the redis in my deployment. It should be the default version from bitnami.

@qeternity
Copy link

@zhx828 we are using chart version 14.1.0 and redis/sentinel version 6.2.2

@zhx828
Copy link

zhx828 commented Jun 28, 2021

@qeternity Still seeing the issue with these versions after restarting the docker. Same error log. I saw you posted last month about the similar issue. Was it just downgrade the version helped?

@vfiset
Copy link

vfiset commented May 16, 2022

We have found that pairing sentinel.staticId = true with the persistent volume for redis and sentinel config has solved our problems

I've been facing this issue for ages on my k8s cluster and I've never realized that I did not have persistence enabled for sentinels and that could help a lot to retain configuration. I will test that and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet