Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[redis] Sentinel from master does not recover itself #3562

Closed
EduardoFLima opened this issue Aug 31, 2020 · 4 comments
Closed

[redis] Sentinel from master does not recover itself #3562

EduardoFLima opened this issue Aug 31, 2020 · 4 comments
Labels
on-hold Issues or Pull Requests with this label will never be considered stale

Comments

@EduardoFLima
Copy link

Which chart:
bitnami/redis, 10.7.16

Describe the bug
We're using Redis Helm Chart topology with one master and two slaves.
When the sentinel container inside the pod which contains the master crashes, it is not able to recover itself.

In the logs, we can see the error below:

sentinel known-sentinel mymaster 172.17.0.10 26379 2156b04daef7355ec8796e6f493a1d0285f1adc5
*** FATAL CONFIG FILE ERROR (Redis 6.0.6) ***
Reading the configuration file, at line 22
>>> 'sentinel known-sentinel mymaster 172.17.0.10 26379 2156b04daef7355ec8796e6f493a1d0285f1adc5'
Wrong hostname or port for sentinel.
sentinel known-sentinel mymaster 172.17.0.9 26379 101ee1d442b7f7db844587d150c4accfe371e529

To Reproduce
Steps to reproduce the behavior:

  1. Install the helm with
helm install my-redis bitnami/redis --namespace default -f values.yaml

Where values.yaml is:

image:
  tag: 6.0.6

sentinel:
  enabled: true
  image:
    tag: 6.0.6
  downAfterMilliseconds: 5000
  failoverTimeout: 10000

usePassword: true
password: password

master:
  resources:
    requests:
      memory: 200Mi
      cpu: 100m
    limits:
      cpu: 1000m
      memory: 1Gi
  persistence:
    enabled: false

slave:
  resources:
    requests:
      memory: 200Mi
      cpu: 100m
    limits:
      cpu: 1000m
      memory: 1Gi

  persistence:
    enabled: false
  1. Wait until pods are up and running and test redis.
    Redis container in the master pod:
kubectl exec -it my-redis-master-0 -- redis-cli -a password info replication
# Replication
role:master
connected_slaves:2
slave0:ip=172.17.0.3,port=6379,state=online,offset=14904,lag=0
slave1:ip=172.17.0.10,port=6379,state=online,offset=14769,lag=1
master_replid:22b4a8cc55fd70d1896519429f46affb9718ba95
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:14904
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:14904

Sentinel container in the master pod:

kubectl exec -it my-redis-master-0 -c sentinel -- redis-cli -p 26379 -a password info sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymaster,status=ok,address=172.17.0.9:6379,slaves=2,sentinels=3

Master pod ip:

kubectl describe pod my-redis-master-0 | grep IP:
IP:           172.17.0.9
  1. Shutdown the sentinel inside the pod which contains one of the slaves:
    kubectl exec -it my-redis-slave-0 -c sentinel -- redis-cli -p 26379 -a password shutdown
  1. Note that the my-redis-slave-0 sentinel crashes and it is able to recover itself.

  2. Shutdown the sentinel inside the pod which contains the master:

    kubectl exec -it my-redis-master-0 -c sentinel -- redis-cli -p 26379 -a password shutdown
  1. Note that the my-redis-master-0 sentinel crashes and keep trying to recover itself but it is not able to (the pod status keep changing between CrashLoopBackOff and Error).

In the log of the sentinel container of the my-redis-master-0 pod:

sentinel known-sentinel mymaster 172.17.0.10 26379 2156b04daef7355ec8796e6f493a1d0285f1adc5

*** FATAL CONFIG FILE ERROR (Redis 6.0.6) ***
Reading the configuration file, at line 22
>>> 'sentinel known-sentinel mymaster 172.17.0.10 26379 2156b04daef7355ec8796e6f493a1d0285f1adc5'
Wrong hostname or port for sentinel.
sentinel known-sentinel mymaster 172.17.0.9 26379 101ee1d442b7f7db844587d150c4accfe371e529

In the log of the sentinel container of the my-redis-slave-0 pod:

1:X 28 Aug 2020 06:57:03.971 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 28 Aug 2020 06:57:03.971 # Redis version=6.0.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 28 Aug 2020 06:57:03.971 # Configuration loaded
1:X 28 Aug 2020 06:57:03.972 * Running mode=sentinel, port=26379.
1:X 28 Aug 2020 06:57:03.973 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 28 Aug 2020 06:57:03.973 # Sentinel ID is 757f43b7ef7152735bb37b5b6f7aceed2bdc6bf5
1:X 28 Aug 2020 06:57:03.973 # +monitor master mymaster 172.17.0.9 6379 quorum 2
1:X 28 Aug 2020 06:58:00.966 # +sdown sentinel 101ee1d442b7f7db844587d150c4accfe371e529 172.17.0.9 26379 @ mymaster 172.17.0.9 6379
  1. From the logs we understood that the sentinel of the master pod is starting and trying to connect to one of the sentinel of the slaves (172.17.0.10) and is not able to for some reason.

Expected behavior
It is expected that the sentinel container inside the master pod to be able to recover itself just as the one in the slave pod were able to (step 4 above).

Version of Helm and Kubernetes:

  • Output of helm version:
version.BuildInfo{Version:"v3.3.0", GitCommit:"8a4aeec08d67a7b84472007529e8097ec3742105", GitTreeState:"dirty", GoVersion:"go1.14.7"}
  • Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:20:10Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:12:17Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
@javsalgar
Copy link
Contributor

Hi,

We have plans on improving our current Redis Sentinel configuration as we detected some failover issues. We will update this ticket as soon as we have more news. Thank you very much for reporting!!

@javsalgar javsalgar added the on-hold Issues or Pull Requests with this label will never be considered stale label Aug 31, 2020
@pascalrimann
Copy link

Is there any solution already on this Topic? We are currently facing the same Issue.

@javsalgar
Copy link
Contributor

Hi,

Could you share the logs of the issue to see if there's any difference with the ones shared by the OP? Is it something you can easily reproduce by removing the pods? We've been doing improvements to the failover mechanism and we would like to understand what caused the issue this time.

@carrodher
Copy link
Member

Unfortunately, this issue was created a long time ago and although there is an internal task to fix it, it was not prioritized as something to address in the short/mid term. It's not a technical reason but something related to the capacity since we're a small team.

Being said that, contributions via PRs are more than welcome in both repositories (containers and charts). Just in case you would like to contribute.

During this time, there are several releases of this asset and it's possible the issue has gone as part of other changes. If that's not the case and you are still experiencing this issue, please feel free to reopen it and we will re-evaluate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold Issues or Pull Requests with this label will never be considered stale
Projects
None yet
Development

No branches or pull requests

4 participants