[bitnami/redis] Sentinel cluster doesn't elect new master after master pod deletion #6165

wilsoniya · 2021-04-20T22:12:03Z

Which chart:
Chart: bitnami/redis
Version: 13.0.1

Describe the bug
When a master pod is manually deleted, occasionally the remaining replicas appear to continue re-electing the nonexistent master. When the replacement pod reappears, it's unable to connect to the existing master as reported by the remaining replicas, which corresponds to the IP of the now nonexistent previous master pod.

To Reproduce
I'm not able to deterministically reproduce the behavior described above. I'd say the errant behavior occurs ~20% of the time.

Steps to reproduce the behavior:

Create a sentinel cluster with the values below and wait for it to come online
Determine which pod is master and delete it
(with some probability) the replacement pod can't start redis because it can't connect to the master IP reported by the remaining sentinels because they still think the IP of the now deleted pod is still master

Expected behavior
When a pod is deleted the cluster members should elect a new master among themselves and the replacement pod should be able to connect to the elected master when the replacement comes online.

Version of Helm and Kubernetes:

Output of helm version:

version.BuildInfo{Version:"v3.4.0", GitCommit:"7090a89efc8a18f3d8178bf47d2462450349a004", GitTreeState:"clean", GoVersion:"go1.14.10"}

Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.11", GitCommit:"d94a81c724ea8e1ccc9002d89b7fe81d58f89ede", GitTreeState:"clean", BuildDate:"2020-03-12T21:08:59Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.15", GitCommit:"73dd5c840662bb066a146d0871216333181f4b64", GitTreeState:"clean", BuildDate:"2021-01-13T13:14:05Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Additional context

values

## Bitnami Redis(TM) image version
## ref: https://hub.docker.com/r/bitnami/redis/tags/
##
image:
  registry: docker.io
  repository: bitnami/redis
  ## Bitnami Redis(TM) image tag
  ## ref: https://github.com/bitnami/bitnami-docker-redis#supported-tags-and-respective-dockerfile-links
  ##
  tag: "6.2.1-debian-10-r36"
  ## Specify a imagePullPolicy
  ## Defaults to 'Always' if image tag is 'latest', else set to 'IfNotPresent'
  ## ref: http://kubernetes.io/docs/user-guide/images/#pre-pulling-images
  ##
  pullPolicy: IfNotPresent

## Cluster settings
##
cluster:
  enabled: true
  slaveCount: 3

## Use redis sentinel in the redis pod. This will disable the master and slave services and
## create one redis service with ports to the sentinel and the redis instances
##
sentinel:
  enabled: true
  ## Require password authentication on the sentinel itself
  ## ref: https://redis.io/topics/sentinel
  ##
  usePassword: true
  ## Bitnami Redis(TM) Sentintel image version
  ## ref: https://hub.docker.com/r/bitnami/redis-sentinel/tags/
  ##
  image:
    registry: docker.io
    repository: bitnami/redis-sentinel
    ## Bitnami Redis(TM) image tag
    ## ref: https://github.com/bitnami/bitnami-docker-redis-sentinel#supported-tags-and-respective-dockerfile-links
    ##
    tag: "6.2.1-debian-10-r35"
    ## Specify a imagePullPolicy
    ## Defaults to 'Always' if image tag is 'latest', else set to 'IfNotPresent'
    ## ref: http://kubernetes.io/docs/user-guide/images/#pre-pulling-images
    ##
    pullPolicy: IfNotPresent

## Use password authentication
##
usePassword: true
## Redis(TM) password (both master and slave)
## Defaults to a random 10-character alphanumeric string if not set and usePassword is true
## ref: https://github.com/bitnami/bitnami-docker-redis#setting-the-server-password-on-first-run
##
password: "password"

##
## Redis(TM) Master parameters
##
master:
  ## Comma-separated list of Redis(TM) commands to disable
  ##
  ## Can be used to disable Redis(TM) commands for security reasons.
  ## Commands will be completely disabled by renaming each to an empty string.
  ## ref: https://redis.io/topics/security#disabling-of-specific-commands
  ##
  disableCommands:
  # - FLUSHDB
  # - FLUSHALL

  ## Redis(TM) Master additional pod labels and annotations
  ## ref: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
  ##
  podLabels: {}
  podAnnotations:
    # Datadog redis metrics autodiscovery
    # See: https://docs.datadoghq.com/agent/kubernetes/integrations/?tab=kubernetes#datadog-redis-integration
    ad.datadoghq.com/redis.check_names: '["redisdb"]'
    ad.datadoghq.com/redis.init_configs: '[{}]'
    ad.datadoghq.com/redis.instances: |
      [
        {
          "host": "%%host%%",
          "port":"6379",
          "password":"{{ .Values.secrets.rms.cache.backend_config.password }}"
        }
      ]

  ## Redis(TM) Master resource requests and limits
  ## ref: http://kubernetes.io/docs/user-guide/compute-resources/
  resources:
    requests:
      memory: 512Mi
      cpu: 300m
    limits:
      memory: 1024Mi
      cpu: 600m

  ## Enable persistence using Persistent Volume Claims
  ## ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
  ##
  persistence:
    enabled: false

##
## Redis(TM) Slave properties
## Note: service.type is a mandatory parameter
## The rest of the parameters are either optional or, if undefined, will inherit those declared in Redis(TM) Master
##
slave:
  ## List of Redis(TM) commands to disable
  ##
  disableCommands:
  # - FLUSHDB
  # - FLUSHALL

  ## Redis(TM) slave Resource
  resources:
    requests:
      memory: 512Mi
      cpu: 300m
    limits:
      memory: 1024Mi
      cpu: 600m

  podAnnotations:
    # Datadog redis metrics autodiscovery
    # See: https://docs.datadoghq.com/agent/kubernetes/integrations/?tab=kubernetes#datadog-redis-integration
    ad.datadoghq.com/redis.check_names: '["redisdb"]'
    ad.datadoghq.com/redis.init_configs: '[{}]'
    ad.datadoghq.com/redis.instances: |
      [
        {
          "host": "%%host%%",
          "port":"6379",
          "password":"{{ .Values.secrets.rms.cache.backend_config.password }}"
        }
      ]

  ## Enable persistence using Persistent Volume Claims
  ## ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
  ##
  persistence:
    enabled: false

## Sysctl InitContainer
## used to perform sysctl operation to modify Kernel settings (needed sometimes to avoid warnings)
##
sysctlImage:
  enabled: true
  command:
    - /bin/sh
    - -c
    - |-
      sysctl -w net.core.somaxconn=10000
      echo never > /host-sys/kernel/mm/transparent_hugepage/enabled
  registry: docker.io
  repository: bitnami/bitnami-shell
  tag: "10"
  pullPolicy: Always
  mountHostSys: true

installation command

helm install redis . -f custom-values.yaml  --atomic --namespace redis-test

cluster log output

The output below occurs on an otherwise healthy sentinel cluster after I run kubectl delete pod redis-node-2 (please note: the logging is collected via stern which I believe explains the unexpected error: stream error: stream ID 19; INTERNAL_ERROR occurrences).

redis-node-2 redis 1:signal-handler (1618955501) Received SIGTERM scheduling shutdown...
redis-node-2 redis 1:M 20 Apr 2021 21:51:41.945 # User requested shutdown...
redis-node-2 redis 1:M 20 Apr 2021 21:51:41.945 * Calling fsync() on the AOF file.
redis-node-2 redis 1:M 20 Apr 2021 21:51:41.945 # Redis is now ready to exit, bye bye...
redis-node-2 sentinel 1:X 20 Apr 2021 21:51:41.949 # Executing user requested FAILOVER of 'mymaster'
redis-node-2 sentinel 1:X 20 Apr 2021 21:51:41.949 # +new-epoch 6
redis-node-2 sentinel 1:X 20 Apr 2021 21:51:41.949 # +try-failover master mymaster 10.42.12.213 6379
redis-node-1 redis 1:S 20 Apr 2021 21:51:41.954 # Connection with master lost.
redis-node-1 redis 1:S 20 Apr 2021 21:51:41.954 * Caching the disconnected master state.
redis-node-1 redis 1:S 20 Apr 2021 21:51:41.954 * Reconnecting to MASTER 10.42.12.213:6379
redis-node-1 redis 1:S 20 Apr 2021 21:51:41.954 * MASTER <-> REPLICA sync started
redis-node-1 redis 1:S 20 Apr 2021 21:51:41.955 # Error condition on socket for SYNC: Connection refused
redis-node-0 redis 1:S 20 Apr 2021 21:51:41.952 # Connection with master lost.
redis-node-0 redis 1:S 20 Apr 2021 21:51:41.952 * Caching the disconnected master state.
redis-node-0 redis 1:S 20 Apr 2021 21:51:41.952 * Reconnecting to MASTER 10.42.12.213:6379
redis-node-0 redis 1:S 20 Apr 2021 21:51:41.952 * MASTER <-> REPLICA sync started
redis-node-0 redis 1:S 20 Apr 2021 21:51:41.953 # Error condition on socket for SYNC: Connection refused
redis-node-2 sentinel 1:X 20 Apr 2021 21:51:41.993 # +vote-for-leader bc33c65f6d573da2c50da570ccf4dc629a32426d 6
redis-node-2 sentinel 1:X 20 Apr 2021 21:51:41.993 # +elected-leader master mymaster 10.42.12.213 6379
redis-node-2 sentinel 1:X 20 Apr 2021 21:51:41.993 # +failover-state-select-slave master mymaster 10.42.12.213 6379
redis-node-2 sentinel 1:X 20 Apr 2021 21:51:42.054 # +selected-slave slave 10.42.9.18:6379 10.42.9.18 6379 @ mymaster 10.42.12.213 6379
redis-node-2 sentinel 1:X 20 Apr 2021 21:51:42.054 * +failover-state-send-slaveof-noone slave 10.42.9.18:6379 10.42.9.18 6379 @ mymaster 10.42.12.213 6379
redis-node-2 sentinel 1:signal-handler (1618955502) Received SIGTERM scheduling shutdown...
redis-node-2 sentinel 1:X 20 Apr 2021 21:51:42.121 # User requested shutdown...
redis-node-2 sentinel 1:X 20 Apr 2021 21:51:42.121 # Sentinel is now ready to exit, bye bye...
redis-node-0 redis 1:S 20 Apr 2021 21:51:42.175 * Connecting to MASTER 10.42.12.213:6379
redis-node-0 redis 1:S 20 Apr 2021 21:51:42.175 * MASTER <-> REPLICA sync started
redis-node-0 redis 1:S 20 Apr 2021 21:51:42.177 # Error condition on socket for SYNC: Connection refused
redis-node-1 redis 1:S 20 Apr 2021 21:51:42.310 * Connecting to MASTER 10.42.12.213:6379
redis-node-1 redis 1:S 20 Apr 2021 21:51:42.310 * MASTER <-> REPLICA sync started
redis-node-1 redis 1:S 20 Apr 2021 21:51:42.312 # Error condition on socket for SYNC: Connection refused
- redis-node-2 › redis
- redis-node-2 › sentinel
redis-node-0 redis 1:S 20 Apr 2021 21:51:43.185 * Connecting to MASTER 10.42.12.213:6379
redis-node-0 redis 1:S 20 Apr 2021 21:51:43.185 * MASTER <-> REPLICA sync started
redis-node-1 redis 1:S 20 Apr 2021 21:51:43.328 * Connecting to MASTER 10.42.12.213:6379
redis-node-1 redis 1:S 20 Apr 2021 21:51:43.328 * MASTER <-> REPLICA sync started
redis-node-1 sentinel 1:X 20 Apr 2021 21:51:52.310 # +reset-master master mymaster 10.42.12.213 6379
+ redis-node-2 › sentinel
+ redis-node-2 › redis
redis-node-2 sentinel  21:51:52.24 INFO  ==> redis-headless.redis-test.svc.cluster.local has my IP: 10.42.12.214
redis-node-2 sentinel  21:51:52.29 INFO  ==> Cleaning sentinels in sentinel node: 10.42.9.18
redis-node-2 sentinel Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
redis-node-2 sentinel 1
redis-node-2 redis  21:51:51.92 INFO  ==> redis-headless.redis-test.svc.cluster.local has my IP: 10.42.12.214
redis-node-2 redis Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
redis-node-2 redis Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
redis-node-1 sentinel 1:X 20 Apr 2021 21:51:53.211 * +sentinel sentinel b942a249aa6aaca842ead4ff6ad2fd01cdd6797b 10.42.16.216 26379 @ mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:51:57.322 # +reset-master master mymaster 10.42.12.213 6379
redis-node-2 sentinel  21:51:57.31 INFO  ==> Cleaning sentinels in sentinel node: 10.42.16.216
redis-node-2 sentinel Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
redis-node-2 sentinel 1
redis-node-0 sentinel 1:X 20 Apr 2021 21:51:59.379 * +sentinel sentinel 11f8f53ef3e904a0cfe2822709d6d6ca611daaf6 10.42.9.18 26379 @ mymaster 10.42.12.213 6379
redis-node-2 sentinel  21:52:02.32 INFO  ==> Sentinels clean up done
redis-node-2 sentinel Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
redis-node-2 sentinel Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
redis-node-1 sentinel 1:X 20 Apr 2021 21:52:12.350 # +sdown master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:17.333 # +sdown master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:17.388 # +odown master mymaster 10.42.12.213 6379 #quorum 2/2
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:17.388 # +new-epoch 6
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:17.388 # +try-failover master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:17.397 # +vote-for-leader b942a249aa6aaca842ead4ff6ad2fd01cdd6797b 6
redis-node-1 sentinel 1:X 20 Apr 2021 21:52:17.407 # +new-epoch 6
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:17.420 # 11f8f53ef3e904a0cfe2822709d6d6ca611daaf6 voted for b942a249aa6aaca842ead4ff6ad2fd01cdd6797b 6
redis-node-1 sentinel 1:X 20 Apr 2021 21:52:17.422 # +vote-for-leader b942a249aa6aaca842ead4ff6ad2fd01cdd6797b 6
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:17.480 # +elected-leader master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:17.480 # +failover-state-select-slave master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:17.556 # -failover-abort-no-good-slave master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:17.623 # Next failover delay: I will not start a failover before Tue Apr 20 21:52:54 2021
redis-node-1 sentinel 1:X 20 Apr 2021 21:52:17.716 # +odown master mymaster 10.42.12.213 6379 #quorum 2/2
redis-node-1 sentinel 1:X 20 Apr 2021 21:52:17.716 # Next failover delay: I will not start a failover before Tue Apr 20 21:52:53 2021
unexpected error: stream error: stream ID 19; INTERNAL_ERROR
unexpected error: stream error: stream ID 29; INTERNAL_ERROR
redis-node-1 sentinel 1:X 20 Apr 2021 21:52:49.303 # +reset-master master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:49.700 # -odown master mymaster 10.42.12.213 6379
redis-node-1 sentinel 1:X 20 Apr 2021 21:52:50.232 * +sentinel sentinel b942a249aa6aaca842ead4ff6ad2fd01cdd6797b 10.42.16.216 26379 @ mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:54.314 # +reset-master master mymaster 10.42.12.213 6379
unexpected error: stream error: stream ID 33; INTERNAL_ERROR
redis-node-0 sentinel 1:X 20 Apr 2021 21:52:54.329 * +sentinel sentinel 11f8f53ef3e904a0cfe2822709d6d6ca611daaf6 10.42.9.18 26379 @ mymaster 10.42.12.213 6379
redis-node-1 sentinel 1:X 20 Apr 2021 21:53:09.384 # +sdown master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:53:14.328 # +sdown master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:53:14.411 # +odown master mymaster 10.42.12.213 6379 #quorum 2/2
redis-node-0 sentinel 1:X 20 Apr 2021 21:53:14.411 # +new-epoch 7
redis-node-0 sentinel 1:X 20 Apr 2021 21:53:14.411 # +try-failover master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:53:14.422 # +vote-for-leader b942a249aa6aaca842ead4ff6ad2fd01cdd6797b 7
redis-node-1 sentinel 1:X 20 Apr 2021 21:53:14.437 # +new-epoch 7
redis-node-1 sentinel 1:X 20 Apr 2021 21:53:14.450 # +vote-for-leader b942a249aa6aaca842ead4ff6ad2fd01cdd6797b 7
redis-node-0 sentinel 1:X 20 Apr 2021 21:53:14.448 # 11f8f53ef3e904a0cfe2822709d6d6ca611daaf6 voted for b942a249aa6aaca842ead4ff6ad2fd01cdd6797b 7
redis-node-0 sentinel 1:X 20 Apr 2021 21:53:14.488 # +elected-leader master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:53:14.488 # +failover-state-select-slave master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:53:14.550 # -failover-abort-no-good-slave master mymaster 10.42.12.213 6379
redis-node-0 sentinel 1:X 20 Apr 2021 21:53:14.640 # Next failover delay: I will not start a failover before Tue Apr 20 21:53:51 2021
redis-node-1 sentinel 1:X 20 Apr 2021 21:53:14.695 # +odown master mymaster 10.42.12.213 6379 #quorum 2/2
redis-node-1 sentinel 1:X 20 Apr 2021 21:53:14.695 # Next failover delay: I will not start a failover before Tue Apr 20 21:53:51 2021
- redis-node-2 › redis
- redis-node-2 › sentinel
+ redis-node-2 › sentinel
+ redis-node-2 › redis

The text was updated successfully, but these errors were encountered:

Mauraza · 2021-04-21T09:35:18Z

Hi @wilsoniya,

I have not been able to reproduce the issue, maybe is related to this issue of helm helm/helm#7997, could you check it?

wilsoniya · 2021-04-21T16:22:01Z

Thanks for your reply, @Mauraza :)

I've never had problems with helm install or helm list like those mentioned in the issue you referenced. That is, I never see helm commands return errors mentioning Context timeouts, or which take a long time. The redis charts I install always result in a healthy sentinel cluster. It's only after deleting the master pod that I sometimes see my issue occur, and deleting a single pod only involves a kubectl command, not helm.

So I'd be surprised if my issue was related. Thanks again!

Mauraza · 2021-04-22T07:30:07Z

Hi @wilsoniya,

I was digging a little more, I think maybe your issue is related to this #3700, could you confirm that?

wilsoniya · 2021-04-22T19:12:52Z

@Mauraza Thank you for continuing to work with me on this.

I think this comment by @dustinrue is pretty similar to what I'm seeing: #3700 (comment)

After more digging I discovered that the new pod is getting the old master info back because the remaining pods haven't yet selected a new master. The new pod then gets stuck, unable to determine what to connect to in order to move forward. I put in a PR that just causes the liveness check to fail and force the sentinel container to restart. Hopefully once this has happened the remaining pods have selected a new master.

However, their message implies that eventually a new master is elected by the remaining pods, and eventually the restarted pod is able to rejoin the cluster.

This isn't the behavior I'm seeing. Instead, I see the remaining two pods fail to ever elect a new master, instead thinking the IP of the old (deleted) master still exists and is valid. This causes the replacement pod to fail to start because it's resolving the IP of the deleted pod as master from the two remaining pods

javsalgar · 2021-04-23T08:00:29Z

Hi,

Could it be because of quorum issues? I see that there are only two pods doing the election.

wilsoniya · 2021-04-23T16:45:42Z

@javsalgar thanks for the reply.

I don't know enough about the workings of sentinel to answer, tbh, though that sounds plausible. I believe I have quorum set at 2; wouldn't that be sufficient?

Mauraza · 2021-04-26T11:00:40Z

Hi @wilsoniya,

There is a new major version of the chart, could you try it?

wilsoniya · 2021-04-29T19:07:56Z

@Mauraza thanks for letting me know about the new major version.

The upgrade seems to be a major improvement, and I wasn't able to reproduce the issue by deleting the master pod or rolling the statefulset.

However, I was able to reproduce the issue by ungracefully deleting the master pod:

kubectl delete pod redis-node-0 --force --grace-period=0

This resulted in the remaining sentinels continuing to think the IP of the deleted pod was master, thus preventing the new pod from discovering a functioning master.

While this seems to be an improvement, I think in general we can't depend on master pods shutting down gracefully. For example, what happens if the k8s node serving the master pod suddenly disappears?

Mauraza · 2021-04-30T09:50:01Z

Hi @wilsoniya,

thanks for a try with the new version, I was able to reproduce it.
I'm going to create an internal task to investigate this. We update this thread when we have more information.

github-actions · 2021-05-16T01:27:45Z

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

wilsoniya · 2021-05-17T14:55:25Z

Hey @Mauraza, do you know of any updates on this issue? Are there any other issues which might represent the work to fix the underlying issue?

Thanks!

Mauraza · 2021-05-18T07:01:12Z

Hi @wilsoniya,

sorry, is still a work in progress, when we have more information we will update the issue

pablogalegoc · 2021-06-02T10:18:02Z

Hi @wilsoniya! We've finally got time to investigate this issue, here's our best guess:

It seems closely related to #3700 (comment) still, a race condition involving the redis master sentinel. Once the pod containing the redis master server and sentinel (1) is forcefully deleted, the period until a new master is elected is: sentinel.downAfterMilliseconds + sentinel.failoverTimeout (which in this case and defaults to 60000ms + 18000ms). This is 1min and 18sec where the sentinels of the replicas (pods 2 and 3) think that the killed master (1) still lives, so when a new pod is up (4) and its sentinel

issues a SENTINEL RESET to all the other sentinels, then
queries 2 and 3 for the master

it gets the IP of 1 since a new master has not been elected and enters on a Crashloopbackoff. Pod restarts have an exponential back off delay, so eventually this delay is bigger than the 1min 18sec and that's when the sentinels try to elect a new master. However, when trying to elect the master they enter the -failover-abort-no-good-slave loop. So it then becomes a question of “is there a good candidate to be master?” (redis/redis#7825). Unfortunately, we've not been able to debug the specific reason why replicas 2 and 3 end up not being suitable to be promoted to master.

Mitigation: We've been able to mitigate this issue by reducing sentinel.downAfterMilliseconds and sentinel.failoverTimeout to beat the pod restart delay.

We are aware this is somewhat nondeterministic but avoiding these race conditions programatically seems not trivial, unfortunately. I'm going to mention #6320 and #6484 so they are also aware of this and if any of you folks can come up with a solution we would be happy to review your contributions!

bluecrabs007 · 2021-06-09T16:42:29Z

Mitigation: We've been able to mitigate this issue by reducing sentinel.downAfterMilliseconds and sentinel.failoverTimeout to beat the pod restart delay.

Looks like this works, I was able to workaround the issue by setting these two config options to

-  downAfterMilliseconds: 60000
-  failoverTimeout: 18000
+  downAfterMilliseconds: 4000
+  failoverTimeout: 2000

rlees85 · 2021-06-24T10:10:31Z

Sorry for my misunderstanding but what exactly to you mean by "pod restart delay" in the comment:

Mitigation: We've been able to mitigate this issue by reducing sentinel.downAfterMilliseconds and sentinel.failoverTimeout to beat the pod restart delay.

Do you mean the termination grace period? or the retry backoff on the original master that will not come back up?

Also thank you for the configuration example:

-  downAfterMilliseconds: 60000
-  failoverTimeout: 18000
+  downAfterMilliseconds: 4000
+  failoverTimeout: 2000

I will try these for now but I was just wondering if they could be tweaked higher and still keep things working? Hence the question above.

Thanks all!

edit: could the sentinel reset command be causing an in-progress failover to abort? I am sure its there for a good reason though....

pablogalegoc · 2021-06-24T10:43:10Z

Hi @rlees85,

Sorry, by pod restart delay I meant the time needed for the creation of the new pod after the original master dies. If the other replicas are able to elect the new master before that, then the new pod will be added as a replica of the newly elected master. Hope that clears things up!

mblaschke · 2021-06-29T08:53:14Z

Can confirm this issue running Redis chart with sentinel in Azure AKS.

Sometimes i can see this in the logs:

redis container:

Could not connect to Redis at redis.XXXXX.svc.cluster.local:26379: Connection timed out            
Could not connect to Redis at -p:6379: Name or service not known

Jacq · 2021-06-30T16:01:36Z

I experienced a similar issue where deleting node-0 caused a loop crash and no new master was elected. I tried several recommendations mentioned here and in other issues but still the same problem.

In my case I debugged the sentinel log contents and found that the slaves where all registering notifying the same IP, which was the IP of the node where the current master is located, we do not have similar problems in other pods so I have no idea why the master reports for all the slaves the node IP.

Due to the above IP reporting, the slaves register produced some inconsistences, the redis-cli "info" command reported "slaves=1,sentinels=3", instead of "slaves=2". This also caused several troubles for sentinel when I delete a slave pod or the master master one, which tried to reconnect to its older master IP, no master was re-elected and the whole cluster went down.
I have applied @bluecrabs007 delay reduction but still, it did not fix the problem.

I have also applied the fix mentioned in #4082 based on the "replica-announce-ip":

replica:
  persistence:
    enabled: false
  preExecCmds:  |
    echo "" >>  #/opt/bitnami/redis/etc/replica.conf
    echo "replica-announce-ip $POD_IP" >> /opt/bitnami/redis/etc/replica.conf
  extraEnvVars:
    - name: "POD_IP"
        valueFrom:
          fieldRef:
            fieldPath: status.podIP

The above fix caused each replica to correctly report their IP and slaves=2 were register, now the cluster recovers correctly after the master deletion, I hope this fix someone else problem.
Cheers

pablogalegoc · 2021-07-01T05:36:11Z

Thanks for sharing it @Jacq!

@mblaschke did any of the above suggestions fixed your problem?

rjasper-frohraum · 2021-07-07T15:03:11Z

I think the problem lies in the prestop-sentinel.sh script:

charts/bitnami/redis/templates/scripts-configmap.yaml

Lines 303 to 307 in 309c7c6

    
               failover_finished() { 
        
                 REDIS_SENTINEL_INFO=($(run_sentinel_command get-master-addr-by-name "{{ .Values.sentinel.masterSet }}")) 
        
                 REDIS_MASTER_HOST="${REDIS_SENTINEL_INFO[0]}" 
        
                 [[ "$REDIS_MASTER_HOST" != "${myip}" ]] 
        
               }

I suspect $myip is not set at that point.

I tried to verify this by replacing ${myip} with $(hostname -i) which worked for me. Nevertheless, I think a proper solution should use something like the snippet below similar to the other scripts:

    # If there are more than one IP, use the first IPv4 address
    if [[ "$myip" = *" "* ]]; then
        myip=$(echo $myip | awk '{if ( match($0,/([0-9]+\.)([0-9]+\.)([0-9]+\.)[0-9]+/) ) { print substr($0,RSTART,RLENGTH); } }')
    fi

pablogalegoc · 2021-07-09T07:17:54Z

Hi @rjasper-frohraum!

Thanks for that, I'll look into it and report back what I find.

Vanosz · 2021-07-13T14:15:39Z

I experienced a similar issue where deleting node-0 caused a loop crash and no new master was elected. I tried several recommendations mentioned here and in other issues but still the same problem.

In my case I debugged the sentinel log contents and found that the slaves where all registering notifying the same IP, which was the IP of the node where the current master is located, we do not have similar problems in other pods so I have no idea why the master reports for all the slaves the node IP.

Due to the above IP reporting, the slaves register produced some inconsistences, the redis-cli "info" command reported "slaves=1,sentinels=3", instead of "slaves=2". This also caused several troubles for sentinel when I delete a slave pod or the master master one, which tried to reconnect to its older master IP, no master was re-elected and the whole cluster went down.
I have applied @bluecrabs007 delay reduction but still, it did not fix the problem.

I have also applied the fix mentioned in #4082 based on the "replica-announce-ip":

replica:
persistence:
enabled: false
preExecCmds: |
echo "" >> #/opt/bitnami/redis/etc/replica.conf
echo "replica-announce-ip $POD_IP" >> /opt/bitnami/redis/etc/replica.conf
extraEnvVars:
- name: "POD_IP"
valueFrom:
fieldRef:
fieldPath: status.podIP
The above fix caused each replica to correctly report their IP and slaves=2 were register, now the cluster recovers correctly after the master deletion, I hope this fix someone else problem.
Cheers

Hi, you have a syntax error here
echo "" >> #/opt/bitnami/redis/etc/replica.conf
as I see, should be without '#'
I've tried to use your method and have good results (in old points), BTW I'm not testing it in all cases but will give soon the updated info

ThWoywod · 2021-07-15T12:16:11Z

Thank you @rjasper-frohraum for your comment. We have noticed the same problem today.
In my opinion, the real problem here is that myip is never set in the "prestop-sentinel.sh" script. (https://github.com/bitnami/charts/blob/master/bitnami/redis/templates/scripts-configmap.yaml#L290)

We fixed the Issue by adding this to the top of the script.

    myip=$(hostname -i)
    
    # If there are more than one IP, use the first IPv4 address
    if [[ "$myip" = *" "* ]]; then
        myip=$(echo $myip | awk '{if ( match($0,/([0-9]+\.){3}[0-9]+/) ) { print substr($0,RSTART,RLENGTH); } }')
    fi

rjasper-frohraum · 2021-07-15T15:23:32Z

Just wanted to note, that the problem I described is most likely not the same as the OP's. It just has similar symptoms. To my knowledge, the $myip bug was introduced in chart version 14.2.0 (OP's is 13.0.1).

rush-skills · 2021-07-20T11:52:12Z

Just wanted to drop this here, I was also facing the same issue where killing the master pod causes a race condition and the cluster was not able to elect a new master. By applying

sentinel:
  downAfterMilliseconds: 10000 
  failoverTimeout: 5000 
  livenessProbe:
    enabled: true
    initialDelaySeconds: 120

I was able to fix the issue since a new master was now elected before the (old) master pod is resurrected and the new pod joined as a replica. I didn't need to make the myip changes however (Chart version 14.6.2)

manisha-tanwar · 2021-07-20T14:30:34Z

Just wanted to drop this here, I was also facing the same issue where killing the master pod causes a race condition and the cluster was not able to elect a new master. By applying
sentinel:
  downAfterMilliseconds: 10000 
  failoverTimeout: 5000 
  livenessProbe:
    enabled: true
    initialDelaySeconds: 120
I was able to fix the issue since a new master was now elected before the (old) master pod is resurrected and the new pod joined as a replica. I didn't need to make the myip changes however (Chart version 14.6.2)

Think this also fixed my issue. (Chart version 14.1.1)

github-actions · 2021-08-05T01:21:36Z

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions · 2021-08-10T01:22:22Z

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

igorwwwwwwwwwwwwwwwwwwww · 2021-11-08T15:27:29Z

Fixed by #7835.

github-actions bot added the stale 15 days without activity label May 16, 2021

Mauraza added on-hold Issues or Pull Requests with this label will never be considered stale and removed stale 15 days without activity labels May 17, 2021

miguelaeh mentioned this issue May 24, 2021

[bitnami/redis] sentinel cluster does not recover automatically if k8s node dies #6320

Closed

alemorcuq mentioned this issue May 28, 2021

[bitnami/redis] sentinel cannot elect new master #6484

Closed

pablogalegoc removed the on-hold Issues or Pull Requests with this label will never be considered stale label Jun 2, 2021

pablogalegoc mentioned this issue Jun 28, 2021

[bitnami/redis] I restart redis node-0 failed when the sentinal is enabled #6777

Closed

f-w added a commit to bcgov/NotifyBC that referenced this issue Jul 28, 2021

fixes bitnami/charts/issues/6165

e22f67c

github-actions bot added the stale 15 days without activity label Aug 5, 2021

github-actions bot closed this as completed Aug 10, 2021

ShineSmile mentioned this issue Aug 17, 2021

Redis Master Node restart issue on Sentinel Mode #6971

Closed

jsecchiero mentioned this issue Oct 10, 2021

[BUG] Sentinels are not able to elect a new master redis/redis#9625

Closed

ekisu mentioned this issue Oct 13, 2021

[bitnami/redis] Make sentinel properly failover when stopping master pods #7795

Closed

2 tasks

mhaswell-bcgov pushed a commit to bcgov/des-notifybc-helmonly that referenced this issue Nov 2, 2023

fixes bitnami/charts/issues/6165

b4ec56a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bitnami/redis] Sentinel cluster doesn't elect new master after master pod deletion #6165

[bitnami/redis] Sentinel cluster doesn't elect new master after master pod deletion #6165

wilsoniya commented Apr 20, 2021

Mauraza commented Apr 21, 2021

wilsoniya commented Apr 21, 2021

Mauraza commented Apr 22, 2021

wilsoniya commented Apr 22, 2021

javsalgar commented Apr 23, 2021 •

edited

Loading

wilsoniya commented Apr 23, 2021

Mauraza commented Apr 26, 2021

wilsoniya commented Apr 29, 2021

Mauraza commented Apr 30, 2021

github-actions bot commented May 16, 2021

wilsoniya commented May 17, 2021

Mauraza commented May 18, 2021

pablogalegoc commented Jun 2, 2021

bluecrabs007 commented Jun 9, 2021

rlees85 commented Jun 24, 2021 •

edited

Loading

pablogalegoc commented Jun 24, 2021

mblaschke commented Jun 29, 2021 •

edited

Loading

Jacq commented Jun 30, 2021 •

edited

Loading

pablogalegoc commented Jul 1, 2021

rjasper-frohraum commented Jul 7, 2021 •

edited

Loading

pablogalegoc commented Jul 9, 2021

Vanosz commented Jul 13, 2021

ThWoywod commented Jul 15, 2021

rjasper-frohraum commented Jul 15, 2021

rush-skills commented Jul 20, 2021

manisha-tanwar commented Jul 20, 2021

github-actions bot commented Aug 5, 2021

github-actions bot commented Aug 10, 2021

igorwwwwwwwwwwwwwwwwwwww commented Nov 8, 2021

[bitnami/redis] Sentinel cluster doesn't elect new master after master pod deletion #6165

[bitnami/redis] Sentinel cluster doesn't elect new master after master pod deletion #6165

Comments

wilsoniya commented Apr 20, 2021

values

installation command

cluster log output

Mauraza commented Apr 21, 2021

wilsoniya commented Apr 21, 2021

Mauraza commented Apr 22, 2021

wilsoniya commented Apr 22, 2021

javsalgar commented Apr 23, 2021 • edited Loading

wilsoniya commented Apr 23, 2021

Mauraza commented Apr 26, 2021

wilsoniya commented Apr 29, 2021

Mauraza commented Apr 30, 2021

github-actions bot commented May 16, 2021

wilsoniya commented May 17, 2021

Mauraza commented May 18, 2021

pablogalegoc commented Jun 2, 2021

bluecrabs007 commented Jun 9, 2021

rlees85 commented Jun 24, 2021 • edited Loading

pablogalegoc commented Jun 24, 2021

mblaschke commented Jun 29, 2021 • edited Loading

Jacq commented Jun 30, 2021 • edited Loading

pablogalegoc commented Jul 1, 2021

rjasper-frohraum commented Jul 7, 2021 • edited Loading

pablogalegoc commented Jul 9, 2021

Vanosz commented Jul 13, 2021

ThWoywod commented Jul 15, 2021

rjasper-frohraum commented Jul 15, 2021

rush-skills commented Jul 20, 2021

manisha-tanwar commented Jul 20, 2021

github-actions bot commented Aug 5, 2021

github-actions bot commented Aug 10, 2021

igorwwwwwwwwwwwwwwwwwwww commented Nov 8, 2021

javsalgar commented Apr 23, 2021 •

edited

Loading

rlees85 commented Jun 24, 2021 •

edited

Loading

mblaschke commented Jun 29, 2021 •

edited

Loading

Jacq commented Jun 30, 2021 •

edited

Loading

rjasper-frohraum commented Jul 7, 2021 •

edited

Loading