Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lettuce cannot recover from connection problems #1428

Closed
adrianpasternak opened this issue Sep 24, 2020 · 20 comments
Closed

Lettuce cannot recover from connection problems #1428

adrianpasternak opened this issue Sep 24, 2020 · 20 comments
Labels
for: stackoverflow A question that is better suited to stackoverflow.com

Comments

@adrianpasternak
Copy link

adrianpasternak commented Sep 24, 2020

Bug Report

Current Behavior

During troubleshooting of our production issues with Lettuce and Redis Cluster, we have discovered issues with re-connection of Pub/Sub subscriptions after network problems.

Lettuce is not sending any keep-alive packets on TCP connections dedicated to Pub/Sub subscriptions. Without keep-alives in a rare case of a sudden connection loss to a Redis node, Lettuce is not able to detect that the connection is no longer working. With default OS configuration it will be waiting for hours until OS will close the connection. In the meantime all messages published to a channel will be lost.

Input Code

Minimal code from Lettuce docs is enough to reproduce the issue.

        RedisClusterClient clusterClient = RedisClusterClient.create(Arrays.asList(node1, node2, node3));

        ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
                .enablePeriodicRefresh(Duration.ofSeconds(15))
                .enableAllAdaptiveRefreshTriggers()
                .build();

        clusterClient.setOptions(ClusterClientOptions.builder()
                .topologyRefreshOptions(topologyRefreshOptions)
                .build());

        StatefulRedisPubSubConnection<String, String> connection = clusterClient.connectPubSub();
        connection.addListener(new RedisPubSubListener<String, String>() { ... } );

        RedisPubSubCommands<String, String> sync = connection.sync();
        sync.subscribe("broadcast");

To reproduce the issue:

  • Start Redis Cluster.
  • Connect to the cluster ans subscribe to the channel using the above code.
  • Find to which server the client is connected using tcpdump or by checking with redis-cli PUBSUB CHANNELS *.
  • Block all network traffic on that server using iptables (killing Redis process is not enough - OS will send FIN packets, and Lettuce will detect a problem and recover the subscription).
  • Redis Cluster will recover the cluster by promoting one of the replicas to the master.
  • Lettuce will not detect that connection is not longer working. And won't receive messages published to channels. Unused connection will be closed by OS after couple hours, and then Lettuce might me able to fix the problem.

We've been able to find issue also in Redis Standalone:

  • Connect to Pub/Sub using Lettuce.
  • Kill traffic on master using iptables. Restart VM with Redis and restore traffic.
  • Lettuce is not detecting an issue and is listening on a dead connection.

Expected behavior/code

Lettuce should be able to detect a broken connection to fix Pub/Sub subscriptions.

Environment

  • Lettuce version(s): 5.3.4.RELEASE
  • Redis version: 5.0.5

Possible Solution

We've made similar tests using redis-cli client. The official client is sending keep-alive packets every 15 seconds, and is able to detect connection loss.

It would be best if Lettuce could send keep-alive packets on a Pub/Sub connection to detect network problems. That should enable Lettuce to fix Pub/Sub subscriptions.

Workarounds

We've found a workaround for this problem by tweaking OS params (tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes), but we would want to avoid changing OS params on all our machines that use Lettuce as a Redis client.

@adrianpasternak adrianpasternak added the type: bug A general bug label Sep 24, 2020
@adrianpasternak
Copy link
Author

adrianpasternak commented Sep 25, 2020

I did some research on the problem.

I tried to figure out how redis-cli is able to send keep-alives, and found following code:
https://github.com/redis/redis/blob/1c71038540f8877adfd5eb2b6a6013a1a761bc6c/src/redis-cli.c#L908

        /* Set aggressive KEEP_ALIVE socket option in the Redis context socket
         * in order to prevent timeouts caused by the execution of long
         * commands. At the same time this improves the detection of real
         * errors. */
        anetKeepAlive(NULL, context->fd, REDIS_CLI_KEEPALIVE_INTERVAL);

Which leads us to: https://github.com/redis/redis/blob/efb6495a446a92328512f8a66db701dab95fb933/src/anet.c#L95

int anetKeepAlive(char *err, int fd, int interval)
{
    int val = 1;

    if (setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &val, sizeof(val)) == -1)
    {
        anetSetError(err, "setsockopt SO_KEEPALIVE: %s", strerror(errno));
        return ANET_ERR;
    }

#ifdef __linux__
    /* Default settings are more or less garbage, with the keepalive time
     * set to 7200 by default on Linux. Modify settings to make the feature
     * actually useful. */

    /* Send first probe after interval. */
    val = interval;
    if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPIDLE, &val, sizeof(val)) < 0) {
        anetSetError(err, "setsockopt TCP_KEEPIDLE: %s\n", strerror(errno));
        return ANET_ERR;
    }

    /* Send next probes after the specified interval. Note that we set the
     * delay as interval / 3, as we send three probes before detecting
     * an error (see the next setsockopt call). */
    val = interval/3;
    if (val == 0) val = 1;
    if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPINTVL, &val, sizeof(val)) < 0) {
        anetSetError(err, "setsockopt TCP_KEEPINTVL: %s\n", strerror(errno));
        return ANET_ERR;
    }

    /* Consider the socket in error state after three we send three ACK
     * probes without getting a reply. */
    val = 3;
    if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPCNT, &val, sizeof(val)) < 0) {
        anetSetError(err, "setsockopt TCP_KEEPCNT: %s\n", strerror(errno));
        return ANET_ERR;
    }
#else
    ((void) interval); /* Avoid unused var warning for non Linux systems. */
#endif

    return ANET_OK;
}

As we can see, the keep-alive is achieved by tweaking SO_KEEPALIVE, TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT on a process level. Probably that's why we were able to fix the issue by tweaking this parameters on OS level.

It would be best if Lettuce could set those parameters on application level. After some research I figured it could be done in at least 2 ways:

  • using ExtendedSocketOptions introduced in Java 11
  • using EpollChannelOption available in Netty Epoll

It is not looking to be an easy task, but i did not found any other and easier solution.
Without those changes it seems that Lettuce cannot keep reliable Pub/Sub connections without making undocumented changes in OS configuration.

@mp911de
Copy link
Collaborator

mp911de commented Sep 28, 2020

Have you seen SocketOptions#keepAlive which can be configured through ClientOptions? You can customize channel options by registering a NettyCustomizer and apply customization in afterBootstrapInitialized(Bootstrap).

I'm not sure whether netty supports already ExtendedSocketOptions since it targets Java 6 (netty/netty#8259) and Lettuce depends on netty for I/O configuration.

@mp911de mp911de added status: waiting-for-feedback We need additional information before we can continue status: waiting-for-triage and removed type: bug A general bug labels Sep 28, 2020
@adrianpasternak
Copy link
Author

Thank you for the hint.

I've managed to fix the problem by adding netty-transport-native-epoll to a classpath and configuring Netty:

SocketOptions socketOptions = SocketOptions.builder()
	.keepAlive(true)
	.build();

ClientResources clientResources = ClientResources.builder()
	.nettyCustomizer(new NettyCustomizer() {
		@Override
		public void afterBootstrapInitialized(Bootstrap bootstrap) {
			bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
			bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
			bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
		}
	})
	.build();

RedisClient client = RedisClient.create(clientResources, node);
client.setOptions(socketOptions);

I also submitted a bug to Redis: redis/redis#7855 because we think it should be documented a little better. Without above code Pub/Sub will work incorrectly after network issues. It was quite challenging to reproduce and troubleshoot this issue.

@mp911de
Copy link
Collaborator

mp911de commented Sep 28, 2020

Thanks for letting us know.

@mp911de mp911de added for: stackoverflow A question that is better suited to stackoverflow.com and removed status: waiting-for-feedback We need additional information before we can continue status: waiting-for-triage labels Sep 28, 2020
@adrianpasternak
Copy link
Author

It seems that it's possible to make it work without EPOLL native library using default NIO transport:

bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPIDLE), 15);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPINTERVAL), 5);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPCOUNT), 3);

@tzickel
Copy link

tzickel commented Sep 30, 2020

Just wanted to notice that this isn't just a Pub/Sub issue even in Redis:

redis/redis#7855 (comment)

@mp911de
Copy link
Collaborator

mp911de commented Sep 30, 2020

Thanks for the link. For now, we can enable and disable keep-alive without the extended configuration. The comments above outline how to manually configure channel options depending on the used channel type. I filed #1437 to add proper support of extended Keep-Alive options.

@mp911de mp911de changed the title Lettuce Pub/Sub cannot recover from connection problems Lettuce cannot recover from connection problems Sep 30, 2020
@qixiaobo
Copy link

Very Interesting.
I found another way to fix this problem! http://libkeepalive.sourceforge.net/

LD_PRELOAD=/the/path/libkeepalive.so \
  > KEEPCNT=20 \
  > KEEPIDLE=180 \
  > KEEPINTVL=60 \
  > java -jar /your/path/yourapp.jar &

@NgSekLong
Copy link

In our case, even though we implemented the keep alive workaround with netty-transport-native-epoll, we still saw a 15 mins error happens (while at much reduced rate), we are connecting to Azure Redis.

Anybody else also experience this issue after deploying the fix? if no then maybe is another issue on our side... Thanks!

@fbeaufume
Copy link

In a Spring Boot 2.5.4 web application we uses Spring Session Redis that uses Lettuce.

When Azure Redis is patched we also get command timeouts for 15 minutes in our application running in Kubernetes (hence Linux), even if Redis correctly accepts new connections.

I added netty-transport-native-epoll and configured the keep-alive but we still face the 15 minutes problem.

Am I missing something abvious here ? :)

The code changes are rather small. The extra dependency:

     <dependency>
        <groupId>io.netty</groupId>
        <artifactId>netty-transport-native-epoll</artifactId>
        <version>${netty.version}</version>
    </dependency>

The keep-alive config in a Spring configuration class:

@Bean
public LettuceClientConfigurationBuilderCustomizer lettuceCustomizer() {
    return builder -> {
        builder.clientOptions(ClientOptions.builder().socketOptions(SocketOptions.builder()
            .keepAlive(SocketOptions.KeepAliveOptions.builder()
                .enable(true)
                .idle(Duration.ofMinutes(3))
                .count(3)
                .interval(Duration.ofSeconds(10))
                .build()
        ).build()).build());
    };
}

@NgSekLong
Copy link

Forget to update, we finally fixed this by adding a TCP_USER_TIMEOUT as well (i.e. socket timeout)

The final add on code looks something like this:

ClientResources clientResources = ClientResources.builder()
  .nettyCustomizer(new NettyCustomizer() {
    @Override
    public void afterBootstrapInitialized(Bootstrap bootstrap) {
      bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
      bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
      bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
      // Socket Timeout (milliseconds)
      bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, 60000);
    }
  })
  .build();
// Enabled keep alive
SocketOptions socketOptions = SocketOptions.builder()
  .keepAlive(true)
  .build();
ClientOptions clientOptions = ClientOptions.builder()
  .socketOptions(socketOptions)
  .build();

We do not have the "15 mins connection timeout issue" for over 7 days now, you can try it out as well see if it work for you. Cheers!

@wpstan
Copy link

wpstan commented Apr 25, 2022

In a Spring Boot 2.5.4 web application we uses Spring Session Redis that uses Lettuce.

When Azure Redis is patched we also get command timeouts for 15 minutes in our application running in Kubernetes (hence Linux), even if Redis correctly accepts new connections.

I added netty-transport-native-epoll and configured the keep-alive but we still face the 15 minutes problem.

Am I missing something abvious here ? :)

The code changes are rather small. The extra dependency:

     <dependency>
        <groupId>io.netty</groupId>
        <artifactId>netty-transport-native-epoll</artifactId>
        <version>${netty.version}</version>
    </dependency>

The keep-alive config in a Spring configuration class:

@Bean
public LettuceClientConfigurationBuilderCustomizer lettuceCustomizer() {
    return builder -> {
        builder.clientOptions(ClientOptions.builder().socketOptions(SocketOptions.builder()
            .keepAlive(SocketOptions.KeepAliveOptions.builder()
                .enable(true)
                .idle(Duration.ofMinutes(3))
                .count(3)
                .interval(Duration.ofSeconds(10))
                .build()
        ).build()).build());
    };
}

try add classifier in your dependency.

<dependency>
  <groupId>io.netty</groupId>
  <artifactId>netty-transport-native-epoll</artifactId>
  <version>${project.version}</version>
  <classifier>linux-x86_64</classifier>
</dependency>

@mselmansezgin
Copy link

mselmansezgin commented Sep 15, 2022

#1428 (comment)

How can I do the same configuration for Jedis client?
As I can see there is no such thing JedisPoolingClientConfiguration.
And I couldn't find a way to set keepalive in JedisClientConfiguration.
Any suggestions?

@kishore25kumar
Copy link

kishore25kumar commented Jan 19, 2023

@adrianpasternak Is there a way to reproduce this on single node redis instance. Observing similar behavior on a single node redis and unless I restart the application the issue is not resolved.

@adrianpasternak
Copy link
Author

@kishore25kumar yes, you should be able to reproduce the issue on a single node Redis instance.
Subscribe to a channel, block incoming/outgoing TCP packets to Redis by adding DROP rules to iptables, then restart Redis and remove blocking rules from iptables.
With a default Linux TCP Keep Alive settings the app will find out that connection is gone after +2 hours.

@huaxne
Copy link

huaxne commented Aug 30, 2023

Forget to update, we finally fixed this by adding a TCP_USER_TIMEOUT as well (i.e. socket timeout)

The final add on code looks something like this:

ClientResources clientResources = ClientResources.builder()
  .nettyCustomizer(new NettyCustomizer() {
    @Override
    public void afterBootstrapInitialized(Bootstrap bootstrap) {
      bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
      bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
      bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
      // Socket Timeout (milliseconds)
      bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, 60000);
    }
  })
  .build();
// Enabled keep alive
SocketOptions socketOptions = SocketOptions.builder()
  .keepAlive(true)
  .build();
ClientOptions clientOptions = ClientOptions.builder()
  .socketOptions(socketOptions)
  .build();

We do not have the "15 mins connection timeout issue" for over 7 days now, you can try it out as well see if it work for you. Cheers!

@NgSekLong
What version is your JDK. I seem to have made an error using JDK8

@flyletc
Copy link

flyletc commented Aug 31, 2023

Forget to update, we finally fixed this by adding a TCP_USER_TIMEOUT as well (i.e. socket timeout)
The final add on code looks something like this:

ClientResources clientResources = ClientResources.builder(
  .nettyCustomizer(new NettyCustomizer() {
    @Override
    public void afterBootstrapInitialized(Bootstrap bootstrap) {
      bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
      bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
      bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
      // Socket Timeout (milliseconds)
      bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, 60000);
    }
  })
  .build();
// Enabled keep alive
SocketOptions socketOptions = SocketOptions.builder()
  .keepAlive(true)
  .build();
ClientOptions clientOptions = ClientOptions.builder()
  .socketOptions(socketOptions)
  .build();

We do not have the "15 mins connection timeout issue" for over 7 days now, you can try it out as well see if it work for you. Cheers!

@NgSekLong What version is

Forget to update, we finally fixed this by adding a TCP_USER_TIMEOUT as well (i.e. socket timeout)

The final add on code looks something like this:

ClientResources clientResources = ClientResources.builder()
  .nettyCustomizer(new NettyCustomizer() {
    @Override
    public void afterBootstrapInitialized(Bootstrap bootstrap) {
      bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
      bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
      bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
      // Socket Timeout (milliseconds)
      bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, 60000);
    }
  })
  .build();
// Enabled keep alive
SocketOptions socketOptions = SocketOptions.builder()
  .keepAlive(true)
  .build();
ClientOptions clientOptions = ClientOptions.builder()
  .socketOptions(socketOptions)
  .build();

We do not have the "15 mins connection timeout issue" for over 7 days now, you can try it out as well see if it work for you. Cheers!

hello sir, I encountered a problem using this method:
Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_USER_TIMEOUT' for channel
Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_KEEPIDLE' for channel
Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_KEEPINTVL' for channel
Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_KEEPCNT' for channel

version:
jdk:1.8
netty-transport-native-epoll:4.1.65.Final(linux-x86_64)

Sorry, I'm not very familiar with netty. Can you give me some advice

@huaxne
Copy link

huaxne commented Aug 31, 2023

@flyletc
I think you are testing in Windows, so the [Unknown channel option] error logs happened.
Maybe you should test it in Linux.

@flyletc
Copy link

flyletc commented Aug 31, 2023

@flyletc I think you are testing in Windows, so the [Unknown channel option] error logs happened. Maybe you should test it in Linux.

Sorry for the incomplete information I provided. We will run the service on a Linux server through Docker mode

  • linux:
    Rocky Linux release 9.2 (Blue Onyx)

  • docker version:
    Client: Docker Engine - Community
    Version: 24.0.2
    API version: 1.43
    Go version: go1.20.4
    Git commit: cb74dfc
    Built: Thu May 25 21:53:24 2023
    OS/Arch: linux/amd64
    Context: default

    Server: Docker Engine - Community
    Engine:
    Version: 24.0.2
    API version: 1.43 (minimum version 1.12)
    Go version: go1.20.4
    Git commit: 659604f
    Built: Thu May 25 21:51:50 2023
    OS/Arch: linux/amd64
    Experimental: false
    containerd:
    Version: 1.6.21
    GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8
    runc:
    Version: 1.1.7
    GitCommit: v1.1.7-0-g860f061
    docker-init:
    Version: 0.19.0
    GitCommit: de40ad0

  • docker-compose version
    Docker Compose version v2.17.3

  • log information
    2023-08-31 06:01:17.125 WARN 1 --- [xecutorLoop-3-2] io.netty.bootstrap.Bootstrap : Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_KEEPIDLE' for channel '[id: 0xa7a4f5d3]'
    2023-08-31 06:01:17.125 WARN 1 --- [xecutorLoop-3-2] io.netty.bootstrap.Bootstrap : Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_KEEPINTVL' for channel '[id: 0xa7a4f5d3]'
    2023-08-31 06:01:17.125 WARN 1 --- [xecutorLoop-3-2] io.netty.bootstrap.Bootstrap : Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_KEEPCNT' for channel '[id: 0xa7a4f5d3]'
    2023-08-31 06:01:17.126 WARN 1 --- [xecutorLoop-3-2] io.netty.bootstrap.Bootstrap : Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_USER_TIMEOUT' for channel '[id: 0xa7a4f5d3]'
    2023-08-31 06:01:17.128 WARN 1 --- [xecutorLoop-3-2] io.netty.bootstrap.Bootstrap : Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_KEEPIDLE' for channel '[id: 0x3305fcec]'
    2023-08-31 06:01:17.128 WARN 1 --- [xecutorLoop-3-2] io.netty.bootstrap.Bootstrap : Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_KEEPINTVL' for channel '[id: 0x3305fcec]'
    2023-08-31 06:01:17.128 WARN 1 --- [xecutorLoop-3-2] io.netty.bootstrap.Bootstrap : Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_KEEPCNT' for channel '[id: 0x3305fcec]'
    2023-08-31 06:01:17.128 WARN 1 --- [xecutorLoop-3-2] io.netty.bootstrap.Bootstrap : Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_USER_TIMEOUT' for channel '[id: 0x3305fcec]'
    2023-08-31 06:01:17.130 WARN 1 --- [xecutorLoop-3-2] io.netty.bootstrap.Bootstrap : Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_KEEPIDLE' for channel '[id: 0xa978a8c4]'
    2023-08-31 06:01:17.130 WARN 1 --- [xecutorLoop-3-2] io.netty.bootstrap.Bootstrap : Unknown channel option 'io.netty.channel.epoll.EpollChannelOption#TCP_KEEPINTVL' for channel '[id: 0xa978a8c4]'

  • jdk
    1.8

  • netty
    netty-transport-native-epoll:4.1.65.Final(linux-x86_64)

  • springboot
    2.6.15

@flyletc
Copy link

flyletc commented Sep 1, 2023

@flyletc I think you are testing in Windows, so the [Unknown channel option] error logs happened. Maybe you should test it in Linux.

Sorry, it's due to a dependency conflict

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: stackoverflow A question that is better suited to stackoverflow.com
Projects
None yet
Development

No branches or pull requests