-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lettuce cannot recover from connection problems #1428
Comments
I did some research on the problem. I tried to figure out how redis-cli is able to send keep-alives, and found following code: /* Set aggressive KEEP_ALIVE socket option in the Redis context socket
* in order to prevent timeouts caused by the execution of long
* commands. At the same time this improves the detection of real
* errors. */
anetKeepAlive(NULL, context->fd, REDIS_CLI_KEEPALIVE_INTERVAL); Which leads us to: https://github.com/redis/redis/blob/efb6495a446a92328512f8a66db701dab95fb933/src/anet.c#L95 int anetKeepAlive(char *err, int fd, int interval)
{
int val = 1;
if (setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &val, sizeof(val)) == -1)
{
anetSetError(err, "setsockopt SO_KEEPALIVE: %s", strerror(errno));
return ANET_ERR;
}
#ifdef __linux__
/* Default settings are more or less garbage, with the keepalive time
* set to 7200 by default on Linux. Modify settings to make the feature
* actually useful. */
/* Send first probe after interval. */
val = interval;
if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPIDLE, &val, sizeof(val)) < 0) {
anetSetError(err, "setsockopt TCP_KEEPIDLE: %s\n", strerror(errno));
return ANET_ERR;
}
/* Send next probes after the specified interval. Note that we set the
* delay as interval / 3, as we send three probes before detecting
* an error (see the next setsockopt call). */
val = interval/3;
if (val == 0) val = 1;
if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPINTVL, &val, sizeof(val)) < 0) {
anetSetError(err, "setsockopt TCP_KEEPINTVL: %s\n", strerror(errno));
return ANET_ERR;
}
/* Consider the socket in error state after three we send three ACK
* probes without getting a reply. */
val = 3;
if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPCNT, &val, sizeof(val)) < 0) {
anetSetError(err, "setsockopt TCP_KEEPCNT: %s\n", strerror(errno));
return ANET_ERR;
}
#else
((void) interval); /* Avoid unused var warning for non Linux systems. */
#endif
return ANET_OK;
} As we can see, the keep-alive is achieved by tweaking SO_KEEPALIVE, TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT on a process level. Probably that's why we were able to fix the issue by tweaking this parameters on OS level. It would be best if Lettuce could set those parameters on application level. After some research I figured it could be done in at least 2 ways:
It is not looking to be an easy task, but i did not found any other and easier solution. |
Have you seen I'm not sure whether netty supports already |
Thank you for the hint. I've managed to fix the problem by adding netty-transport-native-epoll to a classpath and configuring Netty: SocketOptions socketOptions = SocketOptions.builder()
.keepAlive(true)
.build();
ClientResources clientResources = ClientResources.builder()
.nettyCustomizer(new NettyCustomizer() {
@Override
public void afterBootstrapInitialized(Bootstrap bootstrap) {
bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
}
})
.build();
RedisClient client = RedisClient.create(clientResources, node);
client.setOptions(socketOptions); I also submitted a bug to Redis: redis/redis#7855 because we think it should be documented a little better. Without above code Pub/Sub will work incorrectly after network issues. It was quite challenging to reproduce and troubleshoot this issue. |
Thanks for letting us know. |
It seems that it's possible to make it work without EPOLL native library using default NIO transport: bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPIDLE), 15);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPINTERVAL), 5);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPCOUNT), 3); |
Just wanted to notice that this isn't just a Pub/Sub issue even in Redis: |
Thanks for the link. For now, we can enable and disable keep-alive without the extended configuration. The comments above outline how to manually configure channel options depending on the used channel type. I filed #1437 to add proper support of extended Keep-Alive options. |
Very Interesting. LD_PRELOAD=/the/path/libkeepalive.so \
> KEEPCNT=20 \
> KEEPIDLE=180 \
> KEEPINTVL=60 \
> java -jar /your/path/yourapp.jar & |
In our case, even though we implemented the keep alive workaround with netty-transport-native-epoll, we still saw a 15 mins error happens (while at much reduced rate), we are connecting to Azure Redis. Anybody else also experience this issue after deploying the fix? if no then maybe is another issue on our side... Thanks! |
In a Spring Boot 2.5.4 web application we uses Spring Session Redis that uses Lettuce. When Azure Redis is patched we also get command timeouts for 15 minutes in our application running in Kubernetes (hence Linux), even if Redis correctly accepts new connections. I added netty-transport-native-epoll and configured the keep-alive but we still face the 15 minutes problem. Am I missing something abvious here ? :) The code changes are rather small. The extra dependency:
The keep-alive config in a Spring configuration class:
|
Forget to update, we finally fixed this by adding a The final add on code looks something like this: ClientResources clientResources = ClientResources.builder()
.nettyCustomizer(new NettyCustomizer() {
@Override
public void afterBootstrapInitialized(Bootstrap bootstrap) {
bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
// Socket Timeout (milliseconds)
bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, 60000);
}
})
.build();
// Enabled keep alive
SocketOptions socketOptions = SocketOptions.builder()
.keepAlive(true)
.build();
ClientOptions clientOptions = ClientOptions.builder()
.socketOptions(socketOptions)
.build(); We do not have the "15 mins connection timeout issue" for over 7 days now, you can try it out as well see if it work for you. Cheers! |
try add classifier in your dependency.
|
How can I do the same configuration for Jedis client? |
@adrianpasternak Is there a way to reproduce this on single node redis instance. Observing similar behavior on a single node redis and unless I restart the application the issue is not resolved. |
@kishore25kumar yes, you should be able to reproduce the issue on a single node Redis instance. |
@NgSekLong |
hello sir, I encountered a problem using this method: version: Sorry, I'm not very familiar with netty. Can you give me some advice |
@flyletc |
Sorry for the incomplete information I provided. We will run the service on a Linux server through Docker mode
|
Sorry, it's due to a dependency conflict |
Bug Report
Current Behavior
During troubleshooting of our production issues with Lettuce and Redis Cluster, we have discovered issues with re-connection of Pub/Sub subscriptions after network problems.
Lettuce is not sending any keep-alive packets on TCP connections dedicated to Pub/Sub subscriptions. Without keep-alives in a rare case of a sudden connection loss to a Redis node, Lettuce is not able to detect that the connection is no longer working. With default OS configuration it will be waiting for hours until OS will close the connection. In the meantime all messages published to a channel will be lost.
Input Code
Minimal code from Lettuce docs is enough to reproduce the issue.
To reproduce the issue:
We've been able to find issue also in Redis Standalone:
Expected behavior/code
Lettuce should be able to detect a broken connection to fix Pub/Sub subscriptions.
Environment
Possible Solution
We've made similar tests using redis-cli client. The official client is sending keep-alive packets every 15 seconds, and is able to detect connection loss.
It would be best if Lettuce could send keep-alive packets on a Pub/Sub connection to detect network problems. That should enable Lettuce to fix Pub/Sub subscriptions.
Workarounds
We've found a workaround for this problem by tweaking OS params (tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes), but we would want to avoid changing OS params on all our machines that use Lettuce as a Redis client.
The text was updated successfully, but these errors were encountered: