Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exponential backoff for cluster connections. #121

Closed
wants to merge 1 commit into from

Conversation

nihohit
Copy link

@nihohit nihohit commented Feb 18, 2024

Issue #, if available:
valkey-io/valkey-glide#473

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

fn default() -> Self {
const DEFAULT_CONNECTION_RETRY_EXPONENT_BASE: u32 = 2;
const DEFAULT_CONNECTION_RETRY_FACTOR: u32 = 100;
const DEFAULT_NUMBER_OF_CONNECTION_RETRIESE: u32 = 6;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_NUMBER_OF_CONNECTION_RETRIESE => DEFAULT_NUMBER_OF_CONNECTION_RETRIES

let info = get_connection_info(node, params)?;
C::connect(info, response_timeout, connection_timeout, socket_addr).await
let info = get_connection_info(node, params.clone())?;
let counter = std::sync::atomic::AtomicU32::new(params.exponential_backoff.number_of_retries);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that - if i'm not mistaking - when we try to refresh connections we actually block the entire client from new requests, since we lock the connection container with the write lock, right?
so what's the number_of_retries, for a single node? and what's the max waiting time?
since refresh_connections can be called with multiple connection identifiers, we should see if we need to free the lock in some stages so the client won't get fully blocked (for example, retry the refresh_connections function, rather than per node)

Copy link

@ikolomi ikolomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure why we want this feature - what youre doing is adding EBOFF on top of TCP retry mechanisms which is already some form of EBOFF, maxed out to 130 seconds.
see /proc/sys/net/ipv4/tcp_syn_retries
Why would we have two EBOFF on top of each others?

params.exponential_backoff.factor as u64,
),
multiplier: params.exponential_backoff.exponent_base as f64,
max_elapsed_time: None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have max elapsed time set to tens of minutes (it this is that parameter means)

}
})
})
.await
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might block the method for a very long time?

@@ -75,6 +76,27 @@ impl RetryParams {
}
}

#[derive(Clone)]
pub(crate) struct ExponentialBackoffStrategy {
pub(crate) exponent_base: u32,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this differs from ExponentialBackoff of std? Actually the std terminology is the correct one - you have a base and a multiplier.

@shachlanAmazon
Copy link

shachlanAmazon commented Feb 28, 2024

My only concern is that - if i'm not mistaking - when we try to refresh connections we actually block the entire client from new requests, since we lock the connection container with the write lock, right?

Actually, we block new requests both by taking the lock, and by changing the state to ConnectionState::Recover(RecoverFuture::Reconnect(future)), which AFAIs blocks new requests. So no matter what we do, if we take a longer time to refresh connections, we'll take a longer time to receive new requests. We could change refresh_connections to only take the write lock on connection, but

fn poll_ready(
this will be blocked until the state is PollComplete.

@nihohit nihohit closed this Jul 12, 2024
@nihohit nihohit deleted the cluster-backoff branch July 12, 2024 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants