Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receive: Implement a sloppy quorum #5809

Open
matej-g opened this issue Oct 21, 2022 · 2 comments
Open

Receive: Implement a sloppy quorum #5809

matej-g opened this issue Oct 21, 2022 · 2 comments

Comments

@matej-g
Copy link
Collaborator

matej-g commented Oct 21, 2022

An idea that I had for some time, I was wondering about opinion of others.

Is your proposal related to a problem?

Even though receiver is scalable, when replication is on, by increasing / decreasing number of nodes in a hashring, the fact is that due to how receiver handles requests (batch smaller request for each node + replicate the batched requests) a user can afford to lose only small number of nodes before we cannot guarantee quorum anymore and have to respond with failure to client (for example for replication factor 3 it's only one node). This means that even if we increase number of nodes in the hashring, it will not increase the resiliency of the system. The larger the number of nodes in the hashring, the higher the probability more than one node will be out at the same time.

Under normal circumstances and for many use cases, this might be fine - as is widely known, Prometheus clients will keep retrying if 5xx is returned up to 2 hours, which should be enough to resolve issues, so there's only small danger of data loss. However, for users which want to guarantee higher write availability and "near-live" data querying, the time window < 2 hours can be unacceptable. Good example would be users that have alerts defined depending on metrics written to receiver - in such situation, users are clearly interested in having those metrics available ASAP. Even though receiver does not do any rollback in case of partial writes (i.e. even if quorum is not reached, it might be that some requests are written despite receiver returning 5xx error) and therefore these metrics might be available, it cannot be guaranteed and for the duration of any outage, it is unknown whether the request will be written to at least one node.

Describe the solution you'd like

For cases when higher availability of receive writes is desired, we could consider accepting sloppy quorum. This could have a form of nodes accepting and temporarily storing writes on behalf of others and then retrying to send data to the appropriate node with backoff. A very rough idea would be to use something akin to agent, which would periodically retry to deliver data to the appropriate node.

To ensure instant read availability, we might consider to accept sloppy quorum only if at least one replication successfully ended up on the correct node and can be queried. Other option, although probably more complex and less performant, would be to make it possible to query data that is being temporarily stored on behalf of other node.

Describe alternatives you've considered

  • Take advantage of the in-built forward retry mechanism in receiver - this works fine for intermittent network errors, but cannot help during a longer node outages
  • Increase the replication number - with higher replication factor, user can afford to lose more nodes. However, this would result in unnecessary data redundancy and overhead of storing the extra replicated data, even though user does not have a need for it.

Additional context

Performance implications should certainly be considered here, as receivers have large footprint already (see e.g. #5751). Nevertheless, the trade-off might be favorable for the use cases described above.

cc @philipgough

@ahurtaud
Copy link
Contributor

Hello, Starting to implement stateless ruler which now query receive for ALERTS_FOR_STATE in receive.
We would be super interested in this, as of today if 1 receive ingester is down and for whatever reason does not recover fast, we have problems whatever the replication factor we set.

@matej-g
Copy link
Collaborator Author

matej-g commented Jan 31, 2023

Thanks @ahurtaud, unfortunately I think I'd be able to take this on in the foreseeable future 😢 , but I'd be happy to let someone else take over / help review etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants