You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An idea that I had for some time, I was wondering about opinion of others.
Is your proposal related to a problem?
Even though receiver is scalable, when replication is on, by increasing / decreasing number of nodes in a hashring, the fact is that due to how receiver handles requests (batch smaller request for each node + replicate the batched requests) a user can afford to lose only small number of nodes before we cannot guarantee quorum anymore and have to respond with failure to client (for example for replication factor 3 it's only one node). This means that even if we increase number of nodes in the hashring, it will not increase the resiliency of the system. The larger the number of nodes in the hashring, the higher the probability more than one node will be out at the same time.
Under normal circumstances and for many use cases, this might be fine - as is widely known, Prometheus clients will keep retrying if 5xx is returned up to 2 hours, which should be enough to resolve issues, so there's only small danger of data loss. However, for users which want to guarantee higher write availability and "near-live" data querying, the time window < 2 hours can be unacceptable. Good example would be users that have alerts defined depending on metrics written to receiver - in such situation, users are clearly interested in having those metrics available ASAP. Even though receiver does not do any rollback in case of partial writes (i.e. even if quorum is not reached, it might be that some requests are written despite receiver returning 5xx error) and therefore these metrics might be available, it cannot be guaranteed and for the duration of any outage, it is unknown whether the request will be written to at least one node.
Describe the solution you'd like
For cases when higher availability of receive writes is desired, we could consider accepting sloppy quorum. This could have a form of nodes accepting and temporarily storing writes on behalf of others and then retrying to send data to the appropriate node with backoff. A very rough idea would be to use something akin to agent, which would periodically retry to deliver data to the appropriate node.
To ensure instant read availability, we might consider to accept sloppy quorum only if at least one replication successfully ended up on the correct node and can be queried. Other option, although probably more complex and less performant, would be to make it possible to query data that is being temporarily stored on behalf of other node.
Describe alternatives you've considered
Take advantage of the in-built forward retry mechanism in receiver - this works fine for intermittent network errors, but cannot help during a longer node outages
Increase the replication number - with higher replication factor, user can afford to lose more nodes. However, this would result in unnecessary data redundancy and overhead of storing the extra replicated data, even though user does not have a need for it.
Additional context
Performance implications should certainly be considered here, as receivers have large footprint already (see e.g. #5751). Nevertheless, the trade-off might be favorable for the use cases described above.
Hello, Starting to implement stateless ruler which now query receive for ALERTS_FOR_STATE in receive.
We would be super interested in this, as of today if 1 receive ingester is down and for whatever reason does not recover fast, we have problems whatever the replication factor we set.
Thanks @ahurtaud, unfortunately I think I'd be able to take this on in the foreseeable future 😢 , but I'd be happy to let someone else take over / help review etc.
An idea that I had for some time, I was wondering about opinion of others.
Is your proposal related to a problem?
Even though receiver is scalable, when replication is on, by increasing / decreasing number of nodes in a hashring, the fact is that due to how receiver handles requests (batch smaller request for each node + replicate the batched requests) a user can afford to lose only small number of nodes before we cannot guarantee quorum anymore and have to respond with failure to client (for example for replication factor 3 it's only one node). This means that even if we increase number of nodes in the hashring, it will not increase the resiliency of the system. The larger the number of nodes in the hashring, the higher the probability more than one node will be out at the same time.
Under normal circumstances and for many use cases, this might be fine - as is widely known, Prometheus clients will keep retrying if
5xx
is returned up to 2 hours, which should be enough to resolve issues, so there's only small danger of data loss. However, for users which want to guarantee higher write availability and "near-live" data querying, the time window < 2 hours can be unacceptable. Good example would be users that have alerts defined depending on metrics written to receiver - in such situation, users are clearly interested in having those metrics available ASAP. Even though receiver does not do any rollback in case of partial writes (i.e. even if quorum is not reached, it might be that some requests are written despite receiver returning5xx
error) and therefore these metrics might be available, it cannot be guaranteed and for the duration of any outage, it is unknown whether the request will be written to at least one node.Describe the solution you'd like
For cases when higher availability of receive writes is desired, we could consider accepting sloppy quorum. This could have a form of nodes accepting and temporarily storing writes on behalf of others and then retrying to send data to the appropriate node with backoff. A very rough idea would be to use something akin to agent, which would periodically retry to deliver data to the appropriate node.
To ensure instant read availability, we might consider to accept sloppy quorum only if at least one replication successfully ended up on the correct node and can be queried. Other option, although probably more complex and less performant, would be to make it possible to query data that is being temporarily stored on behalf of other node.
Describe alternatives you've considered
Additional context
Performance implications should certainly be considered here, as receivers have large footprint already (see e.g. #5751). Nevertheless, the trade-off might be favorable for the use cases described above.
cc @philipgough
The text was updated successfully, but these errors were encountered: