Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MaxUnavailableZones computed by PartitionInstanceRing.GetReplicationSetsForOperation() #489

Merged
merged 2 commits into from
Feb 20, 2024

Conversation

pracucci
Copy link
Contributor

What this PR does:

PartitionInstanceRing.GetReplicationSetsForOperation() currently returns ReplicationSet with MaxUnavailableZones computed as len(instances) - 1. This works fine if all instances are in different zones, but we have some (or all) instances in the same zone it will not work as expected because MaxUnavailableZones will be >= the actual number of zones among instances. For example, when that's the case, the zoneAwareResultTracker will not issue any request at all because if MaxUnavailableZones is >= the actual number of zones... well the quorum is reached even without sending any request.

In this PR I propose to change MaxUnavailableZones to len(zones) - 1. Naturally this has an impact on quorum if some (or all instances) run in the same zone, but we expect not to be the case when using the partitions ring. However, if that will happen (e.g. local dev env, a demo, ...) after this change the system will keep working as expected, instead of weird bugs being surfaced (e.g. no requests sent to instances at all).

Benchmark:

goos: darwin
goarch: amd64
pkg: github.com/grafana/dskit/ring
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
                                                        │ before.txt  │             after.txt              │
                                                        │   sec/op    │   sec/op     vs base               │
PartitionInstanceRing_GetReplicationSetsForOperation-12   37.99µ ± 1%   41.91µ ± 1%  +10.33% (p=0.002 n=6)

                                                        │  before.txt  │            after.txt            │
                                                        │     B/op     │     B/op      vs base           │
PartitionInstanceRing_GetReplicationSetsForOperation-12   32.88Ki ± 0%   32.88Ki ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal

                                                        │ before.txt │           after.txt           │
                                                        │ allocs/op  │ allocs/op   vs base           │
PartitionInstanceRing_GetReplicationSetsForOperation-12   101.0 ± 0%   101.0 ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal

Which issue(s) this PR fixes:

N/A

Checklist

  • Tests updated
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@dimitarvdimitrov
Copy link
Contributor

dimitarvdimitrov commented Feb 19, 2024

we have some (or all) instances in the same zone

With the implementation in this PR would we end up querying both instances in the same zone or only one of them (assuming we use DoUntilQuorum)?

@pracucci
Copy link
Contributor Author

we have some (or all) instances in the same zone

With the implementation in this PR would we end up querying both instances in the same zone or only one of them (assuming we use DoUntilQuorum)?

Both. That's what I meant by 👇

Naturally this has an impact on quorum if some (or all instances) run in the same zone, but we expect not to be the case when using the partitions ring. However, if that will happen (e.g. local dev env, a demo, ...) after this change the system will keep working as expected, instead of weird bugs being surfaced (e.g. no requests sent to instances at all).

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but TestDefaultResultTracker_StartMinimumRequests_NoFailingRequests is failing and it sounds somewhat related. Have you checked it?

@pracucci
Copy link
Contributor Author

TestDefaultResultTracker_StartMinimumRequests_NoFailingRequests

I don't think it's related because the modified code doesn't touch the code run by that test at all, but I'm not in a hurry on this PR and tomorrow I will look at why TestDefaultResultTracker_StartMinimumRequests_NoFailingRequests is flaky.

@pracucci
Copy link
Contributor Author

TestDefaultResultTracker_StartMinimumRequests_NoFailingRequests

I don't think it's related because the modified code doesn't touch the code run by that test at all, but I'm not in a hurry on this PR and tomorrow I will look at why TestDefaultResultTracker_StartMinimumRequests_NoFailingRequests is flaky.

Yes, it's unrelated: #491

@pracucci pracucci mentioned this pull request Feb 20, 2024
4 tasks
@pracucci pracucci marked this pull request as ready for review February 20, 2024 11:06
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 64 to 66
"instance-1-a": {Id: "instance-1-a", State: ACTIVE, Zone: "a", Timestamp: now.Unix()},
"instance-2-a": {Id: "instance-2-a", State: ACTIVE, Zone: "a", Timestamp: now.Unix()},
"instance-2-b": {Id: "instance-2-b", State: ACTIVE, Zone: "b", Timestamp: now.Unix()},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reverse partition-zone naming is confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Fixed in a2609c7

@pracucci pracucci force-pushed the improve-max-unavailable-zones branch from 3e9574f to a2609c7 Compare February 20, 2024 13:06
@pracucci pracucci enabled auto-merge (squash) February 20, 2024 13:06
@pracucci pracucci merged commit 501176d into main Feb 20, 2024
3 checks passed
@pracucci pracucci deleted the improve-max-unavailable-zones branch February 20, 2024 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants