-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MaxUnavailableZones computed by PartitionInstanceRing.GetReplicationSetsForOperation() #489
Conversation
With the implementation in this PR would we end up querying both instances in the same zone or only one of them (assuming we use DoUntilQuorum)? |
Both. That's what I meant by 👇
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but TestDefaultResultTracker_StartMinimumRequests_NoFailingRequests
is failing and it sounds somewhat related. Have you checked it?
I don't think it's related because the modified code doesn't touch the code run by that test at all, but I'm not in a hurry on this PR and tomorrow I will look at why |
Yes, it's unrelated: #491 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ring/partition_instance_ring_test.go
Outdated
"instance-1-a": {Id: "instance-1-a", State: ACTIVE, Zone: "a", Timestamp: now.Unix()}, | ||
"instance-2-a": {Id: "instance-2-a", State: ACTIVE, Zone: "a", Timestamp: now.Unix()}, | ||
"instance-2-b": {Id: "instance-2-b", State: ACTIVE, Zone: "b", Timestamp: now.Unix()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reverse partition-zone
naming is confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Fixed in a2609c7
…ionSetsForOperation() Signed-off-by: Marco Pracucci <[email protected]>
…sForOperation Signed-off-by: Marco Pracucci <[email protected]>
3e9574f
to
a2609c7
Compare
What this PR does:
PartitionInstanceRing.GetReplicationSetsForOperation()
currently returnsReplicationSet
withMaxUnavailableZones
computed aslen(instances) - 1
. This works fine if all instances are in different zones, but we have some (or all) instances in the same zone it will not work as expected becauseMaxUnavailableZones
will be >= the actual number of zones amonginstances
. For example, when that's the case, thezoneAwareResultTracker
will not issue any request at all because ifMaxUnavailableZones
is >= the actual number of zones... well the quorum is reached even without sending any request.In this PR I propose to change
MaxUnavailableZones
tolen(zones) - 1
. Naturally this has an impact on quorum if some (or all instances) run in the same zone, but we expect not to be the case when using the partitions ring. However, if that will happen (e.g. local dev env, a demo, ...) after this change the system will keep working as expected, instead of weird bugs being surfaced (e.g. no requests sent to instances at all).Benchmark:
Which issue(s) this PR fixes:
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]