-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new update session manager based on Azure Table optimistic locking #61
Conversation
29e0ba3
to
68d06dc
Compare
{ | ||
public class UpdateDomainEntity : TableEntity | ||
{ | ||
public UpdateDomainEntity() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This represents a row in the Azure Table, it's used for all Yams rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant the empty constructor, can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, a default constructor is required by Azure Table client. Given that there is no other constructors, it can probably be removed (it's implicit).
{ | ||
private readonly TimeSpan _ttl; | ||
private readonly CloudTable _table; | ||
public const string UpdateSessionTableName = "YamsUpdateSession"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these consts are used elsewhere, can we move them to a Constants file?
if (updateDomainEntity != null) | ||
{ | ||
// filter out entities that do not match the active update domain (were leftover for some reason) | ||
instancesEntities = instancesEntities.Where(entity => entity.UpdateDomain == updateDomainEntity.UpdateDomain); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid iterating twice maybe by adding the check in the foreach above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not possible because the updateDomainEntity must be found first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you run a query to find the entity with the RowKey equal to UpdateDomainRowKey and then another query maybe to find the non-expired ones matching the UpdateDomain? Not a big concern but if you're querying over indexed facets it will likely be faster than iterating for big clusters. Your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query twice would be problematic because the status of the cluster might change in between. I can use a dictionary to avoid looping twice, should be easy.
{ | ||
public class UpdateSessionStatus | ||
{ | ||
public IEnumerable<UpdateDomainEntity> InstancesEntities { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you made this a List or even an array, you'll get a count for free and then you won't have to do .Any() later which causes to iterate over? Also, does it need to be IEnumerable? We're not iterating over the list aside from the "exists" check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any does not iterate over, it returns right away if there is one element in the backing container. It doesn't need to be an IEnumerable but there is no downsides in this case.
@@ -0,0 +1,73 @@ | |||
using System; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this just renamed, hard to review when they don't show side by side, are there any changes to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sadly GitHub does not detect that this is a rename. There was some changes in wiring dependencies and they are tested in end to end tests.
int startUpdateSessionRetryIntervalInSeconds = 1) | ||
{ | ||
var containerBuilder = new ContainerBuilder(); | ||
containerBuilder.RegisterType<BlobLeaseFactory>().As<IBlobLeaseFactory>().SingleInstance(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need the BlobLeaseFactory registered (or at all)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I'll remove it, thanks
} | ||
catch (StorageException e) | ||
{ | ||
if (e.RequestInformation.HttpStatusCode != 412 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document with a comment why it doesn't throw for 412 and 409?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point, will do.
throw; | ||
} | ||
} | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning false can only happen for 412 and 409 so why don't we do that right away to keep the code tighter together?
string updateDomain, | ||
string connectionString, | ||
TimeSpan updateSessionTtl, | ||
int storageExceptionRetryCount = 20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have to default the parameter values or can you force the caller to always pass them in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are defaulted here for convenience for people who don't care
containerBuilder.RegisterType<BlobLeaseFactory>().As<IBlobLeaseFactory>().SingleInstance(); | ||
|
||
containerBuilder.Register<RetryStrategy>( | ||
c => new FixedInterval(storageExceptionRetryCount, TimeSpan.FromSeconds(storageExceptionRetryIntervalInSeconds))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exponential backoff retry maybe?
@@ -21,6 +21,7 @@ public class YamsConfigBuilder | |||
_appHeartBeatTimeout = TimeSpan.FromSeconds(60); | |||
_ipcConnectTimeout = TimeSpan.FromSeconds(60); | |||
_appInitTimeout = TimeSpan.FromSeconds(60); | |||
_updateSessionTtl = TimeSpan.FromHours(1); | |||
} | |||
|
|||
public YamsConfigBuilder SetCheckForUpdatesPeriodInSeconds(int value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value passed in not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TTL is more for leftover rows if an instance has crashed and left the cluster in an inconsistent state. This should be very rare. The advantage of TTL is that users won't have to manually clear the table in such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment made it to the wrong place, I was referring to the SetCheckForUpdatesPeriodInSeconds method, which doesn't use the int value passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckForUpdatesPeriodInSeconds defines the frequency at which Yams pools the DeploymentConfig.json
for changes. There is a default value of 10 seconds but it can be changed using SetCheckForUpdatesPeriodInSeconds
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issue here, the diff view hides a line of code I thought it was missing. :)
{ | ||
// set a new update domain (if no other instance beats us to it) | ||
transaction.ReplaceUpdateDomain(updateSessionStatus); // will fail if current update domain changes | ||
transaction.FailIfInstanceListModified(updateSessionStatus); // will fail if instance list changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this in ReplaceUpdateDomain? No other calls to FailIfInstanceListModified...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could but then I'd be giving the transaction class knowledge about how I am managing update domains, which is not part of its responsibility.
{ | ||
// handle the case where an instance enlisted itself after the update domain has changed, | ||
string updateDomain = await _updateSessionTable.GetActiveUpdateDomain(_clusterId, appId); | ||
if (updateDomain != _instanceUpdateDomain) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the active update domain changes after this if succeeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instance will just fail to start the update session and try again later. This is the main trick used to avoid locking between instances in the same update domain.
|
||
public async Task DeleteInstanceEntity(string clusterId, string instanceId, string appId) | ||
{ | ||
TableOperation retrieveOperation = TableOperation.Retrieve<UpdateDomainEntity>(GetPartitionKey(clusterId, appId), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe create a PartitionKey class initialized with clusterId and appId that sets the partition key once and exposes it through a getter? Then you pass/inject that instance around? GetPartitionKey is invariant in its result.
var instanceEntity = (UpdateDomainEntity)retrievedResult.Result; | ||
if (instanceEntity != null) | ||
{ | ||
TableOperation deleteOperation = TableOperation.Delete(instanceEntity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you can't just delete, you have to retrieve the entity and then delete it? Aw...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly... maybe they did it this way to avoid race conditions between the entity changing and being deleted.
68d06dc
to
37a6dc6
Compare
This PR replaces the current blob based update session manager with one that is based on Azure table. It uses the same storage account, so the transition should be seamless.
The update session manager is the component that is used to coordinate between different nodes during a rolling upgrade. The current blob based solution uses a blob for inter-node synchronization. The blob contains the current list of nodes being updated and is locked when a node tries to enlist/delist itself to avoid race conditions. This works well for a relatively small number of nodes but the blob can quickly become a bottleneck when a large number of nodes is involved.
The new Azure table based approach uses optimistic locking in a way that if a node is in the active update domain, it'll succeed in enlisting itself and immediately start updating without any locking. Optimistic locking will only take place when moving from one update domain to another (to avoid race conditions between different update domains being activated and race conditions between changing the current update domain concurrently while a node in the current update domain is enlisting itself concurrently). This approach should speed upgrades significantly when a large number of nodes is involved.
I added tests for all race conditions that I can think of and it's looking good!
The code is hopefully self explanatory :)