Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new update session manager based on Azure Table optimistic locking #61

Merged
merged 1 commit into from
Nov 8, 2017

Conversation

nehmebilal
Copy link
Collaborator

@nehmebilal nehmebilal commented Nov 5, 2017

This PR replaces the current blob based update session manager with one that is based on Azure table. It uses the same storage account, so the transition should be seamless.

The update session manager is the component that is used to coordinate between different nodes during a rolling upgrade. The current blob based solution uses a blob for inter-node synchronization. The blob contains the current list of nodes being updated and is locked when a node tries to enlist/delist itself to avoid race conditions. This works well for a relatively small number of nodes but the blob can quickly become a bottleneck when a large number of nodes is involved.

The new Azure table based approach uses optimistic locking in a way that if a node is in the active update domain, it'll succeed in enlisting itself and immediately start updating without any locking. Optimistic locking will only take place when moving from one update domain to another (to avoid race conditions between different update domains being activated and race conditions between changing the current update domain concurrently while a node in the current update domain is enlisting itself concurrently). This approach should speed upgrades significantly when a large number of nodes is involved.

I added tests for all race conditions that I can think of and it's looking good!

The code is hopefully self explanatory :)

{
public class UpdateDomainEntity : TableEntity
{
public UpdateDomainEntity()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This represents a row in the Azure Table, it's used for all Yams rows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant the empty constructor, can it be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, a default constructor is required by Azure Table client. Given that there is no other constructors, it can probably be removed (it's implicit).

{
private readonly TimeSpan _ttl;
private readonly CloudTable _table;
public const string UpdateSessionTableName = "YamsUpdateSession";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these consts are used elsewhere, can we move them to a Constants file?

if (updateDomainEntity != null)
{
// filter out entities that do not match the active update domain (were leftover for some reason)
instancesEntities = instancesEntities.Where(entity => entity.UpdateDomain == updateDomainEntity.UpdateDomain);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid iterating twice maybe by adding the check in the foreach above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not possible because the updateDomainEntity must be found first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you run a query to find the entity with the RowKey equal to UpdateDomainRowKey and then another query maybe to find the non-expired ones matching the UpdateDomain? Not a big concern but if you're querying over indexed facets it will likely be faster than iterating for big clusters. Your call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query twice would be problematic because the status of the cluster might change in between. I can use a dictionary to avoid looping twice, should be easy.

{
public class UpdateSessionStatus
{
public IEnumerable<UpdateDomainEntity> InstancesEntities { get; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you made this a List or even an array, you'll get a count for free and then you won't have to do .Any() later which causes to iterate over? Also, does it need to be IEnumerable? We're not iterating over the list aside from the "exists" check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any does not iterate over, it returns right away if there is one element in the backing container. It doesn't need to be an IEnumerable but there is no downsides in this case.

@@ -0,0 +1,73 @@
using System;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this just renamed, hard to review when they don't show side by side, are there any changes to it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sadly GitHub does not detect that this is a rename. There was some changes in wiring dependencies and they are tested in end to end tests.

int startUpdateSessionRetryIntervalInSeconds = 1)
{
var containerBuilder = new ContainerBuilder();
containerBuilder.RegisterType<BlobLeaseFactory>().As<IBlobLeaseFactory>().SingleInstance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need the BlobLeaseFactory registered (or at all)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I'll remove it, thanks

}
catch (StorageException e)
{
if (e.RequestInformation.HttpStatusCode != 412
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document with a comment why it doesn't throw for 412 and 409?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point, will do.

throw;
}
}
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning false can only happen for 412 and 409 so why don't we do that right away to keep the code tighter together?

string updateDomain,
string connectionString,
TimeSpan updateSessionTtl,
int storageExceptionRetryCount = 20,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have to default the parameter values or can you force the caller to always pass them in?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are defaulted here for convenience for people who don't care

containerBuilder.RegisterType<BlobLeaseFactory>().As<IBlobLeaseFactory>().SingleInstance();

containerBuilder.Register<RetryStrategy>(
c => new FixedInterval(storageExceptionRetryCount, TimeSpan.FromSeconds(storageExceptionRetryIntervalInSeconds)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exponential backoff retry maybe?

@@ -21,6 +21,7 @@ public class YamsConfigBuilder
_appHeartBeatTimeout = TimeSpan.FromSeconds(60);
_ipcConnectTimeout = TimeSpan.FromSeconds(60);
_appInitTimeout = TimeSpan.FromSeconds(60);
_updateSessionTtl = TimeSpan.FromHours(1);
}

public YamsConfigBuilder SetCheckForUpdatesPeriodInSeconds(int value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value passed in not used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTL is more for leftover rows if an instance has crashed and left the cluster in an inconsistent state. This should be very rare. The advantage of TTL is that users won't have to manually clear the table in such cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment made it to the wrong place, I was referring to the SetCheckForUpdatesPeriodInSeconds method, which doesn't use the int value passed in.

Copy link
Collaborator Author

@nehmebilal nehmebilal Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckForUpdatesPeriodInSeconds defines the frequency at which Yams pools the DeploymentConfig.json for changes. There is a default value of 10 seconds but it can be changed using SetCheckForUpdatesPeriodInSeconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issue here, the diff view hides a line of code I thought it was missing. :)

{
// set a new update domain (if no other instance beats us to it)
transaction.ReplaceUpdateDomain(updateSessionStatus); // will fail if current update domain changes
transaction.FailIfInstanceListModified(updateSessionStatus); // will fail if instance list changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this in ReplaceUpdateDomain? No other calls to FailIfInstanceListModified...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could but then I'd be giving the transaction class knowledge about how I am managing update domains, which is not part of its responsibility.

{
// handle the case where an instance enlisted itself after the update domain has changed,
string updateDomain = await _updateSessionTable.GetActiveUpdateDomain(_clusterId, appId);
if (updateDomain != _instanceUpdateDomain)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the active update domain changes after this if succeeds?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance will just fail to start the update session and try again later. This is the main trick used to avoid locking between instances in the same update domain.


public async Task DeleteInstanceEntity(string clusterId, string instanceId, string appId)
{
TableOperation retrieveOperation = TableOperation.Retrieve<UpdateDomainEntity>(GetPartitionKey(clusterId, appId),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe create a PartitionKey class initialized with clusterId and appId that sets the partition key once and exposes it through a getter? Then you pass/inject that instance around? GetPartitionKey is invariant in its result.

var instanceEntity = (UpdateDomainEntity)retrievedResult.Result;
if (instanceEntity != null)
{
TableOperation deleteOperation = TableOperation.Delete(instanceEntity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can't just delete, you have to retrieve the entity and then delete it? Aw...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly... maybe they did it this way to avoid race conditions between the entity changing and being deleted.

@nehmebilal nehmebilal merged commit ab3fcbb into microsoft:master Nov 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants