-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential data loss in DynamoDB backend #10181
Conversation
physical/dynamodb/dynamodb.go
Outdated
break | ||
} else { | ||
batch = output.UnprocessedItems | ||
time.Sleep(boff.NextBackOff()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think based on the defaults you need to handle boof.Stop as a potential value. backoff.Stop is the duration -1, which wouldn't sleep at all so you'd go into a tight loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs a quick remerge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my understanding that with MaxElapsedTime = 0 on line 510 exponential backoff will not timeout and not return backoff.Stop and will instead cap itself using MaxInterval (default 60s passed to the randomization algorithm.)
With this in mind I don't believe we need to add tests for backoff.Stop however if it is still felt they are necessary I can add them in.
I will get started on the remerge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Didn't catch the extra set there. My next comment is possibly that we should have a max elapsed time, otherwise this could hang requests indefinitely hypothetically. Even if it's a pretty long interval I'd be more comfortable if this could eventually fail if it needed to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern I do have, and why I did not put a timeout initially, is that if we use a timeout what impact would failing to commit parts of a batch have on records integrity? Specifically without transactions capability we will end up with some data committed and other data not potentially leaving data hanging in the database. Though looking at the code again this issue may already be present if the batch passed to batchWriteRequests is larger than 25 and one of the sub-batches passed to the AWS SDK fails.
I did intentionally keep the entire batch including retry sleep interval inside of the PermitPool (controlled by max_parallel) to allow stalled writes to act essentially as back preassure slowing down the amount that could be hitting the database at one time
That said thinking through the code operation and considering a timeout I begin to think of another risk of letting these run for an extended period of time, the probability that we end up with an out of order write and what impact they could have on the vault storage. Currently these can already happen with the AWS SDK being able to retry the batches on its own, but extending the time a request sits in the queue does increase the risk.
Off hand I'm not sure of what a good balance is between these two concerns as it relates to setting a timeout since we are fighting two data integrity impacting issues. Would you have a suggestion as to what you would think of as a comfortable balance for time? I'll note that the AWS SDK may retry for some time before it even returns the first batch with unprocessed items, so we would want to at least allow that to occur once to be sure we get a chance to retry the submission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we can take this in another direction if you don't mind. It looks like Dynamo added transactions earlier in the year. So if we could convert this to use TransactWriteItems rather than BatchWriteItems, then a timeout is possible and we could rollback and return an error if it's reached. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies for the delay, I needed to take a step back and think on this one. I'm far from a DynamoDB expert or a Vault expert so needed to double check a few items and still have at least one part I'm not fully sure on.
First I'll note that DynamoDB transactions are limited to 25 items in a single write, this is similar to how BatchWrite is limited. This means that yes transactions would reduce a hanging data problem I was concerned about for a partial data commit, however it wouldn't be eliminated if a batch is over 25 items. Not sure how often this occurs, I would say it sounds like a better scenario to me though as it reduces the amount of hanging data.
I don't think this would hit the 4mb limit transaction either in most cases given max key size restriction Vault has based on my brief review of the data stored in my own test lab setup. There are ACID guarantee issues noted by Amazon on replicated tables with transactions however I believe that would likely be more of a documentation issue than a game stopper.
The first big issue to switching transactions that I can see is that this will amplify the amount of Write Units that are used on DynamoDB as Amazon calculates a transaction write as 2x a non transaction write. I assume this is something that would need to be documented well so that when someone upgrades they know to scale their environment.
Assuming transactions are the method I don't believe we need to check for unprocesseditems as the transaction either commits or it does not, therefor no need for the exponential backoff code and we just leave it to the AWS SDK to handle it, if it errors we return the error if not we know the data made it. I do like that this does significantly shorten the max length the time the function could take to return a response.
The only part I need to spend a little more time looking at is understanding the code inside of the loop for _, prefix := range physical.Prefixes(entry.Key) {
because under DynamoDB transactions one can not touch the same key inside two transactions at once for a write without one being rejected and that looks on a high level glance like it may do just that, but I need to spend more time understanding how the hashed storage keys plays into that and if that is actually an issue or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that loop creates empty placeholder records in DynamoDB for each parent "folder", to assist the list operation. For example, writing foo/bar/baz creates empty structures at "foo" and "foo/bar", and the actual value containing record at "foo/bar/baz".
I agree with your analysis so far. The size limitation shouldn't be an issue, as there should be practically no bytes written for the elements of the transaction except the leaf entry, which would contain the same amount of data we currently write in the batch version.
The read/write capacity issue is interesting, some customers may not care while some may be cost sensitive. In addition, when this ships, customers would need to adjust their provisioning upward. That's a release notes issue but it may surprise some folks. What do you think about making this configurable, transactional by default? If off we use batching with retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the confirmation on how that loop functions. With that I now have a firmer understanding of the code.
I believe from some basic testing I can confirm that the path creation loop will as it exists pose a significant issue for transactions in DynamoDB. Specifically if two transactions run at the same time on a shared path such as putting data on /foo/bar and /foo/baz we will end up with a TransactionConflict DynamoDB exception on /foo. This occurs even if the data is being put in with the same values. This applies for every level of the path, ultimately a transaction collision is not a question of 'if' but 'when' such a conflict would trigger. Ultimately this means that the path creation portion would need to remain outside of transactions and be handled by a batch write, after which since its only a single put to place the actual key record there would be no real advantage I could see to run that as a single item transaction .
There may be some merit in splitting out the code so that a batch to create the path prefixes is sent as a batch awaiting success before sending the single value key to avoid any hanging data, however that would increase latency to do so.
As it relates to this particular bug fix I think we should skip on implementing transactions and just settle on a reasonable timeout as you originally suggested unless there is something I am missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great analysis. I agree with you then that just doing best effort batching with timeout is the way to go.
I've redone the patch taking the advice previously given. I arbitrarily put the timeout at 600 seconds, and am open to any opinions on a change to that. If I'm reading the AWS SDK correctly, depending on just how the issue occurs I believe it may be possible for the original batchWrite to take longer than 600 seconds in extreme conditions though I'm not sure if that would ever happen in production. Additionally there is an undocumented "dynamodb_max_retries" configuration parameter that if used would impact how long the AWS SDK may take before it hands control back. The default for this is set in the AWS SDK as 10 retries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 10 minutes is reasonable as you have here. Looks good to me.
Do you mind merging up from hashicorp:master? The only thing preventing us merging this is the failing CI test, which may be fixed in master. |
fixes #5836 DynamoDB may when throttled return a 2xx response while not committing all submitted items to the database. Depending upon load all actions in a BatchWriteUpdate may be throttled with ProvisionedThroughputExceededException in which case AWS SDK handles the retry. If some messages were throttled but not all ProvisionedThroughputExceededException is not returned to the SDK and it is up to us to resubmit the request. Using an exponential backoff as recommended in AWS SDK for times we possibly get partially throttled repeatedly.
Rebased per request. Looks like we got past the previous failed message. The new one that is failing "ci/circleci: build-common-layers" Doesn't appear to me on glancing the logs to be related to any of the code in this PR. |
@sgmiller High interest in this fix since we appear to face this very issue - we are running bunch of verification test to prove this is mitigating the data loss we face |
This adds a `-check` flag to `vault operator migrate` which enables to verify if all data has been successfully migrated to the new backend. We needed this after facing issues with missing data after migrating to DynamoDB. Also related: * hashicorp#5836 * hashicorp#10181
Thank you @cmlara for this, really appreciate it! We just faced the problem you described in the mentioned issue while running a migration from Consul to DynamoDB. After applying your fix, the migration went through fine and we could confirm that this time all data had been migrated. Would appreciate if this could be merged, as I think this has quite some impact, especially as the bug has the risk of losing data without any apparent errors. |
Great timing @sgmiller we were just discussing this morning how much longer we'd need to run our patched version of vault with this dynamodb fix. Thanks! |
fixes #5836 DynamoDB may when throttled return a 2xx response while not committing all submitted items to the database. Depending upon load all actions in a BatchWriteUpdate may be throttled with ProvisionedThroughputExceededException in which case AWS SDK handles the retry. If some messages were throttled but not all ProvisionedThroughputExceededException is not returned to the SDK and it is up to us to resubmit the request. Using an exponential backoff as recommended in AWS SDK for times we possibly get partially throttled repeatedly.
<bow> Thanks @cmlara! |
Thanks for the merge and help throughout verifying how the vault sides functioned @sgmiller! |
* Potential data loss in DynamoDB backend (#10181) fixes #5836 DynamoDB may when throttled return a 2xx response while not committing all submitted items to the database. Depending upon load all actions in a BatchWriteUpdate may be throttled with ProvisionedThroughputExceededException in which case AWS SDK handles the retry. If some messages were throttled but not all ProvisionedThroughputExceededException is not returned to the SDK and it is up to us to resubmit the request. Using an exponential backoff as recommended in AWS SDK for times we possibly get partially throttled repeatedly. * Add a Changelog entry for 10181 * Fix err shadowing Co-authored-by: Conrad Lara <[email protected]>
fixes #5836
DynamoDB may when throttled return a 2xx response while not committing
all submitted items to the database.
Depending upon load all actions in a BatchWriteUpdate may be throttled
with ProvisionedThroughputExceededException in which case AWS SDK handles
the retry. If some messages were throttled but not all
ProvisionedThroughputExceededException is not returned to the SDK and it
is up to us to resubmit the request.
Using an exponential backoff as recommended in AWS SDK for times we possibly
get partially throttled repeatedly.