Potential data loss in DynamoDB backend #10181

cmlara · 2020-10-19T21:03:37Z

DynamoDB may when throttled return a 2xx response while not committing
all submitted items to the database.

Depending upon load all actions in a BatchWriteUpdate may be throttled
with ProvisionedThroughputExceededException in which case AWS SDK handles
the retry. If some messages were throttled but not all
ProvisionedThroughputExceededException is not returned to the SDK and it
is up to us to resubmit the request.

Using an exponential backoff as recommended in AWS SDK for times we possibly
get partially throttled repeatedly.

hashicorp-cla · 2020-10-19T21:03:41Z

All committers have signed the CLA.

sgmiller · 2020-12-11T16:44:01Z

physical/dynamodb/dynamodb.go

+				break
+			} else {
+				batch = output.UnprocessedItems
+				time.Sleep(boff.NextBackOff())


I think based on the defaults you need to handle boof.Stop as a potential value. backoff.Stop is the duration -1, which wouldn't sleep at all so you'd go into a tight loop.

Also needs a quick remerge.

It is my understanding that with MaxElapsedTime = 0 on line 510 exponential backoff will not timeout and not return backoff.Stop and will instead cap itself using MaxInterval (default 60s passed to the randomization algorithm.)

With this in mind I don't believe we need to add tests for backoff.Stop however if it is still felt they are necessary I can add them in.

I will get started on the remerge.

Ah, I see. Didn't catch the extra set there. My next comment is possibly that we should have a max elapsed time, otherwise this could hang requests indefinitely hypothetically. Even if it's a pretty long interval I'd be more comfortable if this could eventually fail if it needed to.

One concern I do have, and why I did not put a timeout initially, is that if we use a timeout what impact would failing to commit parts of a batch have on records integrity? Specifically without transactions capability we will end up with some data committed and other data not potentially leaving data hanging in the database. Though looking at the code again this issue may already be present if the batch passed to batchWriteRequests is larger than 25 and one of the sub-batches passed to the AWS SDK fails.

I did intentionally keep the entire batch including retry sleep interval inside of the PermitPool (controlled by max_parallel) to allow stalled writes to act essentially as back preassure slowing down the amount that could be hitting the database at one time

That said thinking through the code operation and considering a timeout I begin to think of another risk of letting these run for an extended period of time, the probability that we end up with an out of order write and what impact they could have on the vault storage. Currently these can already happen with the AWS SDK being able to retry the batches on its own, but extending the time a request sits in the queue does increase the risk.

Off hand I'm not sure of what a good balance is between these two concerns as it relates to setting a timeout since we are fighting two data integrity impacting issues. Would you have a suggestion as to what you would think of as a comfortable balance for time? I'll note that the AWS SDK may retry for some time before it even returns the first batch with unprocessed items, so we would want to at least allow that to occur once to be sure we get a chance to retry the submission.

Actually, I think we can take this in another direction if you don't mind. It looks like Dynamo added transactions earlier in the year. So if we could convert this to use TransactWriteItems rather than BatchWriteItems, then a timeout is possible and we could rollback and return an error if it's reached. Thoughts?

My apologies for the delay, I needed to take a step back and think on this one. I'm far from a DynamoDB expert or a Vault expert so needed to double check a few items and still have at least one part I'm not fully sure on.

First I'll note that DynamoDB transactions are limited to 25 items in a single write, this is similar to how BatchWrite is limited. This means that yes transactions would reduce a hanging data problem I was concerned about for a partial data commit, however it wouldn't be eliminated if a batch is over 25 items. Not sure how often this occurs, I would say it sounds like a better scenario to me though as it reduces the amount of hanging data.

I don't think this would hit the 4mb limit transaction either in most cases given max key size restriction Vault has based on my brief review of the data stored in my own test lab setup. There are ACID guarantee issues noted by Amazon on replicated tables with transactions however I believe that would likely be more of a documentation issue than a game stopper.

The first big issue to switching transactions that I can see is that this will amplify the amount of Write Units that are used on DynamoDB as Amazon calculates a transaction write as 2x a non transaction write. I assume this is something that would need to be documented well so that when someone upgrades they know to scale their environment.

Assuming transactions are the method I don't believe we need to check for unprocesseditems as the transaction either commits or it does not, therefor no need for the exponential backoff code and we just leave it to the AWS SDK to handle it, if it errors we return the error if not we know the data made it. I do like that this does significantly shorten the max length the time the function could take to return a response.

The only part I need to spend a little more time looking at is understanding the code inside of the loop for _, prefix := range physical.Prefixes(entry.Key) { because under DynamoDB transactions one can not touch the same key inside two transactions at once for a write without one being rejected and that looks on a high level glance like it may do just that, but I need to spend more time understanding how the hashed storage keys plays into that and if that is actually an issue or not?

I believe that loop creates empty placeholder records in DynamoDB for each parent "folder", to assist the list operation. For example, writing foo/bar/baz creates empty structures at "foo" and "foo/bar", and the actual value containing record at "foo/bar/baz".

I agree with your analysis so far. The size limitation shouldn't be an issue, as there should be practically no bytes written for the elements of the transaction except the leaf entry, which would contain the same amount of data we currently write in the batch version.

The read/write capacity issue is interesting, some customers may not care while some may be cost sensitive. In addition, when this ships, customers would need to adjust their provisioning upward. That's a release notes issue but it may surprise some folks. What do you think about making this configurable, transactional by default? If off we use batching with retries.

Thank you for the confirmation on how that loop functions. With that I now have a firmer understanding of the code.

I believe from some basic testing I can confirm that the path creation loop will as it exists pose a significant issue for transactions in DynamoDB. Specifically if two transactions run at the same time on a shared path such as putting data on /foo/bar and /foo/baz we will end up with a TransactionConflict DynamoDB exception on /foo. This occurs even if the data is being put in with the same values. This applies for every level of the path, ultimately a transaction collision is not a question of 'if' but 'when' such a conflict would trigger. Ultimately this means that the path creation portion would need to remain outside of transactions and be handled by a batch write, after which since its only a single put to place the actual key record there would be no real advantage I could see to run that as a single item transaction .

There may be some merit in splitting out the code so that a batch to create the path prefixes is sent as a batch awaiting success before sending the single value key to avoid any hanging data, however that would increase latency to do so.

As it relates to this particular bug fix I think we should skip on implementing transactions and just settle on a reasonable timeout as you originally suggested unless there is something I am missing.

Great analysis. I agree with you then that just doing best effort batching with timeout is the way to go.

cmlara · 2021-01-14T10:00:27Z

I've redone the patch taking the advice previously given.

I arbitrarily put the timeout at 600 seconds, and am open to any opinions on a change to that.

If I'm reading the AWS SDK correctly, depending on just how the issue occurs I believe it may be possible for the original batchWrite to take longer than 600 seconds in extreme conditions though I'm not sure if that would ever happen in production.

Additionally there is an undocumented "dynamodb_max_retries" configuration parameter that if used would impact how long the AWS SDK may take before it hands control back. The default for this is set in the AWS SDK as 10 retries.

sgmiller

I think 10 minutes is reasonable as you have here. Looks good to me.

sgmiller · 2021-03-29T16:27:54Z

Do you mind merging up from hashicorp:master? The only thing preventing us merging this is the failing CI test, which may be fixed in master.

fixes #5836 DynamoDB may when throttled return a 2xx response while not committing all submitted items to the database. Depending upon load all actions in a BatchWriteUpdate may be throttled with ProvisionedThroughputExceededException in which case AWS SDK handles the retry. If some messages were throttled but not all ProvisionedThroughputExceededException is not returned to the SDK and it is up to us to resubmit the request. Using an exponential backoff as recommended in AWS SDK for times we possibly get partially throttled repeatedly.

cmlara · 2021-03-29T23:45:32Z

Rebased per request.

Looks like we got past the previous failed message. The new one that is failing "ci/circleci: build-common-layers" Doesn't appear to me on glancing the logs to be related to any of the code in this PR.

universam1 · 2021-04-01T10:19:09Z

@sgmiller High interest in this fix since we appear to face this very issue - we are running bunch of verification test to prove this is mitigating the data loss we face

This adds a `-check` flag to `vault operator migrate` which enables to verify if all data has been successfully migrated to the new backend. We needed this after facing issues with missing data after migrating to DynamoDB. Also related: * hashicorp#5836 * hashicorp#10181

AndreasSko · 2021-04-01T14:58:32Z

Thank you @cmlara for this, really appreciate it! We just faced the problem you described in the mentioned issue while running a migration from Consul to DynamoDB. After applying your fix, the migration went through fine and we could confirm that this time all data had been migrated. Would appreciate if this could be merged, as I think this has quite some impact, especially as the bug has the risk of losing data without any apparent errors.

dhohengassner · 2021-04-02T07:09:16Z

@cmlara awesome work!
Glad there is already a fix prepared because we also see get reports from missing secrets.
Think this is an important fix to ensure DDB as backend works reliable.

@sgmiller any chance to get this moving?

Thanks a lot!

webframp · 2021-04-07T14:53:33Z

Great timing @sgmiller we were just discussing this morning how much longer we'd need to run our patched version of vault with this dynamodb fix. Thanks!

fixes #5836 DynamoDB may when throttled return a 2xx response while not committing all submitted items to the database. Depending upon load all actions in a BatchWriteUpdate may be throttled with ProvisionedThroughputExceededException in which case AWS SDK handles the retry. If some messages were throttled but not all ProvisionedThroughputExceededException is not returned to the SDK and it is up to us to resubmit the request. Using an exponential backoff as recommended in AWS SDK for times we possibly get partially throttled repeatedly.

sgmiller · 2021-04-07T16:24:57Z

<bow> Thanks @cmlara!

cmlara · 2021-04-07T19:13:14Z

Thanks for the merge and help throughout verifying how the vault sides functioned @sgmiller!

* Potential data loss in DynamoDB backend (#10181) fixes #5836 DynamoDB may when throttled return a 2xx response while not committing all submitted items to the database. Depending upon load all actions in a BatchWriteUpdate may be throttled with ProvisionedThroughputExceededException in which case AWS SDK handles the retry. If some messages were throttled but not all ProvisionedThroughputExceededException is not returned to the SDK and it is up to us to resubmit the request. Using an exponential backoff as recommended in AWS SDK for times we possibly get partially throttled repeatedly. * Add a Changelog entry for 10181 * Fix err shadowing Co-authored-by: Conrad Lara <[email protected]>

sgmiller requested changes Dec 11, 2020

View reviewed changes

mladlow added the waiting-for-response label Dec 11, 2020

sgmiller self-requested a review January 20, 2021 17:22

sgmiller approved these changes Jan 20, 2021

View reviewed changes

vercel bot temporarily deployed to Preview – vault-storybook March 29, 2021 23:15 Inactive

vercel bot temporarily deployed to Preview – vault March 29, 2021 23:15 Inactive

sgmiller merged commit 1a41ae8 into hashicorp:master Apr 7, 2021

kalafut added this to the 1.7.1 milestone Apr 7, 2021

heatherezell removed the waiting-for-response label Sep 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential data loss in DynamoDB backend #10181

Potential data loss in DynamoDB backend #10181

cmlara commented Oct 19, 2020

hashicorp-cla commented Oct 19, 2020 •

edited

Loading

sgmiller Dec 11, 2020

sgmiller Dec 11, 2020

cmlara Dec 11, 2020

sgmiller Dec 11, 2020

cmlara Dec 12, 2020

sgmiller Dec 14, 2020

cmlara Dec 20, 2020

sgmiller Dec 21, 2020

cmlara Dec 28, 2020

sgmiller Jan 5, 2021

cmlara commented Jan 14, 2021

sgmiller left a comment

sgmiller commented Mar 29, 2021

cmlara commented Mar 29, 2021

universam1 commented Apr 1, 2021 •

edited

Loading

AndreasSko commented Apr 1, 2021

dhohengassner commented Apr 2, 2021

webframp commented Apr 7, 2021

sgmiller commented Apr 7, 2021 •

edited

Loading

cmlara commented Apr 7, 2021

Potential data loss in DynamoDB backend #10181

Potential data loss in DynamoDB backend #10181

Conversation

cmlara commented Oct 19, 2020

hashicorp-cla commented Oct 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmlara commented Jan 14, 2021

sgmiller left a comment

Choose a reason for hiding this comment

sgmiller commented Mar 29, 2021

cmlara commented Mar 29, 2021

universam1 commented Apr 1, 2021 • edited Loading

AndreasSko commented Apr 1, 2021

dhohengassner commented Apr 2, 2021

webframp commented Apr 7, 2021

sgmiller commented Apr 7, 2021 • edited Loading

cmlara commented Apr 7, 2021

hashicorp-cla commented Oct 19, 2020 •

edited

Loading

universam1 commented Apr 1, 2021 •

edited

Loading

sgmiller commented Apr 7, 2021 •

edited

Loading