Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential data loss in DynamoDB backend #10181

Merged
merged 1 commit into from
Apr 7, 2021
Merged

Potential data loss in DynamoDB backend #10181

merged 1 commit into from
Apr 7, 2021

Conversation

cmlara
Copy link
Contributor

@cmlara cmlara commented Oct 19, 2020

fixes #5836

DynamoDB may when throttled return a 2xx response while not committing
all submitted items to the database.

Depending upon load all actions in a BatchWriteUpdate may be throttled
with ProvisionedThroughputExceededException in which case AWS SDK handles
the retry. If some messages were throttled but not all
ProvisionedThroughputExceededException is not returned to the SDK and it
is up to us to resubmit the request.

Using an exponential backoff as recommended in AWS SDK for times we possibly
get partially throttled repeatedly.

@hashicorp-cla
Copy link

hashicorp-cla commented Oct 19, 2020

CLA assistant check
All committers have signed the CLA.

break
} else {
batch = output.UnprocessedItems
time.Sleep(boff.NextBackOff())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think based on the defaults you need to handle boof.Stop as a potential value. backoff.Stop is the duration -1, which wouldn't sleep at all so you'd go into a tight loop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs a quick remerge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my understanding that with MaxElapsedTime = 0 on line 510 exponential backoff will not timeout and not return backoff.Stop and will instead cap itself using MaxInterval (default 60s passed to the randomization algorithm.)

With this in mind I don't believe we need to add tests for backoff.Stop however if it is still felt they are necessary I can add them in.

I will get started on the remerge.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Didn't catch the extra set there. My next comment is possibly that we should have a max elapsed time, otherwise this could hang requests indefinitely hypothetically. Even if it's a pretty long interval I'd be more comfortable if this could eventually fail if it needed to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I do have, and why I did not put a timeout initially, is that if we use a timeout what impact would failing to commit parts of a batch have on records integrity? Specifically without transactions capability we will end up with some data committed and other data not potentially leaving data hanging in the database. Though looking at the code again this issue may already be present if the batch passed to batchWriteRequests is larger than 25 and one of the sub-batches passed to the AWS SDK fails.

I did intentionally keep the entire batch including retry sleep interval inside of the PermitPool (controlled by max_parallel) to allow stalled writes to act essentially as back preassure slowing down the amount that could be hitting the database at one time

That said thinking through the code operation and considering a timeout I begin to think of another risk of letting these run for an extended period of time, the probability that we end up with an out of order write and what impact they could have on the vault storage. Currently these can already happen with the AWS SDK being able to retry the batches on its own, but extending the time a request sits in the queue does increase the risk.

Off hand I'm not sure of what a good balance is between these two concerns as it relates to setting a timeout since we are fighting two data integrity impacting issues. Would you have a suggestion as to what you would think of as a comfortable balance for time? I'll note that the AWS SDK may retry for some time before it even returns the first batch with unprocessed items, so we would want to at least allow that to occur once to be sure we get a chance to retry the submission.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we can take this in another direction if you don't mind. It looks like Dynamo added transactions earlier in the year. So if we could convert this to use TransactWriteItems rather than BatchWriteItems, then a timeout is possible and we could rollback and return an error if it's reached. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies for the delay, I needed to take a step back and think on this one. I'm far from a DynamoDB expert or a Vault expert so needed to double check a few items and still have at least one part I'm not fully sure on.

First I'll note that DynamoDB transactions are limited to 25 items in a single write, this is similar to how BatchWrite is limited. This means that yes transactions would reduce a hanging data problem I was concerned about for a partial data commit, however it wouldn't be eliminated if a batch is over 25 items. Not sure how often this occurs, I would say it sounds like a better scenario to me though as it reduces the amount of hanging data.

I don't think this would hit the 4mb limit transaction either in most cases given max key size restriction Vault has based on my brief review of the data stored in my own test lab setup. There are ACID guarantee issues noted by Amazon on replicated tables with transactions however I believe that would likely be more of a documentation issue than a game stopper.

The first big issue to switching transactions that I can see is that this will amplify the amount of Write Units that are used on DynamoDB as Amazon calculates a transaction write as 2x a non transaction write. I assume this is something that would need to be documented well so that when someone upgrades they know to scale their environment.

Assuming transactions are the method I don't believe we need to check for unprocesseditems as the transaction either commits or it does not, therefor no need for the exponential backoff code and we just leave it to the AWS SDK to handle it, if it errors we return the error if not we know the data made it. I do like that this does significantly shorten the max length the time the function could take to return a response.

The only part I need to spend a little more time looking at is understanding the code inside of the loop for _, prefix := range physical.Prefixes(entry.Key) { because under DynamoDB transactions one can not touch the same key inside two transactions at once for a write without one being rejected and that looks on a high level glance like it may do just that, but I need to spend more time understanding how the hashed storage keys plays into that and if that is actually an issue or not?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that loop creates empty placeholder records in DynamoDB for each parent "folder", to assist the list operation. For example, writing foo/bar/baz creates empty structures at "foo" and "foo/bar", and the actual value containing record at "foo/bar/baz".

I agree with your analysis so far. The size limitation shouldn't be an issue, as there should be practically no bytes written for the elements of the transaction except the leaf entry, which would contain the same amount of data we currently write in the batch version.

The read/write capacity issue is interesting, some customers may not care while some may be cost sensitive. In addition, when this ships, customers would need to adjust their provisioning upward. That's a release notes issue but it may surprise some folks. What do you think about making this configurable, transactional by default? If off we use batching with retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the confirmation on how that loop functions. With that I now have a firmer understanding of the code.

I believe from some basic testing I can confirm that the path creation loop will as it exists pose a significant issue for transactions in DynamoDB. Specifically if two transactions run at the same time on a shared path such as putting data on /foo/bar and /foo/baz we will end up with a TransactionConflict DynamoDB exception on /foo. This occurs even if the data is being put in with the same values. This applies for every level of the path, ultimately a transaction collision is not a question of 'if' but 'when' such a conflict would trigger. Ultimately this means that the path creation portion would need to remain outside of transactions and be handled by a batch write, after which since its only a single put to place the actual key record there would be no real advantage I could see to run that as a single item transaction .

There may be some merit in splitting out the code so that a batch to create the path prefixes is sent as a batch awaiting success before sending the single value key to avoid any hanging data, however that would increase latency to do so.

As it relates to this particular bug fix I think we should skip on implementing transactions and just settle on a reasonable timeout as you originally suggested unless there is something I am missing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great analysis. I agree with you then that just doing best effort batching with timeout is the way to go.

@cmlara
Copy link
Contributor Author

cmlara commented Jan 14, 2021

I've redone the patch taking the advice previously given.

I arbitrarily put the timeout at 600 seconds, and am open to any opinions on a change to that.

If I'm reading the AWS SDK correctly, depending on just how the issue occurs I believe it may be possible for the original batchWrite to take longer than 600 seconds in extreme conditions though I'm not sure if that would ever happen in production.

Additionally there is an undocumented "dynamodb_max_retries" configuration parameter that if used would impact how long the AWS SDK may take before it hands control back. The default for this is set in the AWS SDK as 10 retries.

@sgmiller sgmiller self-requested a review January 20, 2021 17:22
Copy link
Collaborator

@sgmiller sgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 10 minutes is reasonable as you have here. Looks good to me.

@sgmiller
Copy link
Collaborator

Do you mind merging up from hashicorp:master? The only thing preventing us merging this is the failing CI test, which may be fixed in master.

fixes #5836

DynamoDB may when throttled return a 2xx response while not committing
all submitted items to the database.

Depending upon load all actions in a BatchWriteUpdate may be throttled
with ProvisionedThroughputExceededException in which case AWS SDK handles
the retry. If some messages were throttled but not all
ProvisionedThroughputExceededException is not returned to the SDK and it
is up to us to resubmit the request.

Using an exponential backoff as recommended in AWS SDK for times we possibly
get partially throttled repeatedly.
@vercel vercel bot temporarily deployed to Preview – vault-storybook March 29, 2021 23:15 Inactive
@vercel vercel bot temporarily deployed to Preview – vault March 29, 2021 23:15 Inactive
@cmlara
Copy link
Contributor Author

cmlara commented Mar 29, 2021

Rebased per request.

Looks like we got past the previous failed message. The new one that is failing "ci/circleci: build-common-layers" Doesn't appear to me on glancing the logs to be related to any of the code in this PR.

@universam1
Copy link

universam1 commented Apr 1, 2021

@sgmiller High interest in this fix since we appear to face this very issue - we are running bunch of verification test to prove this is mitigating the data loss we face

AndreasSko pushed a commit to AndreasSko/vault that referenced this pull request Apr 1, 2021
This adds a `-check` flag to
`vault operator migrate` which enables to
verify if all data has been successfully
migrated to the new backend.

We needed this after facing issues with
missing data after migrating to DynamoDB.

Also related:
* hashicorp#5836
* hashicorp#10181
@AndreasSko
Copy link

Thank you @cmlara for this, really appreciate it! We just faced the problem you described in the mentioned issue while running a migration from Consul to DynamoDB. After applying your fix, the migration went through fine and we could confirm that this time all data had been migrated. Would appreciate if this could be merged, as I think this has quite some impact, especially as the bug has the risk of losing data without any apparent errors.

@dhohengassner
Copy link

@cmlara awesome work!
Glad there is already a fix prepared because we also see get reports from missing secrets.
Think this is an important fix to ensure DDB as backend works reliable.

@sgmiller any chance to get this moving?

Thanks a lot!

@sgmiller sgmiller merged commit 1a41ae8 into hashicorp:master Apr 7, 2021
@webframp
Copy link

webframp commented Apr 7, 2021

Great timing @sgmiller we were just discussing this morning how much longer we'd need to run our patched version of vault with this dynamodb fix. Thanks!

sgmiller pushed a commit that referenced this pull request Apr 7, 2021
fixes #5836

DynamoDB may when throttled return a 2xx response while not committing
all submitted items to the database.

Depending upon load all actions in a BatchWriteUpdate may be throttled
with ProvisionedThroughputExceededException in which case AWS SDK handles
the retry. If some messages were throttled but not all
ProvisionedThroughputExceededException is not returned to the SDK and it
is up to us to resubmit the request.

Using an exponential backoff as recommended in AWS SDK for times we possibly
get partially throttled repeatedly.
@kalafut kalafut added this to the 1.7.1 milestone Apr 7, 2021
@sgmiller
Copy link
Collaborator

sgmiller commented Apr 7, 2021

<bow> Thanks @cmlara!

@cmlara
Copy link
Contributor Author

cmlara commented Apr 7, 2021

Thanks for the merge and help throughout verifying how the vault sides functioned @sgmiller!

sgmiller added a commit that referenced this pull request Apr 7, 2021
* Potential data loss in DynamoDB backend (#10181)

fixes #5836

DynamoDB may when throttled return a 2xx response while not committing
all submitted items to the database.

Depending upon load all actions in a BatchWriteUpdate may be throttled
with ProvisionedThroughputExceededException in which case AWS SDK handles
the retry. If some messages were throttled but not all
ProvisionedThroughputExceededException is not returned to the SDK and it
is up to us to resubmit the request.

Using an exponential backoff as recommended in AWS SDK for times we possibly
get partially throttled repeatedly.

* Add a Changelog entry for 10181

* Fix err shadowing

Co-authored-by: Conrad Lara <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential data loss in DynamoDB backend
10 participants