Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENG-9317 Fix concurrent workflow create bug #342

Merged

Conversation

kqdeng
Copy link
Contributor

@kqdeng kqdeng commented Oct 15, 2020

Signed-off-by: Kelly Deng [email protected]

Description

Lowers the transaction isolation level from serializable to repeatable read as serializable appears to be too overly strict.

Why is this needed

When creating several workflows in parallel, an error is returned:

rpc error: code = Unknown desc = Failed to workflow: inserting e3ea67ac-2d5d-45ce-8cb3-fca42abc896f, INSERT in to workflow: pq: could not serialize access due to read/write dependencies among transactions

example:

/ # cat inputs | parallel
Created Workflow:  7c8404ee-52a7-49a9-bacc-57ae9171e8c1
2020/10/14 15:08:18 rpc error: code = Unknown desc = Failed to workflow: inserting e3ea67ac-2d5d-45ce-8cb3-fca42abc896f, INSERT in to workflow: pq: could not serialize access due to read/write dependencies among transactions
Created Workflow:  6981c47b-443b-4d69-932b-4e6b6d1ce678
2020/10/14 15:08:18 rpc error: code = Unknown desc = Failed to workflow: inserting 48de21ee-5bac-4498-96c4-21c4ad46ae33, INSERT in to workflow: pq: could not serialize access due to read/write dependencies among transactions
Created Workflow:  a0921f77-e7f6-4aeb-8c5b-823e1d57168c
Created Workflow:  875a9033-51ce-4b5a-a786-1d3c0520eeb4
2020/10/14 15:08:18 rpc error: code = Unknown desc = Failed to insert in workflow_state: inserting 5f1a1f7c-d264-4339-b0fc-12f5fb72bd81, INSERT in to workflow_state: pq: could not serialize access due to read/write dependencies among transactions
Created Workflow:  cde501b0-51f0-4df8-9844-7c0e581a2e69
Created Workflow:  4acef5ac-5299-45ec-bbe9-56871a6fbe61
Created Workflow:  83a6dc36-bfae-456d-8b46-501e6dd16f03

inputs file contents

tink workflow create -t 67a0b806-6315-4a98-a56d-99fb03b30573 -r '{"device_1": "b4:96:91:5f:af:c0"}'
tink workflow create -t 67a0b806-6315-4a98-a56d-99fb03b30573 -r '{"device_1": "b4:96:91:5f:af:c0"}'
tink workflow create -t 67a0b806-6315-4a98-a56d-99fb03b30573 -r '{"device_1": "b4:96:91:5f:af:c0"}'
tink workflow create -t 67a0b806-6315-4a98-a56d-99fb03b30573 -r '{"device_1": "b4:96:91:5f:af:c0"}'
tink workflow create -t 67a0b806-6315-4a98-a56d-99fb03b30573 -r '{"device_1": "b4:96:91:5f:af:c0"}'
tink workflow create -t 67a0b806-6315-4a98-a56d-99fb03b30573 -r '{"device_1": "b4:96:91:5f:af:c0"}'
tink workflow create -t 67a0b806-6315-4a98-a56d-99fb03b30573 -r '{"device_1": "b4:96:91:5f:af:c0"}'
tink workflow create -t 67a0b806-6315-4a98-a56d-99fb03b30573 -r '{"device_1": "b4:96:91:5f:af:c0"}'
tink workflow create -t 67a0b806-6315-4a98-a56d-99fb03b30573 -r '{"device_1": "b4:96:91:5f:af:c0"}'
tink workflow create -t 67a0b806-6315-4a98-a56d-99fb03b30573 -r '{"device_1": "b4:96:91:5f:af:c0"}'

Fixes: #

How Has This Been Tested?

Manually, using parallel to run workflow create commands in parallel

Lowering the isolation level to repeatable read and using the same input file from above, the workflow create commands succeed without error:

/ # cat inputs | parallel
Created Workflow:  e3651718-3657-444d-93cd-a47d37d8b3df
Created Workflow:  706cb112-104a-4f2d-9996-0de9742c3633
Created Workflow:  6fada3af-f601-4463-bde3-63af3323955c
Created Workflow:  b7bf0f64-8767-4436-9cc4-68c585b47e43
Created Workflow:  5c6bc466-b2d1-4e49-af13-2ecedee81e24
Created Workflow:  3e681699-6f39-47d2-bb44-2f4c504dc1fd
Created Workflow:  661c5c0d-57bd-4dda-bd15-f2c81defb1a0
Created Workflow:  9c8e8ec4-c768-406e-b2ce-f4286ab6eb86
Created Workflow:  6099a384-442c-419b-996e-8f2e172bfab6
Created Workflow:  dec9db7b-1cf9-4b07-96f2-e52757614e04

How are existing users impacted? What migration steps/scripts do we need?

No migration needed.

Checklist:

I have:

  • updated the documentation and/or roadmap (if required)
  • added unit or e2e tests
  • provided instructions on how to upgrade

…ead to avoid concurrency error due to level serializable being overly strict

Signed-off-by: Kelly Deng <[email protected]>
@codecov
Copy link

codecov bot commented Oct 15, 2020

Codecov Report

Merging #342 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #342   +/-   ##
=======================================
  Coverage   23.79%   23.79%           
=======================================
  Files          14       14           
  Lines        1244     1244           
=======================================
  Hits          296      296           
  Misses        928      928           
  Partials       20       20           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e368e84...97cf1a1. Read the comment docs.

@kqdeng kqdeng requested review from DavidHwu and jonawong October 19, 2020 16:01
Copy link

@DavidHwu DavidHwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am a bit concerned about lowering the isolation level. Think the only potential negative test case here is the outcome of 2+ concurrent provisioning of the same bare metal. Something that team discussed and is guarded at the inventory level.
The tests here show the isolation level "Repeatable Read" does indeed solve for the DB concurrency issue without a decent amount of code change.
Think this is an acceptable risk. Feedback and other suggestions welcomed.

Copy link

@jonawong jonawong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Ran this branch locally and tested with parallels and saw no issues.

@kqdeng kqdeng added the ready-to-merge Signal to Mergify to merge the PR. label Oct 20, 2020
@mergify mergify bot merged commit 12fba51 into tinkerbell:master Oct 20, 2020
mergify bot added a commit that referenced this pull request Oct 23, 2020
## Description

When multiple process concurrently create workflow, the code to access workflow_worker_map is not concurrent safe. This PR fix this issue. 

## Why is this needed

During workflow creation, new rows may be add to table `workflow_worker_map` in the same transaction. The original design use ` INSERT ... SELECT`. This doesn't work concurrently with SERRIELIZABLE isolation.

Fixes: #

- Create unique index on workflow_worker_map table
- Use `INSERT ... ON CONFLICT DO NOTHING;` to avoid table lock.

## How Has This Been Tested?

Manually as #342.

## How are existing users impacted? What migration steps/scripts do we need?

- No impact
- database migrate to 202010221010-add-unique-index.


## Checklist:

I have:

- [ ] updated the documentation and/or roadmap (if required)
- [ ] added unit or e2e tests
- [ ] provided instructions on how to upgrade
@mmlb mmlb removed the ready-to-merge Signal to Mergify to merge the PR. label Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants