Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(tee): fix race condition in batch locking #3342

Merged
merged 10 commits into from
Dec 3, 2024

Conversation

pbeza
Copy link
Collaborator

@pbeza pbeza commented Nov 28, 2024

What ❔

After scaling zksync-tee-prover to two instances/replicas on Azure for azure-stage2, azure-testnet2, and azure-mainnet2, we started experiencing duplicated proving for some batches.
logs
While this is not an erroneous situation, it is wasteful from a resource perspective. This was due to a race condition in batch locking. This PR fixes the issue by adding atomic batch locking.

Why ❔

To fix the bug that only activates after running zksync-tee-prover on multiple instances.

Checklist

  • PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • Tests for the changes have been added / updated.
  • Documentation comments have been added / updated.
  • Code has been formatted via zkstack dev fmt and zkstack dev lint.

@pbeza pbeza requested review from haraldh and slowli November 28, 2024 14:03
@pbeza pbeza force-pushed the tee/fix/atomic-batch-locking branch from a86fc98 to e95cb27 Compare November 28, 2024 18:10
@pbeza pbeza requested a review from RomanBrodetski November 29, 2024 11:59
@pbeza pbeza force-pushed the tee/fix/atomic-batch-locking branch from e95cb27 to 46dcfde Compare November 29, 2024 12:06
After [scaling][1] [zksync-tee-prover][2] to two instances/replicas on
Azure for azure-stage2, azure-testnet2, and azure-mainnet2, we started
experiencing [duplicated proving for some batches][3]. While this is not
an erroneous situation, it is wasteful from a resource perspective. This
was due to a race condition in batch locking. This PR fixes the issue by
adding atomic batch locking.

[1]: https://github.com/matter-labs/gitops-kubernetes/pull/7033/files
[2]: https://github.com/matter-labs/zksync-era/blob/aaca32b6ab411d5cdc1234c20af8b5c1092195d7/core/bin/zksync_tee_prover/src/main.rs
[3]: https://grafana.matterlabs.dev/goto/M1I_Bq7HR?orgId=1
@pbeza pbeza force-pushed the tee/fix/atomic-batch-locking branch from 46dcfde to 7d96c1c Compare November 29, 2024 12:10
Copy link
Contributor

@slowli slowli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question: How is the locking made atomic in this PR? AFAIU, the first SELECT statement, if queried concurrently, can still return the same L1 batch number unless some kind of row-level locking is implemented (cf. SELECT FOR UPDATE SKIP LOCKED in this contract verifier query). I'm not even sure the UPDATE query will fail for the transaction committed last in case of a race (maybe it would with serialization isolation level, but I'd argue that erroring is not the best cause of action here; row-level locks seem to work better).

core/lib/dal/src/models/storage_tee_proof.rs Outdated Show resolved Hide resolved
core/lib/dal/src/tee_proof_generation_dal.rs Outdated Show resolved Hide resolved
@pbeza
Copy link
Collaborator Author

pbeza commented Nov 29, 2024

Dumb question: How is the locking made atomic in this PR? (...)

Not a dumb question at all! The dumb one here was me! ;P I totally misunderstood what SQL transactions can actually handle in this context. Had to brush up on the finer details of SQL locking. Thanks for steering me in the right direction! These two links were super helpful:

@pbeza
Copy link
Collaborator Author

pbeza commented Nov 29, 2024

@slowli, I’ve addressed your code review comments. Take a look when you get a chance.

It’s kinda hard to test properly without deploying it to stage and letting it run for a while. Specifically, let me know if locking rows in the proof_generation_details table is okay (instead of just locking tee_proof_generation_details rows).

@pbeza pbeza requested a review from slowli November 29, 2024 19:00
slowli
slowli previously approved these changes Dec 2, 2024
@pbeza pbeza requested a review from slowli December 3, 2024 12:19
slowli
slowli previously approved these changes Dec 3, 2024
@pbeza
Copy link
Collaborator Author

pbeza commented Dec 3, 2024

@slowli, @haraldh suggested locking the entire tee_proof_generation_details table to keep things simpler. He also raised a concern that if one TEE prover locks the batch, a second TEE prover instance will just get a no job response instead of waiting for new batches to become available.

Let me know if this more fine-grained locking approach still works for you, or if we’re missing something – or maybe there’s an easier way we haven’t considered.

@pbeza pbeza requested a review from slowli December 3, 2024 13:38
@haraldh haraldh enabled auto-merge December 3, 2024 16:32
@haraldh haraldh added this pull request to the merge queue Dec 3, 2024
Merged via the queue into main with commit a7dc0ed Dec 3, 2024
32 checks passed
@haraldh haraldh deleted the tee/fix/atomic-batch-locking branch December 3, 2024 17:15
pbeza added a commit that referenced this pull request Dec 4, 2024
Commit a7dc0ed (PR #3342) was supposed
to fix a race condition in batch locking by introducing SQL row-locking,
but it didn't work as expected. Now we are switching back to
coarser-grained table-level locking as [originally suggested][1] by
Harald. The original fix was hard to test unless deployed to `stage` due
to the undeterministic nature of the problem, so we needed to merge it
to the `main` branch to properly test it.

[1]: #3342 (comment)
github-merge-queue bot pushed a commit that referenced this pull request Dec 4, 2024
…3358)

## What ❔

Commit a7dc0ed (PR #3342) was supposed
to fix a race condition in batch locking by introducing SQL row-locking,
but it [didn't work][2] as expected.
![Screenshot From 2024-12-04
11-32-32](https://github.com/user-attachments/assets/959ffc3c-593f-409a-87ab-68ec197040a0)
Now we are switching back to coarser-grained table-level locking as
[originally suggested][1] by Harald. The original fix was hard to test
unless deployed to `stage` due to the undeterministic nature of the
problem, so we needed to merge it to the `main` branch to properly test
it.

[1]:
#3342 (comment)
[2]: https://grafana.matterlabs.dev/goto/AhEd5FVNg?orgId=1

## Why ❔

To fix the bug that only activates after running `zksync-tee-prover` on
multiple instances.

## Checklist

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [ ] Documentation comments have been added / updated.
- [x] Code has been formatted via `zkstack dev fmt` and `zkstack dev
lint`.
github-merge-queue bot pushed a commit that referenced this pull request Dec 11, 2024
🤖 I have created a release *beep* *boop*
---


##
[25.3.0](core-v25.2.0...core-v25.3.0)
(2024-12-11)


### Features

* change seal criteria for gateway
([#3320](#3320))
([a0a74aa](a0a74aa))
* **contract-verifier:** Download compilers from GH automatically
([#3291](#3291))
([a10c4ba](a10c4ba))
* integrate gateway changes for some components
([#3274](#3274))
([cbc91e3](cbc91e3))
* **proof-data-handler:** exclude batches without object file in GCS
([#2980](#2980))
([3e309e0](3e309e0))
* **pruning:** Record L1 batch root hash in pruning logs
([#3266](#3266))
([7b6e590](7b6e590))
* **state-keeper:** mempool io opens batch if there is protocol upgrade
tx ([#3360](#3360))
([f6422cd](f6422cd))
* **tee:** add error handling for unstable_getTeeProofs API endpoint
([#3321](#3321))
([26f630c](26f630c))
* **zksync_cli:** Health checkpoint improvements
([#3193](#3193))
([440fe8d](440fe8d))


### Bug Fixes

* **api:** batch fee input scaling for `debug_traceCall`
([#3344](#3344))
([7ace594](7ace594))
* **tee:** correct previous fix for race condition in batch locking
([#3358](#3358))
([b12da8d](b12da8d))
* **tee:** fix race condition in batch locking
([#3342](#3342))
([a7dc0ed](a7dc0ed))
* **tracer:** adds vm error to flatCallTracer error field if exists
([#3374](#3374))
([5d77727](5d77727))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: zksync-era-bot <[email protected]>
gianbelinche added a commit to lambdaclass/zksync-era that referenced this pull request Jan 3, 2025
…vars (#371)

* feat(state-keeper): mempool io opens batch if there is protocol upgrade tx (matter-labs#3360)

## What ❔

Mempool io opens batch if there is protocol upgrade tx

## Why ❔

Currently if mempool is empty but there is protocol upgrade tx, then
batch is not opened

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [ ] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [ ] Documentation comments have been added / updated.
- [ ] Code has been formatted via `zkstack dev fmt` and `zkstack dev
lint`.

* fix: Fixed cargo deny (matter-labs#3372)

## What ❔

Fixes cargo deny CI fail.

* docs: interop docs update (matter-labs#3366)

## What ❔

<!-- What are the changes this PR brings about? -->
<!-- Example: This PR adds a PR template to the repo. -->
<!-- (For bigger PRs adding more context is appreciated) -->

## Why ❔

<!-- Why are these changes done? What goal do they contribute to? What
are the principles behind them? -->
<!-- Example: PR templates ensure PR reviewers, observers, and future
iterators are in context about the evolution of repos. -->

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [ ] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [ ] Documentation comments have been added / updated.
- [ ] Code has been formatted via `zkstack dev fmt` and `zkstack dev
lint`.

* fix(tracer): adds vm error to flatCallTracer error field if exists (matter-labs#3374)

## What ❔

<!-- What are the changes this PR brings about? -->
<!-- Example: This PR adds a PR template to the repo. -->
<!-- (For bigger PRs adding more context is appreciated) -->
- Updates `flatCallTracer` error to include vm error if it exists 

## Why ❔

<!-- Why are these changes done? What goal do they contribute to? What
are the principles behind them? -->
<!-- Example: PR templates ensure PR reviewers, observers, and future
iterators are in context about the evolution of repos. -->
- MM has requested that if an error exists we should populate within
`flatCallTracer` as this is what others do, prior to this PR it was only
revert_reason introduced here:
matter-labs#3306. However, if we have
a vm error the error field is not populated as seen in this tx:
`0x6c85bf34666dcdaa885f2bc6e95186029d2b25f2a3bbdff21c36878e2d4a19ed`
which failed due to a vm panic.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zkstack dev fmt` and `zkstack dev
lint`.

* chore(main): release core 25.3.0 (matter-labs#3313)

:robot: I have created a release *beep* *boop*
---


##
[25.3.0](matter-labs/zksync-era@core-v25.2.0...core-v25.3.0)
(2024-12-11)


### Features

* change seal criteria for gateway
([matter-labs#3320](matter-labs#3320))
([a0a74aa](matter-labs@a0a74aa))
* **contract-verifier:** Download compilers from GH automatically
([matter-labs#3291](matter-labs#3291))
([a10c4ba](matter-labs@a10c4ba))
* integrate gateway changes for some components
([matter-labs#3274](matter-labs#3274))
([cbc91e3](matter-labs@cbc91e3))
* **proof-data-handler:** exclude batches without object file in GCS
([matter-labs#2980](matter-labs#2980))
([3e309e0](matter-labs@3e309e0))
* **pruning:** Record L1 batch root hash in pruning logs
([matter-labs#3266](matter-labs#3266))
([7b6e590](matter-labs@7b6e590))
* **state-keeper:** mempool io opens batch if there is protocol upgrade
tx ([matter-labs#3360](matter-labs#3360))
([f6422cd](matter-labs@f6422cd))
* **tee:** add error handling for unstable_getTeeProofs API endpoint
([matter-labs#3321](matter-labs#3321))
([26f630c](matter-labs@26f630c))
* **zksync_cli:** Health checkpoint improvements
([matter-labs#3193](matter-labs#3193))
([440fe8d](matter-labs@440fe8d))


### Bug Fixes

* **api:** batch fee input scaling for `debug_traceCall`
([matter-labs#3344](matter-labs#3344))
([7ace594](matter-labs@7ace594))
* **tee:** correct previous fix for race condition in batch locking
([matter-labs#3358](matter-labs#3358))
([b12da8d](matter-labs@b12da8d))
* **tee:** fix race condition in batch locking
([matter-labs#3342](matter-labs#3342))
([a7dc0ed](matter-labs@a7dc0ed))
* **tracer:** adds vm error to flatCallTracer error field if exists
([matter-labs#3374](matter-labs#3374))
([5d77727](matter-labs@5d77727))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: zksync-era-bot <[email protected]>

* feat(eigen-client-extra-features): Fix PR comments (#369)

* Add envy load

* Readd proto reference

* Rename blob id to request id

* Make literals constants

* Make point size constant

* Get pool unique

* Remaining comments

* Fix comment

* Add check for failed states

* Change l1 name

* Cargo lock conflicts

* remove concurrent dispatcher leftovers

* Solve comments (#372)

* remove METRICS var

* feat(eigen-client-extra-features): address PR comments (#375)

* Change settlement layer for u32

* Change string to address

* Remove unwraps

* Remove error from name

* Remove unused to bytes

* Rename call for get blob data

* Revert "Change string to address"

This reverts commit 6dd94d4.

* Change string for address

* feat(eigen-client-extra-features): address PR comments (part 2) (#374)

* initial commit

* clippy suggestion

* feat(eigen-client-extra-features): address PR comments (part 3) (#376)

* use keccak256 fn

* simplify get_context_block

* use saturating sub

* feat(eigen-client-extra-features): address PR comments (part 4) (#378)

* Replace decode bytes for ethabi

* Add default to eigenconfig

* Change str to url

* Add index to data availability table

* Address comments

* Change error to verificationerror

* Format code

* feat(eigen-client-extra-features): address PR comments (part 5) (#377)

* use trait object

* prevent blocking non async code

* clippy suggestion

---------

Co-authored-by: juan518munoz <[email protected]>

---------

Co-authored-by: Gianbelinche <[email protected]>

---------

Co-authored-by: Gianbelinche <[email protected]>

* Format code

---------

Co-authored-by: juan518munoz <[email protected]>

---------

Co-authored-by: perekopskiy <[email protected]>
Co-authored-by: Bruno França <[email protected]>
Co-authored-by: kelemeno <[email protected]>
Co-authored-by: Dustin Brickwood <[email protected]>
Co-authored-by: zksync-era-bot <[email protected]>
Co-authored-by: zksync-era-bot <[email protected]>
Co-authored-by: Gianbelinche <[email protected]>
gianbelinche added a commit to lambdaclass/zksync-era that referenced this pull request Jan 3, 2025
* feat(state-keeper): mempool io opens batch if there is protocol upgrade tx (matter-labs#3360)

## What ❔

Mempool io opens batch if there is protocol upgrade tx

## Why ❔

Currently if mempool is empty but there is protocol upgrade tx, then
batch is not opened

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [ ] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [ ] Documentation comments have been added / updated.
- [ ] Code has been formatted via `zkstack dev fmt` and `zkstack dev
lint`.

* fix: Fixed cargo deny (matter-labs#3372)

## What ❔

Fixes cargo deny CI fail.

* docs: interop docs update (matter-labs#3366)

## What ❔

<!-- What are the changes this PR brings about? -->
<!-- Example: This PR adds a PR template to the repo. -->
<!-- (For bigger PRs adding more context is appreciated) -->

## Why ❔

<!-- Why are these changes done? What goal do they contribute to? What
are the principles behind them? -->
<!-- Example: PR templates ensure PR reviewers, observers, and future
iterators are in context about the evolution of repos. -->

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [ ] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [ ] Documentation comments have been added / updated.
- [ ] Code has been formatted via `zkstack dev fmt` and `zkstack dev
lint`.

* fix(tracer): adds vm error to flatCallTracer error field if exists (matter-labs#3374)

## What ❔

<!-- What are the changes this PR brings about? -->
<!-- Example: This PR adds a PR template to the repo. -->
<!-- (For bigger PRs adding more context is appreciated) -->
- Updates `flatCallTracer` error to include vm error if it exists 

## Why ❔

<!-- Why are these changes done? What goal do they contribute to? What
are the principles behind them? -->
<!-- Example: PR templates ensure PR reviewers, observers, and future
iterators are in context about the evolution of repos. -->
- MM has requested that if an error exists we should populate within
`flatCallTracer` as this is what others do, prior to this PR it was only
revert_reason introduced here:
matter-labs#3306. However, if we have
a vm error the error field is not populated as seen in this tx:
`0x6c85bf34666dcdaa885f2bc6e95186029d2b25f2a3bbdff21c36878e2d4a19ed`
which failed due to a vm panic.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zkstack dev fmt` and `zkstack dev
lint`.

* chore(main): release core 25.3.0 (matter-labs#3313)

:robot: I have created a release *beep* *boop*
---


##
[25.3.0](matter-labs/zksync-era@core-v25.2.0...core-v25.3.0)
(2024-12-11)


### Features

* change seal criteria for gateway
([matter-labs#3320](matter-labs#3320))
([a0a74aa](matter-labs@a0a74aa))
* **contract-verifier:** Download compilers from GH automatically
([matter-labs#3291](matter-labs#3291))
([a10c4ba](matter-labs@a10c4ba))
* integrate gateway changes for some components
([matter-labs#3274](matter-labs#3274))
([cbc91e3](matter-labs@cbc91e3))
* **proof-data-handler:** exclude batches without object file in GCS
([matter-labs#2980](matter-labs#2980))
([3e309e0](matter-labs@3e309e0))
* **pruning:** Record L1 batch root hash in pruning logs
([matter-labs#3266](matter-labs#3266))
([7b6e590](matter-labs@7b6e590))
* **state-keeper:** mempool io opens batch if there is protocol upgrade
tx ([matter-labs#3360](matter-labs#3360))
([f6422cd](matter-labs@f6422cd))
* **tee:** add error handling for unstable_getTeeProofs API endpoint
([matter-labs#3321](matter-labs#3321))
([26f630c](matter-labs@26f630c))
* **zksync_cli:** Health checkpoint improvements
([matter-labs#3193](matter-labs#3193))
([440fe8d](matter-labs@440fe8d))


### Bug Fixes

* **api:** batch fee input scaling for `debug_traceCall`
([matter-labs#3344](matter-labs#3344))
([7ace594](matter-labs@7ace594))
* **tee:** correct previous fix for race condition in batch locking
([matter-labs#3358](matter-labs#3358))
([b12da8d](matter-labs@b12da8d))
* **tee:** fix race condition in batch locking
([matter-labs#3342](matter-labs#3342))
([a7dc0ed](matter-labs@a7dc0ed))
* **tracer:** adds vm error to flatCallTracer error field if exists
([matter-labs#3374](matter-labs#3374))
([5d77727](matter-labs@5d77727))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: zksync-era-bot <[email protected]>

* feat(eigen-client-extra-features): Fix PR comments (#369)

* Add envy load

* Readd proto reference

* Rename blob id to request id

* Make literals constants

* Make point size constant

* Get pool unique

* Remaining comments

* Fix comment

* Add check for failed states

* Change l1 name

* Cargo lock conflicts

* remove concurrent dispatcher leftovers

* Solve comments (#372)

* Remove eigen client for external crate

* Add real repo

* remove METRICS var

* Change proxy name and remove generic

* feat(eigen-client-extra-features): address PR comments (#375)

* Change settlement layer for u32

* Change string to address

* Remove unwraps

* Remove error from name

* Remove unused to bytes

* Rename call for get blob data

* Revert "Change string to address"

This reverts commit 6dd94d4.

* Change string for address

* feat(eigen-client-extra-features): address PR comments (part 2) (#374)

* initial commit

* clippy suggestion

* feat(eigen-client-extra-features): address PR comments (part 3) (#376)

* use keccak256 fn

* simplify get_context_block

* use saturating sub

* feat(eigen-client-extra-features): address PR comments (part 4) (#378)

* Replace decode bytes for ethabi

* Add default to eigenconfig

* Change str to url

* Add index to data availability table

* Address comments

* Change error to verificationerror

* Format code

* feat(eigen-client-extra-features): address PR comments (part 5) (#377)

* use trait object

* prevent blocking non async code

* clippy suggestion

---------

Co-authored-by: juan518munoz <[email protected]>

---------

Co-authored-by: Gianbelinche <[email protected]>

---------

Co-authored-by: Gianbelinche <[email protected]>

* Format code

---------

Co-authored-by: juan518munoz <[email protected]>

* Fix compilation

* Update branch

---------

Co-authored-by: perekopskiy <[email protected]>
Co-authored-by: Bruno França <[email protected]>
Co-authored-by: kelemeno <[email protected]>
Co-authored-by: Dustin Brickwood <[email protected]>
Co-authored-by: zksync-era-bot <[email protected]>
Co-authored-by: zksync-era-bot <[email protected]>
Co-authored-by: Juan Munoz <[email protected]>
Co-authored-by: juan518munoz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants