Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug deadlock while applying DB migrations #4588

Closed
bartekn opened this issue Sep 14, 2022 · 4 comments · Fixed by #4587
Closed

Debug deadlock while applying DB migrations #4588

bartekn opened this issue Sep 14, 2022 · 4 comments · Fixed by #4587

Comments

@bartekn
Copy link
Contributor

bartekn commented Sep 14, 2022

What version are you using?

2.20.0

What did you do?

Apply 58_add_index_by_id_optimization.sql migration file while ingestion is running on another ingesting instance.

What did you expect to see?

No error.

What did you see instead?

Deadlock error:

pq: deadlock detected handling 58_add_index_by_id_optimization.sql

There are two things that are important to remember: 1. the migrations are run in a db transaction and 2. because we have multiple ingesting instance, ingestion is running along db migrations (also in a DB transaction). This is the chain of events I think that leads to deadlock:

  1. first CREATE INDEX query in the migration file acquires ShareLock on history_operation_claimable_balances (this lock is going to be released at the end of transaction).
  2. one of ingesting instances INSERT a row into history_operation_liquidity_pools and by doing this acquires RowExclusiveLock (it conflicts with ShareLock and is also released at the end of transaction).
  3. now, CREATE INDEX on history_operation_liquidity_pools is executed, again ShareLock acquire attempt but on history_operation_liquidity_pools table and needs to wait until ingestion commits because there’s already a conflicting lock on this table.
  4. ingesting instance attempts to INSERT a new row into history_operation_claimable_balances but waits because in step 1 a ShareLock was acquired.
  5. deadlock because both txs wait for the other tx lock to be released.

I created a simple local repro.

session 1:

deadlock_repro=# create table t1 (f1 int);
CREATE TABLE
deadlock_repro=# create table t2 (f1 int);
CREATE TABLE
deadlock_repro=# begin;
BEGIN
deadlock_repro=*# create index i1 on t1(f1);
CREATE INDEX
deadlock_repro=*# create index i2 on t2(f1);
CREATE INDEX
deadlock_repro=*# 

session 2:

deadlock_repro=# begin;
BEGIN
deadlock_repro=*# insert into t2 (f1) values (0);
INSERT 0 1
deadlock_repro=*# insert into t1 (f1) values (0);
ERROR:  deadlock detected
LINE 1: insert into t1 (f1) values (0);
                    ^
DETAIL:  Process 80691 waits for RowExclusiveLock on relation 1638638 of database 1638622; blocked by process 80642.
Process 80642 waits for ShareLock on relation 1638641 of database 1638622; blocked by process 80691.
HINT:  See server log for query details.

(the queries within txs are run one after another in two sessions: S1, S2, S1, S2)

@tsachiherman
Copy link
Contributor

What I'm going to propose might be an overkill :
Should we require the migration transaction to take a share lock, so that all the index creation operations would block the ingestion ?

@bartekn
Copy link
Contributor Author

bartekn commented Sep 21, 2022

Yes, this is exactly the purpose of #4587. There is no other way to prevent deadlocks if ingestion is running along db migations.

Repository owner moved this from In Progress to Done in Platform Scrum Sep 27, 2022
@Shaptic
Copy link
Contributor

Shaptic commented Oct 13, 2022

Upon releasing v2.22.0, it appears that non-ingesting instances are trying to write to the read-only database. The error during deployment via deployinator was:

pq: cannot set transaction read-write mode during recovery

Is it possible that the PR at #4587 is triggering that? The workaround was to run the Restart job after running the Install job, as it seems that restarting the service does not attempt to apply any migrations.

@Shaptic Shaptic reopened this Oct 13, 2022
@tamirms
Copy link
Contributor

tamirms commented Oct 18, 2022

Upon releasing v2.22.0, it appears that non-ingesting instances are trying to write to the read-only database. The error during deployment via deployinator was:

pq: cannot set transaction read-write mode during recovery

Is it possible that the PR at #4587 is triggering that? The workaround was to run the Restart job after running the Install job, as it seems that restarting the service does not attempt to apply any migrations.

this is actually a pre-existing issue #4580 which was fixed by #4664

@tamirms tamirms closed this as completed Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants