VerificationTest > killingTheRuntime failed because runtime tried to insert journal entry twice #524

tillrohrmann · 2023-06-22T14:02:08Z

In the logs one can observe that we run multiple InvocationTasks for interpreter.CommandInterpreter-FwoTCgo1MzQzOTQ3MzA1EAMYDiCIJxAi-0188e298f31e72eaa312d51064b4ce8d which is very strange. One instance of this invocation is being executed for partition leader epoch (9, 1) and the other task is run for (8, 1). Once the second task wants to append a GetStateEntry for the journal index 2, the runtime panics with

thread 'restate' panicked at 'assertion failed: `(left == right)`
  left: `2`,
 right: `3`: Expect to receive next journal entry for interpreter.CommandInterpreter-FwoTCgo1MzQzOTQ3MzA1EAMYDiCIJxAi-0188e298f31e72eaa312d51064b4ce8d', /restate/src/worker/src/partition/state_machine/mod.rs:434:9
stack backtrace:

So it looks as if those two InvocationTasks, even though they are executed for different partition leader epochs (PartitionInvocationStateMachineCoordinators), seem to produce into the same partition processor.

A slightly related problem is why do we run multiple InvocationTasks for the same sid?

https://github.com/restatedev/restate/actions/runs/5343947305/jobs/9688317378#step:14:2093
container-logs (2).zip

The text was updated successfully, but these errors were encountered:

tillrohrmann · 2023-06-22T15:24:52Z

I think this is caused by the network using a different partition table than what the partition processors believe they are responsible. Due this aspect, it can happen at recovery time that invocations are being processed by two partition processors (one that gets the message via the shuffle/network and another retrieving it from RocksDB).

To fix this problem, we need to update the network to use the proper partitioning table. Additionally, we should add a check in the partition processor which verifies that it processes messages for the right partition range.

tillrohrmann assigned slinkydeveloper and tillrohrmann and unassigned slinkydeveloper Jun 22, 2023

tillrohrmann mentioned this issue Jun 23, 2023

Introduce FixedConsecutivePartition partition table #527

Merged

tillrohrmann closed this as completed in 29ed806 Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VerificationTest > killingTheRuntime failed because runtime tried to insert journal entry twice #524

VerificationTest > killingTheRuntime failed because runtime tried to insert journal entry twice #524

tillrohrmann commented Jun 22, 2023

tillrohrmann commented Jun 22, 2023

VerificationTest > killingTheRuntime failed because runtime tried to insert journal entry twice #524

VerificationTest > killingTheRuntime failed because runtime tried to insert journal entry twice #524

Comments

tillrohrmann commented Jun 22, 2023

tillrohrmann commented Jun 22, 2023