PVF validation failures #3581

pepyakin · 2021-08-06T13:53:50Z

Several validators report in their logs candidate validation failures (aka internal errors during candidate validation).

Such a log line states (edited for legibility):

Failed to validate candidate due to internal error 
err=ValidationFailed(
  "failed to read the artifact at /home/parity/.local/share/polkadot/chains/ksmcc3/db/pvf-artifacts/wasmtime_0xc159229b363fccf10ceaab1f7f6f3ac2b0c9557cf7db14b33e978531e0925660: Custom { kind: NotFound, error: VerboseError { source: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }, message: \"could not read file `/home/parity/.local/share/polkadot/chains/ksmcc3/db/pvf-artifacts/wasmtime_0xc159229b363fccf10ceaab1f7f6f3ac2b0c9557cf7db14b33e978531e0925660`\" 
  } 
  }"
)

This happens when an request for executing a PVF is received by the execution worker, but the worker cannot read the artifact.

Here is the logic why this should not happen. Apparently something doesn't hold true.

The control over the artifact files is exclusive to the validation host.
During the node start-up, the artifacts cache is cleaned up.
In order to be executed, a PVF should be prepared first. This means that artifacts should have an entry ArtifactsState::Prepared for that artifact. If there is no, the preparation process kicks in. The execution request is stashed until after the preparation is done. Preparation goes through the preparation queue and the pool.

The pool gets an available worker and instructs it to work on the given PVF. The host creates a temporary file for the artifact and passes it to the worker. The worker starts compilation. When the worker finishes, it writes the serialized artifact into the temporary file and notifies the host that it's done. The worker atomically moves (renames) the temporary file to the destination filename of the artifact. If the worker didn't meet the deadline the host writes an artifact with error description into the destination file.

one suspicious point is that the host does not check the error when it writes the artifact file.

polkadot/node/core/pvf/src/prepare/worker.rs

Lines 172 to 173 in c4ee9d4

    
           // best effort: there is nothing we can do here if the write fails. 
        
           let _ = async_std::fs::write(&artifact_path, &bytes).await;

If the worker concluded or "didn't make it", then the pool notifies the queue. In both cases, the queue reports to the host that the artifact is prepared now.
The host will react by adding an entry ArtifactsState::Prepared to artifacts for the PVF in question. The last_time_needed will be set to the current time. It will also dispatch the pending execution requests.
Execution request will come through the queue and ultimately processed by an execution worker. When execution worker receives the request, it will read the requested artifact. If it doesn't exist it reports the internal error. A request for execution will bump the last_time_needed to the current time.
There is a separate process for pruning the prepared artifacts whose last_time_needed is older by a predefined parameter. This process is run very rarely (say, once in a day). Once the artifact is expired it is removed from the artifacts eagerly atomically. That is, it cannot explain persistent failures we are observing.

This may or may not be related #3499

The text was updated successfully, but these errors were encountered:

mrcnski · 2023-01-10T16:33:47Z

Is this ticket still relevant?

This is not the case anymore:

one suspicious point is that the host does not check the error when it writes the artifact file.

https://github.com/paritytech/polkadot/blob/30005e6b6e/node/core/pvf/src/prepare/worker.rs#L370

Also, I believe we no longer save the error into the artifact. We should update this line:

https://github.com/paritytech/polkadot/blob/30005e6b6e/node/core/pvf/src/lib.rs#L78

I think the flow described in this ticket be nice to copy/paste into a module docstring at artifacts.rs (after fixing the out-of-date parts). Thoughts?

cc @slumber @eskimor

eskimor · 2023-01-11T11:13:06Z

Yes please. Let's update docs and close, if nothing is obviously wrong, then keeping this open without being able to reproduce is rather pointless. Thanks @mrcnski !

pepyakin added the I3-bug Fails to follow expected behavior. label Aug 6, 2021

pepyakin mentioned this issue Aug 8, 2021

Add logging to PVF and other related parts #3596

Merged

pepyakin mentioned this issue Aug 25, 2021

PVF validation host: move artifacts states into memory #3720

Closed

mrcnski self-assigned this Jan 12, 2023

mrcnski mentioned this issue Jan 13, 2023

pvf: Update docs for PVF artifacts #6551

Merged

paritytech-processbot bot closed this as completed in #6551 Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF validation failures #3581

PVF validation failures #3581

pepyakin commented Aug 6, 2021

mrcnski commented Jan 10, 2023

eskimor commented Jan 11, 2023

PVF validation failures #3581

PVF validation failures #3581

Comments

pepyakin commented Aug 6, 2021

mrcnski commented Jan 10, 2023

eskimor commented Jan 11, 2023