Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

PVF validation failures #3581

Closed
pepyakin opened this issue Aug 6, 2021 · 2 comments · Fixed by #6551
Closed

PVF validation failures #3581

pepyakin opened this issue Aug 6, 2021 · 2 comments · Fixed by #6551
Assignees
Labels
I3-bug Fails to follow expected behavior.

Comments

@pepyakin
Copy link
Contributor

pepyakin commented Aug 6, 2021

Several validators report in their logs candidate validation failures (aka internal errors during candidate validation).

Such a log line states (edited for legibility):

Failed to validate candidate due to internal error 
err=ValidationFailed(
  "failed to read the artifact at /home/parity/.local/share/polkadot/chains/ksmcc3/db/pvf-artifacts/wasmtime_0xc159229b363fccf10ceaab1f7f6f3ac2b0c9557cf7db14b33e978531e0925660: Custom { kind: NotFound, error: VerboseError { source: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }, message: \"could not read file `/home/parity/.local/share/polkadot/chains/ksmcc3/db/pvf-artifacts/wasmtime_0xc159229b363fccf10ceaab1f7f6f3ac2b0c9557cf7db14b33e978531e0925660`\" 
  } 
  }"
)

This happens when an request for executing a PVF is received by the execution worker, but the worker cannot read the artifact.

Here is the logic why this should not happen. Apparently something doesn't hold true.

  1. The control over the artifact files is exclusive to the validation host.
  2. During the node start-up, the artifacts cache is cleaned up.
  3. In order to be executed, a PVF should be prepared first. This means that artifacts should have an entry ArtifactsState::Prepared for that artifact. If there is no, the preparation process kicks in. The execution request is stashed until after the preparation is done. Preparation goes through the preparation queue and the pool.
  4. The pool gets an available worker and instructs it to work on the given PVF. The host creates a temporary file for the artifact and passes it to the worker. The worker starts compilation. When the worker finishes, it writes the serialized artifact into the temporary file and notifies the host that it's done. The worker atomically moves (renames) the temporary file to the destination filename of the artifact. If the worker didn't meet the deadline the host writes an artifact with error description into the destination file.
    • one suspicious point is that the host does not check the error when it writes the artifact file.
      // best effort: there is nothing we can do here if the write fails.
      let _ = async_std::fs::write(&artifact_path, &bytes).await;
  5. If the worker concluded or "didn't make it", then the pool notifies the queue. In both cases, the queue reports to the host that the artifact is prepared now.
  6. The host will react by adding an entry ArtifactsState::Prepared to artifacts for the PVF in question. The last_time_needed will be set to the current time. It will also dispatch the pending execution requests.
  7. Execution request will come through the queue and ultimately processed by an execution worker. When execution worker receives the request, it will read the requested artifact. If it doesn't exist it reports the internal error. A request for execution will bump the last_time_needed to the current time.
  8. There is a separate process for pruning the prepared artifacts whose last_time_needed is older by a predefined parameter. This process is run very rarely (say, once in a day). Once the artifact is expired it is removed from the artifacts eagerly atomically. That is, it cannot explain persistent failures we are observing.

This may or may not be related #3499

@mrcnski
Copy link
Contributor

mrcnski commented Jan 10, 2023

Is this ticket still relevant?

  1. This is not the case anymore:

one suspicious point is that the host does not check the error when it writes the artifact file.

https://github.com/paritytech/polkadot/blob/30005e6b6e/node/core/pvf/src/prepare/worker.rs#L370

  1. Also, I believe we no longer save the error into the artifact. We should update this line:

https://github.com/paritytech/polkadot/blob/30005e6b6e/node/core/pvf/src/lib.rs#L78

  1. I think the flow described in this ticket be nice to copy/paste into a module docstring at artifacts.rs (after fixing the out-of-date parts). Thoughts?

cc @slumber @eskimor

@eskimor
Copy link
Member

eskimor commented Jan 11, 2023

Yes please. Let's update docs and close, if nothing is obviously wrong, then keeping this open without being able to reproduce is rather pointless. Thanks @mrcnski !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug Fails to follow expected behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants