You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.
Several validators report in their logs candidate validation failures (aka internal errors during candidate validation).
Such a log line states (edited for legibility):
Failed to validate candidate due to internal error
err=ValidationFailed(
"failed to read the artifact at /home/parity/.local/share/polkadot/chains/ksmcc3/db/pvf-artifacts/wasmtime_0xc159229b363fccf10ceaab1f7f6f3ac2b0c9557cf7db14b33e978531e0925660: Custom { kind: NotFound, error: VerboseError { source: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }, message: \"could not read file `/home/parity/.local/share/polkadot/chains/ksmcc3/db/pvf-artifacts/wasmtime_0xc159229b363fccf10ceaab1f7f6f3ac2b0c9557cf7db14b33e978531e0925660`\"
}
}"
)
This happens when an request for executing a PVF is received by the execution worker, but the worker cannot read the artifact.
Here is the logic why this should not happen. Apparently something doesn't hold true.
The control over the artifact files is exclusive to the validation host.
During the node start-up, the artifacts cache is cleaned up.
In order to be executed, a PVF should be prepared first. This means that artifacts should have an entry ArtifactsState::Prepared for that artifact. If there is no, the preparation process kicks in. The execution request is stashed until after the preparation is done. Preparation goes through the preparation queue and the pool.
The pool gets an available worker and instructs it to work on the given PVF. The host creates a temporary file for the artifact and passes it to the worker. The worker starts compilation. When the worker finishes, it writes the serialized artifact into the temporary file and notifies the host that it's done. The worker atomically moves (renames) the temporary file to the destination filename of the artifact. If the worker didn't meet the deadline the host writes an artifact with error description into the destination file.
one suspicious point is that the host does not check the error when it writes the artifact file.
// best effort: there is nothing we can do here if the write fails.
let _ = async_std::fs::write(&artifact_path,&bytes).await;
If the worker concluded or "didn't make it", then the pool notifies the queue. In both cases, the queue reports to the host that the artifact is prepared now.
The host will react by adding an entry ArtifactsState::Prepared to artifacts for the PVF in question. The last_time_needed will be set to the current time. It will also dispatch the pending execution requests.
Execution request will come through the queue and ultimately processed by an execution worker. When execution worker receives the request, it will read the requested artifact. If it doesn't exist it reports the internal error. A request for execution will bump the last_time_needed to the current time.
There is a separate process for pruning the prepared artifacts whose last_time_needed is older by a predefined parameter. This process is run very rarely (say, once in a day). Once the artifact is expired it is removed from the artifacts eagerly atomically. That is, it cannot explain persistent failures we are observing.
I think the flow described in this ticket be nice to copy/paste into a module docstring at artifacts.rs (after fixing the out-of-date parts). Thoughts?
Yes please. Let's update docs and close, if nothing is obviously wrong, then keeping this open without being able to reproduce is rather pointless. Thanks @mrcnski !
Several validators report in their logs candidate validation failures (aka internal errors during candidate validation).
Such a log line states (edited for legibility):
This happens when an request for executing a PVF is received by the execution worker, but the worker cannot read the artifact.
Here is the logic why this should not happen. Apparently something doesn't hold true.
artifacts
should have an entryArtifactsState::Prepared
for that artifact. If there is no, the preparation process kicks in. The execution request is stashed until after the preparation is done. Preparation goes through the preparation queue and the pool.polkadot/node/core/pvf/src/prepare/worker.rs
Lines 172 to 173 in c4ee9d4
ArtifactsState::Prepared
toartifacts
for the PVF in question. Thelast_time_needed
will be set to the current time. It will also dispatch the pending execution requests.last_time_needed
to the current time.last_time_needed
is older by a predefined parameter. This process is run very rarely (say, once in a day). Once the artifact is expired it is removed from theartifacts
eagerly atomically. That is, it cannot explain persistent failures we are observing.This may or may not be related #3499
The text was updated successfully, but these errors were encountered: