-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF: re-preparing artifact on failed runtime construction #3187
PVF: re-preparing artifact on failed runtime construction #3187
Conversation
@s0me0ne-unkn0wn before I continue, could you please have a look? Just to confirm whether it is a right direction |
Looks good, thank you! TBH, I had a feeling that it would require a more complex design, but it looks like you managed to keep it simple. Considering the complexity of PVF host internals, we could definitely use a test to check the behavior is correct, but probably creating it is not as easy as it sounds 🤔 |
@s0me0ne-unkn0wn hi!:) I believe I was wrong on the ordering of the reply to result channel of ValidationBackend and the artifact removal in case of RuntimeConstruction. They are concurrent but as the artifact name is random it's not a race issue (and in practice, the artifact is removed first in majority of cases - I did some tests). |
Still struggling to find time to review it myself and run some local tests, maybe other guys will be faster than me. Overall, looks very good! I just need to dive a bit deeper into it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! I've done some local tests, and it works even better than I expected; it catches some cases of on-disk artifact corruption that were only supposed to be addressed by checking hashes.
Left some nits, comments, and questions below, but none are blockers.
/// A possibly transient runtime instantion error happend during the execution; maybe retried | ||
/// with preparation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// A possibly transient runtime instantion error happend during the execution; maybe retried | |
/// with preparation | |
/// A possibly transient runtime instantiation error happened during the execution; may be retried | |
/// with re-preparation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad :) thank you! Fixed at a10f16d
polkadot/node/core/pvf/src/host.rs
Outdated
) -> Result<(), Fatal> { | ||
let (artifact_id, path) = artifacts | ||
.remove(artifact_id) | ||
.expect("artifact sent by the execute queue for removal exists; qed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessarily true? Could the pruning procedure remove it in the meantime? I'm just discoursing on it now, not asserting. Probably, it's not a valid concern, as the execution should mark the artifact as recently used, and pruning shouldn't touch it. But how bad could it be if we silently ignored the fact that we tried to remove an artifact that didn't even exist in the cache? Sounds safe to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it is safe and there is no need for qed assertion :). Fixed at a10f16d
@@ -238,6 +238,14 @@ impl Artifacts { | |||
.is_none()); | |||
} | |||
|
|||
/// Remove artifact by its id. | |||
pub fn remove(&mut self, artifact_id: ArtifactId) -> Option<(ArtifactId, PathBuf)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I get it right, we don't currently remove anything from disk, either here or in the unused artifact pruning procedure? Are they always removed from the memory cache table only, and the disk is cleaned up only on node startup? It's totally okay for now, but we shouldn't forget that if we ever decide to re-enable artifact persistence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it seems like I indeed missed the moment when artifact names became really random, so right now, it's not a concern at all. Never mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the logic of remove
is the same as with pruning, i.e. it affects only the cache
@s0me0ne-unkn0wn thank you for the catch with artifacts' stale cache and a potential panic because of it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, thank you!
loop { | ||
// Stop retrying if we exceeded the timeout. | ||
if total_time_start.elapsed() + retry_delay > exec_timeout { | ||
break | ||
} | ||
|
||
let mut wait_retry_delay = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about retry_immediately
? I would appreciate a brief comment on why we need it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thank you! Fixed d722ccf
4261366
…head-data * origin/master: Fix call enum's metadata regression (#3513) Enable elastic scaling node feature in local testnets genesis (#3509) update development setup in sdk-docs (#3506) Fix accidental no-shows on node restart (#3277) Remove `AssignmentProviderConfig` and use parameters from `HostConfiguration` instead (#3181) [Deprecation] Remove sp_weights::OldWeight (#3491) Fixup multi-collator parachain transition to async backing (#3510) Multi-Block-Migrations, `poll` hook and new System callbacks (#1781) Snowbridge - Extract Ethereum Chain ID (#3501) PVF: re-preparing artifact on failed runtime construction (#3187) Add documentation around FRAME Offchain workers (#3463) [prdoc] Optional SemVer bumps and Docs (#3441) rpc-v2/tx/tests: Add transaction broadcast tests and check propagated tx status (#3193)
…data * ao-collator-parent-head-data: Fix call enum's metadata regression (#3513) Enable elastic scaling node feature in local testnets genesis (#3509) update development setup in sdk-docs (#3506) Fix accidental no-shows on node restart (#3277) Remove `AssignmentProviderConfig` and use parameters from `HostConfiguration` instead (#3181) [Deprecation] Remove sp_weights::OldWeight (#3491) Fixup multi-collator parachain transition to async backing (#3510) Multi-Block-Migrations, `poll` hook and new System callbacks (#1781) Snowbridge - Extract Ethereum Chain ID (#3501) PVF: re-preparing artifact on failed runtime construction (#3187) Add documentation around FRAME Offchain workers (#3463) [prdoc] Optional SemVer bumps and Docs (#3441) rpc-v2/tx/tests: Add transaction broadcast tests and check propagated tx status (#3193)
resolve #3139
execute_artifact
RuntimeConstruction
error during the executionvalidate_candidate_with_retry
ofValidationBackend
with the case of retriableRuntimeConstruction
error during the execution