-
Notifications
You must be signed in to change notification settings - Fork 799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF: Incorporate wasmtime version in worker version checks #2742
Conversation
This was there for the single-binary puppet workers, but as those have been retired we can now simplify this.
@@ -100,24 +100,6 @@ pub fn build_workers_and_get_paths() -> (PathBuf, PathBuf) { | |||
let mut execute_worker_path = workers_path.clone(); | |||
execute_worker_path.push(EXECUTE_BINARY_NAME); | |||
|
|||
// explain why a build happens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed this because the explanation is only ever shown when -- --nocapture
is passed in, and even then it does not explain a rebuild most of the time. This code would have needed to be updated, and I didn't feel like updating un-useful code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Two concerns:
- Do we still need it at all if we're falling back to pruning all the artifacts?
- Does the new artifact name always fit into
MAX_PATH
?
Thanks for the review!
Yes, actually this PR is still useful to make sure that we're running a version of the workers with the intended wasmtime version. We don't enforce constant rebuilds of the workers during development as it would be bad DevEx, but we should at least make sure that the wasmtime version between polkadot, prepare-worker, and execute-worker is always consistent, erroring if not. I don't think we have the same concerns here as in the other PR, namely CPU features, because I believe wasmtime/cranelift detect those at runtime.
This PR doesn't change the artifact naming scheme (except to make sure that the wasmtime version portion is correct), but that will anyway be superseded by #2895. I did some reading on In any case, what would happen is, we get an error after preparation when renaming the temp compiled artifact to the final one. This would not result in a dispute, so I don't think we need to specifically handle this (which based on reading the above and here, seems full of pitfalls: "Pathnames are very evil, insecure and path_max is a lie and not even a constant (it might be different on different OS functions)."). |
Considering the complexity of #2871 and the discussion therein, as well as the further complexity introduced by the hardening in #2742, as well as the eventual replacement of wasmtime by PolkaVM, it seems best to remove this persistence as it is creating more problems than it solves. ## Related Closes #2863
To finish this we'd need to add back the wasmtime version which was removed (here). However, we should do it a different way than calling |
Is this really a common issue that has happened once? We also don't bump |
@bkchr Perhaps you're right. I thought this was a possible vector for disputes - but maybe fixing this is not worth the added complexity. I can close if we don't want this, we can always re-open if it ever becomes an issue. |
On production we are guarded by the worker version check and that we delete the cache now on startup. So, we should not be able to run into any kind of problem with this? |
@bkchr Since #1495 the worker version check only accounts for the node version. My full concern from the related issue is:
Seems like a viable concern to me (I've talked to a validator who was running on |
In general, wouldn't it fail on loading the wasm file from the disk if the wasmtime version changed? I mean this shouldn't be a reason for a dispute? |
Currently, it does not hold AFAICT. There was an intent in #661 to solve that in a more straightforward way, but it has drowned in discussions, although, TBH, I think that way is superior to constantly increasing the number of checks over artifacts, workers, node version, and everything. Also, I'm not sure if we should put a lot of effort into it right now, considering the transition to PolkaVM. |
I mean that we can create a dispute because of a faulty disc, sounds really bad to me. (I didn't jumped into the issue and read the discussion) Should it be really that complicated to make at least some simple distinction between actual execution errors and other errors? |
Yes, I do agree that's no good. The complexity of implementation highly depends on what we're considering to be a transient error (and that's what all those discussions are about). The least complex approach discussed is like this: considering we check for successful runtime construction during PVF pre-checking, we can consider failed runtime construction during execution a transient problem (corrupted artifact, wrong Wastime version, etc.), and in that case, we could try to re-prepare the artifact from scratch and try to execute it again. That sounds like a solid strategy to me. However, right now, after shifting our priorities for 2024, we lack the capacity to implement that. СС @eskimor in case we want to push it. Also, maybe some of our external contributors will want to work on it? CC @eagr @jpserrat @maksimryndin |
@s0me0ne-unkn0wn I can try to tackle it but I need some time to understand all the discussions :) |
@maksimryndin don't hesitate to ask questions, and also jump on call/chat with me to discuss! I'll try to formulate a separate issue for public discussion later today. |
Thanks @mrcnski for interesting insights on PATH_MAX! |
@s0me0ne-unkn0wn @bkchr @jpserrat @eagr Let me summarize what I've understood so far :) Problem Initial solution More general solution (after discussions in this and related PRs, merge of #2895) for a more general problem of raising disputes for transient local errors I believe the following is also related Now I am going to understand the logic of workers interaction more |
@maksimryndin I've just raised #3139 with a hopefully exhaustive description of my vision of the problem. Feel free to ask questions there! |
Considering the complexity of paritytech#2871 and the discussion therein, as well as the further complexity introduced by the hardening in paritytech#2742, as well as the eventual replacement of wasmtime by PolkaVM, it seems best to remove this persistence as it is creating more problems than it solves. ## Related Closes paritytech#2863
Closing as superseded by #3187 |
Hardens the worker version check.
Let me know if anything is not clear.
Related
Closes #2398