This repository has been archived by the owner on Nov 15, 2023. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Check spawned worker version vs node version before PVF preparation #6861
Check spawned worker version vs node version before PVF preparation #6861
Changes from 2 commits
df6c683
335495b
b96cc31
54b11c6
3cd790f
acc94f7
0db3ed3
89c7897
fb5ccc9
db26c64
5a344de
b696baa
501d4be
a42d0a5
1b4678b
ec9c4ca
0bd760f
a9562c3
0ecce2e
456314c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can send a signal to the overseer to let it shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's PVF host, it isn't a subsystem by itself, and it's not aware of overseer's existence 🙄
To handle it the right way, the error should be propagated all the way up to the candidate validation and PVF pre-check subsystems and handled by them separately. Do you think it's worth the effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't just kill the process here, you should bubble up this error and let the host kill the worker properly. Look at how the host handles e.g.
TimedOut
for an example. Also, I'm pretty sure this code is running on the host soexit
would kill the node, not the worker, which I think is not what we want? (Haven't followed the disputes issue that closely over the weekend.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's exactly what we want. At this point, we found out that the node owner screwed up the node upgrade process and we're still running an old node software, but the binary is already new, and we're trying to execute a new worker from the old node. In that case, we want to tear the whole node down and let the node owner handle its restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, got it. Is any clean-up of the workers necessary, or do they get killed automatically as child processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for sure we should kill the workers, they don't do anything important that could get lost when we kill them.
This should be a separate pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! But I've reviewed the code and looks like @pepyakin thought this through before us.
start_work()
on the host side provides the worker with a temporary path where it should store the artifact. If the worker reports a positive result, then inhandle_result()
, again on the host side, the artifact gets renamed to its realartifact_id
path. If the host dies betweenstart_work()
andhandle_result()
, the only leftover is a stale tmp file which will be pruned on the next node restart.It's still a good idea to kill preparation workers on version mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.... thinking about this some more, since we delete the cache anyway at each startup couldn't we alternatively just hash a random number into the filename? I don't remember if we already do this, but assuming we write the cache atomically to disk (that is, save the cache to a file with e.g.
.tmp
extension or into another directory, and then rename it to where we want it; writing is not atomic, but renaming always is) we'd be essentially guaranteed that we won't load a stale file. This would also allow us to just simply forcibly kill the worker on exit without having to wait for it to shut down.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@koute you mean some random number we generate on node startup and use throughout the node instance lifetime? I think it's a good approach also, but your first idea about including the node version number to the artifact name is somewhat more... Deterministic, I'd say? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just generate a random number and then just hash that in. Assuming the number is actually random and the hash is cryptographic (e.g. BLAKE2) we'll have a guarantee that no two separate runs will try to load the same files.
Well, sure, but considering that we don't want to reuse those cache files after the node restarts what's the point of them having stable filenames across restarts? Isn't that just unnecessary complexity? (:
(Unless we do want to reuse them after a restart, in which case carry on.)