-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia mismatch recovery #915
nvidia mismatch recovery #915
Conversation
For documentation purposes to test the log output, I'm manually adding to the test job |
58bb1d0
to
7b8b6dd
Compare
@j-rivero would appreciate a review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reviewing the PR, in the section of passing the parameters to rebuild the failed build I was looking into the naginator plugin that we have in the buildfarm which usually does a really good work in not successful builds just pressing "Retry" on the GUI. The plugin claims to support: "Only rebuild the job if the build's log output contains a given regular expression" which seems to me like it could be the same thing that we are implementing with custom code here. The GUI is:
The custom implementation here has some advantages like restricting it to only relevant gpu nodes on Linux but the use of a plugin can be more reliable that maintaining and debugging custom groovy code. Before I continue with the review, what do you think @claraberendsen , that we can keep going with this PR or move it to use the retry plugin features?
As this PR proceeds I want to raise an alternative tactic we could take and why I think the current action is the preferred one. Instead of dropping all other labels, we could add the |
Also want to point out that the script is resilient to any fail in the steps subsequent to removing all labels. If for any reason the rebuilding of the job fails the catch will add the old labels to the node and report that we couldn't recover. This is to have some atomicity on these "sudo" actions we are performing from the Jenkins master. The only edge case is if Jenkins fails before shuting down command is scheduled we may encounter our agent with recovery-process label and have to manually restart it. |
I tried the retry plugin and I found this particular issue:
@j-rivero If you know of a way to use that conditional on dsl I would prefer that strategy over the current one for maintainability. On the custom run only on gpu nodes, with the naginator plugin even though we won't have complete certainty that it will run only on gpu-nvidia nodes, we are not expecting to encounter this log in a normal workflow in any other node. The only case I could see right now is if someone with intent adds that log to a build it would force the rebuild of the job and we won't be able to stop that. |
Co-authored-by: Jose Luis Rivero <[email protected]>
Looks like that this particular feature or part of the API has not being ported to DSL. There is a way I used in the past to inject XML almost directly from the DSL, it is called the |
Post review testingThe testing is limited to the areas modified in the review: rebuild a job when a log is matched and regex matching on postbuild actions. Tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go. Great work @claraberendsen !
If the build failed due to the Nvidia error, we do the following steps: * Remove all labels from current agent and add a recovery-process label to indicate that that agent is performing recovery actions, preventing it from taking on any other build. * Requeue the job with same parameters with a delay of 70s to give time for the next step. * Schedule a system restart on 1 minute (the delay is needed here so that the postbuild action finishes correctly
DESCRIPTION
Since moving to cloud instances for the
nvidia agents
we have encountered issues with versioning resulting innvml version mismatch errors
(See issue). This is an unrecoverable error for the builds, and requires manual steps to bring back the agent to a funtional state.This PR automates those steps as postbuild actions. If the build failed due to the Nvidia error, we do the following steps:
recovery-process
label to indicate that that agent is performing recovery actions, preventing it from taking on any other build.TESTS
regex
matchdummy
not actionable labels to agent to not build jobs between current and restart- [x] Agent is put back online with correct labels after restart.
Manual rebuild: https://build.osrfoundation.org/job/_test_job_from_dsl/80/
Automatic rebuild after recovery process: https://build.osrfoundation.org/job/_test_job_from_dsl/81/