Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia mismatch recovery #915

Merged
merged 11 commits into from
May 11, 2023

Conversation

claraberendsen
Copy link
Contributor

@claraberendsen claraberendsen commented Apr 25, 2023

DESCRIPTION

Since moving to cloud instances for the nvidia agents we have encountered issues with versioning resulting in nvml version mismatch errors (See issue). This is an unrecoverable error for the builds, and requires manual steps to bring back the agent to a funtional state.
This PR automates those steps as postbuild actions. If the build failed due to the Nvidia error, we do the following steps:

  1. Remove all labels from current agent and add a recovery-process label to indicate that that agent is performing recovery actions, preventing it from taking on any other build.
  2. Requeue the job with same parameters with a delay of 70s to give time for the next step.
  3. Schedule a system restart on 1 minute (the delay is needed here so that the postbuild action finishes correctly)

TESTS

  • Post build actions only run on failed builds Build Status
  • Post build actions only run on nvidia agents Build Status
  • Steps run only if regex match Build Status
  • Addsdummy not actionable labels to agent to not build jobs between current and restart

- [x] Agent is put back online with correct labels after restart.

@claraberendsen
Copy link
Contributor Author

Build that succeeded after restart
Build Status
Build that restarted the agent
Build Status

WIP

  • Job fails with agent disconnection (will add a sleep so the build can finish).
  • Rebuild job, code commented. Need to test it further.

@claraberendsen
Copy link
Contributor Author

For documentation purposes to test the log output, I'm manually adding to the test job _test_job_from_dsl a build step that echos the text I'm looking for.
Will do a PR to add this as a param, so we can have a test job that we can inject any necessary text in the logs. It is also possible to add another param, to force a STATUS for the build (e.g. force the build to fail).

@claraberendsen claraberendsen force-pushed the feature/automization-fix-nvidia-version-mismatch branch from 58bb1d0 to 7b8b6dd Compare May 4, 2023 21:17
@claraberendsen claraberendsen marked this pull request as ready for review May 4, 2023 21:19
@claraberendsen claraberendsen requested a review from j-rivero as a code owner May 4, 2023 21:19
@claraberendsen
Copy link
Contributor Author

@j-rivero would appreciate a review!

Copy link
Contributor

@j-rivero j-rivero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing the PR, in the section of passing the parameters to rebuild the failed build I was looking into the naginator plugin that we have in the buildfarm which usually does a really good work in not successful builds just pressing "Retry" on the GUI. The plugin claims to support: "Only rebuild the job if the build's log output contains a given regular expression" which seems to me like it could be the same thing that we are implementing with custom code here. The GUI is:

image

The custom implementation here has some advantages like restricting it to only relevant gpu nodes on Linux but the use of a plugin can be more reliable that maintaining and debugging custom groovy code. Before I continue with the review, what do you think @claraberendsen , that we can keep going with this PR or move it to use the retry plugin features?

.gitignore Outdated Show resolved Hide resolved
jenkins-scripts/dsl/_configs_/OSRFUNIXBase.groovy Outdated Show resolved Hide resolved
jenkins-scripts/dsl/_configs_/OSRFUNIXBase.groovy Outdated Show resolved Hide resolved
jenkins-scripts/dsl/_configs_/OSRFUNIXBase.groovy Outdated Show resolved Hide resolved
jenkins-scripts/dsl/_configs_/OSRFUNIXBase.groovy Outdated Show resolved Hide resolved
@nuclearsandwich
Copy link
Contributor

Remove all labels from current agent and add a recovery-process label to indicate that that agent is performing recovery actions, preventing it from taking on any other build.

As this PR proceeds I want to raise an alternative tactic we could take and why I think the current action is the preferred one. Instead of dropping all other labels, we could add the recovery-process label and maintain the others. However, doing that without accounting for it everywhere would mean that singleton labels applied to the node being recovered would still show up as present while the node is no longer running jobs. I was slightly fearful of destructive edits to labels but I think it's actually the best thing to do with autorecovery scripts. 👍🏼

@claraberendsen
Copy link
Contributor Author

claraberendsen commented May 8, 2023

As this PR proceeds I want to raise an alternative tactic we could take and why I think the current action is the preferred one. Instead of dropping all other labels, we could add the recovery-process label and maintain the others. However, doing that without accounting for it everywhere would mean that singleton labels applied to the node being recovered would still show up as present while the node is no longer running jobs. I was slightly fearful of destructive edits to labels but I think it's actually the best thing to do with autorecovery scripts. 👍🏼

Also want to point out that the script is resilient to any fail in the steps subsequent to removing all labels. If for any reason the rebuilding of the job fails the catch will add the old labels to the node and report that we couldn't recover. This is to have some atomicity on these "sudo" actions we are performing from the Jenkins master. The only edge case is if Jenkins fails before shuting down command is scheduled we may encounter our agent with recovery-process label and have to manually restart it.

@claraberendsen
Copy link
Contributor Author

claraberendsen commented May 8, 2023

The custom implementation here has some advantages like restricting it to only relevant gpu nodes on Linux but the use of a plugin can be more reliable that maintaining and debugging custom groovy code. Before I continue with the review, what do you think @claraberendsen , that we can keep going with this PR or move it to use the retry plugin features?

I tried the retry plugin and I found this particular issue:

  • I could not find the dsl docs on the only re-run if expression is found part. This is the docs I was using as reference.

@j-rivero If you know of a way to use that conditional on dsl I would prefer that strategy over the current one for maintainability. On the custom run only on gpu nodes, with the naginator plugin even though we won't have complete certainty that it will run only on gpu-nvidia nodes, we are not expecting to encounter this log in a normal workflow in any other node. The only case I could see right now is if someone with intent adds that log to a build it would force the rebuild of the job and we won't be able to stop that.

@claraberendsen claraberendsen self-assigned this May 8, 2023
@claraberendsen claraberendsen requested a review from j-rivero May 9, 2023 14:53
@j-rivero
Copy link
Contributor

j-rivero commented May 9, 2023

I tried the retry plugin and I found this particular issue:
I could not find the dsl docs on the only re-run if expression is found part. This is the docs I was using as reference.
@j-rivero If you know of a way to use that conditional on dsl I would prefer that strategy over the current one for maintainability.

Looks like that this particular feature or part of the API has not being ported to DSL. There is a way I used in the past to inject XML almost directly from the DSL, it is called the configure block. You can see it in action for example in part of the OSRFWinCompilation class.

@claraberendsen
Copy link
Contributor Author

claraberendsen commented May 9, 2023

Post review testing

The testing is limited to the areas modified in the review: rebuild a job when a log is matched and regex matching on postbuild actions.

Tests

  • Script executes on agent with gpu-nvidia label #86
  • Script doesn't execute on agent with no-gpu-nvidia label #87
  • Script doesn't execute on agent with no gpu-nvidia label #88
  • Rebuilds job with same parameters only once when error found in log: Failing job #89 -> Rebuilt #90

@claraberendsen claraberendsen requested a review from j-rivero May 11, 2023 14:06
Copy link
Contributor

@j-rivero j-rivero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go. Great work @claraberendsen !

@j-rivero j-rivero merged commit 9fbfe60 into master May 11, 2023
@j-rivero j-rivero deleted the feature/automization-fix-nvidia-version-mismatch branch May 11, 2023 14:55
j-rivero pushed a commit that referenced this pull request Jun 28, 2023
If the build failed due to the Nvidia error, we do the following steps:

 *  Remove all labels from current agent and add a recovery-process label to indicate that that agent is performing recovery actions, preventing it from taking on any other build.
 *   Requeue the job with same parameters with a delay of 70s to give time for the next step.
 *   Schedule a system restart on 1 minute (the delay is needed here so that the postbuild action finishes correctly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants