Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing submission to GPU partition in CI run #242

Open
laraPPr opened this issue Feb 18, 2025 · 3 comments
Open

Failing submission to GPU partition in CI run #242

laraPPr opened this issue Feb 18, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@laraPPr
Copy link
Collaborator

laraPPr commented Feb 18, 2025

All rfm_job.sh fail to be submitted by reframe on the GPU partitions. with the following error.

sbatch: error: Invalid generic resource (gres) specification

However when I go to the Stage directory and run the command that failed the job gets submitted and runs. Their must be something that I'm missing that is set by ReFrame but not in the jobscript of launch command. That causes the job not to get submitted. I've already investigated it a little bit and I can't find what the difference might be.

ReFrame version: 4.7.3 & 4.7.2
Test-suite version: 0.5.1

@laraPPr laraPPr added the bug Something isn't working label Feb 18, 2025
@boegel
Copy link
Contributor

boegel commented Feb 19, 2025

@laraPPr Can you easily obtain the contents of the job script, and the exact sbatch command that is being used?

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 19, 2025

I did some further digging and the problem can't be with the job script itself, the test-suite or reframe

if reframe outside of the CI the error does not occur. And the test-suite runs fine.

So it is gonna be something that is picked up in the CI environment that causes the error. but I haven't tracked it down yet.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 19, 2025

@laraPPr Can you easily obtain the contents of the job script, and the exact sbatch command that is being used?

Yes you can find the job script in the stage directory and the run command you can find in the logs but the problem is not with either of those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants