If your deployment succeeds but your jobs fail with the following error:
$ srun -N 6 -p compute hostname
srun: PrologSlurmctld failed, job killed
srun: Force Terminated job 2
srun: error: Job allocation 2 has been revoked
Possible causes could be insufficient quota, insufficient capacity, placement groups, or insufficient permissions for the service account attached to the controller. Also see the Slurm user guide.
It may be that you have sufficient quota to deploy your cluster but insufficient quota to bring up the compute nodes.
You can confirm this by SSHing into the controller
VM and checking the
resume.log
file:
$ cat /var/log/slurm/resume.log
...
resume.py ERROR: ... "Quota 'C2_CPUS' exceeded. Limit: 300.0 in region europe-west4.". Details: "[{'message': "Quota 'C2_CPUS' exceeded. Limit: 300.0 in region europe-west4.", 'domain': 'usageLimits', 'reason': 'quotaExceeded'}]">
The solution here is to request more of the specified quota,
C2 CPUs
in the example above. Alternatively, you could switch the partition's
machine type, to one which has sufficient quota.
It may be that the zone the partition is deployed in has no remaining capacity to create the compute nodes required to run your submitted job.
Check the resume.log
file for possible errors by SSH-ing into the controller VM and running the following:
sudo cat /var/log/slurm/resume.log
One example of an error message which appears in resume.log
due to insufficient capacity is:
bulkInsert operation errors: VM_MIN_COUNT_NOT_REACHED
When this happens, the the output of sacct
will show the job's status as NODE_FAIL
.
Jobs submitted via srun
will not be requeued, however jobs submitted via sbatch
will be requeued.
By default, placement groups (also called affinity groups) are enabled on the compute partition. This places VMs close to each other to achieve lower network latency. If it is not possible to provide the requested number of VMs in the same placement group, the job may fail to run.
Again, you can confirm this by SSHing into the controller
VM and checking the
resume.log
file:
$ cat /var/log/slurm/resume.log
...
resume.py ERROR: group operation failed: Requested minimum count of 6 VMs could not be created.
One way to resolve this is to set enable_placement
to false
on the partition in question.
If VMs get stuck in status: staging
when using the vm-instance
module with
placement enabled, it may be because you need to allow terraform to make more
concurrent requests. See
this note in the vm-instance
README.
By default, the Slurm controller, login and compute nodes use the Google Compute Engine Service Account (GCE SA). If this service account or a custom SA used by the Slurm modules does not have sufficient permissions, configuring the controller or running a job in Slurm may fail.
If configuration of the Slurm controller fails, the error can be seen by viewing the startup script on the controller:
sudo journalctl -u google-startup-scripts.service | less
An error similar to the following indicates missing permissions for the service account:
Required 'compute.machineTypes.get' permission for ...
To solve this error, ensure your service account has the
compute.instanceAdmin.v1
IAM role:
SA_ADDRESS=<SET SERVICE ACCOUNT ADDRESS HERE>
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member=serviceAccount:${SA_ADDRESS} --role=roles/compute.instanceAdmin.v1
If Slurm failed to run a job, view the resume log on the controller instance with the following command:
sudo cat /var/log/slurm/resume.log
An error in resume.log
similar to the following indicates a permissions issue
as well:
The user does not have access to service account '[email protected]'. User: ''. Ask a project owner to grant you the iam.serviceAccountUser role on the service account": ['slurm-hpc-small-compute-0-0']
As indicated, the service account must have the compute.serviceAccountUser IAM role. This can be set with the following command:
SA_ADDRESS=<SET SERVICE ACCOUNT ADDRESS HERE>
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member=serviceAccount:${SA_ADDRESS} --role=roles/iam.serviceAccountUser
If the GCE SA is being used and cannot be updated, a new service account can be created and used with the correct permissions. Instructions for how to do this can be found in the Slurm on Google Cloud User Guide, specifically the section titled "Create Service Accounts".
After creating the service account, it can be set via the
compute_node_service_account
and controller_service_account
settings on the
slurm-on-gcp controller module and the
"login_service_account" setting on the
slurm-on-gcp login module.
If you observe failure of startup scripts in version 6 of the Slurm module,
they may be due to a 300 second maximum timeout on scripts. All startup script
logging is found in /slurm/scripts/setup.log
on every node in a Slurm cluster.
The error will appear similar to:
2022-01-01 00:00:00,000 setup.py DEBUG: custom scripts to run: /slurm/custom_scripts/(login_r3qmskc0.d/ghpc_startup.sh)
2022-01-01 00:00:00,000 setup.py INFO: running script ghpc_startup.sh
2022-01-01 00:00:00,000 util DEBUG: run: /slurm/custom_scripts/login_r3qmskc0.d/ghpc_startup.sh
2022-01-01 00:00:00,000 setup.py ERROR: TimeoutExpired:
command=/slurm/custom_scripts/login_r3qmskc0.d/ghpc_startup.sh
timeout=300
stdout:
We anticipate that this limit will be configured in future releases of the Slurm module, however we recommend that you use a dedicated build VM where possible to execute scripts of significant duration. This pattern is demonstrated in the AMD-optimized Slurm cluster example.
Example error in /slurm/scripts/setup.log
(on Slurm V6 controller):
exportfs: /****** does not support NFS export
This can be caused when you are mounting a Filestore that has the same name for
local_mount
and filestore_share_name
.
For example:
- id: samesharefs # fails to exportfs
source: modules/file-system/filestore
use: [network1]
settings:
filestore_share_name: same
local_mount: /same
This is a known issue, the recommended workaround is to use different naming for
the local_mount
and filestore_share_name
.
Using the enable_reconfigure
setting with Slurm v6 modules uses local-exec
provisioners to perform additional cluster configuration. Some common issues
experienced when using this feature are missing local python requirements and
incorrectly configured gcloud cli. There is more information about these issues
and fixes on the
schedmd-slurm-gcp-v6-controller
documentation.