Skip to content

Latest commit

 

History

History
217 lines (159 loc) · 8.33 KB

slurm-troubleshooting.md

File metadata and controls

217 lines (159 loc) · 8.33 KB

Slurm Troubleshooting

Failure to Create Auto Scale Nodes (Slurm)

If your deployment succeeds but your jobs fail with the following error:

$ srun -N 6 -p compute hostname
srun: PrologSlurmctld failed, job killed
srun: Force Terminated job 2
srun: error: Job allocation 2 has been revoked

Possible causes could be insufficient quota, insufficient capacity, placement groups, or insufficient permissions for the service account attached to the controller. Also see the Slurm user guide.

Insufficient Quota

It may be that you have sufficient quota to deploy your cluster but insufficient quota to bring up the compute nodes.

You can confirm this by SSHing into the controller VM and checking the resume.log file:

$ cat /var/log/slurm/resume.log
...
resume.py ERROR: ... "Quota 'C2_CPUS' exceeded. Limit: 300.0 in region europe-west4.". Details: "[{'message': "Quota 'C2_CPUS' exceeded. Limit: 300.0 in region europe-west4.", 'domain': 'usageLimits', 'reason': 'quotaExceeded'}]">

The solution here is to request more of the specified quota, C2 CPUs in the example above. Alternatively, you could switch the partition's machine type, to one which has sufficient quota.

Insufficient Capacity

It may be that the zone the partition is deployed in has no remaining capacity to create the compute nodes required to run your submitted job.

Check the resume.log file for possible errors by SSH-ing into the controller VM and running the following:

sudo cat /var/log/slurm/resume.log

One example of an error message which appears in resume.log due to insufficient capacity is:

bulkInsert operation errors: VM_MIN_COUNT_NOT_REACHED

When this happens, the the output of sacct will show the job's status as NODE_FAIL.

Jobs submitted via srun will not be requeued, however jobs submitted via sbatch will be requeued.

Placement Groups (Slurm)

By default, placement groups (also called affinity groups) are enabled on the compute partition. This places VMs close to each other to achieve lower network latency. If it is not possible to provide the requested number of VMs in the same placement group, the job may fail to run.

Again, you can confirm this by SSHing into the controller VM and checking the resume.log file:

$ cat /var/log/slurm/resume.log
...
resume.py ERROR: group operation failed: Requested minimum count of 6 VMs could not be created.

One way to resolve this is to set enable_placement to false on the partition in question.

VMs Get Stuck in Status Staging When Using Placement Groups With vm-instance

If VMs get stuck in status: staging when using the vm-instance module with placement enabled, it may be because you need to allow terraform to make more concurrent requests. See this note in the vm-instance README.

Insufficient Service Account Permissions

By default, the Slurm controller, login and compute nodes use the Google Compute Engine Service Account (GCE SA). If this service account or a custom SA used by the Slurm modules does not have sufficient permissions, configuring the controller or running a job in Slurm may fail.

If configuration of the Slurm controller fails, the error can be seen by viewing the startup script on the controller:

sudo journalctl -u google-startup-scripts.service | less

An error similar to the following indicates missing permissions for the service account:

Required 'compute.machineTypes.get' permission for ...

To solve this error, ensure your service account has the compute.instanceAdmin.v1 IAM role:

SA_ADDRESS=<SET SERVICE ACCOUNT ADDRESS HERE>

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member=serviceAccount:${SA_ADDRESS} --role=roles/compute.instanceAdmin.v1

If Slurm failed to run a job, view the resume log on the controller instance with the following command:

sudo cat /var/log/slurm/resume.log

An error in resume.log similar to the following indicates a permissions issue as well:

The user does not have access to service account '[email protected]'.  User: ''.  Ask a project owner to grant you the iam.serviceAccountUser role on the service account": ['slurm-hpc-small-compute-0-0']

As indicated, the service account must have the compute.serviceAccountUser IAM role. This can be set with the following command:

SA_ADDRESS=<SET SERVICE ACCOUNT ADDRESS HERE>

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member=serviceAccount:${SA_ADDRESS} --role=roles/iam.serviceAccountUser

If the GCE SA is being used and cannot be updated, a new service account can be created and used with the correct permissions. Instructions for how to do this can be found in the Slurm on Google Cloud User Guide, specifically the section titled "Create Service Accounts".

After creating the service account, it can be set via the compute_node_service_account and controller_service_account settings on the slurm-on-gcp controller module and the "login_service_account" setting on the slurm-on-gcp login module.

Timeout Error / Startup Script Failure (Slurm V6)

If you observe failure of startup scripts in version 6 of the Slurm module, they may be due to a 300 second maximum timeout on scripts. All startup script logging is found in /slurm/scripts/setup.log on every node in a Slurm cluster. The error will appear similar to:

2022-01-01 00:00:00,000 setup.py DEBUG: custom scripts to run: /slurm/custom_scripts/(login_r3qmskc0.d/ghpc_startup.sh)
2022-01-01 00:00:00,000 setup.py INFO: running script ghpc_startup.sh
2022-01-01 00:00:00,000 util DEBUG: run: /slurm/custom_scripts/login_r3qmskc0.d/ghpc_startup.sh
2022-01-01 00:00:00,000 setup.py ERROR: TimeoutExpired:
    command=/slurm/custom_scripts/login_r3qmskc0.d/ghpc_startup.sh
    timeout=300
    stdout:

We anticipate that this limit will be configured in future releases of the Slurm module, however we recommend that you use a dedicated build VM where possible to execute scripts of significant duration. This pattern is demonstrated in the AMD-optimized Slurm cluster example.

Slurm Controller Startup Fails with exportfs Error

Example error in /slurm/scripts/setup.log (on Slurm V6 controller):

exportfs: /****** does not support NFS export

This can be caused when you are mounting a Filestore that has the same name for local_mount and filestore_share_name.

For example:

  - id: samesharefs  # fails to exportfs
    source: modules/file-system/filestore
    use: [network1]
    settings:
      filestore_share_name: same
      local_mount: /same

This is a known issue, the recommended workaround is to use different naming for the local_mount and filestore_share_name.

local-exec provisioner error During Terraform Apply

Using the enable_reconfigure setting with Slurm v6 modules uses local-exec provisioners to perform additional cluster configuration. Some common issues experienced when using this feature are missing local python requirements and incorrectly configured gcloud cli. There is more information about these issues and fixes on the schedmd-slurm-gcp-v6-controller documentation.