Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCluster 3.10.1 and 3.11.0 Slurm compute daemon node configuration differs from hardware #6449

Open
stefan-maxar opened this issue Oct 3, 2024 · 12 comments
Labels

Comments

@stefan-maxar
Copy link

Hello,

We have been testing to upgrade from PCluster 3.8.0 to 3.11.0 and noticed some differences that impact performance after extensive testing of our applications. We run hybrid MPI-openMP applications using HPC6a.48xlarge instances and noticed that after testing PCluster 3.10.1 or 3.11.0 all of our applications are running ~40% slower than 3.8.0 using the out-of-the-box PCluster AMIs associated with either version. We narrowed down the issue by downgrading/changing versions of performance impacting software (such as EFA installer, downgrading to v1.32.0 or v1.33.0), switching how the job is submitted/run in Slurm (Hydra bootstrap and mpiexec vs PMIv2 and srun), and some other changes that did not improve the degraded performance.

Upon investigation, we noticed that the slurmd compute daemon on the HPC6a.48xlarge instances incorrectly identifies the hardware configuration, resulting in improper job placement and degraded performance. Snapshots of the slurmd from varying versions of PCluster as follows:

HPC6a.48xlarge on PCluster 3.8.0 with Slurm 23.02.7 (correct when considering NUMA node as socket):

[2024-10-03T09:14:54.114] Considering each NUMA node as a socket
[2024-10-03T09:14:54.114] Node reconfigured socket/core boundaries SocketsPerBoard=96:4(hw) CoresPerSocket=1:24(hw)
[2024-10-03T09:14:54.116] Considering each NUMA node as a socket
[2024-10-03T09:14:54.124] CPU frequency setting not configured for this node
[2024-10-03T09:14:54.130] slurmd version 23.02.7 started
[2024-10-03T09:14:54.168] slurmd started on Thu, 03 Oct 2024 09:14:54 +0000
[2024-10-03T09:14:54.169] CPUs=96 Boards=1 Sockets=4 Cores=24 Threads=1 Memory=378805 TmpDisk=40947 Uptime=240 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

HPC6a.48xlarge on PCluster 3.10.1 with Slurm 23.11.7:

[2024-10-01T13:38:57.884] Considering each NUMA node as a socket
[2024-10-01T13:38:57.960] Considering each NUMA node as a socket
[2024-10-01T13:38:57.965] CPU frequency setting not configured for this node
[2024-10-01T13:38:58.142] slurmd version 23.11.7 started
[2024-10-01T13:38:58.221] slurmd started on Tue, 01 Oct 2024 13:38:58 +0000
[2024-10-01T13:38:58.221] CPUs=96 Boards=1 Sockets=96 Cores=1 Threads=1 Memory=378805 TmpDisk=40947 Uptime=123 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

HPC6a.48xlarge on PCluster 3.11.0 with Slurm 23.11.10:

[2024-10-03T13:56:38.733] Considering each NUMA node as a socket
[2024-10-03T13:56:38.735] Considering each NUMA node as a socket
[2024-10-03T13:56:38.740] CPU frequency setting not configured for this node
[2024-10-03T13:56:39.387] pyxis: version v0.20.0
[2024-10-03T13:56:39.388] slurmd version 23.11.10 started
[2024-10-03T13:56:39.830] slurmd started on Thu, 03 Oct 2024 13:56:39 +0000
[2024-10-03T13:56:39.831] CPUs=96 Boards=1 Sockets=96 Cores=1 Threads=1 Memory=378805 TmpDisk=40947 Uptime=377 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

lscpu from a HPC6a.48xlarge instance:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  1
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7R13 Processor
Stepping:            1
CPU MHz:             2420.130
BogoMIPS:            5299.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
NUMA node2 CPU(s):   48-71
NUMA node3 CPU(s):   72-95

Is there some fix (or workaround) to properly reconfigure the node configuration in PCluster 3.11.0? It looks like some process/script that was run in 3.8.0 (e.g. line: [2024-10-03T09:14:54.114] Node reconfigured socket/core boundaries ...) is either not being run or not running properly. We'd prefer not to hard code the proper node configuration in the PCluster compute resource YAML as we dynamically spin up/down clusters and could use difference instance types in a given compute resource depending on resource availability.

Thanks for any help you can provide!

@hanwen-pcluste
Copy link
Contributor

Hi Stefan!

Thank you for the detailed description. I could reproduce the same issue. The same logs appears in the slurmd.log on compute nodes.

I am actively working on this and will keep you updated!

Thank you,
Hanwen

@demartinofra
Copy link
Contributor

demartinofra commented Oct 15, 2024

Hi Stefan,

ParallelCluster has never explicitly configured Sockets and Cores for Slurm nodes, therefore Slurm uses its defaults. This could be due to Slurm 23.11 changing the way the value for Sockets and Cores are computed. Were you able to confirm that after setting the expected values for Sockets and Cores in slurm.conf the performance degradation is resolved? I don't expect seeing relevant changes in scheduling behaviour due to the lack of Sockets/Cores configuration that justify such a big regression.

Would you be able to extract some logs showing how processes are mapped to the various cores? also if you don't mind can you share the cluster configuration and a potential reproducer?

Also if you don't mind could you share the full Slurm config from both clusters? You can retrieve it with scontrol write config /tmp/slum.conf

If Sockets and Cores configuration turns out to be a red herring here is another potential issue to look into:
The Amazon Linux Kernel versions [v6.1.82, v5.15.152, v5.10.213] contain mitigations for CVE-2023-20569. The SRSO mitigations are enabled by default but may have a performance impact for very specific workloads. It is possible to disable these security mitigations to avoid a possible performance impact, however users should carefully consider the security implications.. To disable specify spec_rstack_overflow=off as a kernel boot parameter. For further details see https://docs.kernel.org/admin-guide/hw-vuln/srso.html

Francesco

@stefan-maxar
Copy link
Author

Hey @demartinofra - thanks for the reply!

For my testing, I did set the following in the PCluster configuration to force the proper configuration:

      ComputeResources:
          CustomSlurmSettings:
            Sockets: 4
            CoresPerSocket: 24

Which did yield proper configuration via slurmd (HPC6a.48xlarge on PCluster 3.11.0 with Slurm 23.11.10):

[2024-10-15T19:13:36.503] Considering each NUMA node as a socket
[2024-10-15T19:13:36.505] Considering each NUMA node as a socket
[2024-10-15T19:13:36.509] CPU frequency setting not configured for this node
[2024-10-15T19:13:36.569] pyxis: version v0.20.0
[2024-10-15T19:13:36.570] slurmd version 23.11.10 started
[2024-10-15T19:13:36.619] slurmd started on Tue, 15 Oct 2024 19:13:36 +0000
[2024-10-15T19:13:36.619] CPUs=96 Boards=1 Sockets=4 Cores=24 Threads=1 Memory=378805 TmpDisk=40947 Uptime=205 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

For our applications, I did note some improvement in performance and recouped a few percent of the ~40% degradation using the proper hardware configuration. So, not quite red herring, but definitely not the solution either!

Regarding the SRSO mitigation - thanks for passing this along. This is news to me and is definitely something I am going to investigate further. From what I can see, HPC6a with PCluster 3.11 base AMI has that patch as you refer to:

/sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow:Mitigation: safe RET, no microcode

Other than creating a custom AMI that disables this patch, do you have any suggestions for how to disable this upon instance startup via PCluster? The patch seemingly cant be removed during post install procedures because it requires instance reboot and once you reboot, slurm will detect the instance as "down" and will swap it out. I would rather not have to create a custom AMI if there is some other way to test this out. Thanks!

@demartinofra
Copy link
Contributor

Other than creating a custom AMI that disables this patch, do you have any suggestions for how to disable this upon instance startup via PCluster?

If you want to test it real quick one option is to run the following on the compute nodes:

sudo grubby --update-kernel=ALL --args='spec_rstack_overflow=off'
sudo sync

and then reboot them through the scheduler, so that Slurm does not mark the nodes as unhealthy and the reboot is successful:

sudo -i scontrol reboot <nodelist>

@stefan-maxar
Copy link
Author

Hi @demartinofra - I ran the commands you suggested to disable SRSO mitigation and rebooted via slurm which resulted in the patching being disabled:

/sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow:Vulnerable, no microcode

I then ran one of our smaller-scale hybrid MPI-openMP jobs and the performance was expected with no ~40% performance degradation (I also corrected the HPC6a configuration, which also did help with performance a little). So, it definitely seems like this SRSO mitigation is the culprit for our application slowdowns...and I'll doubly confirm with our larger-scale job.

What do you suggest as a more formal workaround for the SRSO mitigation in the PCluster realm? Custom AMI? Something else? When we had performance issues because of the log4j patch, it was a simple yum remove that we could run during post install. This is a bit more involved and since we spin up and down clusters daily from the base PCluster AMI, it would be great if you could provide some recommendations. Thanks for bringing this to our attention again!

@hanwen-pcluste
Copy link
Contributor

Hi Stefan,

We will work on a Wiki page to describe the mitigation in pcluster realm and let you know when it is done.

Thank you Stefan and Francesco for discovering the issue!
Hanwen

@hanwen-pcluste
Copy link
Contributor

Also, please avoid using 3.11.0 because of the known issue https://github.com/aws/aws-parallelcluster/wiki/(3.11.0)-Job-submission-failure-caused-by-race-condition-in-Pyxis-configuration

@hanwen-pcluste
Copy link
Contributor

hanwen-pcluste commented Oct 23, 2024

Hi Stefan,

We've published Wiki page (3.9.1 ‐ latest) Speculative Return Stack Overflow (SRSO) mitigations introducing potential performance impact on some AMD processors

Moreover, we've released ParallelCluster 3.11.1

Cheers,
Hanwen

@stefan-maxar
Copy link
Author

A follow up on this as we have been finally able to do a lot more testing with newer versions of PCluster. We have disabled SRSO following the guide for PCluster 3.11.1 AMIs, both AL2 and AL2023 OSes. For our large scale hybrid MPI-openMP application that runs on ~200 hpc6a, we still see substantial performance degradation compared to PCluster 3.8.0 even with the SRSO disabled on both OSes.

Further, the PCluster 3.8.0 AL2 AMI we currently use in production does ship with the SRSO mitigation enabled; we have never disabled it. Spinning up a hpc6a with the base us-east-2 PCluster 3.8.0 AL2 AMI (ami-03e71395f1580f16e) yields:

/sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow:Mitigation: safe RET, no microcode

So, something else is going on that is causing issues with large-scale applications/jobs. Its worth reiterating - disabling SRSO in PCluster 3.11.1 AMIs DID help return performance back to near-normal for a small-scale (2 hpc6a) MPI job, but it wasn't the cure for our job using ~200 hpc6a. Were there any other foundational changes that could cause scaling issues in newer versions of PCluster?

@stefan-maxar
Copy link
Author

@hanwen-pcluste @demartinofra

Another update here as we've continued testing with PCluster 3.12.0. We are still seeing performance degradation at scale with PCluster 3.10+, including 3.12.0 on both AL2 and AL2023. After a lot more digging, we've noticed that the network throughput (EFA traffic) is substantially less in the newer versions. Our latest tests were with the following:

Cluster 1:

  • MPI job on ~200 hpc6a using Intel MPI
  • PCluster 3.12.0 Base AMI configured with SRSO disabled and EFA Installer 1.37.0

Cluster 2 (current production environment):

  • MPI job on ~200 hpc6a using Intel MPI
  • PCluster 3.8.0 Base AMI with EFA Installer 1.30.0 (we did not disable SRSO)

Attaching screenshots of an instance from both test clusters showing network in/out and network packets in/out using 5-min averages (top is cluster 1, bottom is cluster 2). During the main MPI job, test cluster 2 has consistent 5-min average network in/out performance of 115+ Gb and packets exceeding 30M. In contrast, test cluster 1 has significantly less 5-min average network in/out performance, varying between 80 and 90 Gb with packets hovering around 24-26M. Further, the traffic is much more volatile (sawtooth pattern). This performance degradation is consistent with other instances within the cluster, but for ease of showing in plot, we isolated it down to 1 compute instance from each.

I am not sure what could be causing this performance drop and could use some pointers on where to dig into next if there are any EFA-related configurations that might have changed. Since its a pretty large version bump in EFA installer, Im sure there are a lot of moving parts that could be the culprit.

  • Stefan

Image
Image

@hanwen-cluster
Copy link
Contributor

Hi Stephan,

We are still looking at the issue. We apologize for the late reply.

Thank you,
Hanwen

@stefan-maxar
Copy link
Author

stefan-maxar commented Jan 16, 2025

Hi Stephan,

We are still looking at the issue. We apologize for the late reply.

Thank you, Hanwen

Thanks Hanwen!

FWIW - I am continuing to test some of our smaller-scale jobs and am not seeing any performance issues. Latest testing for a hybrid MPI-openMP job that uses 4 hpc6a:

Cluster 1:

  • MPI job on 4 hpc6a using Intel MPI
  • PCluster 3.12.0 AL2023 Base AMI configured with SRSO disabled and EFA Installer 1.37.0

Cluster 2 (current production environment):

  • MPI job on 4 hpc6a using Intel MPI
  • PCluster 3.8.0 AL2 Base AMI with EFA Installer 1.30.0 (we did not disable SRSO)

The total wall clock time from the jobs on cluster 1 were nearly identical to that of our production total wall clock times on cluster 2. So, this seems to only be an issue at scale, at least from what I have seen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants