Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supporting Arm Neoverse V2 CPUs: NVIDIA Grace, AWS Graviton 4, Google Axion #845

Open
boegel opened this issue Jan 10, 2025 · 11 comments
Open

Comments

@boegel
Copy link
Contributor

boegel commented Jan 10, 2025

While looking into implementing support in archdetect for detecting the Neoverse V2-based Graviton 4, I noticed that the CPUs that implement this microarchitecture only partially overlap (based on https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html):

instructions Amazon Graviton 4 NVIDIA Grace Google Axion
sve2 x x x
paca x x
pacg x x
rng x x
sm3 x x
sm4 x x
svesm4 x x
ssbs x x

Google Axion not supported ssbs is particularly interesting, since that means that our aarch64/neoverse_v1 installations may not even work there...

@ocaisa
Copy link
Member

ocaisa commented Jan 10, 2025

I was wondering if we inspect what instructions our binaries actually require?

@ocaisa
Copy link
Member

ocaisa commented Jan 10, 2025

Seems to be possible: https://github.com/pkgw/elfx86exts (says it works for x86 and ARM)

@ocaisa
Copy link
Member

ocaisa commented Jan 10, 2025

We could inspect the instructions required by the software, and have CI complain if it goes outside the common options.

@boegel
Copy link
Contributor Author

boegel commented Jan 10, 2025

Seems to be possible: https://github.com/pkgw/elfx86exts (says it works for x86 and ARM)

Step 0:

@ocaisa
Copy link
Member

ocaisa commented Jan 13, 2025

Sample output from elfx86exts:

$ elfx86exts $(which elfx86exts)
File format and CPU architecture: Elf, X86_64
MODE64 (call)
CMOV (cmova)
BMI2 (mulx)
AVX (vpxor)
NOVLX (vpxor)
AVX2 (vpbroadcastq)
BMI (tzcnt)
SSE2 (pause)
SSE1 (xorps)
BWI (vpbroadcastb)
VLX (vpbroadcastb)
AVX512 (kortestw)
DQI (kshiftrb)
Instruction set extensions used: AVX, AVX2, AVX512, BMI, BMI2, BWI, CMOV, DQI, MODE64, NOVLX, SSE1, SSE2, VLX
CPU Generation: Unknown

@ocaisa
Copy link
Member

ocaisa commented Jan 13, 2025

Instructions set extensions cover a range of assembly instructions relevant to different categories (security, performance, AI/ML, Cryptography,...). We could consider keeping a catalogue of what assembly instructions are required for each software package.

We'd still need a map from the assembly instructions back to the CPU features to know whether or not a particular CPU can run the code for a particular architecture branch of EESSI. I couldn't easily find this information, so I asked ChatGPT for a helpful table for the NVIDIA Grace (which uses the Armv9-A ISA) and got something that is at the very least a good starting point:

Category Feature Name Description Relevant Assembly Instructions
Security armv9.0-a, rme Confidential Compute Architecture (CCA): Introduces Realms for hardware-based workload isolation. RMI (Realm Management Instructions), SEAL, UNSEAL
  mte Memory Tagging Extension (MTE): Tags memory allocations to detect overflows and use-after-free errors. STG, LDG, STZG, LDGZ, IRG
  pauth Pointer Authentication (PAC): Improves control-flow integrity with efficient pointer signing. PACIA, PACIB, AUTIA, AUTIB, XPACD, XPACI
  bti Branch Target Identification (BTI): Ensures valid branch targets to mitigate exploitation risks. BTI
  specres Speculative Execution Mitigations: Adds protections against speculative execution vulnerabilities. No specific assembly instructions, protections are hardware-based.
Performance sve2 Scalable Vector Extension 2 (SVE2): SIMD support for variable-length vectors (128–2048 bits) for HPC, AI. Vector operations like ADDVL, FMUL, UDOT, SDOT, EORV
  flagm2, flagm Enhanced Flag Manipulation: Efficient flag-setting and checking for conditional execution. SETF8, SETF16
  ls64, ls64v, ls64acc Enhanced Atomic Operations: Adds efficient atomic operations for multithreaded applications. ST64B, LD64B, CASP64, LDAPRB
  fgt Fine-Grained Traps (FGT): Provides precise control over exception trapping for debugging. WFET, WFIT
  fp16fml Half-Precision Multiply-Accumulate (FP16FML): Optimizes FP16-based computations for ML workloads. FMLAL, FMLSL
AI/ML i8mm Integer Matrix Multiplication Extension (I8MM): Accelerates integer matrix math for AI/ML tasks. UDOT, SDOT
  bf16 BFloat16 Support (BF16): Enables efficient low-precision floating-point operations for AI/ML. BFDOT, BFMLAL
  sve2 SVE2 Optimizations: Adds AI-specific instructions, including support for dot products and convolutions. UDOT, SDOT, SQRDMULH, SMIN, UMAX
Cryptography sha3 SHA-3 Support: Accelerates secure hashing for data integrity and cryptographic applications. EOR3, RAX1, XAR
  sm4 SM4 Instructions: Hardware acceleration for the SM4 block cipher (Chinese cryptographic standard). SM4E, SM4EKEY
  aes AES Extensions: Efficient support for AES encryption and decryption, including GCM modes. AESE, AESD, AESMC, AESIMC
  sha512 SHA-512 Extensions: Optimizes SHA-512 for secure workloads and high-performance environments. SHA512H, SHA512H2, SHA512SU0, SHA512SU1
  sm3 SM3 Instructions: Hardware support for the SM3 hash algorithm (Chinese cryptographic standard). SM3PARTW1, SM3PARTW2, SM3SS1, SM3TT1A, SM3TT1B, SM3TT2A, SM3TT2B
Compatibility armv8.6-a, armv8.4-a, asimddp Backward Compatibility: Full support for previous Armv8-A features, including FP16 and dot products. Instructions like UDOT, SDOT, FMADD
  tlbi TLB Enhancements: Improves Translation Lookaside Buffer management for virtualized systems. TLBIALL, TLBIVMALL, TLBIIPAS2E1IS
  nv Nested Virtualization: Enables running virtual machines within virtual machines. HVC, ERET, SYSREG
Debugging sysreg System Register Enhancements: Adds support for performance monitoring and secure system debugging. System register access instructions like MRS, MSR.
  dbg Improved Debug Architecture: Enhances system-level debugging, especially in isolated environments. Debug instructions like BRK, DBG, and exception-generating instructions.

We only really care about a certain set of categories, so small differences in ARM instruction set extensions may not be relevant to the software stacks we ship.

@ocaisa
Copy link
Member

ocaisa commented Jan 13, 2025

So a hardware check would be more like asking the question: this software stack requires <LIST> assembly instructions, does your CPU support them?

@ocaisa
Copy link
Member

ocaisa commented Jan 31, 2025

It seems you can check this with a compiler:

#include <stdio.h>

int main() {
    __asm__ volatile ("mov %%eax, %%eax" ::: "eax");
    printf("Instruction executed!\n");
    return 0;
}

I wonder if we can compile a small executable with the full list of instructions for the stack and then just run that on the CPU? If it doesn't throw an "Illegal instruction" then it is supported? Is that enough? Could things like cache sizes etc. have an impact?

EDIT: Tested the AI generated code above, doesn't seem to work out-of-the-box, but could it work in principle?
2nd EDIT: Code above now seems to work

@ocaisa
Copy link
Member

ocaisa commented Jan 31, 2025

Indeed, in principle that approach could work:

ocaisa@LAPTOP-O6HF2IKC:~$ cat test_instruction_sapphaire_rapids.c
#include <stdio.h>

int main() {
    __asm__ volatile (".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0"); // Encodes `tilezero tmm0`
    printf("AMX instruction executed!\n");
    return 0;
}
ocaisa@LAPTOP-O6HF2IKC:~$ gcc test_instruction_sapphaire_rapids.c
ocaisa@LAPTOP-O6HF2IKC:~$ ./a.out
Illegal instruction (core dumped)

@ocaisa
Copy link
Member

ocaisa commented Jan 31, 2025

A couple more AI-generated proof-of-concepts
Based on

[
    {
        "name": "AVX vaddps",
        "assembly": "vaddps %xmm0, %xmm1, %xmm2"
    },
    {
        "name": "AMX tilezero",
        "assembly": ".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0"
    },
    {
        "name": "SSE movaps",
        "assembly": "movaps %xmm0, %xmm1"
    }
]

you could have a python code generator like

import json

# Template for the generated C file (Fixed escaping `{}` using double `{{}}`)
C_TEMPLATE = """\
#include <stdio.h>
#include <setjmp.h>
#include <signal.h>

// Global variable for signal handling
sigjmp_buf buf;

// Signal handler for illegal instructions
void sigill_handler(int sig) {{
    siglongjmp(buf, 1);  // Jump back to safety
}}

// Function to test an instruction
void test_instruction(const char *name, void (*instr_func)()) {{
    printf("[*] Testing: %s... ", name);

    if (sigsetjmp(buf, 1) == 0) {{
        instr_func();  // Run the instruction
        printf("Success\\n");
    }} else {{
        printf("Failed (Illegal Instruction)\\n");
    }}
}}

// Instruction test functions
{function_definitions}

int main() {{
    // Set up signal handler for SIGILL
    signal(SIGILL, sigill_handler);

    printf("\\n=== CPU Instruction Test ===\\n");

    // Run all tests
{function_calls}

    printf("\\n=== Test Complete ===\\n");
    return 0;
}}
"""

def generate_c_code(instructions):
    """Generates C code based on the provided instructions."""
    function_definitions = []
    function_calls = []

    for instr in instructions:
        func_name = f"test_{instr['name'].replace(' ', '_').lower()}"
        function_definitions.append(f"void {func_name}() {{ __asm__ volatile (\"{instr['assembly']}\\n\"); }}")
        function_calls.append(f"    test_instruction(\"{instr['name']}\", {func_name});")

    return C_TEMPLATE.format(
        function_definitions="\n".join(function_definitions),
        function_calls="\n".join(function_calls)
    )

def main():
    # Load JSON file
    with open("instructions.json", "r") as f:
        instructions = json.load(f)

    # Generate C code
    c_code = generate_c_code(instructions)

    # Write to file
    with open("test_instructions.c", "w") as f:
        f.write(c_code)

    print("[+] Generated 'test_instructions.c' successfully!")

if __name__ == "__main__":
    main()

and then

ocaisa@LAPTOP-O6HF2IKC:~$ python generate_c_code.py
[+] Generated 'test_instructions.c' successfully!
ocaisa@LAPTOP-O6HF2IKC:~$ gcc test_instructions.c
ocaisa@LAPTOP-O6HF2IKC:~$ ./a.out

=== CPU Instruction Test ===
[*] Testing: AVX vaddps... Success
[*] Testing: AMX tilezero... Failed (Illegal Instruction)
[*] Testing: SSE movaps... Success

=== Test Complete ===

I could see having individual instructions.json for each package, and adding all those to generate a global one for a software stack just by adding them up.

@ocaisa
Copy link
Member

ocaisa commented Jan 31, 2025

To me, this looks promising:

[ocaisa@login1 ~]$ srun --partition=x86-64-generic-node --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
[ocaisa@x86-64-generic-node1 ~]$ ./a.out 

=== CPU Instruction Test ===
[*] Testing: AVX vaddps... Success
[*] Testing: AMX tilezero... Failed (Illegal Instruction)
[*] Testing: SSE movaps... Success
[*] Testing: AVX512 vaddps... Failed (Illegal Instruction)
[*] Testing: AVX512 vpdpbusd... Failed (Illegal Instruction)
[*] Testing: FMA vfmadd231ps... Success
[*] Testing: BMI2 pext... Success
[*] Testing: BMI2 mulx... Success
[*] Testing: SHA sha256rnds2... Failed (Illegal Instruction)
[*] Testing: TSX xbegin... Success

=== Test Complete ===
[ocaisa@x86-64-generic-node1 ~]$ exit
exit
[ocaisa@login1 ~]$ srun --partition=x86-64-intel-skylake-node --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
[ocaisa@x86-64-intel-skylake-node1 ~]$ ./a.out 

=== CPU Instruction Test ===
[*] Testing: AVX vaddps... Success
[*] Testing: AMX tilezero... Failed (Illegal Instruction)
[*] Testing: SSE movaps... Success
[*] Testing: AVX512 vaddps... Success
[*] Testing: AVX512 vpdpbusd... Success
[*] Testing: FMA vfmadd231ps... Success
[*] Testing: BMI2 pext... Success
[*] Testing: BMI2 mulx... Success
[*] Testing: SHA sha256rnds2... Failed (Illegal Instruction)
[*] Testing: TSX xbegin... Success

=== Test Complete ===
[ocaisa@x86-64-intel-skylake-node1 ~]$ exit
exit
[ocaisa@login1 ~]$ srun --partition=x86-64-intel-srapids-node --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
[ocaisa@x86-64-intel-srapids-node1 ~]$ ./a.out 

=== CPU Instruction Test ===
[*] Testing: AVX vaddps... Success
[*] Testing: AMX tilezero... Success
[*] Testing: SSE movaps... Success
[*] Testing: AVX512 vaddps... Success
[*] Testing: AVX512 vpdpbusd... Success
[*] Testing: FMA vfmadd231ps... Success
[*] Testing: BMI2 pext... Success
[*] Testing: BMI2 mulx... Success
[*] Testing: SHA sha256rnds2... Success
[*] Testing: TSX xbegin... Success

=== Test Complete ===

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants