AES-NI #1433

MaxGraey · 2021-07-23T17:02:04Z

I'm not sure if this is the right place for this proposal, but I would like to suggest some very useful commands to speed up cryptography and especially cryptographic and non-cryptographic hashing.

What are the instructions being proposed?

AES-NI (Advanced Encryption Standard New Instructions) is extended instruction set which accelerate AES encryption / decription.

`v128.aes.enc(a, b)`

Perform one round of an AES decryption flow

general:

fn aesenc(v128 a, v128 b) {
  return MixColumns(ShiftRows(SubBytes(a))) ^ b
}

x86: aesenc
ARM: AESMC + AESE + EOR
PPC: vcipher

`v128.aes.enc_last(a, b)`

Perform the last round of an AES decryption flow

general:

fn aesenc_last(v128 a, v128 b) {
  return ShiftRows(SubBytes(a)) ^ b
}

x86: aesenclast
ARM: AESE + EOR
PPC: vcipherlast

`v128.aes.dec(a, b)`

Perform one round of an AES decryption flow

general:

fn aesdec(v128 a, v128 b) {
  return MixColumnsInv(ShiftRowsInv(SubBytesInv(a))) ^ b
}

x86: aesdec
ARM: AESIMC + AESD + EOR
PPC: vncipher

`v128.aes.dec_last(a, b)`

Perform the last round of an AES decryption flow

general:

fn aesdec_last(v128 a, v128 b) {
  return ShiftRowsInv(SubBytesInv(a)) ^ b
}

x86: aesdeclast
ARM: AESD + EOR
PPC: vncipherlast

`v128.aes.keygen(a, imm8)`

Generating the round keys used for encryption

general:

fn aeskeygen(v128 a, u8 imm) {
  X3[31:0] = a[127:96]
  X2[31:0] = a[95:64]
  X1[31:0] = a[63:32]
  X0[31:0] = a[31:0]
  RCON[31:0] = ZeroExtend(imm8)
  vDst[31:0] = SubWord(X1)
  vDst[63:32] = RotWord(SubWord(X1)) ^ RCON
  vDst[95:64] = SubWord(X3)
  vDst[127:96] = RotWord(SubWord(X3)) ^ RCON
  return vDst
}

x86: aeskeygenassist
ARM: Efficient emulation on ARM (See emulating-x86-aes-intrinsics-on-armv8-a):

__m128i _mm_aeskeygenassist_si128_arm(__m128i a, const int imm8) {
    a = vaeseq_u8(a, (__m128i){}); // perform ShiftRows and SubBytes on "a"
    uint32_t rcon = (uint32_t)(uint8_t)imm8;
    __m128i dest = {
        // Undo ShiftRows step from AESE and extract X1 and X3
        a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
        a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
        a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
        a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
    };
    return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}

PPC:

__m128i _mm_aeskeygenassist_si128_ppc(__m128i a, const int imm8) {
    a = __builtin_crypto_vcipherlast(a, (__m128i){}); // perform ShiftRows and SubBytes on "a"
    uint32_t rcon = (uint32_t)(uint8_t)imm8;
    __m128i dest = {
        // Undo ShiftRows step from vcipherlast and extract X1 and X3
        a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
        a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
        a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
        a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
    };
    return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}

Details about operations like MixColumns, ShiftRowsInv and etc see Intel's white paper

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

x86 support by Intel (Westmere, Sandy/Ivy Bridge, Haswell, Skylake and etc) and AMD (>= Jaguar, >= Puma, >= Zen1).
ARM Optionally support on ARMv8-A (ARM Cortex-A30/50/70 cores), Qualcomm 805, Exynos 3 series .
RISC-V doesn't have such specific instructions but a number of RISC-V chips include integrated AES co-processors. And perhaps will be standardize in future
POWER8/9/10 also support this. (thanks to @nemequ for pointing that out).

What use cases are there?

speedup AES encryption / decryption
fast cryptographic and non-cryptographic hashing. Check this benchmark results: https://github.com/rurban/smhasher/blob/master/README.md. All fastest hash algos uses AES-NI

The text was updated successfully, but these errors were encountered:

Maratyszcza · 2021-07-23T17:27:02Z

AES instructions differ very substantially between x86 and ARM. The proposal should address a way to unify over these differences.

nemequ · 2021-07-23T17:55:35Z

POWER also has AES instructions, FWIW:

vcipher (vec_cipher_be)
vcipherlast (vec_cipherlast_be)
vncipher (vec_ncipher_be)
vncipherlast (vec_ncipherlast_be)
vsbox (vec_sbox_be)

They were all added in POWER9 (ISA 3.0).

That said, I don't think these fit here since they're not really relaxed. Is there anywhere to gather ideas for a second version of the simd proposal?

MaxGraey · 2021-07-23T21:30:18Z

@Maratyszcza I added some details for x84 and ARM. ARM has slightly more low level intrinsics but it seems all fit into one API. Later I'll add couple more instructions related to AES

MaxGraey · 2021-07-24T08:35:04Z

I added another operation - v128.aes.keygen. It's pretty easy to emulate on ARM. On PowerPC you can do something similar

penzn · 2021-07-26T06:56:23Z

Looks reasonable to me.

jan-wassenberg · 2021-07-28T07:24:38Z

+1 to these being useful. @nemequ we can consider them 'relaxed' if we extend the definition to "each 128-bit block within a vector", right? SVE2 does that (unfortunately it's an optional extension), and that would also cover x86 VAES.

penzn · 2021-07-28T17:32:05Z

we can consider them 'relaxed' if we extend the definition to "each 128-bit block within a vector", right?

@jan-wassenberg, AFAIK, this proposal is limited to 128 bit operations, however we can probably consider it SIMD 2.0 of sorts, which should let us include these operations. I would be interested to pull this into flexible vectors to have the semantics you described 😉

jan-wassenberg · 2021-07-29T16:01:28Z

@penzn OK :) I'd also welcome both v128 and flexible AES.

jlb6740 · 2021-07-30T15:57:28Z

Explicit instructions for AES support are useful but I share the concern/question that it doesn't really fit here. What SIMD instructions are being relaxed here? Don't we expect the same result no matter the platform? What's the fall back for the VM if a VM doesn't support these relaxed instructions?

MaxGraey · 2021-07-30T16:06:22Z

What's the fall back for the VM if a VM doesn't support these relaxed instructions?

You can follow the same strategy as with qfma / qfms instruction which exposes extra environment variable (fpenv). AES-NI is very similar in this respect because I don't think it is possible to safely and efficiently emulate these instructions on scalar-only VMs and better follow a different strategy to implement AES or hashes on such platforms.

It is also worth remembering that these instructions are executed on constant time, which is quite important in term of security.

The AES instructions are designed to mitigate all of the known timing and cache side channel leakage of sensitive data (from Ring 3 spy processes). Their latency is data-independent, and since all the computations are performed internally by the hardware, no lookup tables are required. Therefore, if the AES instructions are used properly (e.g., as in the following code examples) the AES encryption/decryption, as well as the Key Expansion, would have data-independent timing and would involve only data-independent memory access. Consequently, the AES instructions allow for writing high performance AES software which is, at the same time, protected against the currently known software side channel attacks

jan-wassenberg · 2021-07-30T16:44:15Z

I don't think it is possible to safely and efficiently emulate these instructions on scalar-only VMs

I recently implemented a constant-time fallback using basic SIMD only.

MaxGraey · 2021-07-30T16:58:22Z

As I know most problematic for non-SIMD emulation is Galois field Multiplication

penzn · 2021-07-30T21:22:16Z

I recently implemented a constant-time fallback using basic SIMD only.

This reminds me of some instructions we have added late in the SIMD proposal. I think it would be doable, but we would need to prototype and measure.

I don't think it is possible to safely and efficiently emulate these instructions on scalar-only VMs

I don't think we should worry about scalar-only VMs, especially after SIMD making it to stage 5.

lars-t-hansen · 2021-08-02T09:20:43Z

Explicit instructions for AES support are useful but I share the concern/question that it doesn't really fit here. What SIMD instructions are being relaxed here? Don't we expect the same result no matter the platform?

Just to be pedantic, I don't think we should restrict this proposal to "relaxed" variants of existing SIMD instructions, assuming that that's what you were intending to say here. IMO any 128-bit SIMD operation whose best implementation might have different corner case behavior on some platform of interest should be fair game for the proposal (fma being a case in point).

MaxGraey · 2021-08-02T09:23:56Z

Actually, I don't mind if it's a Fixed SIMD 2.0 or Fixed SIMD Extensions proposal. It's just that such a repository does not exist yet

nemequ · 2021-08-02T15:54:04Z

Just to be pedantic, I don't think we should restrict this proposal to "relaxed" variants of existing SIMD instructions, assuming that that's what you were intending to say here. IMO any 128-bit SIMD operation whose best implementation might have different corner case behavior on some platform of interest should be fair game for the proposal (fma being a case in point).

Does this have different behavior on different platforms? FMA fits because the results will be different on different platforms since the operations may or may not actually be fused.

The summary for this proposal is:

This proposal adds a set of useful SIMD instructions that introduce local non-determinism (where the results of the instructions may vary based on hardware support).

So unless I'm missing something (definitely possible, I'm not really familiar with these work internally) this seems out of scope to me. FWIW, I'd very much like to start gathering potential instructions for a SIMD 2.0 extension, and think this would be a great addition to that list.

lars-t-hansen · 2021-08-03T07:09:21Z

Just to be pedantic, I don't think we should restrict this proposal to "relaxed" variants of existing SIMD instructions, assuming that that's what you were intending to say here. IMO any 128-bit SIMD operation whose best implementation might have different corner case behavior on some platform of interest should be fair game for the proposal (fma being a case in point).

Does this have different behavior on different platforms? FMA fits because the results will be different on different platforms since the operations may or may not actually be fused.

Again, being pedantic, FMA is not a "relaxed" version of any previous instruction, it is a new instruction with relaxed behavior. IMO all 128-bit instructions with relaxed behavior are in scope.

I don't know whether the AES instructions fall into that category; I was simply reacting to @jlb6740's wording and wanted to clarify that instructions with new behaviors are not out of scope for relaxed-simd.

MaxGraey · 2021-08-03T07:22:47Z

As I understand it Relaxed SIMD is a category of SIMD operations which do not guarantee deterministic behavior (e.g. with NaN or signed zero) or accuracy (as in the case of FMA instruction on platforms where it is not available). In the case of AES we have only one non-determinism: that we cannot guarantee same performance (cost model) for all platforms. In fact, many of the already standardized Fixed SIMD instructions cannot guarantee this either. So as far as I'm concerned this can really be related to SIMD 2.0

ngzhian · 2021-08-05T20:47:10Z

I think we agree that these AES instructions don't fit into relaxed-simd, but we also don't have a SIMD 2.0 proposal. As concrete next step, we can approach this as a new feature and follow the process that we have in place for new features, starting with filing an issue on the design repository (I can also transfer this issue there).

It is up to interested participants to scope out the proposal: it could be a SIMD 2.0 proposal consisting of even more SIMD instructions, or it could be only AES instructions (as detailed here), or it could be a broader crypto instructions proposal. This will be driven by the proposal champion(s).

I can provide pointers and help guide the process if @MaxGraey (or anyone else) is willing to take this on.

Update: in the 2021-08-06 SIMD sync, we discussed a potential streamlined/lightweight process to make small changes to the spec, this is not confirmed yet, but could also work for AES (if/when it happens).

penzn · 2021-10-01T18:02:32Z

@MaxGraey, from discussion at the SIMD subgroup meeting, we should probably highlight some use cases for these instructions, then it can be taken to the CG.

MaxGraey · 2021-10-01T18:13:34Z

@penzn "What use cases are there?" point at the end takes some use cases. To summarize, these instructions are used mainly in two cases: to speed up AES encryption/decryption and to speed up сryptographic and non-сryptographic hash functions. Cryptographic hashes can often be used for databases and for very simple HashMap / HashSet containers with high resistance to HashDoS attacks.

penzn · 2021-10-01T18:41:34Z

Thanks @MaxGraey, to me this looks sufficient. @dtig, what do you think? Should we move the issue to design?

dtig · 2021-10-04T17:13:11Z

Thanks @penzn and @MaxGraey this sounds like a good start, I'll transfer this issue to the design repository as it is out of scope for relaxed-simd. Please note that a streamlined/lightweight process will still need a champion and satisfy the requirements outlined in the Phases document.

dtig transferred this issue from WebAssembly/relaxed-simd Oct 4, 2021

jserv mentioned this issue Nov 10, 2022

Absent _mm_aesdec_si128 and _mm_aesdeclast_si128 DLTcollab/sse2neon#477

Closed

ppenzin mentioned this issue Aug 14, 2024

[Use case] crypto #1526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AES-NI #1433

AES-NI #1433

MaxGraey commented Jul 23, 2021

Maratyszcza commented Jul 23, 2021

nemequ commented Jul 23, 2021

MaxGraey commented Jul 23, 2021 •

edited

Loading

MaxGraey commented Jul 24, 2021

penzn commented Jul 26, 2021

jan-wassenberg commented Jul 28, 2021

penzn commented Jul 28, 2021

jan-wassenberg commented Jul 29, 2021

jlb6740 commented Jul 30, 2021

MaxGraey commented Jul 30, 2021 •

edited

Loading

jan-wassenberg commented Jul 30, 2021

MaxGraey commented Jul 30, 2021 •

edited

Loading

penzn commented Jul 30, 2021

lars-t-hansen commented Aug 2, 2021

MaxGraey commented Aug 2, 2021 •

edited

Loading

nemequ commented Aug 2, 2021

lars-t-hansen commented Aug 3, 2021

MaxGraey commented Aug 3, 2021 •

edited

Loading

ngzhian commented Aug 5, 2021 •

edited

Loading

penzn commented Oct 1, 2021

MaxGraey commented Oct 1, 2021 •

edited

Loading

penzn commented Oct 1, 2021

dtig commented Oct 4, 2021

AES-NI #1433

AES-NI #1433

Comments

MaxGraey commented Jul 23, 2021

What are the instructions being proposed?

v128.aes.enc(a, b)

Perform one round of an AES decryption flow

v128.aes.enc_last(a, b)

Perform the last round of an AES decryption flow

v128.aes.dec(a, b)

Perform one round of an AES decryption flow

v128.aes.dec_last(a, b)

Perform the last round of an AES decryption flow

v128.aes.keygen(a, imm8)

Generating the round keys used for encryption

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

What use cases are there?

Maratyszcza commented Jul 23, 2021

nemequ commented Jul 23, 2021

MaxGraey commented Jul 23, 2021 • edited Loading

MaxGraey commented Jul 24, 2021

penzn commented Jul 26, 2021

jan-wassenberg commented Jul 28, 2021

penzn commented Jul 28, 2021

jan-wassenberg commented Jul 29, 2021

jlb6740 commented Jul 30, 2021

MaxGraey commented Jul 30, 2021 • edited Loading

jan-wassenberg commented Jul 30, 2021

MaxGraey commented Jul 30, 2021 • edited Loading

penzn commented Jul 30, 2021

lars-t-hansen commented Aug 2, 2021

MaxGraey commented Aug 2, 2021 • edited Loading

nemequ commented Aug 2, 2021

lars-t-hansen commented Aug 3, 2021

MaxGraey commented Aug 3, 2021 • edited Loading

ngzhian commented Aug 5, 2021 • edited Loading

penzn commented Oct 1, 2021

MaxGraey commented Oct 1, 2021 • edited Loading

penzn commented Oct 1, 2021

dtig commented Oct 4, 2021

`v128.aes.enc(a, b)`

`v128.aes.enc_last(a, b)`

`v128.aes.dec(a, b)`

`v128.aes.dec_last(a, b)`

`v128.aes.keygen(a, imm8)`

MaxGraey commented Jul 23, 2021 •

edited

Loading

MaxGraey commented Jul 30, 2021 •

edited

Loading

MaxGraey commented Jul 30, 2021 •

edited

Loading

MaxGraey commented Aug 2, 2021 •

edited

Loading

MaxGraey commented Aug 3, 2021 •

edited

Loading

ngzhian commented Aug 5, 2021 •

edited

Loading

MaxGraey commented Oct 1, 2021 •

edited

Loading