Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AES-NI #1433

Open
MaxGraey opened this issue Jul 23, 2021 · 23 comments
Open

AES-NI #1433

MaxGraey opened this issue Jul 23, 2021 · 23 comments

Comments

@MaxGraey
Copy link

I'm not sure if this is the right place for this proposal, but I would like to suggest some very useful commands to speed up cryptography and especially cryptographic and non-cryptographic hashing.

What are the instructions being proposed?

AES-NI (Advanced Encryption Standard New Instructions) is extended instruction set which accelerate AES encryption / decription.

v128.aes.enc(a, b)

Perform one round of an AES decryption flow

general:

fn aesenc(v128 a, v128 b) {
  return MixColumns(ShiftRows(SubBytes(a))) ^ b
}

x86: aesenc
ARM: AESMC + AESE + EOR
PPC: vcipher

v128.aes.enc_last(a, b)

Perform the last round of an AES decryption flow

general:

fn aesenc_last(v128 a, v128 b) {
  return ShiftRows(SubBytes(a)) ^ b
}

x86: aesenclast
ARM: AESE + EOR
PPC: vcipherlast

v128.aes.dec(a, b)

Perform one round of an AES decryption flow

general:

fn aesdec(v128 a, v128 b) {
  return MixColumnsInv(ShiftRowsInv(SubBytesInv(a))) ^ b
}

x86: aesdec
ARM: AESIMC + AESD + EOR
PPC: vncipher

v128.aes.dec_last(a, b)

Perform the last round of an AES decryption flow

general:

fn aesdec_last(v128 a, v128 b) {
  return ShiftRowsInv(SubBytesInv(a)) ^ b
}

x86: aesdeclast
ARM: AESD + EOR
PPC: vncipherlast

v128.aes.keygen(a, imm8)

Generating the round keys used for encryption

general:

fn aeskeygen(v128 a, u8 imm) {
  X3[31:0] = a[127:96]
  X2[31:0] = a[95:64]
  X1[31:0] = a[63:32]
  X0[31:0] = a[31:0]
  RCON[31:0] = ZeroExtend(imm8)
  vDst[31:0] = SubWord(X1)
  vDst[63:32] = RotWord(SubWord(X1)) ^ RCON
  vDst[95:64] = SubWord(X3)
  vDst[127:96] = RotWord(SubWord(X3)) ^ RCON
  return vDst
}

x86: aeskeygenassist
ARM: Efficient emulation on ARM (See emulating-x86-aes-intrinsics-on-armv8-a):

__m128i _mm_aeskeygenassist_si128_arm(__m128i a, const int imm8) {
    a = vaeseq_u8(a, (__m128i){}); // perform ShiftRows and SubBytes on "a"
    uint32_t rcon = (uint32_t)(uint8_t)imm8;
    __m128i dest = {
        // Undo ShiftRows step from AESE and extract X1 and X3
        a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
        a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
        a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
        a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
    };
    return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}

PPC:

__m128i _mm_aeskeygenassist_si128_ppc(__m128i a, const int imm8) {
    a = __builtin_crypto_vcipherlast(a, (__m128i){}); // perform ShiftRows and SubBytes on "a"
    uint32_t rcon = (uint32_t)(uint8_t)imm8;
    __m128i dest = {
        // Undo ShiftRows step from vcipherlast and extract X1 and X3
        a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
        a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
        a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
        a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
    };
    return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}

Details about operations like MixColumns, ShiftRowsInv and etc see Intel's white paper

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

  • x86 support by Intel (Westmere, Sandy/Ivy Bridge, Haswell, Skylake and etc) and AMD (>= Jaguar, >= Puma, >= Zen1).
  • ARM Optionally support on ARMv8-A (ARM Cortex-A30/50/70 cores), Qualcomm 805, Exynos 3 series .
  • RISC-V doesn't have such specific instructions but a number of RISC-V chips include integrated AES co-processors. And perhaps will be standardize in future
  • POWER8/9/10 also support this. (thanks to @nemequ for pointing that out).

What use cases are there?

@Maratyszcza
Copy link

AES instructions differ very substantially between x86 and ARM. The proposal should address a way to unify over these differences.

@nemequ
Copy link

nemequ commented Jul 23, 2021

POWER also has AES instructions, FWIW:

  • vcipher (vec_cipher_be)
  • vcipherlast (vec_cipherlast_be)
  • vncipher (vec_ncipher_be)
  • vncipherlast (vec_ncipherlast_be)
  • vsbox (vec_sbox_be)

They were all added in POWER9 (ISA 3.0).

That said, I don't think these fit here since they're not really relaxed. Is there anywhere to gather ideas for a second version of the simd proposal?

@MaxGraey
Copy link
Author

MaxGraey commented Jul 23, 2021

@Maratyszcza I added some details for x84 and ARM. ARM has slightly more low level intrinsics but it seems all fit into one API. Later I'll add couple more instructions related to AES

@MaxGraey
Copy link
Author

I added another operation - v128.aes.keygen. It's pretty easy to emulate on ARM. On PowerPC you can do something similar

@penzn
Copy link

penzn commented Jul 26, 2021

Looks reasonable to me.

@jan-wassenberg
Copy link

+1 to these being useful. @nemequ we can consider them 'relaxed' if we extend the definition to "each 128-bit block within a vector", right? SVE2 does that (unfortunately it's an optional extension), and that would also cover x86 VAES.

@penzn
Copy link

penzn commented Jul 28, 2021

we can consider them 'relaxed' if we extend the definition to "each 128-bit block within a vector", right?

@jan-wassenberg, AFAIK, this proposal is limited to 128 bit operations, however we can probably consider it SIMD 2.0 of sorts, which should let us include these operations. I would be interested to pull this into flexible vectors to have the semantics you described 😉

@jan-wassenberg
Copy link

@penzn OK :) I'd also welcome both v128 and flexible AES.

@jlb6740
Copy link

jlb6740 commented Jul 30, 2021

Explicit instructions for AES support are useful but I share the concern/question that it doesn't really fit here. What SIMD instructions are being relaxed here? Don't we expect the same result no matter the platform? What's the fall back for the VM if a VM doesn't support these relaxed instructions?

@MaxGraey
Copy link
Author

MaxGraey commented Jul 30, 2021

What's the fall back for the VM if a VM doesn't support these relaxed instructions?

You can follow the same strategy as with qfma / qfms instruction which exposes extra environment variable (fpenv). AES-NI is very similar in this respect because I don't think it is possible to safely and efficiently emulate these instructions on scalar-only VMs and better follow a different strategy to implement AES or hashes on such platforms.

It is also worth remembering that these instructions are executed on constant time, which is quite important in term of security.

The AES instructions are designed to mitigate all of the known timing and cache side channel leakage of sensitive data (from Ring 3 spy processes). Their latency is data-independent, and since all the computations are performed internally by the hardware, no lookup tables are required. Therefore, if the AES instructions are used properly (e.g., as in the following code examples) the AES encryption/decryption, as well as the Key Expansion, would have data-independent timing and would involve only data-independent memory access. Consequently, the AES instructions allow for writing high performance AES software which is, at the same time, protected against the currently known software side channel attacks

@jan-wassenberg
Copy link

I don't think it is possible to safely and efficiently emulate these instructions on scalar-only VMs

I recently implemented a constant-time fallback using basic SIMD only.

@MaxGraey
Copy link
Author

MaxGraey commented Jul 30, 2021

As I know most problematic for non-SIMD emulation is Galois field Multiplication

@penzn
Copy link

penzn commented Jul 30, 2021

I recently implemented a constant-time fallback using basic SIMD only.

This reminds me of some instructions we have added late in the SIMD proposal. I think it would be doable, but we would need to prototype and measure.

I don't think it is possible to safely and efficiently emulate these instructions on scalar-only VMs

I don't think we should worry about scalar-only VMs, especially after SIMD making it to stage 5.

@lars-t-hansen
Copy link
Contributor

Explicit instructions for AES support are useful but I share the concern/question that it doesn't really fit here. What SIMD instructions are being relaxed here? Don't we expect the same result no matter the platform?

Just to be pedantic, I don't think we should restrict this proposal to "relaxed" variants of existing SIMD instructions, assuming that that's what you were intending to say here. IMO any 128-bit SIMD operation whose best implementation might have different corner case behavior on some platform of interest should be fair game for the proposal (fma being a case in point).

@MaxGraey
Copy link
Author

MaxGraey commented Aug 2, 2021

Actually, I don't mind if it's a Fixed SIMD 2.0 or Fixed SIMD Extensions proposal. It's just that such a repository does not exist yet

@nemequ
Copy link

nemequ commented Aug 2, 2021

Just to be pedantic, I don't think we should restrict this proposal to "relaxed" variants of existing SIMD instructions, assuming that that's what you were intending to say here. IMO any 128-bit SIMD operation whose best implementation might have different corner case behavior on some platform of interest should be fair game for the proposal (fma being a case in point).

Does this have different behavior on different platforms? FMA fits because the results will be different on different platforms since the operations may or may not actually be fused.

The summary for this proposal is:

This proposal adds a set of useful SIMD instructions that introduce local non-determinism (where the results of the instructions may vary based on hardware support).

So unless I'm missing something (definitely possible, I'm not really familiar with these work internally) this seems out of scope to me. FWIW, I'd very much like to start gathering potential instructions for a SIMD 2.0 extension, and think this would be a great addition to that list.

@lars-t-hansen
Copy link
Contributor

Just to be pedantic, I don't think we should restrict this proposal to "relaxed" variants of existing SIMD instructions, assuming that that's what you were intending to say here. IMO any 128-bit SIMD operation whose best implementation might have different corner case behavior on some platform of interest should be fair game for the proposal (fma being a case in point).

Does this have different behavior on different platforms? FMA fits because the results will be different on different platforms since the operations may or may not actually be fused.

Again, being pedantic, FMA is not a "relaxed" version of any previous instruction, it is a new instruction with relaxed behavior. IMO all 128-bit instructions with relaxed behavior are in scope.

I don't know whether the AES instructions fall into that category; I was simply reacting to @jlb6740's wording and wanted to clarify that instructions with new behaviors are not out of scope for relaxed-simd.

@MaxGraey
Copy link
Author

MaxGraey commented Aug 3, 2021

As I understand it Relaxed SIMD is a category of SIMD operations which do not guarantee deterministic behavior (e.g. with NaN or signed zero) or accuracy (as in the case of FMA instruction on platforms where it is not available). In the case of AES we have only one non-determinism: that we cannot guarantee same performance (cost model) for all platforms. In fact, many of the already standardized Fixed SIMD instructions cannot guarantee this either. So as far as I'm concerned this can really be related to SIMD 2.0

@ngzhian
Copy link
Member

ngzhian commented Aug 5, 2021

I think we agree that these AES instructions don't fit into relaxed-simd, but we also don't have a SIMD 2.0 proposal. As concrete next step, we can approach this as a new feature and follow the process that we have in place for new features, starting with filing an issue on the design repository (I can also transfer this issue there).

It is up to interested participants to scope out the proposal: it could be a SIMD 2.0 proposal consisting of even more SIMD instructions, or it could be only AES instructions (as detailed here), or it could be a broader crypto instructions proposal. This will be driven by the proposal champion(s).

I can provide pointers and help guide the process if @MaxGraey (or anyone else) is willing to take this on.

Update: in the 2021-08-06 SIMD sync, we discussed a potential streamlined/lightweight process to make small changes to the spec, this is not confirmed yet, but could also work for AES (if/when it happens).

@penzn
Copy link

penzn commented Oct 1, 2021

@MaxGraey, from discussion at the SIMD subgroup meeting, we should probably highlight some use cases for these instructions, then it can be taken to the CG.

@MaxGraey
Copy link
Author

MaxGraey commented Oct 1, 2021

@penzn "What use cases are there?" point at the end takes some use cases. To summarize, these instructions are used mainly in two cases: to speed up AES encryption/decryption and to speed up сryptographic and non-сryptographic hash functions. Cryptographic hashes can often be used for databases and for very simple HashMap / HashSet containers with high resistance to HashDoS attacks.

@penzn
Copy link

penzn commented Oct 1, 2021

Thanks @MaxGraey, to me this looks sufficient. @dtig, what do you think? Should we move the issue to design?

@dtig
Copy link
Member

dtig commented Oct 4, 2021

Thanks @penzn and @MaxGraey this sounds like a good start, I'll transfer this issue to the design repository as it is out of scope for relaxed-simd. Please note that a streamlined/lightweight process will still need a champion and satisfy the requirements outlined in the Phases document.

@dtig dtig transferred this issue from WebAssembly/relaxed-simd Oct 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants