Improve AES performance #12

qmuntal · 2022-02-22T10:03:42Z

This PR contains two complementary AES performance improvements:

Using AES wrappers

Commit: 99e3044

Several AES functions and methods were calling EVP_EncryptUpdate, EVP_DecryptUpdate and EVP_CipherUpdate without using the third parameter, out_len, which is an integer pointer filled by OpenSSL with the length of the output buffer.

out_len escapes to the heap, which puts unnecessary garbage collector pressure and degrades performance. To avoid that we are now allocating out_len in the C stack and just discarding the result.

Declare pointer types in C

Commit: 9356eff

We are storing EVP_CIPHER_CTX and EVP_CIPHER objects as Go pointers even though these are opaque C pointers which Go doesn't need to know they are pointers as it is not creating them nor doing any special operation, everything happens in the C scope. If we store those objects as void* typedefs we will have more chances that Go escape analysis will keep them on the stack.

A side-effect of this change is that we rely less on types defined in OpenSSL headers, which goes in the right direction. The "north star" is to eliminate the prerequisite of needing OpenSSL development headers in order to build this package.

As a side note, boringcrypto does not suffer so much from these allocations because BoringSSL does not use opaque pointers as function arguments but ordinary structs, which are are easier to allocate in the stack.

Results

AES encryption and decryption no longer allocates and has 70% more throughput.
AES seal and open allocates 75% less and has ~20% more throughput.

name                  old time/op    new time/op     delta
AESEncrypt-4             171ns ± 3%      102ns ± 6%   -40.22%  (p=0.000 n=10+10)
AESDecrypt-4             173ns ± 4%      101ns ± 6%   -41.36%  (p=0.000 n=10+10)
AESGCM/Open-128-64-4    1.28µs ± 2%     1.07µs ± 2%   -16.65%  (p=0.000 n=10+9)
AESGCM/Seal-128-64-4    1.31µs ± 3%     1.07µs ± 4%   -18.69%  (p=0.000 n=10+10)

name                  old speed      new speed       delta
AESEncrypt-4          93.5MB/s ± 3%  156.5MB/s ± 6%   +67.45%  (p=0.000 n=10+10)
AESDecrypt-4          92.7MB/s ± 4%  158.1MB/s ± 5%   +70.61%  (p=0.000 n=10+10)
AESGCM/Open-128-64-4  50.0MB/s ± 2%   59.7MB/s ± 5%   +19.31%  (p=0.000 n=10+10)
AESGCM/Seal-128-64-4  48.7MB/s ± 3%   59.9MB/s ± 4%   +22.93%  (p=0.000 n=10+10)

name                  old B/op   new B/op    delta
AESEncrypt-4             16.0B ± 0%       0.0B       -100.00%  (p=0.000 n=10+10)
AESDecrypt-4             16.0B ± 0%       0.0B       -100.00%  (p=0.000 n=10+10)
AESGCM/Open-128-64-4      120B ± 0%        32B ± 0%   -73.33%  (p=0.000 n=10+10)
AESGCM/Seal-128-64-4      112B ± 0%        32B ± 0%   -71.43%  (p=0.000 n=10+10)

name                  old allocs/op  new allocs/op   delta
AESEncrypt-4              2.00 ± 0%       0.00       -100.00%  (p=0.000 n=10+10)
AESDecrypt-4              2.00 ± 0%       0.00       -100.00%  (p=0.000 n=10+10)
AESGCM/Open-128-64-4      12.0 ± 0%        3.0 ± 0%   -75.00%  (p=0.000 n=10+10)
AESGCM/Seal-128-64-4      11.0 ± 0%        3.0 ± 0%   -72.73%  (p=0.000 n=10+10)

qmuntal · 2022-02-22T10:39:59Z

AES encrypt and decrypt performance are almost maxed out, even though we are 10 times slower than standard crypto: 10ns/op vs 100ns/op in my dev environment.

If we generate a cpu profile from BenchmarkAESEncrypt we can see that the cgo call and cgo boilerplate accounts for the 90% of the time, and we can't reduce the cgo calls more.

$ go test -run=- -bench=BenchmarkAESEncrypt -memprofile mem.out -cpuprofile cpu.out
$ go tool pprof cpu.out 
File: openssl.test
Build ID: 998359c183439631807bc0a0210fe2d55a6d6072
Type: cpu
Time: Feb 22, 2022 at 10:34am (UTC)
Duration: 1.40s, Total samples = 1.28s (91.18%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.27s, 99.22% of 1.28s total
Showing top 10 nodes out of 15
      flat  flat%   sum%        cum   cum%
     0.80s 62.50% 62.50%      1.23s 96.09%  runtime.cgocall
     0.15s 11.72% 74.22%      0.22s 17.19%  runtime.exitsyscallfast
     0.09s  7.03% 81.25%      0.11s  8.59%  runtime.casgstatus
     0.09s  7.03% 88.28%      0.42s 32.81%  runtime.exitsyscall
     0.06s  4.69% 92.97%      0.06s  4.69%  runtime.wirep
     0.02s  1.56% 94.53%      1.27s 99.22%  github.com/microsoft/go-crypto-openssl/openssl.(*aesCipher).Encrypt
     0.02s  1.56% 96.09%      1.25s 97.66%  github.com/microsoft/go-crypto-openssl/openssl._Cfunc_go_openssl_EVP_EncryptUpdate_wrapper
     0.02s  1.56% 97.66%      0.02s  1.56%  runtime.nanotime (inline)
     0.01s  0.78% 98.44%      1.28s   100%  github.com/microsoft/go-crypto-openssl/openssl.BenchmarkAESEncrypt
     0.01s  0.78% 99.22%      0.01s  0.78%  runtime.exitsyscallfast_reacquired

qmuntal · 2022-02-22T10:49:26Z

AES Open and Seal still have room for improvement if we batch 5 cgo calls into one, but it would mean moving logic from Go to C, which I would try to avoid for safety sake.

qmuntal · 2022-02-22T11:02:17Z

We run ~7% faster on go1.18rc1 compared to go1.17.7, which is strange because it is a significant speed burst and go1.18rc1 does not contain any major improvement in linux-amd64 ABI or cgo boilerplate, at least that I know.

chsienki · 2022-02-22T23:07:00Z

openssl/goopenssl.h

@@ -65,11 +68,25 @@ FOR_ALL_OPENSSL_FUNCTIONS
 #undef DEFINEFUNC_RENAMED
 #undef DEFINEFUNC_FALLBACK

-// This wrapper allocate out_len on the C stack, and check that it matches the expected
+// These wrappers allocate out_len on the C stack, and check that it matches the expected


How do these wrappers actually check the value was what was expected?

Good catch, I took the wrapping idea from the boring branch and just copied the comment as is.

We use these wrappers in functions where we can't know which is the expected output length, so I will reword the comment.

dagood · 2022-02-23T19:09:11Z

openssl/aes_test.go

+}
+
+func BenchmarkAESGCM(b *testing.B) {
+	for _, length := range []int{64, 1350} {


Is this just random "small" and "large", or is it testing both sides of a threshold or something like that?

It just random small and large, but I will remove the large benchmark because it does not add any information that is not already covered by the small case.

openssl/goopenssl.h

Co-authored-by: Davis Goodin <[email protected]>

qmuntal requested review from jaredpar, dagood and chsienki February 22, 2022 10:03

qmuntal force-pushed the dev/qmuntal/bench branch 2 times, most recently from 5cd4fbd to 6eb30c9 Compare February 22, 2022 14:55

qmuntal changed the title ~~Improve AES perfomance~~ Improve AES performance Feb 22, 2022

chsienki reviewed Feb 22, 2022

View reviewed changes

qmuntal force-pushed the dev/qmuntal/bench branch from 7b9f2b2 to 9196927 Compare February 23, 2022 10:00

dagood approved these changes Feb 23, 2022

View reviewed changes

chsienki approved these changes Feb 23, 2022

View reviewed changes

qmuntal and others added 6 commits February 24, 2022 09:04

avoid allocating outlen on AES encrypt/decrypt

1ff1846

aes: declare EVP_CIPHER and EVP_CIPHER_CTX pointers in C instead of Go

f55a70c

add AESGCM benchmarks

0ec98e2

reword function wrappers comment

2b20694

simplify aes benchmark

3695289

Update openssl/goopenssl.h

3a8f831

Co-authored-by: Davis Goodin <[email protected]>

qmuntal force-pushed the dev/qmuntal/bench branch from 744d9d9 to 3a8f831 Compare February 24, 2022 09:07

qmuntal merged commit 78b7fcf into main Feb 24, 2022

qmuntal deleted the dev/qmuntal/bench branch February 24, 2022 09:21

qmuntal mentioned this pull request May 5, 2022

[microsoft/dev.boringcrypto.go1.18] Upgrade go-crypto-openssl to latest microsoft/go#546

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve AES performance #12

Improve AES performance #12

qmuntal commented Feb 22, 2022 •

edited

Loading

qmuntal commented Feb 22, 2022

qmuntal commented Feb 22, 2022

qmuntal commented Feb 22, 2022 •

edited

Loading

chsienki Feb 22, 2022

qmuntal Feb 23, 2022

dagood Feb 23, 2022

qmuntal Feb 24, 2022

Improve AES performance #12

Improve AES performance #12

Conversation

qmuntal commented Feb 22, 2022 • edited Loading

Using AES wrappers

Declare pointer types in C

Results

qmuntal commented Feb 22, 2022

qmuntal commented Feb 22, 2022

qmuntal commented Feb 22, 2022 • edited Loading

chsienki Feb 22, 2022

Choose a reason for hiding this comment

qmuntal Feb 23, 2022

Choose a reason for hiding this comment

dagood Feb 23, 2022

Choose a reason for hiding this comment

qmuntal Feb 24, 2022

Choose a reason for hiding this comment

qmuntal commented Feb 22, 2022 •

edited

Loading

qmuntal commented Feb 22, 2022 •

edited

Loading