Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve AES performance #12

Merged
merged 6 commits into from
Feb 24, 2022
Merged

Improve AES performance #12

merged 6 commits into from
Feb 24, 2022

Conversation

qmuntal
Copy link
Member

@qmuntal qmuntal commented Feb 22, 2022

This PR contains two complementary AES performance improvements:

Using AES wrappers

Commit: 99e3044

Several AES functions and methods were calling EVP_EncryptUpdate, EVP_DecryptUpdate and EVP_CipherUpdate without using the third parameter, out_len, which is an integer pointer filled by OpenSSL with the length of the output buffer.

out_len escapes to the heap, which puts unnecessary garbage collector pressure and degrades performance. To avoid that we are now allocating out_len in the C stack and just discarding the result.

Declare pointer types in C

Commit: 9356eff

We are storing EVP_CIPHER_CTX and EVP_CIPHER objects as Go pointers even though these are opaque C pointers which Go doesn't need to know they are pointers as it is not creating them nor doing any special operation, everything happens in the C scope. If we store those objects as void* typedefs we will have more chances that Go escape analysis will keep them on the stack.

A side-effect of this change is that we rely less on types defined in OpenSSL headers, which goes in the right direction. The "north star" is to eliminate the prerequisite of needing OpenSSL development headers in order to build this package.

As a side note, boringcrypto does not suffer so much from these allocations because BoringSSL does not use opaque pointers as function arguments but ordinary structs, which are are easier to allocate in the stack.

Results

AES encryption and decryption no longer allocates and has 70% more throughput.
AES seal and open allocates 75% less and has ~20% more throughput.

name                  old time/op    new time/op     delta
AESEncrypt-4             171ns ± 3%      102ns ± 6%   -40.22%  (p=0.000 n=10+10)
AESDecrypt-4             173ns ± 4%      101ns ± 6%   -41.36%  (p=0.000 n=10+10)
AESGCM/Open-128-64-4    1.28µs ± 2%     1.07µs ± 2%   -16.65%  (p=0.000 n=10+9)
AESGCM/Seal-128-64-4    1.31µs ± 3%     1.07µs ± 4%   -18.69%  (p=0.000 n=10+10)

name                  old speed      new speed       delta
AESEncrypt-4          93.5MB/s ± 3%  156.5MB/s ± 6%   +67.45%  (p=0.000 n=10+10)
AESDecrypt-4          92.7MB/s ± 4%  158.1MB/s ± 5%   +70.61%  (p=0.000 n=10+10)
AESGCM/Open-128-64-4  50.0MB/s ± 2%   59.7MB/s ± 5%   +19.31%  (p=0.000 n=10+10)
AESGCM/Seal-128-64-4  48.7MB/s ± 3%   59.9MB/s ± 4%   +22.93%  (p=0.000 n=10+10)

name                  old B/op   new B/op    delta
AESEncrypt-4             16.0B ± 0%       0.0B       -100.00%  (p=0.000 n=10+10)
AESDecrypt-4             16.0B ± 0%       0.0B       -100.00%  (p=0.000 n=10+10)
AESGCM/Open-128-64-4      120B ± 0%        32B ± 0%   -73.33%  (p=0.000 n=10+10)
AESGCM/Seal-128-64-4      112B ± 0%        32B ± 0%   -71.43%  (p=0.000 n=10+10)

name                  old allocs/op  new allocs/op   delta
AESEncrypt-4              2.00 ± 0%       0.00       -100.00%  (p=0.000 n=10+10)
AESDecrypt-4              2.00 ± 0%       0.00       -100.00%  (p=0.000 n=10+10)
AESGCM/Open-128-64-4      12.0 ± 0%        3.0 ± 0%   -75.00%  (p=0.000 n=10+10)
AESGCM/Seal-128-64-4      11.0 ± 0%        3.0 ± 0%   -72.73%  (p=0.000 n=10+10)

@qmuntal
Copy link
Member Author

qmuntal commented Feb 22, 2022

AES encrypt and decrypt performance are almost maxed out, even though we are 10 times slower than standard crypto: 10ns/op vs 100ns/op in my dev environment.

If we generate a cpu profile from BenchmarkAESEncrypt we can see that the cgo call and cgo boilerplate accounts for the 90% of the time, and we can't reduce the cgo calls more.

$ go test -run=- -bench=BenchmarkAESEncrypt -memprofile mem.out -cpuprofile cpu.out
$ go tool pprof cpu.out 
File: openssl.test
Build ID: 998359c183439631807bc0a0210fe2d55a6d6072
Type: cpu
Time: Feb 22, 2022 at 10:34am (UTC)
Duration: 1.40s, Total samples = 1.28s (91.18%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.27s, 99.22% of 1.28s total
Showing top 10 nodes out of 15
      flat  flat%   sum%        cum   cum%
     0.80s 62.50% 62.50%      1.23s 96.09%  runtime.cgocall
     0.15s 11.72% 74.22%      0.22s 17.19%  runtime.exitsyscallfast
     0.09s  7.03% 81.25%      0.11s  8.59%  runtime.casgstatus
     0.09s  7.03% 88.28%      0.42s 32.81%  runtime.exitsyscall
     0.06s  4.69% 92.97%      0.06s  4.69%  runtime.wirep
     0.02s  1.56% 94.53%      1.27s 99.22%  github.com/microsoft/go-crypto-openssl/openssl.(*aesCipher).Encrypt
     0.02s  1.56% 96.09%      1.25s 97.66%  github.com/microsoft/go-crypto-openssl/openssl._Cfunc_go_openssl_EVP_EncryptUpdate_wrapper
     0.02s  1.56% 97.66%      0.02s  1.56%  runtime.nanotime (inline)
     0.01s  0.78% 98.44%      1.28s   100%  github.com/microsoft/go-crypto-openssl/openssl.BenchmarkAESEncrypt
     0.01s  0.78% 99.22%      0.01s  0.78%  runtime.exitsyscallfast_reacquired

@qmuntal
Copy link
Member Author

qmuntal commented Feb 22, 2022

AES Open and Seal still have room for improvement if we batch 5 cgo calls into one, but it would mean moving logic from Go to C, which I would try to avoid for safety sake.

@qmuntal
Copy link
Member Author

qmuntal commented Feb 22, 2022

We run ~7% faster on go1.18rc1 compared to go1.17.7, which is strange because it is a significant speed burst and go1.18rc1 does not contain any major improvement in linux-amd64 ABI or cgo boilerplate, at least that I know.

@qmuntal qmuntal force-pushed the dev/qmuntal/bench branch 2 times, most recently from 5cd4fbd to 6eb30c9 Compare February 22, 2022 14:55
@qmuntal qmuntal changed the title Improve AES perfomance Improve AES performance Feb 22, 2022
@@ -65,11 +68,25 @@ FOR_ALL_OPENSSL_FUNCTIONS
#undef DEFINEFUNC_RENAMED
#undef DEFINEFUNC_FALLBACK

// This wrapper allocate out_len on the C stack, and check that it matches the expected
// These wrappers allocate out_len on the C stack, and check that it matches the expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these wrappers actually check the value was what was expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I took the wrapping idea from the boring branch and just copied the comment as is.

We use these wrappers in functions where we can't know which is the expected output length, so I will reword the comment.

}

func BenchmarkAESGCM(b *testing.B) {
for _, length := range []int{64, 1350} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just random "small" and "large", or is it testing both sides of a threshold or something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just random small and large, but I will remove the large benchmark because it does not add any information that is not already covered by the small case.

openssl/goopenssl.h Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants