Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/cipher: A performance optimization idea of “crypto” lib for ARM-arch #42010

Open
kkoogqw opened this issue Oct 16, 2020 · 2 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@kkoogqw
Copy link

kkoogqw commented Oct 16, 2020

What version of Go are you using (go version)?

$ go version 1.14.4

When I run AES-CBC performance analysis on amd64 and arm64 platforms, I found that function:func xorBytes(dst, a, b []byte) int and func safeXORBytes(dst, a, b []byte, n int) (in crypto/cipher/xor_generic.go) on arm64-arch always appears top15 in pprof list. Compared with amd64-arch, this function uses SSE2 SIMD instruction in func xorBytesSSE2(dst, a, b *byte, n int).

```bash
(pprof) top10
Showing nodes accounting for 700ms, 55.12% of 1270ms total
Showing top 10 nodes out of 113
      flat  flat%   sum%        cum   cum%
     170ms 13.39% 13.39%      530ms 41.73%  runtime.mallocgc
      90ms  7.09% 20.47%       90ms  7.09%  crypto/cipher.safeXORBytes
      90ms  7.09% 27.56%      130ms 10.24%  syscall.Syscall
      80ms  6.30% 33.86%       80ms  6.30%  runtime.nextFreeFast (inline)
      60ms  4.72% 38.58%       60ms  4.72%  runtime.publicationBarrier
      50ms  3.94% 42.52%       50ms  3.94%  crypto/aes.expandKeyAsm
      50ms  3.94% 46.46%      140ms 11.02%  crypto/cipher.xorBytes
      40ms  3.15% 49.61%       40ms  3.15%  runtime.acquirem (inline)
      40ms  3.15% 52.76%       40ms  3.15%  runtime.memclrNoHeapPointers
      30ms  2.36% 55.12%       30ms  2.36%  crypto/internal/subtle.InexactOverlap

I consider whether we can use the arm64 SIMD instruction to optimize the performance of this function?

@kkoogqw kkoogqw changed the title A performance optimization idea of “crypto” lib for ARM-arch crypto/cipher: A performance optimization idea of “crypto” lib for ARM-arch Oct 16, 2020
@toothrot toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 16, 2020
@toothrot toothrot added this to the Backlog milestone Oct 16, 2020
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/142537 mentions this issue: crypto/cipher: use Neon for xor on arm64

gopherbot pushed a commit that referenced this issue Nov 7, 2020
cpu: HiSilicon(R) Kirin 970 2.4GHz

name                 old time/op    new time/op    delta
XORBytes/8Bytes        39.8ns ± 0%    17.3ns ± 0%    -56.53%  (p=0.000 n=10+10)
XORBytes/128Bytes       376ns ± 0%      28ns ± 0%    -92.63%  (p=0.000 n=10+8)
XORBytes/2048Bytes     5.67µs ± 0%    0.22µs ± 0%    -96.03%  (p=0.000 n=10+10)
XORBytes/32768Bytes    90.3µs ± 0%     3.5µs ± 0%    -96.12%  (p=0.000 n=10+10)
AESGCMSeal1K            853ns ± 0%     853ns ± 0%       ~     (all equal)
AESGCMOpen1K            876ns ± 0%     874ns ± 0%     -0.23%  (p=0.000 n=10+10)
AESGCMSign8K           3.09µs ± 0%    3.08µs ± 0%     -0.34%  (p=0.000 n=10+9)
AESGCMSeal8K           5.87µs ± 0%    5.87µs ± 0%     +0.01%  (p=0.008 n=10+8)
AESGCMOpen8K           5.82µs ± 0%    5.82µs ± 0%     +0.02%  (p=0.037 n=10+10)
AESCFBEncrypt1K        7.05µs ± 0%    4.27µs ± 0%    -39.38%  (p=0.000 n=10+10)
AESCFBDecrypt1K        7.12µs ± 0%    4.30µs ± 0%    -39.54%  (p=0.000 n=10+9)
AESCFBDecrypt8K        56.7µs ± 0%    34.1µs ± 0%    -39.82%  (p=0.000 n=10+10)
AESOFB1K               5.20µs ± 0%    2.54µs ± 0%    -51.07%  (p=0.000 n=10+10)
AESCTR1K               4.96µs ± 0%    2.30µs ± 0%    -53.62%  (p=0.000 n=9+10)
AESCTR8K               39.5µs ± 0%    18.2µs ± 0%    -53.98%  (p=0.000 n=8+10)
AESCBCEncrypt1K        5.81µs ± 0%    3.07µs ± 0%    -47.13%  (p=0.000 n=10+8)
AESCBCDecrypt1K        5.83µs ± 0%    3.10µs ± 0%    -46.84%  (p=0.000 n=10+8)

name                 old speed      new speed      delta
XORBytes/8Bytes       201MB/s ± 0%   461MB/s ± 0%   +129.80%  (p=0.000 n=6+10)
XORBytes/128Bytes     340MB/s ± 0%  4625MB/s ± 0%  +1259.91%  (p=0.000 n=8+10)
XORBytes/2048Bytes    361MB/s ± 0%  9088MB/s ± 0%  +2414.23%  (p=0.000 n=8+10)
XORBytes/32768Bytes   363MB/s ± 0%  9350MB/s ± 0%  +2477.44%  (p=0.000 n=10+10)
AESGCMSeal1K         1.20GB/s ± 0%  1.20GB/s ± 0%     -0.02%  (p=0.041 n=10+10)
AESGCMOpen1K         1.17GB/s ± 0%  1.17GB/s ± 0%     +0.20%  (p=0.000 n=10+10)
AESGCMSign8K         2.65GB/s ± 0%  2.66GB/s ± 0%     +0.35%  (p=0.000 n=10+9)
AESGCMSeal8K         1.40GB/s ± 0%  1.40GB/s ± 0%     -0.01%  (p=0.000 n=10+7)
AESGCMOpen8K         1.41GB/s ± 0%  1.41GB/s ± 0%     -0.03%  (p=0.022 n=10+10)
AESCFBEncrypt1K       145MB/s ± 0%   238MB/s ± 0%    +64.95%  (p=0.000 n=10+10)
AESCFBDecrypt1K       143MB/s ± 0%   237MB/s ± 0%    +65.39%  (p=0.000 n=10+9)
AESCFBDecrypt8K       144MB/s ± 0%   240MB/s ± 0%    +66.15%  (p=0.000 n=10+10)
AESOFB1K              196MB/s ± 0%   401MB/s ± 0%   +104.35%  (p=0.000 n=9+10)
AESCTR1K              205MB/s ± 0%   443MB/s ± 0%   +115.57%  (p=0.000 n=7+10)
AESCTR8K              207MB/s ± 0%   450MB/s ± 0%   +117.27%  (p=0.000 n=10+10)
AESCBCEncrypt1K       176MB/s ± 0%   334MB/s ± 0%    +89.15%  (p=0.000 n=10+8)
AESCBCDecrypt1K       176MB/s ± 0%   330MB/s ± 0%    +88.08%  (p=0.000 n=10+9)

Updates #42010

Change-Id: I75e6d66fd0070e184d93b020c55a7580c713647c
Reviewed-on: https://go-review.googlesource.com/c/go/+/142537
Reviewed-by: Meng Zhuo <[email protected]>
Reviewed-by: Filippo Valsorda <[email protected]>
Run-TryBot: Meng Zhuo <[email protected]>
TryBot-Result: Go Bot <[email protected]>
Trust: Meng Zhuo <[email protected]>
@adriancable
Copy link

See my PR #53154 which adds non-NEON and NEON implementations of xorBytes for ARM. This bridges the gap with ARM64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants