Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: adds (some) arm NEON assembly for koalabear and babybear #588

Merged
merged 81 commits into from
Jan 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
539e90b
checkpoint
gbotrel Oct 15, 2024
c60ce4c
checkpoint
gbotrel Oct 15, 2024
5832e15
build: update bavard
gbotrel Oct 15, 2024
ec77f9c
checkpoint
gbotrel Oct 15, 2024
c9363b5
checkpoint
gbotrel Oct 16, 2024
0015ca0
checkpoint
gbotrel Oct 16, 2024
e38579f
checkpoint
gbotrel Oct 16, 2024
a334387
checkpoint
gbotrel Oct 16, 2024
637a68a
feat: added butterfly asm, experiment
gbotrel Oct 16, 2024
222036a
checkpoint
gbotrel Oct 16, 2024
6fbfb48
checkpoint
gbotrel Oct 16, 2024
6f4d15b
checkpoint
gbotrel Oct 18, 2024
8bc8267
checkpoint
gbotrel Oct 18, 2024
13521a9
checkpoint
gbotrel Oct 19, 2024
a0e023e
code cleaning
gbotrel Oct 19, 2024
ea8340b
feat: refactor code generation to allow space for arm64
gbotrel Oct 21, 2024
99283df
style: code cleaning
gbotrel Oct 21, 2024
08a7afa
feat: add reduce arm64 for test purposes
gbotrel Oct 21, 2024
771ce54
feat: restore vectors.json mimc
gbotrel Oct 22, 2024
e303fc2
style: add trace in mimc generate
gbotrel Oct 22, 2024
548dfd1
feat: update build tags for 32 bit target
gbotrel Oct 22, 2024
147a62e
feat: generalize arm64 mul for larger modulus
gbotrel Oct 22, 2024
3638f44
feat: add missing files
gbotrel Oct 22, 2024
ed06e35
checkpoint
gbotrel Nov 22, 2024
a170cb1
test passing
gbotrel Nov 23, 2024
dddb22d
feat: add babybear and koalabear
gbotrel Nov 25, 2024
6ae75a9
fix: restore line return to minimize diff
gbotrel Nov 25, 2024
18b7374
feat: restore non-big int inverse
gbotrel Nov 25, 2024
f86ef72
test: fix field config test to take word size into account
gbotrel Nov 25, 2024
2576e72
feat: cleanup add template for field element
gbotrel Dec 4, 2024
93b6669
feat: less ops in mul generic 31bits
gbotrel Dec 4, 2024
f276812
style: cleaning PR
gbotrel Dec 6, 2024
db78c7b
style: more cleaning
gbotrel Dec 6, 2024
5c569d0
feat: cleaner mont mul, slower
gbotrel Dec 6, 2024
0a20412
fix integration test
gbotrel Dec 6, 2024
ac9720a
style: more cleaning
gbotrel Dec 6, 2024
725a476
test: fix failing generator test
gbotrel Dec 6, 2024
5b11602
test: fix field config test to use bitsize
gbotrel Dec 6, 2024
8b3b4d1
feat: on 31bit field better branch-less add and sub
gbotrel Dec 6, 2024
d022a64
feat: skeletton for vec assembly on F31
gbotrel Dec 7, 2024
9216305
refactor: rename asm generation code for 4 words
gbotrel Dec 7, 2024
deb1d7f
refactor: rename asm generation code for 4 words
gbotrel Dec 7, 2024
8f230c2
feat: added F31 avx512 add
gbotrel Dec 8, 2024
d5a6b4d
feat: add avx512 sub for f31
gbotrel Dec 8, 2024
cf8370d
feat: added avx512 sum for f31
gbotrel Dec 8, 2024
8289d3d
feat: working version of the mul, optims to come
gbotrel Dec 8, 2024
f5e93c6
feat: clean up mul avx f31
gbotrel Dec 8, 2024
17beb83
feat: add avx512 scalarMul vec for F31
gbotrel Dec 8, 2024
c1b06b4
feat: add innerProdVec avx512 for f31
gbotrel Dec 8, 2024
9af3aed
style: code cleaning
gbotrel Dec 8, 2024
5514b84
style: more cleaning
gbotrel Dec 8, 2024
ccb7ad1
refactor: give nb bits to asm generation
gbotrel Dec 8, 2024
9651c06
refactor: distinguish nb of bits in file name for generated assembly
gbotrel Dec 8, 2024
31b74f2
feat: add missing file
gbotrel Dec 8, 2024
0f549b5
test: fix broken integration test
gbotrel Dec 9, 2024
58ffe06
Merge branch 'master' into experiment/31bits
gbotrel Dec 10, 2024
1a6e1bc
Merge branch 'experiment/31bits' into perf/f31_avx
gbotrel Dec 10, 2024
34fdd6e
chore: run go mod tidy
gbotrel Dec 10, 2024
e704847
Merge branch 'master' into perf/f31_avx
gbotrel Dec 10, 2024
6dd8803
Merge branch 'master' into perf/f31_avx
gbotrel Dec 10, 2024
21b9b80
chore: re run go generate to update doc
gbotrel Dec 10, 2024
38324d3
feat: prepare skeletton for NEON on F31
gbotrel Dec 10, 2024
fd46a1e
checkpoint
gbotrel Dec 10, 2024
93c01c4
feat: adds f31 neon add
gbotrel Dec 11, 2024
daa6642
feat: adds f31 neon sub
gbotrel Dec 11, 2024
48206ac
feat: add neon f31 sum
gbotrel Dec 11, 2024
5747fb2
feat,perf: faster sum on neon f31
gbotrel Dec 11, 2024
e76768f
perf: move q from const in vector broadcast
gbotrel Dec 11, 2024
18390fd
style: cleaning stuff
gbotrel Dec 11, 2024
5dcb744
checkpoint
gbotrel Dec 13, 2024
e8a14ad
Merge branch 'master' into perf/f31_avx
gbotrel Dec 13, 2024
69dd02d
test: fix integration test
gbotrel Dec 13, 2024
5840713
Merge branch 'perf/f31_avx' into perf/f31_neon
gbotrel Dec 13, 2024
26d7a9d
checkpoint
gbotrel Dec 16, 2024
4f6dde1
checkpoint
gbotrel Dec 19, 2024
de912a0
Merge branch 'master' into perf/f31_neon
gbotrel Dec 19, 2024
50e000b
feat: remove neon mul, not enough support in golang
gbotrel Dec 19, 2024
fa8cf7b
style: remove reg aliases in asm
gbotrel Dec 19, 2024
ca6efa4
docs: add more context to asm arm doc
gbotrel Dec 19, 2024
be88799
style: minor code cleaning
gbotrel Dec 19, 2024
7048ffe
feat: fix template for amd
gbotrel Dec 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions field/asm/element_31b_arm64.s
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
// Code generated by gnark-crypto/generator. DO NOT EDIT.
#include "textflag.h"
#include "funcdata.h"
#include "go_asm.h"

// addVec(res, a, b *Element, n uint64)
// n is the number of blocks of 4 uint32 to process
TEXT ·addVec(SB), NOFRAME|NOSPLIT, $0-32
LDP res+0(FP), (R0, R1)
LDP b+16(FP), (R2, R3)
VMOVS $const_q, V3
VDUP V3.S[0], V3.S4 // broadcast q into V3

loop1:
CBZ R3, done2
VLD1.P 16(R1), [V0.S4]
VLD1.P 16(R2), [V1.S4]
VADD V0.S4, V1.S4, V1.S4 // b = a + b
VSUB V3.S4, V1.S4, V2.S4 // t = b - q
VUMIN V2.S4, V1.S4, V1.S4 // b = min(t, b)
VST1.P [V1.S4], 16(R0) // res = b
SUB $1, R3, R3
JMP loop1

done2:
RET

// subVec(res, a, b *Element, n uint64)
// n is the number of blocks of 4 uint32 to process
TEXT ·subVec(SB), NOFRAME|NOSPLIT, $0-32
LDP res+0(FP), (R0, R1)
LDP b+16(FP), (R2, R3)
VMOVS $const_q, V3
VDUP V3.S[0], V3.S4 // broadcast q into V3

loop3:
CBZ R3, done4
VLD1.P 16(R1), [V0.S4]
VLD1.P 16(R2), [V1.S4]
VSUB V1.S4, V0.S4, V1.S4 // b = a - b
VADD V1.S4, V3.S4, V2.S4 // t = b + q
VUMIN V2.S4, V1.S4, V1.S4 // b = min(t, b)
VST1.P [V1.S4], 16(R0) // res = b
SUB $1, R3, R3
JMP loop3

done4:
RET

// sumVec(t *uint64, a *[]uint32, n uint64) res = sum(a[0...n])
// n is the number of blocks of 16 uint32 to process
TEXT ·sumVec(SB), NOFRAME|NOSPLIT, $0-24
// zeroing accumulators
VMOVQ $0, $0, V4
VMOVQ $0, $0, V5
VMOVQ $0, $0, V6
VMOVQ $0, $0, V7
LDP t+0(FP), (R1, R0)
MOVD n+16(FP), R2

loop5:
CBZ R2, done6

// blockSize is 16 uint32; we load 4 vectors of 4 uint32 at a time
// (4*4)*4 = 64 bytes ~= 1 cache line
// since our values are 31 bits, we can add 2 by 2 these vectors
// we are left with 2 vectors of 4x32 bits values
// that we accumulate in 4*2*64bits accumulators
// the caller will reduce mod q the accumulators.

VLD2.P 32(R0), [V0.S4, V1.S4]
VADD V0.S4, V1.S4, V0.S4 // a1 += a2
VLD2.P 32(R0), [V2.S4, V3.S4]
VADD V2.S4, V3.S4, V2.S4 // a3 += a4
VUSHLL $0, V0.S2, V1.D2 // convert low words to 64 bits
VADD V1.D2, V5.D2, V5.D2 // acc2 += a2
VUSHLL2 $0, V0.S4, V0.D2 // convert high words to 64 bits
VADD V0.D2, V4.D2, V4.D2 // acc1 += a1
VUSHLL $0, V2.S2, V3.D2 // convert low words to 64 bits
VADD V3.D2, V7.D2, V7.D2 // acc4 += a4
VUSHLL2 $0, V2.S4, V2.D2 // convert high words to 64 bits
VADD V2.D2, V6.D2, V6.D2 // acc3 += a3
SUB $1, R2, R2
JMP loop5

done6:
VADD V4.D2, V6.D2, V4.D2 // acc1 += acc3
VADD V5.D2, V7.D2, V5.D2 // acc2 += acc4
VST2.P [V4.D2, V5.D2], 0(R1) // store acc1 and acc2
RET
2 changes: 1 addition & 1 deletion field/babybear/doc.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions field/babybear/element_arm64.s
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
//go:build !purego

// Copyright 2020-2024 Consensys Software Inc.
// Licensed under the Apache License, Version 2.0. See the LICENSE file for details.

// Code generated by consensys/gnark-crypto DO NOT EDIT

// We include the hash to force the Go compiler to recompile: 8620676634583589757
#include "../asm/element_31b_arm64.s"

6 changes: 2 additions & 4 deletions field/babybear/vector_amd64.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

104 changes: 104 additions & 0 deletions field/babybear/vector_arm64.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion field/babybear/vector_purego.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 0 additions & 2 deletions field/generator/asm/amd64/build.go
Original file line number Diff line number Diff line change
Expand Up @@ -221,10 +221,8 @@ func GenerateCommonASM(w io.Writer, nbWords, nbBits int, hasVector bool) error {
if nbBits == 31 {
return GenerateF31ASM(f, hasVector)
} else {
fmt.Printf("nbWords: %d, nbBits: %d\n", nbWords, nbBits)
panic("not implemented")
}

}

f.GenerateReduceDefine()
Expand Down
20 changes: 20 additions & 0 deletions field/generator/asm/arm64/build.go
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,14 @@ func GenerateCommonASM(w io.Writer, nbWords, nbBits int, hasVector bool) error {
f.WriteLn("#include \"go_asm.h\"")
f.WriteLn("")

if nbWords == 1 {
if nbBits == 31 {
return GenerateF31ASM(f, hasVector)
} else {
panic("not implemented")
}
}

if f.NbWords%2 != 0 {
panic("NbWords must be even")
}
Expand Down Expand Up @@ -216,3 +224,15 @@ func ElementASMFileName(nbWords, nbBits int) string {
}
return fmt.Sprintf(nameWN, nbWords)
}

func GenerateF31ASM(f *FFArm64, hasVector bool) error {
if !hasVector {
return nil // nothing for now.
}

f.generateAddVecF31()
f.generateSubVecF31()
f.generateSumVecF31()

return nil
}
Loading
Loading