-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP ARM Neon/AVX2 SIMD implementation. #730
base: master
Are you sure you want to change the base?
Conversation
Seems quite a big deal on some benchmarks:
I was initially not very enthusiastic about SIMD because we need runtime detection for it to be viable, and that seemed like a lot of work, but I very recently found https://github.com/abetlen/simdinfo, which is simple enough, so it could be worth trying. |
ext/json/ext/generator/generator.c
Outdated
const uint8x16_t lower_bound = vdupq_n_u8(32); | ||
const uint8x16_t backslash = vdupq_n_u8(92); | ||
const uint8x16_t dblquote = vdupq_n_u8(34); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const uint8x16_t lower_bound = vdupq_n_u8(32); | |
const uint8x16_t backslash = vdupq_n_u8(92); | |
const uint8x16_t dblquote = vdupq_n_u8(34); | |
const uint8x16_t lower_bound = vdupq_n_u8(' '); | |
const uint8x16_t backslash = vdupq_n_u8('\\'); | |
const uint8x16_t dblquote = vdupq_n_u8('\"'); |
Presumably we can use characters here.
ext/json/ext/generator/generator.c
Outdated
const uint8x16_t dblquote = vdupq_n_u8(34); | ||
|
||
while (pos+16 < len) { | ||
uint8x16_t chunk = vld1q_u8((const uint8_t*)&ptr[pos]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think you could add a few comments?
I usually am of the thinking only the why should be commented, not the how, but SIMD being not a super well known API, I think it's a good exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! I've been iterating pretty quick on various versions. I'll add some documentation as I think I'm relatively happy with the code at the moment.
ext/json/ext/generator/generator.c
Outdated
|
||
invalid = vorrq_u8(invalid, has_escaped_char); | ||
|
||
if (vmaxvq_u8(invalid) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we could check the number of leading zeros to find the exact position of the first character we need to escape, no?
ext/json/ext/generator/generator.c
Outdated
tmp = vorrq_u8(tmp, vandq_u8(has_dblquote, vdupq_n_u8(0x4))); | ||
|
||
uint8_t arr[16]; | ||
vst1q_u8(arr, tmp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH actually, I suppose that's what i was thinking about. It tells you the position of the backslashes and quotes.
Note to self, before merging any of that we should setup ARM64 CI: https://github.blog/changelog/2024-09-03-github-actions-arm64-linux-and-windows-runners-are-now-generally-available/ |
Another possibility is https://gcc.gnu.org/wiki/FunctionMultiVersioning, but not too sure what the support is like in various compilers. |
Nevermind, it's limited to C++ on i386 arch, so basically unusable. |
ext/json/ext/generator/generator.c
Outdated
uint8x16_t tmp = vandq_u8(too_low, vdupq_n_u8(0x1)); | ||
tmp = vorrq_u8(tmp, vandq_u8(has_backslash, vdupq_n_u8(0x2))); | ||
tmp = vorrq_u8(tmp, vandq_u8(has_dblquote, vdupq_n_u8(0x4))); | ||
|
||
uint8_t arr[16]; | ||
vst1q_u8(arr, tmp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried removing this part, and simply falling back to the lookup table for the 16B chunk, and got pretty much the same performance, which isn't that surprising, the biggest gain is from being able to quickly scan the happy path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out this was unnecessary anyway. It didn't matter which case hit if the character needed to be escaped. I removed the chain of vorrq_u8
's this morning.
|
||
if (vmaxvq_u8(invalid) == 0) { | ||
pos += 16; | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could get a lot of performance here by directly writing the NEON register into the destination.
Because ultimately we'll call MEMCPY, which will load the same bytes into a NEON registry too, so it would be way more efficient to do it immediately.
I'm not too sure which intrinsic is used for that, should be the inverse of vld1q_u8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vst1q_u8
is the opposite of vld1q_u8
. If we know the capacity of the buffer is large enough, this should be a win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying that a few minutes ago:
diff --git a/ext/json/ext/generator/generator.c b/ext/json/ext/generator/generator.c
index 3112e89..fd7b948 100644
--- a/ext/json/ext/generator/generator.c
+++ b/ext/json/ext/generator/generator.c
@@ -260,7 +260,12 @@ static void convert_UTF8_to_JSON(FBuffer *out_buffer, VALUE str)
invalid = vorrq_u8(invalid, has_escaped_char);
if (vmaxvq_u8(invalid) == 0) {
+ FLUSH_POS(0);
+ fbuffer_inc_capa(out_buffer, 16);
+ vst1q_u8((const uint8_t*)(out_buffer->ptr + out_buffer->len), chunk);
+ out_buffer->len += 16;
pos += 16;
+ beg = pos;
continue;
}
But if anything it is very sligthly slower, which is surprising, but I may be doing something wrong. I'm on a very spotty airport wifi so not easy to look at doc etc.
That said, I also see there are way to load/store up to 4 128-bit register, so trying to exploit that too may helpo,
and perhaps that's how MEMCPY
is still competitive.
…8_to_JSON. It doesn't matter which case is hit when a byte needs to be escaped. In that case, remove the vorr_q chain and simply use the combined 'needs_escape' vector.
} else { \ | ||
pos++; \ | ||
} | ||
|
||
static void convert_UTF8_to_ASCII_only_JSON(FBuffer *out_buffer, VALUE str, const unsigned char escape_table[256]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: you could entirely ignore convert_UTF8_to_ASCII_only_JSON
, it's a very niche feature, not worth the extra complexity.
Apologies about the no |
No worries. |
ext/json/ext/generator/generator.c
Outdated
uint8x16_t too_low = vcltq_u8(chunk, lower_bound); | ||
uint8x16_t has_backslash = vceqq_u8(chunk, backslash); | ||
uint8x16_t has_dblquote = vceqq_u8(chunk, dblquote); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be possible to check for both too_low
and has_doublequote
with a single vqtbl4q_u8
, because we're searching for < 32
, 34
and 92
.
The first two can be a lookup table (0 -> 64), and for the second we can just to a vceqq_u8
.
Assuming vqtbl4q_u8
and vceqq_u8
have similar-ish performance (no idea), that would save two instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is on the list of things I'd like to try. I can't find the reference now, but somewhere in the Neon reference (I believe) it hinted that table lookups can be more expensive than a series of comparison instructions. It's definitely worth a test though.
…d flags. These can be set with the JSON_GENERATOR_CONFIGURE_OPTS environment variable prior to running rake. Additionally, set the stage for different SIMD implementations.
@@ -86,7 +86,7 @@ end | |||
|
|||
file EXT_GENERATOR_DL => EXT_GENERATOR_SRC do | |||
cd EXT_GENERATOR_DIR do | |||
ruby 'extconf.rb' | |||
ruby "extconf.rb #{ENV['JSON_GENERATOR_CONFIGURE_OPTS']}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is probably a better way to accomplish this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, also I cleaned up the Rakefile the other day, so this will cause a merge conflict.
ext/json/ext/generator/extconf.h
Outdated
@@ -0,0 +1,8 @@ | |||
#ifndef EXTCONF_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file gets generated by mkmf
. It probably shouldn't be checked in...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it shouldn't be committed.
|
||
#define SIMD_VEC_STRIDE 16 | ||
|
||
#define simd_vec_type uint8x16_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully these help readability a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I prefer uint8x16_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I did this to be able to support different platforms in a generic way. This likely comes at the cost of efficiency as there may be better platform specific algorithms based on the provided ISA.
I can remove this and implement the multiple paths in the generator if you'd prefer.
} | ||
SRC | ||
$defs.push("-DENABLE_SIMD") | ||
append_cflags('-mavx2') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely not a safe assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. ruby/json
is shipped with ruby
which means will likely be distributed as precompiled binary.
What we should try to do is to use the various __target__("arch=...")
attributes to compile SIMD enabled versions of some functions, and then use something like https://github.com/abetlen/simdinfo to dispatch at runtime.
That's really the tricky part and blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target-function-attribute
int sse3_func (void) __attribute__ ((__target__ ("sse3")));
Seem to be the syntax, and I think clang
supports it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like both clang
and gcc
support this idea, differently, of course...
However, I'm working on samyron#1 to refactor to use runtime SIMD detection. I also specialized the SSE version a bit. See the link for a few benchmarks.
|
||
#ifdef HAVE_X86INTRIN_H | ||
#include <x86intrin.h> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything in this section is likely not optimal. SSE/AVX/AVX2 do not seem to have great support for unsigned bytes.
WORK IN PROGRESS
Initial implementation providing an ARM Neon (and now AVX2) SIMD implementation of the
convert_UTF8_to_JSON*
functions.There is still more work to be done on the
convert_UTF8_to_ASCII_only_JSON
. Right now it only uses SIMD to skip the prefix of characters that do not need escaping. I will fix that in the coming days.I started the implementation of x86_64 support. Currently using
__m256i
seems significantly slower on my machine. I plan to make it configurable.Additionally, the algorithm between the ARM Neon and x86_64 (AVX2) is the same. This may not be ideal as there may be better instructions on one/either platform which would be more efficient.
Benchmarks
My machine is an M1 Macbook Air with 16MB of RAM.
I do not have YJIT built.
Baseline: code in
master
(at the time fork was created)With SIMD
Benchmark encoding a long string.
Before and After (using the before & after script)
Existing benchmarks