Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP ARM Neon/AVX2 SIMD implementation. #730

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

samyron
Copy link

@samyron samyron commented Jan 17, 2025

WORK IN PROGRESS

Initial implementation providing an ARM Neon (and now AVX2) SIMD implementation of the convert_UTF8_to_JSON* functions.

There is still more work to be done on the convert_UTF8_to_ASCII_only_JSON. Right now it only uses SIMD to skip the prefix of characters that do not need escaping. I will fix that in the coming days.

I started the implementation of x86_64 support. Currently using __m256i seems significantly slower on my machine. I plan to make it configurable.

Additionally, the algorithm between the ARM Neon and x86_64 (AVX2) is the same. This may not be ideal as there may be better instructions on one/either platform which would be more efficient.

Benchmarks

My machine is an M1 Macbook Air with 16MB of RAM.

% ruby -v
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]

I do not have YJIT built.

Baseline: code in master (at the time fork was created)

% ruby -Ilib:ext benchmark/encoder-simple.rb 
== Encoding long string (124001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     1.465k i/100ms
                  oj     2.089k i/100ms
Calculating -------------------------------------
                json     14.440k (± 2.5%) i/s   (69.25 μs/i) -     73.250k in   5.075975s
                  oj     20.504k (± 1.9%) i/s   (48.77 μs/i) -    104.450k in   5.096055s

Comparison:
                json:    14439.8 i/s
                  oj:    20503.7 i/s - 1.42x  faster

With SIMD

Benchmark encoding a long string.

% ruby -Ilib:ext benchmark/encoder-simple.rb
== Encoding long string (124001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     3.721k i/100ms
                  oj     2.006k i/100ms
Calculating -------------------------------------
                json     39.497k (± 3.9%) i/s   (25.32 μs/i) -    200.934k in   5.094877s
                  oj     20.486k (± 2.6%) i/s   (48.81 μs/i) -    104.312k in   5.095360s

Comparison:
                json:    39497.0 i/s
                  oj:    20486.2 i/s - 1.93x  slower

Before and After (using the before & after script)

== Encoding long string (124001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     4.386k i/100ms
Calculating -------------------------------------
               after     42.103k (± 4.4%) i/s   (23.75 μs/i) -    210.528k in   5.010512s

Comparison:
              before:    14901.3 i/s
               after:    42103.4 i/s - 2.83x  faster

Existing benchmarks

== Encoding small mixed (34 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   431.030k i/100ms
                  oj   425.858k i/100ms
Calculating -------------------------------------
                json      4.443M (± 0.8%) i/s  (225.05 ns/i) -     22.414M in   5.044440s
                  oj      4.240M (± 1.2%) i/s  (235.86 ns/i) -     21.293M in   5.022778s

Comparison:
                json:  4443492.7 i/s
                  oj:  4239867.2 i/s - 1.05x  slower


== Encoding small nested array (121 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   202.481k i/100ms
                  oj   173.531k i/100ms
Calculating -------------------------------------
                json      2.060M (± 3.4%) i/s  (485.48 ns/i) -     10.327M in   5.020528s
                  oj      1.740M (± 1.7%) i/s  (574.79 ns/i) -      8.850M in   5.088416s

Comparison:
                json:  2059819.6 i/s
                  oj:  1739779.8 i/s - 1.18x  slower


== Encoding small hash (65 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   440.390k i/100ms
                  oj   485.263k i/100ms
Calculating -------------------------------------
                json      4.432M (± 0.9%) i/s  (225.61 ns/i) -     22.460M in   5.067690s
                  oj      4.834M (± 1.0%) i/s  (206.86 ns/i) -     24.263M in   5.019614s

Comparison:
                json:  4432356.6 i/s
                  oj:  4834123.7 i/s - 1.09x  faster


== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    79.000 i/100ms
                  oj    36.000 i/100ms
Calculating -------------------------------------
                json    671.096 (±11.8%) i/s    (1.49 ms/i) -      3.318k in   5.004342s
                  oj    350.462 (± 4.3%) i/s    (2.85 ms/i) -      1.764k in   5.042945s

Comparison:
                json:      671.1 i/s
                  oj:      350.5 i/s - 1.91x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    58.000 i/100ms
                  oj    35.000 i/100ms
Calculating -------------------------------------
                json    657.830 (±18.7%) i/s    (1.52 ms/i) -      3.190k in   5.014119s
                  oj    346.778 (± 3.5%) i/s    (2.88 ms/i) -      1.750k in   5.052975s

Comparison:
                json:      657.8 i/s
                  oj:      346.8 i/s - 1.90x  slower


== Encoding integers (8009 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     7.817k i/100ms
                  oj     7.297k i/100ms
Calculating -------------------------------------
                json     76.183k (± 6.1%) i/s   (13.13 μs/i) -    383.033k in   5.050410s
                  oj     72.630k (± 2.4%) i/s   (13.77 μs/i) -    364.850k in   5.026326s

Comparison:
                json:    76183.3 i/s
                  oj:    72630.4 i/s - same-ish: difference falls within error


== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     2.292k i/100ms
                  oj     1.506k i/100ms
Calculating -------------------------------------
                json     22.532k (± 2.9%) i/s   (44.38 μs/i) -    114.600k in   5.090538s
                  oj     15.440k (± 2.2%) i/s   (64.77 μs/i) -     78.312k in   5.074521s

Comparison:
                json:    22531.7 i/s
                  oj:    15440.0 i/s - 1.46x  slower


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   109.000 i/100ms
                  oj    91.000 i/100ms
Calculating -------------------------------------
                json      1.102k (± 1.8%) i/s  (907.47 μs/i) -      5.559k in   5.046287s
                  oj    927.507 (± 1.7%) i/s    (1.08 ms/i) -      4.641k in   5.005234s

Comparison:
                json:     1102.0 i/s
                  oj:      927.5 i/s - 1.19x  slower


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   220.000 i/100ms
                  oj   185.000 i/100ms
Calculating -------------------------------------
                json      2.229k (± 2.5%) i/s  (448.54 μs/i) -     11.220k in   5.035750s
                  oj      1.842k (± 3.4%) i/s  (542.90 μs/i) -      9.250k in   5.027607s

Comparison:
                json:     2229.5 i/s
                  oj:     1842.0 i/s - 1.21x  slower


== Encoding canada.json (2090234 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     1.000 i/100ms
                  oj     1.000 i/100ms
Calculating -------------------------------------
                json     11.450 (± 0.0%) i/s   (87.33 ms/i) -     58.000 in   5.066321s
                  oj     11.135 (± 0.0%) i/s   (89.81 ms/i) -     56.000 in   5.029813s

Comparison:
                json:       11.5 i/s
                  oj:       11.1 i/s - 1.03x  slower


== Encoding many #to_json calls (2701 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     2.434k i/100ms
                  oj     2.105k i/100ms
Calculating -------------------------------------
                json     24.448k (± 0.8%) i/s   (40.90 μs/i) -    124.134k in   5.077829s
                  oj     20.934k (± 0.9%) i/s   (47.77 μs/i) -    105.250k in   5.028205s

Comparison:
                json:    24447.9 i/s
                  oj:    20933.5 i/s - 1.17x  slower

@byroot
Copy link
Member

byroot commented Jan 17, 2025

Seems quite a big deal on some benchmarks:

== Encoding integers (8009 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     9.553k i/100ms
Calculating -------------------------------------
               after     96.345k (± 0.5%) i/s   (10.38 μs/i) -    487.203k in   5.056999s

Comparison:
              before:    97143.7 i/s
               after:    96344.5 i/s - same-ish: difference falls within error


== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.797k i/100ms
Calculating -------------------------------------
               after     28.500k (± 1.4%) i/s   (35.09 μs/i) -    142.647k in   5.006212s

Comparison:
              before:    21655.6 i/s
               after:    28499.8 i/s - 1.32x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   134.000 i/100ms
Calculating -------------------------------------
               after      1.356k (± 0.7%) i/s  (737.44 μs/i) -      6.834k in   5.039867s

Comparison:
              before:     1369.0 i/s
               after:     1356.0 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   277.000 i/100ms
Calculating -------------------------------------
               after      2.773k (± 0.5%) i/s  (360.57 μs/i) -     14.127k in   5.093951s

Comparison:
              before:     2407.9 i/s
               after:     2773.4 i/s - 1.15x  faster

I was initially not very enthusiastic about SIMD because we need runtime detection for it to be viable, and that seemed like a lot of work, but I very recently found https://github.com/abetlen/simdinfo, which is simple enough, so it could be worth trying.

Comment on lines 246 to 248
const uint8x16_t lower_bound = vdupq_n_u8(32);
const uint8x16_t backslash = vdupq_n_u8(92);
const uint8x16_t dblquote = vdupq_n_u8(34);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const uint8x16_t lower_bound = vdupq_n_u8(32);
const uint8x16_t backslash = vdupq_n_u8(92);
const uint8x16_t dblquote = vdupq_n_u8(34);
const uint8x16_t lower_bound = vdupq_n_u8(' ');
const uint8x16_t backslash = vdupq_n_u8('\\');
const uint8x16_t dblquote = vdupq_n_u8('\"');

Presumably we can use characters here.

const uint8x16_t dblquote = vdupq_n_u8(34);

while (pos+16 < len) {
uint8x16_t chunk = vld1q_u8((const uint8_t*)&ptr[pos]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you could add a few comments?

I usually am of the thinking only the why should be commented, not the how, but SIMD being not a super well known API, I think it's a good exception.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! I've been iterating pretty quick on various versions. I'll add some documentation as I think I'm relatively happy with the code at the moment.


invalid = vorrq_u8(invalid, has_escaped_char);

if (vmaxvq_u8(invalid) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we could check the number of leading zeros to find the exact position of the first character we need to escape, no?

tmp = vorrq_u8(tmp, vandq_u8(has_dblquote, vdupq_n_u8(0x4)));

uint8_t arr[16];
vst1q_u8(arr, tmp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH actually, I suppose that's what i was thinking about. It tells you the position of the backslashes and quotes.

@byroot
Copy link
Member

byroot commented Jan 17, 2025

Note to self, before merging any of that we should setup ARM64 CI: https://github.blog/changelog/2024-09-03-github-actions-arm64-linux-and-windows-runners-are-now-generally-available/

@byroot
Copy link
Member

byroot commented Jan 17, 2025

I very recently found https://github.com/abetlen/simdinfo, which is simple enough

Another possibility is https://gcc.gnu.org/wiki/FunctionMultiVersioning, but not too sure what the support is like in various compilers. ruby/json being included in Ruby, it's tested against some pretty old and exotic compilers.

@byroot
Copy link
Member

byroot commented Jan 17, 2025

Another possibility is https://gcc.gnu.org/wiki/FunctionMultiVersioning,

Nevermind, it's limited to C++ on i386 arch, so basically unusable.

Comment on lines 267 to 272
uint8x16_t tmp = vandq_u8(too_low, vdupq_n_u8(0x1));
tmp = vorrq_u8(tmp, vandq_u8(has_backslash, vdupq_n_u8(0x2)));
tmp = vorrq_u8(tmp, vandq_u8(has_dblquote, vdupq_n_u8(0x4)));

uint8_t arr[16];
vst1q_u8(arr, tmp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing this part, and simply falling back to the lookup table for the 16B chunk, and got pretty much the same performance, which isn't that surprising, the biggest gain is from being able to quickly scan the happy path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this was unnecessary anyway. It didn't matter which case hit if the character needed to be escaped. I removed the chain of vorrq_u8's this morning.


if (vmaxvq_u8(invalid) == 0) {
pos += 16;
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could get a lot of performance here by directly writing the NEON register into the destination.

Because ultimately we'll call MEMCPY, which will load the same bytes into a NEON registry too, so it would be way more efficient to do it immediately.

I'm not too sure which intrinsic is used for that, should be the inverse of vld1q_u8

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vst1q_u8 is the opposite of vld1q_u8. If we know the capacity of the buffer is large enough, this should be a win.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying that a few minutes ago:

diff --git a/ext/json/ext/generator/generator.c b/ext/json/ext/generator/generator.c
index 3112e89..fd7b948 100644
--- a/ext/json/ext/generator/generator.c
+++ b/ext/json/ext/generator/generator.c
@@ -260,7 +260,12 @@ static void convert_UTF8_to_JSON(FBuffer *out_buffer, VALUE str)
         invalid = vorrq_u8(invalid, has_escaped_char);
 
         if (vmaxvq_u8(invalid) == 0) {
+            FLUSH_POS(0);
+            fbuffer_inc_capa(out_buffer, 16);
+            vst1q_u8((const uint8_t*)(out_buffer->ptr + out_buffer->len), chunk);
+            out_buffer->len += 16;
             pos += 16;
+            beg = pos;
             continue;
         }

But if anything it is very sligthly slower, which is surprising, but I may be doing something wrong. I'm on a very spotty airport wifi so not easy to look at doc etc.

That said, I also see there are way to load/store up to 4 128-bit register, so trying to exploit that too may helpo,
and perhaps that's how MEMCPY is still competitive.

…8_to_JSON. It doesn't matter which case is hit when a byte needs to be escaped. In that case, remove the vorr_q chain and simply use the combined 'needs_escape' vector.
} else { \
pos++; \
}

static void convert_UTF8_to_ASCII_only_JSON(FBuffer *out_buffer, VALUE str, const unsigned char escape_table[256])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: you could entirely ignore convert_UTF8_to_ASCII_only_JSON, it's a very niche feature, not worth the extra complexity.

@samyron
Copy link
Author

samyron commented Jan 18, 2025

Apologies about the no extconf.h. Forgot to stash that change. I'm currently trying to use the extconf.rb detect NEON (with other implementations coming soon). Additionally trying to add configure flags so we can enable/disable this feature.

@byroot
Copy link
Member

byroot commented Jan 18, 2025

No worries.

Comment on lines 295 to 297
uint8x16_t too_low = vcltq_u8(chunk, lower_bound);
uint8x16_t has_backslash = vceqq_u8(chunk, backslash);
uint8x16_t has_dblquote = vceqq_u8(chunk, dblquote);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be possible to check for both too_low and has_doublequote with a single vqtbl4q_u8, because we're searching for < 32, 34 and 92.

The first two can be a lookup table (0 -> 64), and for the second we can just to a vceqq_u8.

Assuming vqtbl4q_u8 and vceqq_u8 have similar-ish performance (no idea), that would save two instructions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is on the list of things I'd like to try. I can't find the reference now, but somewhere in the Neon reference (I believe) it hinted that table lookups can be more expensive than a series of comparison instructions. It's definitely worth a test though.

…d flags. These can be set with the JSON_GENERATOR_CONFIGURE_OPTS environment variable prior to running rake. Additionally, set the stage for different SIMD implementations.
@@ -86,7 +86,7 @@ end

file EXT_GENERATOR_DL => EXT_GENERATOR_SRC do
cd EXT_GENERATOR_DIR do
ruby 'extconf.rb'
ruby "extconf.rb #{ENV['JSON_GENERATOR_CONFIGURE_OPTS']}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably a better way to accomplish this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, also I cleaned up the Rakefile the other day, so this will cause a merge conflict.

@@ -0,0 +1,8 @@
#ifndef EXTCONF_H
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file gets generated by mkmf. It probably shouldn't be checked in...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it shouldn't be committed.


#define SIMD_VEC_STRIDE 16

#define simd_vec_type uint8x16_t
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully these help readability a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I prefer uint8x16_t.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I did this to be able to support different platforms in a generic way. This likely comes at the cost of efficiency as there may be better platform specific algorithms based on the provided ISA.

I can remove this and implement the multiple paths in the generator if you'd prefer.

}
SRC
$defs.push("-DENABLE_SIMD")
append_cflags('-mavx2')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely not a safe assumption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. ruby/json is shipped with ruby which means will likely be distributed as precompiled binary.

What we should try to do is to use the various __target__("arch=...") attributes to compile SIMD enabled versions of some functions, and then use something like https://github.com/abetlen/simdinfo to dispatch at runtime.

That's really the tricky part and blocker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target-function-attribute

int sse3_func (void) __attribute__ ((__target__ ("sse3")));

Seem to be the syntax, and I think clang supports it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like both clang and gcc support this idea, differently, of course...

However, I'm working on samyron#1 to refactor to use runtime SIMD detection. I also specialized the SSE version a bit. See the link for a few benchmarks.


#ifdef HAVE_X86INTRIN_H
#include <x86intrin.h>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in this section is likely not optimal. SSE/AVX/AVX2 do not seem to have great support for unsigned bytes.

@samyron samyron changed the title WIP ARM Neon SIMD implementation. WIP ARM Neon/AVX2 SIMD implementation. Jan 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants