Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LIB: add support for ARM NEON #11

Closed

Conversation

onereddogmedia
Copy link
Contributor

No description provided.

0x1337c0d3 and others added 2 commits November 21, 2021 12:53
@swesterfeld
Copy link
Owner

That is an interesting approach to support SIMD optimizations on ARM without changing the existing algorithms. I have a few comments:

As submitted, it doesn't actually optimize anything, because all the existing SSE code checks if __SSE__ is defined. You can see that there are three files that could potentially provide better performance through vectorization:

$ grep -l __SSE__ */*hh */*cc
lib/smnoisedecoder.cc
lib/smpandaresampler.cc
lib/smmath.hh

So at least these __SSE__ checks need to be changed (maybe to something like #if defined (__SSE__) || defined (SM_ARM_SSE) or similar. All in all we have three vectorized algorithms:

  • NoiseDecoder
  • fast_vector_sin
  • PandaResampler

As for any performance optimization, it is probably a good idea to check if the performance is actually improved by the vectorized code. I cannot do this, because I don't have ARM hardware.

For the NoiseDecoder, running tests/testnoiseperf would be interesting. For fast_vector_sin, running tests/testfastsin perf with and without your ARM vectorization code would be interesting.

PandaResampler is special, because its code is in its own repository (https://github.com/swesterfeld/pandaresampler). So since this is a header-only library you'd have to copy-paste the necessary _mm_mul_ps and _mm_add_ps definitions directly into the lib/pandaresampler.cc file, and then run tests/testresampler perf --fpu and compare it with tests/testresampler perf. Later, the SpectMorph PandaResampler would have to be updated by the new pandaresampler.cc.

As for licensing, I've recently changed SpectMorph to LGPL v2.1 or later (c59dd65), so this would be needed for vectorizing NoiseDecoder and fast_vector_sin. PandaResampler is MPL2.0.

If you've implemented all code yourself it would be great if you'd allow it to be used under both licenses. I also see the possibility that your code could be used in Anklang (https://anklang.testbit.eu/) which is also MPL2.0. If you copypasted your code from elsewhere, the original license would also need to be checked.

@onereddogmedia
Copy link
Contributor Author

onereddogmedia commented Dec 8, 2021

Thanks for replying. This approach to supporting SIMD on ARM NEON is similar to that of the SIMDe library. The SIMD intrinsics are almost the same between NEON and SSE. Interestingly there's not a huge difference in performance on the Apple M1. I have a subset of the SpectMorph running in Xcode, I'm curious to see the timings on the slower Apple ARM processors as used in iPad, I'll update you.

Sorry, I haven't looked at PandaResampler.

License. I marked it MPL, but it might be MIT. The macros are basically from SIMDe without all their preprocessor flags.

swesterfeld added a commit that referenced this pull request Dec 9, 2021
Support for ARM NEON from pull request

  #11

based on code from Peter Johnson <[email protected]>
swesterfeld added a commit that referenced this pull request Dec 12, 2021
  #11

based on code from Peter Johnson <[email protected]>
@swesterfeld
Copy link
Owner

I've tried to merge various bits and pieces from your PR into master. It is not a 1:1 merge, I did some changes while merging. Still, I hope you should be able to build from master now, either without changes or with minimal changes. I'd be interested in your feedback on this.

Sorry, I haven't looked at PandaResampler.

Ok, finally I found some time to add NEON support based on your code to PandaResampler. You'll find it in the master branch, with a new performance test. On my devel system (Ryzen 7) SSE is about 3 times faster.

$ tests/testpandaperf 
SSE: 5.57 ns/sample
FPU: 16.61 ns/sample

SSE/FPU speedup: 2.98

I'd be interested in seeing the results of this test using NEON on your M1 system.

@onereddogmedia
Copy link
Contributor Author

Apologies for the late reply. I get the following results MacBook Air (M1, 2020):

$ ./testpandaperf
SSE: 3.76 ns/sample
FPU: 13.55 ns/sample

SSE/FPU speedup: 3.61

@onereddogmedia
Copy link
Contributor Author

Macbook Pro (14-inch, 2021, M1 Pro)

$ ./tests/testpandaperf 
SSE: 3.65 ns/sample
FPU: 13.42 ns/sample

SSE/FPU speedup: 3.68

@swesterfeld
Copy link
Owner

Ok, the results for PandaResampler on M1 look really good. So after all, your code / suggestion to SIMDify stuff using NEON on M1 really improves performance for the resampling code.

As far as I see everything is merged into master and tested now, so I'll close the PR. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants