Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(python): Specify tune-cpu & add more features #17615

Merged
merged 3 commits into from
Jul 17, 2024

Conversation

ruihe774
Copy link
Contributor

@ruihe774 ruihe774 commented Jul 13, 2024

This PR makes use of standard x86 microarhitecture levels1 for target-cpu instead of manually specified target-features when building the release wheels. This brings the following benefits:

  • Consistent with the standard: Many Linux distros, e.g. openSUSE2 and CachyOS3, use x86 microarhitecture levels when compiling optimized versions of packages. Using standard levels better aligns with the downstream packaging.
  • Reduce maintain burden: No need to define and update4 a list of features. Each level includes all supported features.
  • Performance improvement: Setting target-cpu implies tune-cpu, while setting target-features does not. e.g., target-cpu=x86-84-v3 enables tuning for those v3 cpu models5.

This PR uses:

  • x86-64-v3 for normal release; it includes sse, avx, avx2, etc.
  • x86-64-v2 for lts-cpu release; it includes sse, etc.

Supported CPU models (i.e. compatibility) of either of them is not changed.

The special configuration for macOS is removed. I cannot find a cpu model that support fma but not avx2. Please correct me if I'm wrong.

This PR does not affect _cpu_check.py. A list of features is still given to it.

XXX:

  • Is it worthy to also set tune-cpu? x86-64-v3 is tuned for a kind of old cpu model (Haswell). We may set target-cpu and tune-cpu to different values, e.g. -C target-cpu=x86-64-v3 -Z tune-cpu=skylake.
  • It is not necessary to check every feature in _cpu_check.py. e.g. a cpu that has avx also has sse. We can simplify its logic and only check a sentinel feature.

Footnotes

  1. https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels

  2. https://news.opensuse.org/2023/03/02/tw-gains-optional-optimizations

  3. https://wiki.cachyos.org/cachyos_repositories/what_are_the_cachyos_repo/#performance-and-optimizations

  4. https://github.com/pola-rs/polars/pull/14746

  5. https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86.td

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars labels Jul 13, 2024
@ritchie46
Copy link
Member

@orlp can you take a look here?

@orlp
Copy link
Collaborator

orlp commented Jul 15, 2024

I'd prefer to keep using a manual list of features, on an as-needed basis. This also makes it easy to keep the CPU check module up-to-date.

Another example of a pitfall of x86-64-v3 is that your PR would have silently disabled pclmulqdq.

@orlp orlp closed this Jul 15, 2024
@ruihe774
Copy link
Contributor Author

ruihe774 commented Jul 15, 2024

Another example of a pitfall of x86-64-v3 is that your PR would have silently disabled pclmulqdq.

Sorry, I did not notice that.

I'd prefer to keep using a manual list of features, on an as-needed basis. This also makes it easy to keep the CPU check module up-to-date.'

I want to argue that the compiler can implicitly make use of arch features. An "as-needed basis" is sub-optimal for the compiler. For example, if x86-64-v3 is specified, the compiler can use movbe and f16c, which are present in all AVX2-equppied CPU models and are not specified in the current "manual list of features".

Another point, as I've mentioned, is the tunings. Specifying a manual list of features does not turn on the tunings for the target CPU - tunings for generic CPU is used. Tunings in x86-64-v3 like TuningFastScalarFSQRT, TuningFastVariableCrossLaneShuffle, TuningFastVariablePerLaneShuffle, etc, are not enabled. This prevent the compiler from e.g. using a native sqrt implementation, and fallbacks to software Newton-Raphson.

Despite the aforementioned benefits of using microarch levels, if a manual list of features is still preferred, IMO we can at least complete the list with all features that the target CPU supports, and manually specify a -Z tune-cpu. @orlp Would you plz reconsider it?

@orlp
Copy link
Collaborator

orlp commented Jul 16, 2024

@ruihe774 I would be open to adding more features if it means we don't exclude any CPUs we support now, and adding a reasonable tuning target. movbe sounds useful, f16c I wouldn't bother with since we don't use that.

I'm not very open to relaxing the CPU check. It's not that expensive and it means users get a proper explained error rather than a segfault. Also please consider that there are 'CPUs' which support weird combinations of feature sets, such as certain emulators like Rosetta.

@orlp orlp reopened this Jul 16, 2024
@ruihe774
Copy link
Contributor Author

ruihe774 commented Jul 17, 2024

@ruihe774 I would be open to adding more features if it means we don't exclude any CPUs we support now, and adding a reasonable tuning target. movbe sounds useful, f16c I wouldn't bother with since we don't use that.

OK. I'm going to work on this.

BTW what is the "reasonable tuning target" in your opinion? IMO I would choose tune-cpu=skylake because 1) skylake and later cpu models are widely used while prior cpu models are rare now, 2) it has TuningFastGather and TuningAllowLight256Bit which allow more efficient auto-vectorization, 3) it has similar tunings to ryzen.

such as certain emulators like Rosetta.

Rosetta does not support AVX at all1; it can only run polars-lts-cpu.

BTW, I'm still wondering what the purpose of the macOS branch is:

elif [[ "$IS_MACOS" = true ]]; then
    FEATURES=+sse3,+ssse3,+sse4.1,+sse4.2,+popcnt,+avx,+fma,+pclmulqdq

I cannot find a cpu model that support fma but not avx2.

Footnotes

  1. https://developer.apple.com/documentation/apple-silicon/about-the-rosetta-translation-environment#What-Cant-Be-Translated

@ruihe774 ruihe774 changed the title perf(python): Use standard x86 microarch levels instead of features perf(python): Specify tune-cpu & add more features Jul 17, 2024
@ruihe774
Copy link
Contributor Author

tune-cpu is set to skylake for polars and x86-64-v2 for polars-lts-cpu; movbe and cmpxchg16b are added.

Copy link

codecov bot commented Jul 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.66%. Comparing base (f304a0c) to head (624a6b2).
Report is 27 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17615      +/-   ##
==========================================
+ Coverage   80.64%   80.66%   +0.01%     
==========================================
  Files        1484     1490       +6     
  Lines      195509   195931     +422     
  Branches     2780     2789       +9     
==========================================
+ Hits       157675   158043     +368     
- Misses      37324    37375      +51     
- Partials      510      513       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@orlp
Copy link
Collaborator

orlp commented Jul 17, 2024

@ruihe774 We talked internally and it seems no one can remember the reason why that MacOS check was there. So let's just get rid of it and we'll see if it turns out to be a regression. If so we can always fix it again and place a comment why it's so then :)

@orlp orlp merged commit 8367381 into pola-rs:main Jul 17, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants