perf(python): Specify tune-cpu & add more features #17615

ruihe774 · 2024-07-13T11:54:29Z

This PR makes use of standard x86 microarhitecture levels¹ for target-cpu instead of manually specified target-features when building the release wheels. This brings the following benefits:

Consistent with the standard: Many Linux distros, e.g. openSUSE² and CachyOS³, use x86 microarhitecture levels when compiling optimized versions of packages. Using standard levels better aligns with the downstream packaging.
Reduce maintain burden: No need to define and update⁴ a list of features. Each level includes all supported features.
Performance improvement: Setting target-cpu implies tune-cpu, while setting target-features does not. e.g., target-cpu=x86-84-v3 enables tuning for those v3 cpu models⁵.

This PR uses:

x86-64-v3 for normal release; it includes sse, avx, avx2, etc.
x86-64-v2 for lts-cpu release; it includes sse, etc.

Supported CPU models (i.e. compatibility) of either of them is not changed.

The special configuration for macOS is removed. I cannot find a cpu model that support fma but not avx2. Please correct me if I'm wrong.

This PR does not affect _cpu_check.py. A list of features is still given to it.

XXX:

Is it worthy to also set tune-cpu? x86-64-v3 is tuned for a kind of old cpu model (Haswell). We may set target-cpu and tune-cpu to different values, e.g. -C target-cpu=x86-64-v3 -Z tune-cpu=skylake.
It is not necessary to check every feature in _cpu_check.py. e.g. a cpu that has avx also has sse. We can simplify its logic and only check a sentinel feature.

ritchie46 · 2024-07-15T07:04:56Z

@orlp can you take a look here?

orlp · 2024-07-15T11:08:15Z

I'd prefer to keep using a manual list of features, on an as-needed basis. This also makes it easy to keep the CPU check module up-to-date.

Another example of a pitfall of x86-64-v3 is that your PR would have silently disabled pclmulqdq.

ruihe774 · 2024-07-15T11:36:16Z

Another example of a pitfall of x86-64-v3 is that your PR would have silently disabled pclmulqdq.

Sorry, I did not notice that.

I'd prefer to keep using a manual list of features, on an as-needed basis. This also makes it easy to keep the CPU check module up-to-date.'

I want to argue that the compiler can implicitly make use of arch features. An "as-needed basis" is sub-optimal for the compiler. For example, if x86-64-v3 is specified, the compiler can use movbe and f16c, which are present in all AVX2-equppied CPU models and are not specified in the current "manual list of features".

Another point, as I've mentioned, is the tunings. Specifying a manual list of features does not turn on the tunings for the target CPU - tunings for generic CPU is used. Tunings in x86-64-v3 like TuningFastScalarFSQRT, TuningFastVariableCrossLaneShuffle, TuningFastVariablePerLaneShuffle, etc, are not enabled. This prevent the compiler from e.g. using a native sqrt implementation, and fallbacks to software Newton-Raphson.

Despite the aforementioned benefits of using microarch levels, if a manual list of features is still preferred, IMO we can at least complete the list with all features that the target CPU supports, and manually specify a -Z tune-cpu. @orlp Would you plz reconsider it?

orlp · 2024-07-16T13:36:18Z

@ruihe774 I would be open to adding more features if it means we don't exclude any CPUs we support now, and adding a reasonable tuning target. movbe sounds useful, f16c I wouldn't bother with since we don't use that.

I'm not very open to relaxing the CPU check. It's not that expensive and it means users get a proper explained error rather than a segfault. Also please consider that there are 'CPUs' which support weird combinations of feature sets, such as certain emulators like Rosetta.

ruihe774 · 2024-07-17T00:08:24Z

@ruihe774 I would be open to adding more features if it means we don't exclude any CPUs we support now, and adding a reasonable tuning target. movbe sounds useful, f16c I wouldn't bother with since we don't use that.

OK. I'm going to work on this.

BTW what is the "reasonable tuning target" in your opinion? IMO I would choose tune-cpu=skylake because 1) skylake and later cpu models are widely used while prior cpu models are rare now, 2) it has TuningFastGather and TuningAllowLight256Bit which allow more efficient auto-vectorization, 3) it has similar tunings to ryzen.

such as certain emulators like Rosetta.

Rosetta does not support AVX at all¹; it can only run polars-lts-cpu.

BTW, I'm still wondering what the purpose of the macOS branch is:

elif [[ "$IS_MACOS" = true ]]; then
    FEATURES=+sse3,+ssse3,+sse4.1,+sse4.2,+popcnt,+avx,+fma,+pclmulqdq

I cannot find a cpu model that support fma but not avx2.

https://developer.apple.com/documentation/apple-silicon/about-the-rosetta-translation-environment#What-Cant-Be-Translated ↩

ruihe774 · 2024-07-17T01:44:14Z

tune-cpu is set to skylake for polars and x86-64-v2 for polars-lts-cpu; movbe and cmpxchg16b are added.

codecov · 2024-07-17T02:09:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.66%. Comparing base (f304a0c) to head (624a6b2).
Report is 27 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #17615      +/-   ##
==========================================
+ Coverage   80.64%   80.66%   +0.01%     
==========================================
  Files        1484     1490       +6     
  Lines      195509   195931     +422     
  Branches     2780     2789       +9     
==========================================
+ Hits       157675   158043     +368     
- Misses      37324    37375      +51     
- Partials      510      513       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

orlp · 2024-07-17T12:55:35Z

@ruihe774 We talked internally and it seems no one can remember the reason why that MacOS check was there. So let's just get rid of it and we'll see if it turns out to be a regression. If so we can always fix it again and place a comment why it's so then :)

ruihe774 requested review from ritchie46, stinodego and c-peters as code owners July 13, 2024 11:54

github-actions bot added performance Performance issues or improvements python Related to Python Polars labels Jul 13, 2024

orlp closed this Jul 15, 2024

orlp reopened this Jul 16, 2024

ruihe774 added 3 commits July 17, 2024 09:17

perf(python): specify tune-cpu

9fc12fd

perf(python): add feature +movbe

d7f828c

perf(python): add feature +cmpxchg16b

624a6b2

ruihe774 force-pushed the x86-features branch from d1dff6d to 624a6b2 Compare July 17, 2024 01:41

ruihe774 requested review from alexander-beedie, MarcoGorelli and reswqa as code owners July 17, 2024 01:41

ruihe774 changed the title ~~perf(python): Use standard x86 microarch levels instead of features~~ perf(python): Specify tune-cpu & add more features Jul 17, 2024

orlp approved these changes Jul 17, 2024

View reviewed changes

orlp merged commit 8367381 into pola-rs:main Jul 17, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(python): Specify tune-cpu & add more features #17615

perf(python): Specify tune-cpu & add more features #17615

ruihe774 commented Jul 13, 2024 •

edited

Loading

ritchie46 commented Jul 15, 2024

orlp commented Jul 15, 2024 •

edited

Loading

ruihe774 commented Jul 15, 2024 •

edited

Loading

orlp commented Jul 16, 2024 •

edited

Loading

ruihe774 commented Jul 17, 2024 •

edited

Loading

ruihe774 commented Jul 17, 2024

codecov bot commented Jul 17, 2024

orlp commented Jul 17, 2024

perf(python): Specify tune-cpu & add more features #17615

perf(python): Specify tune-cpu & add more features #17615

Conversation

ruihe774 commented Jul 13, 2024 • edited Loading

Footnotes

ritchie46 commented Jul 15, 2024

orlp commented Jul 15, 2024 • edited Loading

ruihe774 commented Jul 15, 2024 • edited Loading

orlp commented Jul 16, 2024 • edited Loading

ruihe774 commented Jul 17, 2024 • edited Loading

Footnotes

ruihe774 commented Jul 17, 2024

codecov bot commented Jul 17, 2024

Codecov Report

orlp commented Jul 17, 2024

ruihe774 commented Jul 13, 2024 •

edited

Loading

orlp commented Jul 15, 2024 •

edited

Loading

ruihe774 commented Jul 15, 2024 •

edited

Loading

orlp commented Jul 16, 2024 •

edited

Loading

ruihe774 commented Jul 17, 2024 •

edited

Loading