Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-90536: Add support for the BOLT post-link binary optimizer #95908

Merged
merged 11 commits into from
Aug 18, 2022

Conversation

kmod
Copy link
Contributor

@kmod kmod commented Aug 11, 2022

Using bolt
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an --enable-bolt configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).

Compared to a previous attempt,
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior, and the benchmarks in the pyperformance
suite are small and tend to fit in i-cache.

This change uses the existing pgo profiling task (python -m test --pgo),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.

Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt)
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an `--enable-bolt` configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).

Compared to [a previous attempt](faster-cpython/ideas#224),
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior, and the benchmarks in the pyperformance
suite are small and tend to fit in i-cache.

This change uses the existing pgo profiling task (`python -m test --pgo`),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.
@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@gvanrossum
Copy link
Member

Thanks! I hope @corona10 can review and merge this, and maybe @pablogsal will be willing to backport it to 3.11.

@pablogsal
Copy link
Member

and maybe @pablogsal will be willing to backport it to 3.11.

Unfortunately, changes in the configure script or makefile are too much at this stage, especially for a new feature that has not been tested in the wild (by users checking the pre-releases). Sadly, this must go to 3.12.

@corona10 corona10 self-requested a review August 11, 2022 23:28
@corona10
Copy link
Member

Nice work! I will take a look at this PR by this weekend

@corona10 corona10 changed the title Add support for the BOLT post-link binary optimizer gh-90536: Add support for the BOLT post-link binary optimizer Aug 11, 2022
@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@corona10

This comment was marked as resolved.

Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things need to be checked.

  • I failed to build the binary with this patch, This can be due to the BOLT bug but I would like to know which BOLT version you used. -> solved
BOLT-INFO: Allocation combiner: 30 empty spaces coalesced (dyn count: 63791805).
 #0 0x0000563eb3e8d705 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
 #1 0x0000563eb3e8b2d4 SignalHandler(int) Signals.cpp:0:0
 #2 0x00007fc228930520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #3 0x0000563eb4ebd106 llvm::bolt::BinaryFunction::translateInputToOutputAddress(unsigned long) const (/usr/local/bin/llvm-bolt+0x1c52106)
 #4 0x0000563eb3f52658 llvm::bolt::DWARFRewriter::updateUnitDebugInfo(llvm::DWARFUnit&, llvm::bolt::DebugInfoBinaryPatcher&, llvm::bolt::DebugAbbrevWriter&, llvm::bolt::DebugLocWriter&, llvm::bolt::DebugRangesSectionWriter&, llvm::Optional<unsigned long>) (/usr/local/bin/llvm-bolt+0xce7658)
 #5 0x0000563eb3f5688b llvm::bolt::DWARFRewriter::updateDebugInfo()::'lambda0'(unsigned long, llvm::DWARFUnit*)::operator()(unsigned long, llvm::DWARFUnit*) const DWARFRewriter.cpp:0:0
 #6 0x0000563eb3f5c45a llvm::bolt::DWARFRewriter::updateDebugInfo() (/usr/local/bin/llvm-bolt+0xcf145a)
 #7 0x0000563eb3f1aef8 llvm::bolt::RewriteInstance::updateMetadata() (/usr/local/bin/llvm-bolt+0xcafef8)
 #8 0x0000563eb3f428e6 llvm::bolt::RewriteInstance::run() (/usr/local/bin/llvm-bolt+0xcd78e6)
 #9 0x0000563eb355ccf8 main (/usr/local/bin/llvm-bolt+0x2f1cf8)
#10 0x00007fc228917d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#11 0x00007fc228917e40 call_init ./csu/../csu/libc-start.c:128:20
#12 0x00007fc228917e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#13 0x0000563eb35dbd75 _start (/usr/local/bin/llvm-bolt+0x370d75)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /usr/local/bin/llvm-bolt python -o python.bolt -data=python.fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
make: *** [Makefile:800: bolt-opt] Segmentation fault (core dumped
  • While profiling, I met the test failure, would you like to check that the optimized binary pass all std python test? (e.g python -m test), I met the related issue with the last attempts and it was solved by profiling through python -m test -> solved
./python.bolt_inst -m test --pgo --timeout=1200 || true
0:00:00 load avg: 2.17 Run tests sequentially (timeout: 20 min)
0:00:00 load avg: 2.17 [ 1/44] test_array
0:00:01 load avg: 2.17 [ 2/44] test_base64
0:00:02 load avg: 2.07 [ 3/44] test_binascii
0:00:02 load avg: 2.07 [ 4/44] test_binop
0:00:02 load avg: 2.07 [ 5/44] test_bisect
0:00:02 load avg: 2.07 [ 6/44] test_bytes
0:00:06 load avg: 2.07 [ 7/44] test_bz2
0:00:06 load avg: 2.07 [ 8/44] test_cmath
0:00:07 load avg: 2.07 [ 9/44] test_codecs
0:00:08 load avg: 1.99 [10/44] test_collections
0:00:09 load avg: 1.99 [11/44] test_complex
0:00:10 load avg: 1.99 [12/44] test_dataclasses
0:00:10 load avg: 1.99 [13/44] test_datetime
0:00:14 load avg: 1.83 [14/44] test_decimal
0:00:18 load avg: 1.76 [15/44] test_difflib
0:00:19 load avg: 1.76 [16/44] test_embed
0:00:21 load avg: 1.76 [17/44] test_float
0:00:22 load avg: 1.76 [18/44] test_fstring
0:00:23 load avg: 1.70 [19/44] test_functools
0:00:23 load avg: 1.70 [20/44] test_generators
0:00:24 load avg: 1.70 [21/44] test_hashlib
0:00:25 load avg: 1.70 [22/44] test_heapq
0:00:26 load avg: 1.70 [23/44] test_int
0:00:26 load avg: 1.70 [24/44] test_itertools
0:00:32 load avg: 1.64 [25/44] test_json
0:00:36 load avg: 1.59 [26/44] test_long
0:00:39 load avg: 1.54 [27/44] test_lzma
0:00:39 load avg: 1.54 [28/44] test_math
0:00:42 load avg: 1.50 [29/44] test_memoryview
0:00:43 load avg: 1.50 [30/44] test_operator
0:00:44 load avg: 1.50 [31/44] test_ordered_dict
0:00:46 load avg: 1.50 [32/44] test_patma
0:00:46 load avg: 1.50 [33/44] test_pickle
0:00:52 load avg: 1.46 [34/44] test_pprint
0:00:52 load avg: 1.42 [35/44] test_re
0:00:53 load avg: 1.42 [36/44] test_set
0:01:00 load avg: 1.39 [37/44] test_sqlite3
0:01:05 load avg: 1.36 [38/44] test_statistics
0:01:10 load avg: 1.33 [39/44] test_struct
0:01:11 load avg: 1.33 [40/44] test_tabnanny
0:01:12 load avg: 1.30 [41/44] test_time
0:01:15 load avg: 1.30 [42/44] test_unicode
test test_unicode failed
0:01:17 load avg: 1.28 [43/44] test_xml_etree -- test_unicode failed (1 failure)
0:01:19 load avg: 1.28 [44/44] test_xml_etree_c

Total duration: 1 min 21 sec
Tests result: FAILURE

I will share further investigation into this patch.
FYI, this is my environment.

- OS: Ubuntu 22.04 LTS
- BOLT revision e9b213131ae9c57f4f151d3206916676135b31b0
- gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0

@corona10
Copy link
Member

Hmm, I will try to build BOLT from LLVM 14.0.6

@corona10
Copy link
Member

corona10 commented Aug 13, 2022

I found why the BOLT was failed, I will downgrade the gcc version into 10.


DWARF 5 has become the default in GCC 11

Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for work! All pipeline works correctly.

Please update https://github.com/python/cpython/blob/main/Doc/using/configure.rst too.
(If possible https://github.com/python/cpython/blob/main/Doc/whatsnew/3.12.rst too, I will update the whats new if you are too busy)
But please emphasize that this feature is experimental optimization support.

I am going to measure the performance enhancement soon through the pyperformance and also for the l1 i-cache miss ratio.

Looks like https://github.com/pyston/python-macrobenchmarks does not support Python 3.1[1-2] yet right? Please let me know if I know wrong.

plus
https://github.com/python/cpython/blob/main/Misc/ACKS Add your name in this file too :)

Makefile.pre.in Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@corona10
Copy link
Member

corona10 commented Aug 13, 2022

@gvanrossum @kmod cc @markshannon

Interesting result!
The following benchmark was measured on AWS c5n.metal / gcc-10. (base commit: f235178)
I wish to re-measure the benchmark from the FasterCPython project machine also.
I am going to measure the L1 i-cache miss ratio soon where the perf tool is available.

Benchmark CPython 3.12 ./configure --enable-optimizations --with-lto CPython 3.12 ./configure --enable-optimizations --with-lto --enable-bolt
2to3 269 ms 255 ms: 1.05x faster
chameleon 7.39 ms 7.02 ms: 1.05x faster
chaos 74.1 ms 68.8 ms: 1.08x faster
crypto_pyaes 82.3 ms 77.2 ms: 1.07x faster
deltablue 3.65 ms 3.41 ms: 1.07x faster
django_template 38.6 ms 35.3 ms: 1.09x faster
dulwich_log 67.6 ms 58.7 ms: 1.15x faster
fannkuch 385 ms 380 ms: 1.02x faster
float 73.2 ms 72.4 ms: 1.01x faster
genshi_text 24.3 ms 23.3 ms: 1.04x faster
genshi_xml 56.4 ms 52.8 ms: 1.07x faster
go 140 ms 136 ms: 1.03x faster
hexiom 6.40 ms 6.25 ms: 1.02x faster
html5lib 65.0 ms 60.7 ms: 1.07x faster
json_dumps 11.1 ms 10.4 ms: 1.07x faster
json_loads 28.7 us 26.3 us: 1.09x faster
logging_format 7.29 us 6.69 us: 1.09x faster
logging_silent 101 ns 97.6 ns: 1.03x faster
logging_simple 6.48 us 6.01 us: 1.08x faster
mako 10.6 ms 9.91 ms: 1.07x faster
meteor_contest 106 ms 102 ms: 1.04x faster
nbody 86.4 ms 87.7 ms: 1.02x slower
nqueens 91.3 ms 88.1 ms: 1.04x faster
pathlib 19.0 ms 16.8 ms: 1.13x faster
pickle_dict 32.2 us 32.6 us: 1.01x slower
pickle_list 4.69 us 4.62 us: 1.02x faster
pickle_pure_python 297 us 282 us: 1.05x faster
pidigits 177 ms 176 ms: 1.01x faster
pyflate 423 ms 416 ms: 1.02x faster
python_startup 8.72 ms 8.15 ms: 1.07x faster
python_startup_no_site 6.35 ms 5.97 ms: 1.06x faster
raytrace 312 ms 293 ms: 1.06x faster
regex_compile 139 ms 131 ms: 1.06x faster
regex_dna 180 ms 185 ms: 1.03x slower
regex_effbot 2.99 ms 2.82 ms: 1.06x faster
regex_v8 21.4 ms 20.4 ms: 1.05x faster
richards 48.6 ms 46.3 ms: 1.05x faster
scimark_fft 348 ms 338 ms: 1.03x faster
scimark_lu 120 ms 117 ms: 1.02x faster
scimark_monte_carlo 67.0 ms 65.4 ms: 1.02x faster
scimark_sor 116 ms 113 ms: 1.02x faster
spectral_norm 101 ms 102 ms: 1.01x slower
sqlalchemy_declarative 143 ms 135 ms: 1.06x faster
sqlalchemy_imperative 19.0 ms 17.0 ms: 1.12x faster
sqlite_synth 2.50 us 2.29 us: 1.09x faster
sympy_expand 507 ms 465 ms: 1.09x faster
sympy_integrate 21.7 ms 20.5 ms: 1.06x faster
sympy_sum 176 ms 164 ms: 1.08x faster
sympy_str 311 ms 286 ms: 1.09x faster
telco 7.02 ms 6.36 ms: 1.10x faster
tornado_http 125 ms 113 ms: 1.10x faster
unpickle 15.7 us 15.1 us: 1.04x faster
unpickle_list 4.74 us 4.56 us: 1.04x faster
unpickle_pure_python 229 us 219 us: 1.05x faster
xml_etree_parse 158 ms 155 ms: 1.02x faster
xml_etree_iterparse 103 ms 101 ms: 1.02x faster
xml_etree_generate 91.0 ms 84.3 ms: 1.08x faster
xml_etree_process 61.9 ms 58.4 ms: 1.06x faster
Geometric mean (ref) 1.05x faster

Benchmark hidden because not significant (3): pickle, scimark_sparse_mat_mult, unpack_sequence

@corona10 corona10 self-assigned this Aug 13, 2022
@corona10
Copy link
Member

corona10 commented Aug 14, 2022

Another benchmark from Azure VM(Ubuntu 20.04.4 LTS gcc 9.4.0):
https://gist.github.com/corona10/c2aa0108a5ffcc96be449c0ce033412d

But let's measure the benchmark from the Faster CPython machine after the PR is merged.

Makefile.pre.in Outdated
@@ -640,6 +640,15 @@ profile-opt: profile-run-stamp
-rm -f profile-clean-stamp
$(MAKE) @DEF_MAKE_RULE@ CFLAGS_NODIST="$(CFLAGS_NODIST) $(PGO_PROF_USE_FLAG)" LDFLAGS_NODIST="$(LDFLAGS_NODIST)"

bolt-opt: @PREBOLT_RULE@
rm -f *.fdata
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst
@LLVM_BOLT@ ./$(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst

Makefile.pre.in Outdated
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst
./$(BUILDPYTHON).bolt_inst $(PROFILE_TASK) || true
@MERGE_FDATA@ $(BUILDPYTHON).*.fdata > $(BUILDPYTHON).fdata
@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
@LLVM_BOLT@ ./$(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot

@corona10
Copy link
Member

corona10 commented Aug 15, 2022

I success to get cache miss-related metadata and also I got the pyperformance result which is similar to my previous attempts and Kevin's report.
I didn't analyze whether the GCC version or OS version could affect the performance result.
But I can conclude that BOLT definitely makes CPython faster.

Environment

  • Hardware: AWS c5n.metal
  • Red Hat Enterprise Linux release 8.6 (Ootpa)
  • gcc: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10)
  • LLVM version 14.0.6

Binary Size

  • Without BOLT: 79M
  • With BOLT: 36M

ICache miss

Experiment instructions L1-icache-misses ratio
PGO + LTO 8,330,863,079,932 77,047,357,163 0.92%
PGO + LTO + BOLT 8,312,698,165,975 65,319,225,064 0.79%

Benchmark (1.01x faster)

https://gist.github.com/corona10/5726d1528176677d4c694265edfc4bf5

@corona10
Copy link
Member

@aaupov I would like to recommend creating issue for your suggestions on https://github.com/faster-cpython/ideas or https://github.com/python/cpython/issues. I think that is faster-cpython repo is more proper :)

vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Nov 16, 2022
@osevan
Copy link

osevan commented Dec 18, 2022

Another question and important view of performance tuning.

Gcc pgo and clang pgo are different , and gcc pgo profiler like profile-generate, can get more deeply data for pgo, instead of clang profile-generate.

So, would be nice to make new flags with

--enable-lto-gcc --enable-pgo-gcc,but considering at gcc level reorder flag needing for BOLT at clang

  • bolting

And one compilechain completely in clang
--enable-lto-llvm --enable-pgo-llvm plus bolt

Thank you very much

vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Dec 20, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Dec 20, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Dec 20, 2022
vmarkovtsev added a commit to athenianco/athenian-api that referenced this pull request Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants