-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ELF] Making cdsort default for function reordering #68638
Conversation
f8fc1d0
to
00b33de
Compare
@llvm/pr-subscribers-lld-elf @llvm/pr-subscribers-lld Author: None (spupyrev) ChangesCDSort function reordering outperforms the existing default heuristic ( High-level perf impact on two selected binaries: More detailed measurements on the clang binary is at [here](https://reviews.llvm.org/D152834#4445042) Full diff: https://github.com/llvm/llvm-project/pull/68638.diff 8 Files Affected:
diff --git a/lld/ELF/Driver.cpp b/lld/ELF/Driver.cpp
index 6272276e94b2d35..e2100d00d54ede6 100644
--- a/lld/ELF/Driver.cpp
+++ b/lld/ELF/Driver.cpp
@@ -1095,7 +1095,7 @@ static void ltoValidateAllVtablesHaveTypeInfos(opt::InputArgList &args) {
}
static CGProfileSortKind getCGProfileSortKind(opt::InputArgList &args) {
- StringRef s = args.getLastArgValue(OPT_call_graph_profile_sort, "hfsort");
+ StringRef s = args.getLastArgValue(OPT_call_graph_profile_sort, "cdsort");
if (s == "hfsort")
return CGProfileSortKind::Hfsort;
if (s == "cdsort")
diff --git a/lld/docs/ld.lld.1 b/lld/docs/ld.lld.1
index 2e46fc18132f3e0..12b17dd37796d13 100644
--- a/lld/docs/ld.lld.1
+++ b/lld/docs/ld.lld.1
@@ -128,9 +128,9 @@ may be:
.It Cm none
Ignore call graph profile.
.It Cm hfsort
-Use hfsort (default).
+Use hfsort.
.It Cm cdsort
-Use cdsort.
+Use cdsort (default).
.El
.Pp
.It Fl -color-diagnostics Ns = Ns Ar value
diff --git a/lld/test/ELF/cgprofile-bad-clusters.s b/lld/test/ELF/cgprofile-bad-clusters.s
index c162e981acdd633..88e68bfb7b2c0a1 100644
--- a/lld/test/ELF/cgprofile-bad-clusters.s
+++ b/lld/test/ELF/cgprofile-bad-clusters.s
@@ -10,7 +10,7 @@
# RUN: echo "F G 6" >> %t.call_graph
# RUN: echo "G H 5" >> %t.call_graph
# RUN: echo "H I 4" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t2
# RUN: llvm-readobj --symbols %t2 | FileCheck %s
.section .text.A,"ax",@progbits
diff --git a/lld/test/ELF/cgprofile-icf.s b/lld/test/ELF/cgprofile-icf.s
index a9de5613917cbf1..e28630d0eb30bf0 100644
--- a/lld/test/ELF/cgprofile-icf.s
+++ b/lld/test/ELF/cgprofile-icf.s
@@ -5,9 +5,9 @@
# RUN: echo "A B 100" > %t.call_graph
# RUN: echo "A C 40" >> %t.call_graph
# RUN: echo "C D 61" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t.out -icf=all
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t.out -icf=all
# RUN: llvm-readobj --symbols %t.out | FileCheck %s
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2.out
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t2.out
# RUN: llvm-readobj --symbols %t2.out | FileCheck %s --check-prefix=NOICF
.section .text.D,"ax",@progbits
diff --git a/lld/test/ELF/cgprofile-print.s b/lld/test/ELF/cgprofile-print.s
index b103ef5109effbb..d6f15f2926b5729 100644
--- a/lld/test/ELF/cgprofile-print.s
+++ b/lld/test/ELF/cgprofile-print.s
@@ -5,7 +5,7 @@
# RUN: echo "B C 50" >> %t.call_graph
# RUN: echo "C D 40" >> %t.call_graph
# RUN: echo "D B 10" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --print-symbol-order=%t3
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --call-graph-profile-sort=hfsort --print-symbol-order=%t3
# RUN: FileCheck %s --input-file %t3
# CHECK: B
@@ -32,6 +32,3 @@ C:
.globl D
D:
nop
-
-
-
diff --git a/lld/test/ELF/cgprofile-rela.test b/lld/test/ELF/cgprofile-rela.test
index 189f169e65481ee..141dfd4c65b1ea1 100644
--- a/lld/test/ELF/cgprofile-rela.test
+++ b/lld/test/ELF/cgprofile-rela.test
@@ -3,7 +3,7 @@
# REQUIRES: x86
# RUN: yaml2obj %s -o %t.o
-# RUN: ld.lld %t.o -o %t
+# RUN: ld.lld --call-graph-profile-sort=hfsort %t.o -o %t
# RUN: llvm-nm --no-sort %t | FileCheck %s
# RUN: ld.lld --no-call-graph-profile-sort %t.o -o %t
# RUN: llvm-nm --no-sort %t | FileCheck %s --check-prefix=NO-CG
diff --git a/lld/test/ELF/cgprofile-reproduce.s b/lld/test/ELF/cgprofile-reproduce.s
index b9cb269e4580d78..1b1b36151da99d1 100644
--- a/lld/test/ELF/cgprofile-reproduce.s
+++ b/lld/test/ELF/cgprofile-reproduce.s
@@ -5,7 +5,7 @@
# RUN: echo "B C 50" >> %t.call_graph
# RUN: echo "C D 40" >> %t.call_graph
# RUN: echo "D B 10" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --print-symbol-order=%t3
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --call-graph-profile-sort=hfsort --print-symbol-order=%t3
# RUN: ld.lld -e A %t --symbol-ordering-file %t3 -o %t2
# RUN: llvm-readobj --symbols %t2 | FileCheck %s
@@ -37,6 +37,3 @@ C:
.globl D
D:
nop
-
-
-
diff --git a/lld/test/ELF/cgprofile-txt.s b/lld/test/ELF/cgprofile-txt.s
index c9194bbbc43cbe0..cf5b17627cfb63c 100644
--- a/lld/test/ELF/cgprofile-txt.s
+++ b/lld/test/ELF/cgprofile-txt.s
@@ -26,12 +26,12 @@
# RUN: echo "TooManyPreds10 TooManyPreds 11" >> %t.call_graph
# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t2
# RUN: llvm-readobj --symbols %t2 | FileCheck %s
-## --call-graph-profile-sort=hfsort is the default.
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2b
-# RUN: cmp %t2 %t2b
# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=cdsort -o %t2
# RUN: llvm-readobj --symbols %t2 | FileCheck %s --check-prefix=CDSORT
+## --call-graph-profile-sort=cdsort is the default.
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2b
+# RUN: cmp %t2 %t2b
# RUN: not ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=sort \
# RUN: -o /dev/null 2>&1 | FileCheck %s --check-prefix=UNKNOWN
|
@@ -10,7 +10,7 @@ | |||
# RUN: echo "F G 6" >> %t.call_graph | |||
# RUN: echo "G H 5" >> %t.call_graph | |||
# RUN: echo "H I 4" >> %t.call_graph | |||
# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 | |||
# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excessive spaces before -o
00b33de
to
86e91d4
Compare
86e91d4
to
8fb3c96
Compare
If cdsort is better, after some baking period we should probably remove other alternative call graph layout algorithm (hfsort) in the code base just to keep things simple. |
I am afraid we cannot remove hfsort as the high time complexity of cache-directed sort is a real issue.
The work amount can be measured by the total size of |
The high time complexity of cache-directed sort is a real issue and is not appropriate as the default, at least for now (#68638 (comment)).
Can we apply the same optimization for cdsort as in #68617? (ie limiting the size of the max chain to 512). Block reordering becomes 50x faster on some instances after the change, without sacrificing the quality. I am happy to test this too, if you share (some version of) the problematic call graph. |
Let me try... For the internal program that I saw a timeout (20+min for the total build on a machine that may be slower than my pc). On my pc:
While lld with
|
When linking an executable with a slightly larger executable, ld.lld --call-graph-profile-sort=cdsort can be very slow (see llvm#68638). ``` 4.6% 20.7Mi .text.hot 3.5% 15.9Mi .text 3.4% 15.2Mi .text.unknown ``` Add cl option `cds-max-chain-size`, which is similar to `ext-tsp-max-chain-size`, and set it to 128, to improve performance. In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace" builds, the "Total Sort sections" time is measured as follows: * -mllvm -cds-max-chain-size=64: 1.321813 * -mllvm -cds-max-chain-size=128: 2.030425 * -mllvm -cds-max-chain-size=256: 2.927684 * -mllvm -cds-max-chain-size=512: 5.493106 * unlimited: 9 minutes The rest part takes 6.8s.
When linking an executable with a slightly larger executable, ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638). ``` 4.6% 20.7Mi .text.hot 3.5% 15.9Mi .text 3.4% 15.2Mi .text.unknown ``` Add cl option `cdsort-max-chain-size`, which is similar to `ext-tsp-max-chain-size`, and set it to 128, to improve performance. In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace" builds, the "Total Sort sections" time is measured as follows: * -mllvm -cdsort-max-chain-size=64: 1.321813 * -mllvm -cdsort-max-chain-size=128: 2.030425 * -mllvm -cdsort-max-chain-size=256: 2.927684 * -mllvm -cdsort-max-chain-size=512: 5.493106 * unlimited: 9 minutes The rest part takes 6.8s.
Given the latest build time improvements of the algorithm, any concerns to re-enable this default? @MaskRay |
Yes, you can update the patch to try again... Make sure to update |
Edited lld/ELF/Options.td to cdsort as well CDSort function reordering outperforms the existing default heuristic ( hfsort/C^3) in terms of the performance of generated binaries while being (almost) as fast. Thus, the suggestion is to change the default. The speedup is up to 1.5% perf for large front-end binaries, and can be moderate/neutral for "small" benchmarks. High-level **perf impact** on two selected binaries: clang-10 binary (built with LTO+AutoFDO/CSSPGO): wins on top of C^3 in [0.3%..0.8%] rocksDB-8 binary (built with LTO+CSSPGO): wins on top of C^3 in [0.8%..1.5%] More detailed measurements on the clang binary is at [here](https://reviews.llvm.org/D152834#4445042)
CDSort function reordering outperforms the existing default heuristic (
hfsort/C^3) in terms of the performance of generated binaries while
being (almost) as fast. Thus, the suggestion is to change the default.
The speedup is up to 1.5% perf for large front-end binaries, and can be
moderate/neutral for "small" benchmarks.
High-level perf impact on two selected binaries:
clang-10 binary (built with LTO+AutoFDO/CSSPGO): wins on top of C^3 in [0.3%..0.8%]
rocksDB-8 binary (built with LTO+CSSPGO): wins on top of C^3 in [0.8%..1.5%]
More detailed measurements on the clang binary is at here