Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ELF] Making cdsort default for function reordering #68638

Merged
merged 1 commit into from
Oct 10, 2023

Conversation

spupyrev
Copy link
Contributor

@spupyrev spupyrev commented Oct 9, 2023

CDSort function reordering outperforms the existing default heuristic (
hfsort/C^3) in terms of the performance of generated binaries while
being (almost) as fast. Thus, the suggestion is to change the default.
The speedup is up to 1.5% perf for large front-end binaries, and can be
moderate/neutral for "small" benchmarks.

High-level perf impact on two selected binaries:
clang-10 binary (built with LTO+AutoFDO/CSSPGO): wins on top of C^3 in [0.3%..0.8%]
rocksDB-8 binary (built with LTO+CSSPGO): wins on top of C^3 in [0.8%..1.5%]

More detailed measurements on the clang binary is at here

@spupyrev spupyrev force-pushed the cdsort-linker-default branch 4 times, most recently from f8fc1d0 to 00b33de Compare October 10, 2023 01:37
@spupyrev spupyrev requested a review from MaskRay October 10, 2023 02:24
@spupyrev spupyrev marked this pull request as ready for review October 10, 2023 02:24
@llvmbot
Copy link
Member

llvmbot commented Oct 10, 2023

@llvm/pr-subscribers-lld-elf

@llvm/pr-subscribers-lld

Author: None (spupyrev)

Changes

CDSort function reordering outperforms the existing default heuristic (
hfsort/C^3) in terms of the performance of generated binaries while
being (almost) as fast. Thus, the suggestion is to change the default.
The speedup is up to 1.5% perf for large front-end binaries, and can be
moderate/neutral for "small" benchmarks.

High-level perf impact on two selected binaries:
clang-10 binary (built with LTO+AutoFDO/CSSPGO): wins on top of C^3 in [0.3%..0.8%]
rocksDB-8 binary (built with LTO+CSSPGO): wins on top of C^3 in [0.8%..1.5%]

More detailed measurements on the clang binary is at [here](https://reviews.llvm.org/D152834#4445042)


Full diff: https://github.com/llvm/llvm-project/pull/68638.diff

8 Files Affected:

  • (modified) lld/ELF/Driver.cpp (+1-1)
  • (modified) lld/docs/ld.lld.1 (+2-2)
  • (modified) lld/test/ELF/cgprofile-bad-clusters.s (+1-1)
  • (modified) lld/test/ELF/cgprofile-icf.s (+2-2)
  • (modified) lld/test/ELF/cgprofile-print.s (+1-4)
  • (modified) lld/test/ELF/cgprofile-rela.test (+1-1)
  • (modified) lld/test/ELF/cgprofile-reproduce.s (+1-4)
  • (modified) lld/test/ELF/cgprofile-txt.s (+3-3)
diff --git a/lld/ELF/Driver.cpp b/lld/ELF/Driver.cpp
index 6272276e94b2d35..e2100d00d54ede6 100644
--- a/lld/ELF/Driver.cpp
+++ b/lld/ELF/Driver.cpp
@@ -1095,7 +1095,7 @@ static void ltoValidateAllVtablesHaveTypeInfos(opt::InputArgList &args) {
 }
 
 static CGProfileSortKind getCGProfileSortKind(opt::InputArgList &args) {
-  StringRef s = args.getLastArgValue(OPT_call_graph_profile_sort, "hfsort");
+  StringRef s = args.getLastArgValue(OPT_call_graph_profile_sort, "cdsort");
   if (s == "hfsort")
     return CGProfileSortKind::Hfsort;
   if (s == "cdsort")
diff --git a/lld/docs/ld.lld.1 b/lld/docs/ld.lld.1
index 2e46fc18132f3e0..12b17dd37796d13 100644
--- a/lld/docs/ld.lld.1
+++ b/lld/docs/ld.lld.1
@@ -128,9 +128,9 @@ may be:
 .It Cm none
 Ignore call graph profile.
 .It Cm hfsort
-Use hfsort (default).
+Use hfsort.
 .It Cm cdsort
-Use cdsort.
+Use cdsort (default).
 .El
 .Pp
 .It Fl -color-diagnostics Ns = Ns Ar value
diff --git a/lld/test/ELF/cgprofile-bad-clusters.s b/lld/test/ELF/cgprofile-bad-clusters.s
index c162e981acdd633..88e68bfb7b2c0a1 100644
--- a/lld/test/ELF/cgprofile-bad-clusters.s
+++ b/lld/test/ELF/cgprofile-bad-clusters.s
@@ -10,7 +10,7 @@
 # RUN: echo "F G 6" >> %t.call_graph
 # RUN: echo "G H 5" >> %t.call_graph
 # RUN: echo "H I 4" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort  -o %t2
 # RUN: llvm-readobj --symbols %t2 | FileCheck %s
 
     .section    .text.A,"ax",@progbits
diff --git a/lld/test/ELF/cgprofile-icf.s b/lld/test/ELF/cgprofile-icf.s
index a9de5613917cbf1..e28630d0eb30bf0 100644
--- a/lld/test/ELF/cgprofile-icf.s
+++ b/lld/test/ELF/cgprofile-icf.s
@@ -5,9 +5,9 @@
 # RUN: echo "A B 100" > %t.call_graph
 # RUN: echo "A C 40" >> %t.call_graph
 # RUN: echo "C D 61" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t.out -icf=all
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t.out -icf=all
 # RUN: llvm-readobj --symbols %t.out | FileCheck %s
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2.out
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t2.out
 # RUN: llvm-readobj --symbols %t2.out | FileCheck %s --check-prefix=NOICF
 
     .section    .text.D,"ax",@progbits
diff --git a/lld/test/ELF/cgprofile-print.s b/lld/test/ELF/cgprofile-print.s
index b103ef5109effbb..d6f15f2926b5729 100644
--- a/lld/test/ELF/cgprofile-print.s
+++ b/lld/test/ELF/cgprofile-print.s
@@ -5,7 +5,7 @@
 # RUN: echo "B C 50" >> %t.call_graph
 # RUN: echo "C D 40" >> %t.call_graph
 # RUN: echo "D B 10" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --print-symbol-order=%t3
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --call-graph-profile-sort=hfsort --print-symbol-order=%t3
 # RUN: FileCheck %s --input-file %t3
 
 # CHECK: B
@@ -32,6 +32,3 @@ C:
 .globl  D
 D:
  nop
-
-
-
diff --git a/lld/test/ELF/cgprofile-rela.test b/lld/test/ELF/cgprofile-rela.test
index 189f169e65481ee..141dfd4c65b1ea1 100644
--- a/lld/test/ELF/cgprofile-rela.test
+++ b/lld/test/ELF/cgprofile-rela.test
@@ -3,7 +3,7 @@
 # REQUIRES: x86
 
 # RUN: yaml2obj %s -o %t.o
-# RUN: ld.lld %t.o -o %t
+# RUN: ld.lld --call-graph-profile-sort=hfsort %t.o -o %t
 # RUN: llvm-nm --no-sort %t | FileCheck %s
 # RUN: ld.lld --no-call-graph-profile-sort %t.o -o %t
 # RUN: llvm-nm --no-sort %t | FileCheck %s --check-prefix=NO-CG
diff --git a/lld/test/ELF/cgprofile-reproduce.s b/lld/test/ELF/cgprofile-reproduce.s
index b9cb269e4580d78..1b1b36151da99d1 100644
--- a/lld/test/ELF/cgprofile-reproduce.s
+++ b/lld/test/ELF/cgprofile-reproduce.s
@@ -5,7 +5,7 @@
 # RUN: echo "B C 50" >> %t.call_graph
 # RUN: echo "C D 40" >> %t.call_graph
 # RUN: echo "D B 10" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --print-symbol-order=%t3
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --call-graph-profile-sort=hfsort --print-symbol-order=%t3
 # RUN: ld.lld -e A %t --symbol-ordering-file %t3 -o %t2
 # RUN: llvm-readobj --symbols %t2 | FileCheck %s
 
@@ -37,6 +37,3 @@ C:
 .globl  D
 D:
  nop
-
-
-
diff --git a/lld/test/ELF/cgprofile-txt.s b/lld/test/ELF/cgprofile-txt.s
index c9194bbbc43cbe0..cf5b17627cfb63c 100644
--- a/lld/test/ELF/cgprofile-txt.s
+++ b/lld/test/ELF/cgprofile-txt.s
@@ -26,12 +26,12 @@
 # RUN: echo "TooManyPreds10 TooManyPreds 11" >> %t.call_graph
 # RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t2
 # RUN: llvm-readobj --symbols %t2 | FileCheck %s
-## --call-graph-profile-sort=hfsort is the default.
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2b
-# RUN: cmp %t2 %t2b
 
 # RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=cdsort -o %t2
 # RUN: llvm-readobj --symbols %t2 | FileCheck %s --check-prefix=CDSORT
+## --call-graph-profile-sort=cdsort is the default.
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2b
+# RUN: cmp %t2 %t2b
 
 # RUN: not ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=sort \
 # RUN:   -o /dev/null 2>&1 | FileCheck %s --check-prefix=UNKNOWN

@@ -10,7 +10,7 @@
# RUN: echo "F G 6" >> %t.call_graph
# RUN: echo "G H 5" >> %t.call_graph
# RUN: echo "H I 4" >> %t.call_graph
# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2
# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excessive spaces before -o

@spupyrev spupyrev force-pushed the cdsort-linker-default branch from 00b33de to 86e91d4 Compare October 10, 2023 15:41
@spupyrev spupyrev force-pushed the cdsort-linker-default branch from 86e91d4 to 8fb3c96 Compare October 10, 2023 16:04
@spupyrev spupyrev merged commit d5c1d73 into llvm:main Oct 10, 2023
@spupyrev spupyrev deleted the cdsort-linker-default branch October 10, 2023 16:09
@WenleiHe
Copy link
Member

If cdsort is better, after some baking period we should probably remove other alternative call graph layout algorithm (hfsort) in the code base just to keep things simple.

@MaskRay
Copy link
Member

MaskRay commented Oct 13, 2023

If cdsort is better, after some baking period we should probably remove other alternative call graph layout algorithm (hfsort) in the code base just to keep things simple.

I am afraid we cannot remove hfsort as the high time complexity of cache-directed sort is a real issue.
In a not-so-large executable, CodeLayout.cpp took 20+ minutes, which is not acceptable. I'm sorry but I'll revert this patch as the cdsort default is too risky now. I'm happy to make more investigation and try making CodeLayout.cpp faster.

   4.6%  20.7Mi    .text.hot
   3.5%  15.9Mi    .text
   3.4%  15.2Mi    .text.unknown
   2.8%  12.7Mi    .debug_rnglists
   2.0%  8.88Mi    .rela.dyn
   1.3%  6.09Mi    .symtab
   1.3%  5.72Mi    .rodata
   1.2%  5.58Mi    [54 Others]

The work amount can be measured by the total size of BestSrcChain->Edges and BestDstChain->Edges (repeatedly counted) in llvm/lib/Transforms/Utils/CodeLayout.cpp:mergeChainPairs. After 7 minutes, the total size has increased to 18182463.

MaskRay added a commit that referenced this pull request Oct 13, 2023


The high time complexity of cache-directed sort is a real issue and is not
appropriate as the default, at least for now
(#68638 (comment)).
@spupyrev
Copy link
Contributor Author

Can we apply the same optimization for cdsort as in #68617? (ie limiting the size of the max chain to 512). Block reordering becomes 50x faster on some instances after the change, without sacrificing the quality. I am happy to test this too, if you share (some version of) the problematic call graph.
Btw, cdsort is on by default in BOLT for ~5 years now. Likely something is off with the call graphs coming from the compiler.

@MaskRay
Copy link
Member

MaskRay commented Oct 13, 2023

Can we apply the same optimization for cdsort as in #68617? (ie limiting the size of the max chain to 512). Block reordering becomes 50x faster on some instances after the change, without sacrificing the quality. I am happy to test this too, if you share (some version of) the problematic call graph. Btw, cdsort is on by default in BOLT for ~5 years now. Likely something is off with the call graphs coming from the compiler.

Let me try... For the internal program that I saw a timeout (20+min for the total build on a machine that may be slower than my pc). On my pc:

% time timeout 660 zsh /t/0.sh --call-graph-profile-sort=cdsort 
timeout 660 zsh /t/0.sh --call-graph-profile-sort=cdsort  548.21s user 3.63s system 101% cpu 9:02.64 total

While lld with --call-graph-profile-sort=hfsort takes just 7 seconds:

% time zsh /t/0.sh --call-graph-profile-sort=hfsort
zsh /t/0.sh --call-graph-profile-sort=hfsort  14.37s user 2.80s system 238% cpu 7.202 total

MaskRay added a commit to MaskRay/llvm-project that referenced this pull request Oct 19, 2023
When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see llvm#68638).
```
   4.6%  20.7Mi    .text.hot
   3.5%  15.9Mi    .text
   3.4%  15.2Mi    .text.unknown
```

Add cl option `cds-max-chain-size`, which is similar to
`ext-tsp-max-chain-size`, and set it to 128, to improve performance.

In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace"
builds, the "Total Sort sections" time is measured as follows:

* -mllvm  -cds-max-chain-size=64: 1.321813
* -mllvm -cds-max-chain-size=128: 2.030425
* -mllvm -cds-max-chain-size=256: 2.927684
* -mllvm -cds-max-chain-size=512: 5.493106
* unlimited: 9 minutes

The rest part takes 6.8s.
MaskRay added a commit that referenced this pull request Oct 22, 2023
When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638).
```
   4.6%  20.7Mi    .text.hot
   3.5%  15.9Mi    .text
   3.4%  15.2Mi    .text.unknown
```

Add cl option `cdsort-max-chain-size`, which is similar to
`ext-tsp-max-chain-size`, and set it to 128, to improve performance.

In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort
--time-trace"
builds, the "Total Sort sections" time is measured as follows:

* -mllvm  -cdsort-max-chain-size=64: 1.321813
* -mllvm -cdsort-max-chain-size=128: 2.030425
* -mllvm -cdsort-max-chain-size=256: 2.927684
* -mllvm -cdsort-max-chain-size=512: 5.493106
* unlimited: 9 minutes

The rest part takes 6.8s.
@spupyrev
Copy link
Contributor Author

Given the latest build time improvements of the algorithm, any concerns to re-enable this default? @MaskRay

@MaskRay
Copy link
Member

MaskRay commented Nov 3, 2023

Given the latest build time improvements of the algorithm, any concerns to re-enable this default? @MaskRay

Yes, you can update the patch to try again... Make sure to update lld/ELF/Options.td (default: hfsort) to cdsort as well.

spupyrev added a commit that referenced this pull request Nov 3, 2023
Edited lld/ELF/Options.td to cdsort as well

CDSort function reordering outperforms the existing default heuristic (
hfsort/C^3) in terms of the performance of generated binaries while
being (almost) as fast. Thus, the suggestion is to change the default.
The speedup is up to 1.5% perf for large front-end binaries, and can be
moderate/neutral for "small" benchmarks.

High-level **perf impact** on two selected binaries:
clang-10 binary (built with LTO+AutoFDO/CSSPGO): wins on top of C^3 in
[0.3%..0.8%]
rocksDB-8 binary (built with LTO+CSSPGO): wins on top of C^3 in
[0.8%..1.5%]

More detailed measurements on the clang binary is at
[here](https://reviews.llvm.org/D152834#4445042)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants