[ELF] Making cdsort default for function reordering #68638

spupyrev · 2023-10-09T22:00:04Z

CDSort function reordering outperforms the existing default heuristic (
hfsort/C^3) in terms of the performance of generated binaries while
being (almost) as fast. Thus, the suggestion is to change the default.
The speedup is up to 1.5% perf for large front-end binaries, and can be
moderate/neutral for "small" benchmarks.

High-level perf impact on two selected binaries:
clang-10 binary (built with LTO+AutoFDO/CSSPGO): wins on top of C^3 in [0.3%..0.8%]
rocksDB-8 binary (built with LTO+CSSPGO): wins on top of C^3 in [0.8%..1.5%]

More detailed measurements on the clang binary is at here

llvmbot · 2023-10-10T02:25:53Z

@llvm/pr-subscribers-lld-elf

@llvm/pr-subscribers-lld

Author: None (spupyrev)

Changes

CDSort function reordering outperforms the existing default heuristic (
hfsort/C^3) in terms of the performance of generated binaries while
being (almost) as fast. Thus, the suggestion is to change the default.
The speedup is up to 1.5% perf for large front-end binaries, and can be
moderate/neutral for "small" benchmarks.

High-level perf impact on two selected binaries:
clang-10 binary (built with LTO+AutoFDO/CSSPGO): wins on top of C^3 in [0.3%..0.8%]
rocksDB-8 binary (built with LTO+CSSPGO): wins on top of C^3 in [0.8%..1.5%]

More detailed measurements on the clang binary is at [here](https://reviews.llvm.org/D152834#4445042)

Full diff: https://github.com/llvm/llvm-project/pull/68638.diff

8 Files Affected:

(modified) lld/ELF/Driver.cpp (+1-1)
(modified) lld/docs/ld.lld.1 (+2-2)
(modified) lld/test/ELF/cgprofile-bad-clusters.s (+1-1)
(modified) lld/test/ELF/cgprofile-icf.s (+2-2)
(modified) lld/test/ELF/cgprofile-print.s (+1-4)
(modified) lld/test/ELF/cgprofile-rela.test (+1-1)
(modified) lld/test/ELF/cgprofile-reproduce.s (+1-4)
(modified) lld/test/ELF/cgprofile-txt.s (+3-3)

diff --git a/lld/ELF/Driver.cpp b/lld/ELF/Driver.cpp
index 6272276e94b2d35..e2100d00d54ede6 100644
--- a/lld/ELF/Driver.cpp
+++ b/lld/ELF/Driver.cpp
@@ -1095,7 +1095,7 @@ static void ltoValidateAllVtablesHaveTypeInfos(opt::InputArgList &args) {
 }
 
 static CGProfileSortKind getCGProfileSortKind(opt::InputArgList &args) {
-  StringRef s = args.getLastArgValue(OPT_call_graph_profile_sort, "hfsort");
+  StringRef s = args.getLastArgValue(OPT_call_graph_profile_sort, "cdsort");
   if (s == "hfsort")
     return CGProfileSortKind::Hfsort;
   if (s == "cdsort")
diff --git a/lld/docs/ld.lld.1 b/lld/docs/ld.lld.1
index 2e46fc18132f3e0..12b17dd37796d13 100644
--- a/lld/docs/ld.lld.1
+++ b/lld/docs/ld.lld.1
@@ -128,9 +128,9 @@ may be:
 .It Cm none
 Ignore call graph profile.
 .It Cm hfsort
-Use hfsort (default).
+Use hfsort.
 .It Cm cdsort
-Use cdsort.
+Use cdsort (default).
 .El
 .Pp
 .It Fl -color-diagnostics Ns = Ns Ar value
diff --git a/lld/test/ELF/cgprofile-bad-clusters.s b/lld/test/ELF/cgprofile-bad-clusters.s
index c162e981acdd633..88e68bfb7b2c0a1 100644
--- a/lld/test/ELF/cgprofile-bad-clusters.s
+++ b/lld/test/ELF/cgprofile-bad-clusters.s
@@ -10,7 +10,7 @@
 # RUN: echo "F G 6" >> %t.call_graph
 # RUN: echo "G H 5" >> %t.call_graph
 # RUN: echo "H I 4" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort  -o %t2
 # RUN: llvm-readobj --symbols %t2 | FileCheck %s
 
     .section    .text.A,"ax",@progbits
diff --git a/lld/test/ELF/cgprofile-icf.s b/lld/test/ELF/cgprofile-icf.s
index a9de5613917cbf1..e28630d0eb30bf0 100644
--- a/lld/test/ELF/cgprofile-icf.s
+++ b/lld/test/ELF/cgprofile-icf.s
@@ -5,9 +5,9 @@
 # RUN: echo "A B 100" > %t.call_graph
 # RUN: echo "A C 40" >> %t.call_graph
 # RUN: echo "C D 61" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t.out -icf=all
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t.out -icf=all
 # RUN: llvm-readobj --symbols %t.out | FileCheck %s
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2.out
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t2.out
 # RUN: llvm-readobj --symbols %t2.out | FileCheck %s --check-prefix=NOICF
 
     .section    .text.D,"ax",@progbits
diff --git a/lld/test/ELF/cgprofile-print.s b/lld/test/ELF/cgprofile-print.s
index b103ef5109effbb..d6f15f2926b5729 100644
--- a/lld/test/ELF/cgprofile-print.s
+++ b/lld/test/ELF/cgprofile-print.s
@@ -5,7 +5,7 @@
 # RUN: echo "B C 50" >> %t.call_graph
 # RUN: echo "C D 40" >> %t.call_graph
 # RUN: echo "D B 10" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --print-symbol-order=%t3
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --call-graph-profile-sort=hfsort --print-symbol-order=%t3
 # RUN: FileCheck %s --input-file %t3
 
 # CHECK: B
@@ -32,6 +32,3 @@ C:
 .globl  D
 D:
  nop
-
-
-
diff --git a/lld/test/ELF/cgprofile-rela.test b/lld/test/ELF/cgprofile-rela.test
index 189f169e65481ee..141dfd4c65b1ea1 100644
--- a/lld/test/ELF/cgprofile-rela.test
+++ b/lld/test/ELF/cgprofile-rela.test
@@ -3,7 +3,7 @@
 # REQUIRES: x86
 
 # RUN: yaml2obj %s -o %t.o
-# RUN: ld.lld %t.o -o %t
+# RUN: ld.lld --call-graph-profile-sort=hfsort %t.o -o %t
 # RUN: llvm-nm --no-sort %t | FileCheck %s
 # RUN: ld.lld --no-call-graph-profile-sort %t.o -o %t
 # RUN: llvm-nm --no-sort %t | FileCheck %s --check-prefix=NO-CG
diff --git a/lld/test/ELF/cgprofile-reproduce.s b/lld/test/ELF/cgprofile-reproduce.s
index b9cb269e4580d78..1b1b36151da99d1 100644
--- a/lld/test/ELF/cgprofile-reproduce.s
+++ b/lld/test/ELF/cgprofile-reproduce.s
@@ -5,7 +5,7 @@
 # RUN: echo "B C 50" >> %t.call_graph
 # RUN: echo "C D 40" >> %t.call_graph
 # RUN: echo "D B 10" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --print-symbol-order=%t3
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2 --call-graph-profile-sort=hfsort --print-symbol-order=%t3
 # RUN: ld.lld -e A %t --symbol-ordering-file %t3 -o %t2
 # RUN: llvm-readobj --symbols %t2 | FileCheck %s
 
@@ -37,6 +37,3 @@ C:
 .globl  D
 D:
  nop
-
-
-
diff --git a/lld/test/ELF/cgprofile-txt.s b/lld/test/ELF/cgprofile-txt.s
index c9194bbbc43cbe0..cf5b17627cfb63c 100644
--- a/lld/test/ELF/cgprofile-txt.s
+++ b/lld/test/ELF/cgprofile-txt.s
@@ -26,12 +26,12 @@
 # RUN: echo "TooManyPreds10 TooManyPreds 11" >> %t.call_graph
 # RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort -o %t2
 # RUN: llvm-readobj --symbols %t2 | FileCheck %s
-## --call-graph-profile-sort=hfsort is the default.
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2b
-# RUN: cmp %t2 %t2b
 
 # RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=cdsort -o %t2
 # RUN: llvm-readobj --symbols %t2 | FileCheck %s --check-prefix=CDSORT
+## --call-graph-profile-sort=cdsort is the default.
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2b
+# RUN: cmp %t2 %t2b
 
 # RUN: not ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=sort \
 # RUN:   -o /dev/null 2>&1 | FileCheck %s --check-prefix=UNKNOWN

MaskRay · 2023-10-10T06:48:22Z

lld/test/ELF/cgprofile-bad-clusters.s

@@ -10,7 +10,7 @@
 # RUN: echo "F G 6" >> %t.call_graph
 # RUN: echo "G H 5" >> %t.call_graph
 # RUN: echo "H I 4" >> %t.call_graph
-# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph --call-graph-profile-sort=hfsort  -o %t2


excessive spaces before -o

WenleiHe · 2023-10-12T05:18:42Z

If cdsort is better, after some baking period we should probably remove other alternative call graph layout algorithm (hfsort) in the code base just to keep things simple.

MaskRay · 2023-10-13T05:50:57Z

If cdsort is better, after some baking period we should probably remove other alternative call graph layout algorithm (hfsort) in the code base just to keep things simple.

I am afraid we cannot remove hfsort as the high time complexity of cache-directed sort is a real issue.
In a not-so-large executable, CodeLayout.cpp took 20+ minutes, which is not acceptable. I'm sorry but I'll revert this patch as the cdsort default is too risky now. I'm happy to make more investigation and try making CodeLayout.cpp faster.

   4.6%  20.7Mi    .text.hot
   3.5%  15.9Mi    .text
   3.4%  15.2Mi    .text.unknown
   2.8%  12.7Mi    .debug_rnglists
   2.0%  8.88Mi    .rela.dyn
   1.3%  6.09Mi    .symtab
   1.3%  5.72Mi    .rodata
   1.2%  5.58Mi    [54 Others]

The work amount can be measured by the total size of BestSrcChain->Edges and BestDstChain->Edges (repeatedly counted) in llvm/lib/Transforms/Utils/CodeLayout.cpp:mergeChainPairs. After 7 minutes, the total size has increased to 18182463.

The high time complexity of cache-directed sort is a real issue and is not appropriate as the default, at least for now (#68638 (comment)).

spupyrev · 2023-10-13T16:43:45Z

Can we apply the same optimization for cdsort as in #68617? (ie limiting the size of the max chain to 512). Block reordering becomes 50x faster on some instances after the change, without sacrificing the quality. I am happy to test this too, if you share (some version of) the problematic call graph.
Btw, cdsort is on by default in BOLT for ~5 years now. Likely something is off with the call graphs coming from the compiler.

MaskRay · 2023-10-13T22:09:34Z

Can we apply the same optimization for cdsort as in #68617? (ie limiting the size of the max chain to 512). Block reordering becomes 50x faster on some instances after the change, without sacrificing the quality. I am happy to test this too, if you share (some version of) the problematic call graph. Btw, cdsort is on by default in BOLT for ~5 years now. Likely something is off with the call graphs coming from the compiler.

Let me try... For the internal program that I saw a timeout (20+min for the total build on a machine that may be slower than my pc). On my pc:

% time timeout 660 zsh /t/0.sh --call-graph-profile-sort=cdsort 
timeout 660 zsh /t/0.sh --call-graph-profile-sort=cdsort  548.21s user 3.63s system 101% cpu 9:02.64 total

While lld with --call-graph-profile-sort=hfsort takes just 7 seconds:

% time zsh /t/0.sh --call-graph-profile-sort=hfsort
zsh /t/0.sh --call-graph-profile-sort=hfsort  14.37s user 2.80s system 238% cpu 7.202 total

When linking an executable with a slightly larger executable, ld.lld --call-graph-profile-sort=cdsort can be very slow (see llvm#68638). ``` 4.6% 20.7Mi .text.hot 3.5% 15.9Mi .text 3.4% 15.2Mi .text.unknown ``` Add cl option `cds-max-chain-size`, which is similar to `ext-tsp-max-chain-size`, and set it to 128, to improve performance. In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace" builds, the "Total Sort sections" time is measured as follows: * -mllvm -cds-max-chain-size=64: 1.321813 * -mllvm -cds-max-chain-size=128: 2.030425 * -mllvm -cds-max-chain-size=256: 2.927684 * -mllvm -cds-max-chain-size=512: 5.493106 * unlimited: 9 minutes The rest part takes 6.8s.

When linking an executable with a slightly larger executable, ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638). ``` 4.6% 20.7Mi .text.hot 3.5% 15.9Mi .text 3.4% 15.2Mi .text.unknown ``` Add cl option `cdsort-max-chain-size`, which is similar to `ext-tsp-max-chain-size`, and set it to 128, to improve performance. In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace" builds, the "Total Sort sections" time is measured as follows: * -mllvm -cdsort-max-chain-size=64: 1.321813 * -mllvm -cdsort-max-chain-size=128: 2.030425 * -mllvm -cdsort-max-chain-size=256: 2.927684 * -mllvm -cdsort-max-chain-size=512: 5.493106 * unlimited: 9 minutes The rest part takes 6.8s.

spupyrev · 2023-10-30T14:38:23Z

Given the latest build time improvements of the algorithm, any concerns to re-enable this default? @MaskRay

MaskRay · 2023-11-03T07:15:56Z

Given the latest build time improvements of the algorithm, any concerns to re-enable this default? @MaskRay

Yes, you can update the patch to try again... Make sure to update lld/ELF/Options.td (default: hfsort) to cdsort as well.

Edited lld/ELF/Options.td to cdsort as well CDSort function reordering outperforms the existing default heuristic ( hfsort/C^3) in terms of the performance of generated binaries while being (almost) as fast. Thus, the suggestion is to change the default. The speedup is up to 1.5% perf for large front-end binaries, and can be moderate/neutral for "small" benchmarks. High-level **perf impact** on two selected binaries: clang-10 binary (built with LTO+AutoFDO/CSSPGO): wins on top of C^3 in [0.3%..0.8%] rocksDB-8 binary (built with LTO+CSSPGO): wins on top of C^3 in [0.8%..1.5%] More detailed measurements on the clang binary is at [here](https://reviews.llvm.org/D152834#4445042)

spupyrev force-pushed the cdsort-linker-default branch 4 times, most recently from f8fc1d0 to 00b33de Compare October 10, 2023 01:37

spupyrev requested a review from MaskRay October 10, 2023 02:24

spupyrev marked this pull request as ready for review October 10, 2023 02:24

llvmbot added lld lld:ELF labels Oct 10, 2023

MaskRay approved these changes Oct 10, 2023

View reviewed changes

spupyrev force-pushed the cdsort-linker-default branch from 00b33de to 86e91d4 Compare October 10, 2023 15:41

[ELF] Making cdsort default for function reordering

8fb3c96

spupyrev force-pushed the cdsort-linker-default branch from 86e91d4 to 8fb3c96 Compare October 10, 2023 16:04

spupyrev merged commit d5c1d73 into llvm:main Oct 10, 2023

spupyrev deleted the cdsort-linker-default branch October 10, 2023 16:09

MaskRay added a commit that referenced this pull request Oct 13, 2023

[ELF] Restore the --call-graph-profile-sort=hfsort default before #68638

60b3e05

The high time complexity of cache-directed sort is a real issue and is not appropriate as the default, at least for now (#68638 (comment)).

MaskRay mentioned this pull request Oct 14, 2023

[CodeLayout] cache-directed sort: limit max chain size #69039

Merged

madhur13490 mentioned this pull request Oct 20, 2023

Revert commit ba8565fbcb975e2d067ce3ae5a7dbaae4953edd3 madhur13490/llvm-project#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ELF] Making cdsort default for function reordering #68638

[ELF] Making cdsort default for function reordering #68638

spupyrev commented Oct 9, 2023 •

edited

Loading

llvmbot commented Oct 10, 2023 •

edited

Loading

MaskRay Oct 10, 2023

WenleiHe commented Oct 12, 2023

MaskRay commented Oct 13, 2023 •

edited

Loading

spupyrev commented Oct 13, 2023

MaskRay commented Oct 13, 2023 •

edited

Loading

spupyrev commented Oct 30, 2023

MaskRay commented Nov 3, 2023

[ELF] Making cdsort default for function reordering #68638

[ELF] Making cdsort default for function reordering #68638

Conversation

spupyrev commented Oct 9, 2023 • edited Loading

llvmbot commented Oct 10, 2023 • edited Loading

MaskRay Oct 10, 2023

Choose a reason for hiding this comment

WenleiHe commented Oct 12, 2023

MaskRay commented Oct 13, 2023 • edited Loading

spupyrev commented Oct 13, 2023

MaskRay commented Oct 13, 2023 • edited Loading

spupyrev commented Oct 30, 2023

MaskRay commented Nov 3, 2023

spupyrev commented Oct 9, 2023 •

edited

Loading

llvmbot commented Oct 10, 2023 •

edited

Loading

MaskRay commented Oct 13, 2023 •

edited

Loading

MaskRay commented Oct 13, 2023 •

edited

Loading