Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make LLVM better at using XMM registers to perform structure moves #35093

Open
pcwalton opened this issue Jul 28, 2016 · 13 comments
Open

Make LLVM better at using XMM registers to perform structure moves #35093

pcwalton opened this issue Jul 28, 2016 · 13 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such E-help-wanted Call for participation: Help is requested to fix this issue. I-compiletime Issue: Problems and improvements with respect to compile times. I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_32 Target: x86 processors, 32 bit (like i686-*) O-x86_64 Target: x86-64 processors (like x86_64-*) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@pcwalton
Copy link
Contributor

Here's a small snippet of Servo code from block flow fragmentation:

servo[0x100bf974e] <+6638>:  mov    r10b, byte ptr [rbp - 0x729]
servo[0x100bf9755] <+6645>:  mov    r9, qword ptr [rbp - 0x728]
servo[0x100bf975c] <+6652>:  mov    rax, qword ptr [rbp - 0x720]
servo[0x100bf9763] <+6659>:  mov    rdi, qword ptr [rbp - 0x718]
servo[0x100bf976a] <+6666>:  mov    r8, qword ptr [rbp - 0x6c8]
servo[0x100bf9771] <+6673>:  mov    rdx, qword ptr [rbp - 0x710]
servo[0x100bf9778] <+6680>:  mov    qword ptr [rbp - 0x710], rdx
servo[0x100bf977f] <+6687>:  mov    qword ptr [r13 + 0x78], r8
servo[0x100bf9783] <+6691>:  mov    qword ptr [r13 + 0x80], rdi
servo[0x100bf978a] <+6698>:  mov    qword ptr [r13 + 0x88], rax
servo[0x100bf9791] <+6705>:  mov    qword ptr [r13 + 0x90], r9
servo[0x100bf9798] <+6712>:  mov    byte ptr [r13 + 0x98], r10b
servo[0x100bf979f] <+6719>:  mov    al, byte ptr [rbp - 0x34a]
servo[0x100bf97a5] <+6725>:  mov    byte ptr [r13 + 0x9f], al
servo[0x100bf97ac] <+6732>:  mov    ax, word ptr [rbp - 0x34c]
servo[0x100bf97b3] <+6739>:  mov    word ptr [r13 + 0x9d], ax
servo[0x100bf97bb] <+6747>:  mov    eax, dword ptr [rbp - 0x350]
servo[0x100bf97c1] <+6753>:  mov    dword ptr [r13 + 0x99], eax
servo[0x100bf97c8] <+6760>:  mov    eax, dword ptr [r13 + 0x15c]
servo[0x100bf97cf] <+6767>:  test   ah, 0x6
servo[0x100bf97d2] <+6770>:  je     0x100bf9949               ; <+7145>

I see this all over the place. It should be using XMM registers instead. This is bad because: (a) it clogs up the instruction stream; (b) it's an inefficient way to perform structure moves; (c) it kills tons of registers, resulting in spills elsewhere (notice rax, rdx, rdi, r8, r9, and r10 are all dead for no good reason); (d) it puts pressure on the register allocator, making compile times worse.

Is there some way to get LLVM to emit the right thing here?

cc @eddyb @brson

@pcwalton pcwalton added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. A-codegen Area: Code generation I-compiletime Issue: Problems and improvements with respect to compile times. A-optimization labels Jul 28, 2016
@brson brson added the E-help-wanted Call for participation: Help is requested to fix this issue. label Jul 28, 2016
@brson
Copy link
Contributor

brson commented Jul 28, 2016

Tagging this as help wanted because it's a significant (and kinda obvious) optimization. Unfortunately we don't know how to make LLVM do it. Will require some digging.

@eddyb
Copy link
Member

eddyb commented Jul 28, 2016

@brson It might be impossible in the general case if LLVM can't get some guarantees about bytes that are never touched by the individual copies (i.e. the padding bytes).
Locally, not reading those bytes can inform the optimizer that they are irrelevant, but if pointers to the values are used anywhere else, they have to be marked with the "padding holes" somehow.

@brson
Copy link
Contributor

brson commented Jul 28, 2016

It may also be undesirable in the general case for security reasons.

@pcwalton
Copy link
Contributor Author

@pcwalton
Copy link
Contributor Author

@eefriedman
Copy link
Contributor

servo[0x100bf9771] <+6673>:  mov    rdx, qword ptr [rbp - 0x710]
servo[0x100bf9778] <+6680>:  mov    qword ptr [rbp - 0x710], rdx

Are you sure this is a release build? That looks extremely suspicious.

@eddyb
Copy link
Member

eddyb commented Jul 28, 2016

Looking at the IR for <layout::block::BlockFlow as layout::flow::Flow>::fragment:

  • 718 alloca
    • 42 core::raw::TraitObject (you'd expect these to vanish)
  • 649 store
  • 917 load

Honestly, I have no idea where exactly those assembly instructions come from.
It might be possible to use debug info to correlate to the IR and see what exactly is causing them, but at this point, it just looks like a very large number of temporaries and loads/stores between them.

This part is a potential clue, but I can't find the 6 constant in the function's IR:

servo[0x100bf97c8] <+6760>:  mov    eax, dword ptr [r13 + 0x15c]
servo[0x100bf97cf] <+6767>:  test   ah, 0x6
servo[0x100bf97d2] <+6770>:  je     0x100bf9949               ; <+7145>

@eddyb
Copy link
Member

eddyb commented Jul 28, 2016

@pcwalton I convinced myself this is a release + debuginfo build, but that's not actually true, is it?

This explains where all the code comes from (#[inline(always)]): https://github.com/servo/servo/blob/dfc007e10e8f0815966682e768685f14e55164c2/components/layout/block.rs#L772.

@DemiMarie
Copy link
Contributor

Is this before or after optimization? mem2reg should have optimized this out (and should probably be the first pass run).

@eddyb
Copy link
Member

eddyb commented Jul 29, 2016

@DemiMarie @eefriedman Turns out the IR is from a debug build (my fault, forgot to add --release in the command I gave @pcwalton), but the assembly is from a release build.

@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 25, 2017
@nox
Copy link
Contributor

nox commented Apr 2, 2018

What are the current options to teach LLVM about our padding holes? See also #17027 and #6736.

Cc @rust-lang/wg-codegen

@sunfishcode
Copy link
Member

!tbaa.struct is a way to describe padding holes when doing llvm.memcpy etc.

@steveklabnik
Copy link
Member

Triage: not aware of any changes here. This is a pretty abstract bug...

@jonas-schievink jonas-schievink added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 31, 2020
@workingjubilee workingjubilee added O-x86 O-x86_64 Target: x86-64 processors (like x86_64-*) labels Feb 20, 2022
@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
@Noratrieb Noratrieb added O-x86_32 Target: x86 processors, 32 bit (like i686-*) and removed O-x86-all labels Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such E-help-wanted Call for participation: Help is requested to fix this issue. I-compiletime Issue: Problems and improvements with respect to compile times. I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_32 Target: x86 processors, 32 bit (like i686-*) O-x86_64 Target: x86-64 processors (like x86_64-*) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests