Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant clearing of top-32 bit before certain calls to PDEP #442

Open
damageboy opened this issue Dec 2, 2019 · 0 comments
Open

Redundant clearing of top-32 bit before certain calls to PDEP #442

damageboy opened this issue Dec 2, 2019 · 0 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@damageboy
Copy link
Contributor

damageboy commented Dec 2, 2019

Repro Repo:

https://github.com/damageboy/coreclr-redundant-mov-around-pdep

Relevant piece of code:

https://github.com/damageboy/coreclr-redundant-mov-around-pdep/blob/2f87e88ccd419f26d2b5afcfe09cfadd7a220996/Program.cs#L40-L47

            var p0 = ParallelBitDeposit((uint) e0, 0x0707070707070707);
            var p1 = ParallelBitDeposit((uint) e1, 0x0707070707070707);
            var p2 = ParallelBitDeposit((uint) e2, 0x0707070707070707);
            var p3 = ParallelBitDeposit((uint) e3, 0x0707070707070707);
            var p4 = ParallelBitDeposit((uint) e4, 0x0707070707070707);
            var p5 = ParallelBitDeposit((uint) e5, 0x0707070707070707);
            var p6 = ParallelBitDeposit((uint) e6, 0x0707070707070707);
            var p7 = ParallelBitDeposit((uint) e7, 0x0707070707070707);

Generated asm:

https://github.com/damageboy/coreclr-redundant-mov-around-pdep/blob/2f87e88ccd419f26d2b5afcfe09cfadd7a220996/listing.asm#L56-L87

;             var p0 = ParallelBitDeposit((uint) e0, 0x0707070707070707);
00007F06A0BA0833 8BFF                 mov     edi,edi
00007F06A0BA0835 49BB0707070707070707 mov     r11,707070707070707h
00007F06A0BA083F C4C2C3F5FB           pdep    rdi,rdi,r11
;             var p1 = ParallelBitDeposit((uint) e1, 0x0707070707070707);
00007F06A0BA0844 8BC0                 mov     eax,eax
00007F06A0BA0846 C4C2FBF5C3           pdep    rax,rax,r11
;             var p2 = ParallelBitDeposit((uint) e2, 0x0707070707070707);
00007F06A0BA084B 8BD2                 mov     edx,edx
00007F06A0BA084D C4C2EBF5D3           pdep    rdx,rdx,r11
;             var p3 = ParallelBitDeposit((uint) e3, 0x0707070707070707);
00007F06A0BA0852 8BF6                 mov     esi,esi
00007F06A0BA0854 C4C2CBF5F3           pdep    rsi,rsi,r11
;             var p4 = ParallelBitDeposit((uint) e4, 0x0707070707070707);
00007F06A0BA0859 458BC0               mov     r8d,r8d
00007F06A0BA085C C442BBF5C3           pdep    r8,r8,r11
;             var p5 = ParallelBitDeposit((uint) e5, 0x0707070707070707);
00007F06A0BA0861 8BC9                 mov     ecx,ecx
00007F06A0BA0863 C4C2F3F5CB           pdep    rcx,rcx,r11
;             var p6 = ParallelBitDeposit((uint) e6, 0x0707070707070707);
00007F06A0BA0868 458BD2               mov     r10d,r10d
00007F06A0BA086B C442ABF5D3           pdep    r10,r10,r11
;             var p7 = ParallelBitDeposit((uint) e7, 0x0707070707070707);
00007F06A0BA0870 458BC9               mov     r9d,r9d
00007F06A0BA0873 C442B3F5CB           pdep    r9,r9,r11

Issue

In the asm listing, we can clearly see that the upper 32 bits of the source register for PDEP is cleared as part of the implicit cast to ulong.
While this is a sensible behaviour, as PDEP might end up reading from those bits, for cases such as the above, where PDEP is supplied with a constant mask, which "happens" to never read a single bit past the first 32 bits anyway, clearing these top bits seems redundant:

From the Intel docs:

PDEP uses a mask in the second source operand (the third operand) to transfer/scatter contiguous low order bits in
the first source operand (the second operand) into the destination (the first operand). PDEP takes the low bits from
the first source operand and deposit them in the destination operand at the corresponding bit locations that are set
in the second source operand (mask). All other bits (bits not set in mask) in destination are set to zero.

In other words, the amount of toggled bits in the mask controls how many low-order bits are read from the source.

As such, a constant mask with 24 bits toggled in this case: (0x0707070707070707 -> each 0x7 is 0b111 x 8 == 24 bits) means that not a single bit from the upper 32 bits will ever be read by this instruction, so clearing those bits as part of the cast is meaningless.

For this case, it would shave off the 8 mov reg,reg instructions that are currently emmited to clear those bits.

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 2, 2019
@jkotas jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 2, 2019
@BruceForstall BruceForstall added optimization and removed untriaged New issue has not been triaged by the area owner labels Dec 2, 2019
@BruceForstall BruceForstall added this to the Future milestone Dec 2, 2019
@briansull briansull self-assigned this Jan 30, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
MichalStrehovsky pushed a commit to MichalStrehovsky/runtime that referenced this issue Mar 25, 2021
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

5 participants