Redundant clearing of top-32 bit before certain calls to PDEP #442

damageboy · 2019-12-02T16:44:01Z

Repro Repo:

https://github.com/damageboy/coreclr-redundant-mov-around-pdep

Relevant piece of code:

https://github.com/damageboy/coreclr-redundant-mov-around-pdep/blob/2f87e88ccd419f26d2b5afcfe09cfadd7a220996/Program.cs#L40-L47

            var p0 = ParallelBitDeposit((uint) e0, 0x0707070707070707);
            var p1 = ParallelBitDeposit((uint) e1, 0x0707070707070707);
            var p2 = ParallelBitDeposit((uint) e2, 0x0707070707070707);
            var p3 = ParallelBitDeposit((uint) e3, 0x0707070707070707);
            var p4 = ParallelBitDeposit((uint) e4, 0x0707070707070707);
            var p5 = ParallelBitDeposit((uint) e5, 0x0707070707070707);
            var p6 = ParallelBitDeposit((uint) e6, 0x0707070707070707);
            var p7 = ParallelBitDeposit((uint) e7, 0x0707070707070707);

Generated asm:

https://github.com/damageboy/coreclr-redundant-mov-around-pdep/blob/2f87e88ccd419f26d2b5afcfe09cfadd7a220996/listing.asm#L56-L87

;             var p0 = ParallelBitDeposit((uint) e0, 0x0707070707070707);
00007F06A0BA0833 8BFF                 mov     edi,edi
00007F06A0BA0835 49BB0707070707070707 mov     r11,707070707070707h
00007F06A0BA083F C4C2C3F5FB           pdep    rdi,rdi,r11
;             var p1 = ParallelBitDeposit((uint) e1, 0x0707070707070707);
00007F06A0BA0844 8BC0                 mov     eax,eax
00007F06A0BA0846 C4C2FBF5C3           pdep    rax,rax,r11
;             var p2 = ParallelBitDeposit((uint) e2, 0x0707070707070707);
00007F06A0BA084B 8BD2                 mov     edx,edx
00007F06A0BA084D C4C2EBF5D3           pdep    rdx,rdx,r11
;             var p3 = ParallelBitDeposit((uint) e3, 0x0707070707070707);
00007F06A0BA0852 8BF6                 mov     esi,esi
00007F06A0BA0854 C4C2CBF5F3           pdep    rsi,rsi,r11
;             var p4 = ParallelBitDeposit((uint) e4, 0x0707070707070707);
00007F06A0BA0859 458BC0               mov     r8d,r8d
00007F06A0BA085C C442BBF5C3           pdep    r8,r8,r11
;             var p5 = ParallelBitDeposit((uint) e5, 0x0707070707070707);
00007F06A0BA0861 8BC9                 mov     ecx,ecx
00007F06A0BA0863 C4C2F3F5CB           pdep    rcx,rcx,r11
;             var p6 = ParallelBitDeposit((uint) e6, 0x0707070707070707);
00007F06A0BA0868 458BD2               mov     r10d,r10d
00007F06A0BA086B C442ABF5D3           pdep    r10,r10,r11
;             var p7 = ParallelBitDeposit((uint) e7, 0x0707070707070707);
00007F06A0BA0870 458BC9               mov     r9d,r9d
00007F06A0BA0873 C442B3F5CB           pdep    r9,r9,r11

Issue

In the asm listing, we can clearly see that the upper 32 bits of the source register for PDEP is cleared as part of the implicit cast to ulong.
While this is a sensible behaviour, as PDEP might end up reading from those bits, for cases such as the above, where PDEP is supplied with a constant mask, which "happens" to never read a single bit past the first 32 bits anyway, clearing these top bits seems redundant:

From the Intel docs:

PDEP uses a mask in the second source operand (the third operand) to transfer/scatter contiguous low order bits in
the first source operand (the second operand) into the destination (the first operand). PDEP takes the low bits from
the first source operand and deposit them in the destination operand at the corresponding bit locations that are set
in the second source operand (mask). All other bits (bits not set in mask) in destination are set to zero.

In other words, the amount of toggled bits in the mask controls how many low-order bits are read from the source.

As such, a constant mask with 24 bits toggled in this case: (0x0707070707070707 -> each 0x7 is 0b111 x 8 == 24 bits) means that not a single bit from the upper 32 bits will ever be read by this instruction, so clearing those bits as part of the cast is meaningless.

For this case, it would shave off the 8 mov reg,reg instructions that are currently emmited to clear those bits.

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

The text was updated successfully, but these errors were encountered:

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 2, 2019

jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 2, 2019

BruceForstall added optimization and removed untriaged New issue has not been triaged by the area owner labels Dec 2, 2019

BruceForstall added this to the Future milestone Dec 2, 2019

damageboy mentioned this issue Dec 3, 2019

Flaky CSE around pdep mask #477

Open

briansull self-assigned this Jan 30, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

MichalStrehovsky pushed a commit to MichalStrehovsky/runtime that referenced this issue Mar 25, 2021

Fix VS version in error message (dotnet#442)

724e1b7

BruceForstall unassigned briansull Oct 30, 2022

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redundant clearing of top-32 bit before certain calls to PDEP #442

Redundant clearing of top-32 bit before certain calls to PDEP #442

damageboy commented Dec 2, 2019 •

edited by BruceForstall

Loading

Redundant clearing of top-32 bit before certain calls to PDEP #442

Redundant clearing of top-32 bit before certain calls to PDEP #442

Comments

damageboy commented Dec 2, 2019 • edited by BruceForstall Loading

Repro Repo:

Relevant piece of code:

Generated asm:

Issue

damageboy commented Dec 2, 2019 •

edited by BruceForstall

Loading