Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM64] Vector128<T>.ExtractMostSignificantBits() + TrailingZeroCount has better alternative than MoveMask emulation #87735

Closed
neon-sunset opened this issue Jun 17, 2023 · 2 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue

Comments

@neon-sunset
Copy link
Contributor

neon-sunset commented Jun 17, 2023

Description

As of now, BitOperations.TrailingZeroCount(vector.ExtractMostSignificantBits()); used across CoreLib for finding element index produces sse2neon style movemask emulation.

However, there is a better way recommended by ARM to achieve the same outcome.

Regression?

No

Data

Current assembly produced by

public static uint IndexOf(Vector128<byte> vec, byte value)
{
    var mask = Vector128.Equals(vec, Vector128.Create(value));
    return (uint)BitOperations.TrailingZeroCount(
        mask.ExtractMostSignificantBits());
}
G_M000_IG01:                ;; offset=0000H
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0008H
            uxtb    w0, w0
            dup     v16.16b, w0
            cmeq    v16.16b, v0.16b, v16.16b
            ldr     q17, [@RWD00]
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD16]
            ushl    v16.16b, v16.16b, v17.16b
            mov     v17.16b, v16.16b
            addv    b17, v17.8b
            umov    w0, v17.b[0]
            ext     v16.16b, v16.16b, v16.16b, #8
            addv    b16, v16.8b
            umov    w1, v16.b[0]
            orr     w0, w0, w1,  LSL #8
            rbit    w0, w0
            clz     w0, w0
 
G_M000_IG03:                ;; offset=0048H
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
RWD00   dq      8080808080808080h, 8080808080808080h
RWD16   dq      00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h

; Total bytes of code 80

This code can be manually rewritten to

public static uint IndexOfOptimized(Vector128<byte> vec, byte value)
{
    var mask = AdvSimd.CompareEqual(vec, Vector128.Create(value)).AsUInt16();
    var res = AdvSimd.ShiftRightLogicalNarrowingLower(mask, 4);
    var matches = res.AsUInt64().ToScalar();
    return (uint)BitOperations.TrailingZeroCount(matches) >> 2;
}

which is more compact and does not perform two vector loads

G_M000_IG01:                ;; offset=0000H
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0008H
            uxtb    w0, w0
            dup     v16.16b, w0
            cmeq    v16.16b, v0.16b, v16.16b
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x0, x0
            clz     x0, x0
            lsr     w0, w0, #2
 
G_M000_IG03:                ;; offset=0028H
            ldp     fp, lr, [sp], #0x10
            ret     lr

; Total bytes of code 48

Analysis

Given the prevalence of .ExtractMostSignificantBits() in vectorized code and the trend towards using cross-platform vector methods whenever possible, it would be ideal if the idiom described above was recognized and optimized for ARM64 by the compiler.

@neon-sunset neon-sunset added the tenet-performance Performance related issue label Jun 17, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 17, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 17, 2023
@ghost
Copy link

ghost commented Jun 17, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

As of now, BitOperations.TrailingZeroCount(vector.ExtractMostSignificantBits()); used across CoreLib for finding element index produces sse2neon style movemask emulation.

However, there is a better way recommended by ARM to achieve the same outcome.

Regression?

No

Data

Current assembly produced by

public static uint IndexOf(Vector128<byte> vec, byte value)
{
    var mask = Vector128.Equals(vec, Vector128.Create(value));
    return (uint)BitOperations.TrailingZeroCount(
        mask.ExtractMostSignificantBits());
}
G_M000_IG01:                ;; offset=0000H
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0008H
            uxtb    w0, w0
            dup     v16.16b, w0
            cmeq    v16.16b, v0.16b, v16.16b
            ldr     q17, [@RWD00]
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD16]
            ushl    v16.16b, v16.16b, v17.16b
            mov     v17.16b, v16.16b
            addv    b17, v17.8b
            umov    w0, v17.b[0]
            ext     v16.16b, v16.16b, v16.16b, #8
            addv    b16, v16.8b
            umov    w1, v16.b[0]
            orr     w0, w0, w1,  LSL #8
            rbit    w0, w0
            clz     w0, w0
 
G_M000_IG03:                ;; offset=0048H
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
RWD00   dq      8080808080808080h, 8080808080808080h
RWD16   dq      00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h

; Total bytes of code 80

This code can be manually rewritten to

public static uint IndexOfOptimized(Vector128<byte> vec, byte value)
{
    var mask = AdvSimd.CompareEqual(vec, Vector128.Create(value)).AsUInt16();
    var res = AdvSimd.ShiftRightLogicalNarrowingLower(mask, 4);
    var matches = res.AsUInt64().ToScalar();
    return (uint)BitOperations.TrailingZeroCount(matches) >> 2;
}

which is more compact and does not perform two vector loads

G_M000_IG01:                ;; offset=0000H
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0008H
            uxtb    w0, w0
            dup     v16.16b, w0
            cmeq    v16.16b, v0.16b, v16.16b
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x0, x0
            clz     x0, x0
            lsr     w0, w0, #2
 
G_M000_IG03:                ;; offset=0028H
            ldp     fp, lr, [sp], #0x10
            ret     lr

; Total bytes of code 48

Analysis

Given the prevalence of .ExtractMostSignificantBits() in vectorized code and the trend towards using cross-platform vector methods whenever possible, it would be ideal of the above idiom could be recognized and rewritten for ARM64.

Author: neon-sunset
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented Jun 18, 2023

Duplicates #76047

@EgorBo EgorBo closed this as completed Jun 18, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jun 18, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

2 participants