[ARM64] `Vector128<T>.ExtractMostSignificantBits()` + `TrailingZeroCount` has better alternative than MoveMask emulation #87735

neon-sunset · 2023-06-17T20:40:22Z

Description

As of now, BitOperations.TrailingZeroCount(vector.ExtractMostSignificantBits()); used across CoreLib for finding element index produces sse2neon style movemask emulation.

However, there is a better way recommended by ARM to achieve the same outcome.

Regression?

No

Data

Current assembly produced by

public static uint IndexOf(Vector128<byte> vec, byte value)
{
    var mask = Vector128.Equals(vec, Vector128.Create(value));
    return (uint)BitOperations.TrailingZeroCount(
        mask.ExtractMostSignificantBits());
}

G_M000_IG01:                ;; offset=0000H
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0008H
            uxtb    w0, w0
            dup     v16.16b, w0
            cmeq    v16.16b, v0.16b, v16.16b
            ldr     q17, [@RWD00]
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD16]
            ushl    v16.16b, v16.16b, v17.16b
            mov     v17.16b, v16.16b
            addv    b17, v17.8b
            umov    w0, v17.b[0]
            ext     v16.16b, v16.16b, v16.16b, #8
            addv    b16, v16.8b
            umov    w1, v16.b[0]
            orr     w0, w0, w1,  LSL #8
            rbit    w0, w0
            clz     w0, w0
 
G_M000_IG03:                ;; offset=0048H
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
RWD00   dq      8080808080808080h, 8080808080808080h
RWD16   dq      00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h

; Total bytes of code 80

This code can be manually rewritten to

public static uint IndexOfOptimized(Vector128<byte> vec, byte value)
{
    var mask = AdvSimd.CompareEqual(vec, Vector128.Create(value)).AsUInt16();
    var res = AdvSimd.ShiftRightLogicalNarrowingLower(mask, 4);
    var matches = res.AsUInt64().ToScalar();
    return (uint)BitOperations.TrailingZeroCount(matches) >> 2;
}

which is more compact and does not perform two vector loads

G_M000_IG01:                ;; offset=0000H
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0008H
            uxtb    w0, w0
            dup     v16.16b, w0
            cmeq    v16.16b, v0.16b, v16.16b
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x0, x0
            clz     x0, x0
            lsr     w0, w0, #2
 
G_M000_IG03:                ;; offset=0028H
            ldp     fp, lr, [sp], #0x10
            ret     lr

; Total bytes of code 48

Analysis

Given the prevalence of .ExtractMostSignificantBits() in vectorized code and the trend towards using cross-platform vector methods whenever possible, it would be ideal if the idiom described above was recognized and optimized for ARM64 by the compiler.

The text was updated successfully, but these errors were encountered:

ghost · 2023-06-17T20:40:30Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

As of now, BitOperations.TrailingZeroCount(vector.ExtractMostSignificantBits()); used across CoreLib for finding element index produces sse2neon style movemask emulation.

However, there is a better way recommended by ARM to achieve the same outcome.

Regression?

No

Data

Current assembly produced by

public static uint IndexOf(Vector128<byte> vec, byte value)
{
    var mask = Vector128.Equals(vec, Vector128.Create(value));
    return (uint)BitOperations.TrailingZeroCount(
        mask.ExtractMostSignificantBits());
}

G_M000_IG01:                ;; offset=0000H
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0008H
            uxtb    w0, w0
            dup     v16.16b, w0
            cmeq    v16.16b, v0.16b, v16.16b
            ldr     q17, [@RWD00]
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD16]
            ushl    v16.16b, v16.16b, v17.16b
            mov     v17.16b, v16.16b
            addv    b17, v17.8b
            umov    w0, v17.b[0]
            ext     v16.16b, v16.16b, v16.16b, #8
            addv    b16, v16.8b
            umov    w1, v16.b[0]
            orr     w0, w0, w1,  LSL #8
            rbit    w0, w0
            clz     w0, w0
 
G_M000_IG03:                ;; offset=0048H
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
RWD00   dq      8080808080808080h, 8080808080808080h
RWD16   dq      00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h

; Total bytes of code 80

This code can be manually rewritten to

public static uint IndexOfOptimized(Vector128<byte> vec, byte value)
{
    var mask = AdvSimd.CompareEqual(vec, Vector128.Create(value)).AsUInt16();
    var res = AdvSimd.ShiftRightLogicalNarrowingLower(mask, 4);
    var matches = res.AsUInt64().ToScalar();
    return (uint)BitOperations.TrailingZeroCount(matches) >> 2;
}

which is more compact and does not perform two vector loads

G_M000_IG01:                ;; offset=0000H
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0008H
            uxtb    w0, w0
            dup     v16.16b, w0
            cmeq    v16.16b, v0.16b, v16.16b
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x0, x0
            clz     x0, x0
            lsr     w0, w0, #2
 
G_M000_IG03:                ;; offset=0028H
            ldp     fp, lr, [sp], #0x10
            ret     lr

; Total bytes of code 48

Analysis

Given the prevalence of .ExtractMostSignificantBits() in vectorized code and the trend towards using cross-platform vector methods whenever possible, it would be ideal of the above idiom could be recognized and rewritten for ARM64.

Author:	neon-sunset
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2023-06-18T21:02:38Z

Duplicates #76047

neon-sunset added the tenet-performance Performance related issue label Jun 17, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 17, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Jun 17, 2023

EgorBo closed this as completed Jun 18, 2023

ghost removed the untriaged New issue has not been triaged by the area owner label Jun 18, 2023

ghost locked as resolved and limited conversation to collaborators Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM64] `Vector128<T>.ExtractMostSignificantBits()` + `TrailingZeroCount` has better alternative than MoveMask emulation #87735

[ARM64] `Vector128<T>.ExtractMostSignificantBits()` + `TrailingZeroCount` has better alternative than MoveMask emulation #87735

neon-sunset commented Jun 17, 2023 •

edited

Loading

ghost commented Jun 17, 2023

Description

Regression?

Data

Analysis

EgorBo commented Jun 18, 2023

[ARM64] Vector128<T>.ExtractMostSignificantBits() + TrailingZeroCount has better alternative than MoveMask emulation #87735

[ARM64] Vector128<T>.ExtractMostSignificantBits() + TrailingZeroCount has better alternative than MoveMask emulation #87735

Comments

neon-sunset commented Jun 17, 2023 • edited Loading

Description

Regression?

Data

Analysis

ghost commented Jun 17, 2023

Description

Regression?

Data

Analysis

EgorBo commented Jun 18, 2023

[ARM64] `Vector128<T>.ExtractMostSignificantBits()` + `TrailingZeroCount` has better alternative than MoveMask emulation #87735

[ARM64] `Vector128<T>.ExtractMostSignificantBits()` + `TrailingZeroCount` has better alternative than MoveMask emulation #87735

neon-sunset commented Jun 17, 2023 •

edited

Loading