Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please provide API and intrinsics to unmanaged memory operations (volatile and atomic operations) to be on parity with managed memory operations #4209

Closed
zpodlovics opened this issue May 3, 2015 · 8 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Runtime enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@zpodlovics
Copy link

Please provide volatile and atomic (interlocked including increment/decrement/add) operation intrinsics not only to managed memory but also for unmanaged memory (eg.: memory mapped files, GCHandle.Alloc*, external library/API provided memory region, etc.).

It is important to provide not only the API but also intrinsics, the JIT engine should replace these calls with single CPU instructions.

Today when a high performance GPU devices, RDMA capable network devices, NVME capable devices and zero copy APIs are available to everybody the .NET virtual machine should provide support for these high performance abstractions. These abstractions (hardware and/or libraries) are provided with user space accessible buffers and command packet queues ring buffer) and variables to notification (doorbell). Sometimes the doorbell variable is just a volatile field in the command packet.

As far as I know the usual implementation look like this: the high performance device and/or library provide one (or more) preallocated and mapped memory region to the user space application and also provide one (or more) command queue (ring buffer) and variables (doorbell) for notification. The application write the data directly (zero copy) to the provided buffer, create a command queue packet and write to the doorbell variable (could be a field in the new command packet).

High performance applications also use similar zero copy abstractions (using memory mapped files as buffers and command queues and variables) for communication between thread and/or processes.

However accessing these high performance abstractions from .NET is currently limited because of the limited unmanaged memory operations support. Java also provide these unmanaged memory operations using Unsafe which may become official later.

Volatile Read:

public static IntPtr VolatileRead(ref IntPtr address)
public static UIntPtr VolatileRead(ref UIntPtr address)

Volatile Write:

public static void VolatileWrite(ref IntPtr address,IntPtr value)
public static void VolatileWrite(ref UIntPtr address,UIntPtr value)

Atomic Operations:

public static IntPtr CompareExchange(ref IntPtr location1,IntPtr value,IntPtr comparand)
public static IntPtr Exchange(ref IntPtr location1,IntPtr value)

It would be good to have some kind of typed intristics API to volatile read/write and atomic operations including increment/decrement/add. (The "typed" native pointer IntPtr<'T> idea is based on the F# NativePtr<'T>)

API Example:

UnmanagedVolatile class (Read operations):

public static byte VolatileReadByte(ref IntPtr<byte> address)
public static double VolatileReadDouble(ref IntPtr<double> address)
public static int16 VolatileReadInt16(ref IntPtr<int16> address)
public static int32 VolatileReadInt32(ref IntPtr<int32> address)
public static int64 VolatileReadInt64(ref IntPtr<int64> address)
public static IntPtr VolatileReadIntPtr(ref IntPtr<IntPtr> address)
public static sbyte VolatileReadSByte(ref IntPtr<sbyte> address)
public static single VolatileReadSingle(ref IntPtr<single> address)
public static uint16 VolatileReadUInt16(ref IntPtr<uint16> address)
public static uint32 VolatileReadUInt32(ref IntPtr<uint32> address)
public static uint64 VolatileReadUInt64(ref IntPtr<uint64> address)

UnmanagedVolatile class (Write operations):

public static void VolatileWriteByte(ref IntPtr<byte> address,byte value)
public static void VolatileWriteDouble(ref IntPtr<double> address,double value)
public static void VolatileWriteInt16(ref IntPtr<int16> address,int16 value)
public static void VolatileWriteInt32(ref IntPtr<int32> address,int32 value)
public static void VolatileWriteInt64(ref IntPtr<int64> address,int64 value)
public static void VolatileWriteIntPtr(ref IntPtr<IntPtr> address,IntPtr value)
public static void VolatileWriteSByte(ref IntPtr<sbyte> address,sbyte value)
public static void VolatileWriteSingle(ref IntPtr<single> address,single value)
public static void VolatileWriteUInt16(ref IntPtr<uint16> address,uint16 value)
public static void VolatileWriteUInt32(ref IntPtr<uint32> address,uint32 value)
public static void VolatileWriteUInt64(ref IntPtr<uint64> address,uint64 value)

UnmanagedInterlocked class:

public static int AddByte(ref IntPtr<byte> location1,byte value)
public static int AddInt16(ref IntPtr<int16> location1,int16 value)
public static int AddInt32(ref IntPtr<int32> location1,int32 value)
public static int AddInt64(ref IntPtr<int64> location1,int64 value)
public static int IncrementByte(ref IntPtr<byte>)
public static int IncrementInt16(ref IntPtr<int16>)
public static int IncrementInt32(ref IntPtr<int32>)
public static int IncrementInt64(ref IntPtr<int64>)
public static int DecrementByte(ref IntPtr<byte>)
public static int DecrementInt16(ref IntPtr<int16>)
public static int DecrementInt32(ref IntPtr<int32>)
public static int DecrementInt64(ref IntPtr<int64>)
public static byte ExchangeByte(ref IntPtr<byte> address,byte value)
public static double ExchangeDouble(ref IntPtr<double> address,double value)
public static int16 ExchangeInt16(ref IntPtr<int16> address,int16 value)
public static int32 ExchangeInt32(ref IntPtr<int32> address,int32 value)
public static int64 ExchangeInt64(ref IntPtr<int64> address,int64 value)
public static IntPtr ExchangeIntPtr(ref IntPtr<IntPtr> address,IntPtr value)
public static sbyte ExchangeSByte(ref IntPtr<sbyte> address,sbyte value)
public static single ExchangeSingle(ref IntPtr<single> address,byte value)
public static uint16 ExchangeUInt16(ref IntPtr<uint16> address,uint16 value)
public static uint32 ExchangeUInt32(ref IntPtr<uint32> address,uint32 value)
public static uint64 ExchangeUInt64(ref IntPtr<uint64> address,uint64 value)
public static byte CompareExchangeByte(ref IntPtr<byte> address,byte value,byte comparand)
public static double CompareExchangeDouble(ref IntPtr<double> address,double value,double comparand)
public static int16 CompareExchangeInt16(ref IntPtr<int16> address,int16 value,int16 comparand)
public static int32 CompareExchangeInt32(ref IntPtr<int32> address,int32 value,int32 comparand)
public static int64 CompareExchangeInt64(ref IntPtr<int64> address,int64 value,int64 comparand)
public static IntPtr CompareExchangeIntPtr(ref IntPtr<IntPtr> address,IntPtr value,IntPtr comparand)
public static sbyte CompareExchangeSByte(ref IntPtr<sbyte> address,sbyte value,sbyte comparand)
public static single CompareExchangeSingle(ref IntPtr<single> address,byte value,single comparand)
public static uint16 CompareExchangeUInt16(ref IntPtr<uint16> address,uint16 value, uint16 comparand)
public static uint32 CompareExchangeUInt32(ref IntPtr<uint32> address,uint32 value, uint32 comparand)
public static uint64 CompareExchangeUInt64(ref IntPtr<uint64> address,uint64 value, uint64 comparand)

However these operations require proper memory alignment, which may require some extended (aligned) allocation operations from .NET side too. Currently there is no proper way to aligned allocation in .NET which will stays aligned even if the GC try to move these data around. Some more or less usable hacks are available but I am afraid these are far from a proper solution. [1] [2] [3]. Please note proper SIMD support will have similar memory alignment requirement.

Please also consider additional memory operation intrinsics in order to support persistent memory architectures which will require consistent memory state. The following new instructions are available to support persistent memory on x86: CLFLUSH, CLFLUSHOPT, CLWB, PCOMMIT [1].

public static void AcquireBarrier()
public static void ReleaseBarrier()
public static void CacheLineFlush(ref IntPtr<IntPtr> address)
public static void CacheLineWriteback(ref IntPtr<IntPtr> address)
public static void CommitToPersistence()

Update: added some syntax highlighting because the original format does not showed the typed IntPtr<'T>. Added some note about memory alignment requirement.

Best Regards,
Zoltan

[1] https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf

[2] https://stackoverflow.com/questions/1951290/memory-alignment-of-classes-in-c
[3] https://stackoverflow.com/questions/13413323/allocate-memory-with-16-byte-alignment
[4] https://stackoverflow.com/questions/10239659/is-there-a-way-to-new-a-net-object-aligned-to-64-bytes

@sharwell
Copy link
Member

sharwell commented May 3, 2015

💡 You have many cases of a parameter ref IntPtr address, which should actually be just IntPtr address.

@zpodlovics
Copy link
Author

True, thanks for the notice. It was a quick and dirty API example based on the existing managed API style which use ref address as argument.

@MattWhilden
Copy link
Contributor

Thanks for the suggestion. Just tagging some folks who may be in a better position to comment.

@ericeil who I think did some of our other atomic operations? @BruceForstall from the codegen team. @terrajobst API review?

@jkotas
Copy link
Member

jkotas commented May 4, 2015

Unique advantage of managed pointers (ref in C#) is that they can point to unmanaged memory, and be used in many of the same ways as unmanaged pointers. For example, check this code fragment:

IntPtr p = ...
Interlocked.CompareExchange(ref *(Int32*)p, 1, 0);

Thus the existing volatile and atomic operations defined on managed pointers should work fine for your scenario. The code is not as straightforward as it could be with unsafe helper method - the unsafe constructs are not straightforward in C# and .NET Core libraries on purpose, to encourage writing safe code.

BTW: Take a look at the proposed C# features Ref Returns and Locals and Array Slicing. If they materialize, they should allow writing more of the code that avoids copies and operate on unmanaged memory in safe C#.

For the other part of your proposal - wrappers for CLFLUSH, CLFLUSHOPT, CLWB, PCOMMIT instructions: They feel pretty specialized, with unclear path for portability to non-Intel processors. I think they should be independent NuGet package initially. Would you like to start one? These processor instructions do not look particularly cheap and so it should not be required for them to be JIT intrinsics, to start with at least.

@zpodlovics
Copy link
Author

Thanks for your help and the suggestion. I have written a small example in C# and looks working. It looks like the original issue comes from F# where the two types the int32* and int32& are not equal and there is no conversion function available yet (as far as I know). I have managed to make it work with some inline IL hack and the generated IL looks the same as in C#. However I would like to have some feedback about it. Could you please confirm that void* -> int32& and int32* -> int32& conversion (in this example) is nothing more than a "type cast" in IL?

The other proposal was only an additional suggestion in the case of an api review. I'll check it out later how it could be implemented.

Is there any way to add intrinsics functionality as extensible plugins to the current JIT? If I remember correctly the SIMD functionality works as a JIT extension and/or plugin. I would like to replace some (own) class method calls with single instructions. For example it would like to use a single PAUSE instructions as intrinsics in a tight loop in low latency environment (use case: high performance inter-thread messaging). It could be implemented as pinvoke or executable memory with delegate [1] but it comes with huge amount overhead compared to a single instruction. The other option is to search for the instruction usage in CoreCLR source and try to use it if possible.

CoreCLR has the following define in the standalone gc sample in
https://github.com/dotnet/coreclr/blob/master/src/gc/sample/gcenv.h#L277

#pragma intrinsic(_mm_pause)
#define YieldProcessor _mm_pause

I know my request is probably not a generic use case, but I would like to explore the available options.

[1] https://stackoverflow.com/questions/9557293/is-it-possible-to-write-a-jit-compiler-to-native-code-entirely-in-a-managed-n

@jkotas
Copy link
Member

jkotas commented May 4, 2015

The unmanaged pointer to managed pointer conversions are unsafe casts. They should be close to no-op in JITed code, assuming optimizations work as expected.

The SIMD APIs are exposed via independent Nuget package, but the SIMD code generation is built into the JIT. It is not a pluggable extension. The JIT and CLR treat the SIMD Nuget package in a special way.

Customizing code generation in the JIT is a tricky subject. It would make some of inner workings of the JIT public, and limit what kind of changes can be done in the JIT in future. I think that the customizing code generation may be better candidate for AOT compiler, where the additional abstractions to expose the inner workings are less of a concern.

cc @CarolEidt

@zpodlovics
Copy link
Author

As most of the SIMD and HW Intrinsics are already implemented (including PDEP/PEXT!) in CoreCLR, would you please reconsider also supporting PAUSE intrinsics and the following explicit cache control intrinsics CLFLUSH, CLFLUSHOPT, CLWB, CLZERO(?), SFENCE (PCOMMIT was deprecated by Intel)? Please note, persistent memory devices are already available in the market, and this is no longer Intel only, all of them (with one exception CLWB) AMD also supported in Zen microarchitecture (https://support.amd.com/TechDocs/24594.pdf in page 260, 138, 141, 143 ).

The PAUSE instruction target is high performance / low latency intra-thread communication, lot's of energy could be saved with a small latency increase if these communicating tight loops could use a single PAUSE instruction instead of busy spinning.

The CLFLUSH, CLFLUSHOPT, CLWB, CLZERO(?), SFENCE target would be the persistent memory support from user space. eg.: https://qconsf.com/sf2017/system/files/presentation-slides/rethink_nvm.pdf

cc @fiigii
cc @tannergooding

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@joperezr joperezr added api-needs-work API needs work before it is approved, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels Jul 1, 2020
@tannergooding
Copy link
Member

Closing this. We've already exposed APIs that take IntPtr and UIntPtr for Volatile.* and Interlocked.*

As mentioned by Jan above, taking a pointer and creating a byref should already be free and so Volatile.Write(ref *someByte, value) should already be "optimal" codegen.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Runtime enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests

8 participants