[libc] Partially implement 'rand' for the GPU #66167

jhuber6 · 2023-09-13T03:10:47Z

Summary:
This patch partially implements the rand function on the GPU. This is
partial because the GPU currently doesn't support thread local storage
or static initializers. To implement this on the GPU. I use 1/8th of the
local / shared memory quota to treak the shared memory as thread local
storage. This is done by simply allocating enough storage for each
thread in the block and indexing into this based off of the thread id.
The downside to this is that it does not initialize srand correctly to
be 1 as the standard says, it is also wasteful. In the future we
should figure out a way to support TLS on the GPU so that this can be
completely common and less resource intensive.

llvmbot · 2023-09-13T03:11:55Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-libc

Changes

Summary: This patch partially implements the `rand` function on the GPU. This is partial because the GPU currently doesn't support thread local storage or static initializers. To implement this on the GPU. I use 1/8th of the local / shared memory quota to treak the shared memory as thread local storage. This is done by simply allocating enough storage for each thread in the block and indexing into this based off of the thread id. The downside to this is that it does not initialize `srand` correctly to be `1` as the standard says, it is also wasteful. In the future we should figure out a way to support TLS on the GPU so that this can be completely common and less resource intensive.

--
Full diff: https://github.com/llvm/llvm-project/pull/66167.diff

6 Files Affected:

(modified) libc/config/gpu/entrypoints.txt (+2)
(modified) libc/src/__support/GPU/amdgpu/utils.h (+6-4)
(modified) libc/src/stdlib/rand.cpp (+6-4)
(modified) libc/src/stdlib/rand_util.cpp (+6)
(modified) libc/src/stdlib/rand_util.h (+22)
(modified) libc/test/src/stdlib/rand_test.cpp (+3)

diff --git a/libc/config/gpu/entrypoints.txt b/libc/config/gpu/entrypoints.txt
index 0e314c60870c6ae..532c57d27718eb3 100644
--- a/libc/config/gpu/entrypoints.txt
+++ b/libc/config/gpu/entrypoints.txt
@@ -61,6 +61,8 @@ set(TARGET_LIBC_ENTRYPOINTS
     libc.src.stdlib.ldiv
     libc.src.stdlib.llabs
     libc.src.stdlib.lldiv
+    libc.src.stdlib.rand
+    libc.src.stdlib.srand
     libc.src.stdlib.strtod
     libc.src.stdlib.strtof
     libc.src.stdlib.strtol
diff --git a/libc/src/__support/GPU/amdgpu/utils.h b/libc/src/__support/GPU/amdgpu/utils.h
index f2fa84bc4259fb6..c23c9a2e2d985f8 100644
--- a/libc/src/__support/GPU/amdgpu/utils.h
+++ b/libc/src/__support/GPU/amdgpu/utils.h
@@ -21,10 +21,12 @@ namespace gpu {
 constexpr const uint64_t LANE_SIZE = __AMDGCN_WAVEFRONT_SIZE;
 
 /// Type aliases to the address spaces used by the AMDGPU backend.
-template <typename T> using Private = [[clang::opencl_private]] T;
-template <typename T> using Constant = [[clang::opencl_constant]] T;
-template <typename T> using Local = [[clang::opencl_local]] T;
-template <typename T> using Global = [[clang::opencl_global]] T;
+// FIXME: The OpenCL attributes are currently buggy on AMDGPU compared to the
+// direct address space attributes.
+template <typename T> using Private = [[clang::address_space(5)]] T;
+template <typename T> using Constant = [[clang::address_space(4)]] T;
+template <typename T> using Local = [[clang::address_space(3)]] T;
+template <typename T> using Global = [[clang::address_space(1)]] T;
 
 /// Returns the number of workgroups in the 'x' dimension of the grid.
 LIBC_INLINE uint32_t get_num_blocks_x() {
diff --git a/libc/src/stdlib/rand.cpp b/libc/src/stdlib/rand.cpp
index 771944f8b336fb7..038c0e082409e12 100644
--- a/libc/src/stdlib/rand.cpp
+++ b/libc/src/stdlib/rand.cpp
@@ -15,10 +15,12 @@ namespace __llvm_libc {
 // An implementation of the xorshift64star pseudo random number generator. This
 // is a good general purpose generator for most non-cryptographics applications.
 LLVM_LIBC_FUNCTION(int, rand, (void)) {
-  rand_next ^= rand_next >> 12;
-  rand_next ^= rand_next << 25;
-  rand_next ^= rand_next >> 27;
-  return static_cast<int>((rand_next * 0x2545F4914F6CDD1Dul) >> 32) & RAND_MAX;
+  unsigned long x = rand_next;
+  x ^= x >> 12;
+  x ^= x << 25;
+  x ^= x >> 27;
+  rand_next = x;
+  return static_cast<int>((x * 0x2545F4914F6CDD1Dul) >> 32) & RAND_MAX;
 }
 
 } // namespace __llvm_libc
diff --git a/libc/src/stdlib/rand_util.cpp b/libc/src/stdlib/rand_util.cpp
index dac8dca2804e1c2..6e7ec85501af785 100644
--- a/libc/src/stdlib/rand_util.cpp
+++ b/libc/src/stdlib/rand_util.cpp
@@ -11,8 +11,14 @@
 
 namespace __llvm_libc {
 
+#ifdef LIBC_TARGET_ARCH_IS_GPU
+// FIXME: Local GPU memory cannot be initialized so we cannot currently provide
+// a standard compliant default value.
+ThreadLocal<unsigned long> rand_next;
+#else
 // C standard 7.10p2: If 'rand' is called before 'srand' it is to proceed as if
 // the 'srand' function was called with a value of '1'.
 LIBC_THREAD_LOCAL unsigned long rand_next = 1;
+#endif
 
 } // namespace __llvm_libc
diff --git a/libc/src/stdlib/rand_util.h b/libc/src/stdlib/rand_util.h
index 4cbfa1478ef7bb2..281ce8a9a65a6fa 100644
--- a/libc/src/stdlib/rand_util.h
+++ b/libc/src/stdlib/rand_util.h
@@ -9,11 +9,33 @@
 #ifndef LLVM_LIBC_SRC_STDLIB_RAND_UTIL_H
 #define LLVM_LIBC_SRC_STDLIB_RAND_UTIL_H
 
+#include "src/__support/GPU/utils.h"
 #include "src/__support/macros/attributes.h"
 
 namespace __llvm_libc {
 
+#ifdef LIBC_TARGET_ARCH_IS_GPU
+// Implement thread local storage on the GPU using local memory. Each thread
+// gets its slot in the local memory array and is private to the group.
+// TODO: We need to implement the 'thread_local' keyword on the GPU. This is an
+// inefficient and incomplete stand-in until that is done.
+template <typename T> class ThreadLocal {
+private:
+  static constexpr long MAX_THREADS = 1024;
+  [[clang::loader_uninitialized]] static inline gpu::Local<T>
+      storage[MAX_THREADS];
+
+public:
+  LIBC_INLINE operator T() const { return storage[gpu::get_thread_id()]; }
+  LIBC_INLINE void operator=(const T &value) {
+    storage[gpu::get_thread_id()] = value;
+  }
+};
+
+extern ThreadLocal<unsigned long> rand_next;
+#else
 extern LIBC_THREAD_LOCAL unsigned long rand_next;
+#endif
 
 } // namespace __llvm_libc
 
diff --git a/libc/test/src/stdlib/rand_test.cpp b/libc/test/src/stdlib/rand_test.cpp
index 4bebbe37ffbe114..771b0e0f5975911 100644
--- a/libc/test/src/stdlib/rand_test.cpp
+++ b/libc/test/src/stdlib/rand_test.cpp
@@ -23,12 +23,15 @@ TEST(LlvmLibcRandTest, UnsetSeed) {
     vals[i] = val;
   }
 
+  // FIXME: The GPU implementation cannot initialize the seed correctly.
+#ifndef LIBC_TARGET_ARCH_IS_GPU
   // The C standard specifies that if 'srand' is never called it should behave
   // as if 'srand' was called with a value of 1. If we seed the value with 1 we
   // should get the same sequence as the unseeded version.
   __llvm_libc::srand(1);
   for (size_t i = 0; i < 1000; ++i)
     ASSERT_EQ(__llvm_libc::rand(), vals[i]);
+#endif
 }
 
 TEST(LlvmLibcRandTest, SetSeed) {

libc/src/__support/GPU/amdgpu/utils.h

JonChesterfield · 2023-09-25T14:37:01Z

I don't understand the comments about being unable to initialize the memory. Libc could have a per-kernel init function and document that languages that want to integrate libc need to splice that into the kernel prologue. Or we could bite the bullet and support trivial initialisers on shared memory, the blocker there was mostly disagreement about whether one lane in the warp or all of them should do the write.

jhuber6 · 2023-09-25T14:41:55Z

I don't understand the comments about being unable to initialize the memory. Libc could have a per-kernel init function and document that languages that want to integrate libc need to splice that into the kernel prologue. Or we could bite the bullet and support trivial initialisers on shared memory, the blocker there was mostly disagreement about whether one lane in the warp or all of them should do the write.

Pushing stuff to global constructors isn't really desirable for libc, since most of these functions aren't intended to be really used with the start.cpp stuff we use for testing. It's certainly possible to make a big global state struct that someone writes to, but I'd like to avoid needing such things if at all possible. I don't think we really require it right now except for AMD's shortsighted handling of clock frequencies.

I'd love be able to support these non-trivial constructors if possible, we really shouldn't need to call a kernel (which is really slow in general) just to set something to 1.

jdoerfert · 2023-09-25T15:20:21Z

libc/src/stdlib/rand_util.h

+template <typename T> class ThreadLocal {
+private:
+  static constexpr long MAX_THREADS = 1024;
+  [[clang::loader_uninitialized]] static inline gpu::Local<T>


What is gpu::Local? If you use AS(5), you just need one T, not 1024.

gpu::local is an alias to opencl_local which is AS(5) in the backend. Global AS(5) memory doesn't work unless it's fully inlined into each calling kernel, which isn't something we can guarantee, especially because the seed is global state.

I want to allocate thread local variables on the kernel stack, using basically the same compile time allocation scheme as LDS. I'm getting push back on that because applications run out of stack space already and using more of it is considered hazardous.

AS(5) globals should be raising an error in the back end. They definitely aren't correctly lowered.

Why should that cause pushback? We should be able to optimize any of this away in the general case. If we use this to handle thread_local semantics then it's a pay-for-what-you-use type situation I'd think.

jhuber6 · 2023-10-05T02:37:23Z

ping

lntue

Can you double check this error from the linux x86_64 bot:

/var/lib/buildkite-agent/builds/linux-56-7f758798dd-khkmx-1/llvm-project/github-pull-requests/libc/test/src/stdlib/rand_test.cpp:27:2: error: unterminated conditional directive
--
  | #ifndef LIBC_TARGET_ARCH_IS_GPU
  | ^
  | 1 error generated.

jhuber6 · 2023-10-05T13:37:30Z

Can you double check this error from the linux x86_64 bot:

/var/lib/buildkite-agent/builds/linux-56-7f758798dd-khkmx-1/llvm-project/github-pull-requests/libc/test/src/stdlib/rand_test.cpp:27:2: error: unterminated conditional directive
--
  | #ifndef LIBC_TARGET_ARCH_IS_GPU
  | ^
  | 1 error generated.

I think I broke that when I rebased the test. Will fix.

Summary: This patch partially implements the `rand` function on the GPU. This is partial because the GPU currently doesn't support thread local storage or static initializers. To implement this on the GPU. I use 1/8th of the local / shared memory quota to treak the shared memory as thread local storage. This is done by simply allocating enough storage for each thread in the block and indexing into this based off of the thread id. The downside to this is that it does not initialize `srand` correctly to be `1` as the standard says, it is also wasteful. In the future we should figure out a way to support TLS on the GPU so that this can be completely common and less resource intensive.

jhuber6 · 2023-10-17T20:55:34Z

ping

michaelrj-google

LGTM from the libc side.

jhuber6 requested review from a team as code owners September 13, 2023 03:10

llvmbot added backend:AMDGPU libc labels Sep 13, 2023

arsenm reviewed Sep 13, 2023

View reviewed changes

libc/src/__support/GPU/amdgpu/utils.h Outdated Show resolved Hide resolved

jhuber6 force-pushed the GPURand branch 2 times, most recently from 6363761 to 7abe016 Compare September 20, 2023 14:02

jhuber6 requested review from jdoerfert and JonChesterfield September 22, 2023 02:49

jdoerfert reviewed Sep 25, 2023

View reviewed changes

lntue reviewed Oct 5, 2023

View reviewed changes

jhuber6 force-pushed the GPURand branch from bade00e to 970880c Compare October 5, 2023 14:31

michaelrj-google approved these changes Oct 18, 2023

View reviewed changes

jhuber6 merged commit 630037e into llvm:main Oct 19, 2023

madhur13490 mentioned this pull request Oct 20, 2023

Revert commit ba8565fbcb975e2d067ce3ae5a7dbaae4953edd3 madhur13490/llvm-project#3

Closed

banach-space mentioned this pull request Oct 24, 2023

[mlir][vector] Add scalable vectors to tests for vector.contract #70039

Merged

michaelrj-google mentioned this pull request Apr 23, 2024

[libc] Only include GPU utils when targeting GPU #84999

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libc] Partially implement 'rand' for the GPU #66167

[libc] Partially implement 'rand' for the GPU #66167

jhuber6 commented Sep 13, 2023

llvmbot commented Sep 13, 2023 •

edited

Loading

JonChesterfield commented Sep 25, 2023

jhuber6 commented Sep 25, 2023

jdoerfert Sep 25, 2023

jhuber6 Sep 25, 2023

JonChesterfield Sep 25, 2023

jhuber6 Sep 25, 2023

jhuber6 commented Oct 5, 2023

lntue left a comment

jhuber6 commented Oct 5, 2023

jhuber6 commented Oct 17, 2023

michaelrj-google left a comment

[libc] Partially implement 'rand' for the GPU #66167

[libc] Partially implement 'rand' for the GPU #66167

Conversation

jhuber6 commented Sep 13, 2023

llvmbot commented Sep 13, 2023 • edited Loading

JonChesterfield commented Sep 25, 2023

jhuber6 commented Sep 25, 2023

jdoerfert Sep 25, 2023

Choose a reason for hiding this comment

jhuber6 Sep 25, 2023

Choose a reason for hiding this comment

JonChesterfield Sep 25, 2023

Choose a reason for hiding this comment

jhuber6 Sep 25, 2023

Choose a reason for hiding this comment

jhuber6 commented Oct 5, 2023

lntue left a comment

Choose a reason for hiding this comment

jhuber6 commented Oct 5, 2023

jhuber6 commented Oct 17, 2023

michaelrj-google left a comment

Choose a reason for hiding this comment

llvmbot commented Sep 13, 2023 •

edited

Loading