From f29a880003aa3c7d64b1dc98948e2758bea8133e Mon Sep 17 00:00:00 2001
From: Mark Harris <mharris@nvidia.com>
Date: Tue, 2 Apr 2024 03:58:57 +0000
Subject: [PATCH 1/3] Fix ordering / heading levels in readme and python
 example in guide

---
 README.md            | 70 ++++++++++++++++++++++----------------------
 python/docs/guide.md | 10 +++----
 2 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/README.md b/README.md
index 9ec8cbf47..13d6651a8 100644
--- a/README.md
+++ b/README.md
@@ -207,37 +207,6 @@ alignment argument. All allocations are required to be aligned to at least 256B.
 `device_memory_resource` adds an additional `cuda_stream_view` argument to allow specifying the stream
 on which to perform the (de)allocation.
 
-## `cuda_stream_view` and `cuda_stream`
-
-`rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's
-purpose is to provide strong type safety for stream types. (`cudaStream_t` is an alias for a pointer,
-which can lead to ambiguity in APIs when it is assigned `0`.)  All RMM stream-ordered APIs take a
-`rmm::cuda_stream_view` argument.
-
-`rmm::cuda_stream` is a simple owning wrapper around a CUDA `cudaStream_t`. This class provides
-RAII semantics (constructor creates the CUDA stream, destructor destroys it). An `rmm::cuda_stream`
-can never represent the CUDA default stream or per-thread default stream; it only ever represents
-a single non-default stream. `rmm::cuda_stream` cannot be copied, but can be moved.
-
-## `cuda_stream_pool`
-
-`rmm::cuda_stream_pool` provides fast access to a pool of CUDA streams. This class can be used to
-create a set of `cuda_stream` objects whose lifetime is equal to the `cuda_stream_pool`. Using the
-stream pool can be faster than creating the streams on the fly. The size of the pool is configurable.
-Depending on this size, multiple calls to `cuda_stream_pool::get_stream()` may return instances of
-`rmm::cuda_stream_view` that represent identical CUDA streams.
-
-### Thread Safety
-
-All current device memory resources are thread safe unless documented otherwise. More specifically,
-calls to memory resource `allocate()` and `deallocate()` methods are safe with respect to calls to
-either of these functions from other threads. They are _not_ thread safe with respect to
-construction and destruction of the memory resource object.
-
-Note that a class `thread_safe_resource_adapter` is provided which can be used to adapt a memory
-resource that is not thread safe to be thread safe (as described above). This adapter is not needed
-with any current RMM device memory resources.
-
 ### Stream-ordered Memory Allocation
 
 `rmm::mr::device_memory_resource` is a base class that provides stream-ordered memory allocation.
@@ -386,17 +355,48 @@ line of the error comment.
 }
 ```
 
-### Allocators
+## `cuda_stream_view` and `cuda_stream`
+
+`rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's
+purpose is to provide strong type safety for stream types. (`cudaStream_t` is an alias for a pointer,
+which can lead to ambiguity in APIs when it is assigned `0`.)  All RMM stream-ordered APIs take a
+`rmm::cuda_stream_view` argument.
+
+`rmm::cuda_stream` is a simple owning wrapper around a CUDA `cudaStream_t`. This class provides
+RAII semantics (constructor creates the CUDA stream, destructor destroys it). An `rmm::cuda_stream`
+can never represent the CUDA default stream or per-thread default stream; it only ever represents
+a single non-default stream. `rmm::cuda_stream` cannot be copied, but can be moved.
+
+## `cuda_stream_pool`
+
+`rmm::cuda_stream_pool` provides fast access to a pool of CUDA streams. This class can be used to
+create a set of `cuda_stream` objects whose lifetime is equal to the `cuda_stream_pool`. Using the
+stream pool can be faster than creating the streams on the fly. The size of the pool is configurable.
+Depending on this size, multiple calls to `cuda_stream_pool::get_stream()` may return instances of
+`rmm::cuda_stream_view` that represent identical CUDA streams.
+
+## Thread Safety
+
+All current device memory resources are thread safe unless documented otherwise. More specifically,
+calls to memory resource `allocate()` and `deallocate()` methods are safe with respect to calls to
+either of these functions from other threads. They are _not_ thread safe with respect to
+construction and destruction of the memory resource object.
+
+Note that a class `thread_safe_resource_adapter` is provided which can be used to adapt a memory
+resource that is not thread safe to be thread safe (as described above). This adapter is not needed
+with any current RMM device memory resources.
+
+## Allocators
 
 C++ interfaces commonly allow customizable memory allocation through an [`Allocator`](https://en.cppreference.com/w/cpp/named_req/Allocator) object.
 RMM provides several `Allocator` and `Allocator`-like classes.
 
-#### `polymorphic_allocator`
+### `polymorphic_allocator`
 
 A [stream-ordered](#stream-ordered-memory-allocation) allocator similar to [`std::pmr::polymorphic_allocator`](https://en.cppreference.com/w/cpp/memory/polymorphic_allocator).
 Unlike the standard C++ `Allocator` interface, the `allocate` and `deallocate` functions take a `cuda_stream_view` indicating the stream on which the (de)allocation occurs.
 
-#### `stream_allocator_adaptor`
+### `stream_allocator_adaptor`
 
 `stream_allocator_adaptor` can be used to adapt a stream-ordered allocator to present a standard `Allocator` interface to consumers that may not be designed to work with a stream-ordered interface.
 
@@ -415,7 +415,7 @@ auto p = adapted.allocate(100);
 adapted.deallocate(p,100);
 ```
 
-#### `thrust_allocator`
+### `thrust_allocator`
 
 `thrust_allocator` is a device memory allocator that uses the strongly typed `thrust::device_ptr`, making it usable with containers like `thrust::device_vector`.
 
diff --git a/python/docs/guide.md b/python/docs/guide.md
index c06135ca8..aee01118a 100644
--- a/python/docs/guide.md
+++ b/python/docs/guide.md
@@ -181,9 +181,9 @@ You can configure
 for memory allocations using their by configuring the current
 allocator.
 
-```python
-from rmm.allocators.torch import rmm_torch_allocator
-import torch
+  ```python
+  >>> from rmm.allocators.torch import rmm_torch_allocator
+  >>> import torch
 
-torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
-```
+  >>>torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
+  ```

From 134a87e29155ea3581dae3f13714aaa3d950b20f Mon Sep 17 00:00:00 2001
From: Mark Harris <mharris@nvidia.com>
Date: Wed, 3 Apr 2024 01:45:22 +0000
Subject: [PATCH 2/3] Fix heading levels better

---
 README.md | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/README.md b/README.md
index 13d6651a8..0fe848fea 100644
--- a/README.md
+++ b/README.md
@@ -207,7 +207,7 @@ alignment argument. All allocations are required to be aligned to at least 256B.
 `device_memory_resource` adds an additional `cuda_stream_view` argument to allow specifying the stream
 on which to perform the (de)allocation.
 
-### Stream-ordered Memory Allocation
+## Stream-ordered Memory Allocation
 
 `rmm::mr::device_memory_resource` is a base class that provides stream-ordered memory allocation.
 This allows optimizations such as re-using memory deallocated on the same stream without the
@@ -239,16 +239,16 @@ For further information about stream-ordered memory allocation semantics, read
 Allocator](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/)
 on the NVIDIA Developer Blog.
 
-### Available Resources
+## Available Device Resources
 
 RMM provides several `device_memory_resource` derived classes to satisfy various user requirements.
 For more detailed information about these resources, see their respective documentation.
 
-#### `cuda_memory_resource`
+### `cuda_memory_resource`
 
 Allocates and frees device memory using `cudaMalloc` and `cudaFree`.
 
-#### `managed_memory_resource`
+### `managed_memory_resource`
 
 Allocates and frees device memory using `cudaMallocManaged` and `cudaFree`.
 
@@ -256,22 +256,22 @@ Note that `managed_memory_resource` cannot be used with NVIDIA Virtual GPU Softw
 with virtual machines or hypervisors) because [NVIDIA CUDA Unified Memory is not supported by
 NVIDIA vGPU](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#cuda-open-cl-support-vgpu).
 
-#### `pool_memory_resource`
+### `pool_memory_resource`
 
 A coalescing, best-fit pool sub-allocator.
 
-#### `fixed_size_memory_resource`
+### `fixed_size_memory_resource`
 
 A memory resource that can only allocate a single fixed size. Average allocation and deallocation
 cost is constant.
 
-#### `binning_memory_resource`
+### `binning_memory_resource`
 
 Configurable to use multiple upstream memory resources for allocations that fall within different
 bin sizes. Often configured with multiple bins backed by `fixed_size_memory_resource`s and a single
 `pool_memory_resource` for allocations larger than the largest bin size.
 
-### Default Resources and Per-device Resources
+## Default Resources and Per-device Resources
 
 RMM users commonly need to configure a `device_memory_resource` object to use for all allocations
 where another resource has not explicitly been provided. A common example is configuring a
@@ -296,7 +296,7 @@ Accessing and modifying the default resource is done through two functions:
      `get_current_device_resource()`
    - For more explicit control, you can use `set_per_device_resource()`, which takes a device ID.
 
-#### Example
+### Example
 
 ```c++
 rmm::mr::cuda_memory_resource cuda_mr;
@@ -308,7 +308,7 @@ rmm::mr::set_current_device_resource(&pool_mr); // Updates the current device re
 rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource(); // Points to `pool_mr`
 ```
 
-#### Multiple Devices
+### Multiple Devices
 
 A `device_memory_resource` should only be used when the active CUDA device is the same device
 that was active when the `device_memory_resource` was created. Otherwise behavior is undefined.
@@ -497,13 +497,13 @@ Similar to `device_memory_resource`, it has two key functions for (de)allocation
 Unlike `device_memory_resource`, the `host_memory_resource` interface and behavior is identical to
 `std::pmr::memory_resource`.
 
-### Available Resources
+## Available Host Resources
 
-#### `new_delete_resource`
+### `new_delete_resource`
 
 Uses the global `operator new` and `operator delete` to allocate host memory.
 
-#### `pinned_memory_resource`
+### `pinned_memory_resource`
 
 Allocates "pinned" host memory using `cuda(Malloc/Free)Host`.
 
@@ -611,7 +611,7 @@ resources are detectable with Compute Sanitizer Memcheck.
 It may be possible in the future to add support for memory bounds checking with other memory
 resources using NVTX APIs.
 
-## Using RMM in Python Code
+# Using RMM in Python
 
 There are two ways to use RMM in Python code:
 
@@ -622,7 +622,7 @@ There are two ways to use RMM in Python code:
 RMM provides a `MemoryResource` abstraction to control _how_ device
 memory is allocated in both the above uses.
 
-### DeviceBuffers
+## DeviceBuffer
 
 A DeviceBuffer represents an **untyped, uninitialized device memory
 allocation**.  DeviceBuffers can be created by providing the
@@ -662,7 +662,7 @@ host:
 array([1., 2., 3.])
 ```
 
-### MemoryResource objects
+## MemoryResource objects
 
 `MemoryResource` objects are used to configure how device memory allocations are made by
 RMM.

From 7c90edc800b12e8099990eebd6bc2a87377273d3 Mon Sep 17 00:00:00 2001
From: Mark Harris <mharris@nvidia.com>
Date: Wed, 3 Apr 2024 01:46:44 +0000
Subject: [PATCH 3/3] Indentation

---
 python/docs/guide.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/python/docs/guide.md b/python/docs/guide.md
index aee01118a..968be8586 100644
--- a/python/docs/guide.md
+++ b/python/docs/guide.md
@@ -181,9 +181,9 @@ You can configure
 for memory allocations using their by configuring the current
 allocator.
 
-  ```python
-  >>> from rmm.allocators.torch import rmm_torch_allocator
-  >>> import torch
+```python
+>>> from rmm.allocators.torch import rmm_torch_allocator
+>>> import torch
 
-  >>>torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
-  ```
+>>> torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
+```