Support resolutions over 2048x2048 in CUDA rasterizer

NVlabs · Sep 30, 2024 · 729261d · 729261d
1 parent f09684d
commit 729261d
Show file tree

Hide file tree

Showing 11 changed files with 116 additions and 46 deletions.
diff --git a/docs/index.html b/docs/index.html
@@ -739,7 +739,7 @@ <h3 id="rasterizing-with-cuda-vs-opengl">Rasterizing with CUDA vs OpenGL</h3>
 <p>Since version 0.3.0, nvdiffrast on PyTorch supports executing the rasterization operation using either CUDA or OpenGL. Earlier versions and the Tensorflow bindings support OpenGL only.</p>
 <p>When rasterization is executed on OpenGL, we use the GPU's graphics pipeline to determine which triangles land on which pixels. GPUs have amazingly efficient hardware for doing this — it is their original <i>raison d'être</i> — and thus it makes sense to exploit it. Unfortunately, some computing environments haven't been designed with this in mind, and it can be difficult to get OpenGL to work correctly and interoperate with CUDA cleanly. On Windows, compatibility is generally good because the GPU drivers required to run CUDA also include OpenGL support. Linux is more complicated, as various drivers can be installed separately and there isn't a standardized way to acquire access to the hardware graphics pipeline.</p>
 <p>Rasterizing in CUDA pretty much reverses these considerations. Compatibility is obviously not an issue on any CUDA-enabled platform. On the other hand, implementing the rasterization process correctly and efficiently on a massively data-parallel programming model is non-trivial. The CUDA rasterizer in nvdiffrast follows the approach described in research paper <em>High-Performance Software Rasterization on GPUs</em> by Laine and Karras, HPG 2011. Our code is based on the paper's publicly released CUDA kernels, with considerable modifications to support current hardware architectures and to match nvdiffrast's needs.</p>
-<p>The maximum resolution supported by the CUDA rasterizer is 2048×2048, and the number of triangles that can be rendered in one batch is limited to around 16 million. Subpixel precision is limited to 4 bits and depth peeling is less accurate than with OpenGL. Memory consumption depends on many factors. <em>Note:</em> Prior to version 0.3.2, the output dimensions had to be multiples of 8, but this restriction has been removed.</p>
+<p>The subpixel precision of the CUDA rasterizer is limited to 4 bits, and depth peeling is less accurate than with OpenGL. Memory consumption depends on many factors. <em>Note:</em> Restrictions related to output resolution have been removed in version 0.3.3. Although the internal resolution of the CUDA rasterizer remains capped at 2048×2048, nvdiffrast now invokes it automatically multiple times to support higher resolutions.</p>
 <p>It is difficult to predict which rasterizer offers better performance. For complex meshes and high resolutions OpenGL will most likely outperform the CUDA rasterizer, although it has certain overheads that the CUDA rasterizer does not have. For simple meshes and low resolutions the CUDA rasterizer may be faster, but it has its own overheads, too. Measuring the performance on actual data, on the target platform, and in the context of the entire program is the only way to know for sure.</p>
 <p>To run rasterization in CUDA, create a <code>RasterizeCudaContext</code> and supply it to the <code>rasterize()</code> operation. For OpenGL, use a <code>RasterizeGLContext</code> instead. Easy!</p>
 <h3 id="running-on-multiple-gpus">Running on multiple GPUs</h3>

diff --git a/nvdiffrast/__init__.py b/nvdiffrast/__init__.py
@@ -6,4 +6,4 @@
 # distribution of this software and related documentation without an express
 # license agreement from NVIDIA CORPORATION is strictly prohibited.
 
-__version__ = '0.3.2'
+__version__ = '0.3.3'
diff --git a/nvdiffrast/common/cudaraster/CudaRaster.hpp b/nvdiffrast/common/cudaraster/CudaRaster.hpp
@@ -39,7 +39,8 @@ class CudaRaster
 					        CudaRaster				(void);
 					        ~CudaRaster				(void);
 
-    void                    setViewportSize         (int width, int height, int numImages);              // Width and height are internally rounded up to multiples of tile size (8x8) for buffer sizes.
+    void                    setBufferSize           (int width, int height, int numImages);              // Width and height are internally rounded up to multiples of tile size (8x8) for buffer sizes.
+    void                    setViewport             (int width, int height, int offsetX, int offsetY);   // Tiled rendering viewport setup.
     void                    setRenderModeFlags      (unsigned int renderModeFlags);                      // Affects all subsequent calls to drawTriangles(). Defaults to zero.
     void                    deferredClear           (unsigned int clearColor);                           // Clears color and depth buffers during next call to drawTriangles().
     void                    setVertexBuffer         (void* vertices, int numVertices);                   // GPU pointer managed by caller. Vertex positions in clip space as float4 (x, y, z, w).

diff --git a/nvdiffrast/common/cudaraster/impl/Buffer.hpp b/nvdiffrast/common/cudaraster/impl/Buffer.hpp
@@ -21,7 +21,7 @@ class Buffer
 
     void            reset       (size_t bytes);
     void            grow        (size_t bytes);
-    void*           getPtr      (void) { return m_gpuPtr; }
+    void*           getPtr      (size_t offset = 0) { return (void*)(((uintptr_t)m_gpuPtr) + offset); }
     size_t          getSize     (void) const { return m_bytes; }
 
     void            setPtr      (void* ptr) { m_gpuPtr = ptr; }

diff --git a/nvdiffrast/common/cudaraster/impl/CudaRaster.cpp b/nvdiffrast/common/cudaraster/impl/CudaRaster.cpp
@@ -26,9 +26,14 @@ CudaRaster::~CudaRaster()
     delete m_impl;
 }
 
-void CudaRaster::setViewportSize(int width, int height, int numImages)
+void CudaRaster::setBufferSize(int width, int height, int numImages)
 {
-    m_impl->setViewportSize(Vec3i(width, height, numImages));
+    m_impl->setBufferSize(Vec3i(width, height, numImages));
+}
+
+void CudaRaster::setViewport(int width, int height, int offsetX, int offsetY)
+{
+    m_impl->setViewport(Vec2i(width, height), Vec2i(offsetX, offsetY));
 }
 
 void CudaRaster::setRenderModeFlags(U32 flags)

diff --git a/nvdiffrast/common/cudaraster/impl/FineRaster.inl b/nvdiffrast/common/cudaraster/impl/FineRaster.inl
@@ -241,20 +241,20 @@ __device__ __inline__ void fineRasterImpl(const CRParams p)
         }
         else // otherwise => read tile from framebuffer
         {
-            U32* pColor = (U32*)p.colorBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
-            U32* pDepth = (U32*)p.depthBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
-			tileColor[threadIdx.x] = pColor[px + p.widthPixels * py];
-            tileDepth[threadIdx.x] = pDepth[px + p.widthPixels * py];
-            tileColor[threadIdx.x + 32] = pColor[px + p.widthPixels * (py + 4)];
-            tileDepth[threadIdx.x + 32] = pDepth[px + p.widthPixels * (py + 4)];
+            U32* pColor = (U32*)p.colorBuffer + p.strideX * p.strideY * blockIdx.z;
+            U32* pDepth = (U32*)p.depthBuffer + p.strideX * p.strideY * blockIdx.z;
+			tileColor[threadIdx.x] = pColor[px + p.strideX * py];
+            tileDepth[threadIdx.x] = pDepth[px + p.strideX * py];
+            tileColor[threadIdx.x + 32] = pColor[px + p.strideX * (py + 4)];
+            tileDepth[threadIdx.x + 32] = pDepth[px + p.strideX * (py + 4)];
         }
 
         // read peeling inputs if enabled
         if (p.renderModeFlags & CudaRaster::RenderModeFlag_EnableDepthPeeling)
         {
-            U32* pPeel = (U32*)p.peelBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
-            tilePeel[threadIdx.x] = pPeel[px + p.widthPixels * py];
-            tilePeel[threadIdx.x + 32] = pPeel[px + p.widthPixels * (py + 4)];
+            U32* pPeel = (U32*)p.peelBuffer + p.strideX * p.strideY * blockIdx.z;
+            tilePeel[threadIdx.x] = pPeel[px + p.strideX * py];
+            tilePeel[threadIdx.x + 32] = pPeel[px + p.strideX * (py + 4)];
         }
 
         U32 tileZMax;
@@ -372,12 +372,12 @@ __device__ __inline__ void fineRasterImpl(const CRParams p)
         {
             int px = (tileX << CR_TILE_LOG2) + (threadIdx.x & (CR_TILE_SIZE - 1));
             int py = (tileY << CR_TILE_LOG2) + (threadIdx.x >> CR_TILE_LOG2);
-            U32* pColor = (U32*)p.colorBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
-            U32* pDepth = (U32*)p.depthBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
-            pColor[px + p.widthPixels * py] = tileColor[threadIdx.x];
-            pDepth[px + p.widthPixels * py] = tileDepth[threadIdx.x];
-            pColor[px + p.widthPixels * (py + 4)] = tileColor[threadIdx.x + 32];
-            pDepth[px + p.widthPixels * (py + 4)] = tileDepth[threadIdx.x + 32];
+            U32* pColor = (U32*)p.colorBuffer + p.strideX * p.strideY * blockIdx.z;
+            U32* pDepth = (U32*)p.depthBuffer + p.strideX * p.strideY * blockIdx.z;
+            pColor[px + p.strideX * py] = tileColor[threadIdx.x];
+            pDepth[px + p.strideX * py] = tileDepth[threadIdx.x];
+            pColor[px + p.strideX * (py + 4)] = tileColor[threadIdx.x + 32];
+            pDepth[px + p.strideX * (py + 4)] = tileDepth[threadIdx.x + 32];
         }
     }
 }

diff --git a/nvdiffrast/common/cudaraster/impl/PrivateDefs.hpp b/nvdiffrast/common/cudaraster/impl/PrivateDefs.hpp
@@ -85,14 +85,19 @@ struct CRParams
     void*       vertexBuffer;       // numVertices * float4(x, y, z, w)
     void*       indexBuffer;        // numTriangles * int3(vi0, vi1, vi2)
 
-    S32         widthPixels;        // Buffer size in pixels. Must be multiple of tile size (8x8).
+    S32         widthPixels;        // Render buffer size in pixels. Must be multiple of tile size (8x8).
     S32         heightPixels;
     S32         widthPixelsVp;      // Viewport size in pixels.
     S32         heightPixelsVp;
     S32         widthBins;          // widthPixels / CR_BIN_SIZE
     S32         heightBins;         // heightPixels / CR_BIN_SIZE
     S32         numBins;            // widthBins * heightBins
 
+    F32         xs;                 // Vertex position adjustments for tiled rendering.
+    F32         ys;
+    F32         xo;
+    F32         yo;
+
     S32         widthTiles;         // widthPixels / CR_TILE_SIZE
     S32         heightTiles;        // heightPixels / CR_TILE_SIZE
     S32         numTiles;           // widthTiles * heightTiles
@@ -130,11 +135,13 @@ struct CRParams
     void*       activeTiles;        // CR_MAXTILES_SQR * (S32 tileIdx)
     void*       tileFirstSeg;       // CR_MAXTILES_SQR * (S32 segIdx), -1 = none
 
-    // Surface buffers.
+    // Surface buffers. Outer tile offset is baked into pointers.
 
     void*       colorBuffer;        // sizePixels.x * sizePixels.y * numImages * U32
     void*       depthBuffer;        // sizePixels.x * sizePixels.y * numImages * U32
     void*       peelBuffer;         // sizePixels.x * sizePixels.y * numImages * U32, only if peeling enabled.
+    S32         strideX;            // horizontal size in pixels
+    S32         strideY;            // vertical stride in pixels
 
     // Per-image parameters for first images are embedded here to avoid extra memcpy for small batches.
 

diff --git a/nvdiffrast/common/cudaraster/impl/RasterImpl.cpp b/nvdiffrast/common/cudaraster/impl/RasterImpl.cpp
@@ -37,8 +37,11 @@ RasterImpl::RasterImpl(void)
     m_bufferSizesReported   (0),
 
     m_numImages             (0),
+    m_bufferSizePixels      (0, 0),
+    m_bufferSizeVp          (0, 0),
     m_sizePixels            (0, 0),
     m_sizeVp                (0, 0),
+    m_offsetPixels          (0, 0),
     m_sizeBins              (0, 0),
     m_numBins               (0),
     m_sizeTiles             (0, 0),
@@ -81,24 +84,40 @@ RasterImpl::~RasterImpl(void)
 
 //------------------------------------------------------------------------
 
-void RasterImpl::setViewportSize(Vec3i size)
+void RasterImpl::setBufferSize(Vec3i size)
 {
-    // Internally round buffer sizes to multiples of 8.
+    // Internal buffer width and height must be divisible by tile size.
+    int w = (size.x + CR_TILE_SIZE - 1) & (-CR_TILE_SIZE);
+    int h = (size.y + CR_TILE_SIZE - 1) & (-CR_TILE_SIZE);
+
+    m_bufferSizePixels = Vec2i(w, h);
+    m_bufferSizeVp     = Vec2i(size.x, size.y);
+    m_numImages        = size.z;
+
+    m_colorBuffer.reset(w * h * size.z * sizeof(U32));
+    m_depthBuffer.reset(w * h * size.z * sizeof(U32));
+}
+
+//------------------------------------------------------------------------
+
+void RasterImpl::setViewport(Vec2i size, Vec2i offset)
+{
+    // Offset must be divisible by tile size.
+    NVDR_CHECK((offset.x & (CR_TILE_SIZE - 1)) == 0 && (offset.y & (CR_TILE_SIZE - 1)) == 0, "invalid viewport offset");
+
+    // Round internal viewport size to multiples of tile size.
     int w = (size.x + CR_TILE_SIZE - 1) & (-CR_TILE_SIZE);
     int h = (size.y + CR_TILE_SIZE - 1) & (-CR_TILE_SIZE);
 
-    m_numImages     = size.z;
     m_sizePixels    = Vec2i(w, h);
+    m_offsetPixels  = offset;
     m_sizeVp        = Vec2i(size.x, size.y);
     m_sizeTiles.x   = m_sizePixels.x >> CR_TILE_LOG2;
     m_sizeTiles.y   = m_sizePixels.y >> CR_TILE_LOG2;
     m_numTiles      = m_sizeTiles.x * m_sizeTiles.y;
     m_sizeBins.x    = (m_sizeTiles.x + CR_BIN_SIZE - 1) >> CR_BIN_LOG2;
     m_sizeBins.y    = (m_sizeTiles.y + CR_BIN_SIZE - 1) >> CR_BIN_LOG2;
     m_numBins       = m_sizeBins.x * m_sizeBins.y;
-
-    m_colorBuffer.reset(m_sizePixels.x * m_sizePixels.y * m_numImages * sizeof(U32));
-    m_depthBuffer.reset(m_sizePixels.x * m_sizePixels.y * m_numImages * sizeof(U32));
 }
 
 void RasterImpl::swapDepthAndPeel(void)
@@ -273,6 +292,11 @@ void RasterImpl::launchStages(bool instanceMode, bool peel, cudaStream_t stream)
         p.heightBins        = m_sizeBins.y;
         p.numBins           = m_numBins;
 
+        p.xs                = (float)m_bufferSizeVp.x / (float)m_sizeVp.x;
+        p.ys                = (float)m_bufferSizeVp.y / (float)m_sizeVp.y;
+        p.xo                = (float)(m_bufferSizeVp.x - m_sizeVp.x - 2 * m_offsetPixels.x) / (float)m_sizeVp.x;
+        p.yo                = (float)(m_bufferSizeVp.y - m_sizeVp.y - 2 * m_offsetPixels.y) / (float)m_sizeVp.y;
+
         p.widthTiles        = m_sizeTiles.x;
         p.heightTiles       = m_sizeTiles.y;
         p.numTiles          = m_numTiles;
@@ -300,9 +324,12 @@ void RasterImpl::launchStages(bool instanceMode, bool peel, cudaStream_t stream)
         p.activeTiles       = m_activeTiles.getPtr();
         p.tileFirstSeg      = m_tileFirstSeg.getPtr();
 
-        p.colorBuffer       = m_colorBuffer.getPtr();
-        p.depthBuffer       = m_depthBuffer.getPtr();
-        p.peelBuffer        = (m_renderModeFlags & CudaRaster::RenderModeFlag_EnableDepthPeeling) ? m_peelBuffer.getPtr() : 0;
+        size_t byteOffset = ((size_t)m_offsetPixels.x + (size_t)m_offsetPixels.y * (size_t)p.strideX) * sizeof(U32);
+        p.colorBuffer       = m_colorBuffer.getPtr(byteOffset);
+        p.depthBuffer       = m_depthBuffer.getPtr(byteOffset);
+        p.peelBuffer        = (m_renderModeFlags & CudaRaster::RenderModeFlag_EnableDepthPeeling) ? m_peelBuffer.getPtr(byteOffset) : 0;
+        p.strideX           = m_bufferSizePixels.x;
+        p.strideY           = m_bufferSizePixels.y;
 
         memcpy(&p.imageParamsFirst, imageParams, min(m_numImages, CR_EMBED_IMAGE_PARAMS) * sizeof(CRImageParams));
         p.imageParamsExtra  = (CRImageParams*)m_crImageParamsExtra.getPtr();
@@ -315,7 +342,7 @@ void RasterImpl::launchStages(bool instanceMode, bool peel, cudaStream_t stream)
     dim3 frBlock(32, m_numFineWarpsPerBlock);
     void* args[] = {&p};
 
-    // Launch stages from setup to coarse and copy atomics to host only if this is not a peeling iteration.
+    // Launch stages from setup to coarse and copy atomics to host only if this is not a single-tile peeling iteration.
     if (!peel)
     {
         if (instanceMode)

diff --git a/nvdiffrast/common/cudaraster/impl/RasterImpl.hpp b/nvdiffrast/common/cudaraster/impl/RasterImpl.hpp
@@ -21,7 +21,8 @@ class RasterImpl
 					        RasterImpl				(void);
 					        ~RasterImpl				(void);
 
-    void                    setViewportSize         (Vec3i size);
+    void                    setBufferSize           (Vec3i size);
+    void                    setViewport             (Vec2i size, Vec2i offset);
     void                    setRenderModeFlags      (U32 flags) { m_renderModeFlags = flags; }
     void                    deferredClear           (U32 color) { m_deferredClear = true; m_clearColor = color; }
     void                    setVertexBuffer         (void* ptr, int numVertices) { m_vertexPtr = ptr; m_numVertices = numVertices; } // GPU pointer.
@@ -52,8 +53,11 @@ class RasterImpl
     Buffer                  m_depthBuffer;
     Buffer                  m_peelBuffer;
     int                     m_numImages;
+    Vec2i                   m_bufferSizePixels;     // Internal buffer size.
+    Vec2i                   m_bufferSizeVp;         // Total viewport size.
     Vec2i                   m_sizePixels;           // Internal size at which all computation is done, buffers reserved, etc.
     Vec2i                   m_sizeVp;               // Size to which output will be cropped outside, determines viewport size.
+    Vec2i                   m_offsetPixels;         // Viewport offset for tiled rendering.
     Vec2i                   m_sizeBins;
     S32                     m_numBins;
     Vec2i                   m_sizeTiles;

diff --git a/nvdiffrast/common/cudaraster/impl/TriangleSetup.inl b/nvdiffrast/common/cudaraster/impl/TriangleSetup.inl
@@ -257,6 +257,15 @@ __device__ __inline__ void triangleSetupImpl(const CRParams p)
     float4 v1 = vertexBuffer[vidx.y];
     float4 v2 = vertexBuffer[vidx.z];
 
+    // Adjust vertex positions according to current viewport size and offset.
+
+    v0.x = v0.x * p.xs + v0.w * p.xo;
+    v0.y = v0.y * p.ys + v0.w * p.yo;
+    v1.x = v1.x * p.xs + v1.w * p.xo;
+    v1.y = v1.y * p.ys + v1.w * p.yo;
+    v2.x = v2.x * p.xs + v2.w * p.xo;
+    v2.y = v2.y * p.ys + v2.w * p.yo;
+
     // Outside view frustum => cull.
 
     if (v0.w < fabsf(v0.x) | v0.w < fabsf(v0.y) | v0.w < fabsf(v0.z))