Skip to content

Commit

Permalink
Support resolutions over 2048x2048 in CUDA rasterizer
Browse files Browse the repository at this point in the history
  • Loading branch information
s-laine committed Sep 30, 2024
1 parent f09684d commit 729261d
Show file tree
Hide file tree
Showing 11 changed files with 116 additions and 46 deletions.
2 changes: 1 addition & 1 deletion docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -739,7 +739,7 @@ <h3 id="rasterizing-with-cuda-vs-opengl">Rasterizing with CUDA vs OpenGL</h3>
<p>Since version 0.3.0, nvdiffrast on PyTorch supports executing the rasterization operation using either CUDA or OpenGL. Earlier versions and the Tensorflow bindings support OpenGL only.</p>
<p>When rasterization is executed on OpenGL, we use the GPU's graphics pipeline to determine which triangles land on which pixels. GPUs have amazingly efficient hardware for doing this — it is their original <i>raison d'être</i> — and thus it makes sense to exploit it. Unfortunately, some computing environments haven't been designed with this in mind, and it can be difficult to get OpenGL to work correctly and interoperate with CUDA cleanly. On Windows, compatibility is generally good because the GPU drivers required to run CUDA also include OpenGL support. Linux is more complicated, as various drivers can be installed separately and there isn't a standardized way to acquire access to the hardware graphics pipeline.</p>
<p>Rasterizing in CUDA pretty much reverses these considerations. Compatibility is obviously not an issue on any CUDA-enabled platform. On the other hand, implementing the rasterization process correctly and efficiently on a massively data-parallel programming model is non-trivial. The CUDA rasterizer in nvdiffrast follows the approach described in research paper <em>High-Performance Software Rasterization on GPUs</em> by Laine and Karras, HPG 2011. Our code is based on the paper's publicly released CUDA kernels, with considerable modifications to support current hardware architectures and to match nvdiffrast's needs.</p>
<p>The maximum resolution supported by the CUDA rasterizer is 2048×2048, and the number of triangles that can be rendered in one batch is limited to around 16 million. Subpixel precision is limited to 4 bits and depth peeling is less accurate than with OpenGL. Memory consumption depends on many factors. <em>Note:</em> Prior to version 0.3.2, the output dimensions had to be multiples of 8, but this restriction has been removed.</p>
<p>The subpixel precision of the CUDA rasterizer is limited to 4 bits, and depth peeling is less accurate than with OpenGL. Memory consumption depends on many factors. <em>Note:</em> Restrictions related to output resolution have been removed in version 0.3.3. Although the internal resolution of the CUDA rasterizer remains capped at 2048×2048, nvdiffrast now invokes it automatically multiple times to support higher resolutions.</p>
<p>It is difficult to predict which rasterizer offers better performance. For complex meshes and high resolutions OpenGL will most likely outperform the CUDA rasterizer, although it has certain overheads that the CUDA rasterizer does not have. For simple meshes and low resolutions the CUDA rasterizer may be faster, but it has its own overheads, too. Measuring the performance on actual data, on the target platform, and in the context of the entire program is the only way to know for sure.</p>
<p>To run rasterization in CUDA, create a <code>RasterizeCudaContext</code> and supply it to the <code>rasterize()</code> operation. For OpenGL, use a <code>RasterizeGLContext</code> instead. Easy!</p>
<h3 id="running-on-multiple-gpus">Running on multiple GPUs</h3>
Expand Down
2 changes: 1 addition & 1 deletion nvdiffrast/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@
# distribution of this software and related documentation without an express
# license agreement from NVIDIA CORPORATION is strictly prohibited.

__version__ = '0.3.2'
__version__ = '0.3.3'
3 changes: 2 additions & 1 deletion nvdiffrast/common/cudaraster/CudaRaster.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ class CudaRaster
CudaRaster (void);
~CudaRaster (void);

void setViewportSize (int width, int height, int numImages); // Width and height are internally rounded up to multiples of tile size (8x8) for buffer sizes.
void setBufferSize (int width, int height, int numImages); // Width and height are internally rounded up to multiples of tile size (8x8) for buffer sizes.
void setViewport (int width, int height, int offsetX, int offsetY); // Tiled rendering viewport setup.
void setRenderModeFlags (unsigned int renderModeFlags); // Affects all subsequent calls to drawTriangles(). Defaults to zero.
void deferredClear (unsigned int clearColor); // Clears color and depth buffers during next call to drawTriangles().
void setVertexBuffer (void* vertices, int numVertices); // GPU pointer managed by caller. Vertex positions in clip space as float4 (x, y, z, w).
Expand Down
2 changes: 1 addition & 1 deletion nvdiffrast/common/cudaraster/impl/Buffer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ class Buffer

void reset (size_t bytes);
void grow (size_t bytes);
void* getPtr (void) { return m_gpuPtr; }
void* getPtr (size_t offset = 0) { return (void*)(((uintptr_t)m_gpuPtr) + offset); }
size_t getSize (void) const { return m_bytes; }

void setPtr (void* ptr) { m_gpuPtr = ptr; }
Expand Down
9 changes: 7 additions & 2 deletions nvdiffrast/common/cudaraster/impl/CudaRaster.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,14 @@ CudaRaster::~CudaRaster()
delete m_impl;
}

void CudaRaster::setViewportSize(int width, int height, int numImages)
void CudaRaster::setBufferSize(int width, int height, int numImages)
{
m_impl->setViewportSize(Vec3i(width, height, numImages));
m_impl->setBufferSize(Vec3i(width, height, numImages));
}

void CudaRaster::setViewport(int width, int height, int offsetX, int offsetY)
{
m_impl->setViewport(Vec2i(width, height), Vec2i(offsetX, offsetY));
}

void CudaRaster::setRenderModeFlags(U32 flags)
Expand Down
30 changes: 15 additions & 15 deletions nvdiffrast/common/cudaraster/impl/FineRaster.inl
Original file line number Diff line number Diff line change
Expand Up @@ -241,20 +241,20 @@ __device__ __inline__ void fineRasterImpl(const CRParams p)
}
else // otherwise => read tile from framebuffer
{
U32* pColor = (U32*)p.colorBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
U32* pDepth = (U32*)p.depthBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
tileColor[threadIdx.x] = pColor[px + p.widthPixels * py];
tileDepth[threadIdx.x] = pDepth[px + p.widthPixels * py];
tileColor[threadIdx.x + 32] = pColor[px + p.widthPixels * (py + 4)];
tileDepth[threadIdx.x + 32] = pDepth[px + p.widthPixels * (py + 4)];
U32* pColor = (U32*)p.colorBuffer + p.strideX * p.strideY * blockIdx.z;
U32* pDepth = (U32*)p.depthBuffer + p.strideX * p.strideY * blockIdx.z;
tileColor[threadIdx.x] = pColor[px + p.strideX * py];
tileDepth[threadIdx.x] = pDepth[px + p.strideX * py];
tileColor[threadIdx.x + 32] = pColor[px + p.strideX * (py + 4)];
tileDepth[threadIdx.x + 32] = pDepth[px + p.strideX * (py + 4)];
}

// read peeling inputs if enabled
if (p.renderModeFlags & CudaRaster::RenderModeFlag_EnableDepthPeeling)
{
U32* pPeel = (U32*)p.peelBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
tilePeel[threadIdx.x] = pPeel[px + p.widthPixels * py];
tilePeel[threadIdx.x + 32] = pPeel[px + p.widthPixels * (py + 4)];
U32* pPeel = (U32*)p.peelBuffer + p.strideX * p.strideY * blockIdx.z;
tilePeel[threadIdx.x] = pPeel[px + p.strideX * py];
tilePeel[threadIdx.x + 32] = pPeel[px + p.strideX * (py + 4)];
}

U32 tileZMax;
Expand Down Expand Up @@ -372,12 +372,12 @@ __device__ __inline__ void fineRasterImpl(const CRParams p)
{
int px = (tileX << CR_TILE_LOG2) + (threadIdx.x & (CR_TILE_SIZE - 1));
int py = (tileY << CR_TILE_LOG2) + (threadIdx.x >> CR_TILE_LOG2);
U32* pColor = (U32*)p.colorBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
U32* pDepth = (U32*)p.depthBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
pColor[px + p.widthPixels * py] = tileColor[threadIdx.x];
pDepth[px + p.widthPixels * py] = tileDepth[threadIdx.x];
pColor[px + p.widthPixels * (py + 4)] = tileColor[threadIdx.x + 32];
pDepth[px + p.widthPixels * (py + 4)] = tileDepth[threadIdx.x + 32];
U32* pColor = (U32*)p.colorBuffer + p.strideX * p.strideY * blockIdx.z;
U32* pDepth = (U32*)p.depthBuffer + p.strideX * p.strideY * blockIdx.z;
pColor[px + p.strideX * py] = tileColor[threadIdx.x];
pDepth[px + p.strideX * py] = tileDepth[threadIdx.x];
pColor[px + p.strideX * (py + 4)] = tileColor[threadIdx.x + 32];
pDepth[px + p.strideX * (py + 4)] = tileDepth[threadIdx.x + 32];
}
}
}
Expand Down
11 changes: 9 additions & 2 deletions nvdiffrast/common/cudaraster/impl/PrivateDefs.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -85,14 +85,19 @@ struct CRParams
void* vertexBuffer; // numVertices * float4(x, y, z, w)
void* indexBuffer; // numTriangles * int3(vi0, vi1, vi2)

S32 widthPixels; // Buffer size in pixels. Must be multiple of tile size (8x8).
S32 widthPixels; // Render buffer size in pixels. Must be multiple of tile size (8x8).
S32 heightPixels;
S32 widthPixelsVp; // Viewport size in pixels.
S32 heightPixelsVp;
S32 widthBins; // widthPixels / CR_BIN_SIZE
S32 heightBins; // heightPixels / CR_BIN_SIZE
S32 numBins; // widthBins * heightBins

F32 xs; // Vertex position adjustments for tiled rendering.
F32 ys;
F32 xo;
F32 yo;

S32 widthTiles; // widthPixels / CR_TILE_SIZE
S32 heightTiles; // heightPixels / CR_TILE_SIZE
S32 numTiles; // widthTiles * heightTiles
Expand Down Expand Up @@ -130,11 +135,13 @@ struct CRParams
void* activeTiles; // CR_MAXTILES_SQR * (S32 tileIdx)
void* tileFirstSeg; // CR_MAXTILES_SQR * (S32 segIdx), -1 = none

// Surface buffers.
// Surface buffers. Outer tile offset is baked into pointers.

void* colorBuffer; // sizePixels.x * sizePixels.y * numImages * U32
void* depthBuffer; // sizePixels.x * sizePixels.y * numImages * U32
void* peelBuffer; // sizePixels.x * sizePixels.y * numImages * U32, only if peeling enabled.
S32 strideX; // horizontal size in pixels
S32 strideY; // vertical stride in pixels

// Per-image parameters for first images are embedded here to avoid extra memcpy for small batches.

Expand Down
47 changes: 37 additions & 10 deletions nvdiffrast/common/cudaraster/impl/RasterImpl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,11 @@ RasterImpl::RasterImpl(void)
m_bufferSizesReported (0),

m_numImages (0),
m_bufferSizePixels (0, 0),
m_bufferSizeVp (0, 0),
m_sizePixels (0, 0),
m_sizeVp (0, 0),
m_offsetPixels (0, 0),
m_sizeBins (0, 0),
m_numBins (0),
m_sizeTiles (0, 0),
Expand Down Expand Up @@ -81,24 +84,40 @@ RasterImpl::~RasterImpl(void)

//------------------------------------------------------------------------

void RasterImpl::setViewportSize(Vec3i size)
void RasterImpl::setBufferSize(Vec3i size)
{
// Internally round buffer sizes to multiples of 8.
// Internal buffer width and height must be divisible by tile size.
int w = (size.x + CR_TILE_SIZE - 1) & (-CR_TILE_SIZE);
int h = (size.y + CR_TILE_SIZE - 1) & (-CR_TILE_SIZE);

m_bufferSizePixels = Vec2i(w, h);
m_bufferSizeVp = Vec2i(size.x, size.y);
m_numImages = size.z;

m_colorBuffer.reset(w * h * size.z * sizeof(U32));
m_depthBuffer.reset(w * h * size.z * sizeof(U32));
}

//------------------------------------------------------------------------

void RasterImpl::setViewport(Vec2i size, Vec2i offset)
{
// Offset must be divisible by tile size.
NVDR_CHECK((offset.x & (CR_TILE_SIZE - 1)) == 0 && (offset.y & (CR_TILE_SIZE - 1)) == 0, "invalid viewport offset");

// Round internal viewport size to multiples of tile size.
int w = (size.x + CR_TILE_SIZE - 1) & (-CR_TILE_SIZE);
int h = (size.y + CR_TILE_SIZE - 1) & (-CR_TILE_SIZE);

m_numImages = size.z;
m_sizePixels = Vec2i(w, h);
m_offsetPixels = offset;
m_sizeVp = Vec2i(size.x, size.y);
m_sizeTiles.x = m_sizePixels.x >> CR_TILE_LOG2;
m_sizeTiles.y = m_sizePixels.y >> CR_TILE_LOG2;
m_numTiles = m_sizeTiles.x * m_sizeTiles.y;
m_sizeBins.x = (m_sizeTiles.x + CR_BIN_SIZE - 1) >> CR_BIN_LOG2;
m_sizeBins.y = (m_sizeTiles.y + CR_BIN_SIZE - 1) >> CR_BIN_LOG2;
m_numBins = m_sizeBins.x * m_sizeBins.y;

m_colorBuffer.reset(m_sizePixels.x * m_sizePixels.y * m_numImages * sizeof(U32));
m_depthBuffer.reset(m_sizePixels.x * m_sizePixels.y * m_numImages * sizeof(U32));
}

void RasterImpl::swapDepthAndPeel(void)
Expand Down Expand Up @@ -273,6 +292,11 @@ void RasterImpl::launchStages(bool instanceMode, bool peel, cudaStream_t stream)
p.heightBins = m_sizeBins.y;
p.numBins = m_numBins;

p.xs = (float)m_bufferSizeVp.x / (float)m_sizeVp.x;
p.ys = (float)m_bufferSizeVp.y / (float)m_sizeVp.y;
p.xo = (float)(m_bufferSizeVp.x - m_sizeVp.x - 2 * m_offsetPixels.x) / (float)m_sizeVp.x;
p.yo = (float)(m_bufferSizeVp.y - m_sizeVp.y - 2 * m_offsetPixels.y) / (float)m_sizeVp.y;

p.widthTiles = m_sizeTiles.x;
p.heightTiles = m_sizeTiles.y;
p.numTiles = m_numTiles;
Expand Down Expand Up @@ -300,9 +324,12 @@ void RasterImpl::launchStages(bool instanceMode, bool peel, cudaStream_t stream)
p.activeTiles = m_activeTiles.getPtr();
p.tileFirstSeg = m_tileFirstSeg.getPtr();

p.colorBuffer = m_colorBuffer.getPtr();
p.depthBuffer = m_depthBuffer.getPtr();
p.peelBuffer = (m_renderModeFlags & CudaRaster::RenderModeFlag_EnableDepthPeeling) ? m_peelBuffer.getPtr() : 0;
size_t byteOffset = ((size_t)m_offsetPixels.x + (size_t)m_offsetPixels.y * (size_t)p.strideX) * sizeof(U32);
p.colorBuffer = m_colorBuffer.getPtr(byteOffset);
p.depthBuffer = m_depthBuffer.getPtr(byteOffset);
p.peelBuffer = (m_renderModeFlags & CudaRaster::RenderModeFlag_EnableDepthPeeling) ? m_peelBuffer.getPtr(byteOffset) : 0;
p.strideX = m_bufferSizePixels.x;
p.strideY = m_bufferSizePixels.y;

memcpy(&p.imageParamsFirst, imageParams, min(m_numImages, CR_EMBED_IMAGE_PARAMS) * sizeof(CRImageParams));
p.imageParamsExtra = (CRImageParams*)m_crImageParamsExtra.getPtr();
Expand All @@ -315,7 +342,7 @@ void RasterImpl::launchStages(bool instanceMode, bool peel, cudaStream_t stream)
dim3 frBlock(32, m_numFineWarpsPerBlock);
void* args[] = {&p};

// Launch stages from setup to coarse and copy atomics to host only if this is not a peeling iteration.
// Launch stages from setup to coarse and copy atomics to host only if this is not a single-tile peeling iteration.
if (!peel)
{
if (instanceMode)
Expand Down
6 changes: 5 additions & 1 deletion nvdiffrast/common/cudaraster/impl/RasterImpl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ class RasterImpl
RasterImpl (void);
~RasterImpl (void);

void setViewportSize (Vec3i size);
void setBufferSize (Vec3i size);
void setViewport (Vec2i size, Vec2i offset);
void setRenderModeFlags (U32 flags) { m_renderModeFlags = flags; }
void deferredClear (U32 color) { m_deferredClear = true; m_clearColor = color; }
void setVertexBuffer (void* ptr, int numVertices) { m_vertexPtr = ptr; m_numVertices = numVertices; } // GPU pointer.
Expand Down Expand Up @@ -52,8 +53,11 @@ class RasterImpl
Buffer m_depthBuffer;
Buffer m_peelBuffer;
int m_numImages;
Vec2i m_bufferSizePixels; // Internal buffer size.
Vec2i m_bufferSizeVp; // Total viewport size.
Vec2i m_sizePixels; // Internal size at which all computation is done, buffers reserved, etc.
Vec2i m_sizeVp; // Size to which output will be cropped outside, determines viewport size.
Vec2i m_offsetPixels; // Viewport offset for tiled rendering.
Vec2i m_sizeBins;
S32 m_numBins;
Vec2i m_sizeTiles;
Expand Down
9 changes: 9 additions & 0 deletions nvdiffrast/common/cudaraster/impl/TriangleSetup.inl
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,15 @@ __device__ __inline__ void triangleSetupImpl(const CRParams p)
float4 v1 = vertexBuffer[vidx.y];
float4 v2 = vertexBuffer[vidx.z];

// Adjust vertex positions according to current viewport size and offset.

v0.x = v0.x * p.xs + v0.w * p.xo;
v0.y = v0.y * p.ys + v0.w * p.yo;
v1.x = v1.x * p.xs + v1.w * p.xo;
v1.y = v1.y * p.ys + v1.w * p.yo;
v2.x = v2.x * p.xs + v2.w * p.xo;
v2.y = v2.y * p.ys + v2.w * p.yo;

// Outside view frustum => cull.

if (v0.w < fabsf(v0.x) | v0.w < fabsf(v0.y) | v0.w < fabsf(v0.z))
Expand Down
Loading

0 comments on commit 729261d

Please sign in to comment.