-
Notifications
You must be signed in to change notification settings - Fork 57
OWL for Blackbelts: CUDA Interop and Asynch Launches
This page is intended for advanced OWL users, and goes into some detail on how asynchronous launches and/or CUDA interop can be employed to overlap multiple different launches and/or CUDA kernels.
If you have not done so, you should first read up on what "Variables" and "Launch Params" are, and how they are usually used in more common scenarios.
In the default usage, using launch params is super-simple: define then
device-side struct with all the C++/CUDA types members you want
to use on the device, and create an OWLParams
variable on the host that describes this; then when you are ready to issue
a launch set the variables of that OWLParams
object, and pass it to the owlLaunch2D
call.
Then change the variables for the next launch, launch again, etc.
This is the most common usage of launch parameters, and in most cases will be entirely sufficient. If you do not understand the rest of this page, simply ignore it, and use launch parameters as just described.
The main issue you have with the owlLaunch2D()
based way of launching
OWL is that this method is entirely synchronous with respect to the host: OWL issues
the launch on the device, and waits for it to complete. By the time owlLaunch2D
returns,
the kernel is done, and any buffers it might have written in the raygen program are written.
This is good because it avoids potential pit-falls of reading frame buffers while the
raygen is still working on it, etc ... but has the obvious drawback that you may lose a bit of
performance: after the raygen is finished, the GPU will remain idle until new work has been issued for it.
To avoid this, and keep the GPU busy all the time, you can also use launches in a more advanved way, by exeuting "async" launches that will not wait for the GPU to finish work. You then can - and have to
- manually manage any dependencies among possibly multiple such launches and/or other async operations (such
as CUDA interop via
cudaMemcpyAsync
or "manual" CUDA kernel launches).
As mentioned avbove the name of the launch params in device code is
fixed to optixLaunchParams
, and cannot be changed; but its values
can.
If you do want to launch different raygen programs then you can of
course do so by creating different OWLRayGen
instances, and using
different ones in different launches. If all different raygen programs
use the same global launch params, then this is trivial, and you can
even use the same OWLRayGen
variable for launches with different
raygens.
If you do, however, want the different raygens to use different launch params, then you have the obvious problem that on the device there's only a single variable to write them into. There are, however, two or three differnet ways of doing this.
Maybe the easiest option is to simply create a new struct that imply
contains all possible variants you might want to use as a separate
member. E.g, imaging you have two different raygen programs (say,
lightPass
and renderPass
) with a different struct each:
struct LightPassLP {
int something;
...
};
struct RenderPassLP {
int other;
...
}
Then the easiest way is to create a single struct that contains both,
and use a optixLaunchParams
of that type:
struct MergedLPs {
LightPassLP lightPass;
RenderPassLP renderPass;
}
__constant__ MergedLPs optixLaunchParams;
Then in the raygen program simply only ever read the members you want for that pass:
OPTIX_RAYGEN_PROGRAM(renderPass)()
{
auto &lp = optixLaunchParams.renderPass;
...
}
On the host side, you would create a single OWLParams
struct for the
entire MergedLPs
struct, but would obviously only have to set
those variables that the respective next pass expects.
Intead of a struct
, one can of course also use a union
to "merge"
the different structs. However, one issue that arises then is that the
different unions' members are "aliasing" to the same memory within the
merged struct, and since OWL doesn't have any notion of a "unions" of
variables it cannot resolve this. Consequently, if you create a
OWLParams
struct where two different OWLVarDecls
point to the same
memory offsets, then undefined behvavious may (and in fact, will
happen). I.e., the following code would be invalid:
struct MergedLPs {
union {
LightPassLP lightPass;
RenderPassLP renderPass;
};
}
...
__constant__ MergedLPs optixLaunchParams;
OWLVarDecl lpVars[] = {
{ "lightPass.something", OWL_INT, OWL_OFFSETOF(...) },
...
{ "renderPass.other", OWL_INT, OWL_OFFSETOF(...) },
...
};
OWLParams lp = owlParamsCreate(context,...);
In this code, OWL couldn't possibly know which of the two variables to write to the memory location that they both share.
You can, however, create two different OWLParams
structs, with
each one only using one of the unions:
OWLVarDecl lightPassVars[] = {
{ "lightPass.something", OWL_INT, OWL_OFFSETOF(...) },
...
};
OWLParams lightPassLPs = owlParamsCreate(.. lightPassVars, ..);
OWLVarDecl renderPassVars[] = {
{ "renderPass.something", OWL_INT, OWL_OFFSETOF(...) },
...
};
OWLParams renderPassLPs = owlParamsCreate(.. renderPassVars, ..);
Note: A variant of this model is to declare the OWLParams
directly
on the LightPass
and RenderPass
structs, and make the
optixLaunchParams
variable itself be a union:
struct LightPassLP { ... };
struct RenderPassLP { ... };
__constant__ union {
LightPassLP lightPass;
RenderPassLP renderPass;
} optixLaunchParams;
In that variant the OWLParams
for the two passes can be defined
directly over their respective structs. You would, however, still have
to create two different OWLParams
, one for each launch.
If you have your raygen programs written in different .cu
files, you
can also use a diffrently typed optixLaunchParams
variable in each
such CUDA file; ie, it could be of type RenderPassLP
in
renderPass.cu
containing the renderPass()
raygen program, and be
of LightPassLP
in lightPass.cu
, etc.
This does, in fact, work in exactly the same way as the union option
above, so you would also have to have differnet OWLParam
s for the
different raygens. You would also have to be very(!) careful if you
had other programs (say, a closet hit program) defined in one .cu
file, and called this from the raygen in the other.
The default owlLaunch2D()
call is synchronous, in the sense that it
uploads the params, laucnhes the raygen program, and waits for
completion. However, you do not necessarily have to wait for this
launch to complete: instead of owlLaunch2D()
you can also use
owlAsyncLaunch2D()
in which case OWL uploads the params, and
launches the raygen into a CUDA stream that is associated with the
lp
variable, but will not wait for that raygen to complete. You
can then do something else in the meantime while this launch is
running, and explicitly wait for its completion later on via
owlLaunchSync(lp)
, roughly like this:
OWLParams lp = owlParamsCreate(...);
OWLRayGen rg = ...;
...
owlParamsSet1i(lp, ....);
owlAsyncLaunch2D(rg, size.x, size.y, lp);
...
doSomethingElse();
...
owlLaunchSync(lp);
Note that the owlLaunchSync()
needs to get called with the same lp
variable used during the launch - it is that lp
variable that keeps
track of the launch, the CUDA stream used to execute it, etc.
If you want to issue any other CUDA kernel calls into the same stream
(or do async memcpys in that stream, etc), you can always query the
CUDA stream associated with the launch using
owlParamsGetCudaStream(...)
, and use that for any kernel
calls. E.g., you can do something like this:
owlParamsSetBuffer(lp, "frameBuffer", fb);
owlAsyncLaunch2D(...., lp);
CUstream stream = owlParamsGetCudaStream(lp,0);
myCudaToneMapper<<<..., ..., stream>>>(owlBufferGetPointer(fb,0), ... );
Note that OWL uses a different CUDA stream for each device, so
owlParamsGetCudaStream()
needs to be told for which device you want
to query this stream.
In this example, the CUDA tone mapper kernel gets launched into the same stream as the one used by OWL's raygen launch, so - based on how CUDA streams work - will have an automatic dependency on the render launch, and not launch until that render launch is done.
All the owlLaunchSync(lp)
is doing is waiting for the LPs'
associated stream to complete, so you can issue multiple async
launches into the same stream, and have a single owlLaunchSync()
for
all of them. However, given how CUDAs streams work the different
launches will internally serialize within that thread. For example,
consider the following code in which we have two different raygen
programs (say, for example, tracePhotons()
and renderFrame()
, with
the former tracing photons and splatting them into a buffer, and the
latter ray tracing the result):
OWLParams lp = ...;
...
owlParamsSet1i(lp,"rngSeed", frameID);
owlParamsSetBuffer(lp,"bakedLightBuffer", bakedLightBuffer);
owlsAyncLaunch2D(rgTracePhotons, numPhotons, 1, lp);
owlParamsSet3f(lp,"camera.org", ...);
owlAsyncLaunch2D(rgRenderScene, fbSize.x, fbSize.y, lp);
owlParamsSync(lp);
In that code:
- the first async launch will "freeze" the variables at that point in time, "stage" the uploading of those variables, and issue the launch, but will not wait for it to return.
- the second variable assignment to lp right after that first launch will not have any side effects on the preceiding launch, even though that launch may still be running (in fact, it migth not even have started yet!), because OWL will automatically create the right copies and dependencie.
- the second
owlAynsLaunch()
will then issue a second launch - with a different raygen program, and potentially different variable values, and again return right after the launch is issues, without waiting for it. Since we didn't yet wait on anything, even the first launch may not even have started yet. - since the second launch used the same
lp
variable, it gets issued into the same stream - ie, CUDA will schedule it into that stream, but will not actually launch it until the first launch has completed. - the final
owlParamsSync()
waits for the stream associated withlp
- and since both launches are in that same stream it will not return until both launches have completed.
Based on how CUDA streams work, multiple launches using the same lps are asynchronous with respect to the host (ie, the host issuing the launch call will not wait for that launch to complete), but differnet launches in the same stream will automatically serialize. In the example above makes total sense: you actually do not want the render launch to start until the baking stage is done!
If instead you want to have different launches that are asynchronous
to each other, then you can do that, too, using different OWLParam
instances. Eg, assuming you wnat to issue a path tracer with N samples
per pixel, you could either do have a single launch that iterates 16
times in the raygen program .... or you could instead issue N
different launches with one sample per pixel:
OWLParams lp = ...;
...
for (int sampleID=0;sampleID<numSamplesPerPixel;sampleID++) {
owlParams1i(lp,"sampleID",sampleID);
owlAyncLaunch2D(rgOnePathPerPixel,fbSize.x,fbSize.y,lp);
}
owlLaunchSync(lp);
In this example, you'd probably not see much of a benefit (in fact, it may well be slower...), because though you do issue all those launches asynchronously, because they use the same lp they also use the same stream, and thus, on the device, will autmatically serialize.
However, now consider the following:
const int numParallel = 4;
OWLparams lp[numParallel];
for (...) lp[i] = ....;
...
for (int sampleID=0; .... ) {
owlParams1i(lp[sampleID%numParallel], ...);
owlAsyncLaunch2D(.... lp[sampleID%numParallle]);
}
for (....) owlLaunchSync(...);
Now, in this example we use four different LPs, and each with its own CUDA stream. In this case, the first four launches can actually run fully parallel to each other - the fifth launch uses the same stream as the first, and thus has to wait for the first one to complete - but other launches can run in parallel, and thus make use of any idle SMs the moment they become available.
Where this can be helpful is if you have a huge variation in per-pixel cost: if one pixel is significantly more expensive then others the CUDA thread computing that pixel may take so long to complete that other threads run out of work until it is completed, leading to idle time; but splitting it up into different streams means that different sample IDs can run in parallel in differnet launches.