Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Vulkan] Support uniform buffer object for passing many scalar arguments #7717

Merged
merged 22 commits into from
Apr 11, 2021

Conversation

masahi
Copy link
Member

@masahi masahi commented Mar 22, 2021

We are using push constants to pass scalar arguments to SPIR-V kernels. The vulkan spec ensures that the size of push constants storage is at least 128 bytes. Since we pass each scalar via 64 bit union, that means we can only pass 16 parameters via push constants. Unfortunately, for fused, dynamic input kernel, the number of any_dim and other params can easily go beyond that limit. See for example this crazy kernel that needs 20 scalars to be passed in https://gist.github.com/masahi/ce51c0d51c6115109203b3732f185aab

This PR enables passing parameters beyond push constants limit, using uniform buffer object. In particular, with this PR I was able to run PyTorch MaskRCNN end to end on Vulkan!

Running NMS tests in onnx/test_forward.py serves as a test case.

There are some minor TODOs left, but I'd like to get some feedback. Supporting UBO requires making non trivial changes to both codegen and runtime, but hopefully they are straightforward.

@tqchen @tmoreau89 @jwfromm @ajtulloch

TODO

  • Query platform specific push constant size limit rather than using a hard coded constant
  • Fix segfault on deleting UBO host buffer (Seems I shouldn't allocate/deallocate the host buf explicitly, rather use vkMapMemory/vkUnMapMemory)
  • Add doc comments in ir_builder.h

@masahi masahi marked this pull request as draft March 22, 2021 08:38
@masahi masahi force-pushed the vk-ubo branch 2 times, most recently from 54a534c to 62d19eb Compare April 8, 2021 08:16
@masahi masahi marked this pull request as ready for review April 8, 2021 08:34
@masahi masahi requested a review from tqchen April 8, 2021 08:37
@masahi
Copy link
Member Author

masahi commented Apr 8, 2021

@tqchen This is ready for review.

On the codegen side, UBO is declared in a similar way as push constants, so I made DeclareStorageVariable function to declare both storage type.

On the runtime side, I introduced CreateBuffer function to declare both device storage buffers and uniform buffers. The change should be straightforward.

One potentially controversial point could be the use of runtime function MaxPushConstantsSize() during codegen, to query the HW param and decide if we should use push constants or UBO.

@tqchen
Copy link
Member

tqchen commented Apr 8, 2021

cc @ajtulloch @antinucleon please help to see if you can take a look

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @masahi for these changes, I left a comment on cosmetics

src/target/spirv/ir_builder.h Outdated Show resolved Hide resolved
@tmoreau89 tmoreau89 merged commit 5bc1cec into apache:main Apr 11, 2021
@tmoreau89
Copy link
Contributor

Thanks you @masahi your PR has been merged

@tqchen
Copy link
Member

tqchen commented Apr 11, 2021

Sorry that i didn't get a chance to review. Here are some items that needs to be fixed as followup PRs:

  • Let us not use GetMaxPushConstantSize, as it is dynamic across devices, and the code won't behave correctly if we are doing cross compilation(where we compile on a machine that does not have vulkan), we should use a minimum TVM default (e.g. 128 bytes).
  • I see a few places where delete is used, always prefer std::unique_ptr when possible instead of delete

@masahi
Copy link
Member Author

masahi commented Apr 11, 2021

@tqchen

  • I added one delete but I just followed the existing use of delete in

    vkDestroyBuffer(vctx.device, pbuf->buffer, nullptr);
    vkFreeMemory(vctx.device, pbuf->memory, nullptr);
    delete pbuf;
    Since we need to manually delete or free associated vk related resources anyway, I don't mind this delete too much

  • Is there a way to make GetMaxPushConstantSize work for cross compilation use cases? Because for vulkan in particular, we definitely need to query HW related information like this at compile time. For example, the max thread block sizes, warp size, what extentions are available (subgroup ops extensions correspond to warp shuffle ops in CUDA but not all vk devices support them) etc. The situation with warp size is the worst because we don't have warp_size specified for vulkan target https://github.com/apache/tvm/blob/main/src/target/target_kind.cc#L273-L275 but some topi ops uses thread_warp_size

    warp_size = tvm.target.Target.current(False).thread_warp_size
    so right now we cannot compile these ops on vulkan.

I'm ok if we decide to use a fixed max push constant size, but for other HW params we need to have a solution for runtime querying.

@tqchen
Copy link
Member

tqchen commented Apr 11, 2021

Thanks @masahi. I agree that the use of delete is minor.

In the case of push constant, we can try to obtain these constant and put them into target. The main problem is the code as it is we are querying on the local machine but we need parameters on the remote. It would be useful to consider de-coupling to avoid such inconsistency(of information obtained by compiler and information in the runtime). e.g. The kernel should declare whether UBO or push constant is being used in its function meta-data, so we do not need to make dependency on the runtime parameters (e.g. the compiler decides whether to use UBO by arg size, and write that to an optional meta data of the shader to enable UBO). e.g. we can add a bitmask flag to VulkanShader::flag to indicate the difference

The runtime simply read that information and run dispatch, this will avoid inconsistency during compile time and runtime. We can then enhance the target registration to include such informations of max push constant per target, while defaults to the minimum value.

After taking a closer look, there are a few more potential things that needs to be fixed:

  • Vulkan runtime have deferred mode and immediate mode that needs to be executed in separate ways(see the two places using push constants), we will need to do the same thing for UBO logic, right now seems that the code will work for immediate mode but not deferred mode
  • Right now the UBO argument buffer simply re-allocates without deleting the original buffer, this could cause memory leaks. We will need to have a logic that re-allocates if necessary(also think about relation to synchronization in deferred case) and use the buffer if the size is large enough.

@masahi
Copy link
Member Author

masahi commented Apr 11, 2021

I confirmed that it also works for deferred mode (by always returning false from UseImmediate()). The relevent changes are

Need to look at the reallocate problem

@tqchen
Copy link
Member

tqchen commented Apr 11, 2021

Sorry I was wrong about the re-allocation, seems you are allocating a single UBO per-kernel, so it might be fine. My original thought was along the line of single UBO per runtime.

@masahi
Copy link
Member Author

masahi commented Apr 11, 2021

@tqchen
Copy link
Member

tqchen commented Apr 11, 2021

To followup on the deffered execution case:

Because memcpy happens here: https://github.com/masahi/tvm/blob/1a3dbee99c9a2c362373707678d5657e59ea6827/src/runtime/vulkan/vulkan.cc#L1150

Imagine there are two consecutive launch of the same kernel A(uses UBO) that are deferred. Then the memcpy of the second kernel launch would override the value of the first launch(both of which hasn't yet been launched). So we want to move memcpy to the lambda in the immediate and deferred kernels(besides the push constant calls)

@tqchen
Copy link
Member

tqchen commented Apr 11, 2021

To followup on the per pipeline UBO, after thinking a bit more about this design. Per pipeline UBO is fine for single threaded case, but can also be problematic under a multi-threaded setting when multiple threads are launching the same kernel A.

To make the runtime thread-safe, we normally needs to divide the data structure into constants(e.g. pipeline) and runtime structure(e.g. staging buffer, streams). The runtime structure part belongs to VulkanThreadEntry that comes with a thread-local copy to avoid threading issue. So we would want to do that, and uses a similar logic as StagingBuffer to create staging buffer for UBO.

Finally, right now seems we relies on being able to have an unified memory(host mapped memory) for UBO. Will all GPU support such kind of memory? Shall we allow explicit data copy into the UBO buffer when such memory is not supported?

@tqchen
Copy link
Member

tqchen commented Apr 11, 2021

Thanks @masahi for great discussions. It seems to me that we would have benefited more reviews and discussion :) Great that we have catched these points for improvements during the discussion after merging. Given it is already merged, we could consider:

  • A0: Revert the PR, and bring another one, do another round of thorough reviews there.
  • A1: Bring a fix PR as soon as possible.

To summarize the items that needs to be addressed:

  • P0: Add the use of UBO as meta-data to shader, so we won't have compiler runtime in-consistentcy problem
  • P1: Solve the deferred execution case when a UBO buffer can be overriden by the followup kernel calls(need to do the copy in deferred stream as well).
  • P2: Solve the multi-thread safety issue when multiple threads calls into UBO at the same time.

@masahi
Copy link
Member Author

masahi commented Apr 11, 2021

ok sent the revert #7821

@tmoreau89
Copy link
Contributor

@masahi @tqchen Thanks for the comments, it appears as though the merge might have been to hasty. I'm good with the decision to go with A0; thanks @masahi for issuing the revert.

tqchen pushed a commit that referenced this pull request Apr 11, 2021
tmoreau89 pushed a commit to tmoreau89/tvm that referenced this pull request Apr 11, 2021
…nts (apache#7717)

* ubo codegen first cut

* begin runtime change for UBO

* allocate and bind ubo

* query memory type for uniform

* refactor

* do not use float64

* trying an approach similar to push constant

* add more log

* do not delete ubo when not using it

* cumsum and nms test working with ubo

* remove log

* cleaning up

* formatting

* revert BufferArgument change

* refactored codegen

* minor fix

* introduce value kind for ubo

* fix cpplint and revert float64 change

* query push constant size using runtime API

* let vkmap/unmap allocate and delete host_buf

* doc update

* fix typo

Co-authored-by: Masahiro Masuda <masahi@[email protected]>
tmoreau89 pushed a commit to tmoreau89/tvm that referenced this pull request Apr 11, 2021
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
…nts (apache#7717)

* ubo codegen first cut

* begin runtime change for UBO

* allocate and bind ubo

* query memory type for uniform

* refactor

* do not use float64

* trying an approach similar to push constant

* add more log

* do not delete ubo when not using it

* cumsum and nms test working with ubo

* remove log

* cleaning up

* formatting

* revert BufferArgument change

* refactored codegen

* minor fix

* introduce value kind for ubo

* fix cpplint and revert float64 change

* query push constant size using runtime API

* let vkmap/unmap allocate and delete host_buf

* doc update

* fix typo

Co-authored-by: Masahiro Masuda <masahi@[email protected]>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
…nts (apache#7717)

* ubo codegen first cut

* begin runtime change for UBO

* allocate and bind ubo

* query memory type for uniform

* refactor

* do not use float64

* trying an approach similar to push constant

* add more log

* do not delete ubo when not using it

* cumsum and nms test working with ubo

* remove log

* cleaning up

* formatting

* revert BufferArgument change

* refactored codegen

* minor fix

* introduce value kind for ubo

* fix cpplint and revert float64 change

* query push constant size using runtime API

* let vkmap/unmap allocate and delete host_buf

* doc update

* fix typo

Co-authored-by: Masahiro Masuda <masahi@[email protected]>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
…nts (apache#7717)

* ubo codegen first cut

* begin runtime change for UBO

* allocate and bind ubo

* query memory type for uniform

* refactor

* do not use float64

* trying an approach similar to push constant

* add more log

* do not delete ubo when not using it

* cumsum and nms test working with ubo

* remove log

* cleaning up

* formatting

* revert BufferArgument change

* refactored codegen

* minor fix

* introduce value kind for ubo

* fix cpplint and revert float64 change

* query push constant size using runtime API

* let vkmap/unmap allocate and delete host_buf

* doc update

* fix typo

Co-authored-by: Masahiro Masuda <masahi@[email protected]>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request May 11, 2021
…nts (apache#7717)

* ubo codegen first cut

* begin runtime change for UBO

* allocate and bind ubo

* query memory type for uniform

* refactor

* do not use float64

* trying an approach similar to push constant

* add more log

* do not delete ubo when not using it

* cumsum and nms test working with ubo

* remove log

* cleaning up

* formatting

* revert BufferArgument change

* refactored codegen

* minor fix

* introduce value kind for ubo

* fix cpplint and revert float64 change

* query push constant size using runtime API

* let vkmap/unmap allocate and delete host_buf

* doc update

* fix typo

Co-authored-by: Masahiro Masuda <masahi@[email protected]>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants