Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan backend bug in region_allocator #8079

Closed
immortalsalomon opened this issue Feb 8, 2024 · 10 comments · Fixed by #8130
Closed

Vulkan backend bug in region_allocator #8079

immortalsalomon opened this issue Feb 8, 2024 · 10 comments · Fixed by #8130
Assignees
Labels

Comments

@immortalsalomon
Copy link

immortalsalomon commented Feb 8, 2024

Hi all, I think I have found a bug in the region_allocator used by Vulkan.

In short, the bug is created by a call to the can_split(const BlockRegion *block_region, size_t size) method which, instead of checking the size returned by the conform_size(size_t offset, size_t size, size_t alignment, size_t nearest_multiple) method, uses the size set in the MemoryRequest. This can cause a problem if the free region returned by the method find_block_region has the same size as the value returned by the conform_size method. Since the request->size value is used this case is not recognised and leads to the creation of a memory region with a theoretical size of zero. When a new memory region is created, the conform_size method is used to determine its size. The method will not return 0 but the nearest multiple of the properties.nearest_multiple value starting from the actual_alignment value passed to the method. This happens do to this check.

This creates the problem that if all regions in a block are free and the coalesce_block_regions method is called, the size of the resulting region exceeds the size of the block.

image
This also leads to the following error:
image

The fix I made is as follows. I have not tested it yet.
Original:

    BlockRegion *block_region = find_block_region(user_context, request);
    if (block_region == nullptr) {
#ifdef DEBUG_RUNTIME_INTERNAL
        debug(user_context) << "RegionAllocator: Failed to locate region for requested size ("
                            << (int32_t)(request.size) << " bytes)!\n";
#endif
        return nullptr;
    }
    if (can_split(block_region, request->size)) {
#ifdef DEBUG_RUNTIME_INTERNAL
        debug(user_context) << "RegionAllocator: Splitting region of size ( " << (int32_t)(block_region->memory.size) << ") "
                            << "to accomodate requested size (" << (int32_t)(request.size) << " bytes)!\n";
#endif
        split_block_region(user_context, block_region, request.size, request.alignment);
    }

Fixed:

    BlockRegion *block_region = find_block_region(user_context, request);
    if (block_region == nullptr) {
#ifdef DEBUG_RUNTIME_INTERNAL
        debug(user_context) << "RegionAllocator: Failed to locate region for requested size ("
                            << (int32_t)(request.size) << " bytes)!\n";
#endif
        return nullptr;
    }

    // Need to check if empty region is not 0
    actual_size = conform_size(block_region->memory.offset, request.size, actual_alignment, block->memory.properties.nearest_multiple);

    if (can_split(block_region, actual_size)) {
#ifdef DEBUG_RUNTIME_INTERNAL
        debug(user_context) << "RegionAllocator: Splitting region of size ( " << (int32_t)(block_region->memory.size) << ") "
                            << "to accomodate requested size (" << (int32_t)(request.size) << " bytes)!\n";
#endif
        split_block_region(user_context, block_region, request.size, request.alignment);
    }
@derek-gerstmann
Copy link
Contributor

Thanks very much for the bug report! I believe you've found the root cause for this lingering issue that triggers this validation error that I hadn't been able to reproduce very easily. I'll take a look and investigate!

@immortalsalomon
Copy link
Author

immortalsalomon commented Feb 9, 2024

Truly happy to help. You have done an extraordinary job :).

I found another trigger that generates the same validation error in Vulkan. The cause is the value of actual_alignment and the value of properties.nearest_multiple. I am testing my code on two graphics cards. One is an NVidia where the actual_alignment is 16 Bytes and the other is an intel integrated graphics card where the actual_alignment is 64 Bytes. With the former the error is not thrown but with the latter it is. This is because vulkan's vkGetBufferMemoryRequirements() method returns the size value which is a near multiple of 64 and not 32.

image

1° => 63797472 = conform_size(0, 63797448, 64, 32);
2° => 63797504 = conform_size(0, 63797448, 64, 64);

I think there are two possible solutions. The quickest is to change the value of nearest_multiple from 32 to 64 Bytes. The other would be to obtain the actual_alignment during initialisation and adjust the value of properties.nearest_multiple.

@immortalsalomon immortalsalomon closed this as not planned Won't fix, can't repro, duplicate, stale Feb 9, 2024
@abadams
Copy link
Member

abadams commented Feb 9, 2024

Was closing this intentional? Seems like it's still active from the comments so far.

@derek-gerstmann
Copy link
Contributor

Ahh, thanks so much for the detailed report! I've spent some time working on a fix and adding more test coverage. For the device specific alignment and vkGetBufferMemoryRequirements, I agree we should be querying the value during initialization. I'll add that as well.

@derek-gerstmann
Copy link
Contributor

@immortalsalomon I created a new pull request #8087 which should resolve these issues. Please give this a try!

@immortalsalomon
Copy link
Author

immortalsalomon commented Feb 12, 2024

@immortalsalomon I created a new pull request #8087 which should resolve these issues. Please give this a try!

I tried the fixes and they work for the NVidia graphics card (16bytes alignment). For the intel graphics card (64bytes alignment) the fixes created another problem. below is the log:
image
image

@derek-gerstmann
Copy link
Contributor

Hmm, strange! I'll try and set thing up on my laptop with an integrated Intel GPU and see if I can diagnose what's going on. Which graphics card are you using?

@immortalsalomon
Copy link
Author

Hmm, strange! I'll try and set thing up on my laptop with an integrated Intel GPU and see if I can diagnose what's going on. Which graphics card are you using?

Sorry I was in a bit of a hurry yesterday. Unfortunately I don't have time at the moment to check the problem myself. I found a temporary solution by setting the value of the nearest_multiple parameter to 64Bytes.

The specifications of my computer:
CPU: 12th Gen Intel(R) Core(TM) i7-12850HX
RAM: 32GB
Integrated GPU: Intel(R) UHD Graphics
Dedicated GPU: NVIDIA RTX A2000 8GB Laptop GPU

Thank you for investigating. :)

@derek-gerstmann
Copy link
Contributor

@immortalsalomon Thanks for your patience. Please give #8130 a try when you have a chance!

@derek-gerstmann
Copy link
Contributor

Fixed in #8130 and improved in #8452

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants