-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dxvk] Periodically flush handles for long-running queries #3273
Conversation
Some apps (e.g. Halo: MCC) misuse D3D11 queries: they'll Begin(...) a query, and then only afterward decide that they don't actually want the results, returning the query object back to the engine's internal query pool without first calling End(...). The query thus continues to remain active until the engine decides to allocate the query for another purpose... if it ever does. While this is misuse of the D3D API and is definitely a bug in the app, DXVK ought to tolerate this situation at least as well as D3D itself apparently does. This change will try to free up Vulkan query handles if they're in the available state, tracking their totals on the DxvkGpuQuery object itself instead. As a bonus, this change should also reduce the CPU time consumed by repeated GetData(...) calls, since it avoids redundantly summing query handle results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for debugging that and coming up with a solution; I think the concept is good but I'll want to do some refactoring on the query code before addressing this.
Do you happen to have an apitrace or something to test the changes with? If not, I can hack a quick test app as well in a few minutes as well, would just be nice to see how bad the problem is in practice.
@@ -78,13 +73,44 @@ namespace dxvk { | |||
|
|||
|
|||
void DxvkGpuQuery::addQueryHandle(const DxvkGpuQueryHandle& handle) { | |||
// Some apps forget to End() their queries; instead of accumulating query | |||
// handles indefinitely, try to clean them up periodically. | |||
if (m_handles.size() >= 256) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to do this on every call? 256 queries as a threshold is a lot for a single query object, and games tend to have thousands of query objects. At the same time, we'll very rarely have more than one Vulkan query per D3D query in well-behaved games, so there won't really be any overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You guessed it in your edit: the reason is to avoid overhead. Checking a deque's size, comparing against a constant, and conditionally branching over a call is one or two orders of magnitude cheaper than a vkGetQueryPoolResults
which will likely only come back with VK_NOT_READY
if run on a query handle so recently allocated.
Yeah, 256 is a pretty big threshold! The vast, vast majority of D3D queries out there in the wild are used responsibly and don't come anywhere near that number, which I picked since I didn't want to spend time seeking out the not-as-well-behaved games which do use multiple Vulkan queries per D3D query to make sure this new codepath wasn't causing some regression somewhere. I chose 256 conservatively to be sure that this would only affect "leaked" D3D queries. (And, let's remember that the old threshold was actually infinity, and while games do tend to have thousands of query objects, I only observed about a dozen being leaked in this manner.)
If m_handles
is to become a small_vector
, it does make sense to make the threshold the same as its preallocated region, however.
|
||
mutable DxvkQueryData m_queryData; | ||
mutable DxvkGpuQueryHandle m_handle; | ||
mutable std::deque<DxvkGpuQueryHandle> m_handles; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly dislike this mutable
spam; if methods modify internal state then they shouldn't be const
, especially since doing it like this hides potential thread safety concerns.
If anything, this highlights that the current implementation is bugged (in that accesses to m_ended
do not enforce memory order).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like mutable
either, though I wanted to avoid a pointer to an extra struct that would have just meant yet one more allocation.
Since DxvkGpuQuery::getData()
is part of the "public" (D3D layer facing) API, it seemed appropriate to keep it const
as it is indeed logically const. If that isn't The DXVK Way™, it's easy enough to make getData()
non-const.
I can replace m_ended
with an atomic if you'd like, though this is a separate change and I'd rebase this PR to put that commit deeper than this one.
m_handles.pop_front(); | ||
} | ||
|
||
if (m_handle.queryPool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably easier to turn this whole mess into a small_vector
. The main reason why there are two structures at all was to avoid allocations in the common case (1 query), but we have better ways to deal with this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern that I have here is that the small_vector
's equivalent to pop_front()
(erase(0)
) is O(n)
, which would make the while loop O(n²)
. I suppose this could be mitigated if small_vector::erase()
got a range overload, which could then be called once after the loop finishes; flushHandles()
would be back to O(n)
.
Either way, that's a separate change so I'd be rebasing the PR to add a commit if I do it.
Thank you for the expedient and thorough review!
The best I have is a RenderDoc capture, but I haven't found a way to get it to replay the But, the problem is pretty bad: RSS growth of ~130MiB/minute at 60FPS and a new query pool (and associated GPU buffer mmap) every two seconds or so. |
@CFSworks Just out of curiosity, is this the memory leak that only happens in the Halo 1 campaign? If yes, than it is pretty awesome that this got fixed in dxvk, because the memory leak even happens on Windows and hasn't been fixed there for months, even though the devs are aware of the issue. |
It sounds about right for the timeframe and everything. I was stuck at home sick for most of February and decided to play Halo 1's campaign to pass the time, but was plagued with OOMs after about 30 minutes of play (and progressively-worsening stuttering, probably from Linux's memory management trying to cope with the leak). I haven't tried the other campaigns or multiplayer, so I can't confirm whether my problem was unique only to Halo 1. I suppose another way of confirming that it's the same issue would be to run Halo 1 on Windows but with dxvk instead of D3D11, and then see if it resolves the leak?
Wow, I had no idea that this was affecting the Windows players too. I'm used to things like this being, "well the game itself is technically bugged because it's misusing the Windows API, but Windows itself seems to tolerate it so [dxvk/wine/whatever] should as well" so I never checked my (wrong) assumption that that would be the case here... Guess D3D11 itself doesn't tolerate this query misuse any better, which means dxvk is doing something better than the library it's trying to emulate! :) I'd thought about sending in a report to the devs, pointing out their exact misuse of the API and giving some tips to finding the actual culprit code. Perhaps I/we still should? Maybe the reason it's gone unfixed for so long is they don't know where to look? |
Yeah from your description it's pretty clear that we are indeed talking about the same memory leak, I don't think we have to do any further testing.
Wow, pretty impressive that you still were able to find the issue, even without the Halo source code available, while not even 343I was able to find it with the code available :D Reminds me of the time where this happened in GTA V.
I agree, contacting the devs would be a good idea. Do you want to go ahead? :) P.S. Stuff like this is why I enjoy the Linux gaming ecosystem. Sometimes it feels like a lot of workarounds, but then occasionally we have a gem like this one, where we are able to fix issues that seem unfixable even on Windows. |
Counterintuitively, I've identified more of these kinds of bugs in closed-source software than open-source. My guess is it's because they're actually pretty low-hanging fruit when you have the source and know what you need to look for (i.e. you are a user who is actively being bitten by the bug, not a developer who has only seen reports), so in the open-source world, these things get fixed pretty quickly -- long before I come along to encounter them. Even your GTA V example would fit the pattern: building the app with debug information and then using a profiler would have revealed But then in the closed-source world, you either need a developer to prioritize that issue enough to take the time to nail it down (hard when there's more issues than developers, harder still when the most skilled devs have even more on their plates, and the developers aren't necessarily also users/players so they might not have quite the same personal motivation to fix it)... or you need the tenacity and skillset of someone like t0st. :)
Well, I replied to the 343I employee on the Reddit thread you linked. My hope is this will generate a few upvotes and make this enough of a "squeaky wheel with an easy fix" that someone over there either finally nails it down or shoots me a message to ask for more details. I'd thought about also opening a support ticket but those teams usually don't/can't escalate something directly to engineering, no matter how technical the reporter is -- any other way you think I should be getting their attention besides just Reddit? |
Yeah, low hanging fruit are definitely easier to get fixed in free software.
For now this should be enough. If they still don't bother to reply in a few weeks, we can try to contact them in another way. Personally I am also very interested in how the hell you debugged this, debugging software without source code available always felt like impossible to me. So there is also the alternative strategy of writing up a blogpost for this, reach Hackernews front page, which will surely get the devs attention then. :D (Seriously though, don't feel obliged to write a blog post just because I said this) |
Looks like this might have indeed helped 343I fix the memory leak upstream: The latest patch fixes the problem:
|
Some apps (e.g. Halo: MCC) misuse D3D11 queries: they'll Begin(...) a query, and then only afterward decide that they don't actually want the results, returning the query object back to the engine's internal query pool without first calling End(...). The query thus continues to remain active until the engine decides to allocate the query for another purpose... if it ever does.
While this is misuse of the D3D API and is definitely a bug in the app, DXVK ought to tolerate this situation at least as well as D3D itself apparently does. This change will try to free up Vulkan query handles if they're in the available state, tracking their totals on the DxvkGpuQuery object itself instead.
As a bonus, this change should also reduce the CPU time consumed by repeated GetData(...) calls, since it avoids redundantly summing query handle results.