-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metal error: Execution of the command buffer was aborted... #602
Comments
I suspect this is caused by #591. Prior to this error, are you seeing command buffers time out? You're not waiting for a semaphore before signaling it, are you? |
FWIW...I am seeing this also occur regularly on an older device (MacBook 2014 NV GPU)...but not on a later device (MacBook Pro 2017 AMD GPU). |
@cdavis5e : I do not see any command buffer timeout, nor any other Vulkan errors when this error is reported. When we submit a command buffer using vkQueueSubmit we also add a semaphore to wait on. So I 'think' we are doing everything correctly. What do I need to modify in MoltenVK to restore the old behaviour, so I can verify it's the new method you suggested in #591? I can trigger this crash reproducible now, even from Xcode, when setting prefillMetalCommandBuffers = false. I am seeing this on a MacBookPro 2017 with Radeon Pro 560. |
Just a quick note: When I add vkQueueWaitIdle per frame or vkDeviceWaitIdle everything is fine, of course at a much reduced frame rate. So it's either something wrong on our side or in the latest MoltenVK implementation. |
Set For iOS:
For macOS:
|
I mean, are you seeing any Metal errors being reported by the command buffer? You can add some diagnostic code, like in this diff, to do that. That particular error you cited occurs when command buffers fail one too many times, so the system revokes access to the device. |
No, I do not see any Metal errors being reported. I even added an output each time the completion handler is called to see the function is properly called, but no error is received there. The first output I get is an error like this:
When I set
our app works just fine and I no longer observe the issues I described. |
This is what I was looking for. What's happening I suspect is that the semaphore wait value is being incremented before the semaphore signal is scheduled. This means that the wait for the semaphore never wakes up, leading to a timeout. When this happens too many times, Metal revokes access to the device. I need to figure out why this is happening, because I've been seeing it myself. |
I think I know why this is happening. Does enabling |
As mentioned in my previous posting above, the parameter I can run this test in more detail tomorrow, but I did indeed observe that setting But let me test this more tomorrow. |
Are you sure you're not confusing it with |
Ok, I just ran my tests again. With
I can run our App from Xcode and I do not observe any crashes. When running the App from outside Xcode I still get crashes, but the App runs a little longer before doing so, but it's not without crashes. When setting
our App crashes quickly after the App starts and if running from Xcode. So to summarize, |
Any update on this issue? |
As mentioned in #625 this issue is also reproducible for me in dota. If I run dota.sh -vulkan and the initial background image is drawn before the main menu, I almost always see the same command buffer errors as the original reporter and then fence timeouts. You may be able to use public dota as a place to repro this by dropping libMoltenVK.dylib in its game/bin/osx64 folder. For now, I'm going to disable _metalFeatures.events in a local hack. |
If possible I recommend to disable _metalFeatures.events by default until the issue is resolved. The error shows up on our side for MacOS as well as iOS. |
I'm probably going to take a stab at fixing this. But in case I fail to get something by the SDK release, feel free to add an environment variable to disable this by default. |
PR #633 provides an automatic workaround for this issue by providing the environment variable The use of Metal events is therefore disabled by default, unless the |
I think I've figured out what's wrong. For real, this time. At least in the cases I've observed, this seems to be some sort of weird race between Note that This seems like an important case to handle. The whole point is that the submission shouldn't execute until the RT image has been presented and thus becomes available for rendering. I'll come up with a patch. *In reality, the submission times out and Metal returns an error. When this happens enough, you lose the device. |
Another problem is that there is a case where the semaphore's |
Thank you for your update on this issue. We would be happy to assist you here with testing. |
Hi there. I am using MoltenVK on a large 3D iOS project. When I have _metalFeatures.events = true; |
@danginsburg, @aerofly, @tasku12 Can you retest enabling MoltenVK now logs whether or not it is using There has been a fair bit of work in the past couple of months that have impacted swap chain sync code...and in PR #724 I have streamlined the swap chain sync code. MoltenVK passes exactly the same set of 3,200 CTS semaphore tests with or without using native MTLEvents in the semaphores. Unfortunately, CTS does not exercise swap chain surface presentation very much, so there still may be room for issues that CTS is not catching. @danginsburg I've run against Dota2, but can't see any issues flipping back and forth between using and not using native |
@billhollings |
I wonder if it would help to delay incrementing the semaphore counter until the command buffer is scheduled. |
@cdavis5e I can run this test if you let me know what to modify on the code @billhollings I ran the above tests on another Mac with a AMD Radeon Pro 580 GPU and 8 GB of RAM. The results are different here and I do observe more 'hangs'. With this setting:
It also seems very unpredictable. This looks a little bit like a synchronisation issue somewhere, but I do not have enough inside into MoltenVK to check and see where this might happen. |
The smaller values of Interesting that this metering causes the problem to go away. It would be interesting to see if further reducing the value of This, and @cdavis5e's comment:
lead me to wonder if...depending on the threading model of the app...the All of which lead me to review whether PR #728 adds the The following log entry will appear if that env var is successfully enabled:
@aerofly Can you give it a spin, and report back, please? |
@billhollings Thank you for working on this issue. I tried your new code with all 3 variations on how synchronisation is performed, e.g. MTLEvent, MTLFence and software emulation. Using the mvk-Info I verified that the proper synchronisation was indeed used. I also tried on iOS and Mac OS. My observations are still extremely strange and crashes and halts happen in various test scenarious. To summarise: Depending on what synchronisation I use ( event, fence or emulation ) and if I use synchronousQueueSubmit and how I set maxActiveMetalCommandBuffersPerQueue, I get mixed results on iOS and Mac OS, but I still do observe crashes and halts, both on iOS and Mac OS. Crashes and halts also appear with MTLFence used. So before I do further extensive testing, I would like to ask or discuss one thought: Could it be that the actual issues I observe ( and possibly others as well ) are not really related to the fact how you do synchronisation for command buffer submission, but it's rather a bug (either on my side or in MoltenVK ) between proper inter command buffer synchronization? Let me quickly outline on what our app is doing, I am sure other developers might use similar steps. This holds for Mac OS and iOS, since our app runs almost the same code on both platforms. So a frame is usually working like this:
If you think this has nothing to do with it, I will proceed with testing and report back the findings. If you think this might be related to the issues, I will try to do tests and enable/disable some rendering steps to see what happens then. We pass Vulkan validation on Windows and Android so far. I am not saying it's not our fault here, but we do not observe issues on other platforms, so I have no idea where to look in our code for issues. |
This patch will do it. |
@cdavis5e Thank you for this small patch. My report is below. @billhollings One of the 'hangs' and crashes I observed in my previous postings were caused by a real driver issue that happens only on AMD Radeon 570/580 GPUs. I found this rather by accident, because this bug exists on Windows as well. It was confirmed by AMD and caused the driver to crash the whole system. The bug happens in certain situations when using triangle strips with uint16 indices and the primitive restart index. If I replace the code with not using the primitive restart index, a lot of 'hangs' and 'crashes' are gone. So with the fix I just described and when using
Setting Setting As for cdavis5e code change, I did the tests with and without his change, but the results were pretty much the same. However on iOS with his change I had a more stable frame rate compared to the previous version. Maybe others can try his change and see what they observe. To summarise: With the latest MoltenVK version and when setting the config options as above I can now confirm our app runs fine on iOS as well as Mac OS. I just would like to point out that |
We need to figure out |
The other reason I didn't use |
@cdavis5e I understand the need for I can of course run those tests again if something changes in MoltenVK, I think this is a top priority to get this part of MoltenVK into a stable form. |
Your point about
Yeah...Apple is typically obscure on explaining fences. The docs define
And the old Metal Programming Guide provides some practical examples. See Listing 13-4 on that page for an example that spans two
Interesting. I'm glad you can work around this by disabling at little cost. But I'd like to understand more about this so we can fix it. What do you mean by "hangs"? Does either of Or do you mean that the app stalls in some other way? |
@billhollings After I circumvented the real driver bug I mentioned before, a 'hang' or better stall is caused if the command buffer is aborted. When using Let me run a few tests to see where it actually stalls, but with But let me runs more tests to give you a more qualified answer. I will add some log output around the wait and signal statements inside MoltenVK to see where this is the case. |
I unfortunately can not easily switch to the newest version of Molten due to linking it statically in my project and having a very old C++ memory manager that globally overrides new and delete. This causes me to have to replace all the stl containers that are being used in molten with my own versions that use custom allocators so that I do not end up with mismatched new and deletes. In regards to MVK_ALLOW_METAL_EVENTS I see really good framerate compared with it being off. However I am investigating visual artifacts appearing in my game when metal events is enabled. If metal events are disabled or if I remove the sempahore waits I am seeing visual artifacts in my game. I am not sure if this has to do with improper synchronization in my rendering code or if this is an error with synchronization due to metal events being on. In the next month or so I might have the bandwidth to get latest molten and try with the new fences. If the molten team has time in the future it would be beneficial to have an official way to provide custom allocator for all the stl containers being used by the library. |
Were you able to better understand how and where the hangs are occurring when enabling As per my note in the related issue...I'd like to enable |
Sorry, I did not yet perform any tests yet with By the way we have some Apps live in the AppStore for iOS and MacOS that use the latest MoltenVK and |
That is great news! However...I'm confused as to how this aligns with your earlier comments above:
and
I had taken these to indicate that you were experiencing hangs with Perhaps I'm misunderstanding the situation. Can you clarify, please? The default setting for |
Sorry, my previous posting was a typo, I think there are too many combinations one can test :) Anyway, here is the correct information:
If I set
'random hangs' means, that our App no longer renders after a few seconds. But this is erradic, the hangs occur in various cases and sometimes early, sometimes later. Sometimes our App even runs fine for 1 minute or more. But let me run the tests again to give you more insight into where the 'hangs' occur. Since our renderer is not CPU limited, there is no issue for us to set |
PR #760 now enables |
With one of the last updates to MoltenVK we now observe random fatal crashes with our Apps on Mac OS 10.14, e.g. we get the following error:
Execution of the command buffer was aborted due to an error during execution. Ignored (for causing prior/excessive GPU errors) (IOAF code 4)
This error is new and shows up since one of the last updates that were made to MoltenVK. When using a previous version of MoltenVK ( around April 2019 ), our Apps are working fine again. Vulkan functions themselves do not return any error at all, all we get is the above error numerous times.
It also makes a big difference if we run our App direct from Xcode or from outside. It also makes a difference if we use
With both options set to 'true' our App runs a little more stable but eventually also crashes.
When running direct from Xcode we actually do NOT observe the crash.
Any idea what's causing this? What was the breaking change that would cause such a crash even though our code base didn't change?
The text was updated successfully, but these errors were encountered: