Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic "parent device lost" error/crash after #6092 #6279

Open
ArthurBrussee opened this issue Sep 15, 2024 · 21 comments
Open

Generic "parent device lost" error/crash after #6092 #6279

ArthurBrussee opened this issue Sep 15, 2024 · 21 comments
Labels
api: vulkan Issues with Vulkan platform: windows Issues with integration with windows type: bug Something isn't working

Comments

@ArthurBrussee
Copy link

Description
After updating an egui + burn app to wgpu master, I observed random crashes with a generic "validation failed: parent device lost" error. Nothing in particular seems to cause the crash, even just drawing the app while firing off empty submit() calls seemed to crash.

After a bisection it seems to come down to this specific commit: ce9c9b7

It doesn't look that suspicious but does touch some raw pointers so idk... I definitely can't tell what's wrong anyway.

I can try to work on a smaller repro than "draw an egui app while another thread fires off submit()" but maybe this already gives enough hints.

Thanks for having a look!

Platform
Vulkan + Windows

@Wumpf
Copy link
Member

Wumpf commented Sep 15, 2024

I think that's related to

which is about to get fixed by

However after above PR this is still still a validation error which it almost certainly shouldn't be, see this comment thread here #6253 (comment)

Anyways, I'm not certain all of this is the case and device lost itself shouldn't happen in the first place, so if you have more information about your system and a as-minimal-as-reasonably-possible repro that would be great!

EDIT: Didn't pay enough attention to the fact that this is on wgpu-master only and dismissed the bisect result too quickly. That's quite curious, maybe some object lifetime got messed up 🤔. cc: @teoxoy
Put it on v23 milestone to ensure this gets a look before the next release

@Wumpf Wumpf added type: bug Something isn't working api: vulkan Issues with Vulkan platform: windows Issues with integration with windows labels Sep 15, 2024
@Wumpf Wumpf added this to the v23 milestone Sep 15, 2024
@ArthurBrussee
Copy link
Author

Thanks for the quick reply! I don't understand the internals here so quite possible it's related to the linked issues but its not at startup so think it could be different, didn't do a good job describing some other symptoms:

  • the app runs for a while before crashing
  • this "while" seems random each run
  • the more work I add to the submit queue on the background thread the faster the crash happened.
  • call stack just points to the submit call

So it seems like something race-y perhaps. Will try a single threaded setup at some point.

Otherwise some more specs - 4070 gpu (on Optimus or whatever its called now), haven't tried other gpus yet.

@ErichDonGubler ErichDonGubler changed the title Generice "parent device lost" error/crash after #6092 Generic "parent device lost" error/crash after #6092 Sep 16, 2024
@jimblandy
Copy link
Member

Do we have steps to reproduce this?

@hakolao
Copy link
Contributor

hakolao commented Oct 3, 2024

I'm getting this also, but no idea how to repro. But I too have another thread submitting the queue occasionally.

@hakolao
Copy link
Contributor

hakolao commented Oct 3, 2024

Error in Queue::submit: Validation Error

Caused by:
  Parent device is lost

Is the exact error message, so it occurs on queue submit, not on surface configure. There's nothing on the stack trace, so dunno how to debug really. Sounds exactly like described above in @ArthurBrussee 's comment. Randomly at runtime.

wgpu 22.1.0
Windows + Vulkan + NVIDIA GeForce RTX 3080 Ti Laptop GPU

@ErichDonGubler
Copy link
Member

Hmm, having a validation error attributed to a device loss sounds wrong. That sounds like it should be classified as an internal error, rather than a validation error. That probably doesn't really matter WRT the root cause for the OP, though.

We are aware of issues with multi-threaded command submission, and this is likely a symptom. Because this is likely due to raciness, it's hard to comment on how to reproduce it, though. 😅

@hakolao
Copy link
Contributor

hakolao commented Oct 4, 2024

I managed to capture one crash of mine with debug logs.

[2024-10-04T09:46:15Z ERROR wgpu_hal::vulkan::instance]         objects: (type: DESCRIPTOR_SET, hndl: 0x9ddb900000000a30, name: bind_group)
[2024-10-04T09:46:15Z ERROR wgpu_hal::vulkan::instance] VALIDATION [VUID-vkFreeDescriptorSets-pDescriptorSets-00309 (0xbfce1114)]
        Validation Error: [ VUID-vkFreeDescriptorSets-pDescriptorSets-00309 ] Object 0: handle = 0x5a00570000000a2f, name = bind_group, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0xbfce1114 | vkFreeDescriptorSets(): pDescriptorSets[0] VkDescriptorSet 0x5a00570000000a2f[bind_group] is in use by VkCommandBuffer 0x16ce6764fc0[Chunks to screen]. The Vulkan spec states: All submitted commands that refer to any element of pDescriptorSets must have completed execution (https://vulkan.lunarg.com/doc/view/1.3.275.0/windows/1.3-extensions/vkspec.html#VUID-vkFreeDescriptorSets-pDescriptorSets-00309)
[2024-10-04T09:46:15Z ERROR wgpu_hal::vulkan::instance]         objects: (type: DESCRIPTOR_SET, hndl: 0x5a00570000000a2f, name: bind_group)
thread 'main' panicked at C:\Users\okko-\.cargo\registry\src\index.crates.io-6f17d22bba15001f\wgpu-22.1.0\src\backend\wgpu_core.rs:2314:30:
Error in Queue::submit: Validation Error

Caused by:
  Parent device is lost


I'm not sure if this is the exact cause, but it could give some clues.

I've recently updated my wgpu, and I've also recently made some changes to how I create bind groups (not every frame for these particular ones). Some of those might cause this to pop up for me now.

@hakolao
Copy link
Contributor

hakolao commented Oct 4, 2024

Still thinking that something must have changed, because none of these errors popped up before.

Pretty sure I'm doing some things wrong as well.

@hakolao
Copy link
Contributor

hakolao commented Oct 4, 2024

#6323

This might be related.

I tried doing submits only from main thread, and this error still keeps happening. My app is so complex, that it's hard to extract a repro step :S

@hakolao
Copy link
Contributor

hakolao commented Oct 5, 2024

This is a particularly frustrating one, because I don't even always get any validation errors. But every run of my game eventually reaches this error and panics. Wouldn't mind some pinpoints on how to approach a reproducible report, when the submit error stacktrace contains nothing.

@hakolao
Copy link
Contributor

hakolao commented Oct 6, 2024

After seeing #6318 I hoped I could catch the error by testing using trunk. Just testing was way too much work... having to fork so many libs... but got a better error message:

thread 'main' panicked at C:\Users\okko-\Programming\wgpu\wgpu-core\src\resource.rs:740:73:
thread 'main' attempted to acquire a snatch lock recursively.
- Currently trying to acquire a write lock at C:\Users\okko-\Programming\wgpu\wgpu-core\src\resource.rs:740:73
   0: std::backtrace_rs::backtrace::dbghelp64::trace
             at /rustc/2bd1e894efde3b6be857ad345914a3b1cea51def\library/std\src\..\..\backtrace\src\backtrace\dbghelp64.rs:91
   1: std::backtrace_rs::backtrace::trace_unsynchronized
             at /rustc/2bd1e894efde3b6be857ad345914a3b1cea51def\library/std\src\..\..\backtrace\src\backtrace\mod.rs:66
   2: std::backtrace::Backtrace::create
             at /rustc/2bd1e894efde3b6be857ad345914a3b1cea51def\library/std\src\backtrace.rs:331
   3: std::backtrace::Backtrace::capture
             at /rustc/2bd1e894efde3b6be857ad345914a3b1cea51def\library/std\src\backtrace.rs:296
   4: wgpu_core::snatch::LockTrace::enter
             at C:\Users\okko-\Programming\wgpu\wgpu-core\src\snatch.rs:87
   5: wgpu_core::snatch::SnatchLock::write
             at C:\Users\okko-\Programming\wgpu\wgpu-core\src\snatch.rs:148
   6: wgpu_core::resource::Buffer::destroy
             at C:\Users\okko-\Programming\wgpu\wgpu-core\src\resource.rs:740
   7: wgpu_core::device::resource::Device::release_gpu_resources
             at C:\Users\okko-\Programming\wgpu\wgpu-core\src\device\resource.rs:3599
   8: wgpu_core::device::resource::Device::lose
             at C:\Users\okko-\Programming\wgpu\wgpu-core\src\device\resource.rs:3583
   9: wgpu_core::device::resource::Device::handle_hal_error
             at C:\Users\okko-\Programming\wgpu\wgpu-core\src\device\resource.rs:337
  10: wgpu_core::global::Global::queue_submit
             at C:\Users\okko-\Programming\wgpu\wgpu-core\src\device\queue.rs:1271
  11: wgpu::backend::wgpu_core::impl$3::queue_submit
             at C:\Users\okko-\Programming\wgpu\wgpu\src\backend\wgpu_core.rs:2075
  12: wgpu::context::impl$1::queue_submit<wgpu::backend::wgpu_core::ContextWgpuCore>
             at C:\Users\okko-\Programming\wgpu\wgpu\src\context.rs:2101
  13: wgpu::api::queue::Queue::submit
             at C:\Users\okko-\Programming\wgpu\wgpu\src\api\queue.rs:252
...
  41: main
  42: invoke_main
             at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:78
  43: __scrt_common_main_seh
             at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288
  44: BaseThreadInitThunk
  45: RtlUserThreadStart

- Previously acquired a read lock at C:\Users\okko-\Programming\wgpu\wgpu-core\src\device\queue.rs:1043:55
   0: std::backtrace_rs::backtrace::dbghelp64::trace
             at /rustc/2bd1e894efde3b6be857ad345914a3b1cea51def\library/std\src\..\..\backtrace\src\backtrace\dbghelp64.rs:91
   1: std::backtrace_rs::backtrace::trace_unsynchronized
             at /rustc/2bd1e894efde3b6be857ad345914a3b1cea51def\library/std\src\..\..\backtrace\src\backtrace\mod.rs:66
   2: std::backtrace::Backtrace::create
             at /rustc/2bd1e894efde3b6be857ad345914a3b1cea51def\library/std\src\backtrace.rs:331
   3: std::backtrace::Backtrace::capture
             at /rustc/2bd1e894efde3b6be857ad345914a3b1cea51def\library/std\src\backtrace.rs:296
   4: wgpu_core::snatch::LockTrace::enter
             at C:\Users\okko-\Programming\wgpu\wgpu-core\src\snatch.rs:87
   5: wgpu_core::global::Global::queue_submit
             at C:\Users\okko-\Programming\wgpu\wgpu-core\src\device\queue.rs:1043
   6: wgpu::backend::wgpu_core::impl$3::queue_submit
             at C:\Users\okko-\Programming\wgpu\wgpu\src\backend\wgpu_core.rs:2075
   7: wgpu::context::impl$1::queue_submit<wgpu::backend::wgpu_core::ContextWgpuCore>
             at C:\Users\okko-\Programming\wgpu\wgpu\src\context.rs:2101
   8: wgpu::api::queue::Queue::submit
             at C:\Users\okko-\Programming\wgpu\wgpu\src\api\queue.rs:252
...

I don't think this should be happening...

@ErichDonGubler
Copy link
Member

ErichDonGubler commented Oct 7, 2024

@hakolao: That indicates that wgpu-core is incorrectly trying to acquire guards on a snatch lock in two layers of its call stack. If I were to hazard a guess just by eyeballing code in the call stack, it'd be the following lines:

  • On queue submission, the call to actually perform the submission in HAL has a read guard on the device's snatch lock:
    let snatch_guard = device.snatchable_lock.read();
  • When dropping resources (in this case, a Buffer) after losing the device due to the above HAL error, release_gpu_resources attempts to acquire a conflicting write guard:
    let raw = match self.raw.snatch(&mut device.snatchable_lock.write()) {

Assuming the above is correct, we have two problems:

  1. We're getting a HAL error back from queue submission. Not sure what the cause is yet, and we should probably figure out the next issue first.
  2. When we get a HAL error on queue submission, we crash because of a statically predictable conflict in locking.

This seems like a hazard that would apply to more places than just this one; it's not obvious that we should let go of the device's snatch lock before we call handle_hal_error, and I suspect we've made that same error in other places. CC @teoxoy, who worked on that error conversion and device loss layer.

Ugh!

@ErichDonGubler
Copy link
Member

I bet we can artificially reproduce (2) by forcibly returning a HAL error of some kind, instead of performing the HAL call as normal. That would let us focus on resolving it.

@hakolao
Copy link
Contributor

hakolao commented Oct 7, 2024

So, I suppose this #6229 causes the lock acquisition problems.

Maybe handle_hal_error could forcibly take the lock. Or the device.lose could be deferred to when the snatch_guard is out of scope.

I'm beginning to suspect that my issue could be driver induced. Because I can't find anything wrong, and no longer am seeing any validation errors either. Dunno though...

@hakolao
Copy link
Contributor

hakolao commented Oct 8, 2024

@ErichDonGubler @teoxoy how likely is it that I am doing something wrong when getting device is lost? Or how likely is it that something is wrong on wgpu side. I read here that for others this error relates to barriers, synchronization or memory, but these are areas that wgpu should handle automatically.

Can't see any validation errors (fixed those) using InstanceFlags::DEBUG | InstanceFlags::VALIDATION.

I begun getting these after starting to reuse already created bind groups (chunks that I render, and texture atlas). Nothing else comes to mind that would have changed. And because I've tested that this issue happens already as far back as wgpu 0.19 (didn't bother to go further back), I kinda want to suspect that it's my bug. I also updated my gpu drivers.

I'll keep debugging, but this has been a difficult problem to investigate.

@teoxoy
Copy link
Member

teoxoy commented Oct 8, 2024

You could try turning on these features:

wgpu/wgpu-hal/Cargo.toml

Lines 96 to 107 in c0fa1bc

# Panic when running into an out-of-memory error (for debugging purposes).
#
# Only affects the d3d12 and vulkan backends.
oom_panic = []
# Panic when running into a device lost error (for debugging purposes).
# Only affects the d3d12 and vulkan backends.
device_lost_panic = []
# Panic when running into an internal error other than out-of-memory and device lost
# (for debugging purposes).
#
# Only affects the d3d12 and vulkan backends.
internal_error_panic = []

and looking at the callstack to pinpoint the vulkan call that's returning the lost error.

@hakolao
Copy link
Contributor

hakolao commented Oct 8, 2024

Thanks, I'll try those.

I realized I do still have some remaining warnings that I had missed from vulkan validation, so gonna check those also.

@teoxoy
Copy link
Member

teoxoy commented Oct 8, 2024

Could you share the vulkan validation errors? wgpu's validation should in principle catch issues earlier.

@hakolao
Copy link
Contributor

hakolao commented Oct 8, 2024

VUID-VkSwapchainCreateInfoKHR-pNext-07781(ERROR / SPEC): msgNum: 1284057537 - Validation Error: [ VUID-VkSwapchainCreateInfoKHR-pNext-07781 ] | MessageID = 0x4c8929c1 |
 vkCreateSwapchainKHR(): pCreateInfo->imageExtent (width = 1904, height = 984), which is outside the bounds returned by vkGetPhysicalDeviceSurfaceCapabilitiesKHR(): cur
rentExtent = (width = 1920, height = 1080), minImageExtent = (width = 1920, height = 1080), maxImageExtent = (width = 1920, height = 1080). The Vulkan spec states: If a
 VkSwapchainPresentScalingCreateInfoEXT structure was not included in the pNext chain, or it is included and VkSwapchainPresentScalingCreateInfoEXT::scalingBehavior is 
zero then imageExtent must be between minImageExtent and maxImageExtent, inclusive, where minImageExtent and maxImageExtent are members of the VkSurfaceCapabilitiesKHR 
structure returned by vkGetPhysicalDeviceSurfaceCapabilitiesKHR for the surface (https://vulkan.lunarg.com/doc/view/1.3.290.0/windows/1.3-extensions/vkspec.html#VUID-VkSwapchainCreateInfoKHR-pNext-07781)
    Objects: 0

        [0] 0x2b86eebcac0, type: 6, name: Death Sprites to Sim Input
VUID-vkCmdDispatch-None-08114(ERROR / SPEC): msgNum: 817291879 - Validation Error: [ VUID-vkCmdDispatch-None-08114 ] Object 0: handle = 0xbb041c00000002f8, name = Sim S
and Bind Group, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x30b6e267 | vkCmdDispatch():  the descriptor VkDescriptorSet 0xbb041c00000002f8[Sim Sand Bind Group
] [Set 0, Binding 22, Index 1, variable "sand_texture_sampler"] is being used in dispatch but has never been updated via vkUpdateDescriptorSets() or a similar call. The
 Vulkan spec states: Descriptors in each bound descriptor set, specified via vkCmdBindDescriptorSets, must be valid as described by descriptor validity if they are stat
ically used by the VkPipeline bound to the pipeline bind point used by this command and the bound VkPipeline was not created with VK_PIPELINE_CREATE_DESCRIPTOR_BUFFER_BIT_EXT (https://vulkan.lunarg.com/doc/view/1.3.290.0/windows/1.3-extensions/vkspec.html#VUID-vkCmdDispatch-None-08114)
    Objects: 1


        [0] 0x2b856333480, type: 6, name: Render Commands
BestPractices-PipelineBarrier-readToReadBarrier(WARN / PERF): msgNum: 49690623 - Validation Performance Warning: [ BestPractices-PipelineBarrier-readToReadBarrier ] Obj
ect 0: handle = 0x2b8564a58d0, name = Render Commands, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x2f637ff | vkCmdPipelineBarrier():  [AMD] [NVIDIA] Don't issue read-to-read barriers. Get the resource in the right state the first time you use it.
    Objects: 1

@hakolao
Copy link
Contributor

hakolao commented Oct 8, 2024

This was an easy fix, and probably an easy to validate for you as well.

        [0] 0x2b86eebcac0, type: 6, name: Death Sprites to Sim Input
VUID-vkCmdDispatch-None-08114(ERROR / SPEC): msgNum: 817291879 - Validation Error: [ VUID-vkCmdDispatch-None-08114 ] Object 0: handle = 0xbb041c00000002f8, name = Sim S
and Bind Group, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x30b6e267 | vkCmdDispatch():  the descriptor VkDescriptorSet 0xbb041c00000002f8[Sim Sand Bind Group
] [Set 0, Binding 22, Index 1, variable "sand_texture_sampler"] is being used in dispatch but has never been updated via vkUpdateDescriptorSets() or a similar call. The
 Vulkan spec states: Descriptors in each bound descriptor set, specified via vkCmdBindDescriptorSets, must be valid as described by descriptor validity if they are stat
ically used by the VkPipeline bound to the pipeline bind point used by this command and the bound VkPipeline was not created with VK_PIPELINE_CREATE_DESCRIPTOR_BUFFER_BIT_EXT (https://vulkan.lunarg.com/doc/view/1.3.290.0/windows/1.3-extensions/vkspec.html#VUID-vkCmdDispatch-None-08114)
    Objects: 1

I was passing a single sampler to bind group, but had set count to 256 (that's how many textures I've got in an array). I had made a refactor to reuse samplers (instead of having one per image... silly, right...). But forgot to remove the count from the BindGroupLayoutDescriptor.

I could imagine something like this could potentially crash...

Now I'm only seeing occasional resize validation error, but other stuff is just wanings / perf things. Some of which are too much effort for no gain. No device lost for a while... never mind that... :S

@ErichDonGubler
Copy link
Member

From our maintainer meeting agenda today, we decided that this issue does not need to block the v23 release. We need to eventually fix validation to catch this sort of issue, but it doesn't prevent programs that are correct from running.

@ErichDonGubler ErichDonGubler removed this from the v23 milestone Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: vulkan Issues with Vulkan platform: windows Issues with integration with windows type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants