Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crashes on startup for PRIME enabled hardware thats configured to use the onboard nvidia not the intel #88

Closed
davidbuzz opened this issue Feb 22, 2024 · 15 comments · Fixed by #92
Labels
type: bug Something isn't working

Comments

@davidbuzz
Copy link

davidbuzz commented Feb 22, 2024

... seems to specifically choose the incorrect (intel) video card even tho 'prime-select query' output says 'nvidia'

temporary workaround, switch-around your prime configuration to use the worse 'intel" video card, reboot, and then it won't crash.
run: sudo prime-select intel
reboot

Edit: vulkaninfo

@davidbuzz
Copy link
Author

discussed with @kvark in discord, and #84 and #86 were unsuccessful attempts at improving/fixing this issue.

@kvark kvark added the type: bug Something isn't working label Feb 23, 2024
@kvark
Copy link
Owner

kvark commented Feb 23, 2024

Based on the Discord discussion, even vkcube doesn't work on that setup when using Intel GPU.
There is a lot of Nvidia-related issues (or instances of a single issue?): NVIDIA/open-gpu-kernel-modules#317, gfx-rs/wgpu#4775, NVIDIA/egl-wayland#72, and others.
Could you share the info about your driver version and X11/Wayland environment?

@davidbuzz
Copy link
Author

davidbuzz commented Feb 23, 2024

Setup 1:
My default/preferred configuration is 'sudo prime-select nvidia', which results in 'vulkaninfo' showing 3 devices.. the nvidia, the intel, and the mesa software-renderer/llvmpipe GPUs. In this configuration blade/zed doesn't work, as it keeps choosing to use the Intel hardware, incorrectly.
vkcube --gpu_number 0
[ this is the nvidia hardware], and vkcube runs great, but blade and Zed refuses to use this device.
[because 'prime-select query' shows its using nvidia]
vkcube --gpu_number 1
[ this is the intel hardware], and vkcube crashes in this configuration, and blade and Zed crash in this configuration.

Setup 2:
After doing 'sudo prime-select intel', and rebooting makes vkcube work... but its entirely ignoring the nvidia hardware at that point. At this point, 'vulkaninfo' only shows 2 devices ( the nvidia hardware is no-more in the list , so it has the Intel, and the llvmpipe GPUs).
vkcube --gpu_number 0
Selected GPU 0: Intel(R) UHD Graphics (CML GT2), type: IntegratedGpu
[ device zero in this configuration is Intel, as the nvidia device has gone-away as a result of 'prime-select intel' above]
vkcube --gpu_number 1
Selected GPU 1: llvmpipe (LLVM 15.0.7, 256 bits), type: Cpu
[ device 1 is the mesa softrware renderer llvmpime ], and it works too

In this 'Setup 2'... Zed and the blade 'bunnymark' example both work great, no issue here... but thats not how most people with a PRIME setup and nvidia hardware are gonna be using it

Summary:
In both these configurations blade/Zed appears to try to use the 'intel' hardware.. Adapter "Intel(R) UHD Graphics (CML GT2)", and obviously, it shouldn't.

@davidbuzz
Copy link
Author

davidbuzz commented Feb 24, 2024

$ uname -a
Linux buzzlap 6.5.0-21-generic #21-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 7 14:17:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=23.10
DISTRIB_CODENAME=mantic
DISTRIB_DESCRIPTION="Ubuntu 23.10"

$ nvidia-smi

Sat Feb 24 14:20:53 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro T2000 wi...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   76C    P5     8W /  N/A |    796MiB /  3914MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4989      G   /usr/lib/xorg/Xorg                402MiB |
|    0   N/A  N/A      6567      G   /usr/bin/gnome-shell              155MiB |
|    0   N/A  N/A      6594      G   ...mviewer/tv_bin/TeamViewer        1MiB |
|    0   N/A  N/A      7288      G   ...RendererForSitePerProcess       34MiB |
|    0   N/A  N/A     11137      G   ...9/usr/lib/firefox/firefox      199MiB |
+-----------------------------------------------------------------------------+

@davidbuzz
Copy link
Author

davidbuzz commented Feb 24, 2024

note to say that updating the nvidia driver to 550 didn't magically make it work, but did change a few things ( like the number of reported devices in 'vulkaninfo' output is now 4 for me ( one intel, two nvidia, and the mesa driver ) , and vkcube now seems to run no matter which of the 4 devices I choose ...
'sudo prime-select nvidia' was also run after a driver change
vkcube --gpu_number 0
vkcube --gpu_number 1
vkcube --gpu_number 2
vkcube --gpu_number 3
[ these all run... but blade crashes with a different validation error now]

@davidbuzz
Copy link
Author

vkcube without explicity choosing a device , chooses the nvidia hardware.... let me see...

The ordering of the GPU's output by these two vulkan commands is different ... vulkaninfo reports GPU0 as Intel, and vkcuke when specifying a GPU reports that 'GPU0' is the nvidia.. so their "get a list of vulcan deivces" code is using a different ordering/numbering/indexing ...?

vulkaninfo --summary | egrep '(GPU|deviceName)'
GPU0:
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = Intel(R) UHD Graphics (CML GT2)
GPU1:
    deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName         = Quadro T2000 with Max-Q Design
GPU2:
    deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName         = Quadro T2000 with Max-Q Design
GPU3:
    deviceName         = llvmpipe (LLVM 15.0.7, 256 bits)
$vkcube --gpu_number 0
Selected GPU 0: Quadro T2000 with Max-Q Design, type: DiscreteGpu

$ vkcube --gpu_number 1
Selected GPU 1: Quadro T2000 with Max-Q Design, type: DiscreteGpu

$ vkcube --gpu_number 2
Selected GPU 2: Intel(R) UHD Graphics (CML GT2), type: IntegratedGpu

$ vkcube --gpu_number 3
Selected GPU 3: llvmpipe (LLVM 15.0.7, 256 bits), type: Cpu

is that important?
I think u want to do whatever vkcube is doing, not what vulcaninfo is doing.

@kvark
Copy link
Owner

kvark commented Feb 27, 2024

Yes, we want to do what vkcube is doing, if possible.

[ these all run... but blade crashes with a different validation error now]

Please post the exact error

@davidbuzz
Copy link
Author

command: blade]$ RUST_LOG=blade_graphics=debug RUST_BACKTRACE=1 cargo run --example bunnymark > buzz.550.bunnymark.validation2.error.txt 2>&1

output:

   Compiling blade-graphics v0.3.0 (/home/buzz/blade/blade-graphics)
   Compiling blade-render v0.2.0 (/home/buzz/blade/blade-render)
   Compiling blade-egui v0.2.0 (/home/buzz/blade/blade-egui)
   Compiling blade v0.2.0 (/home/buzz/blade)
    Finished dev [unoptimized + debuginfo] target(s) in 13.66s
     Running `target/debug/examples/bunnymark`
[2024-03-03T02:55:57Z INFO  blade_graphics::hal::init] Adapter "Intel(R) UHD Graphics (CML GT2)"
[2024-03-03T02:55:57Z INFO  blade_graphics::hal::init] No ray tracing extensions are supported
[2024-03-03T02:55:57Z DEBUG blade_graphics::hal::init] Adapter AdapterCapabilities {
        api_version: 4206847,
        properties: PhysicalDeviceProperties {
            api_version: 4206847,
            driver_version: 96477185,
            vendor_id: 32902,
            device_id: 39876,
            device_type: INTEGRATED_GPU,
            device_name: "Intel(R) UHD Graphics (CML GT2)",
            pipeline_cache_uuid: [
                160,
                145,
                135,
                32,
                32,
                75,
                120,
                124,
                207,
                186,
                129,
                7,
                1,
                126,
                156,
                91,
            ],
            limits: PhysicalDeviceLimits {
                max_image_dimension1_d: 16384,
                max_image_dimension2_d: 16384,
                max_image_dimension3_d: 2048,
                max_image_dimension_cube: 16384,
                max_image_array_layers: 2048,
                max_texel_buffer_elements: 134217728,
                max_uniform_buffer_range: 134217728,
                max_storage_buffer_range: 4294967295,
                max_push_constants_size: 128,
                max_memory_allocation_count: 4294967295,
                max_sampler_allocation_count: 65536,
                buffer_image_granularity: 1,
                sparse_address_space_size: 0,
                max_bound_descriptor_sets: 8,
                max_per_stage_descriptor_samplers: 65535,
                max_per_stage_descriptor_uniform_buffers: 64,
                max_per_stage_descriptor_storage_buffers: 65535,
                max_per_stage_descriptor_sampled_images: 65535,
                max_per_stage_descriptor_storage_images: 65535,
                max_per_stage_descriptor_input_attachments: 64,
                max_per_stage_resources: 4294967295,
                max_descriptor_set_samplers: 393210,
                max_descriptor_set_uniform_buffers: 384,
                max_descriptor_set_uniform_buffers_dynamic: 8,
                max_descriptor_set_storage_buffers: 393210,
                max_descriptor_set_storage_buffers_dynamic: 8,
                max_descriptor_set_sampled_images: 393210,
                max_descriptor_set_storage_images: 393210,
                max_descriptor_set_input_attachments: 256,
                max_vertex_input_attributes: 29,
                max_vertex_input_bindings: 31,
                max_vertex_input_attribute_offset: 2047,
                max_vertex_input_binding_stride: 4095,
                max_vertex_output_components: 128,
                max_tessellation_generation_level: 64,
                max_tessellation_patch_size: 32,
                max_tessellation_control_per_vertex_input_components: 128,
                max_tessellation_control_per_vertex_output_components: 128,
                max_tessellation_control_per_patch_output_components: 128,
                max_tessellation_control_total_output_components: 2048,
                max_tessellation_evaluation_input_components: 128,
                max_tessellation_evaluation_output_components: 128,
                max_geometry_shader_invocations: 32,
                max_geometry_input_components: 128,
                max_geometry_output_components: 128,
                max_geometry_output_vertices: 256,
                max_geometry_total_output_components: 1024,
                max_fragment_input_components: 116,
                max_fragment_output_attachments: 8,
                max_fragment_dual_src_attachments: 1,
                max_fragment_combined_output_resources: 131078,
                max_compute_shared_memory_size: 65536,
                max_compute_work_group_count: [
                    65535,
                    65535,
                    65535,
                ],
                max_compute_work_group_invocations: 1024,
                max_compute_work_group_size: [
                    1024,
                    1024,
                    1024,
                ],
                sub_pixel_precision_bits: 8,
                sub_texel_precision_bits: 8,
                mipmap_precision_bits: 8,
                max_draw_indexed_index_value: 4294967295,
                max_draw_indirect_count: 4294967295,
                max_sampler_lod_bias: 16.0,
                max_sampler_anisotropy: 16.0,
                max_viewports: 16,
                max_viewport_dimensions: [
                    16384,
                    16384,
                ],
                viewport_bounds_range: [
                    -32768.0,
                    32767.0,
                ],
                viewport_sub_pixel_bits: 13,
                min_memory_map_alignment: 4096,
                min_texel_buffer_offset_alignment: 16,
                min_uniform_buffer_offset_alignment: 64,
                min_storage_buffer_offset_alignment: 4,
                min_texel_offset: -8,
                max_texel_offset: 7,
                min_texel_gather_offset: -32,
                max_texel_gather_offset: 31,
                min_interpolation_offset: -0.5,
                max_interpolation_offset: 0.4375,
                sub_pixel_interpolation_offset_bits: 4,
                max_framebuffer_width: 16384,
                max_framebuffer_height: 16384,
                max_framebuffer_layers: 2048,
                framebuffer_color_sample_counts: TYPE_1 | TYPE_2 | TYPE_4 | TYPE_8 | TYPE_16,
                framebuffer_depth_sample_counts: TYPE_1 | TYPE_2 | TYPE_4 | TYPE_8 | TYPE_16,
                framebuffer_stencil_sample_counts: TYPE_1 | TYPE_2 | TYPE_4 | TYPE_8 | TYPE_16,
                framebuffer_no_attachments_sample_counts: TYPE_1 | TYPE_2 | TYPE_4 | TYPE_8 | TYPE_16,
                max_color_attachments: 8,
                sampled_image_color_sample_counts: TYPE_1 | TYPE_2 | TYPE_4 | TYPE_8 | TYPE_16,
                sampled_image_integer_sample_counts: TYPE_1 | TYPE_2 | TYPE_4 | TYPE_8 | TYPE_16,
                sampled_image_depth_sample_counts: TYPE_1 | TYPE_2 | TYPE_4 | TYPE_8 | TYPE_16,
                sampled_image_stencil_sample_counts: TYPE_1 | TYPE_2 | TYPE_4 | TYPE_8 | TYPE_16,
                storage_image_sample_counts: TYPE_1,
                max_sample_mask_words: 1,
                timestamp_compute_and_graphics: 1,
                timestamp_period: 83.333336,
                max_clip_distances: 8,
                max_cull_distances: 8,
                max_combined_clip_and_cull_distances: 8,
                discrete_queue_priorities: 2,
                point_size_range: [
                    0.125,
                    255.875,
                ],
                line_width_range: [
                    0.0,
                    8.0,
                ],
                point_size_granularity: 0.125,
                line_width_granularity: 0.0078125,
                strict_lines: 0,
                standard_sample_locations: 1,
                optimal_buffer_copy_offset_alignment: 128,
                optimal_buffer_copy_row_pitch_alignment: 128,
                non_coherent_atom_size: 64,
            },
            sparse_properties: PhysicalDeviceSparseProperties {
                residency_standard2_d_block_shape: 0,
                residency_standard2_d_multisample_block_shape: 0,
                residency_standard3_d_block_shape: 0,
                residency_aligned_mip_size: 0,
                residency_non_resident_strict: 0,
            },
        },
        queue_family_index: 0,
        layered: false,
        ray_tracing: false,
        buffer_marker: false,
        shader_info: false,
    }
[2024-03-03T02:55:57Z INFO  blade_graphics::hal::resource] Creating texture 0x84c0580000000017 of size 1x1x1 and format Rgba8Unorm, name 'texutre', handle 0
[2024-03-03T02:55:57Z INFO  blade_graphics::hal::resource] Creating buffer 0x95a125000000001a of size 4, name 'staging', handle 1
[2024-03-03T02:55:58Z INFO  blade_graphics::hal::resource] Destroying buffer 0x95a125000000001a, handle 1
SYNC-HAZARD-WRITE-AFTER-WRITE(ERROR / SPEC): msgNum: 1544472022 - Validation Error: [ SYNC-HAZARD-WRITE-AFTER-WRITE ] Object 0: handle = 0xf443490000000006, type = VK_OBJECT_TYPE_IMAGE; | MessageID = 0x5c0ec5d6 | vkCmdPipelineBarrier():  Hazard WRITE_AFTER_WRITE for image barrier 0 VkImage 0xf443490000000006[]. Access info (usage: SYNC_IMAGE_LAYOUT_TRANSITION, prior_usage: SYNC_COLOR_ATTACHMENT_OUTPUT_COLOR_ATTACHMENT_WRITE, write_barriers: 0, command: vkCmdEndRenderingKHR, seq_no: 5, reset_no: 1).
    Objects: 1
        [0] 0xf443490000000006, type: 10, name: NULL
SYNC-HAZARD-WRITE-AFTER-WRITE(ERROR / SPEC): msgNum: 1544472022 - Validation Error: [ SYNC-HAZARD-WRITE-AFTER-WRITE ] Object 0: handle = 0xcb3ee80000000007, type = VK_OBJECT_TYPE_IMAGE; | MessageID = 0x5c0ec5d6 | vkCmdPipelineBarrier():  Hazard WRITE_AFTER_WRITE for image barrier 0 VkImage 0xcb3ee80000000007[]. Access info (usage: SYNC_IMAGE_LAYOUT_TRANSITION, prior_usage: SYNC_COLOR_ATTACHMENT_OUTPUT_COLOR_ATTACHMENT_WRITE, write_barriers: 0, command: vkCmdEndRenderingKHR, seq_no: 5, reset_no: 3).
    Objects: 1
        [0] 0xcb3ee80000000007, type: 10, name: NULL
SYNC-HAZARD-WRITE-AFTER-WRITE(ERROR / SPEC): msgNum: 1544472022 - Validation Error: [ SYNC-HAZARD-WRITE-AFTER-WRITE ] Object 0: handle = 0xead9370000000008, type = VK_OBJECT_TYPE_IMAGE; | MessageID = 0x5c0ec5d6 | vkCmdPipelineBarrier():  Hazard WRITE_AFTER_WRITE for image barrier 0 VkImage 0xead9370000000008[]. Access info (usage: SYNC_IMAGE_LAYOUT_TRANSITION, prior_usage: SYNC_COLOR_ATTACHMENT_OUTPUT_COLOR_ATTACHMENT_WRITE, write_barriers: 0, command: vkCmdEndRenderingKHR, seq_no: 5, reset_no: 3).
    Objects: 1
        [0] 0xead9370000000008, type: 10, name: NULL

@kvark
Copy link
Owner

kvark commented Mar 4, 2024

@flukejones
Copy link

flukejones commented Mar 10, 2024

What is the exact situation here? Is it:

  1. Xorg is used and configured to use dgpu as primary?
  2. A laptop which has a MUX switch?
  3. Or something else? There is a rather large lack of information here.

I had a similar issue with WGPU, gfx-rs/wgpu#4110, solved by checking the mesa version and if less than 21.2 it is disabled.

(I wrote and maintain https://gitlab.com/asus-linux/supergfxctl/, so I have a fairly decent understanding of the hardware level but not so much actual use)

@flukejones
Copy link

@davidbuzz I need to knwo more info about this. I noticed:

  1. You are using Xorg
  2. You say you "configured to use the onboard nvidia not the intel"

This to me implies that you are using xorg-dgpu mode. Something that is pretty much a hack and not necessary these days. As a result of this I think some incorrect assumptions have been made.

Can you please verify for me that under a wayland session blade works perfectly fine without the blocking commit? Given that the Linux world is likely going to be defaulting to Wayland by the end of the year if not by this quarter the resulting knee-capping of everyone because of this one unique and not very well supported use-case isn't justified.

@davidbuzz
Copy link
Author

@flukejones ... its a nice Dell laptop with both Intel graphics and Nvidia Graphics. Its an integration that at the hardware is called 'Nvidia Optimus' and the software/switcher/etc is called 'Nvidia Prime'. [Google both those for more]
This is one of the available things u can do... ie "which video card do i want to use by-default, for all apps i launch unless changed":
'sudo prime-select intel'
'sudo prime-select nvidia'

The nvidia card being more powerful, amd this laptop always being on power, and it doing some pretty busy stuff, i keep the nvidia active and in-use all the time by running the 2nd of those commands , and just leavong it like that.

@flukejones
Copy link

Right. So xorg configured to use nvidia as primary is the entire cause of the issue you had.

Any work around needs to check just that one thing, not chop everything off at the knees for everybody else.

I suggest you give KDE 6 Wayland a try if you can, it works very very well on hybrid setups

@kvark
Copy link
Owner

kvark commented Mar 12, 2024

@flukejones thanks for your input!
Look like @davidbuzz 's driver is much newer than 21.2 (see Vulkaninfo in the issue description):

driverInfo = Mesa 23.2.1-1ubuntu3.1

So it would not make sense to try to port your wgpu PR here. Or at least, it wouldn't help this issue in particular.

Right. So xorg configured to use nvidia as primary is the entire cause of the issue you had.
Any work around needs to check just that one thing, not chop everything off at the knees for everybody else.

Any idea how to detect this specifically (i.e. without chopping everything off at the knees)?

@flukejones
Copy link

Lets keep new discussion on the linked issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants