Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] MSL: mesh shader initial #2074

Closed
wants to merge 10 commits into from
Closed

Conversation

Try
Copy link
Contributor

@Try Try commented Dec 18, 2022

This is more initial draft to start conversation on how-to.
Mainly need to know opinion on copy-pass and SPIRVCrossDecorationInterfaceMemberIndex.

Workflow:

When cross-compiling, new types gets synthesized:

  1. spvPerVertex all regular varyings + gl_Position
  2. spvPerPrimitive - varyings marked as perprimitiveEXT
  3. using spvMesh_t = mesh<spvPerVertex, spvPerPrimitive, ... >;

gl_PrimitiveTriangleIndicesEXT becomes spvMesh. Affects handling OpSetMeshOutputsEXT.

OpStore to index buffer remapped as 3 calls, in case of triangles, to spvMesh.set_index
There is ugly hack to implement so - see code.

gl_MeshVerticesEXT and varyings are represented as shared memory arrays. One shader is done - they get packed into single struct. This is to work-around of SPIRV-vs-MSL api-differences.
In theory, if shader writes only with [gl_LocalInvocationIndex] array can be removed.

Concerns

Had to change meaning SPIRVCrossDecorationInterfaceMemberIndex. Current implementation doesn't track array-elements - they all have same ID. On vertex/fragment seems to work fine, don't have any tesselation/geometry test-cases.
I'm not sure how to go about it: current workflow with lambdas isn't suitable for

Bugs/TODO:

Haven't tested gl_ClipDistance/gl_CullDistance - dunno how exactly they should work in spirv-cross.

Task(Object) shader in metal is different in metal:
In GLSL EmitMeshTasksEXT is terminator - calling it suppose to halt shader execution.
set_threadgroups_per_grid - doesn't stop shader execution.

Performance

Apple M1 testing(in OpenGothic) shows ~2x fps regression. Can be because shared memory, but don't know for sure.
Asked about this on apple forum: https://developer.apple.com/forums/thread/722047

Apple M3 seem to work as fast as expected

spirv_msl.cpp Outdated
{
auto &execution = get_entry_point();
auto str = to_expression(lhs_expression);
str = str.substr(str.find_first_of('[')+1, str.find_last_of(']') - str.find_first_of('[') - 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code look quite bad, but I don't see better way, atm.
Need some way to access OpAccessChain parameters, or some way to emit shorten chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, small bump to this.
My question basically: is there any way to reach original OpAccessChain with all arguments?
Needed here and for DX mesh-shader as well.
Thanks!

@Try
Copy link
Contributor Author

Try commented Dec 19, 2022

Tested today compilation variant, when instead of shared memory array local variable been used. This happen to be valid for my shaders, since there is 1-to-1 match between gl_LocalInvocationID and vertex.
Shader example: https://shader-playground.timjones.io/641b24c9f6700a03eb9f69414ebbf22b
Still FPS roughly as bad as it was, so probably metal3 mesh implementation is just bad :(

@rcaridade145
Copy link

rcaridade145 commented Dec 21, 2022

@Try Can this be due to the TBDR arch ?
There is a discussion on that here https://forum.beyond3d.com/threads/apple-powervr-tbdr-gpu-architecture-speculation-thread.61873/page-5#post-2171147

I'll leave there some other links i came across reading about this issue:
https://tellusim.com/mesh-shader-emulation/

Unreal Engine - https://blog.imaginationtech.com/powervr-performance-tips-for-unreal-engine-4/ - when developing focusing on PowerVR recomends disabling Early-Z due to how PowerVR handles geometry. Being Apple based on PowerVR i hope some of this helps you.

@Try
Copy link
Contributor Author

Try commented Dec 21, 2022

@rcaridade145

Can this be due to the TBDR arch ?

That depends a lot on what kind of TBDR is it. According to https://blog.imaginationtech.com/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/ PowerVR has fat tiler, with native support for vertex shader.
In theory mesh should work very well for them: just take meshlet and write it to polygon list. Naturally even one small implementation detail can ruin performance.

So, I think best way is to ask them.

disabling Early-Z

Not related. In any case, when renderer is same and only difference is vertex-vs-mesh, then performance should be roughly same.

@HansKristian-Work
Copy link
Contributor

Back from holiday, just ACK-ing that I've seen it and I'll look at it when I have time.

[[mesh]] void main0(uint gl_LocalInvocationIndex [[thread_index_in_threadgroup]], uint3 gl_GlobalInvocationID [[thread_position_in_grid]], spvMesh_t spvMesh)
{
threadgroup gl_MeshPerVertexEXT gl_MeshVerticesEXT[2];
_4(gl_MeshVerticesEXT, gl_LocalInvocationIndex, spvMesh, gl_GlobalInvocationID);
Copy link
Contributor

@HansKristian-Work HansKristian-Work Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the equivalent for SetMeshOutputsEXT in Metal? I see set_primitive_count there, but is there no set_vertex_count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to NV extension only set_primitive_count. Also index buffer is more similar to original NV extension(uint index[max_prim*3]).

threadgroup gl_MeshPerVertexEXT gl_MeshVerticesEXT[2];
_4(gl_MeshVerticesEXT, gl_LocalInvocationIndex, spvMesh, gl_GlobalInvocationID);
threadgroup_barrier(mem_flags::mem_threadgroup);
for (uint spvI = gl_LocalInvocationIndex, spvThreadCount = (gl_WorkGroupSize.x*gl_WorkGroupSize.y*gl_WorkGroupSize.z); spvI < 2; spvI += spvThreadCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not gonna lie, having to emit workarounds like this is just depressing. Is this even going to give meaningful uplifts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested performance of 2 approaches in my engine:

  1. copy loop
  2. assume that gl_MeshVerticesEXT are always addressed by gl_LocalInvocationIndex and use one thread-local variable.

Both are similar, and very slow, ~2x slower than draw-indexed spam with no culling. Asked them on developer forum: https://developer.apple.com/forums/thread/722047, yet there is no useful answers.

spirv_msl.cpp Outdated
case OpSetMeshOutputsEXT:
{
flush_variable_declaration(builtin_primitive_indices_id);
statement("spvMesh.set_primitive_count(", to_unpacked_expression(ops[1]), ");");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this handle rules where the first invocation of the workgroup wins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't in this draft. Metal spec is unclear about access from multiple threads.
Also emulating vulkan spec here is, unfortunately, another tough workaround: would require something like subgroupElect, but for entire workgroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subgroupElect

I didn't realize, back than, that Vulkan requires uniform control flow. Fixed now.

@HansKristian-Work
Copy link
Contributor

Looking over this, I'm not very excited about the prospect of merging this. The impedance mismatch with the mesh type is a disaster for performance, and I'm not particularly excited about having to maintain painful workarounds like this. The tessellation code is bad enough as it is, with ridiculous heroics, but we are somewhat forced to implement it.

Mesh shaders on the other hands is still deeply in "would be nice" territory. No one can reasonably rely on it any time soon, and if it cannot provide meaningful uplift either, I'd defer it indefinitely.

@Try
Copy link
Contributor Author

Try commented Jan 12, 2023

Looking over this, I'm not very excited about the prospect of merging this.

Yes, I do agree on the fact that current mesh-shader in metal3 is not cross-compilation friendly (and bad overall) and PR can be discarded, if no objections from MoltenVK guys.

On my side still have small hope that only M1-laptops do have performance issues, and desktop Mac can be more performant.
Another small hope: what if they have excellent task shader support, that will be way better and fast (unlikely).

Mesh shaders on the other hands is still deeply in "would be nice" territory.

To be more clear: for my engine I've picked mesh-shader as lesser evil - other gpu-driven approaches, like draw-indirect, seem to be worse.

TLTR: feel free do discard PR

@Try
Copy link
Contributor Author

Try commented Feb 12, 2023

Pushed new take on metal-mesh.

  1. For max_vertices <= num_threads, for loop can be avoided: one thread writes exactly one vertex(or primitive).
    I'm not sure, if this is enough, to claim that translated code is readable and clean.

  2. As for shared memory: tested on M1 and one of contributors tested my engine on M2 max - no measurable impact.
    (note: XCode GPU-tools still have no mesh-shader support, so we can measure only FPS in game)

  3. As for bad performance overall:
    In my game I've made a few test cases:
    a) outdoor
    b) cave in middle of game world
    c) cave in corner of game world

'a' and 'b' shown similar performance, yet 'c' was fastest, when looking away from world and slowest when camera points to world center.
My current theory:

  • based on WWDC 2016 presentation, compute-warp launch is expensive on apple
  • most likely mesh-shader also has huge warp launch cost, and that defeats the feature.
  • they design GPU assuming that every shader will be exactly same as NVidia sample (with task-based culling)
    I've asked on apple forum a few days ago, no response so far.

Can't really test task-shader at this point: cross-compilation is relatively straight, but build sensible culling in task is very difficult.

@Try
Copy link
Contributor Author

Try commented Feb 26, 2023

Submitted initial implementation for Task shader.

Still there are differences. In GLSL EmitMeshTasksEXT is terminator - calling it suppose to halt shader execution.
In Metal: mgp.set_threadgroups_per_grid - doesn't stop shader execution, and behave similar to old gl_TaskCountNV.

For now PR assumes, that EmitMeshTasksEXT is called from main, and generates call+return sequence.

@BeastLe9enD
Copy link

For MoltenVK, this is really interesting, I`m working on a PR (see KhronosGroup/MoltenVK#1845 ) for this at the moment that is using this branch to convert SPIRV mesh shaders to MSL code.

@BeastLe9enD
Copy link

@Try I ran into a problem where the generated mesh shader is invalid and does not compile.
I have the following HLSL code generated with spirv to a mesh shader, I also added the resulting spirv assembly:

struct MSOutput {
    float4 Position: SV_Position;
    [[vk::location(0)]] float3 Color: COLOR0;
};

[NumThreads(1, 1, 1)]
[OutputTopology("triangle")]
void mesh_main(out indices uint3 triangles[1], out vertices MSOutput vertices[3]) {
    SetMeshOutputCounts(3, 1);
    triangles[0] = uint3(0, 1, 2);

    vertices[0].Position = float4(-0.5, 0.5, 0.0, 1.0);
    vertices[0].Color = float3(1.0, 0.0, 0.0);

    vertices[1].Position = float4(0.5, 0.5, 0.0, 1.0);
    vertices[1].Color = float3(0.0, 1.0, 0.0);

    vertices[2].Position = float4(0.0, -0.5, 0.0, 1.0);
    vertices[2].Color = float3(0.0, 0.0, 1.0);
}

When I try to do spirv-cross --msl --msl-version 33000 test.spv, it does not work:

[[mesh]] void mesh_main(uint gl_LocalInvocationIndex [[thread_index_in_threadgroup]], spvMesh_t spvMesh)
{
    threadgroup uint spv_primitive_count;
    threadgroup TaskPayload payload;
    threadgroup float4 v_3[3];
    threadgroup float3 out_var_COLOR0[3];
    _1(spvMesh, spv_primitive_count, gl_Position, out_var_COLOR0, gl_LocalInvocationIndex);
    threadgroup_barrier(mem_flags::mem_threadgroup);
    if (spv_primitive_count == 0)
    {
        return;
    }
    spvMesh.set_primitive_count(spv_primitive_count);
    for (uint spvI = gl_LocalInvocationIndex, spvThreadCount = (gl_WorkGroupSize.x*gl_WorkGroupSize.y*gl_WorkGroupSize.z); spvI < 3; spvI += spvThreadCount)
    {
        spvPerVertex spvV = {};
        spvV.gl_Position = v_3[spvI];
        spvV.out_var_COLOR0 = out_var_COLOR0[spvI];
        spvMesh.set_vertex(spvI, spvV);
    }
}

The position is stored in threadgroup float4 v_3[3];, but when _1 is invoked, gl_Position is passed instead of v_3, which results in an error. I just put it here as a comment since this PR has not been merged yet.

; SPIR-V
; Version: 1.6
; Generator: Google spiregg; 0
; Bound: 60
; Schema: 0
               OpCapability MeshShadingEXT
               OpExtension "SPV_EXT_mesh_shader"
               OpMemoryModel Logical GLSL450
               OpEntryPoint MeshEXT %mesh_main "mesh_main" %2 %gl_Position %out_var_COLOR0 %payload
               OpExecutionMode %mesh_main LocalSize 1 1 1
               OpExecutionMode %mesh_main OutputTrianglesNV
               OpExecutionMode %mesh_main OutputVertices 3
               OpExecutionMode %mesh_main OutputPrimitivesNV 1
               OpSource HLSL 660
               OpName %TaskPayload "TaskPayload"
               OpName %payload "payload"
               OpName %out_var_COLOR0 "out.var.COLOR0"
               OpName %mesh_main "mesh_main"
               OpName %MSOutput "MSOutput"
               OpMemberName %MSOutput 0 "Position"
               OpMemberName %MSOutput 1 "Color"
               OpDecorate %2 BuiltIn PrimitiveTriangleIndicesEXT
               OpDecorate %gl_Position BuiltIn Position
               OpDecorate %out_var_COLOR0 Location 0
       %uint = OpTypeInt 32 0
     %uint_1 = OpConstant %uint 1
     %uint_3 = OpConstant %uint 3
     %uint_0 = OpConstant %uint 0
     %uint_2 = OpConstant %uint 2
     %v3uint = OpTypeVector %uint 3
         %14 = OpConstantComposite %v3uint %uint_0 %uint_1 %uint_2
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
      %float = OpTypeFloat 32
 %float_n0_5 = OpConstant %float -0.5
  %float_0_5 = OpConstant %float 0.5
    %float_0 = OpConstant %float 0
    %float_1 = OpConstant %float 1
    %v4float = OpTypeVector %float 4
         %23 = OpConstantComposite %v4float %float_n0_5 %float_0_5 %float_0 %float_1
    %v3float = OpTypeVector %float 3
         %25 = OpConstantComposite %v3float %float_1 %float_0 %float_0
         %26 = OpConstantComposite %v4float %float_0_5 %float_0_5 %float_0 %float_1
      %int_1 = OpConstant %int 1
         %28 = OpConstantComposite %v3float %float_0 %float_1 %float_0
         %29 = OpConstantComposite %v4float %float_0 %float_n0_5 %float_0 %float_1
      %int_2 = OpConstant %int 2
         %31 = OpConstantComposite %v3float %float_0 %float_0 %float_1
%TaskPayload = OpTypeStruct
%_ptr_Workgroup_TaskPayload = OpTypePointer Workgroup %TaskPayload
%_arr_v3uint_uint_1 = OpTypeArray %v3uint %uint_1
%_ptr_Output__arr_v3uint_uint_1 = OpTypePointer Output %_arr_v3uint_uint_1
%_arr_v4float_uint_3 = OpTypeArray %v4float %uint_3
%_ptr_Output__arr_v4float_uint_3 = OpTypePointer Output %_arr_v4float_uint_3
%_arr_v3float_uint_3 = OpTypeArray %v3float %uint_3
%_ptr_Output__arr_v3float_uint_3 = OpTypePointer Output %_arr_v3float_uint_3
       %void = OpTypeVoid
         %40 = OpTypeFunction %void
%_ptr_Function__arr_v3uint_uint_1 = OpTypePointer Function %_arr_v3uint_uint_1
   %MSOutput = OpTypeStruct %v4float %v3float
%_arr_MSOutput_uint_3 = OpTypeArray %MSOutput %uint_3
%_ptr_Function__arr_MSOutput_uint_3 = OpTypePointer Function %_arr_MSOutput_uint_3
         %44 = OpTypeFunction %void %_ptr_Function__arr_v3uint_uint_1 %_ptr_Function__arr_MSOutput_uint_3
%_ptr_Output_v3uint = OpTypePointer Output %v3uint
%_ptr_Output_v4float = OpTypePointer Output %v4float
%_ptr_Output_v3float = OpTypePointer Output %v3float
    %payload = OpVariable %_ptr_Workgroup_TaskPayload Workgroup
          %2 = OpVariable %_ptr_Output__arr_v3uint_uint_1 Output
%gl_Position = OpVariable %_ptr_Output__arr_v4float_uint_3 Output
%out_var_COLOR0 = OpVariable %_ptr_Output__arr_v3float_uint_3 Output
         %48 = OpUndef %v3uint
         %49 = OpUndef %MSOutput
  %mesh_main = OpFunction %void None %40
         %50 = OpLabel
               OpSetMeshOutputsEXT %uint_3 %uint_1
         %51 = OpAccessChain %_ptr_Output_v3uint %2 %int_0
               OpStore %51 %14
         %52 = OpAccessChain %_ptr_Output_v4float %gl_Position %int_0
               OpStore %52 %23
         %53 = OpAccessChain %_ptr_Output_v3float %out_var_COLOR0 %int_0
               OpStore %53 %25
         %54 = OpAccessChain %_ptr_Output_v4float %gl_Position %int_1
               OpStore %54 %26
         %55 = OpAccessChain %_ptr_Output_v3float %out_var_COLOR0 %int_1
               OpStore %55 %28
         %56 = OpAccessChain %_ptr_Output_v4float %gl_Position %int_2
               OpStore %56 %29
         %57 = OpAccessChain %_ptr_Output_v3float %out_var_COLOR0 %int_2
               OpStore %57 %31
         %58 = OpCompositeConstruct %_arr_v3uint_uint_1 %48
         %59 = OpCompositeConstruct %_arr_MSOutput_uint_3 %49 %49 %49
               OpReturn
               OpFunctionEnd

@Try
Copy link
Contributor Author

Try commented Mar 25, 2023

Hi, @BeastLe9enD !

I have the following HLSL code generated with spirv to a mesh shader, I also added the resulting spirv assembly

What HLSL compiler been used? The resulting spirv doesn't look correct:

OpEntryPoint MeshEXT %mesh_main "mesh_main" %2 %gl_Position %out_var_COLOR0 %payload
Shader claim to output vec4 gl_Position[], what is not possible in mesh, where gl_MeshVerticesEXT should be used instead.

%TaskPayload = OpTypeStruct
Empty playload struct? Also TaskPayload is in HLSL...

@BeastLe9enD
Copy link

@Try oh yea, true, I missed that! That spirv code does not look right although its running fine on my NVIDIA gpu.

Im using dxcompiler.dll: 1.7 - 1.7.0.3795 (bef540d36) shipped with the newest vulkan sdk.

@BeastLe9enD
Copy link

@Try are u sure this isn't correct? I think DXC just names it %gl_Position instead of %gl_MeshVerticesEXT.
It decorates it with OpDecorate %gl_Position BuiltIn Position, the equivalent glslangValidator does is OpMemberDecorate %gl_MeshPerVertexEXT 0 BuiltIn Position which should be the same imo.

@Try
Copy link
Contributor Author

Try commented Apr 1, 2023

@BeastLe9enD

I think DXC just names it %gl_Position instead of %gl_MeshVerticesEXT.

This is not what provided spiv code shows:

_arr_v4float_uint_3 = OpTypeArray %v4float %uint_3
%_ptr_Output__arr_v4float_uint_3 = OpTypePointer Output %_arr_v4float_uint_3
`%gl_Position = OpVariable %_ptr_Output__arr_v4float_uint_3 Output`

In GLSL, this would be:
out vec3 gl_Position[3], what is not correct, as gl_Position must be member of gl_MeshPerVertexEXT

@BeastLe9enD
Copy link

Is this still a thing? And is there a possibility of merging this in the future?

@Try
Copy link
Contributor Author

Try commented Sep 16, 2023

Is this still a thing?

Can ask you same :D Any news about molten-vk prototype?

On my end not doing much here, as mesh-shader is too broken on apple: multiple complex hack are required to make it compile and even then shader is too slow. When performance feature runs slow - that's very bad

@BeastLe9enD
Copy link

BeastLe9enD commented Sep 16, 2023

Can ask you same :D Any news about molten-vk prototype?

I hope its doing well :D don't know, at least I have plans to finish it this year

multiple complex hack are required to make it compile and even then shader is too slow

you mean mesh shader on apple are slow in general or just the spirv -> msl mapped code is suboptimal?

@Try
Copy link
Contributor Author

Try commented Sep 17, 2023

you mean mesh shader on apple are slow in general or just the spirv -> msl mapped code is suboptimal?

It's hard to say for sure:
XCode has no profiling for mesh/object shaders. Also when I asked, apple provide no useful answer on what good practice is. So, we can only speculate. Few key-points to mention:

  • shared memory usage is bad - in most gpu's it puts limit on how many warps can run in parallel
  • nvidia model of mesh shader is not something what can work on tile-based gpu's, like my MacM1
  • in my project mesh-path show 2x performance regression versus vertex-path
  • no way to reason why it's slow - need profiling tools

@BeastLe9enD
Copy link

shared memory usage is bad - in most gpu's it puts limit on how many warps can run in parallel

fair point, I think we could stop using shared memory if we detect that meshlet data are only written in order like you would do in MSL although it would be really painful to implement into spirv-cross because we need to analyze the spirv code first

nvidia model of mesh shader is not something what can work on tile-based gpu's, like my MacM1

what do you mean by nvidia model of mesh shaders? the mesh shaders how they are implemented in dx12/vulkan in general or the guidlines that nvidia gave for getting decent performance like this ? https://developer.nvidia.com/blog/advanced-api-performance-mesh-shaders/
would be interesting here what apple suggests you to use mesh shaders for....

in my project mesh-path show 2x performance regression versus vertex-path

is your project open source? what are you using mesh shaders for?

no way to reason why it's slow - need profiling tools

yep thats indeed really bad. if you have neither a profiler or at least know what is good practise and whats not, the only real thing you can do is guess and pray that the thing you're doing is good xD

@Try
Copy link
Contributor Author

Try commented Sep 18, 2023

could stop using shared memory

I've tested this path as well (my shader happens to be like so) - no measurable FPS improvement.

what do you mean by nvidia model of mesh shaders?

pre-rasterization pipeline is very different across GPU-vendors. On NV (apparently) any thread in warp can output any part of meshlet. On AMD(RDNA2) - each thread responsible for exactly one vertex+primitive and driver need to emulate NV behavior in many cases. On tile-based GPU every vendor do they own thing; on simple case: replay draw-calls multiple time once per tile - so meshlets do no make any sense.
M1 is tile-based, so driver need to do a lot on apple side to make it at least valid.

would be interesting here what apple suggests you to use mesh shaders for...

https://developer.apple.com/forums/thread/722047 nothing interesting, This is probably working as expected with the M1 GPU; Mesh shaders on M1 are intended to enable use-cases that cannot be expressed as draws

is your project open source? what are you using mesh shaders for?

Yes: https://github.com/Try/OpenGothic/blob/master/shader/materials/main.mesh
use-case is quite simple: hiz+frustum culling.

@zmarlon
Copy link

zmarlon commented Nov 3, 2023

Since Apple has now introduced chips with the M3 generation that support hardware mesh shading, I wanted to find out again whether this branch will be developed further, or whether there are plans to merge this feature into the main branch.
I would like to continue working on the mesh shader implementation in MoltenVK based on this.

@Try
Copy link
Contributor Author

Try commented Nov 4, 2023

M3 generation that support hardware mesh shading

Hm, took them less than forever :) Do you have M3 to test it?

or whether there are plans to merge this feature into the main branch

I've rebased it on current main to resolve merge conflicts, yet not it has sporadic/unreproducable failures in CI.
Let me do clear rewrite..
In general I would like to have full mesh support in my engine and deprecate vertex, so merging is desirable (if mesh-shader really fixed in apple driver&hw).

UPD:
Now CI is green

@zmarlon
Copy link

zmarlon commented Oct 13, 2024

You could also support a sub-set of the task shader. This would still be better, as it would still cover a large number of cases.

@HansKristian-Work
Copy link
Contributor

VK_NV_mesh_shader, I propose we firstly support that extension and then add support for the EXT-variant later?

Only supporting a vendor extension when there is an EXT is a dead-end, and I don't see the point.

What is blocking task shaders from being potentially supported? As mentioned, that would not be conformant, indeed.

@squidbus
Copy link
Contributor

What is blocking task shaders from being potentially supported? As mentioned, that would not be conformant, indeed.

EmitMeshTasksEXT here does not terminate the shader, unlike in the spec.

@HansKristian-Work
Copy link
Contributor

HansKristian-Work commented Oct 14, 2024

EmitMeshTasksEXT here does not terminate the shader, unlike in the spec.

Is that the only problem? In practical scenarios, no shader relies on that. If EmitMeshTasksEXT is called in main(), a simple return; after would be correct, and for calls in a function, we can just throw an error. MoltenVK could in theory run spir-v inlining on task shaders to avoid that problem for conformance, but I'm not going to hell and back to try and workaround something that can be worked around at the SPIR-V level.

@squidbus
Copy link
Contributor

squidbus commented Oct 14, 2024

I'm a bit of a late-comer here but it's the main problem I'm aware of from the comment history here, @Try is there anything else blocking this if not terminating in EmitMeshTasksEXT is not an issue?

@zmarlon
Copy link

zmarlon commented Oct 14, 2024

EmitMeshTasksEXT here does not terminate the shader, unlike in the spec.

Is that the only problem? In practical scenarios, no shader relies on that. If EmitMeshTasksEXT is called in main(), a simple return; after would be correct, and for calls in a function, we can just throw an error. MoltenVK could in theory run spir-v inlining on task shaders to avoid that problem for conformance, but I'm not going to hell and back to try and workaround something that can be worked around at the SPIR-V level.

That sounds like a good idea. If this was implemented, could it actually be merged? @HansKristian-Work @Try

@Try
Copy link
Contributor Author

Try commented Oct 14, 2024

@squidbus @HansKristian-Work

@Try is there anything else blocking this if not terminating in EmitMeshTasksEXT is not an issue?
Is that[EmitMeshTasksEXT ] the only problem?

In principle:

  • EmitMeshTasksEXT
  • gl_ClipDistance/gl_CullDistance are untested (and not much unit tests in general)
  • merge conflicts - can work on it this weekend, if we agree on eventually merge it

@BeastLe9enD
Copy link

Sounds nice. if this is merged, @zmarlon and me can finally finish the implementation on the MoltenVK side.

@HansKristian-Work
Copy link
Contributor

merge conflicts - can work on it this weekend, if we agree on eventually merge it

I wouldn't spend time on this yet before I've committed to supporting it. I need to study the implementation in more detail to see how much of mess mesh shaders end up adding ...

@zmarlon
Copy link

zmarlon commented Oct 15, 2024

The Problem is that we are blocked on the MVK side if this is not getting merged. Do you have any Plans at which time you want to look into it?

@HansKristian-Work
Copy link
Contributor

I'll try to have a look this week if I don't get sidetracked with other stuff ...

BuiltIn(get_decoration(lhs_expression, DecorationBuiltIn)) == BuiltInSampleMask &&
is_array(type))
// Meshlet indices
if (lhs_e != nullptr && lhs_e->loaded_from == builtin_mesh_primitive_indices_id &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right place to do it. Leaf functions should receive a plain threadgroup uint3 gl_Indices[MaxPrimitives] array. The lowering to set_index needs to happen in the wrapped main.

Pretty sure you can read the content of that array in SPIR-V and this code would break that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going thru spec again: yep, my understanding that all out variables are write only is not correct.

flush_variable_declaration(builtin_mesh_primitive_indices_id);
statement("if (gl_LocalInvocationIndex == 0)");
begin_scope();
statement("spv_primitive_count = ", to_unpacked_expression(ops[1]), ";");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for the primitive count to be threadgroup when shaders write to it. There can be a threadgroup variable in the wrapped main that is written before the barrier. It can just be plain thread.

Also, it's missing the vertex count. That is useful when copying out vertex data. No need to copy out unused vertices ... This fake builtin can just be a uvec2 really.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In spec:

The arguments are taken from the first invocation in each workgroup. Any invocation must execute this instruction no more than once and under uniform control flow. There must not be any control flow path to an output write that is not preceded by this instruction

My read on this: while application has to ensure uniform control flow, it doesn't have to provide same arguments from every invocation, as only first invocation matters.

// Relevant for multiple entry-point modules which might declare unused builtins.
if (!active_input_builtins.get(bi_type) || !interface_variable_exists_in_entry_point(var_id))
return;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just reindentation? Needs to be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem so. Probably touched a line in this block at some point and clang-format did the rest :|

statement(builtin_type_decl(bi_type), " ", to_expression(var_id), " = ",
to_expression(builtin_invocation_id_id), ".x % ", this->get_entry_point().output_vertices,
";");
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also just looks like a massive reindentation to me.

// GLSL: Once this instruction is called, the workgroup must be terminated immediately, and the mesh shaders are launched.
// TODO: find relieble and clean of terminating shader.
statement("spvMpg.set_threadgroups_per_grid(uint3(", to_unpacked_expression(block.mesh.groups[0]), ", ",
to_unpacked_expression(block.mesh.groups[1]), ", ", to_unpacked_expression(block.mesh.groups[2]), "));");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return; should be called after this. Also needs to check that this is only used in the entry function, otherwise just throw an error saying it's not implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use to have something like this on previous iterations. Having non-trivial behavior make it even worse:
In this stage we can document shader termination in Emit doesn't work.
Otherwise it will be shader termination in Emit sometimes work, but may fail to compile if <...>

}

string quals;
quals = member_location_attribute_qualifier(type, index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flat/centroid/sample is only relevant for fragment stage.

{
if (is_builtin)
{
switch (builtin)
Copy link
Contributor

@HansKristian-Work HansKristian-Work Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smells like code duplication. Any reason it's done like this?

statement("spvPerVertex spvV = {};");
for (uint32_t index = 0; index < uint32_t(type_vert.member_types.size()); ++index)
{
uint32_t orig_var =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path is a bit too naive for more complex objects. E.g.

#version 450
#extension GL_EXT_mesh_shader : require
layout(max_vertices = 3, max_primitives = 1, triangles) out;
layout(local_size_x = 1) in;

out gl_MeshPerVertexEXT
{
	invariant vec4 gl_Position;
} gl_MeshVerticesEXT[3];

layout(location = 0) out float foos[3][4];

layout(location = 4) out Foo
{
	float bar;
} bars[3];

void main()
{
	SetMeshOutputsEXT(3, 1);
	gl_MeshVerticesEXT[0].gl_Position = vec4(1.0);
	gl_MeshVerticesEXT[1].gl_Position = vec4(1.0);
	gl_MeshVerticesEXT[2].gl_Position = vec4(1.0);
	gl_PrimitiveTriangleIndicesEXT[0] = uvec3(0, 1, 2);
	foos[0][0] = 10.0;
	foos[1][1] = 20.0;
	foos[2][2] = 20.0;
	bars[0].bar = 4.0;
	bars[1].bar = 5.0;
	bars[2].bar = 6.0;
}
    for (uint spvI = gl_LocalInvocationIndex, spvThreadCount = (gl_WorkGroupSize.x*gl_WorkGroupSize.y*gl_WorkGroupSize.z); spvI < 3; spvI += spvThreadCount)
    {
        spvPerVertex spvV = {};
        spvV.gl_Position = gl_MeshVerticesEXT[spvI].gl_Position;
        spvV.foos_0 = foos[spvI];
        spvV.foos_1 = foos[spvI];
        spvV.foos_2 = foos[spvI];
        spvV.foos_3 = foos[spvI];
        spvV.bars_bar = bars[spvI].bar;
        spvMesh.set_vertex(spvI, spvV);
    }

it doesn't seem to lower arrayed objects.

I feel like there should probably be a way to reuse the existing lambda stuff to lower the output writes. That might be something I have to look into once the rest of the implementation is in an acceptable state.

num_invocaions = mode.workgroup_size.x * mode.workgroup_size.y * mode.workgroup_size.z;
}

{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need a separate block.

if (num_invocaions < mode.output_vertices)
{
statement("for (uint spvI = gl_LocalInvocationIndex, spvThreadCount = "
"(gl_WorkGroupSize.x*gl_WorkGroupSize.y*gl_WorkGroupSize.z); spvI < ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should loop over spv_vertex_count that was set earlier.

Copy link
Contributor

@HansKristian-Work HansKristian-Work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a local merge conflict resolve and it seemed trivial.

  • My biggest concerns now is high code duplication.
  • Lots of diffs which only seem to be indentation changes which makes it impossible to review.
  • Misc structural issues.
  • Also make sure to rebase and squash. Having 10+ commits with random commit messages isn't helpful. Having several clean commits to go through would be the ideal, but that is asking for a lot of extra work and not a requirement.

Copy link
Contributor Author

@Try Try left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have closer look at weekend

Having 10+ commits with random commit messages isn't helpful.

I'm planning to start new fresh PR: one for mesh and then another for task. This should help to keep scope cleaner and smaller.

// GLSL: Once this instruction is called, the workgroup must be terminated immediately, and the mesh shaders are launched.
// TODO: find relieble and clean of terminating shader.
statement("spvMpg.set_threadgroups_per_grid(uint3(", to_unpacked_expression(block.mesh.groups[0]), ", ",
to_unpacked_expression(block.mesh.groups[1]), ", ", to_unpacked_expression(block.mesh.groups[2]), "));");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use to have something like this on previous iterations. Having non-trivial behavior make it even worse:
In this stage we can document shader termination in Emit doesn't work.
Otherwise it will be shader termination in Emit sometimes work, but may fail to compile if <...>

// Relevant for multiple entry-point modules which might declare unused builtins.
if (!active_input_builtins.get(bi_type) || !interface_variable_exists_in_entry_point(var_id))
return;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem so. Probably touched a line in this block at some point and clang-format did the rest :|

flush_variable_declaration(builtin_mesh_primitive_indices_id);
statement("if (gl_LocalInvocationIndex == 0)");
begin_scope();
statement("spv_primitive_count = ", to_unpacked_expression(ops[1]), ";");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In spec:

The arguments are taken from the first invocation in each workgroup. Any invocation must execute this instruction no more than once and under uniform control flow. There must not be any control flow path to an output write that is not preceded by this instruction

My read on this: while application has to ensure uniform control flow, it doesn't have to provide same arguments from every invocation, as only first invocation matters.

BuiltIn(get_decoration(lhs_expression, DecorationBuiltIn)) == BuiltInSampleMask &&
is_array(type))
// Meshlet indices
if (lhs_e != nullptr && lhs_e->loaded_from == builtin_mesh_primitive_indices_id &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going thru spec again: yep, my understanding that all out variables are write only is not correct.

@Try
Copy link
Contributor Author

Try commented Oct 28, 2024

Will have closer look at weekend

Just to inform: still working on new PR for mesh-shader.

@Try
Copy link
Contributor Author

Try commented Oct 28, 2024

Mesh shader: #2400

@HansKristian-Work
Copy link
Contributor

Superseded.

@Try
Copy link
Contributor Author

Try commented Oct 30, 2024

Task shader: #2402

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants