-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] MSL: mesh shader initial #2074
Conversation
spirv_msl.cpp
Outdated
{ | ||
auto &execution = get_entry_point(); | ||
auto str = to_expression(lhs_expression); | ||
str = str.substr(str.find_first_of('[')+1, str.find_last_of(']') - str.find_first_of('[') - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code look quite bad, but I don't see better way, atm.
Need some way to access OpAccessChain parameters, or some way to emit shorten chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, small bump to this.
My question basically: is there any way to reach original OpAccessChain
with all arguments?
Needed here and for DX mesh-shader as well.
Thanks!
Tested today compilation variant, when instead of shared memory array local variable been used. This happen to be valid for my shaders, since there is 1-to-1 match between |
@Try Can this be due to the TBDR arch ? I'll leave there some other links i came across reading about this issue: Unreal Engine - https://blog.imaginationtech.com/powervr-performance-tips-for-unreal-engine-4/ - when developing focusing on PowerVR recomends disabling Early-Z due to how PowerVR handles geometry. Being Apple based on PowerVR i hope some of this helps you. |
That depends a lot on what kind of TBDR is it. According to https://blog.imaginationtech.com/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/ PowerVR has fat tiler, with native support for vertex shader. So, I think best way is to ask them.
Not related. In any case, when renderer is same and only difference is vertex-vs-mesh, then performance should be roughly same. |
Back from holiday, just ACK-ing that I've seen it and I'll look at it when I have time. |
[[mesh]] void main0(uint gl_LocalInvocationIndex [[thread_index_in_threadgroup]], uint3 gl_GlobalInvocationID [[thread_position_in_grid]], spvMesh_t spvMesh) | ||
{ | ||
threadgroup gl_MeshPerVertexEXT gl_MeshVerticesEXT[2]; | ||
_4(gl_MeshVerticesEXT, gl_LocalInvocationIndex, spvMesh, gl_GlobalInvocationID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the equivalent for SetMeshOutputsEXT in Metal? I see set_primitive_count there, but is there no set_vertex_count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to NV extension only set_primitive_count
. Also index buffer is more similar to original NV extension(uint index[max_prim*3]
).
threadgroup gl_MeshPerVertexEXT gl_MeshVerticesEXT[2]; | ||
_4(gl_MeshVerticesEXT, gl_LocalInvocationIndex, spvMesh, gl_GlobalInvocationID); | ||
threadgroup_barrier(mem_flags::mem_threadgroup); | ||
for (uint spvI = gl_LocalInvocationIndex, spvThreadCount = (gl_WorkGroupSize.x*gl_WorkGroupSize.y*gl_WorkGroupSize.z); spvI < 2; spvI += spvThreadCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not gonna lie, having to emit workarounds like this is just depressing. Is this even going to give meaningful uplifts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested performance of 2 approaches in my engine:
- copy loop
- assume that
gl_MeshVerticesEXT
are always addressed bygl_LocalInvocationIndex
and use one thread-local variable.
Both are similar, and very slow, ~2x slower than draw-indexed spam with no culling. Asked them on developer forum: https://developer.apple.com/forums/thread/722047, yet there is no useful answers.
spirv_msl.cpp
Outdated
case OpSetMeshOutputsEXT: | ||
{ | ||
flush_variable_declaration(builtin_primitive_indices_id); | ||
statement("spvMesh.set_primitive_count(", to_unpacked_expression(ops[1]), ");"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this handle rules where the first invocation of the workgroup wins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't in this draft. Metal spec is unclear about access from multiple threads.
Also emulating vulkan spec here is, unfortunately, another tough workaround: would require something like subgroupElect
, but for entire workgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subgroupElect
I didn't realize, back than, that Vulkan requires uniform control flow. Fixed now.
Looking over this, I'm not very excited about the prospect of merging this. The impedance mismatch with the mesh type is a disaster for performance, and I'm not particularly excited about having to maintain painful workarounds like this. The tessellation code is bad enough as it is, with ridiculous heroics, but we are somewhat forced to implement it. Mesh shaders on the other hands is still deeply in "would be nice" territory. No one can reasonably rely on it any time soon, and if it cannot provide meaningful uplift either, I'd defer it indefinitely. |
Yes, I do agree on the fact that current mesh-shader in metal3 is not cross-compilation friendly (and bad overall) and PR can be discarded, if no objections from MoltenVK guys. On my side still have small hope that only M1-laptops do have performance issues, and desktop Mac can be more performant.
To be more clear: for my engine I've picked mesh-shader as lesser evil - other gpu-driven approaches, like draw-indirect, seem to be worse. TLTR: feel free do discard PR |
Pushed new take on metal-mesh.
'a' and 'b' shown similar performance, yet 'c' was fastest, when looking away from world and slowest when camera points to world center.
Can't really test task-shader at this point: cross-compilation is relatively straight, but build sensible culling in task is very difficult. |
Submitted initial implementation for Task shader. Still there are differences. In GLSL For now PR assumes, that |
For MoltenVK, this is really interesting, I`m working on a PR (see KhronosGroup/MoltenVK#1845 ) for this at the moment that is using this branch to convert SPIRV mesh shaders to MSL code. |
@Try I ran into a problem where the generated mesh shader is invalid and does not compile. struct MSOutput {
float4 Position: SV_Position;
[[vk::location(0)]] float3 Color: COLOR0;
};
[NumThreads(1, 1, 1)]
[OutputTopology("triangle")]
void mesh_main(out indices uint3 triangles[1], out vertices MSOutput vertices[3]) {
SetMeshOutputCounts(3, 1);
triangles[0] = uint3(0, 1, 2);
vertices[0].Position = float4(-0.5, 0.5, 0.0, 1.0);
vertices[0].Color = float3(1.0, 0.0, 0.0);
vertices[1].Position = float4(0.5, 0.5, 0.0, 1.0);
vertices[1].Color = float3(0.0, 1.0, 0.0);
vertices[2].Position = float4(0.0, -0.5, 0.0, 1.0);
vertices[2].Color = float3(0.0, 0.0, 1.0);
} When I try to do
The position is stored in
|
Hi, @BeastLe9enD !
What HLSL compiler been used? The resulting spirv doesn't look correct:
|
@Try oh yea, true, I missed that! That spirv code does not look right although its running fine on my NVIDIA gpu. Im using |
@Try are u sure this isn't correct? I think DXC just names it |
This is not what provided spiv code shows:
In GLSL, this would be: |
Is this still a thing? And is there a possibility of merging this in the future? |
Can ask you same :D Any news about molten-vk prototype? On my end not doing much here, as mesh-shader is too broken on apple: multiple complex hack are required to make it compile and even then shader is too slow. When performance feature runs slow - that's very bad |
I hope its doing well :D don't know, at least I have plans to finish it this year
you mean mesh shader on apple are slow in general or just the spirv -> msl mapped code is suboptimal? |
It's hard to say for sure:
|
fair point, I think we could stop using shared memory if we detect that meshlet data are only written in order like you would do in MSL although it would be really painful to implement into spirv-cross because we need to analyze the spirv code first
what do you mean by nvidia model of mesh shaders? the mesh shaders how they are implemented in dx12/vulkan in general or the guidlines that nvidia gave for getting decent performance like this ? https://developer.nvidia.com/blog/advanced-api-performance-mesh-shaders/
is your project open source? what are you using mesh shaders for?
yep thats indeed really bad. if you have neither a profiler or at least know what is good practise and whats not, the only real thing you can do is guess and pray that the thing you're doing is good xD |
I've tested this path as well (my shader happens to be like so) - no measurable FPS improvement.
pre-rasterization pipeline is very different across GPU-vendors. On NV (apparently) any thread in warp can output any part of meshlet. On AMD(RDNA2) - each thread responsible for exactly one vertex+primitive and driver need to emulate NV behavior in many cases. On tile-based GPU every vendor do they own thing; on simple case: replay draw-calls multiple time once per tile - so meshlets do no make any sense.
https://developer.apple.com/forums/thread/722047 nothing interesting,
Yes: https://github.com/Try/OpenGothic/blob/master/shader/materials/main.mesh |
Since Apple has now introduced chips with the M3 generation that support hardware mesh shading, I wanted to find out again whether this branch will be developed further, or whether there are plans to merge this feature into the main branch. |
Hm, took them less than forever :) Do you have M3 to test it?
I've rebased it on current main to resolve merge conflicts, yet not it has sporadic/unreproducable failures in CI. UPD: |
You could also support a sub-set of the task shader. This would still be better, as it would still cover a large number of cases. |
Only supporting a vendor extension when there is an EXT is a dead-end, and I don't see the point. What is blocking task shaders from being potentially supported? As mentioned, that would not be conformant, indeed. |
|
Is that the only problem? In practical scenarios, no shader relies on that. If EmitMeshTasksEXT is called in main(), a simple return; after would be correct, and for calls in a function, we can just throw an error. MoltenVK could in theory run spir-v inlining on task shaders to avoid that problem for conformance, but I'm not going to hell and back to try and workaround something that can be worked around at the SPIR-V level. |
I'm a bit of a late-comer here but it's the main problem I'm aware of from the comment history here, @Try is there anything else blocking this if not terminating in |
That sounds like a good idea. If this was implemented, could it actually be merged? @HansKristian-Work @Try |
In principle:
|
Sounds nice. if this is merged, @zmarlon and me can finally finish the implementation on the MoltenVK side. |
I wouldn't spend time on this yet before I've committed to supporting it. I need to study the implementation in more detail to see how much of mess mesh shaders end up adding ... |
The Problem is that we are blocked on the MVK side if this is not getting merged. Do you have any Plans at which time you want to look into it? |
I'll try to have a look this week if I don't get sidetracked with other stuff ... |
BuiltIn(get_decoration(lhs_expression, DecorationBuiltIn)) == BuiltInSampleMask && | ||
is_array(type)) | ||
// Meshlet indices | ||
if (lhs_e != nullptr && lhs_e->loaded_from == builtin_mesh_primitive_indices_id && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the right place to do it. Leaf functions should receive a plain threadgroup uint3 gl_Indices[MaxPrimitives]
array. The lowering to set_index needs to happen in the wrapped main.
Pretty sure you can read the content of that array in SPIR-V and this code would break that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going thru spec again: yep, my understanding that all out
variables are write only is not correct.
flush_variable_declaration(builtin_mesh_primitive_indices_id); | ||
statement("if (gl_LocalInvocationIndex == 0)"); | ||
begin_scope(); | ||
statement("spv_primitive_count = ", to_unpacked_expression(ops[1]), ";"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for the primitive count to be threadgroup when shaders write to it. There can be a threadgroup variable in the wrapped main that is written before the barrier. It can just be plain thread.
Also, it's missing the vertex count. That is useful when copying out vertex data. No need to copy out unused vertices ... This fake builtin can just be a uvec2 really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In spec:
The arguments are taken from the first invocation in each workgroup. Any invocation must execute this instruction no more than once and under uniform control flow. There must not be any control flow path to an output write that is not preceded by this instruction
My read on this: while application has to ensure uniform control flow, it doesn't have to provide same arguments from every invocation, as only first invocation matters.
// Relevant for multiple entry-point modules which might declare unused builtins. | ||
if (!active_input_builtins.get(bi_type) || !interface_variable_exists_in_entry_point(var_id)) | ||
return; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just reindentation? Needs to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem so. Probably touched a line in this block at some point and clang-format did the rest :|
statement(builtin_type_decl(bi_type), " ", to_expression(var_id), " = ", | ||
to_expression(builtin_invocation_id_id), ".x % ", this->get_entry_point().output_vertices, | ||
";"); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also just looks like a massive reindentation to me.
// GLSL: Once this instruction is called, the workgroup must be terminated immediately, and the mesh shaders are launched. | ||
// TODO: find relieble and clean of terminating shader. | ||
statement("spvMpg.set_threadgroups_per_grid(uint3(", to_unpacked_expression(block.mesh.groups[0]), ", ", | ||
to_unpacked_expression(block.mesh.groups[1]), ", ", to_unpacked_expression(block.mesh.groups[2]), "));"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return;
should be called after this. Also needs to check that this is only used in the entry function, otherwise just throw an error saying it's not implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use to have something like this on previous iterations. Having non-trivial behavior make it even worse:
In this stage we can document shader termination in Emit doesn't work
.
Otherwise it will be shader termination in Emit sometimes work, but may fail to compile if <...>
} | ||
|
||
string quals; | ||
quals = member_location_attribute_qualifier(type, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flat/centroid/sample is only relevant for fragment stage.
{ | ||
if (is_builtin) | ||
{ | ||
switch (builtin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This smells like code duplication. Any reason it's done like this?
statement("spvPerVertex spvV = {};"); | ||
for (uint32_t index = 0; index < uint32_t(type_vert.member_types.size()); ++index) | ||
{ | ||
uint32_t orig_var = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path is a bit too naive for more complex objects. E.g.
#version 450
#extension GL_EXT_mesh_shader : require
layout(max_vertices = 3, max_primitives = 1, triangles) out;
layout(local_size_x = 1) in;
out gl_MeshPerVertexEXT
{
invariant vec4 gl_Position;
} gl_MeshVerticesEXT[3];
layout(location = 0) out float foos[3][4];
layout(location = 4) out Foo
{
float bar;
} bars[3];
void main()
{
SetMeshOutputsEXT(3, 1);
gl_MeshVerticesEXT[0].gl_Position = vec4(1.0);
gl_MeshVerticesEXT[1].gl_Position = vec4(1.0);
gl_MeshVerticesEXT[2].gl_Position = vec4(1.0);
gl_PrimitiveTriangleIndicesEXT[0] = uvec3(0, 1, 2);
foos[0][0] = 10.0;
foos[1][1] = 20.0;
foos[2][2] = 20.0;
bars[0].bar = 4.0;
bars[1].bar = 5.0;
bars[2].bar = 6.0;
}
for (uint spvI = gl_LocalInvocationIndex, spvThreadCount = (gl_WorkGroupSize.x*gl_WorkGroupSize.y*gl_WorkGroupSize.z); spvI < 3; spvI += spvThreadCount)
{
spvPerVertex spvV = {};
spvV.gl_Position = gl_MeshVerticesEXT[spvI].gl_Position;
spvV.foos_0 = foos[spvI];
spvV.foos_1 = foos[spvI];
spvV.foos_2 = foos[spvI];
spvV.foos_3 = foos[spvI];
spvV.bars_bar = bars[spvI].bar;
spvMesh.set_vertex(spvI, spvV);
}
it doesn't seem to lower arrayed objects.
I feel like there should probably be a way to reuse the existing lambda stuff to lower the output writes. That might be something I have to look into once the rest of the implementation is in an acceptable state.
num_invocaions = mode.workgroup_size.x * mode.workgroup_size.y * mode.workgroup_size.z; | ||
} | ||
|
||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need a separate block.
if (num_invocaions < mode.output_vertices) | ||
{ | ||
statement("for (uint spvI = gl_LocalInvocationIndex, spvThreadCount = " | ||
"(gl_WorkGroupSize.x*gl_WorkGroupSize.y*gl_WorkGroupSize.z); spvI < ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should loop over spv_vertex_count that was set earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a local merge conflict resolve and it seemed trivial.
- My biggest concerns now is high code duplication.
- Lots of diffs which only seem to be indentation changes which makes it impossible to review.
- Misc structural issues.
- Also make sure to rebase and squash. Having 10+ commits with random commit messages isn't helpful. Having several clean commits to go through would be the ideal, but that is asking for a lot of extra work and not a requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have closer look at weekend
Having 10+ commits with random commit messages isn't helpful.
I'm planning to start new fresh PR: one for mesh and then another for task. This should help to keep scope cleaner and smaller.
// GLSL: Once this instruction is called, the workgroup must be terminated immediately, and the mesh shaders are launched. | ||
// TODO: find relieble and clean of terminating shader. | ||
statement("spvMpg.set_threadgroups_per_grid(uint3(", to_unpacked_expression(block.mesh.groups[0]), ", ", | ||
to_unpacked_expression(block.mesh.groups[1]), ", ", to_unpacked_expression(block.mesh.groups[2]), "));"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use to have something like this on previous iterations. Having non-trivial behavior make it even worse:
In this stage we can document shader termination in Emit doesn't work
.
Otherwise it will be shader termination in Emit sometimes work, but may fail to compile if <...>
// Relevant for multiple entry-point modules which might declare unused builtins. | ||
if (!active_input_builtins.get(bi_type) || !interface_variable_exists_in_entry_point(var_id)) | ||
return; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem so. Probably touched a line in this block at some point and clang-format did the rest :|
flush_variable_declaration(builtin_mesh_primitive_indices_id); | ||
statement("if (gl_LocalInvocationIndex == 0)"); | ||
begin_scope(); | ||
statement("spv_primitive_count = ", to_unpacked_expression(ops[1]), ";"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In spec:
The arguments are taken from the first invocation in each workgroup. Any invocation must execute this instruction no more than once and under uniform control flow. There must not be any control flow path to an output write that is not preceded by this instruction
My read on this: while application has to ensure uniform control flow, it doesn't have to provide same arguments from every invocation, as only first invocation matters.
BuiltIn(get_decoration(lhs_expression, DecorationBuiltIn)) == BuiltInSampleMask && | ||
is_array(type)) | ||
// Meshlet indices | ||
if (lhs_e != nullptr && lhs_e->loaded_from == builtin_mesh_primitive_indices_id && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going thru spec again: yep, my understanding that all out
variables are write only is not correct.
Just to inform: still working on new PR for mesh-shader. |
Mesh shader: #2400 |
Superseded. |
Task shader: #2402 |
This is more initial draft to start conversation on how-to.
Mainly need to know opinion on copy-pass and
SPIRVCrossDecorationInterfaceMemberIndex
.Workflow:
When cross-compiling, new types gets synthesized:
spvPerVertex
all regular varyings + gl_PositionspvPerPrimitive
- varyings marked asperprimitiveEXT
using spvMesh_t = mesh<spvPerVertex, spvPerPrimitive, ... >;
gl_PrimitiveTriangleIndicesEXT
becomesspvMesh
. Affects handlingOpSetMeshOutputsEXT
.OpStore to index buffer remapped as 3 calls, in case of triangles, to
spvMesh.set_index
There is ugly hack to implement so - see code.
gl_MeshVerticesEXT
and varyings are represented as shared memory arrays. One shader is done - they get packed into single struct. This is to work-around of SPIRV-vs-MSL api-differences.In theory, if shader writes only with
[gl_LocalInvocationIndex]
array can be removed.Concerns
Had to change meaning
SPIRVCrossDecorationInterfaceMemberIndex
. Current implementation doesn't track array-elements - they all have same ID. On vertex/fragment seems to work fine, don't have any tesselation/geometry test-cases.I'm not sure how to go about it: current workflow with lambdas isn't suitable for
Bugs/TODO:
Haven't tested
gl_ClipDistance
/gl_CullDistance
- dunno how exactly they should work in spirv-cross.Task(Object) shader in metal is different in metal:
In GLSL
EmitMeshTasksEXT
is terminator - calling it suppose to halt shader execution.set_threadgroups_per_grid
- doesn't stop shader execution.Performance
Apple M1 testing(in OpenGothic) shows ~2x fps regression. Can be because shared memory, but don't know for sure.
Asked about this on apple forum: https://developer.apple.com/forums/thread/722047
Apple M3 seem to work as fast as expected