Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support XFB in MoltenVK #2169

Draft
wants to merge 61 commits into
base: main
Choose a base branch
from
Draft

Support XFB in MoltenVK #2169

wants to merge 61 commits into from

Conversation

gpx1000
Copy link

@gpx1000 gpx1000 commented Jun 19, 2023

Adds support for MoltenVK to use TransformFeedback

cdavis5e and others added 15 commits August 23, 2022 15:26
This is analogous to the existing support for fixing up shader inputs.
It is intended to be used with tessellation to add implicit builtins
that are read from a later stage, despite not being written in an
earlier stage. (Believe it or not, this is in fact legal in Vulkan.)

Helps fix 8 CTS tests under `dEQP-VK.pipeline.*.no_position`. (Eight
other tests work solely by accident without this change.)
MSL: Add a mechanism to fix up shader outputs.

Approved-by: Steven Winston
It is possible in SPIR-V to declare multiple specialization constants
with the same constant ID. The most common cause of this in GLSL is
defining a spec constant, then declaring the workgroup size to use that
spec constant by its ID. But, MSL forbids defining multiple function
constants with the same function constant ID. So, we must only emit one
definition of the actual function constant (with the
`[[function_constant(id)]]` attribute); but we can point the other
variables at this one definition.

Fixes three tests in the Vulkan CTS under
`dEQP-VK.compute.basic.max_local_size_*`.
MSL: Deduplicate function constants.

Approved-by: Steven Winston
Does analysis of outputs and sorts them into buffers. Nothing else yet.
# Conflicts:
#	main.cpp
#	spirv_cross_c.cpp
#	spirv_cross_c.h
#	spirv_msl.cpp
#	spirv_msl.hpp
This only does the bare minimum needed to write XFB data (and not even
that actually). It still needs to calculate the offset in the buffer
where the data need to be written, and primitive types other than points
need to be implemented.
# Conflicts:
#	reference/shaders-msl/comp/local-size-duplicate-spec-id.comp
#	spirv_msl.cpp
@gpx1000 gpx1000 marked this pull request as draft June 20, 2023 00:17
@HansKristian-Work
Copy link
Contributor

Given this is still marked draft, I assume this is still WIP and I'll be notified when it's ready for review?

This is the only type of shader that can even have such outputs. Not
only does this save some work in most cases, it also fixes a problem
with the next patch.
We still rely on it to pass around and collect the output. To avoid
duplicates, only do this if we would not do this normally.
Use the offset from the `XfbOutput` struct instead of querying it again
from the ID.

Use the `member_index` local instead of using `size() - 1` when setting
member decorations.

Don't set the qualified name for builtin block variabless--we handle
those a different way.

Use the member index from the `XfbOutput` when inspecing the original
block type instead of the `member_index` local. This one was a real bug;
honestly, I don't know how it even worked before.
…types.

Make sure they use the `thread` AS and that they have the `packed_`
prefix, if necessary.
Add missing changes from previous patch.
@cdavis5e cdavis5e marked this pull request as ready for review September 19, 2023 01:18
spvXfbBuffer3 spvXfbOutput3 = {};
VertOut _20 = {};
if (any(gl_GlobalInvocationID >= spvStageInputSize))
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks threadgroup_barrier.

if (all(gl_GlobalInvocationID.xy == 0))
{
uint spvWritten = spvStageInputSize.x * spvStageInputSize.y;
atomic_store_explicit(spvXfbCounter1, spvInitOffset1 + sizeof(*spvXfb1) * spvWritten, memory_order_relaxed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is XFB ordering maintained here? XFB data must be emitted in-order with input primitives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual XFB buffers are indexed by the global invocation ID.

spvXfb3 = reinterpret_cast<device spvXfbBuffer3*>(reinterpret_cast<device char*>(spvXfb3) + spvInitOffset3);
if ((gl_GlobalInvocationID.x % 3u == 2) || gl_GlobalInvocationID.x + 2 < spvStageInputSize.x)
spvXfb3[spvXfbIndex] = spvXfbOutput3;
threadgroup_barrier(mem_flags::mem_device);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a barrier even needed here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure the atomic load of the counter happens-before the atomic update, particularly since Metal doesn't support acquire/release semantics for atomics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different workgroup can touch the counter here though, so not sure threadgroup_barrier is enough to ensure the in-order requirement of XFB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one thread in the entire dispatch is allowed to write the counter.

...But if the first workgroup completes before the others, they may wind up loading the counter after the first thread writes to it. Hmm, this may be tougher than I thought...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reasonable solution is to manage the counter from the outside and only pass the offset (read-only) into the shader.

spvXfbBuffer1 spvXfbOutput1 = {};
spvXfbBuffer2 spvXfbOutput2 = {};
spvXfbBuffer3 spvXfbOutput3 = {};
VertOut _25 = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does vertex output go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly to the transform feedback buffers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this rasterized? Do you re-shade through the normal vertex shader in a second pass, or do you shade from the XFB feedback as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We intend to have MoltenVK shade from the XFB data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is broken. Some reasons why:

  • It's possible to only capture a subset of varyings.
  • Robustness rules. XFB buffers can be exhausted, where you still rasterize, but you must stop writing XFB data.

spvXfbBuffer2 spvXfbOutput2 = {};
spvXfbBuffer3 spvXfbOutput3 = {};
VertOut _25 = {};
if (any(gl_GlobalInvocationID >= spvStageInputSize))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about index buffers?

Copy link
Contributor

@HansKristian-Work HansKristian-Work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, this is 58 (!) commits. I'm not reviewing that as-is. Please rebase this into something more digestable.

Also, given the scope of this, please add a PR description that explains the implementation strategy, problem scenarios, which corner cases don't work, etc, also why this feature is even desirable in the first place. Does it pass CTS?

@cdavis5e cdavis5e marked this pull request as draft November 23, 2023 03:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants