Skip to content

Latest commit

 

History

History
233 lines (151 loc) · 25.3 KB

README.md

File metadata and controls

233 lines (151 loc) · 25.3 KB

vGPU: Vulkan GPU Framework for Graphics and Compute, in Go

IMPORTANT: This is the cogentcore branch that was developed for the Cogent Core framework, which also includes the gosl "go as a shader language" package, which converts Go to the HLSL shader language for compilation on the GPU. Cogent Core has now switched to using WebGPU instead of Vulkan, so this Vulkan-based code is no longer being used or maintained.

The main branch of vgpu is the original v1 version of vGPU, which works with the original v1 version of goki.

Mac Installation prerequisite: https://vulkan.lunarg.com/sdk/home -- download the Vulkan SDK installer for the mac. Unfortunately there does not appear to be a full version of this on homebrew -- the molten-vk package is not enough by itself.

Major design problems with vGPU

In the process of rewriting the WebGPU version in Cogent Core, called gpu, much was improved about the overall design and implementation, particularly in these areas:

  1. vgpu uses shared memory buffers and a separate Memory management system for all values, which requires all values to be sync'd to / from the GPU together, and adds considerable complexity to the overall implementation. Everything uses dynamic offsets by default. Later, an even more complex mechanism for splitting the Storage memory into separate chunks was introduced. In gpu, each Value has its own buffer by default, and dynamic offset is only optionally enabled for specific Values, and the Value directly manages everything -- much simpler and cleaner.

  2. vgpu uses dynamically indexed Textures, which has a limit of 16 on the mac, so considerable complexity was introduced to pack a bunch of images into these 16 images, via the szalloc package. In gpu, we just call Bind to update which current texture is being used -- this is very fast and is the standard way of doing things, so you typically only need a single active texture variable at a time. Major "doh" moment there.

  3. vgpu has a bunch of confusing complexity around the vulkan DescSet concept, which is confusing in relation to the VarSet. In gpu, variable sets are now VarGroup, and you just manage multiple instances using as many Value instances as you want for each Var. The DescSet was also conflated with the texture packing thing, and there are only 3 desc sets supported, so overall that system was just plain bad and unnecessary.

So, if anyone ever did want to actually use this vulkan-based version, it would be good to fix these issues and make it work more like the current Cogent Core gpu version. Basically, you'd start with gpu and just back-port all the vulkan impl from this vgpu version.

But so far, WebGPU is proving to be much simpler and just as performant as Vulkan, with significantly fewer hassels on the main Windows and Mac platforms (because it goes direct to native GPU frameworks there), and it works on the Web, which Vulkan never will, so there isn't much reason to go back to Vulkan at this point.

Overview

vGPU is a Vulkan-based framework for both Graphics and Compute Engine use of GPU hardware, in the Go langauge. It uses the basic cgo-based Go bindings to Vulkan in vulkan-go and was developed starting with the associated example code surrounding that project. Vulkan is a relatively new, essentially universally supported interface to GPU hardware across all types of systems from mobile phones to massive GPU-based compute hardware, and it provides high-performance "bare metal" access to the hardware, for both graphics and computational uses.

Vulkan is very low-level and demands a higher-level framework to manage the complexity and verbosity. While there are many helpful tutorials covering the basic API, many of the tutorials don't provide much of a pathway for how to organize everything at a higher level of abstraction. vGPU represents one attempt that enforces some reasonable choices that enable a significantly simpler programming model, while still providing considerable flexibility and high levels of performance. Everything is a tradeoff, and simplicity definitely was prioritized over performance in a few cases, but in practical use-cases, the performance differences should be minimal.

  • The gosl sub-package provides a way to translate Go code into GPU shader language code for running under vGPU, playing the role of NVIDIA's "cuda" language in other frameworks.

Platforms

  • On desktop (mac, windows, linux), glfw is used for initializing the GPU.
  • Mobile (android, ios)...
    • When developing for Android on macOS, it is critical to set Emulated Performance -> Graphics to Software in the Android Virtual Device Manager (AVD); otherwise, the app will crash on startup. This is because macOS does not support direct access to the underlying hardware GPU in the Android Emulator. You can see more information how to do this in the Android developer documentation. Please note that this issue will not affect end-users of your app, only you while you develop it. Also, due to the typically bad performance of the emulated device GPU on macOS, it is recommended that you use a more modern emulated device than the default Pixel 3a. Finally, you should always test your app on a real mobile device if possible to see what it is actually like.

Selecting a GPU Device

For systems with multiple GPU devices, by default the discrete device is selected, and if multiple of those are present, the one with the most RAM is used. To see what is available and their properties, use:

$ vulkaninfo --summary

The following environment variables can be set to specifically select a particular device by name (deviceName):

  • MESA_VK_DEVICE_SELECT (standard for mesa-based drivers) or VK_DEVICE_SELECT -- for graphics or compute usage.
  • VK_COMPUTE_DEVICE_SELECT -- only used for compute, if present -- will override above, so you can use different GPUs for graphics vs compute.

vPhong and vShape

The vPhong package provides a complete rendering implementation with different pipelines for different materials, and support for 4 different types of light sources based on the classic Blinn-Phong lighting model. See the examples/phong example for how to use it. It does not assume any kind of organization of the rendering elements, and just provides name and index-based access to all the resources needed to render a scene.

vShape generates standard 3D shapes (sphere, cylinder, box, etc), with all the normals and texture coordinates. You can compose shape elements into more complex groups of shapes, programmatically. It separates the calculation of the number of vertex and index elements from actually setting those elements, so you can allocate everything in one pass, and then configure the shape data in a second pass, consistent with the most efficient memory model provided by vgpu. It only has a dependency on the math32 package and could be used for anything.

Basic Elements and Organization

  • GPU represents the hardware Device and maintains global settings, info about the hardware.

    • Device is a logical device and associated Queue info -- each such device can function in parallel.
    • CmdPool manages a command pool and buffer, associated with a specific logical device, for submitting commands to the GPU.
  • System manages multiple vulkan Pipelines and associated variables, variable values, and memory, to accomplish a complete overall rendering / computational job. The Memory with Vars and Values are shared across all pipelines within a System.

    • Pipeline performs a specific chain of operations, using Shader program(s). In a graphics context, each pipeline typically handles a different type of material or other variation in rendering (textured vs. not, transparent vs. solid, etc).
    • Memory manages the memory, organized by Vars variables that are referenced in the shader programs, with each Var having any number of associated values in Values. Vars are organized into Sets that manage their bindings distinctly, and can be updated at different time scales. It has 4 different MemBuff buffers for different types of memory. It is assumed that the sizes of all the Values do not change frequently, so everything is Alloc'd afresh if any size changes. This avoids the need for complex de-fragmentation algorithms, and is maximally efficient, but is not good if sizes change (which is rare in most rendering cases).
  • Image manages a vulkan Image and associated ImageView, including potential host staging buffer (shared as in a Value or owned separately).

  • Texture extends the Image with a Sampler that defines how pixels are accessed in a shader.

  • Framebuffer manages an Image along with a RenderPass configuration for managing a Render target (shared for rendering onto a window Surface or an offscreen RenderFrame)

  • Surface represents the full hardware-managed Images associated with an actual on-screen Window. One can associate a System with a Surface to manage the Swapchain updating for effective double or triple buffering.

  • RenderFrame is an offscreen render target with Framebuffers and a logical device if being used without any Surface -- otherwise it should use the Surface device so images can be copied across them.

  • Unlike most game-oriented GPU setups, vGPU is designed to be used in an event-driven manner where render updates arise from user input or other events, instead of requiring a constant render loop taking place at all times (which can optionally be established too). The event-driven model is vastly more energy efficient for non-game applications.

Memory organization

Memory maintains a host-visible, mapped staging buffer, and a corresponding device-local memory buffer that the GPU uses to compute on (the latter of which is optional for unified memory architectures). Each Value records when it is modified, and a global Sync step efficiently transfers only what has changed. You must allocate and sync update a unique Value for each different value you will need for the entire render pass -- although you can dynamically select which Value to use for each draw command, you cannot in general update the actual data associated with these values during the course of a single rendering pass.

  • Vars variables define the Type and Role of data used in the shaders. There are 3 major categories of Var roles:
    • Vertex and Index represent mesh points etc that provide input to Vertex shader -- these are handled very differently from the others, and must be located in a VertexSet which has a set index of -2. The offsets into allocated Values are updated dynamically for each render Draw command, so you can Bind different Vertex Values as you iterate through objects within a single render pass (again, the underlying vals must be sync'd prior).
    • PushConst are push constants that can only be 128 bytes total that can be directly copied from CPU ram to the GPU via a command -- it is the most high-performance way to update dynamically changing content, such as view matricies or indexes into other data structures. Must be located in PushConstSet set (index -1).
    • Uniform (read-only "constants") and Storage (read-write) data that contain misc other data, e.g., transformation matricies. These are also updated dynamically using dynamic offsets, so you can also call BindDynValue methods to select different such vals as you iterate through objects. The original binding is done automatically in the Memory Config (via BindDynVarsAll) and usually does not need to be redone.
    • Texture vars that provide the raw Image data, the ImageView through which that is accessed, and a Sampler that parametrizes how the pixels are mapped onto coordinates in the Fragment shader. Each texture object is managed as a distinct item in device memory, and they cannot be accessed through a dynamic offset. Thus, a unique descriptor is provided for each texture Value, and your shader should describe them as an array. All such textures must be in place at the start of the render pass, and cannot be updated on the fly during rendering! Thus, you must dynamically bind a uniform variable or push constant to select which texture item from the array to use on a given step of rendering.
      • There is a low maximum number of Texture descriptors (vals) available within one descriptor set on many platforms, including the Mac, only 16, which is enforced via the MaxTexturesPerSet const. There are two (non mutually exclusive) strategies for increasing the number of available textures:
      • Each individual Texture can have up to 128 (again a low limit present on the Mac) layers in a 2d Array of images, in addition to all the separate texture vals being in an Array -- arrays of arrays. Each of the array layers must be the same size -- they are allocated and managed as a unit. The szalloc package provides manager for efficiently allocating images of various sizes to these 16 x 128 (or any N's) groups of layer arrays. This is integrated into the Values value manager and can be engaged by calling AllocTexBySize there. The texture UV coordinates need to be processed by the actual pct size of a given texture relative to the allocated group size -- this is all done in the vphong package and the texture_frag.frag file there can be consulted for a working example.
      • If you allocate more than 16 texture Values, then multiple entire collections of these descriptors will be allocated, as indicated by the NTextureDescs on VarSet and Vars (see that for more info, and the vdraw Draw method for an example). You can use Vars.BindAllTextureValues to bind all texture vals (iterating over NTextureDescs), and System.CmdBindTextureVarIndex to automatically bind the correct set. +, here's some glsl shader code showing how to use the sampler2DArray, as used in the vdraw draw_frag.frag fragment shader code:
#version 450
#extension GL_EXT_nonuniform_qualifier : require

// must use mat4 -- mat3 alignment issues are horrible.
// each mat4 = 64 bytes, so full 128 byte total, but only using mat3.
// pack the tex index into [0][3] of mvp,
// and the fill color into [3][0-3] of uvp
layout(push_constant) uniform Mtxs {
	mat4 mvp;
	mat4 uvp;
};

layout(set = 0, binding = 0) uniform sampler2DArray Tex[]; //
layout(location = 0) in vector2 uv;
layout(location = 0) out vector4 outputColor;

void main() {
	int idx = int(mvp[3][0]);   // packing into unused part of mat4 matrix push constant
	int layer = int(mvp[3][1]);
	outputColor = texture(Tex[idx], vector3(uv,layer)); // layer selection as 3rd dim here
}
  • Vars are accessed at a given location or binding number by the shader code, and these bindings can be organized into logical sets called DescriptorSets (see Graphics Rendering example below). In HLSL, these are specified like this: [[vk::binding(5, 1)]] Texture2D MyTexture2; for descriptor 5 in set 1 (set 0 is the first set).
    • You manage these sets explicitly by calling AddSet or AddVertexSet / AddPushConstSet to create new VarSets, and then add Vars directly to each set.
    • Values represent the values of Vars, with each Value representing a distinct value of a corresponding Var. The ConfigValues call on a given VarSet specifies how many Values to create per each Var within a Set -- this is a shared property of the Set, and should be a consideration in organizing the sets. For example, Sets that are per object should contain one Value per object, etc, while others that are per material would have one Value per material.
    • There is no support for interleaved arrays -- if you want to achieve such a thing, you need to use an array of structs, but the main use-case is for VertexInput, which actually works faster with non-interleaved simple arrays of vector4 points in most cases (e.g., for the points and their normals).
    • You can allocate multiple bindings of all the variables, with each binding being used for a different parallel threaded rendering pass. However, there is only one shared set of Values, so your logic will have to take that into account (e.g., allocating 2x Values per object to allow 2 separate threads to each update and bind their own unique subset of these Values for their own render pass). See discussion under Texture about how this is done in that case. The number of such DescriptorSets is configured in Memory.Vars.NDescs (defaults to 1). Note that these are orthogonal to the number of VarSets -- the terminology is confusing. Various methods take a descIndex to determine which such descriptor set to use -- typically in a threaded swapchain logic you would use the acquired frame index as the descIndex to determine what binding to use.

Naming conventions

  • New returns a new object
  • Init operates on existing object, doing initialization needed for subsequent setting of options
  • Config operates on an existing object and settings, and does everything to get it configured for use.
  • Destroy destroys allocated vulkan objects
  • Alloc is for allocating memory (vs. making a new object)
  • Free is for freeing memory (vs. destroying an object)

Graphics Rendering

See https://developer.nvidia.com/vulkan-shader-resource-binding for a clear description of DescriptorSets etc.

Here's a widely used rendering logic, supported by the Cogent Core Scene (and tbd std Pipeline), and how you should organize the Uniform data into different sets at each level, to optimize the binding overhead:

for each view {
  bind view resources [Set 0]         // camera, environment...
  for each shader type (based on material type: textured, transparent..) {
    bind shader pipeline  
    bind shader resources [Set 1]    // shader control values (maybe none)
    for each specific material {
      bind material resources  [Set 2] // material parameters and textures
      for each object {
        bind object resources  [Set 3] // object transforms
        draw object [VertexInput binding to locations]
        (only finally calls Pipeline here!)
      }
    }
  }
}

It is common practice to use different DescriptorSets for each level in the swapchain, for maintaining high FPS rates by rendering the next frame while the current one is still cooking along itself -- this is the NDescs parameter mentioned above.

Because everything is all packed into big buffers organized by different broad categories, in Memory, we exclusively use the Dynamic mode for Uniform and Storage binding, where the offset into the buffer is specified at the time of the binding call, not in advance in the descriptor set itself. This has only very minor performance implications and makes everything much more flexible and simple: just bind whatever variables you want and that data will be used.

The examples and provided vPhong package retain the Y-is-up coordinate system from OpenGL, which is more "natural" for the physical world, where the Y axis is the height dimension, and up is up, after all. Some of the defaults reflect this choice, but it is easy to use the native Vulkan Y-is-down coordinate system too.

Combining many pipeline renders per RenderPass

The various introductory tutorials all seem to focus on just a single simple render pass with one draw operation, but any realistic scene needs different settings for each object! As noted above, this requires dynamic binding, which is good for Uniforms and Vertex data, but you might not appreciate that this also requires that you pre-allocate and sync up to device memory all the Values that you will need for the entire render pass -- the dynamic binding only selects different offsets into memory buffers, but the actual contents of those buffers should not change during a single render pass (otherwise things will get very slow and lots of bad sync steps might be required, etc). The Memory system makes it easy to allocate, update, and dynamically bind these vals.

Here's some info on the logical issues:

This blog has a particularly clear discussion of the need for Texture arrays for managing textures within a render pass. This is automatically how Texture vars are managed .

GPU Accelerated Compute Engine

See examples/compute1 for a very simple compute shader, and compute.go for Compute* methods specifically useful for this case.

See the gosl repository for a tool that converts Go code into HLSL shader code, so you can effectively run Go on the GPU.

Here's how it works:

  • Each Vulkan Pipeline holds 1 compute shader program, which is equivalent to a kernel in CUDA -- this is the basic unit of computation, accomplishing one parallel sweep of processing across some number of identical data structures.

  • You must organize at the outset your Vars and Values in the System Memory to hold the data structures your shaders operate on. In general, you want to have a single static set of Vars that cover everything you'll need, and different shaders can operate on different subsets of these. You want to minimize the amount of memory transfer.

  • Because the vkQueueSubmit call is by far the most expensive call in Vulkan, you want to minimize those. This means you want to combine as much of your computation into one big Command sequence, with calls to various different Pipeline shaders (which can all be put in one command buffer) that gets submitted once, rather than submitting separate commands for each shader. Ideally this also involves combining memory transfers to / from the GPU in the same command buffer as well.

  • Although rarely used in graphics, the most important tool for synchronizing commands within a single command stream is the vkEvent, which is described a bit in the Khronos Blog. Much of vulkan discussion centers instead around Semaphores, but these are only used for synchronization between different commands --- each of which requires a different vkQueueSubmit (and is therefore suboptimal).

  • Thus, you should create named events in your compute System, and inject calls to set and wait on those events in your command stream.

Gamma Correction (sRGB vs Linear) and Headless / Offscreen Rendering

It is hard to find this info very clearly stated:

  • All internal computation in shaders is done in a linear color space.
  • Image textures are assumed to be sRGB and are automatically converted to linear on upload.
  • Other colors that are passed in should be converted from sRGB to linear (the Phong shaders do this).
  • The Surface automatically converts from Linear to sRGB for actual rendering.
  • A RenderFrame for offscreen / headless rendering must use vk.FormatR8g8b8a8Srgb for the format, in order to get back an image that is automatically converted back to sRGB format.

Mac Platform

To have the mac use the libMoltenVK.dylib installed by brew install molten-vk, you need to change the LDFLAGS here:

github.com/goki/vulkan/vulkan_darwin.go

#cgo darwin LDFLAGS: -L/opt/homebrew/lib -Wl,-rpath,/opt/homebrew/lib -F/Library/Frameworks -framework Cocoa -framework IOKit -framework IOSurface -framework QuartzCore -framework Metal -lMoltenVK -lc++

However it does not find the libvulkan which is not included in molten-vk.

Platform properties

See MACOS.md file for full report of properties on Mac.

These are useful for deciding what kinds of limits are likely to work in practice:

This is a significant constraint! need to work around it.

Note that this constraint is largely irrelevant because each dynamic descriptor can have an unlimited number of offset values used for it.