- Overview
- Running the Program
- Execution Model
- Models
- Scenarios
DxDispatch is simple command-line executable for launching DirectX 12 compute programs without writing all the C++ boilerplate. The input to the tool is a JSON model that defines resources, dispatchables (compute shaders or DirectML operators), and commands to execute.
This guide is organized as follows:
- The first section (Running the Program) shows basic command-line usage.
- The second section (Execution Model) explains the steps involves in executing a model.
- The third section (Models) covers the model schema with examples.
- The final section (Scenarios) goes through some of the ways you might use the tool.
The most basic usage is to simply pass in the path to a JSON model. For example, this will run on of the examples:
> dxdispatch.exe .\models\dml_reduce.json
Running on 'NVIDIA GeForce RTX 2070 SUPER'
Resource 'input': 1, 2, 3, 4, 5, 6, 7, 8, 9
Resource 'output': 6, 15, 24
There are more options available to you. If you run the executable with no arguments (or --help
) it will display the available options:
dxdispatch version 0.1.0
Usage:
dxdispatch [OPTION...] <PATH_TO_MODEL_JSON>
-d, --debug Enable D3D and DML debug layers
-b, --benchmark Show benchmarking information
-i, --dispatch_repeat arg [Benchmarking only] The number of times to
repeat each dispatch (default: 128)
-h, --help Print command-line usage help
-a, --adapter arg Substring to match a desired DirectX adapter
(default: )
-s, --show_adapters Show all available DirectX adapters
-q, --direct_queue Use a direct queue/lists (default is compute
queue/lists)
If your machine has multiple adapters you can display them:
> .\dxdispatch.exe -s
NVIDIA GeForce RTX 2070 SUPER
-Version: 27.21.14.5671
-Hardware: true
-Integrated: false
-Dedicated Adapter Memory: 7.82 GB
-Dedicated System Memory: 0 bytes
-Shared System Memory: 15.92 GB
Intel(R) UHD Graphics 630
-Version: 27.20.100.8681
-Hardware: true
-Integrated: true
-Dedicated Adapter Memory: 128.00 MB
-Dedicated System Memory: 0 bytes
-Shared System Memory: 15.92 GB
Microsoft Basic Render Driver
-Version: 10.0.19041.546
-Hardware: false
-Integrated: false
-Dedicated Adapter Memory: 0 bytes
-Dedicated System Memory: 0 bytes
-Shared System Memory: 15.92 GB
Discrete hardware adapters are preferred by default. You can specify a desired adapter by passing a part of its name (will match by substring):
> dxdispatch.exe .\models\dml_reduce.json -a Intel
Running on 'Intel(R) UHD Graphics 630'
Resource 'input': 1, 2, 3, 4, 5, 6, 7, 8, 9
Resource 'output': 6, 15, 24
Before going into the model schema, it's important to understand how models are executed: the model abstraction makes it easy to experiment, but it also preserves low-level control and flexibility. The only way to preserve this flexibility is to keep the abstraction close to how D3D12 programs are written. This doc assumes that you're familiar with D3D12 concepts like resources (buffers/textures), command lists, command queues, and barriers.
DxDispatch is an extremely basic command-line-only tool that executes a single model and then exits. The steps involved in executing a model are below. If an error occurs at any point the process will (hopefully) print a helpful message and immediately terminate.
- Load & parse JSON model: the JSON model is converted into a C++ object representation
- Create device: an appropriate DX adapter is selected to perform the work (you have some control over this). The necessary D3D/DML interfaces (devices, command list, queues, fences, etc.) are created.
- Allocate resources: all resources defined in the model are allocated, initialized, and uploaded to GPU-visible memory. Completion of this step includes CPU/GPU synchronization.
- Initialize dispatchables: all dispatchables (i.e. DML ops or HLSL shaders) defined in the model are created, compiled, and initialized. Completion of this step includes CPU/GPU synchronization.
- Execute commands: all commands (e.g. dispatch, print resource) are processed in the order they're defined in the model. Each command is recorded into its own command list, and completion of each command includes CPU/GPU synchronization. All D3D work is done using a single D3D command list and command queue on a single thread. It is currently inefficient to issue multiple dispatch commands back-to-back since CPU/GPU synchronization will occur between each dispatch.
The execution model is imperative, so the order of commands matters and any side effects are permanent for the lifetime of the program. In particular, resource state will not be reinitialized for each dispatch command.
Each model is a collection of resources, dispatchables, and commands to execute. A model file has a root object with three members:
- Resources: a dictionary that maps names to GPU resources that will be initialized with specified values. The
Buffers
member is a dictionary that maps each buffer's user-specified name to its definition. - Dispatchables: a dictionary that maps names to dispatchable objects. A dispatchable object is either a DirectML operator or HLSL compute shader.
- Commands: an array of instructions to run, such as executing dispatchables or printing resources.
Below is a very simple model that adds two buffer resources (using the DirectML API) and prints the result. This example has three resources named 'A', 'B', and 'Out' and a single dispatchable named 'simple_add'. The resource and dispatchable names can be any valid JSON string you like, but they must be unique with their respective dictionaries. You use these names to refer to the appropriate objects in model commands (e.g. binding resources).
{
"resources":
{
"A": { "initialValuesDataType": "FLOAT32", "initialValues": [1,2,3] },
"B": { "initialValuesDataType": "FLOAT32", "initialValues": [5,8,-5] },
"Out": { "initialValuesDataType": "FLOAT32", "initialValues": { "valueCount": 3, "value": 0 } }
},
"dispatchables":
{
"simple_add":
{
"type": "DML_OPERATOR_ELEMENT_WISE_ADD",
"desc":
{
"ATensor": { "DataType": "FLOAT32", "Sizes": [3] },
"BTensor": { "DataType": "FLOAT32", "Sizes": [3] },
"OutputTensor": { "DataType": "FLOAT32", "Sizes": [3] }
}
}
},
"commands":
[
{
"type": "dispatch",
"dispatchable": "simple_add",
"bindings": { "ATensor": "A", "BTensor": "B", "OutputTensor": "Out" }
},
{ "type": "print", "resource": "Out" }
]
}
Each resource in the resources dictionary maps to a separate D3D resource (specifically a committed resource in a default heap). Textures are not yet implemented, so for now all resources must be buffers. Each resource declaration includes an initializer that is used to set the initial contents of the resource. Keep in mind that initializers don't restrict the usage of the resource in any way: you can initialize a resource with floats, for example, and later bind it as a buffer of UINT32.
All buffer resources have the following fields, though only the first two are required:
Member | JSON Type | Description |
---|---|---|
initialValues |
Object or array | Determines initial contents of the buffer. |
initialValuesDataType |
String (DML_TENSOR_DATA_TYPE) | The data type associated with the buffer's initial values. |
sizeInBytes |
Number (UINT64) | Optional. Will be calculated if omitted. |
The following rules apply:
- The
initialValuesDataType
field must be a string form of DML_TENSOR_DATA_TYPE. You can use a shortened version that omits theDML_TENSOR_DATA_TYPE_
prefix. - JSON numbers will be converted to the appropriate data type in C++. You can use strings as well for "nan", "inf", and "-inf", but only when the data type is floating-point.
- The buffer's
sizeInBytes
will, if not supplied, be calculated as as number of elements ininitialValues
multiplied by the size ofinitialValuesDataType
. If supplied, thesizeInBytes
must be at least as large as the size calculated from the initial values. The total size will be inflated, if necessary, to meet DirectML's 4-byte alignment requirement. - The
initialValues
are always written to the start of the buffer; if thesizeInBytes
is larger than the implied size of the initial data then the end will be padded. - The initial values are written into an upload-heap resource that is then copied to the default-heap buffer resource at startup; any commands that write into a buffer after startup will have a permanent effect on that buffer's contents for the duration of model execution.
The simplest way to initialize a buffer is filling it with a single repeated value of a specific data type. The example below will write [0,0,0,0]
into the buffer.
- The
initialValuesDataType
must not be"UNKNOWN"
. - The
valueCount
must be larger than 0.
{
"initialValuesDataType": "FLOAT32",
"initialValues": { "valueCount": 4, "value": 0 }
}
You can initialize a buffer is using an array of values all with the same type:
- The
initialValuesDataType
must not be"UNKNOWN"
. - The
initialValues
size must be larger than 0.
{
"initialValuesDataType": "FLOAT16",
"initialValues": [ 65504, -65504, 4, 3, -4, -6, "nan", "inf", "-inf" ]
}
You can initialize a buffer is using a sequence. The example below will write [1, 3.5, 6, 8.5]
into the buffer.
- The
initialValuesDataType
must not be"UNKNOWN"
. - The
valueCount
must be larger than 0.
{
"initialValuesDataType": "FLOAT32",
"initialValues": { "valueCount": 4, "valueStart": 1, "valueDelta": 2.5 }
}
You can initialize a buffer is using an array of elements with different types and sizes. The primary use for this initializer is recording values for a constant buffer used in an HLSL dispatchable.
- The
initialValuesDataType
must be"UNKNOWN"
. - Elements in
initialValues
must be JSON objects with thetype
andvalue
fields set. Thename
field is optional and purely for the user (like a comment). - The
initialValues
size must be larger than 0. - You'll want an alignment of 256 bytes if viewing this resource with a CBV, so take advantage of the
sizeInBytes
field!
{
"initialValuesDataType": "UNKNOWN",
"initialValues":
[
{ "name": "elementCount", "type": "UINT32", "value": 6 },
{ "name": "alpha", "type": "FLOAT32", "value": 6 },
{ "name": "beta", "type": "FLOAT32", "value": 2.3 },
],
"sizeInBytes": 256
}
Dispatchables are objects that can be written into a D3D command list. The model supports two types of dispatchables: DirectML operators and custom HLSL compute shaders.
The JSON format for defining operators closely mirrors the DirectML API for creating operators: you are filling out a DML_OPERATOR_DESC
struct that will be used to instantiate an IDMLOperator
object. Below is an example that creates a dispatchable using DML_CONVOLUTION_OPERATOR_DESC
:
{
"type": "DML_OPERATOR_CONVOLUTION",
"desc":
{
"InputTensor": { "DataType": "FLOAT32", "Sizes": [1,1,3,3] },
"FilterTensor": { "DataType": "FLOAT32", "Sizes": [1,1,2,2] },
"OutputTensor": { "DataType": "FLOAT32", "Sizes": [1,1,2,2] },
"Mode": "DML_CONVOLUTION_MODE_CROSS_CORRELATION",
"Direction": "DML_CONVOLUTION_DIRECTION_FORWARD",
"DimensionCount": 2,
"Strides": [1,1],
"Dilations": [1,1],
"StartPadding": [0,0],
"EndPadding": [0,0],
"OutputPadding": [0,0],
"GroupCount": 1
}
}
You must define all fields of the appropriate operator desc unless a field is marked as optional in the API. Optional fields in DirectML operator descs are generally pointers and annotated with _Maybenull_
or _Field_size_opt_
. If you do not define an optional field in the JSON dispatchable then it will be a nullptr in the C++ definition. Additionally, there are some special parsing rules that make it more convenient to define DML dispatchables: take note of the abbreviated enum values and defaults for DML_TENSOR_DESC
fields.
NOTE: take care to use the same casing when setting the fields. Most of the JSON field names in the model start with a lowercase letter, but DML structs generally start with an uppercase letter.
You can execute custom compute shaders using an HLSL dispatchable. These objects will result in loading and compiling HLSL source at runtime, which can be very useful for prototyping.
The example below shows how to reference the shader in the JSON model:
{
"type": "hlsl",
"sourcePath": "models/add_fp16.hlsl",
"compiler": "dxc",
"compilerArgs":
[
"-T", "cs_6_2",
"-E", "CSMain",
"-D", "NUM_THREADS=6",
"-enable-16bit-types"
]
}
The contents of models/add_fp16.hlsl
could, for example, point at the following contents:
StructuredBuffer<float16_t> inputA;
StructuredBuffer<float16_t> inputB;
RWStructuredBuffer<float16_t> output;
cbuffer constants { uint elementCount; };
[numthreads(NUM_THREADS, 1, 1)]
void CSMain(uint3 dtid : SV_DispatchThreadID)
{
if (dtid.x < elementCount)
{
output[dtid.x] = inputA[dtid.x] + inputB[dtid.x];
}
}
Notes:
- Only the modern DXC compiler is supported at this time. I may add FXC support later.
- For command-line arguments refer to this page. Pay special attention to how the arguments should be split in the array.
- A compatible root signature will be generated automatically. You do not have control over the root signature and should not declare one inline in the HLSL.
- A single descriptor table will reference all shader resources (buffers). SRVs, UAVs, and CBVs will be created automatically by reflecting the HLSL source and using appropriate views. You have some control over these views when binding (discussed later).
- You may declare shader resources using any type of buffer view, but textures are not supported. Arrays of resources (e.g.
Buffer<float> inputs[2];
), including unbounded arrays, are not yet supported. This is on the backlog though! - If you declare a resource in HLSL but do not reference it in the shader program then it will likely be optimized away! Binding failures will result if you try to bind a buffer in the model to an unused shader input.
A dispatch command records a dispatchable into a D3D command list and executes it. Each dispatch command has the following fields:
Member | Type | DML | HLSL | Description |
---|---|---|---|---|
type | String | Required | Required | Must be "dispatch" . |
dispatchable | String | Required | Required | Name of the dispatchable object. |
bindings | String, Object, or Array | Required | Required | Size of each element (or structure) in the buffer. |
threadGroupCount | Array | - | Optional | Number of thread groups in X, Y, and Z dimensions. |
The only difference between DML and HLSL dispatches is that DML ops ignore the threadGroupCount
field (defaults to [1,1,1]
for HLSL if omitted).
Below is an example of a dispatch command for a DML operator:
{
"type": "dispatch",
"dispatchable": "join",
"bindings":
{
"InputTensors": [ "A", "B", "C" ],
"OutputTensor": "Out"
}
}
Below is an example of a dispatch command for an HLSL operator:
{
"type": "dispatch",
"dispatchable": "add",
"threadGroupCount": [ 2, 1, 1 ],
"bindings":
{
"inputA": "A",
"inputB": "B",
"output": "Out",
"constants": "Constants"
}
}
The most important property to discuss in this command is the bindings
, which ties the resources in the model to the bind points in the dispatchable. The bindings
field is a dictionary that maps the name of a bind point to an array of resource bindings. Bind points in DML operators are the names of DML_TENSOR_DESC
fields in the operator desc. For example, the join operator has two bind points: InputTensors
and OutputTensor
. Each bind point may have 1 or more resources bound to it. In the case of join, InputTensors
has a variable number of resources bound to it, which is determined using the value of of InputCount
. The OutputTensor
bind point must have 1 and only 1 resource bound to it.
struct DML_JOIN_OPERATOR_DESC
{
UINT InputCount;
_Field_size_(InputCount) const DML_TENSOR_DESC* InputTensors;
const DML_TENSOR_DESC* OutputTensor;
UINT Axis;
};
Bind points in compute shaders are the shader input names. The following HLSL has 4 bind points: inputA
, inputB
, output
, and constants
:
StructuredBuffer<float16_t> inputA;
StructuredBuffer<float16_t> inputB;
RWStructuredBuffer<float16_t> output;
cbuffer constants { uint elementCount; };
This command is used to print the contents of a resource to stdout. If the resource lives in a GPU-visible-only heap then it will first be downloaded into a CPU-visible readback heap. Buffers are always printed as a flat 1D view of elements: the data type and number of elements display will be derived using the resource's initializer. More control over printing may be added in the future.
{
"type": "Print",
"resource": "Out"
}
DirectML "desc" structs use a type enumeration and void*
pair to achieve polymorphism:
DML_OPERATOR_DESC
is used to describe any type of operator (e.g. element-wise identity).DML_TENSOR_DESC
is used to describe any type of tensor (currently only buffers).
These high-level desc structs have no fields except for <enum> Type
and const void* Desc
; their only purpose is to give the API a way to communicate a polymorphic type to DirectML without using IUnknown
or rev'ing structs and APIs for each new subtype that's added. As such, it's usually not useful to retain the exact layout of these structs when representing objects in JSON.
For example, instead of writing the FusedActivation
member like this:
{
"type": "DML_OPERATOR_ELEMENT_WISE_ADD1",
"desc":
{
"ATensor": { "DataType": "FLOAT32", "Sizes": [3] },
"BTensor": { "DataType": "FLOAT32", "Sizes": [3] },
"OutputTensor": { "DataType": "FLOAT32", "Sizes": [3] },
"FusedActivation": { "Type": "ACTIVATION_ELU", "Desc": { "Alpha": 0.5 } }
}
}
... it's easier to write the field Alpha
inline without the extra Desc
level:
{
"type": "DML_OPERATOR_ELEMENT_WISE_ADD1",
"desc":
{
"ATensor": { "DataType": "FLOAT32", "Sizes": [3] },
"BTensor": { "DataType": "FLOAT32", "Sizes": [3] },
"OutputTensor": { "DataType": "FLOAT32", "Sizes": [3] },
"FusedActivation": { "Type": "ACTIVATION_ELU", "Alpha": 0.5 }
}
}
When defining DirectML desc structs in JSON, the parser will recognize either of the above formats. However, if the subtype itself has a field named Type
then it will collide with the Type
field of the higher-level struct; in this case you must use the longer format and explicitly list fields using the Desc
member:
{
"Type": "DML_OPERATOR_RANDOM_GENERATOR",
"Desc":
{
"InputTensor": { ... },
"OutputTensor": { ... },
"OutputStateTensor": { ... },
"Type": "DML_RANDOM_GENERATOR_TYPE_PHILOX_4X32_10"
}
}
DirectML enumerations follow a naming convention where values are prefixed by their respective type name. The JSON parser allows initializing enum values by either their full name or their suffix. Below are some examples:
Type | Example Value | Example Value Suffix |
---|---|---|
DML_TENSOR_TYPE | DML_TENSOR_TYPE_BUFFER | BUFFER |
DML_TENSOR_DATA_TYPE | DML_TENSOR_DATA_TYPE_FLOAT32 | FLOAT32 |
DML_REDUCE_FUNCTION | DML_REDUCE_FUNCTION_SUM | SUM |
DML_OPERATOR_TYPE | DML_OPERATOR_GEMM | GEMM |
DML_EXECUTION_FLAGS | DML_EXECUTION_FLAG_DESCRIPTORS_VOLATILE | DESCRIPTORS_VOLATILE |
Take note of the few odd cases that don't follow the usual rule exactly:
- Enum values of type
DML_OPERATOR_TYPE
omit_TYPE
from their prefix. It'sDML_OPERATOR_GEMM
, notDML_OPERATOR_TYPE_GEMM
. - Flag values are singular and omit the "S". It's
DML_EXECUTION_FLAG_NONE
, notDML_EXECUTION_FLAGS_NONE
.
Since tensor descs are so common, the JSON parser provides default values for most fields.
Field | DML Type | JSON Type | Required | Default |
---|---|---|---|---|
Type |
DML_TENSOR_TYPE | String | No | DML_TENSOR_TYPE_BUFFER |
DataType |
DML_TENSOR_DATA_TYPE | String | Yes | - |
Flags |
DML_TENSOR_FLAGS | String | No | DML_TENSOR_FLAG_NONE |
DimensionCount |
UINT | Integer | No | inferred from size of Sizes field |
Sizes |
UINT* | List | Yes | - |
Strides |
UINT* | List | No | null |
TotalTensorSizeInBytes |
UINT | Integer | No | inferred (packed) from Sizes , Strides , and DataType |
GuaranteedBaseOffsetAlignment |
UINT | Integer | No | 0 |
Below is an example of a fully defined DML_TENSOR_DESC
:
{
"Type": "DML_TENSOR_TYPE_BUFFER",
"Desc":
{
"DataType": "DML_TENSOR_DATA_TYPE_FLOAT32",
"Flags": "DML_TENSOR_FLAG_NONE",
"DimensionCount": 4,
"Sizes": [1,1,2,3],
"Strides": [6,6,3,1],
"TotalTensorSizeInBytes": 24,
"GuaranteedBaseOffsetAlignment": 0
}
}
The following DML_TENSOR_DESC
in JSON is equivalent to the definition above:
{
"DataType": "FLOAT32",
"Sizes": [1,1,2,3],
}
In this simplest case you provide a resource binding by its name only (e.g. "inputA": "A"
). However, you also have the option of providing additional information to view a subrange of the resource or reinterpret its type. Below is an example that fills out a binding object with these additional properties:
Binding object:
{
"name": "A",
"format": "R32_FLOAT",
"elementCount": 12,
"elementStride": 4,
"elementOffset": 16
}
Member | Type | DML | HLSL | Description |
---|---|---|---|---|
name |
String | Required | Required | Name of the model buffer to bind. |
format |
String (DXGI_FORMAT) | - | Optional | Format to use for the buffer view. |
elementCount |
Number (UINT32) | - | Optional | Number of elements in the buffer. |
elementStride |
Number (UINT32) | - | Optional | Size of each element (or structure) in the buffer. |
elementOffset |
Number (UINT32) | Optional | Optional | Offset (measured in elements) to the first element. |
When binding with only a name the above values default to viewing the portion of the resource that is initialized when declared in the model. So, for example, if you have a 64KB buffer named "A" that is initialized with 3x FLOAT32 values the default view will have elementCount=3
, elementStride=4
, elementOffset=0
, format=R32_FLOAT
. This is generally what you want, but the extra parameters can be useful for experimentation.
Finally, recall that you may bind multiple resources to a bind point. This means a bind point in the model bindings
may map to a string, an object, or an array of strings/objects. The example below shows advanced binding that mixes objects and strings, including binding different regions of the same resource "A":
{
"type": "dispatch",
"dispatchable": "join",
"bindings":
{
"InputTensors":
[
{ "name": "A", "elementCount": 3, "elementOffset": 0 },
"B",
{ "name": "A", "elementCount": 5, "elementOffset": 4 }
],
"OutputTensor": "Out"
}
}
It's quite easy to make a mistake when creating, compiling, and binding DirectML operators. The error codes returned from an invalid API call are rarely useful enough to diagnose the problem, so these libraries have dedicated debug layers that can be turned on during development and turned off (for efficiency) in deployment. You can enable these debug layers using the --debug
option when running the program. For example, let's say you call the reduce operator incorrectly with the following dispatchable:
"sum":
{
"type": "DML_OPERATOR_REDUCE",
"desc":
{
"InputTensor": { "DataType": "FLOAT32", "Sizes": [1,1,3,3] },
"OutputTensor": { "DataType": "FLOAT32", "Sizes": [1,1,1,1] },
"Function": "DML_REDUCE_FUNCTION_SUM",
"AxisCount": 1,
"Axes": [3]
}
}
The above output shape in the above code is wrong: only axis 3 is indicated as being reduced. If you run the model normally it will simply tell you that creating the operator failed:
> dxdispatch.exe .\models\dml_reduce.json
Running on 'NVIDIA GeForce RTX 2070 SUPER'
ERROR: failed to execute the model.
ERROR creating dispatchable 'sum'
With the debug layers enabled you'll get a helpful message from the DirectML debug layers:
> dxdispatch.exe .\models\dml_reduce.json -d
Running on 'NVIDIA GeForce RTX 2070 SUPER'
DML_OPERATOR_REDUCE: expected OutputTensor.Sizes[2] to be InputTensor.Sizes[2] since dimension 2 is not reduced, but 1 and 3 were provided
ERROR: failed to execute the model.
ERROR creating dispatchable 'sum'
Likewise, you will get output from the D3D debug layers (including GPU-based validation). If you bind a buffer that's too small:
Running on 'NVIDIA GeForce RTX 2070 SUPER'
ERROR while binding resources: dxdispatch.exe!00007FF788C0F7AD: (caller: 00007FF788C95E2D) Exception(1) tid(50dc) 887A0005 The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action.
[DmlDispatchable::Bind(m_device->DML()->GetDeviceRemovedReason())]
Error during binding for dispatchable 'sum': The value of SizeInBytes is too small. 4 was provided, but 36 was expected.
Resource 'input': 1
Resource 'output': 0, 0, 0
The debug layers won't catch everything, but they can be extremely helpful. Keep in mind running with the --debug
option will greatly reduce performance, so don't use it with benchmarking!
The --benchmark
option gives you a simple way to measure the elapsed time of a dispatch. It's usually better to profile using a dedicated tool like PIX, but this option can be useful for quickly scripting multiple executions with different arguments (e.g. problem size).
> dxdispatch.exe .\models\dml_reduce.json -b
Running on 'NVIDIA GeForce RTX 2070 SUPER'
Dispatch 'sum': 0.00671445 ms
Resource 'input': 1, 2, 3, 4, 5, 6, 7, 8, 9
Resource 'output': 6, 15, 24
When --benchmark
mode is enabled each dispatch command will record the multiple instances of the same dispatchable into the same command list and divide the total execution duration by the number of instances. The default is 256 times, but you can control this with --dispatch_repeat <i>
. For example, the following invocation will record the DML reduce operator 1000 times:
> dxdispatch.exe .\models\dml_reduce.json -b -i 1000
Running on 'NVIDIA GeForce RTX 2070 SUPER'
Dispatch 'sum': 0.00562345 ms
Resource 'input': 1, 2, 3, 4, 5, 6, 7, 8, 9
Resource 'output': 6, 15, 24
A larger number of repeats may give a more accurate time. This is a very simplistic benchmarking technique and may be improved in the future.
For a deeper look into performance you'll want to use a dedicated profiling tool like PIX. Hardware vendors also provide their own profiling tools that should also be compatible. Using these tools is outside the scope of this guide, but as an example here's a GPU capture with PIX:
You'll notice that resources and dispatches are labeled with the names of objects in the model. This makes it easy to map D3D objects in the capture back to your model objects:
The image below shows some of the occupancy and timing data captured (the dml_reduce.json model was modified to have ~4M elements, since reducing 9 elements is not interesting enough for profiling). As you might expect, reduction has low arithmetic complexity and thus the ALU/FMA pipes are not too busy; however, there is good occupancy on the cores and some degree of L1 cache hits. Reduction is memory-bandwidth bound.
The profiling examples in the previous example used a DML operator, but of course you can profile your custom shaders using the same approach. HLSL dispatchables can also be debugged and modified on the fly in PIX. Make sure to add -Zi
(or -Zs
) and -Od
to your HLSL dispatchable arguments, which will disable optimizations produce a PDB for your compiled shader. You can then launch your HLSL model, step through it, and even make changes all within PIX:
Below is a modified version of add_fp16.json
used in the example above. Within PIX you may need to resolve the PDB path (should be generated in the working directory from which dxdispatch.exe was launched).
"add":
{
"type": "hlsl",
"sourcePath": "models/add_fp16.hlsl",
"compiler": "dxc",
"compilerArgs":
[
"-T", "cs_6_2",
"-E", "CSMain",
"-D", "NUM_THREADS=6",
"-enable-16bit-types",
"-Zs", "-Od"
]
}