diff --git a/appendices/VK_AMDX_shader_enqueue.adoc b/appendices/VK_AMDX_shader_enqueue.adoc index ff0014425..de7501347 100644 --- a/appendices/VK_AMDX_shader_enqueue.adoc +++ b/appendices/VK_AMDX_shader_enqueue.adoc @@ -7,7 +7,7 @@ include::{generated}/meta/{refprefix}VK_AMDX_shader_enqueue.adoc[] === Other Extension Metadata *Last Modified Date*:: - 2021-07-22 + 2024-07-17 *Provisional*:: @@ -30,12 +30,14 @@ between revisions, and before final release.* === Description -This extension adds the ability for developers to enqueue compute shader -workgroups from other compute shaders. +This extension adds the ability for developers to enqueue mesh +and compute shader workgroups from other compute shaders. include::{generated}/interfaces/VK_AMDX_shader_enqueue.adoc[] === Version History + * Revision 2, 2024-07-17 (Tobias Hector) + ** Add mesh nodes * Revision 1, 2021-07-22 (Tobias Hector) ** Initial revision diff --git a/appendices/spirvenv.adoc b/appendices/spirvenv.adoc index 6a9dd214f..752560e9a 100644 --- a/appendices/spirvenv.adoc +++ b/appendices/spirvenv.adoc @@ -2168,10 +2168,18 @@ endif::VK_KHR_maintenance5[] ifdef::VK_AMDX_shader_enqueue[] * [[VUID-{refpage}-ShaderEnqueueAMDX-09191]] The code:ShaderEnqueueAMDX capability must: only be used in shaders with - the code:GLCompute execution model + the code:GLCompute +ifdef::VK_EXT_mesh_shader[] + or code:MeshEXT +endif::VK_EXT_mesh_shader[] + execution model * [[VUID-{refpage}-NodePayloadAMDX-09192]] Variables in the code:NodePayloadAMDX storage class must: only be - declared in the code:GLCompute execution model + declared in the code:GLCompute +ifdef::VK_EXT_mesh_shader[] + or code:MeshEXT +endif::VK_EXT_mesh_shader[] + execution model * [[VUID-{refpage}-maxExecutionGraphShaderPayloadSize-09193]] Variables declared in the code:NodePayloadAMDX storage class must: not be larger than the <libraries must: be either a compute pipeline or an - execution graph pipeline + pname:pLibraryInfo->pLibraries must: be either a compute pipeline, + an execution graph pipeline, or a graphics pipeline + * If pname:pLibraryInfo is not `NULL`, each element of + pname:pLibraryInfo->pLibraries that is a compute pipeline + or a graphics pipeline must: have been created with + ename:VK_PIPELINE_CREATE_2_EXECUTION_GRAPH_BIT_AMDX set + * If the <> feature + is not enabled, and pname:pLibraryInfo->pLibraries is not `NULL`, + pname:pLibraryInfo->pLibraries must: not contain any graphics pipelines +ifdef::VK_EXT_graphics_pipeline_library[] + * Any element of pname:pLibraryInfo->pLibraries identifying a + graphics pipeline must: have been created with + <> +endif::VK_EXT_graphics_pipeline_library[] * [[VUID-VkExecutionGraphPipelineCreateInfoAMDX-None-09134]] There must: be no two nodes in the pipeline that share both the same shader name and index, as specified by @@ -166,6 +183,11 @@ include::{chapters}/commonvalidity/compute_graph_pipeline_create_info_common.ado matches the shader name of any other node in the graph, the size of the output payload must: match the size of the input payload in the matching node + * If pname:flags does not include ename:VK_PIPELINE_CREATE_LIBRARY_BIT_KHR, + and an output payload declared in any shader in the pipeline does not + have a code:PayloadNodeSparseArrayAMDX decoration, there must: be a node + in the graph corresponding to every index from 0 to its + code:PayloadNodeArraySizeAMDX decoration **** include::{generated}/validity/structs/VkExecutionGraphPipelineCreateInfoAMDX.adoc[] @@ -215,6 +237,12 @@ By associating multiple shaders with the same name but different indexes, applications can dynamically select different nodes to execute. Applications must: ensure each node has a unique name and index. +[NOTE] +==== +Shaders with the same name must: be of the same type - e.g. a compute and +graphics shader, or even two compute shaders where one is coalescing and the +other is not, cannot share the same name. +==== include::{generated}/validity/structs/VkPipelineShaderStageNodeCreateInfoAMDX.adoc[] -- @@ -227,7 +255,7 @@ graph, call: include::{generated}/api/protos/vkGetExecutionGraphPipelineNodeIndexAMDX.adoc[] - * pname:device is the that pname:executionGraph was created on. + * pname:device is the logical device that pname:executionGraph was created on. * pname:executionGraph is the execution graph pipeline to query the internal node index for. * pname:pNodeInfo is a pointer to a @@ -269,7 +297,7 @@ To query the scratch space required to dispatch an execution graph, call: include::{generated}/api/protos/vkGetExecutionGraphPipelineScratchSizeAMDX.adoc[] - * pname:device is the that pname:executionGraph was created on. + * pname:device is the logical device that pname:executionGraph was created on. * pname:executionGraph is the execution graph pipeline to query the scratch space for. * pname:pSizeInfo is a pointer to a @@ -293,8 +321,18 @@ include::{generated}/api/structs/VkExecutionGraphPipelineScratchSizeAMDX.adoc[] * pname:sType is a elink:VkStructureType value identifying this structure. * pname:pNext is `NULL` or a pointer to a structure extending this structure. - * pname:size indicates the scratch space required for dispatch the queried - execution graph. + * pname:minSize indicates the minimum scratch space required for + dispatching the queried execution graph. + * pname:maxSize indicates the maximum scratch space that can be used for + dispatching the queried execution graph. + * pname:sizeGranularity indicates the granularity at which the scratch space can be + increased from pname:minSize. + +Applications can: use any amount of scratch memory greater than +pname:minSize for dispatching a graph, however only the values equal to pname:minSize ++ an integer multiple of pname:sizeGranularity will be used. +Greater values may: result in higher performance, up to pname:maxSize which indicates the most memory +that an implementation can use effectively. include::{generated}/validity/structs/VkExecutionGraphPipelineScratchSizeAMDX.adoc[] -- @@ -309,16 +347,16 @@ include::{generated}/api/protos/vkCmdInitializeGraphScratchMemoryAMDX.adoc[] * pname:commandBuffer is the command buffer into which the command will be recorded. - * pname:scratch is a pointer to the scratch memory to be initialized. + * pname:executionGraph is the execution graph pipeline to initialize the + scratch memory for. + * pname:scratch is the address of scratch memory to be initialized. + * pname:scratchSize is a range in bytes of scratch memory to be initialized. This command must: be called before using pname:scratch to dispatch the currently bound execution graph pipeline. Execution of this command may: modify any memory locations in the range -[pname:scratch,pname:scratch + pname:size), where pname:size is the value -returned in slink:VkExecutionGraphPipelineScratchSizeAMDX::pname:size by -slink:VkExecutionGraphPipelineScratchSizeAMDX for the currently bound -execution graph pipeline. +[pname:scratch,pname:scratch + pname:scratchSize). Accesses to this memory range are performed in the ename:VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT pipeline stage with the ename:VK_ACCESS_2_SHADER_STORAGE_READ_BIT and @@ -327,17 +365,17 @@ ename:VK_ACCESS_2_SHADER_STORAGE_WRITE_BIT access flags. If any portion of pname:scratch is modified by any command other than flink:vkCmdDispatchGraphAMDX, flink:vkCmdDispatchGraphIndirectAMDX, flink:vkCmdDispatchGraphIndirectCountAMDX, or -fname:vkCmdInitializeGraphScratchMemoryAMDX with the same execution graph, +flink:vkCmdInitializeGraphScratchMemoryAMDX with the same execution graph, it must: be reinitialized for the execution graph again before dispatching against it. .Valid Usage **** - * [[VUID-vkCmdInitializeGraphScratchMemoryAMDX-scratch-09143]] - pname:scratch must: be the device address of an allocated memory range - at least as large as the value of - slink:VkExecutionGraphPipelineScratchSizeAMDX::pname:size returned by - slink:VkExecutionGraphPipelineScratchSizeAMDX for the currently bound + * pname:scratch must: be the device address of an allocated memory range + at least as large as pname:scratchSize + * pname:scratchSize must: be greater than or equal to + slink:VkExecutionGraphPipelineScratchSizeAMDX::pname:minSize returned by + flink:vkGetExecutionGraphPipelineScratchSizeAMDX for the currently bound execution graph pipeline. * [[VUID-vkCmdInitializeGraphScratchMemoryAMDX-scratch-09144]] pname:scratch must: be a multiple of 64 @@ -363,7 +401,8 @@ include::{generated}/api/protos/vkCmdDispatchGraphAMDX.adoc[] * pname:commandBuffer is the command buffer into which the command will be recorded. - * pname:scratch is a pointer to the scratch memory to be used. + * pname:scratch is the address of scratch memory to be used. + * pname:scratchSize is a range in bytes of scratch memory to be used. * pname:pCountInfo is a host pointer to a slink:VkDispatchGraphCountInfoAMDX structure defining the nodes which will be initially executed. @@ -372,6 +411,11 @@ When this command is executed, the nodes specified in pname:pCountInfo are executed. Nodes executed as part of this command are not implicitly synchronized in any way against each other once they are dispatched. +There are no rasterization order guarantees between separately dispatched +graphics nodes, though individual primitives within a single dispatch do +adhere to rasterization order. +Draw calls executed before or after the execution graph also execute relative to +each graphics node with respect to rasterization order. For this command, all device/host pointers in substructures are treated as host pointers and read only during host execution of this command. @@ -379,14 +423,15 @@ Once this command returns, no reference to the original pointers is retained. Execution of this command may: modify any memory locations in the range -[pname:scratch,pname:scratch + pname:size), where pname:size is the value -returned in slink:VkExecutionGraphPipelineScratchSizeAMDX::pname:size by -slink:VkExecutionGraphPipelineScratchSizeAMDX for the currently bound -execution graph pipeline Accesses to this memory range are performed in the +[pname:scratch,pname:scratch + pname:scratchSize). +Accesses to this memory range are performed in the ename:VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT pipeline stage with the ename:VK_ACCESS_2_SHADER_STORAGE_READ_BIT and ename:VK_ACCESS_2_SHADER_STORAGE_WRITE_BIT access flags. +This command <> for mesh nodes similarly to draw commands. + .Valid Usage **** include::{chapters}/commonvalidity/dispatch_graph_common.adoc[] @@ -431,7 +476,8 @@ include::{generated}/api/protos/vkCmdDispatchGraphIndirectAMDX.adoc[] * pname:commandBuffer is the command buffer into which the command will be recorded. - * pname:scratch is a pointer to the scratch memory to be used. + * pname:scratch is the address of scratch memory to be used. + * pname:scratchSize is a range in bytes of scratch memory to be used. * pname:pCountInfo is a host pointer to a slink:VkDispatchGraphCountInfoAMDX structure defining the nodes which will be initially executed. @@ -440,6 +486,11 @@ When this command is executed, the nodes specified in pname:pCountInfo are executed. Nodes executed as part of this command are not implicitly synchronized in any way against each other once they are dispatched. +There are no rasterization order guarantees between separately dispatched +graphics nodes, though individual primitives within a single dispatch do +adhere to rasterization order. +Draw calls executed before or after the execution graph also execute relative to +each graphics node with respect to rasterization order. For this command, all device/host pointers in substructures are treated as device pointers and read during device execution of this command. @@ -450,15 +501,15 @@ ename:VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT pipeline stage with the ename:VK_ACCESS_2_SHADER_STORAGE_READ_BIT access flag. Execution of this command may: modify any memory locations in the range -[pname:scratch,pname:scratch + pname:size), where pname:size is the value -returned in slink:VkExecutionGraphPipelineScratchSizeAMDX::pname:size by -slink:VkExecutionGraphPipelineScratchSizeAMDX for the currently bound -execution graph pipeline. +[pname:scratch,pname:scratch + pname:scratchSize). Accesses to this memory range are performed in the ename:VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT pipeline stage with the ename:VK_ACCESS_2_SHADER_STORAGE_READ_BIT and ename:VK_ACCESS_2_SHADER_STORAGE_WRITE_BIT access flags. +This command <> for mesh nodes similarly to draw commands. + .Valid Usage **** include::{chapters}/commonvalidity/dispatch_graph_common.adoc[] @@ -525,7 +576,8 @@ include::{generated}/api/protos/vkCmdDispatchGraphIndirectCountAMDX.adoc[] * pname:commandBuffer is the command buffer into which the command will be recorded. - * pname:scratch is a pointer to the scratch memory to be used. + * pname:scratch is the address of scratch memory to be used. + * pname:scratchSize is a range in bytes of scratch memory to be used. * pname:countInfo is a device address of a slink:VkDispatchGraphCountInfoAMDX structure defining the nodes which will be initially executed. @@ -544,10 +596,7 @@ ename:VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT pipeline stage with the ename:VK_ACCESS_2_SHADER_STORAGE_READ_BIT access flag. Execution of this command may: modify any memory locations in the range -[pname:scratch,pname:scratch + pname:size), where pname:size is the value -returned in slink:VkExecutionGraphPipelineScratchSizeAMDX::pname:size by -slink:VkExecutionGraphPipelineScratchSizeAMDX for the currently bound -execution graph pipeline. +[pname:scratch,pname:scratch + pname:scratchSize). Accesses to this memory range are performed in the ename:VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT pipeline stage with the ename:VK_ACCESS_2_SHADER_STORAGE_READ_BIT and @@ -732,3 +781,37 @@ The number of invocations coalesced into a given workgroup in this way can: be queried via the <> built-in. Any values in the payload have no effect on execution. + +ifdef::VK_EXT_mesh_shader[] +[[executiongraphs-meshnodes]] +=== Mesh Nodes + +Graphics pipelines added as nodes to an execution graph are executed in a +manner similar to a flink:vkCmdDrawMeshTasksIndirectEXT, using the same +payloads as compute shaders, but capturing some state from the command buffer. + +[[executiongraphs-meshnodes-statecapture]] +When an execution graph dispatch is recorded into a command buffer, it +captures the following dynamic state for use with draw nodes: + + * `VK_DYNAMIC_STATE_VIEWPORT` + * `VK_DYNAMIC_STATE_SCISSOR` + * `VK_DYNAMIC_STATE_LINE_WIDTH` + * `VK_DYNAMIC_STATE_DEPTH_BIAS` + * `VK_DYNAMIC_STATE_BLEND_CONSTANTS` + * `VK_DYNAMIC_STATE_DEPTH_BOUNDS` +ifdef::VK_VERSION_1_3,VK_EXT_extended_dynamic_state[] + * `VK_DYNAMIC_STATE_VIEWPORT_WITH_COUNT` + * `VK_DYNAMIC_STATE_SCISSOR_WITH_COUNT` +endif::VK_VERSION_1_3,VK_EXT_extended_dynamic_state[] +ifdef::VK_EXT_sampler_locations[] + * `VK_DYNAMIC_STATE_SAMPLE_LOCATIONS_EXT` +endif::VK_EXT_sampler_locations[] +ifdef::VK_KHR_fragment_shading_rate[] + * `VK_DYNAMIC_STATE_FRAGMENT_SHADING_RATE_KHR` +endif::VK_KHR_fragment_shading_rate[] + +Other state is not captured, and graphics pipelines must: not be created +with other dynamic states when used as a library in an execution graph +pipeline. +endif::VK_EXT_mesh_shader[] \ No newline at end of file diff --git a/chapters/features.adoc b/chapters/features.adoc index eac910c66..eb41d6c3f 100644 --- a/chapters/features.adoc +++ b/chapters/features.adoc @@ -7386,6 +7386,9 @@ This structure describes the following feature: * [[features-shaderEnqueue]] pname:shaderEnqueue indicates whether the implementation supports <>. + * [[features-shaderMeshEnqueue]] pname:shaderMeshEnqueue indicates whether the + implementation supports + <>. :refpage: VkPhysicalDeviceShaderEnqueueFeaturesAMDX include::{chapters}/features.adoc[tag=features] diff --git a/chapters/limits.adoc b/chapters/limits.adoc index e87c17669..44dda0180 100644 --- a/chapters/limits.adoc +++ b/chapters/limits.adoc @@ -4734,6 +4734,13 @@ structure describe the following limits: pname:executionGraphDispatchAddressAlignment specifies the alignment of non-scratch basetype:VkDeviceAddress arguments consumed by graph dispatch commands. + * [[limits-maxExecutionGraphWorkgroupCount]] + pname:maxExecutionGraphWorkgroupCount[3] is the maximum number of + local workgroups that a shader can: be dispatched with in X, Y, and Z + dimensions, respectively. + * [[limits-maxExecutionGraphWorkgroups]] + pname:maxExecutionGraphWorkgroups is the total number of + local workgroups that a shader can: be dispatched with. :refpage: VkPhysicalDeviceShaderEnqueuePropertiesAMDX include::{chapters}/limits.adoc[tag=limits_desc] @@ -5445,6 +5452,9 @@ ifdef::VK_AMDX_shader_enqueue[] | code:uint32_t | pname:maxExecutionGraphShaderPayloadSize | `<>` | code:uint32_t | pname:maxExecutionGraphShaderPayloadCount | `<>` | code:uint32_t | pname:executionGraphDispatchAddressAlignment | `<>` +| code:uint32_t | pname:maxExecutionGraphVertexBufferBindings | `<>` +| 3 {times} code:uint32_t | pname:maxExecutionGraphWorkgroupCount | `<>` +| code:uint32_t | pname:maxExecutionGraphWorkgroups | `<>` endif::VK_AMDX_shader_enqueue[] ifdef::VK_EXT_device_generated_commands[] | code:uint32_t | pname:maxIndirectShaderObjectCount | `<>` @@ -5959,6 +5969,9 @@ ifdef::VK_AMDX_shader_enqueue[] | pname:maxExecutionGraphShaderPayloadSize | - | 32768 | min | pname:maxExecutionGraphShaderPayloadCount | - | 256 | min | pname:executionGraphDispatchAddressAlignment | - | 4 | max +| pname:maxExecutionGraphVertexBufferBindings | - | 1024 | min +| pname:maxExecutionGraphWorkgroupCount | - | (65535,65535,65535) | min +| pname:maxExecutionGraphWorkgroups | - | 2^24^-1 | min endif::VK_AMDX_shader_enqueue[] ifdef::VK_NV_extended_sparse_address_space[] | pname:extendedSparseAddressSpaceSize | 0 | pname:sparseAddressSpaceSize | min diff --git a/chapters/pipelines.adoc b/chapters/pipelines.adoc index 0699a10fc..c91023261 100644 --- a/chapters/pipelines.adoc +++ b/chapters/pipelines.adoc @@ -2588,6 +2588,27 @@ ifdef::VK_KHR_line_rasterization,VK_EXT_line_rasterization[] slink:VkPipelineRasterizationLineStateCreateInfoKHR must: be in the range [eq]#[1,256]# endif::VK_KHR_line_rasterization,VK_EXT_line_rasterization[] +ifdef::VK_AMDX_shader_enqueue[] + * If <> is not enabled, + shaders specified by pname:pStages must: not declare the + code:ShaderEnqueueAMDX capability + * If pname:flags does not include + ename:VK_PIPELINE_CREATE_LIBRARY_BIT_KHR, shaders specified by + pname:pStages must: not declare the code:ShaderEnqueueAMDX capability + * If any shader stages in pname:pStages declare the code:ShaderEnqueueAMDX + capability, ename:VK_PIPELINE_CREATE_2_EXECUTION_GRAPH_BIT_AMDX and + ename:VK_PIPELINE_CREATE_2_LIBRARY_BIT_KHR must: be included in + pname:flags + * If ename:VK_PIPELINE_CREATE_2_EXECUTION_GRAPH_BIT_AMDX is included in + pname:flags, and the pipeline requires + <>, there must: not be a task or vertex shader specified in + pname:pStages + * If ename:VK_PIPELINE_CREATE_2_EXECUTION_GRAPH_BIT_AMDX is included in + pname:flags, all elements of + slink:VkPipelineLibraryCreateInfoKHR::pname:pLibraries must: have been + created with ename:VK_PIPELINE_CREATE_2_EXECUTION_GRAPH_BIT_AMDX +endif::VK_AMDX_shader_enqueue[] ifdef::VK_KHR_ray_tracing_pipeline[] * [[VUID-VkGraphicsPipelineCreateInfo-flags-03372]] pname:flags must: not include @@ -3494,14 +3515,25 @@ ifdef::VK_EXT_graphics_pipeline_library[] If the <> feature is not enabled, endif::VK_EXT_graphics_pipeline_library[] +ifdef::VK_AMDX_shader_enqueue[] +ifdef::VK_EXT_graphics_pipeline_library[and if] +ifndef::VK_EXT_graphics_pipeline_library[If] + the <> feature is + not enabled, +endif::VK_AMDX_shader_enqueue[] pname:flags must: not include ename:VK_PIPELINE_CREATE_LIBRARY_BIT_KHR endif::VK_KHR_pipeline_library[] ifdef::VK_EXT_graphics_pipeline_library[] * [[VUID-VkGraphicsPipelineCreateInfo-flags-06608]] - If the pipeline defines, or includes as libraries, all the state subsets - required for a <>, pname:flags must: not include - ename:VK_PIPELINE_CREATE_LIBRARY_BIT_KHR + {empty} +ifdef::VK_AMDX_shader_enqueue[] + If <> is not + enabled, and +endif::VK_AMDX_shader_enqueue[] +ifndef::VK_AMDX_shader_enqueue[If] + the pipeline is being created with + <>, + pname:flags must: not include ename:VK_PIPELINE_CREATE_LIBRARY_BIT_KHR * [[VUID-VkGraphicsPipelineCreateInfo-flags-06609]] If pname:flags includes ename:VK_PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_BIT_EXT, pipeline @@ -4849,6 +4881,10 @@ ifdef::VK_EXT_legacy_dithering[] ename:VK_RENDERING_ENABLE_LEGACY_DITHERING_BIT_EXT. endif::VK_EXT_legacy_dithering[] endif::VK_VERSION_1_3,VK_KHR_dynamic_rendering[] +ifdef::VK_AMDX_shader_enqueue[] + * ename:VK_PIPELINE_CREATE_2_EXECUTION_GRAPH_BIT_AMDX specifies that + the pipeline will be used in an <> +endif::VK_AMDX_shader_enqueue[] It is valid to set both ename:VK_PIPELINE_CREATE_2_ALLOW_DERIVATIVES_BIT_KHR and ename:VK_PIPELINE_CREATE_2_DERIVATIVE_BIT_KHR. diff --git a/proposals/VK_AMDX_shader_enqueue.adoc b/proposals/VK_AMDX_shader_enqueue.adoc index 81657ce19..efd0a8e4f 100644 --- a/proposals/VK_AMDX_shader_enqueue.adoc +++ b/proposals/VK_AMDX_shader_enqueue.adoc @@ -7,7 +7,7 @@ :refpage: https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/ :sectnums: -This extension adds the ability for developers to enqueue compute workgroups from a shader. +This extension adds the ability for developers to enqueue mesh pipelines and compute shader workgroups from other compute shaders. ## Problem Statement @@ -59,7 +59,7 @@ VkResult vkCreateExecutionGraphPipelinesAMDX( VkDevice device, VkPipelineCache pipelineCache, uint32_t createInfoCount, - const VkExecutionGraphPipelineCreateInfoAMDX* pCreateInfos, + const VkExecutionGraphPipelineCreateInfoAMDX* pCreateInfos, const VkAllocationCallbacks* pAllocator, VkPipeline* pPipelines); @@ -77,10 +77,16 @@ typedef struct VkExecutionGraphPipelineCreateInfoAMDX { ---- Shaders defined by `pStages` and any pipelines in `pLibraryInfo->pLibraries` define the possible nodes of the graph. -The linkage between nodes however is defined wholly in shader code. +The linkage between nodes however is defined wholly in shader code, though may be overridden by specialization constants in many cases. Shaders in `pStages` must be in the `GLCompute` execution model, and may have the *CoalescingAMDX* execution mode. -Pipelines in `pLibraries` can be compute pipelines or other graph pipelines created with the `VK_PIPELINE_CREATE_LIBRARY_BIT_KHR` flag bit. + +Pipelines in `pLibraries` can be compute pipelines, graphics pipelines, or other execution graph pipelines. Compute and graphics pipelines must be created with the `VK_PIPELINE_CREATE_2_LIBRARY_BIT_KHR` and `VK_PIPELINE_CREATE_2_EXECUTION_GRAPH_BIT_AMDX` flag bits. Execution graph pipelines used as libraries must be created with the `VK_PIPELINE_CREATE_2_LIBRARY_BIT_KHR` flag bit. + +[source,c] +---- +VK_PIPELINE_CREATE_2_EXECUTION_GRAPH_BIT_AMDX = 0x100000000ULL +---- Each shader in an execution graph is associated with a name and an index, which are used to identify the target shader when dispatching a payload. The `VkPipelineShaderStageNodeCreateInfoAMDX` provides options for specifying how the shader is specified with regards to its entry point name and index, and can be chained to the link:{refpage}VkPipelineShaderStageCreateInfo.html[VkPipelineShaderStageCreateInfo] structure. @@ -109,6 +115,36 @@ Allowing the index to be set dynamically lets applications stream shaders in and Shaders with the same name and different indexes must consume identical payloads and have the same execution model. Shaders with the same name in an execution graph pipeline must have unique indexes. +When dispatching from another shader, any declared input payload for the dispatched node must be less than or equal to the size of the output payload in the dispatching node. +Additionally, if an input payload is declared in the dispatched shader, the input and output payloads must specify members with the same decorations at the same offsets. + + +##### Graphics Pipeline State + +When adding a graphics pipeline to an execution graph pipeline, applications must specify a graphics pipeline with a complete set of state, and the `VK_PIPELINE_CREATE_2_LIBRARY_BIT_KHR` and `VK_PIPELINE_CREATE_2_EXECUTION_GRAPH_BIT_AMDX` flags set. +Graphics pipelines must only include mesh shaders; vertex shader pipelines or mesh pipelines with task shaders are not supported. +When creating such a graphics pipeline from libraries as an interaction with link:{refpage}VK_EXT_graphics_pipeline_library.html[VK_EXT_graphics_pipeline_library], those libraries must also have been created with those flags. + +For graphics pipelines defined in this way, only the following dynamic state is allowed: + + * `VK_DYNAMIC_STATE_VIEWPORT` + * `VK_DYNAMIC_STATE_SCISSOR` + * `VK_DYNAMIC_STATE_LINE_WIDTH` + * `VK_DYNAMIC_STATE_DEPTH_BIAS` + * `VK_DYNAMIC_STATE_BLEND_CONSTANTS` + * `VK_DYNAMIC_STATE_DEPTH_BOUNDS` + * `VK_DYNAMIC_STATE_VIEWPORT_WITH_COUNT` + * `VK_DYNAMIC_STATE_SCISSOR_WITH_COUNT` + * `VK_DYNAMIC_STATE_SAMPLE_LOCATIONS_EXT` + * `VK_DYNAMIC_STATE_FRAGMENT_SHADING_RATE_KHR` + +When these dynamic states are specified, this state is captured from the command buffer state at the point the execution graph is dispatched, and applies to all nodes that have that state set dynamically executed as part of that dispatch. +All graphics pipelines in an execution graph must use the same set of dynamic states. +Applications can dynamically choose any other state at runtime by selecting between pipelines with different state when dispatching, but the underlying pipelines must be created statically. + +When included as a library in an execution graph pipeline, the node is defined by the first shader in the graphics pipeline. + + #### Scratch Memory Implementations may need scratch memory to manage dispatch queues or similar when executing a pipeline graph, and this is explicitly managed by the application. @@ -118,18 +154,23 @@ Implementations may need scratch memory to manage dispatch queues or similar whe typedef struct VkExecutionGraphPipelineScratchSizeAMDX { VkStructureType sType; void* pNext; - VkDeviceSize size; + VkDeviceSize minSize; + VkDeviceSize maxSize; + VkDeviceSize sizeGranularity; } VkExecutionGraphPipelineScratchSizeAMDX; VkResult vkGetExecutionGraphPipelineScratchSizeAMDX( - VkDevice device, - VkPipeline executionGraph, - VkExecutionGraphPipelineScratchSizeAMDX* pSizeInfo); + VkDevice device, + VkPipeline executionGraph, + VkExecutionGraphPipelineScratchSizeAMDX* pSizeInfo); ---- -Applications can query the required amount of scratch memory required for a given pipeline, and the address of a buffer of that size must be provided when calling `vkCmdDispatchGraphAMDX`. +Applications can query the required amount of scratch memory for a given pipeline, and the address of a buffer of that size must be provided when calling `vkCmdDispatchGraphAMDX`. The amount of scratch memory needed by a given pipeline is related to the number and size of payloads across the whole graph; while the exact relationship is implementation dependent, reducing the number of unique nodes (different name string) and size of payloads can reduce scratch memory consumption. +A range of sizes are returned by the implementation; any size between `minSize` and `maxSize` can be used, though the actual memory consumed will be snapped to `minSize` + an integer multiple of `sizeGranularity`. +Choosing any value less than the maximum size will reduce memory pressure but will likely result in degraded performance. + Buffers created for this purpose must use the new buffer usage flags: [source,c] @@ -144,11 +185,13 @@ Scratch memory needs to be initialized against a graph pipeline before it can be ---- void vkCmdInitializeGraphScratchMemoryAMDX( VkCommandBuffer commandBuffer, - VkDeviceAddress scratch); + VkPipeline executionGraph, + VkDeviceAddress scratch, + VkDeviceSize scratchSize); ---- -This command initializes it for the currently bound execution graph pipeline. -Scratch memory will need to be re-initialized if it is going to be reused with a different execution graph pipeline, but can be used with the same pipeline repeatedly without re-initialization. +This command initializes it for the execution graph pipeline `executionGraph` with the specified `scratchSize`. +Scratch memory will need to be re-initialized if it is going to be re-used with a different execution graph pipeline, but can be used with the same pipeline repeatedly without re-initialization. Scratch memory initialization can be synchronized using the compute pipeline stage `VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT` and shader write access flag `VK_ACCESS_SHADER_WRITE_BIT`. @@ -174,20 +217,23 @@ typedef struct VkDispatchGraphCountInfoAMDX { void vkCmdDispatchGraphAMDX( VkCommandBuffer commandBuffer, VkDeviceAddress scratch, + VkDeviceSize scratchSize, const VkDispatchGraphCountInfoAMDX* pCountInfo); void vkCmdDispatchGraphIndirectAMDX( VkCommandBuffer commandBuffer, VkDeviceAddress scratch, + VkDeviceSize scratchSize, const VkDispatchGraphCountInfoAMDX* pCountInfo); void vkCmdDispatchGraphIndirectCountAMDX( VkCommandBuffer commandBuffer, VkDeviceAddress scratch, + VkDeviceSize scratchSize, VkDeviceAddress countInfo); ---- -Each of the above commands enqueues an array of nodes in the bound execution graph pipeline with separate payloads, according to the contents of the `VkDispatchGraphCountInfoAMDX` and `VkDispatchGraphInfoAMDX` structures. +Each of the above commands enqueues payloads for an array of nodes in the bound execution graph pipeline, according to the contents of the `VkDispatchGraphCountInfoAMDX` and `VkDispatchGraphInfoAMDX` structures. `vkCmdDispatchGraphAMDX` takes all of its arguments from the host pointers. `VkDispatchGraphCountInfoAMDX::infos.hostAddress` is a pointer to an array of `VkDispatchGraphInfoAMDX` structures, @@ -200,18 +246,23 @@ with stride equal to `VkDispatchGraphCountInfoAMDX::stride` and `VkDispatchGraph Data consumed via a device address must be from buffers created with the `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT` and `VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT` flags. `payloads` is a pointer to a linear array of payloads in memory, with a stride equal to `payloadStride`. `payloadCount` may be `0`. -`scratch` may be used by the implementation to hold temporary data during graph execution, and can be synchronized using the compute pipeline stage and shader write access flags. +The range of memory from `scratch` up to `scratchSize` may be used by the implementation to hold temporary data during graph execution, and can be synchronized using the compute pipeline stage and shader write access flags. These dispatch commands must not be called in protected command buffers or secondary command buffers. -If a selected node does not include a `StaticNumWorkgroupsAMDX` or `CoalescingAMDX` declaration, the first part of each element of `payloads` must be a `VkDispatchIndirectCommand` structure, indicating the number of workgroups to dispatch in each dimension. -If an input payload variable in `NodePayloadAMDX` storage class is defined in the shader, its structure type *must* include link:{refpage}VkDispatchIndirectCommand.html[VkDispatchIndirectCommand] in its first 12 bytes. +The size of the payload provided for each dispatched node must be at least as large as the *NodePayloadAMDX* declaration in the node, and the layout of the payload data in memory will be interpreted as it is laid out in the selected node's shader, including any member decorations. +In particular, this means for nodes that consume indirect parameters from the payload, those parameters must be provided in the correct location as specified in the shader. +For example, for a compute shader that does not include a `StaticNumWorkgroupsAMDX` or `CoalescingAMDX` declaration, each dispatch will consume a payload structure containing a member decorated with *PayloadDispatchIndirectAMDX* that indicates the number of workgroups to dispatch in each dimension. + +Node payload members must be _explicitly laid out_ with offset and array stride decorations, both in the input and output. -If that node does not include a `MaxNumWorkgroupsAMDX` declaration, it is assumed that the node may be dispatched with a grid size up to `VkPhysicalDeviceLimits::maxComputeWorkGroupCount`. +* If the dispatched shader uses `GLCompute` or `MeshEXT` `Execution Model`, then it is allowed to not specify the input payload. + In this case, the payload is defined implicitly as follows: +** If the `StaticNumWorkgroupsAMDX` or `CoalescingAMDX` execution modes are specified, the payload is empty. +** Otherwise, the payload is a structure with a single member that is a vector of three 32-bit unsigned integers. -If that node does not include a `CoalescingAMDX` declaration, all data in the payload is broadcast to all workgroups dispatched in this way. -If that node includes a `CoalescingAMDX` declaration, data in the payload will be consumed by exactly one workgroup. -There is no guarantee of how payloads will be consumed by `CoalescingAMDX` nodes. +Payloads are always read (including built-in values) according to the input payload definition - the output payload definition must have the same size as the expected input, but does not otherwise need to match. +Applications must take care to ensure that values are where they expect them. The `nodeIndex` is a unique integer identifier identifying a specific shader name and shader index (defined by `VkPipelineShaderStageNodeCreateInfoAMDX`) added to the executable graph pipeline. `vkGetExecutionGraphPipelineNodeIndexAMDX` can be used to query the identifier for a given node: @@ -221,7 +272,7 @@ The `nodeIndex` is a unique integer identifier identifying a specific shader nam VkResult vkGetExecutionGraphPipelineNodeIndexAMDX( VkDevice device, VkPipeline executionGraph, - const VkPipelineShaderStageNodeCreateInfoAMDX* pNodeInfo, + const VkPipelineShaderStageNodeCreateInfoAMDX* pNodeInfo, uint32_t* pNodeIndex); ---- @@ -258,23 +309,28 @@ typedef VkPhysicalDeviceShaderEnqueuePropertiesAMDX { uint32_t maxExecutionGraphShaderPayloadSize; uint32_t maxExecutionGraphShaderPayloadCount; uint32_t executionGraphDispatchAddressAlignment; + uint32_t maxExecutionGraphWorkgroupCount[3]; + uint32_t maxExecutionGraphWorkgroups; } VkPhysicalDeviceShaderEnqueuePropertiesAMDX; ---- Each limit is defined as follows: * `maxExecutionGraphDepth` defines the maximum node chain length in the graph, and must be at least 32. - The dispatched node is at depth 1 and the node enqueued by it is at depth 2, and so on. + A node that is dispatched with an API command is at depth 1 and the node that receives a payload from it is at depth 2, and so on. If a node uses tail recursion, each recursive call increases the depth by 1 as well. * `maxExecutionGraphShaderOutputNodes` specifies the maximum number of unique nodes that can be dispatched from a single shader, and must be at least 256. * `maxExecutionGraphShaderPayloadSize` specifies the maximum total size of payload declarations in a shader, and must be at least 32KB. * `maxExecutionGraphShaderPayloadCount` specifies the maximum number of output payloads that can be initialized in a single workgroup, and must be at least 256. * `executionGraphDispatchAddressAlignment` specifies the alignment of non-scratch `VkDeviceAddress` arguments consumed by graph dispatch commands, and must be no more than 4 bytes. + * `maxExecutionGraphWorkgroupCount[3]` describes the maximum number of local workgroups that a shader can be dispatched with, + and must be at least (65535, 65535, 65535) for the X, Y, and Z dimensions, respectively. + * `maxExecutionGraphWorkgroups` describes the total number of local workgroups that a shader can be dispatched with and must be at least 16777215. #### Features -The following new feature is added to Vulkan: +The following new features are added to Vulkan: [source,c] ---- @@ -282,10 +338,12 @@ typedef VkPhysicalDeviceShaderEnqueueFeaturesAMDX { VkStructureType sType; void* pNext; VkBool32 shaderEnqueue; + VkBool32 shaderMeshEnqueue; } VkPhysicalDeviceShaderEnqueueFeaturesAMDX; ---- -The `shaderEnqueue` feature enables all functionality in this extension. +The `shaderEnqueue` feature enables the ability to enqueue compute shader workgroups from other compute shaders. +The `shaderMeshEnqueue` feature enables the ability to enqueue mesh nodes in an execution graph. ### SPIR-V Changes @@ -305,15 +363,12 @@ A new storage class is added: |==== 2+^.^| Storage Class | Enabling Capabilities | 5068 | *NodePayloadAMDX* + -Input payload from a node dispatch. + -In the *GLCompute* execution model with the *CoalescingAMDX* execution mode, it is visible across all functions in all invocations in a workgroup; otherwise it is visible across all functions in all invocations in a dispatch. + -Variables declared with this storage class are read-write, and must not have initializers. -| *ShaderEnqueueAMDX* -| 5076 | *NodeOutputPayloadAMDX* + -Output payload to be used for dispatch. + -Variables declared with this storage class are read-write, must not have initializers, and must be initialized with *OpInitializeNodePayloadsAMDX* before they are accessed. + -Once initialized, a variable declared with this storage class is visible to all invocations in the declared _Scope_. + -Valid in *GLCompute* execution models. +Storage for Node Payloads. + + + +Variables declared with *OpVariable* in the *GLCompute* execution model with the *CoalescingAMDX* execution mode are visible across all invocations within a workgroup; and other variables declared with *OpVariable* in this storage class are visible across all invocations within a node dispatch. +Variables declared with this storage class are readable and writable, and must not have initializers. + + + +Pointers to this storage class are also used to point to payloads allocated and enqueued for other nodes. | *ShaderEnqueueAMDX* |==== @@ -331,12 +386,22 @@ Must not be declared alongside *StaticNumWorkgroupsAMDX* or *MaxNumWorkgroupsAMD 3+| |*ShaderEnqueueAMDX* | 5071 | *MaxNodeRecursionAMDX* + -Maximum number of times a node can enqueue itself. +Maximum number of times a node can enqueue payloads for itself. 3+| __ + _Number of recursions_ |*ShaderEnqueueAMDX* +| 5070 | *IsApiEntryAMDX* + +Indicates whether the shader can be dispatched directly by the client API or not. (GLCompute and MeshEXT execution models only) + + + +_Is Entry_ is a scalar Boolean value, with a value of *true* indicating that it can be dispatched from the API, and *false* indicating that it cannot. +If not specified, defaults to *true*. + + + +Must be set to *false* if *SharesInputWithAMDX* is specified. +3+| __ + +_Is Entry_ +|*ShaderEnqueueAMDX* | 5072 | *StaticNumWorkgroupsAMDX* + -Statically declare the number of workgroups dispatched for this shader, instead of obeying an API- or payload-specified value. Values are reflected in the NumWorkgroups built-in value. (GLCompute only) + +Statically declare the number of workgroups dispatched for this shader, instead of obeying an API- or payload-specified value. (GLCompute and MeshEXT only) + + Must not be declared alongside *CoalescingAMDX* or *MaxNumWorkgroupsAMDX*. | __ + @@ -347,7 +412,7 @@ _y size_ _z size_ |*ShaderEnqueueAMDX* | 5077 | *MaxNumWorkgroupsAMDX* + -Declare the maximum number of workgroups dispatched for this shader. Dispatches must not exceed this value (GLCompute only) + +Declare the maximum number of workgroups dispatched for this shader. Dispatches must not exceed this value (GLCompute and MeshEXT only) + + Must not be declared alongside *CoalescingAMDX* or *StaticNumWorkgroupsAMDX*. | __ + @@ -358,8 +423,19 @@ _y size_ _z size_ |*ShaderEnqueueAMDX* | 5073 | *ShaderIndexAMDX* + -Declare the node index for this shader. (GLCompute only) 3+| __ + +Declare the node index for this shader. (GLCompute and MeshEXT only) 3+| __ + +_Shader Index_ +|*ShaderEnqueueAMDX* +| 5102 | *SharesInputWithAMDX* + +Declare that this shader is paired with another node, such that it will be dispatched with the same input payload when the identified node is dispatched. + +_Node Name_ and _Shader Index_ indicate the node that the input will be shared with. + + + +_Node Name_ must be an *OpConstantStringAMDX* or *OpSpecConstantStringAMDX* instruction. +| + +_Node Name_ +| __ + _Shader Index_ +| |*ShaderEnqueueAMDX* |==== @@ -367,27 +443,39 @@ A shader module declaring `ShaderEnqueueAMDX` capability must only be used in ex `vkCreateExecutionGraphPipelinesAMDX` command. `MaxNodeRecursionAMDX` must be specified if a shader re-enqueues itself, which takes place if that shader -initializes and finalizes a payload for the same node _name_ and _index_. Other forms of recursion are not allowed. +allocates and enqueues a payload for the same node _name_ and _index_. Other forms of recursion are not allowed. An application must not dispatch the shader with a number of workgroups in any dimension greater than the values specified by `MaxNumWorkgroupsAMDX`. `StaticNumWorkgroupsAMDX` allows the declaration of the number of workgroups to dispatch to be coded into the shader itself, which can be useful for optimizing some algorithms. When a compute shader is dispatched using existing `vkCmdDispatchGraph*` commands, the workgroup counts specified there are overridden. When enqueuing such shaders with a payload, these arguments will not be consumed from the payload before application-specified data begins. -The values of `MaxNumWorkgroupsAMDX` and `StaticNumWorkgroupsAMDX` must be less than or equal to `link:{refpage}VkPhysicalDeviceLimits.html[VkPhysicalDeviceLimits]::maxComputeWorkGroupCount`. +The values of `MaxNumWorkgroupsAMDX` and `StaticNumWorkgroupsAMDX` must be less than or equal to `link:{refpage}VkPhysicalDeviceShaderEnqueuePropertiesAMDX.html[VkPhysicalDeviceShaderEnqueuePropertiesAMDX]::maxExecutionGraphWorkgroupCount`. + +The product of the X, Y, and Z values of `MaxNumWorkgroupsAMDX` and `StaticNumWorkgroupsAMDX` must be less than or equal to `link:{refpage}VkPhysicalDeviceShaderEnqueuePropertiesAMDX.html[VkPhysicalDeviceShaderEnqueuePropertiesAMDX]::maxExecutionGraphWorkgroups`. The arguments to each of these execution modes must be a constant 32-bit integer value, and may be supplied via specialization constants. -When a *GLCompute* shader is being used in an execution graph, `NumWorkgroups` must not be used. +When a *GLCompute* or *MeshEXT* shader is being used in an execution graph, `NumWorkgroups` must not be used. When *CoalescingAMDX* is used, it has the following effects on a compute shader's inputs and outputs: - The `WorkgroupId` built-in is always `(0,0,0)` - NB: This affects related built-ins like `GlobalInvocationId` - So similar to `StaticNumWorkgroupsAMDX`, no dispatch size is consumed from the payload-specified - - The input in the `NodePayloadAMDX` storage class must have a type of *OpTypeArray* or *OpTypeRuntimeArray*. + - The input in the `NodePayloadAMDX` storage class must have a type of `OpTypeNodePayloadArrayAMDX`. - This input must be decorated with `NodeMaxPayloadsAMDX`, indicating the number of payloads that can be received. - - The number of payloads received is provided in the `CoalescedInputCountAMDX` built-in. - - If *OpTypeArray* is used, that input's array length must be equal to the size indicated by the `NodeMaxPayloadsAMDX` decoration. + - The number of payloads received can be queried through `OpNodePayloadArrayLengthAMDX` + +When *SharesInputWithAMDX* is declared, the node will be dispatched whenever the node identified by it is dispatched, with the same input payload. +The following limitations apply for sharing nodes in this way: + + - Nodes must only share with a node that does not declare *SharesInputWithAMDX* + - No more than 256 nodes in a graph can share the same input (including the base node) + - Applications must not directly dispatch any node with the *SharesInputWithAMDX* execution mode. + - Input payloads must be decorated with _NonWritable_ if *SharesInputWithAMDX* is declared. + - Emitting a payload to a shared node multiplies all of the payload resources by the number of shared nodes, as they count against values in `VkPhysicalDeviceShaderEnqueuePropertiesAMDX`. + +If *IsApiEntryAMDX* is set to *false*, `vkCmdDispatchGraph*` commands must not reference this node. New decorations are added: @@ -395,103 +483,203 @@ New decorations are added: |==== 2+^.^| Decoration | Extra Operands | Enabling Capabilities | 5020 | *NodeMaxPayloadsAMDX* + -Must only be used to decorate a variable in the *NodeOutputPayloadAMDX* or *NodePayloadAMDX* storage class. + - + -Variables in the *NodeOutputPayloadAMDX* storage class must have this decoration. -If such a variable is decorated, the operand indicates the maximum number of payloads in the array + -as well as the maximum number of payloads that can be allocated by a single workgroup for this output. + +Must only be used to decorate an *OpTypeNodePayloadArrayAMDX*. + + -Variables in the *NodePayloadAMDX* storage class must have this decoration if the *CoalescingAMDX* execution mode is specified, otherwise they must not. -If such a variable is decorated, the operand indicates the maximum number of payloads in the array. + +*OpTypeNodePayloadArrayAMDX* must have this decoration. +The operand indicates the maximum number of payloads that can be in the array, and the maximum number of payloads that can be enqueued with this type. | __ + _Max number of payloads_ |*ShaderEnqueueAMDX* + | 5019 | *NodeSharesPayloadLimitsWithAMDX* + -Decorates a variable in the *NodeOutputPayloadAMDX* storage class to indicate that it shares output resources with _Payload Array_ when dispatched. + +Decorates an *OpTypeNodePayloadArrayAMDX* declaration to indicate that payloads of this type share output resources with _Payload Type_ when allocated. + + -Without the decoration, each variable's resources are separately allocated against the output limits; by using the decoration only the limit of _Payload Array_ is considered. -Applications must still ensure that at runtime the actual usage does not exceed these limits, as this decoration only relaxes static validation. + +Without the decoration, each types's resources are separately allocated against the output limits; by using the decoration only the limits of _Payload Type_ are considered. +Applications must still ensure that at runtime the actual usage does not exceed these limits, as this decoration only modifies static validation. + + -Must only be used to decorate a variable in the *NodeOutputPayloadAMDX* storage class, -_Payload Array_ must be a different variable in the *NodeOutputPayloadAMDX* storage class, and -_Payload Array_ must not be itself decorated with *NodeSharesPayloadLimitsWithAMDX*. + +Must only be used to decorate an *OpTypeNodePayloadArrayAMDX* declaration, +_Payload Type_ must be a different *OpTypeNodePayloadArrayAMDX* declaration, and +_Payload Type_ must not be itself decorated with *NodeSharesPayloadLimitsWithAMDX*. + + -It is only necessary to decorate one variable to indicate sharing between two node outputs. -Multiple variables can be decorated with the same _Payload Array_ to indicate sharing across multiple node outputs. +It is only necessary to decorate one *OpTypeNodePayloadArrayAMDX* declaration to indicate sharing between two node outputs. +Multiple variables can be decorated with the same _Payload Type_ to indicate sharing across multiple node outputs. | __ + -_Payload Array_ +_Payload Type_ |*ShaderEnqueueAMDX* + | 5091 | *PayloadNodeNameAMDX* + -Decorates a variable in the *NodeOutputPayloadAMDX* storage class to indicate that the payloads in the array +Decorates an *OpTypeNodePayloadArrayAMDX* declaration to indicate that the payloads in the array will be enqueued for the shader with _Node Name_. + + -Must only be used to decorate a variable that is initialized by *OpInitializeNodePayloadsAMDX*. -| _Literal_ + +Must only be used to decorate an *OpTypeNodePayloadArrayAMDX* declaration. + + + +_Node Name_ must be an *OpConstantStringAMDX* or *OpSpecConstantStringAMDX* instruction. +| __ + _Node Name_ |*ShaderEnqueueAMDX* -| 5078 | *TrackFinishWritingAMDX* + -Decorates a variable in the *NodeOutputPayloadAMDX* or *NodePayloadAMDX* storage class to indicate that a payload that is first -enqueued and then accessed in a receiving shader, will be used with *OpFinishWritingNodePayloadAMDX* instruction. + + +| 5098 | *PayloadNodeBaseIndexAMDX* + +Decorates an *OpTypeNodePayloadArrayAMDX* declaration to indicate a base index that +will be added to the _Node Index_ when allocating payloads of this type. +If not specified, it is equivalent to specifying a value of 0. + + + +Must only be used to decorate an *OpTypeNodePayloadArrayAMDX* declaration. +| __ + +_Base Index_ +|*ShaderEnqueueAMDX* + +| 5099 | *PayloadNodeSparseArrayAMDX* + +Decorates an *OpTypeNodePayloadArrayAMDX* declaration to indicate that nodes at some node indexes may not exist in the execution graph pipeline and cannot be used to allocate payloads. + + + +If not specified, all node indexes between 0 and the *PayloadNodeArraySizeAMDX* value must be valid nodes in the graph. + + -Must only be used to decorate a variable in the *NodeOutputPayloadAMDX* or *NodePayloadAMDX* storage class. + +Must only be used to decorate an *OpTypeNodePayloadArrayAMDX* declaration. +| +|*ShaderEnqueueAMDX* + +| 5100 | *PayloadNodeArraySizeAMDX* + +Decorates an *OpTypeNodePayloadArrayAMDX* declaration to indicate the maximum node index that can be used when allocating payloads of this type, including the base index offset in *PayloadNodeBaseIndexAMDX* decoration (if present). +If not specified, the node array is considered unbounded. + + -Must not be used to decorate a variable in the *NodePayloadAMDX* storage class if the shader uses *CoalescingAMDX* execution mode. + +Must only be used to decorate an *OpTypeNodePayloadArrayAMDX* declaration. + + + +If *PayloadNodeSparseArrayAMDX* is not set to *true* for a type initialized by *OpAllocateNodePayloadsAMDX*, this must be specified. +| __ + +_Array Size_ +|*ShaderEnqueueAMDX* + +| 5078 | *TrackFinishWritingAMDX* + +Decorates a structure to indicate that when used as a payload it can be written to and works with the *OpFinishWritingNodePayloadAMDX* instruction. + + -If a variable in *NodeOutputPayloadAMDX* storage class is decorated, then a matching variable with *NodePayloadAMDX* storage class -in the receiving shader must be decorated as well. + +Must only be used to decorate a structure type declaration. + + -If a variable in *NodePayloadAMDX* storage class is decorated, then a matching variable with *NodeOutputPayloadAMDX* storage class -in the enqueuing shader must be decorated as well. + +If the payload enqueued for a node is using a structure decorated with this value, the input payload in the *NodePayloadAMDX* storage class in the receiving node must use a structure decorated with it as well. | |*ShaderEnqueueAMDX* -|==== -This allows more control over the `maxExecutionGraphShaderPayloadSize` limit, and can be useful when a shader may output some large number of payloads but to potentially different nodes. +| 5105 | *PayloadDispatchIndirectAMDX* + +Indicates the dispatch indirect arguments describing the number of workgroups to dispatch in a payload. +Must only be used with *OpMemberDecorate* to decorate the member of a structure. -Two new built-ins are provided: +Must decorate a structure member with a type of *OpTypeInt* or *OpTypeVector* with two or three components. +The integer type or the type of the vector component must be an *OpTypeInt* with up to 32-bit _Width_ and 0 _Signedness_. +If a single integer is used, the Y and Z dispatch indirect arguments are assumed to be 1. +If a vector of two components is used, the Z dispatch indirect argument is assumed to be 1. +| +|*ShaderEnqueueAMDX* +|==== + +The following new built-ins are provided: [cols="1,10,8",options="header"] |==== 2+^.^| BuiltIn | Enabling Capabilities +| 5021 | *RemainingRecursionLevelsAMDX* + +The number of times this node can still enqueue payloads for itself. + +Is equal to 0 if at the leaf or if the node is not recursive at all. +|*ShaderEnqueueAMDX* | 5073 | *ShaderIndexAMDX* + Index assigned to the current shader. |*ShaderEnqueueAMDX* -| 5021 | *CoalescedInputCountAMDX* + -Number of valid inputs in the *NodePayloadAMDX* storage class array when using the *CoalescingAMDX* Execution Mode. (GLCompute only) -|*ShaderEnqueueAMDX* |==== -The business of actually allocating and enqueuing payloads is done by *OpInitializeNodePayloadsAMDX*: +If the `Execution Model` is `GLCompute` or `MeshEXT`, and neither the `StaticNumWorkgroupsAMDX` or `CoalescingAMDX` execution modes are specified, if an input payload is specified it must include a member with the *PayloadDispatchIndirectAMDX* decoration, indicating the number of workgroups to dispatch in each dimension. -[cols="1,2,2,2,2,2"] +New constant instructions are added to allow specialization of string variables, which are used for linkage between shaders. + +[cols="4*1"] +|====== +3+|[[OpConstantStringAMDX]]*OpConstantStringAMDX* + + + +Declare a new string specialization constant. + + + +_String_ is the value of the constant. + + + +Unlike *OpString*, this is a semantically meaningful instruction and cannot be safely removed from a module. +1+|Capability: + +*ShaderEnqueueAMDX* +| 3 + variable | 5103 +| _Result _ +| _Literal_ + +_String_ +|====== + +[cols="4*1"] |====== -5+|[[OpInitializeNodePayloadsAMDX]]*OpInitializeNodePayloadsAMDX* + +3+|[[OpSpecConstantStringAMDX]]*OpSpecConstantStringAMDX* + + + +Declare a new string specialization constant. + + + +_String_ is the default value of the constant. + + -Allocate payloads in memory and make them accessible through the _Payload Array_ variable. -The payloads are enqueued for the node shader identified by the _Node Index_ and _Node Name_ in the decoration -*PayloadNodeNameAMDX* on the _Payload Array_ variable. + +Unlike *OpString*, this is a semantically meaningful instruction and cannot be safely removed from a module. + + -_Payload Array_ variable must be an *OpTypePointer* with a _Storage Class_ of _OutputNodePayloadAMDX_, and a _Type_ of *OpTypeArray* with an _Element Type_ of *OpTypeStruct*. + +This instruction can be specialized to become an *OpConstantStringAMDX* instruction. + + -The array pointed to by _Payload Array_ variable must have _Payload Count_ elements. + +See _Specialization_. +1+|Capability: + +*ShaderEnqueueAMDX* +| 3 + variable | 5104 +| _Result _ +| _Literal_ + +_String_ +|====== + + +A new payload type is defined that can be allocated dynamically and then enqueued for a node: + +[cols="4*1",width="100%"] +|===== +3+|[[OpTypeNodePayloadArrayAMDX]]*OpTypeNodePayloadArrayAMDX* + + + +Declare a new payload array type. Its length is not known at compile time. + + + +_Payload Type_ is the type of each payload in the array. + + + + See <> for getting the length of an array of this type. + + + +A payload array can be allocated by either *OpAllocateNodePayloadsAMDX* to be enqueued as an output, or via *OpVariable* in the *NodePayloadAMDX* storage class to be consumed as an input. + + + +Can be dereferenced using an access chain in the same way as *OpTypeRuntimeArray* or *OpTypeArray*. +1+|<>: + +*Shader* +| 3 | 5076 +| _Result _ +| __ + +_Payload Type_ +|===== + +Decorations on this type indicate which node this type will be dispatched to and how it consumes resources. +Once a payload array type has been declared and all relevant decorations specified, they can be allocated using: + +[cols="6*2,4"] +|====== +6+|[[OpAllocateNodePayloadsAMDX]]*OpAllocateNodePayloadsAMDX* + + + +Allocates payloads for a node to be later enqueued via *OpEnqueueNodePayloadsAMDX*. + + + +_Result Type_ must be an *OpTypePointer* to an *OpTypeNodePayloadArrayAMDX* in the *NodePayloadAMDX* storage class. + + + +The payloads are allocated for the node identified by the _Node Name_ in the *PayloadNodeNameAMDX* decoration on _Result Type_, +with an index equal to the sum of its *PayloadNodeBaseIndexAMDX* decoration (if present) and _Node Index_. + Payloads are allocated for the _Scope_ indicated by _Visibility_, and are visible to all invocations in that _Scope_. + + -_Payload Count_ is the number of payloads to initialize in the _Payload Array_. + +_Payload Count_ is the number of payloads to allocate in the resulting array. + -_Payload Count_ must be less than or equal to the *NodeMaxPayloadsAMDX* decoration on the _Payload Array_ variable. + +Behavior is undefined if _Payload Count_ is greater than the *NodeMaxPayloadsAMDX* decoration on _Result Type_. + + _Payload Count_ and _Node Index_ must be dynamically uniform within the scope identified by _Visibility_. + + _Visibility_ must only be either _Invocation_ or _Workgroup_. + + -This instruction must be called in uniform control flow. + -This instruction must not be called on a _Payload Array_ variable that has previously been initialized. +This instruction must be called in uniform control flow within the same workgroup. 1+|Capability: + *ShaderEnqueueAMDX* -| 5 | 5090 +| 6 | 5074 | __ + -_Payload Array_ +_Result Type_ +| _Result_ __ | _Scope _ + _Visibility_ | __ + @@ -500,29 +688,22 @@ _Payload Count_ _Node Index_ |====== - -Once a payload element is initialized, it will be enqueued to workgroups in the corresponding shader after the calling shader has written all of its values. +Once a payload array is allocated, it can be enqueued to the identified node by calling *OpEnqueueNodePayloadsAMDX*. Enqueues are performed in the same manner as the `vkCmdDispatchGraph*` API commands. -If the node enqueued has the `CoalescingAMDX` execution mode, there is no guarantee what set of payloads are visible to the same workgroup. +If the node receiving the payloads has the `CoalescingAMDX` execution mode, there is no guarantee what set of payloads are visible to the same workgroup. -The shader must not enqueue payloads to a shader with the same name as this shader unless the index identifies this shader and `MaxNodeRecursionAMDX` is declared with a sufficient depth. +The shader must not enqueue payloads to a shader with the same name as this shader unless the index identifies this node and `MaxNodeRecursionAMDX` is declared with a sufficient depth. Shaders with the same name and different indexes can each recurse independently. - -A shader can explicitly specify that it is done writing to outputs (allowing the enqueue to happen sooner) by calling *OpFinalizeNodePayloadsAMDX*: - [cols="3,1,1"] |====== -2+|[[OpFinalizeNodePayloadsAMDX]]*OpFinalizeNodePayloadsAMDX* + +2+|[[OpEnqueueNodePayloadsAMDX]]*OpEnqueueNodePayloadsAMDX* + + -Optionally indicates that all accesses to an array of output payloads have completed. +Enqueues a previously allocated payload array for execution by its node. + + -_Payload Array_ is a payload array previously initialized by *OpInitializeNodePayloadsAMDX*. +_Payload Array_ is a pointer to a payload array that was previously allocated by *OpAllocateNodePayloadsAMDX*. + + -This instruction must be called in uniform control flow. - + -_Payload Array_ must be an *OpTypePointer* with a _Storage Class_ of _OutputNodePayloadAMDX_, and a _Type_ of *OpTypeArray* or *OpTypeRuntimeArray* with an _Element Type_ of *OpTypeStruct*. -_Payload Array_ must not have been previously finalized by *OpFinalizeNodePayloadsAMDX*. +This instruction must be called in uniform control flow within the workgroup. 1+|Capability: + *ShaderEnqueueAMDX* | 2 | 5075 @@ -532,22 +713,84 @@ _Payload Array_ Once this has been called, accessing any element of _Payload Array_ is undefined behavior. +The length of _Payload Array_ can be queried at any point by calling: + +[cols="2*1,3*2",width="100%"] +|===== +4+|[[OpNodePayloadArrayLengthAMDX]]*OpNodePayloadArrayLengthAMDX* + + + +Query the length of a payload array. Must only be used with input payload arrays or allocated output payload arrays. + + + +_Result_ will be equal to the _Payload Count_ value used to allocate _Payload Array_, or to the number of received payloads if the shader is using *CoalescingAMDX* execution mode. Otherwise, _Result_ will be 1. + + + +_Result Type_ must be an *OpTypeInt* with 32-bit _Width_ and 0 _Signedness_. + + + +_Payload Array_ is a pointer to a payload array previously allocated by *OpAllocateNodePayloadsAMDX*, or declared via *OpVariable* in the *NodePayloadAMDX* storage class as an input. +1+|<>: + +*Shader* +| 4 | 5090 +| __ + +_Result Type_ +| _Result _ +| __ + +_Payload Array_ +|===== + +Before allocating payloads, applications can determine whether allocating payloads is possible for a particular node index: + +- If a payload type is decorated with *PayloadNodeSparseArrayAMDX*, applications can determine whether a node exists at a particular index. +- If a payload type is decorated with *PayloadNodeNameAMDX* that matches the current node, applications can determine whether a node at a particular index has reached its max recursion depth. +- In all other cases, the payload can be allocated. + +[cols="1,2,2,2,2,2"] +|====== +5+|[[OpIsNodePayloadValidAMDX]]*OpIsNodePayloadValidAMDX* + + + +Check if the node payload identified by the _Node Name_ in the *PayloadNodeNameAMDX* decoration, +with an index equal to the sum of its *PayloadNodeBaseIndexAMDX* decoration (if present) and _Node Index_ +can be allocated. + + + +_Result_ is equal to *OpConstantTrue* if the payload is valid and can be allocated, *OpConstantFalse* otherwise. + + + +_Result Type_ must be *OpTypeBool*. + + + +_Payload Type_ must be an *OpTypeNodePayloadArrayAMDX* declaration. + + + +_NodeIndex_ must be less than the value specified by the *PayloadNodeArraySizeAMDX* decoration if specified. +1+|Capability: + +*ShaderEnqueueAMDX* +| 5 | 5101 +| __ + +_Result Type_ +| _Result_ __ +| __ + +_Payload Type_ +| __ + +_Node Index_ +|====== + +Payloads enqueued in this way will be provided to the node through the *NodePayloadAMDX* storage class in the shader. +These payloads can be read by the receiving node, but also can be written for a limited amount of communication between multiple workgroups enqueued for the same node. +It is a data race if one workgroup writes to a particular element of the payload and another workgroup accesses it in any way, with one exception; once all nodes have finished writing, it is safe for the last node to read those values. +Workgroups can indicate that they have finished writing to the payload by calling: + [cols="3,1,1,1,1"] |====== 4+|[[OpFinishWritingNodePayloadAMDX]]*OpFinishWritingNodePayloadAMDX* + + -Optionally indicates that all writes to the input payload by the current workgroup have completed. +Optionally indicates that all writes to the input payload by the current workgroup have completed. + + + +_Result_ is equal to *OpConstantTrue* if all workgroups that can access this payload have called this function. + + -Returns `true` when all workgroups that can access this payload have called this function. - Must not be called if the shader is using *CoalescingAMDX* execution mode, -or if the shader was dispatched with a `vkCmdDispatchGraph*` command, rather than enqueued from another shader. - -Must not be called if the input payload is not decorated with *TrackFinishWritingAMDX*. - -_Result Type_ must be *OpTypeBool*. +or if the shader was dispatched with a `vkCmdDispatchGraph*` client API command, +rather than enqueued from another shader. + + + +Must not be called if the input payload is not decorated with *TrackFinishWritingAMDX*. + + -_Payload_ is a variable in the *NodePayloadAMDX* storage class. +_Result Type_ must be *OpTypeBool*. + + + +_Payload_ must be the result of an *OpVariable* in the *NodePayloadAMDX* storage class. 1+|Capability: + *ShaderEnqueueAMDX* | 4 | 5078 @@ -563,20 +806,13 @@ Once this has been called for a given payload, writing values into that payload ## Issues -### RESOLVED: For compute nodes, can the input payload be modified? If so what sees that modification? - -Yes, input payloads are writable and *OpFinishWritingNodePayloadAMDX* instruction is provided to indicate that all -workgroups that share the same payload have finished writing to it. - -Limitations apply to this functionality. Please refer to the instruction's specification. - - -### UNRESOLVED: Do we need input from the application to tune the scratch allocation? +### How does this extension interact with device groups? -For now no, more research is required to determine what information would be actually useful to know. +It works the same as any other dispatch commands - work is replicated to all devices unless applications split the work themselves. +There is no automatic scheduling between devices. +### What dynamic state should be allowed? -### PROPOSED: How does this extension interact with device groups? +Proposed: Support a subset of dynamic state. -It works the same as any other dispatch commands - work is replicated to all devices unless applications split the work themselves. -There is no automatic scheduling between devices. +For now, this specification exposes basic "value" state - primarily things where there is only a value to modify rather than a mode switch or state enable. diff --git a/xml/vk.xml b/xml/vk.xml index 00d97f14c..8d108d08d 100644 --- a/xml/vk.xml +++ b/xml/vk.xml @@ -8994,11 +8994,14 @@ typedef void* MTLSharedEvent_id; uint32_t maxExecutionGraphShaderPayloadSize uint32_t maxExecutionGraphShaderPayloadCount uint32_t executionGraphDispatchAddressAlignment + uint32_t maxExecutionGraphWorkgroupCount[3] + uint32_t maxExecutionGraphWorkgroups VkStructureType sType void* pNext VkBool32 shaderEnqueue + VkBool32 shaderMeshEnqueue VkStructureType sType @@ -9020,7 +9023,9 @@ typedef void* MTLSharedEvent_id; VkStructureType sType void* pNext - VkDeviceSize size + VkDeviceSize minSize + VkDeviceSize maxSize + VkDeviceSize sizeGranularity uint32_t nodeIndex @@ -15713,13 +15718,13 @@ typedef void* MTLSharedEvent_id; VkResult vkGetExecutionGraphPipelineScratchSizeAMDX VkDevice device VkPipeline executionGraph - VkExecutionGraphPipelineScratchSizeAMDX* pSizeInfo + VkExecutionGraphPipelineScratchSizeAMDX* pSizeInfo VkResult vkGetExecutionGraphPipelineNodeIndexAMDX VkDevice device VkPipeline executionGraph - const VkPipelineShaderStageNodeCreateInfoAMDX* pNodeInfo + const VkPipelineShaderStageNodeCreateInfoAMDX* pNodeInfo uint32_t* pNodeIndex @@ -15731,27 +15736,32 @@ typedef void* MTLSharedEvent_id; const VkAllocationCallbacks* pAllocator VkPipeline* pPipelines - + void vkCmdInitializeGraphScratchMemoryAMDX VkCommandBuffer commandBuffer + VkPipeline executionGraph VkDeviceAddress scratch + VkDeviceSize scratchSize - + void vkCmdDispatchGraphAMDX VkCommandBuffer commandBuffer VkDeviceAddress scratch - const VkDispatchGraphCountInfoAMDX* pCountInfo + VkDeviceSize scratchSize + const VkDispatchGraphCountInfoAMDX* pCountInfo - + void vkCmdDispatchGraphIndirectAMDX VkCommandBuffer commandBuffer VkDeviceAddress scratch - const VkDispatchGraphCountInfoAMDX* pCountInfo + VkDeviceSize scratchSize + const VkDispatchGraphCountInfoAMDX* pCountInfo - + void vkCmdDispatchGraphIndirectCountAMDX VkCommandBuffer commandBuffer VkDeviceAddress scratch + VkDeviceSize scratchSize VkDeviceAddress countInfo @@ -19260,10 +19270,10 @@ typedef void* MTLSharedEvent_id; - + - - + + @@ -19290,7 +19300,11 @@ typedef void* MTLSharedEvent_id; - + + + + + @@ -24065,7 +24079,6 @@ typedef void* MTLSharedEvent_id; -