Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bindless support #36

Open
Try opened this issue Jun 19, 2022 · 3 comments
Open

Bindless support #36

Try opened this issue Jun 19, 2022 · 3 comments

Comments

@Try
Copy link
Owner

Try commented Jun 19, 2022

Bindless is quite messy in every api, so need to design nice top-level api with reasonable underlying implementation.

GLSL

GLSL is main language in Tempest, so dedicated section is must. GLSL features 2 ways:

  1. Unbound array of descriptors. - nice and easy to use
  2. Device address. - not portable to metal; hard to track hazards
layout(binding = 0) uniform sampler2D tex[]; // unbound array of textures
layout(binding = 1) uniform sampler2D img[]; // another unbound array of textures
layout(binding = 1, std140) readonly buffer Input {
  vec4 val[];
  } ssbo[]; // unbound array of buffers

Engine-side

std::vector<const Tempest::Texture2d*> ptex(tex.size());
for(size_t i=0; i<tex.size(); ++i)
  ptex[i] = &tex[i];
auto desc = device.descriptors(pso);
desc.set(0,ptex); // taking vector or c-array

Doesn't fit the engine perfectly - need to add support for sampler and textures(non-combined) on top of it.

Vulkan

Caps-list:

VkPhysicalDeviceDescriptorIndexingFeatures::runtimeDescriptorArray; // support for unbound array declaration (tex[])
// Support of nonuniformEXT, per resource-type 
VkPhysicalDeviceDescriptorIndexingFeatures::shaderUniformBufferArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderSampledImageArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderStorageBufferArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderStorageImageArrayNonUniformIndexing;

VK_DESCRIPTOR_BINDING_VARIABLE_DESCRIPTOR_COUNT_BIT can be used (in theory), but only for the very last binding in descriptor set, what doesn't fit GLSL side.
Alternatively, it's sufficient to use VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT_EXT with very-large descriptor array. Size of array has to be defined in C++ upfront, at VkDescriptorSetLayout creation.
Current implementation of Tempest can recreate VkDescriptorSetLayout and VkDescriptorSet on a go, if preallocated array is not big enough. But it also requires reallocation of VkPipeline, at runtime, based of descriptor set size - this is hard to implement without extra performance cost.

VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT - useless by itself, but there is a special behavior for this type of descriptors in spec:

... layouts which may be much higher than the pre-existing limits. The old limits only count descriptors in non-updateAfterBind descriptor set layouts, and the new limits count descriptors in all descriptor set layouts in the pipeline layout.

maxUpdateAfterBindDescriptorsInAllPools = 500,000+ // Eh, probably can't do anything sensible about it
maxPerStageUpdateAfterBindResources   = 500,000+

maxPerStageDescriptorUpdateAfterBindSamplers = 500,000+
maxPerStageDescriptorUpdateAfterBindUniformBuffers = 12+
maxPerStageDescriptorUpdateAfterBindStorageBuffers = 500,000+
maxPerStageDescriptorUpdateAfterBindSampledImages = 500,000+
maxPerStageDescriptorUpdateAfterBindStorageImages = 500,000+
maxPerStageDescriptorUpdateAfterBindAccelerationStructures = 500,000+

maxDescriptorSetUpdateAfterBindSamplers = 500,000+
maxDescriptorSetUpdateAfterBindUniformBuffers = 72+ // n × PerStage
maxDescriptorSetUpdateAfterBindStorageBuffers = 500,000+
maxDescriptorSetUpdateAfterBindSampledImages = 500,000+
maxDescriptorSetUpdateAfterBindStorageImages = 500,000+
maxDescriptorSetUpdateAfterBindAccelerationStructures = 500,000+

Naturally as there is only single descriptor-set, can just take min of PerStage and DescriptorSet limits.

Other limits to concern (obsolete):

VkPhysicalDeviceLimits::maxPerStageDescriptorSamplers = 16+;
VkPhysicalDeviceLimits::maxPerStageDescriptorUniformBuffers = 12+;
VkPhysicalDeviceLimits::maxPerStageDescriptorStorageBuffers = 4+;
VkPhysicalDeviceLimits::maxPerStageDescriptorSampledImages = 16+;
VkPhysicalDeviceLimits::maxPerStageDescriptorStorageImages = 4+;
VkPhysicalDeviceLimits::maxPerStageResources = 128^2+;

VkPhysicalDeviceLimits::maxDescriptorSetSamplers = 96^8+;
VkPhysicalDeviceLimits::maxDescriptorSetUniformBuffers = 72^8+;
VkPhysicalDeviceLimits::maxDescriptorSetStorageBuffers = 24^8+;
VkPhysicalDeviceLimits::maxDescriptorSetSampledImages = 96^8+;
VkPhysicalDeviceLimits::maxDescriptorSetStorageImages = 24^8+;

With such limits, realloc has to manage per-stage + per-resource + per_set limit somehow.

DirectX12

Note: Tempest uses spirv-cross to generate HLSL, except produced HLSL is not valid:

// error: more than one unbounded resource (ssbo and tex) in space 0
ByteAddressBuffer         ssbo[]        : register(t1, space0);
Texture2D<float4>         tex[]         : register(t0, space0);
SamplerState             _tex_sampler[] : register(s0, space0);
RWTexture2D<unorm float4> ret           : register(u2, space0);

Apparently spirv-cross follows VARIABLE_DESCRIPTOR_COUNT workflow. This maps directly to
D3D12_DESCRIPTOR_HEAP_DESC::NumDescriptors = -1 with same limitation of only one runtime array per set. I theory can workaround with instrumenting spir-v:
OpDecorate %tex DescriptorSet 0 -> OpDecorate %tex DescriptorSet UNIQ_SPACE

Limits:

Resources Available to the Pipeline Tier 1 Tier 2 Tier 3
Feature levels 11.0+ 11.0+ 11.1+
Maximum number of descriptors in a CBV/SRV/UAV heap used for rendering 1,000,000 1,000,000 1,000,000+
Maximum number of CBV in all descriptor tables per shader stage 14 14 full heap
Maximum number of SRV in all descriptor tables per shader stage 128 full heap full heap
Maximum number of UAV in all descriptor tables per shader stage 64 for feature levels 11.1+ 8 for feature level 11 64 full heap
Maximum number of Samplers in all descriptor tables per shader stage 16 2048 2048

ID3D12GraphicsCommandList::SetDescriptorHeaps
Only one descriptor heap of each type can be set at one time, which means a maximum of 2 heaps (one sampler, one CBV/SRV/UAV) can be set at one time.
DX12 is a bit awkward, because limit is shared for all types of descriptors, except sampler. Probably can "just" split heap in equal partitions.

Metal [3]

Limits (per-app resources available at any given time are):

Resources Available to the Pipeline Tier1(ios) Tier1 Tier2
Buffers(and TLAS'es) 31 64 500,000
Textures 31 128 500,000
Samplers 16 16 2048

For both tiers, the maximum number of argument buffer entries in each function argument table is 8.

*Writable textures aren’t supported within an argument buffer.
Tier 1 argument buffers can’t be accessed through pointer indexing, nor can they include pointers to other argument buffers.
Tier 2 argument buffers can be accessed through pointer indexing, as shown in the following example.

T1 argument are practically same as descriptor-set's in vulkan and have nothing usefull in it.
T2 allows for pointer-indexing and can be leveraged for bindless-array.

Sources:
https://gist.github.com/DethRaid/0171f3cfcce51950ee4ef96c64f59617
https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ns-d3d12-d3d12_descriptor_range
https://learn.microsoft.com/en-us/windows/win32/direct3d12/hardware-support?redirectedfrom=MSDN
https://developer.apple.com/documentation/metal/buffers/about_argument_buffers
https://developer.apple.com/documentation/metal/buffers/managing_groups_of_resources_with_argument_buffers

GLSL

Unbound array of descriptors has 2 meanings:
Base spec:
uniform sampler2D tex[] -> OpTypeArray %8 %uint_1
size of array depend on highest index that been used in code.

GL_EXT_nonuniform_qualifier:
May work same as base spec, if runtime-index is not in use, and otherwise:
uniform sampler2D tex[] ->OpTypeRuntimeArray %8 // legal only if driver supports descriptor-indexing

Engine side

[wip]
Generally metal-like model is good middle ground:

maxUAV      = 500'000; // ssbo + tlas + imageStore
maxTextures = 500'000;
maxSamplers = 2048;
// can skip maxUbo - hard in vulkan and not very usefull
// combined image consumes both Texture and Samplers limits

In DX UAX/Tex - can be achieved by splitting heap in 2 parts
In Vulkan UAV is probably min for all applicable resources

Try added a commit that referenced this issue Mar 27, 2023
Try added a commit that referenced this issue Mar 28, 2023
Try added a commit that referenced this issue Mar 28, 2023
Try added a commit that referenced this issue Mar 28, 2023
@Try
Copy link
Owner Author

Try commented Mar 28, 2023

TODO, for DX12:

  • handle case when only sampler is in descriptor-set (pDescriptorHeaps[0]==nullptr)

Try added a commit that referenced this issue Mar 29, 2023
Try added a commit that referenced this issue Mar 30, 2023
Try added a commit that referenced this issue Jul 15, 2023
Try added a commit that referenced this issue Aug 16, 2023
Try added a commit that referenced this issue Aug 17, 2023
@Try
Copy link
Owner Author

Try commented Apr 22, 2024

error: number of textures with read_write access exceeds maximum supported (8)

apparently undocumented. MoltenVK allows 500k, if argument buffer tier 2 is supported(why?) and 8 otherwise

Try added a commit that referenced this issue Apr 29, 2024
@Try
Copy link
Owner Author

Try commented Jul 2, 2024

New Mac/iOS feature to track residency of resources:
https://developer.apple.com/documentation/metal/resource_fundamentals/simplifying_gpu_resource_management_with_residency_sets?language=objc

According to apple:
You don’t need to call the following methods for any allocation in a residency set that you associate with the command buffer: useResource, useHeap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant