Quantcast
Channel: Timothy Lottes
Viewing all articles
Browse latest Browse all 434

Bindless and Descriptors

$
0
0
So far the continuing interwebs discussion on what is broken with GL and how GL should evolve has been quite useful, going to continue with thoughts on bindless and managing descriptors. Very interested to see what others think about these conclusions. Some of my estimations could be wrong. Please correct them.

Web sources referenced to write this: AMD's Sea Islands ISA | AMD's Sea Islands Register Set | AMD's Southern Islands Programming Guide | NVIDIA PTX Guide | NVIDIA nvdisasm | NVIDIA Instruction Set Reference | NVIDIA Maxwell Tuning Guide.

Fetching of Constants
As can be seen in both NVIDIA nvdisasm and NVIDIA ISA docs, constant buffers with compile time immediate binding slot (X) and offset (Y) have an important instruction fast path "c[X][Y]". Constants can replace a register in opcodes. This avoids an extra instruction to load the constant into a register and reduces register pressure (there by increasing warp occupancy and ability to hide latency). The "MOV" instruction can be used to move immediate indexed constants into registers. Note PTX shows "MOV" supporting 64-bit moves. For loading using dynamic addresses, the ISA doc describes the "LDC" instruction to load constants into registers, which is documented by "LDU" in the PTX guide. The constant load supports [32bitRegister+32bitImmediate] and [64bitPairOfRegisters+32bitImmediate] addressing modes. As can be seen in the Maxwell tuning guide under "Instruction Scheduling", NVIDIA's latest hardware maintains the ability to dual issue to different functional units like math and memory op in the same cycle (important later).

As can be seen in the AMD ISA doc, GCN supports two methods of block fetching scalars (or constants): S_LOAD_DWORD* which uses 2 scalars for a 64-bit base address or S_BUFFER_LOAD_DWORD* which uses 4 scalars for a buffer descriptor. The S_LOAD_DWORD* instructions supports three address modes: [64bitBase+8bitUnsignedImmediate*4], [64bitBase+32bitUnsignedImmediate] (SQ_SRC_LITERAL), and [64bitBase+ScalarRegister]. The S_BUFFER_LOAD_DWORD* instructions add on top of this support to clamp out of bounds accesses (since the buffer descriptor provides the size of the buffer: num_records*stride).

Dynamic Address Modes: 32-bit Handle vs 32-bit Index
Handle addressing: [base+handle]
Index addressing: [base+index*stride]
Given dynamic indexing into resource descriptor tables on GCN, if the application loads a dynamic "index" in a shader at run-time to access a table, there is a hidden scalar shift or multiply instruction (by the stride) required to produce the address. This applies to both NV and AMD, neither have free shift on addressing for scalar/constant loads. Hopefully this makes it clear that it is better to use "handles" as they avoid extra instructions.

Resources Used by Texture Instructions
AMD docs are explicit, texture fetch using floating point offsets takes two immediates: one which provides the scalar register index for the texture resource descriptor, the other which provides scalar register index for the sampler resource descriptor.

NVIDIA PTX docs describe the evolution of NVIDIA hardware. At first NVIDIA had two texture access modes: "unified" which supported using 128 bind points of {texture,sampler} pairs (GL mode), and "independent" which supported using 128 bind points for {texture} and 16 bind points for {sampler} (DX mode). With "sm_20" (Fermi), NVIDIA added an extra mode for "unified" or paired {texture,sampler} access: the "indirect texture access" mode (GL bindless) which takes a register handle instead of a bind point index.

GCN SGPR Initialization
As can be seen in the AMD programming guide under "SGPR Initialization", up to 16 scalars can be pre-loaded before shader start. This provides the ability to have up to 8 64-bit base addresses for S_LOAD_DWORD* minus whatever pre-loaded scalars are needed for something else.

That should be enough background to start reasoning about implementations by estimating instructions required...

NVIDIA GL Non-bindless
TEX 7bitImmediateTextureIndex

NVIDIA GL Bindless From Constant Buffer
MOV handle, c[immediateBufferBinding][immediateHandleOffset]
TEX handle
This is leveraging hardware fast paths including c[X][Y] constant as register path. Expectation is that "handle" register usage is temporary, and the extra MOV gets dual issued, so effectively almost no cost. Could even use this mode to reduce CPU overhead in a non-bindless API like DX11 and prior, by replacing building command buffers with building handles in constant buffers (a lot less work, the larger hardware descriptors only need to change at resource streaming frequency).

NVIDIA With Resource Tables and Constant Buffers
MOV tableBaseAddress, c[immediateTableBufferBinding][immediateOffsetToTableBaseAddress]
LDC handle, [tableBaseAddress+immediateOffsetToTableEntry]
TEX handle
Showing the GL style paired {texture,sampler} above. Looks like extra register usage and an extra memory operation for the indirection for the table. The expectation is that since constant buffers have existing usage for fixed function, and that a resource table based API is likely to support multiple resource tables bound per draw (something like UE4 probably wants 4 texture tables/draw each updated at a different frequency: global, per view, per material, per mesh), that there are not enough binding slots to use the optimized c[X][Y] path for both resources and constants? Adding separate texture and samplers into this mix compounds the number of indirections especially if samplers need to be in a separate table (now maybe 8 tables/draw), and there is the mystery bit of how to build the final "handle" for the TEX instruction from two separate handles.

The existing GL path looks like to me a much better match for NVIDIA hardware then other options.

AMD GL Non-bindless
S_LOAD_DWORD* textureAndSamplerDescriptor
SAMPLE_*
A single S_LOAD_DWORD* instruction fetches both the texture and sampler descriptor. The expectation is that the binding table is setup before shader start via SGPR Initialization using 2 of the 16 scalars. Loads of majority of {texture,sampler} pairs take 32-byte loads (16-byte texture descriptor, 16-byte sampler descriptor). Loads of {texture arrays, cubemaps, 3D} need 32-byte texture descriptors, so these use 64-byte loads, and waste 4 scalars.

AMD DX Non-bindless
S_LOAD_DWORD* textureDescriptor
S_LOAD_DWORD* samplerDescriptor (sometimes)
SAMPLE_*
First texture takes two S_LOAD_DWORD* instructions. Assuming the sampler scalars don't get repurposed by the compiler, the next texture which uses the same sampler will not need the extra S_LOAD_DWORD* instruction.

Thus far for non-bindless on GCN, the GL way minimizes the number of scalar instructions. The DX way minimizes the size of the block loads. Seems like the GL way ends up better if instruction issue bound. When wave occupancy gets low (as is common with complex shaders), GCN can only multi-issue if enough waves can be scheduled since each functional unit needs a different wavefront in a given clock cycle (one single wavefront cannot multi-issue). Extra scalar operations can eat into ALU throughput. Perhaps the DX way ends up better if K$ throughput bound, but would that happen in practice? The AMD ISA docs document the K$ as 16-byte wide (which is why 64-byte block loads only need 4 scalar alignment in the scalar register file). A 16-byte load would have a throughput of 1 clock, a 32-byte load in 2 clocks, and a 64-byte load in 4 clocks. With modern ALUop:TEXop ratios of 16:1 or so, some percent of an extra clock for GL might not matter. The last issue would be K$ utilization with some possible duplication of samplers. K$ is 16KB. If a shader uses 16 2D textures and 4 samplers: DX = 320B or about 2% of the K$, GL = 512B or about 3% of the K$. Does that matter?

AMD GL Bindless
S_LOAD_DWORD* upTo16Constants (sometimes)
S_LOAD_DWORD* textureAndSamplerDescriptor
SAMPLE_*
The indirection, or the fetching of a bindless handle, gets amortized into block loads of constants. If GL moved for example to a 32-bit bindless handle, then the cost of the indirection could be as little as 1/16 of a scalar issue slot, and 1/4 of a scalar K$ cycle.

AMD Resource Tables With Separate Textures And Samplers
S_LOAD_DWORD* upTo8TablePointers (sometimes)
S_LOAD_DWORD* textureDescriptor
S_LOAD_DWORD* samplerDescriptor (sometimes)
SAMPLE_*
Using the combination of multi-frequency (global, per view, per material, per mesh) updates with separate tables per object type (constant buffers, texture resource tables, sampler resource tables, etc), easily exhausts the maximum of 16 scalars supplied in SGPR Initialization on GCN (max of 8 64-bit pointers). So likely table pointers end up needing an extra indirection to load in that case. An API which pack descriptors with constants, could leverage SGPR Initialization to avoid that indirection.

Given the options what API design makes the most sense?

Some of my Conclusions
(0.) I do not want designs which compromise the performance or functionality of the 100+ GB/s GPU vendors: AMD and NVIDIA.

(1.) The GL combined {texture, sampler} still looks like a good design on both AMD and NVIDIA. Looks to be the fastest design on NVIDIA and possibly the fastest or a matching performance design on AMD also.

(2.) GL bindless should be changed from 64-bit to 32-bit handles.

(3.) The GL bindless design could be extended to support all resource types {buffers, images, and textures} with handles in constant buffers. NVIDIA PTX guide shows "indirect surface access" (bindless surfaces) on sm_20 (Fermi) and above. So this should be possible on the two vendors which currently support bindless.

(4.) Resource descriptors in huge global tables (one for buffers, one for images, and one for textures) could be updated only at resource streaming frequency (not that often). This way if an engine decides to repack constant buffers/draw with different resources, they only need 4-bytes per resource (the handle) instead of up to 48-bytes (on GCN for things like array textures with sampler).

(5.) There is great utility to the developer to not need to deal with the differences in hardware descriptor sizes. Using 32-bit handles instead provides a common almost minimal size. The alternative of developers attempting to have different layout in buffers in memory thanks to different hardware descriptor sizes would be quite complex.

(6.) I currently believe the best path in dealing with the mix of both fully bindless and non-or-partial-bindless hardware is to support the non-fully-bindless cases like the PS3: make legacy binding and vertex fetch work as fast as possible. This would be in contrast to the attempt to find some middle ground design which would cripple the fully bindless hardware, and make the non-bindless hardware look like something they are not. Likely this evolution would look like bindless becoming the standard interface and then an optional extension set which provides {fixed function vertex fetch, bindings for DX11 style separate samplers and textures}.

Viewing all articles
Browse latest Browse all 434

Trending Articles