OpenGL 4.4 Quick Reference Card
ARB_bindless_texture Spec
Core concept, build 64-bit handles of textures and samplers CPU-side,
uint64 handle = glGetTextureSamplerHandleARB(textureIdx, samplerIdx);
Make the combination resident (too bad there is no batch version of this),
glGetTextureSamplerHandleARB(handle);
Then get the handle into a uvec2 GPU-side, typecast to a sampler type and use,
uvec2 handle;
value = texture(sampler2D(handle), texCoord);
The uvec2 handles can easily be passed between vertex and pixel shader stage, or could be stored in a uniform, or could be fetched from a texture (or buffer), etc.
How Would this Map To the Hardware?
Traditionally bound graphics resources have been a great form of optimization at the cost of flexibility in comparison to pointers. Starting with NVIDIA using online reference of PTX. PTX docs describe two ways to get at textures: unified and independent. Unified provides access to 128 pairs of {texture and sampler} via one index (matches GL), and independent provides for two indexes but only a 4-bit index for sampler (matches DX). The texture index is a texref opaque handle, which suggests a translation to direct immediates inside opcodes. PTX 3.1 adds support for indirect texture access for sm_20 (Fermi) and only for unified mode (GL style): "In indirect access, operand a is a .u64 register holding the address of a .texref variable".
Looks like ARB_bindless_texture is a direct match for the PTX indirect unified texture access. Hints that the side effect of using bindless textures is that 2 precious 32-bit registers are used for each texture handle (Fermi only has 21 registers at maximum occupancy so each handle takes almost 10% of registers in that case). A texture array in comparion only requires one extra register to index the layer.
What about AMD GCN?
Using AMD Southern Islands ISA as a guide. AMD's GCN uses a separate vector and scalar register file. The scalar register file is used for many things including branching, predication, constants, and resource descriptors. In order to make loading of values into the scalar register file efficient, AMD supports block loads of scalars via S_LOAD_DWORDX* and S_BUFFER_LOAD_DWORDX* instructions. For buffer and texture access, opcodes use immediate indexes to scalar registers. Each index provides the base register in a series of registers which provide the complete description of the resource. Texture resource descriptors take 4 to 8 scalar registers (8 if using texture arrays, cubemaps, 3D, or MSAA), and sampler resources take 4. A single block scalar load can fetch 16 scalars, or up to four resource descriptors.
Note that ARB_bindless_texture requires vector uniform accesses to bindless textures (all threads need to access the same resource). NV_bindless_texture didn't have this restriction, but AMD's hardware requires uniform access (opcodes use immediate index to scalar).
The optimal path for tradition non-bindless resources would involve one block scalar load per 4 samplers. Assuming a bindless resource 64-bit handle in a pair of scalar registers, the bindless path would use S_LOAD_DWORDX* to fetch a resource at a time. Guessing bindless handles passed from VS to PS would end up in LDS. LDS_READ_B32 would be required to fetch to line of vector registers, then two V_READLINE_B32 instructions would be required to pass the handle to a scalar register pair.
What I'd Rather Have
A simple usage case, pass a material index from the vertex shader to the pixel shader. Material index is a 32-bit (or less) offset into a buffer (as to not waste vertex output space). This offset is an aligned offset safe for scalar block loads. At this offset is the resource descriptors themselves. Except this won't work with ARB_bindless_texture. ARB_bindless_texture requires that the material index would be an offset into a buffer with a bunch of 64-bit pointers to the descriptors which would require scattered sparse reads to gather. Seems quite a bit more expensive, or rather even prohibitively expensive in comparison.
ARB_bindless_texture Spec
Core concept, build 64-bit handles of textures and samplers CPU-side,
uint64 handle = glGetTextureSamplerHandleARB(textureIdx, samplerIdx);
Make the combination resident (too bad there is no batch version of this),
glGetTextureSamplerHandleARB(handle);
Then get the handle into a uvec2 GPU-side, typecast to a sampler type and use,
uvec2 handle;
value = texture(sampler2D(handle), texCoord);
The uvec2 handles can easily be passed between vertex and pixel shader stage, or could be stored in a uniform, or could be fetched from a texture (or buffer), etc.
How Would this Map To the Hardware?
Traditionally bound graphics resources have been a great form of optimization at the cost of flexibility in comparison to pointers. Starting with NVIDIA using online reference of PTX. PTX docs describe two ways to get at textures: unified and independent. Unified provides access to 128 pairs of {texture and sampler} via one index (matches GL), and independent provides for two indexes but only a 4-bit index for sampler (matches DX). The texture index is a texref opaque handle, which suggests a translation to direct immediates inside opcodes. PTX 3.1 adds support for indirect texture access for sm_20 (Fermi) and only for unified mode (GL style): "In indirect access, operand a is a .u64 register holding the address of a .texref variable".
Looks like ARB_bindless_texture is a direct match for the PTX indirect unified texture access. Hints that the side effect of using bindless textures is that 2 precious 32-bit registers are used for each texture handle (Fermi only has 21 registers at maximum occupancy so each handle takes almost 10% of registers in that case). A texture array in comparion only requires one extra register to index the layer.
What about AMD GCN?
Using AMD Southern Islands ISA as a guide. AMD's GCN uses a separate vector and scalar register file. The scalar register file is used for many things including branching, predication, constants, and resource descriptors. In order to make loading of values into the scalar register file efficient, AMD supports block loads of scalars via S_LOAD_DWORDX* and S_BUFFER_LOAD_DWORDX* instructions. For buffer and texture access, opcodes use immediate indexes to scalar registers. Each index provides the base register in a series of registers which provide the complete description of the resource. Texture resource descriptors take 4 to 8 scalar registers (8 if using texture arrays, cubemaps, 3D, or MSAA), and sampler resources take 4. A single block scalar load can fetch 16 scalars, or up to four resource descriptors.
Note that ARB_bindless_texture requires vector uniform accesses to bindless textures (all threads need to access the same resource). NV_bindless_texture didn't have this restriction, but AMD's hardware requires uniform access (opcodes use immediate index to scalar).
The optimal path for tradition non-bindless resources would involve one block scalar load per 4 samplers. Assuming a bindless resource 64-bit handle in a pair of scalar registers, the bindless path would use S_LOAD_DWORDX* to fetch a resource at a time. Guessing bindless handles passed from VS to PS would end up in LDS. LDS_READ_B32 would be required to fetch to line of vector registers, then two V_READLINE_B32 instructions would be required to pass the handle to a scalar register pair.
What I'd Rather Have
A simple usage case, pass a material index from the vertex shader to the pixel shader. Material index is a 32-bit (or less) offset into a buffer (as to not waste vertex output space). This offset is an aligned offset safe for scalar block loads. At this offset is the resource descriptors themselves. Except this won't work with ARB_bindless_texture. ARB_bindless_texture requires that the material index would be an offset into a buffer with a bunch of 64-bit pointers to the descriptors which would require scattered sparse reads to gather. Seems quite a bit more expensive, or rather even prohibitively expensive in comparison.