Unquestionably GL bindless is well designed to NVIDIA hardware. After all, they wrote the extension that ARB_bindless_texture was based on. Over time I have slowly started to understand how it is also a great match for AMD's GCN hardware as well...
Quick GCN Hardware Review
Each non-buffer texture in GL can be accessed by either a floating point offset or an integer texel offset. On GCN the integer texel offset case requires only a 16-byte or 32-byte descriptor depending on texture type. The floating point offset case requires an extra 16-byte sample descriptor. GCN has ability to block load the descriptors into the scalar register file at 4-byte, 8-byte, 16-byte, 32-byte, and 64-bytes at a time with a single instruction.
GL vs DX With Regards to Sampler Style
The DX style of separate samplers and textures on GCN would require two block load instructions for the first texture accessed with a new sampler (one load for the texture descriptor the other for the sampler descriptor). Each additional texture using the same sampler would require just one block load instruction.
The GL style of a combined sampler and texture on GCN only requires one block load instruction for any texture. The texture and sampler descriptor sitting side by side in memory can be fetched together with just one 32-byte or 64-byte scalar block load instruction. As long as the K$ is not a limiter, the GL case could be faster (less instructions). This could be more important as the wavefront occupancy goes down (less options to multi-issue scalar and vector and mem instructions).
A Bindless Indirection can be an Optimization?
Lets look in detail at a situation which would be common for a Clustered Forward based engine. Typically objects are drawn with a material shader, and this material shader will have a selection of new textures in combination with some textures shared with the prior shader such as per-view textures (to fetch lights, etc). If the graphics API builds a unique buffer of descriptors for each draw call, then there is an extra amount of CPU overhead to copy in all the per-view texture descriptors into each chunk of buffer per draw call. Another option would be an API which enables keeping a unique cached table/material of descriptors in a buffer (table changes per draw). However in that case during rendering, after each table change, parts of the K$ need to get reloaded for the per-view descriptors which already might exist in the cache at a different address. Usage of the indirection of GL bindless could in theory reduce CPU setup (less to build), and also give a better chance to keeping a warmed K$ between draw calls (as the much larger descriptors shared between shaders have the same address).
Using Bindless in a "Binding" Way
Another option for using bindless which was recently brought to my attention was the option of just replacing glBindTextures() with glProgramUniformHandleui64vARB(). The later sets the traditional GL sampler bind points with a collection of bindless handles, and in theory has the ability to remove the driver overhead and validation of the legacy glBindTextures() path. This also in theory opens up an option to the driver of skipping the bindless indirection if that indirection would possibly be slower on some kinds of hardware. Lastly this option is 100% compatible with engines which need to support GPUs which have no ARB_bindless_texture support.
Quick GCN Hardware Review
Each non-buffer texture in GL can be accessed by either a floating point offset or an integer texel offset. On GCN the integer texel offset case requires only a 16-byte or 32-byte descriptor depending on texture type. The floating point offset case requires an extra 16-byte sample descriptor. GCN has ability to block load the descriptors into the scalar register file at 4-byte, 8-byte, 16-byte, 32-byte, and 64-bytes at a time with a single instruction.
GL vs DX With Regards to Sampler Style
The DX style of separate samplers and textures on GCN would require two block load instructions for the first texture accessed with a new sampler (one load for the texture descriptor the other for the sampler descriptor). Each additional texture using the same sampler would require just one block load instruction.
The GL style of a combined sampler and texture on GCN only requires one block load instruction for any texture. The texture and sampler descriptor sitting side by side in memory can be fetched together with just one 32-byte or 64-byte scalar block load instruction. As long as the K$ is not a limiter, the GL case could be faster (less instructions). This could be more important as the wavefront occupancy goes down (less options to multi-issue scalar and vector and mem instructions).
A Bindless Indirection can be an Optimization?
Lets look in detail at a situation which would be common for a Clustered Forward based engine. Typically objects are drawn with a material shader, and this material shader will have a selection of new textures in combination with some textures shared with the prior shader such as per-view textures (to fetch lights, etc). If the graphics API builds a unique buffer of descriptors for each draw call, then there is an extra amount of CPU overhead to copy in all the per-view texture descriptors into each chunk of buffer per draw call. Another option would be an API which enables keeping a unique cached table/material of descriptors in a buffer (table changes per draw). However in that case during rendering, after each table change, parts of the K$ need to get reloaded for the per-view descriptors which already might exist in the cache at a different address. Usage of the indirection of GL bindless could in theory reduce CPU setup (less to build), and also give a better chance to keeping a warmed K$ between draw calls (as the much larger descriptors shared between shaders have the same address).
Using Bindless in a "Binding" Way
Another option for using bindless which was recently brought to my attention was the option of just replacing glBindTextures() with glProgramUniformHandleui64vARB(). The later sets the traditional GL sampler bind points with a collection of bindless handles, and in theory has the ability to remove the driver overhead and validation of the legacy glBindTextures() path. This also in theory opens up an option to the driver of skipping the bindless indirection if that indirection would possibly be slower on some kinds of hardware. Lastly this option is 100% compatible with engines which need to support GPUs which have no ARB_bindless_texture support.