Josh Barczak continues the bindless chain letter. Here is another reply in the chain,
Thanks for the reply, this has already turned up some very good points.
The CodeXLAnalyzer results do not look optimized? In theory they only need 4 dwords/descriptor there, could coalesce into larger block loads, and would be interesting to see if interleaving s_* with the v_* instructions before the first s_waitcnt would be faster or not. Yes it is slightly counter intuitive because it might seem like less latency hiding in one shader, however might get more efficient scheduling at run-time.
Re pre-loading the single sampler into the up-to-16 preloaded scalars. I think it is much more important to place the multi-frequency constant buffer addresses in those preload scalars because that fully removes one level of indirection.
Also guessing the preload of future K$ data into L2 prior to shader execution is an important optimization. Something which would be for free if the GPU frontend is building via DMA legacy style binding tables into L2 ahead of shader execution. Pre-warming caches is something I'm still thinking about in the context of bindless.
As for cache granularity, you have some good points. According to the GCN Crash Course (slide 23) K$ is 64-bytes/line, but only transfers 16-bytes/clock. Developers looking to optimize constant buffers would pack data in 16 32-bit values at a time. I don't see this being a problem in my design suggestion for handles, just tightly pack with the other constant data. For any design where descriptors end up getting random accessed, it favors the GL combined {texture, sampler} design (no extra random access for samplers). This brings up two issues which should be thought through better in any adjustments to the GL bindless design:
(a.) Ideally texture descriptor location in a global table would be up to the developer. This way they could pair two {16-byte texture descriptor, 16-sampler descriptor} descriptor pairs which get accessed in the same material into a 64-byte line for the common case of non-array 2D textures. This provides no waste.
(b.) Fixed worst case 64-byte packing {32-byte texture descriptor, 16-byte sampler descriptor, 16-byte padding} does not look like a good option since in the common case, 32-bytes/line would get wasted.
The workaround for these problems is to expose something which is a global descriptor table. The API would provide functionality similar to the following: ChangeTextureDescriptor(indexIntoTable, textureDescription), GetTextureHandle(indexIntoTable). Then to work around case (b.), have GetIndexGranularity(textureDescription). Default granularity would be one index, each index taking the optimal 32-bytes. Then {texture arrays, 3D textures, cubemaps, etc} would have a granularity of 2 indexes (or 64-bytes).
Given the option between 64-bit handles and 32-bit handles. The 64-bit handle case does not require keeping the global descriptor table base address in two scalars which are in the up-to-16 preload. However this results in a lot of waste in constant buffers because the driver could just keep the global descriptor table in the lower 32-bits of virtual memory address space. The 32-bit handle case could in theory keep the virtual address in that case, but S_LOAD_DWORD requires a 64-bit base address. If it is possible to use the "FLAT_SRC_LO:FLAT_SRC_HI" (104:105) as the 64-bit base address, then this frees up keeping and clearing two scalar registers (for a zero base address). If un-pre-loaded scalars are all zero at shader start time, that would be an interesting case as well. Removing the global descriptor base address has an advantage in that one more constant buffer can be backed into the scalar pre-load, believe that is a useful optimization.
16 Samplers Are NOT Enough
Reply to Andrew Lauritzen's comment on samplers: "4 bits would probably cover >99% of usage, especially now with min LOD in shader" - To avoid confusion, there is a typo, should be "min LOD in texture". I disagree. 16 samplers is not enough. This also covers some of the reasons the combined {texture,sampler} design is quite useful. Sampler properties,
(a.) Anisotropy is an important per texture performance/quality setting. Some textures in a material need high anisotropy others need less.
(b.) LOD bias is an important per texture performance/quality setting. Very important given temporal super-sampling (UE4 TAA) or spatial super-sampling (SGSSAA). Some textures greatly benefit visually from LOD bias, others do not.
(c.) Max LOD can be important for lightmaps to avoid bleeding?
(d.) Combined those with the typical per-sampler stuff like wrap mode, etc.
Thanks for the reply, this has already turned up some very good points.
The CodeXLAnalyzer results do not look optimized? In theory they only need 4 dwords/descriptor there, could coalesce into larger block loads, and would be interesting to see if interleaving s_* with the v_* instructions before the first s_waitcnt would be faster or not. Yes it is slightly counter intuitive because it might seem like less latency hiding in one shader, however might get more efficient scheduling at run-time.
Re pre-loading the single sampler into the up-to-16 preloaded scalars. I think it is much more important to place the multi-frequency constant buffer addresses in those preload scalars because that fully removes one level of indirection.
Also guessing the preload of future K$ data into L2 prior to shader execution is an important optimization. Something which would be for free if the GPU frontend is building via DMA legacy style binding tables into L2 ahead of shader execution. Pre-warming caches is something I'm still thinking about in the context of bindless.
As for cache granularity, you have some good points. According to the GCN Crash Course (slide 23) K$ is 64-bytes/line, but only transfers 16-bytes/clock. Developers looking to optimize constant buffers would pack data in 16 32-bit values at a time. I don't see this being a problem in my design suggestion for handles, just tightly pack with the other constant data. For any design where descriptors end up getting random accessed, it favors the GL combined {texture, sampler} design (no extra random access for samplers). This brings up two issues which should be thought through better in any adjustments to the GL bindless design:
(a.) Ideally texture descriptor location in a global table would be up to the developer. This way they could pair two {16-byte texture descriptor, 16-sampler descriptor} descriptor pairs which get accessed in the same material into a 64-byte line for the common case of non-array 2D textures. This provides no waste.
(b.) Fixed worst case 64-byte packing {32-byte texture descriptor, 16-byte sampler descriptor, 16-byte padding} does not look like a good option since in the common case, 32-bytes/line would get wasted.
The workaround for these problems is to expose something which is a global descriptor table. The API would provide functionality similar to the following: ChangeTextureDescriptor(indexIntoTable, textureDescription), GetTextureHandle(indexIntoTable). Then to work around case (b.), have GetIndexGranularity(textureDescription). Default granularity would be one index, each index taking the optimal 32-bytes. Then {texture arrays, 3D textures, cubemaps, etc} would have a granularity of 2 indexes (or 64-bytes).
Given the option between 64-bit handles and 32-bit handles. The 64-bit handle case does not require keeping the global descriptor table base address in two scalars which are in the up-to-16 preload. However this results in a lot of waste in constant buffers because the driver could just keep the global descriptor table in the lower 32-bits of virtual memory address space. The 32-bit handle case could in theory keep the virtual address in that case, but S_LOAD_DWORD requires a 64-bit base address. If it is possible to use the "FLAT_SRC_LO:FLAT_SRC_HI" (104:105) as the 64-bit base address, then this frees up keeping and clearing two scalar registers (for a zero base address). If un-pre-loaded scalars are all zero at shader start time, that would be an interesting case as well. Removing the global descriptor base address has an advantage in that one more constant buffer can be backed into the scalar pre-load, believe that is a useful optimization.
16 Samplers Are NOT Enough
Reply to Andrew Lauritzen's comment on samplers: "4 bits would probably cover >99% of usage, especially now with min LOD in shader" - To avoid confusion, there is a typo, should be "min LOD in texture". I disagree. 16 samplers is not enough. This also covers some of the reasons the combined {texture,sampler} design is quite useful. Sampler properties,
(a.) Anisotropy is an important per texture performance/quality setting. Some textures in a material need high anisotropy others need less.
(b.) LOD bias is an important per texture performance/quality setting. Very important given temporal super-sampling (UE4 TAA) or spatial super-sampling (SGSSAA). Some textures greatly benefit visually from LOD bias, others do not.
(c.) Max LOD can be important for lightmaps to avoid bleeding?
(d.) Combined those with the typical per-sampler stuff like wrap mode, etc.