More Bindless - Continuing the bindless chain blog run, great to think through this stuff in more detail...
"What I’m getting at is that two 4x loads and one 8x load might end up having the same cost with enough occupancy" - If there was no advantage to 8x or 16x block loads, then either an engineer would leave it out of the design, or it is in there for future chips which might have more than 16B/clk of transfer out of K$ (since lines are 64B). Guessing the 16B/clk of transfer is really a limit of the scalar register file write ports. Guessing the advantage of larger than 4x block loads might be for cases like, issue an 16x block load, then issue a longer latency operation like a texture fetch. Meaning the 16x block load can continue writing to the register file in parallel with the texture fetch which at the same time only needs a read from the register file. If instead the code issued four 4x reads, it would delay the texture fetch.
"ideal is probably to allow the texture/sampler descriptors to be freely mixed" - The end game of what you are suggesting ends up being that all vendors directly expose their own low-level API to their unique hardware design. Then let the developers decide how to best use it. I do like that idea, I just don't believe I could convince the vendors to do that. Vendors already have that option given the ability to write extensions. Still need some common base which is portable and fast on all vendors for those who don't have the ability to target all hardware and who want to target future hardware which they don't know about yet.
"Today’s bind model basically has the app providing arrays of pointers which get chased to build contiguous descriptor blocks. Instead, the app could just provide a contiguous descriptor block" - GCN is special in that descriptors are loaded into the scalar register file. Think about the other possible GPU design options which don't involve loading descriptors into a shader register file. All of those possibilities take either an index or offset or pointer of a descriptor in the texture instruction. Which means that the hardware is doing the indirection already even when you don't use bindless. The indirection is free in that case. If there is some performance issue to worry about it might be the difference of loading a constant index/offset/pointer into a register vs using an immediate in the texture instruction instead. If the constant access can be used in place of a register in the opcode, or dual issued loaded for free, or the shader is not ALU bound, then it would not matter.
"Binding is just moving the table pointer." - Again the idea here, not including GCN, is to leverage the legacy path of immediate index textures. So every state change gets a fresh mini-table built inside the giant table. Often devs would just ring buffer these mini-tables inside the giant table. Those looking to actually remove CPU overhead, given a large enough giant table size, would just cache those mini-tables and not rebuild them (for the most part, a majority of what is drawn in one frame was drawn in the prior frame). A given texture descriptor might be duplicated in multiple mini-tables (different combinations of textures per table, or even different orderings). This design requires keeping track of all those descriptor copies and patching all of them on resource streaming. This I'm not wild about.
"What I’m getting at is that two 4x loads and one 8x load might end up having the same cost with enough occupancy" - If there was no advantage to 8x or 16x block loads, then either an engineer would leave it out of the design, or it is in there for future chips which might have more than 16B/clk of transfer out of K$ (since lines are 64B). Guessing the 16B/clk of transfer is really a limit of the scalar register file write ports. Guessing the advantage of larger than 4x block loads might be for cases like, issue an 16x block load, then issue a longer latency operation like a texture fetch. Meaning the 16x block load can continue writing to the register file in parallel with the texture fetch which at the same time only needs a read from the register file. If instead the code issued four 4x reads, it would delay the texture fetch.
"ideal is probably to allow the texture/sampler descriptors to be freely mixed" - The end game of what you are suggesting ends up being that all vendors directly expose their own low-level API to their unique hardware design. Then let the developers decide how to best use it. I do like that idea, I just don't believe I could convince the vendors to do that. Vendors already have that option given the ability to write extensions. Still need some common base which is portable and fast on all vendors for those who don't have the ability to target all hardware and who want to target future hardware which they don't know about yet.
"Today’s bind model basically has the app providing arrays of pointers which get chased to build contiguous descriptor blocks. Instead, the app could just provide a contiguous descriptor block" - GCN is special in that descriptors are loaded into the scalar register file. Think about the other possible GPU design options which don't involve loading descriptors into a shader register file. All of those possibilities take either an index or offset or pointer of a descriptor in the texture instruction. Which means that the hardware is doing the indirection already even when you don't use bindless. The indirection is free in that case. If there is some performance issue to worry about it might be the difference of loading a constant index/offset/pointer into a register vs using an immediate in the texture instruction instead. If the constant access can be used in place of a register in the opcode, or dual issued loaded for free, or the shader is not ALU bound, then it would not matter.
"Binding is just moving the table pointer." - Again the idea here, not including GCN, is to leverage the legacy path of immediate index textures. So every state change gets a fresh mini-table built inside the giant table. Often devs would just ring buffer these mini-tables inside the giant table. Those looking to actually remove CPU overhead, given a large enough giant table size, would just cache those mini-tables and not rebuild them (for the most part, a majority of what is drawn in one frame was drawn in the prior frame). A given texture descriptor might be duplicated in multiple mini-tables (different combinations of textures per table, or even different orderings). This design requires keeping track of all those descriptor copies and patching all of them on resource streaming. This I'm not wild about.