Bindless Blog Chain Cached

Josh Barczak continues the bindless chain letter. Here is another reply in the chain,

Thanks for the reply, this has already turned up some very good points.

The CodeXLAnalyzer results do not look optimized? In theory they only need 4 dwords/descriptor there, could coalesce into larger block loads, and would be interesting to see if interleaving s_* with the v_* instructions before the first s_waitcnt would be faster or not. Yes it is slightly counter intuitive because it might seem like less latency hiding in one shader, however might get more efficient scheduling at run-time.

Re pre-loading the single sampler into the up-to-16 preloaded scalars. I think it is much more important to place the multi-frequency constant buffer addresses in those preload scalars because that fully removes one level of indirection.

Also guessing the preload of future K$ data into L2 prior to shader execution is an important optimization. Something which would be for free if the GPU frontend is building via DMA legacy style binding tables into L2 ahead of shader execution. Pre-warming caches is something I'm still thinking about in the context of bindless.

As for cache granularity, you have some good points. According to the GCN Crash Course (slide 23) K$ is 64-bytes/line, but only transfers 16-bytes/clock. Developers looking to optimize constant buffers would pack data in 16 32-bit values at a time. I don't see this being a problem in my design suggestion for handles, just tightly pack with the other constant data. For any design where descriptors end up getting random accessed, it favors the GL combined {texture, sampler} design (no extra random access for samplers). This brings up two issues which should be thought through better in any adjustments to the GL bindless design:

(a.) Ideally texture descriptor location in a global table would be up to the developer. This way they could pair two {16-byte texture descriptor, 16-sampler descriptor} descriptor pairs which get accessed in the same material into a 64-byte line for the common case of non-array 2D textures. This provides no waste.

(b.) Fixed worst case 64-byte packing {32-byte texture descriptor, 16-byte sampler descriptor, 16-byte padding} does not look like a good option since in the common case, 32-bytes/line would get wasted.

The workaround for these problems is to expose something which is a global descriptor table. The API would provide functionality similar to the following: ChangeTextureDescriptor(indexIntoTable, textureDescription), GetTextureHandle(indexIntoTable). Then to work around case (b.), have GetIndexGranularity(textureDescription). Default granularity would be one index, each index taking the optimal 32-bytes. Then {texture arrays, 3D textures, cubemaps, etc} would have a granularity of 2 indexes (or 64-bytes).

Given the option between 64-bit handles and 32-bit handles. The 64-bit handle case does not require keeping the global descriptor table base address in two scalars which are in the up-to-16 preload. However this results in a lot of waste in constant buffers because the driver could just keep the global descriptor table in the lower 32-bits of virtual memory address space. The 32-bit handle case could in theory keep the virtual address in that case, but S_LOAD_DWORD requires a 64-bit base address. If it is possible to use the "FLAT_SRC_LO:FLAT_SRC_HI" (104:105) as the 64-bit base address, then this frees up keeping and clearing two scalar registers (for a zero base address). If un-pre-loaded scalars are all zero at shader start time, that would be an interesting case as well. Removing the global descriptor base address has an advantage in that one more constant buffer can be backed into the scalar pre-load, believe that is a useful optimization.

16 Samplers Are NOT Enough
Reply to Andrew Lauritzen's comment on samplers: "4 bits would probably cover >99% of usage, especially now with min LOD in shader" - To avoid confusion, there is a typo, should be "min LOD in texture". I disagree. 16 samplers is not enough. This also covers some of the reasons the combined {texture,sampler} design is quite useful. Sampler properties,

(a.) Anisotropy is an important per texture performance/quality setting. Some textures in a material need high anisotropy others need less.

(b.) LOD bias is an important per texture performance/quality setting. Very important given temporal super-sampling (UE4 TAA) or spatial super-sampling (SGSSAA). Some textures greatly benefit visually from LOD bias, others do not.

(c.) Max LOD can be important for lightmaps to avoid bleeding?

(d.) Combined those with the typical per-sampler stuff like wrap mode, etc.

Bindless Blog Chain Cached

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112