Quantcast
Channel: Timothy Lottes
Viewing all articles
Browse latest Browse all 434

Random Thoughts on TS vs Alternatives

$
0
0
In the context of: Joshua Barczak - Thoughts on Texel Shaders.

Here is where my mind gets stuck when I think about any kind of TS-like stage...

Looking at a single compute unit in GCN,

256 KB of vector registers
16 KB of L1

If a shader occupies 32 waves (relatively good occupancy out of 40 possible) that is a tiny 512 bytes of L1 cache on average per wave. Dividing that out into the 64 lanes of a wave provides just 8 bytes on average per invocation in the L1 cache. Interesting to think about these 8 bytes in the context of how many textures a fragment (or pixel) shader invocation accesses in it's lifetime. The ratio of vector register per L1 cache is 16:1. This working state to cache ratio provides a strong indication that data lifetime in the L1 cache is typically very short. L1 cache serves to collect coherence in a relatively small window temporally, and the SIMD lockstep execution of a full wave guarantees the tight timing requirements. Suggesting likely one could not cache TS stage results in L1. L2 is also relatively tiny in comparison with the amount of vector register state of the full machine...

Going back to R9 Nano ratios: 16 ops to 2 bytes to 1 texture fetch. The "op" in this context is a vector instruction (1 FMA instruction provides 2 flops). Lets work with the assumption of a balanced shader using those numbers. Say a shader uses 256 vector operations, it has capacity for 16 texture fetches, and lets assume those 16 fetches are batched into 4 sets of 4 fetches. Lets simplify scheduling to exact round robin. Then simplify to assume magically 5 waves can always be scheduled (enough active waves to run 5 function units like scalar, vector, memory, export, etc). Then simplify to average texture return latency of 384 cycles (made that up). Given vector ops take 4 clocks, we can ballpark shader runtime as,

4 clocks per op * 256 operations * 5 waves interleaved + 4 batches of fetch * 384 cycles of latency
= 6.6 thousand cycles of run-time

This made up example is just to point out that adding a TS stage serves as an amplifier on the amount of latency a texture miss can take to service. Instead of pulling in memory, the shader waiting on the miss now waits on another program instead. Assuming TS dumps results to L2 (which auto-backs to memory),

Dump out arguments for TS shading request
Schedule new TS job
Wait until machine has resources available to run scheduled work (free wave)
Wait for shader execution to finish filling in the missing cache lines
Send back an ack that the TS job is finished
etc

If a TS shader can access a procedural texture, in theory that TS shader could also miss, resulting in a compounding amount of latency. The 16:1 ratio of vector register to L1 cache, hints at another trouble: the shader has a huge amount of state. Any attempt to save out wave state and later restore (for wave which needs to sleep for many 1000's or maybe many 10000's of cycles for a TS shader to service a miss), is likely to use more bandwidth to save/restore than is used to fetch textures in the shader itself running without a TS stage. Ultimately suggesting it would be better to service what would be expected TS misses long before a shader runs, instead of preemptively attempting to service while a shader is running...

The majority of visual coherence is temporal, not spatial. Comparing compression ratios of video to still image provides an idea of the magnitude. Might be more powerful to engineer around enabling temporal coherence instead just very limited spatial coherence. Suggests the optimal end game caches all the way through to DRAM in some kind of view independent parameterization to enable some amount of reuse across frames in common case. This also could be a major stepping stone in decoupling shading rate from both refresh-rate and screen resolution. Suggesting again a pipeline of caching what would be TS results across frames...

Gut feeling based on a tremendous amount of hand waving is pointing to something which doesn't actually new any new hardware, something which can be done quite well on existing GCN GPUs for example. Unique virtual shading cache shaded in the same 8x8 texel tiles one might imagine for TS shaders, but in this case async shaded in CS instead. With background mechanism which is actively pruning and expanding the tree structure of the cache based on the needs of view visibility. Each 8x8 tile with a high precision {scale, translation, quaternion}, paired with a compressed texture feeding a 3D object space displacement, providing texel world space position for rigid bodies or pre-fabs. Skinned objects perhaps have an additional per tile bone list, per tile base weights, and compressed texture feeding per texel modifications to base weights. Lots of shading complexity is factored out into per tile work. For example with traditional lights, can cull lights to fully in/out of shadow to skip shadow sampling. Each frame can classify the highest priority tiles which need update, then shade them: tiles with actively changing shadow, tiles reflecting quickly changing specular, etc.

Viewing all articles
Browse latest Browse all 434

Trending Articles