Quantcast
Channel: Timothy Lottes
Viewing all articles
Browse latest Browse all 434

Continuing on the TS Blog Conversation Chain

$
0
0
Re Joshua Barczak - Texel Shader Discussion...

"I’m suggesting that the calling wave be re-used to service the TS misses (if any), so instead of waiting for scheduling and execution, it can jump into a TS and do the execution itself."

I'm going to attempt to digest actually building this on something similar to current hardware and see where the pitfalls would be. Basically the PS stage shader gets recompiled to include conditional TS execution. This would roughly look like,

(1.) Do some special IMAGE_LOADS which set bit on miss in a wave bitmask stored in a pair of SGPRs.
(2.) Do standard independent ALU work to help hide latency.
(3.) Do S_WAITCNT to wait for IMAGE_LOADS to return.
(4.) Check if bitmask in SGPR is non-zero, if so do wave coherent branch to TS execution (this needs to activate inactive lanes).

Continuing with TS execution,

(5.) Loop while bitmask is non-zero.
(6.) Find first one bit.
(7.) Start TS shader wave-wide corresponding to the lane with the one bit.
(8.) Use the TEX return {x,y} to get an 8x8 tile coordinate to re-generate and {z} for mip level.
(9.) Do TS work and write results directly back into L1.
(10.) When (5.) ends, re-issue IMAGE_LOADS.
(11.) Do S_WAITCNT to wait for loads to return.
(12.) For any new invocations which didn't pass before, save off successful results to other registers.
(13.) Check again if bitmask in SGPR is non-zero, if so go back to (5.).
(14.) Branch back to PS execution (which needs to disable inactive lanes).

This kind of design has a bunch of issues, getting into a few of them,

(A.) Step (10.) has no post load ALU before S_WAITCNT, so it hides less of it's own latency (even though it will hit in the cache).

(B.) Need to assume texture can miss at the point where the wave has already peak register usage in the shader, this implies a shader needs that plus the TS needs in terms of total VGPR usage. Given the frequency of PS work which is VGPR pressure limited without any TS pass compiled in, this is quite scary. Cannot afford to save out the PS registers. Also cannot afford to make hardware to dynamically allocate registers at run-time just for TS (deadlock issues, worst case problem of too many waves attempting to run TS code paths at same time, etc). So VGPR usage would be a real problem with TS embedded in PS.

(C.) Need to hardware change to ensure TS results are in pinned cache lines until after first access finishing serving a given invocation. This way the IMAGE_LOAD in (10.) is ensured a hit to have some guaranteed forward progress. There is a real problem that 8x8 tiles generated early in the (5.) loop might normally be evicted by the time all the data was generated.

(D.) Attempt to do random access at 64-bit/texel (aka fetch from 64 8x8 tiles) which all miss. That's 64*8*8*8 bytes (32KB) or double the size of the L1 cache. There are multiple major terminal design problems related to this causing the wasteful (12.) step: need to support the possibility that one cannot service all texture loads for 64-invocations from one pass.

(E.) The TS embedded in PS option would lead to some radically extreme cases like multiple waves missing on the same 8x8 tiles, and possibly attempting to regenerate same tiles in parallel.

(F.) The TS embedded in PS option would result in extreme variation in PS execution time. Causing a requirement for more buffering in the in-order ROP fixed function pipeline.


So gut feeling is that this isn't practical.

I feel like many of these ideas fall into a similar design trap: the idea of borrowing the concept of "call and return" from CPUs. Devs have decades of experience solving problems taking advantage of a "stack", depending on what seems like "free" saving of state and later restoring of state. That idea only applies to hardware which has tiny register files, and massive caches. GPUs are the opposite, there is no room in any cache for saving working state. And working state of a kernel is massive in comparison to the bandwidth used to fetch inputs and write outputs, so one never wants working set to ever go off-chip. Any time anyone builds a GPU based API which has a "return", or a "join", this is an immediate red flag for me. GPUs instead require "fire and forget" solutions, things that are "stackless" and things which look more like message passing where a job never waits for the return. The message needs to include something which triggers the thing which ultimately consumes the data, or the consumer is pre-scheduled to run and waits on some kind of signal which blocks launch.

Viewing all articles
Browse latest Browse all 434

Trending Articles