For years now I have found that nearly everything I work on can be made better by leveraging ISA features which are not always exposed in all the graphics APIs. For example, currently working on a project now which could use the combination of the following,
(1.) From AMD_shader_trinary_minmax, max3(). Direct access to max of three values in a single V_MAX3_F32 operation. If the GPU has 3 read ports on the register file for FMA, might at well take advantage of that for min/max/median. AMD's DX driver shader compiler automatically optimizes these cases, for example "min(x,min(y,z))" gets transformed to "min3(x,y,z)".
(2.) Direct exposure of V_SIN_F32 and V_COS_F32, which have a range of +/- 512 PI and take normalized input. Avoids and extra V_MUL_F32 and V_FRACT_F32 per operation. Nearly all the time I use sin() or cos() I'm in range (no need for V_FRACT_F32). Nearly all the time I'm in the {0 to 1} range for 360 degrees, and need to scale by 2 PI only so code generation can later scale back by 1/2 PI. Portable fallback for machines without V_SIN_F32 and V_COS_F32 like functionality looks like,
float sinNormalized(float x) { return sin(x * 2.0 * PI); }
float cosNormalized(float x) { return cos(x * 2.0 * PI); }
(3.) Branching if any or all of the SIMD vector want to do something. Massively important tool to avoid divergence. For example in a full screen triangle, if any pixel needs the more complex path, just have the full SIMD vector only do the complex path instead of divergently processing both complex and simple. API can be quite simple,
bool anyInvocations(bool x)
bool allInvocations(bool x)
Example of how these could map in GCN (these scalar instructions execute in parallel with vector instructions, so low cost),
// S_CMP_NEQ_U64 x,0
// S_CBRANCH_SCCNZ
if(anyInvocations(x)) { }
// S_CMP_EQ_U64 x,-1
// S_CBRANCH_SCCNZ
if(allInvocations(x)) { }
(4.) Quad swizzle for fragment shaders for cross-invocation communication is super useful. Given a 2x2 fragment quad as follows,
01
23
These functions would be quite useful (they map to DS_SWIZZLE_B32 in GCN),
// Swap value horizontally.
type quadSwizzle1032(type x)
// Swap value vertically.
type quadSwizzle2301(type x)
For example one could simultaneously write out the results of a fragment shader to the standard full screen pass and write out the 1/2 x 1/2 resolution next smaller mip level at the same time using an extra image store. Just use the following to do a 2x2 box filter in the shader,
boxFilterColor = quadSwizzle1032(color) + color;
boxFilterColor += quadSwizzle2301(boxFilterColor);
(1.) From AMD_shader_trinary_minmax, max3(). Direct access to max of three values in a single V_MAX3_F32 operation. If the GPU has 3 read ports on the register file for FMA, might at well take advantage of that for min/max/median. AMD's DX driver shader compiler automatically optimizes these cases, for example "min(x,min(y,z))" gets transformed to "min3(x,y,z)".
(2.) Direct exposure of V_SIN_F32 and V_COS_F32, which have a range of +/- 512 PI and take normalized input. Avoids and extra V_MUL_F32 and V_FRACT_F32 per operation. Nearly all the time I use sin() or cos() I'm in range (no need for V_FRACT_F32). Nearly all the time I'm in the {0 to 1} range for 360 degrees, and need to scale by 2 PI only so code generation can later scale back by 1/2 PI. Portable fallback for machines without V_SIN_F32 and V_COS_F32 like functionality looks like,
float sinNormalized(float x) { return sin(x * 2.0 * PI); }
float cosNormalized(float x) { return cos(x * 2.0 * PI); }
(3.) Branching if any or all of the SIMD vector want to do something. Massively important tool to avoid divergence. For example in a full screen triangle, if any pixel needs the more complex path, just have the full SIMD vector only do the complex path instead of divergently processing both complex and simple. API can be quite simple,
bool anyInvocations(bool x)
bool allInvocations(bool x)
Example of how these could map in GCN (these scalar instructions execute in parallel with vector instructions, so low cost),
// S_CMP_NEQ_U64 x,0
// S_CBRANCH_SCCNZ
if(anyInvocations(x)) { }
// S_CMP_EQ_U64 x,-1
// S_CBRANCH_SCCNZ
if(allInvocations(x)) { }
(4.) Quad swizzle for fragment shaders for cross-invocation communication is super useful. Given a 2x2 fragment quad as follows,
01
23
These functions would be quite useful (they map to DS_SWIZZLE_B32 in GCN),
// Swap value horizontally.
type quadSwizzle1032(type x)
// Swap value vertically.
type quadSwizzle2301(type x)
For example one could simultaneously write out the results of a fragment shader to the standard full screen pass and write out the 1/2 x 1/2 resolution next smaller mip level at the same time using an extra image store. Just use the following to do a 2x2 box filter in the shader,
boxFilterColor = quadSwizzle1032(color) + color;
boxFilterColor += quadSwizzle2301(boxFilterColor);