Possible to do the 64-bit single pass method from OIT in GL4.x with 32-bit atomics by packing {16-bit depth, 16-bit color channel} in 32-bits, where color channel is on a Bayer grid. Would require a demosaic.
Benjamin 'BeRo' Rosseaux posted an example of hybrid atomic loop weighted blended order independent transparency implementation which mixes the 2 pass algorithm from the "OIT in GL4" with depth weighted blending for the tail.
If rastering in compute, might be possible to do a single pass version of "OIT in GL4" with only 32-bit atomics by packing {16-bit depth, RG},{16-bit depth, BD} into a pair of 32-bit values. Where RGBD encodes HDR. The trick is leveraging {even,odd} pairs of invocations (aka, threads) in compute to do atomics to a pair of 32-bit values where the pair is 64-bit aligned. API does not ensure that the pair of 32-bit atomics happens atomically, but in practice I believe desktop hardware (AMD/NV) will do that anyway. Clearly doing the pair of 32-bit atomics in one invocation in serial won't work.
I still prefer stochatic methods with spatial+temporal post filtering. One such method with a K entry array per pixel, is to use a per pixel atomic to grab array index then store out {depth, color, alpha} packed into a 64-bit value. Post process does an in-register sorting network to sort, reduces to one {color,alpha}, then onto filtering, then later composite with opaque. With only K bins per pixel, it is possible to overflow, but there is a biased method to avoid overflow: stochastically do more agressive dropping of the fragment if "alpha < threshold(gl_FragCoord,K_index)". Where the dither pattern is based on gl_FragCoord, and the K_index is read (only do the atomic if the fragment is not dropped). The threshold progressively increases as K_index approaches the max K. Just a very rough front to back sort could help this out a bit (could amortize the sorting to a multipass algorithm, only doing a bit of the sort per frame). This stochatic method could be interesting if "depth" for the fragment is stochastically set to somewhere in the probability of the volume which a billboard represents. With proper noise filtering could remove the billboard order change pop problem. Also with NVIDIA's NV_shader_thread_group extension it might be possible to reduce a per fragment atomic to a 2x2 fragment quad atomic for this kind of algorithm (but not the "OIT in GL4" algorithm).
Benjamin 'BeRo' Rosseaux posted an example of hybrid atomic loop weighted blended order independent transparency implementation which mixes the 2 pass algorithm from the "OIT in GL4" with depth weighted blending for the tail.
If rastering in compute, might be possible to do a single pass version of "OIT in GL4" with only 32-bit atomics by packing {16-bit depth, RG},{16-bit depth, BD} into a pair of 32-bit values. Where RGBD encodes HDR. The trick is leveraging {even,odd} pairs of invocations (aka, threads) in compute to do atomics to a pair of 32-bit values where the pair is 64-bit aligned. API does not ensure that the pair of 32-bit atomics happens atomically, but in practice I believe desktop hardware (AMD/NV) will do that anyway. Clearly doing the pair of 32-bit atomics in one invocation in serial won't work.
I still prefer stochatic methods with spatial+temporal post filtering. One such method with a K entry array per pixel, is to use a per pixel atomic to grab array index then store out {depth, color, alpha} packed into a 64-bit value. Post process does an in-register sorting network to sort, reduces to one {color,alpha}, then onto filtering, then later composite with opaque. With only K bins per pixel, it is possible to overflow, but there is a biased method to avoid overflow: stochastically do more agressive dropping of the fragment if "alpha < threshold(gl_FragCoord,K_index)". Where the dither pattern is based on gl_FragCoord, and the K_index is read (only do the atomic if the fragment is not dropped). The threshold progressively increases as K_index approaches the max K. Just a very rough front to back sort could help this out a bit (could amortize the sorting to a multipass algorithm, only doing a bit of the sort per frame). This stochatic method could be interesting if "depth" for the fragment is stochastically set to somewhere in the probability of the volume which a billboard represents. With proper noise filtering could remove the billboard order change pop problem. Also with NVIDIA's NV_shader_thread_group extension it might be possible to reduce a per fragment atomic to a 2x2 fragment quad atomic for this kind of algorithm (but not the "OIT in GL4" algorithm).