Been quite distracted for a few days. Did manage to throw something together for OpenGL Compute Merge Sort. Sorting 3840 keys per block, I'm at around 1.3B to 1.7B 32-bit keys/sec on 560 ti (depending on how I factor out gfx to compute context switch time). So block wise sorting is looking good, and I have not optimized yet (haha or validated yet other than visually so hopefully the results are not in error).
Global merging the sorted blocks is currently failing (error in the code) and running approximately 3x off from where it should be (I'm at 18% of bandwidth saturation per pass now). I'll fix this later.
After measuring performance I expect that when global merge is fixed and optimized, that the total cost for around 128K keys will be under 0.25 ms on 560 ti, and be broken roughly 50% in the block sort, and 50% in the global merge passes. This simple merge sort is going to work just fine for what I need.
Global merging the sorted blocks is currently failing (error in the code) and running approximately 3x off from where it should be (I'm at 18% of bandwidth saturation per pass now). I'll fix this later.
After measuring performance I expect that when global merge is fixed and optimized, that the total cost for around 128K keys will be under 0.25 ms on 560 ti, and be broken roughly 50% in the block sort, and 50% in the global merge passes. This simple merge sort is going to work just fine for what I need.