On Killing WIN32?

July 26, 2016, 8:41 pm

≫ Next: Vulkan - How to Deal With the Layouts of Presentable Images

Many years ago I used to be a dedicated reader of Ars, but it slowly transitioned to something a little too biased for my taste, so I avoid it, but thanks to twitter, it is possible to get sucked into a highly controversial article: "Tim Sweeney claims that Microsoft will remove Win32, destroy Steam".

My take on this is quite simple. Everyone in this industry who has lived long enough to have programmed in the C64 era, has witnessed a universal truth on every mass market platform: the freedom and access to the computer by the user or programmer is reduced annually at a rate which is roughly scaling with the complexity of the software and hardware.

The emergent macro level behavior is undeniable. Human nature is undeniable. It is possible to continuously limit freedom as long as it is done slowly enough such that it falls under the instantaneous tolerance to act on each micro-level regression of freedom. Or translation, humans are lazy, humans adapt fast, and humans don't live long. Each new generation lacks the larger perspective of the last, and starts ignorant of what had been lost.

The reason why computers and freedom are so important is that computers are on a crash course to continue deeper and deeper integration with our lives. I believe ultimately humans will transcend the limits of our biology, blurring the lines between the mind and machine. Seems rather important at that apex to have the individual freedoms we have today, the privacy of our thoughts, etc.

In the short term as a developer I'm also somewhat concerned that the infants that will grow up to replace the generation I started in, will have the same opportunities I had, the same ability to get access to the hardware, to have the freedom implement their dreams, and to if they choose to, make a living doing so, in a free market, controlling their own destiny, selling their own product, without a larger controlling interest gating that process.

WIN32 is one such manifestation of that freedom.

There are some very obvious trends in the industry specifically in the layers of complexity being introduced either in hardware or software. For example, virtualization in hardware mixed with more attempts to sandbox software. Or the increased distance one has to the display hardware. Look at VR, you as an application developer are locked out of the display, and have to pass through a graphics API interopt layer which does a significant amount of extra processing in a separate process. Or perhaps the "service-ication" of software to subscription models. Or perhaps the HDR standard removing your ability to control tone-mapping. Or perhaps it is just the complexity of the API which makes it no longer practical to do what was done before, even if it is still actually possible.

Following the trends to their natural conclusion perhaps paints a different picture for system APIs like WIN32. They don't go away per say, they just get virtualized behind so many layers, it is becomes impossible to gain the advantages those APIs had when they were direct. That is one of the important freedoms which is eventually lost.

One of the best examples of this phenomenon is how the new generation perceives old arcade games. Specifically as, games with incorrect color (CRT gamma around 2.5 being presented as sRGB without conversion), giant exactly square pixels (never happened on CRTs), with dropped frames (arcade had crystal clear no-jitter on v-sync animation), with high latency input due to emulation in a browser for example (arcade input was instant in contrast), with more latency due to swap-chains added in the program (arcade hardware generated images on scan-out), with added high latency displays (HDTVs and their +100 milliseconds, vs instant CRTs), and games with poor button and joystick quality (arcade controls are a completely different experience). Everything which made arcades awesome was lost in the emulation translation.

Returning to the article, I don't believe there is any risk in WIN32 being instantly deprecated, because if that was to happen, it would be a macro-level event well beyond the tolerance level required to trigger action. The real risk is the continued slow extinction.

↧

Vulkan - How to Deal With the Layouts of Presentable Images

July 27, 2016, 11:59 pm

≫ Next: Blink Mechanic for Fast View Switch for VR

≪ Previous: On Killing WIN32?

Continuing my posts on building a Vulkan based Compute based Graphics engine from scratch (no headers, no libraries, no debug tools, no using existing working code)...

Interesting to Google something and already get hits on Vulkan questions on Stack Overflow - How to Deal With the Layouts of Presentable Images. Turns out one of the frustrating aspects of Vulkan is the WSI or presentation interface. Three specific things make this a pain in the butt, quoting from the Vulkan spec.

(1.) "Use of a presentable image must occur only after the image is returned by vkAcquireNextImageKHR, and before it is presented by vkQueuePresentKHR. This includes transitioning the image layout and rendering commands."

(2.) "The order in which images are acquired is implementation-dependent. Images may be acquired in a seemingly random order that is not a simple round-robin."

(3.) "Let n be the total number of images in the swapchain, m be the value of VkSurfaceCapabilitiesKHR::minImageCount, and a be the number of presentable images that the application has currently acquired (i.e. images acquired with vkAcquireNextImageKHR, but not yet presented with vkQueuePresentKHR). vkAcquireNextImageKHR can always succeed if a<=n-m at the time vkAcquireNextImageKHR is called. vkAcquireNextImageKHR should not be called if a>n-m ..."

The last part (3.) roughly translates into the fact that you might not be guaranteed ability to Acquire all images at any one time. Placing all these problems together means that it is impossible to do the following,

(A.) No way to robustly pre-transform all images into VK_IMAGE_LAYOUT_PRESENT_SRC_KHR before entering normal run-time conditions. Instead you have to special case the transition the 1st time acquire returns a given index. IMO this adds unnecessary complexity for absolutely no benefit, and makes it really easy to introduce bugs. I've see online Vulkan examples violate rule (1.).

(B.) No way to ensure a simplified round-robin order even in cases where it is physically impossible to get anything other than round-robin (such as full-screen flip with v-sync on and a 2 deep swap chain).

Working Around the Problem
This problem infuriates me personally because of all the wasted time required to add complexity for no benefit. Likewise being forced to double buffer instead of front buffer is also a large waste of time for a regression in latency. Since my engine is command buffer replay based (no command buffers are generated after init-time), I ended up needing 8 baked command buffer permutations.

(1.) Even frame pre-acquire.
(2.) Odd frame pre-acquire.
(3.) Even frame post-acquire image index 0.
(4.) Even frame post-acquire image index 1.
(5.) Odd frame post-acquire image index 0.
(6.) Odd frame post-acquire image index 1.
(7.) Transition from UNDEFINED to PRESENT_SRC for image index 0.
(8.) Transition from UNDEFINED to PRESENT_SRC for image index 1.

The workaround I have for needing to special case transitions on 1st acquire of a given index, is to run the transition from UNDEFINED command buffer instead of the one which normally draws into the frame. So there is a possibility of randomly seeing one extra black frame after init time. This IMO is all throw-away code anyway once I can get some kind of front-buffer access.

Bugs
Interesting to look back at the bugs I had to deal with on route to getting the basic example of a compute shader rendering into a back-buffer. Really only 2 bugs, one I forgot to vkGetDeviceQueue(), which was trivial to find and fix. The other was that when creating the swap chain I accidentally set imageExtent.width to height and left imageExtent.height to zero. No amount of type-checking would ever help in finding that bug. Didn't see any errors, so took a while of reinspecting the code to see what I had screwed up.

In hindsight, after knowing what to do, using Vulkan was actually quite easy.

↧

Blink Mechanic for Fast View Switch for VR

July 28, 2016, 6:38 am

≫ Next: Vulkan From Scratch Part 2

≪ Previous: Vulkan - How to Deal With the Layouts of Presentable Images

As seen in the SIGGRAPH Realtime section for Bound for PS4 PSVR around 42 minutes in this video. Great to see someone making good use of the "blink mechanic" to quickly switch view in VR. Scene quickly transitions to black, to simulate eyelid closing, followed by fading back to a new view, simulating eyelid opening.

My recommendation for the ideal VR display interface used this mechanic. Specifically "blink" to hide the transition of exclusive ownership of the Display between a "Launcher / VR Portal" application and the game. The advantages of exclusive app-owned Display for VR on PC would have been tremendous. For instance it then becomes possible to,

(1.) Fully saturate the GPU. No more massive sections of GPU idle time.

(2.) Render directly to the post-warped final image for a 2x reduction in shaded samples for Compute generated Graphics Non-Triangle based rendering, and pixel perfect image quality.

(3.) Factor out shading to Async Compute, and only generate the view right before v-sync. Rendering just in time is better than time-warp: no more incorrect transparency, no more doubling of visual error for dynamic objects which are moving differently than the head camera tracking.

(4.) Race the beam for the ultimate in low latency.

(5.) Great MGPU scaling (app owned display cuts MGPU transfer cost by 4x).

(6.) Have any GPU Programming API, even compute APIs, be able to work well in VR without complex cross-process interopt.

(7.) Etc.

Ultimately no one on the PC space implemented this, and thus all my R&D on the ultimate VR experience got locked out and blocked by external process VR compositors, pushing me personally out of VR, and back to flat 3D where I still can actually push the limits, with good frame delivery, without artifacts, and with perfect image quality.

↧

Vulkan From Scratch Part 2

July 31, 2016, 11:21 am

≫ Next: Uber Shader Unrolling

≪ Previous: Blink Mechanic for Fast View Switch for VR

Continuing posting when I find some time to work on the "from-scratch" Vulkan engine...

Review From Last Time
Bringing up on Windows first this time, will get to Linux later. Got basic system interface without non-system libraries on Windows. Still have {gamepad, audio, usb} interfaces to bring up. Got basic Vulkan window rendering full-screen compute PSO into back-buffer. Switched to always using UNDEFINED layout for source for back buffer to remove a permutation in the command buffer baking. Have new simplified rapid prototyping working.

Mechanics of Batch File
No "make" system, just a simple shell script to run the development cycle per platform. I include the source of all the helper programs inside the single source file, then comment them out in the shell script unless I need to recompile them. Using #define's to control what actually gets compiled. Two helpers. The "glsl.exe" takes the single source file and prefixes with the "#version 450" to use as GLSL shader source, outputs to "tmp.comp". Then "head.exe" converts the "comp.spv" output of "glslangValidator.exe" to a header file. Arguments of both are the shader ID which I use a two digit number for. Shell script compiles shaders, cleans up temporaries, compiles the program (aka "rom.exe"), runs the program, and depending on exit code relaunches or quits the shell script. Script below has only one shader currently,

@echo off
@rem cl /nologo /O1 /Oi /Os /Oy /fp:fast /DCPU_ /DGLSL_ /Feglsl.exe /Tprom.c
@rem cl /nologo /O1 /Oi /Os /Oy /fp:fast /DCPU_ /DHEAD_ /Fehead.exe /Tprom.c
:loop
glsl.exe 00
glslangValidator.exe -V tmp.comp
head.exe 00
del tmp.comp
del comp.spv
cl /nologo /O1 /Oi /Os /Oy /fp:fast /DCPU_ /DGAME_ /Ferom.exe /Tprom.c
rom.exe
if /i %ERRORLEVEL% equ 0 goto :eof
goto :loop

Mechanics of Graphics Abstraction Interface
This should be thought of as work in progress as it will change during bring up. I'm posting this because it shows just how simple a Compute-Generates-Graphics style engine can be in Vulkan. There are exactly 2 Descriptor Sets, one for the even and one for the odd frames. They contain everything, so no need to ever think about "binding" anything.

// ROUGH IDEA OF API

// Initialize the Vulkan interface.
// Inputs are,
//  (1.) Descriptor pool setup.
//  (2.) Descriptor set layout.
{ S_ VkDescriptorPoolSize count[1]={
   { VK_DESCRIPTOR_TYPE_STORAGE_IMAGE,2 } };
  S_ VkDescriptorSetLayoutBinding bind[1]={
   { 0,VK_DESCRIPTOR_TYPE_STORAGE_IMAGE,1,VK_SHADER_STAGE_COMPUTE_BIT } }; 
  GfxInit(count,1,bind,1); }

// Used to compile PSOs (this eventually will be parallelized when necessary).
// Specialization constants.
// This example passes in the size of the frame buffer.
F4 con[]={ F4_(wndR->x), F4_(wndR->y) };
// Include the shader source files.
#include "s00.h"
// Generate PSOs.
U8 pso[1];
pso[0]=GfxPso(s00_,sizeof(s00_),con,sizeof(con)>>2);

// Build the baked command buffers.
// Leaving a lot out here, will return later when this is cleaned up more.
U8 cmd=GfxBegin(cmdIdx);
...
GfxEnd(cmd);

// Loop forever replaying the groups of {even, odd} command buffers.
GfxLoop();

Specialization Constants
Absolutely great feature to have in the API. Setup constants which can have overrides set at PSO compile-time. Enables factoring out evaluation of various expressions to compile-time, setting array sizes, etc. Perfect for my setup, as I can just pass in frame size (and later other important things). Translates to this in the current dummy bring-up shader,

 // Part of what I use for cleaner types...
 #define F4 float
 #define F4x2 vec2
 #define F4x3 vec3
 #define F4x4 vec4
 ...

 // Specialization constants with default values.
 layout(constant_id=0) const F4 SCRX_=1920.0;
 layout(constant_id=1) const F4 SCRY_=1080.0;

 // Bind "everything", which thus far is just the back buffer.
 layout (set=0,binding=0,rgba8) writeonly uniform image2D img[1];

 // Showing a specific shader, in this case the dummy shader for bring up.
 #ifdef S00_
  layout (local_size_x=16,local_size_y=16) in;
  void main() { imageStore(img[0],S4x2(gl_GlobalInvocationID.xy),
   F4x4(F4(gl_GlobalInvocationID.x)*(1.0/SCRX_),F4(gl_GlobalInvocationID.y)*(1.0/SCRY_),1.0,1.0)); }
 #endif

What's Next
Setting up a double buffered SSBO for the even/odd frames for game state. Each frame reads from prior frame game state and builds the new frame game state. Everything updated GPU-side via compute dispatches. Yes burning a whole wave to do some scalar processing once and a while (like computing new player position and view-matrix, etc), but in practice the scalar work is in the noise in terms of run-time, so it doesn't matter.

Then on to simple CPU->GPU uploads (for late latched IO) and GPU->CPU downloads (for saving current state, etc)...

↧

Uber Shader Unrolling

August 6, 2016, 11:25 am

≫ Next: GPU Parking Lot

≪ Previous: Vulkan From Scratch Part 2

Looking at running a compute only pass, no graphics waves to contend with on the machine, so it becomes relatively easy to think about occupancy. Target 4 waves per SIMD via a 4 wave work-group (one work-group per SIMD unit). That provides 4 work-groups sharing a Compute Unit (CU) and L1 cache. This is only 16/40 occupancy, but enough in theory to maintain a good amount of multi-issue of waves to functional units. Each wave gets 64 VGPRs, each work-group gets 16KB LDS (16 32-bit words/invocation on average).

In this fixed context, one can leverage compile-time unrolling to manage variable register allocation for different sub-shaders in an uber-shader. Unrolling as in running more than one instance of the shader in the uber-shader at a given time.

Unroll 2 in parallel = 32 VGPRs, 8 words LDS
Unroll 3 in parallel = 21 VGPRs, 5 words LDS
Unroll 4 in parallel = 16 VGPRs, 4 words LDS

But it doesn't have to be this fixed. One can have variable blending of N parallel instances. Meaning as register usage starts to drain from one instance, start ramping up the other instance. Also enables instances to share intermediate computations.

This more "pinned task" model with unrolling in theory in some cases (maybe really short shaders like particle blending) would allow better utilization of the machine, than separate kernel launches for everything. During shader start as the shader ramps up, the VGPRs allocated to it are under utilized. During shader ramp down towards exit, VGPRs are also under utilized. Unrolling can blend the fill and drain.

Clearly there is also a question of unrolling out of instruction cache.

↧

GPU Parking Lot

September 5, 2016, 9:58 pm

≫ Next: Transistor Count Thoughts

≪ Previous: Uber Shader Unrolling

Push Model
Perhaps the GPU parking lot, aka register file waiting on long latency returns, is a side effect of not having ability to issue a load which pushes data to a different SIMD unit's register file? If loads could be issued and return somewhere else, one could possibly split a problem into 2 components: the part figuring out how to route memory traffic, and the part consuming the memory traffic. No call and return, thus no parking of state after loads.

↧

Transistor Count Thoughts

September 8, 2016, 10:05 pm

≫ Next: Thinking "Clearly" About 4K

≪ Previous: GPU Parking Lot

Wikipedia's Transistor Count Page
Really interesting page on Wikipedia. Amazing how many of the original cache-free Acorn RISC Machines will fit in the transistor budget of modern processors. The rest of this post is some high level thinking about the compromises required to scale ALU density upwards by simplifying and shrinking core size down to something sized like an ARM2.

_____________________ARM2 ~ _______30,000 transistors ~ ______1 ARM2 ____________________80386 ~ ______275,000 transistors ~ ______9 ARM2 __________________Pentium ~ ____3,100,000 transistors ~ ____103 ARM2 ____________1st Pentium 4 ~ ___42,000,000 transistors ~ ____800 ARM2 _____________________Cell ~ __241,000,000 transistors ~ __8,033 ARM2 _____________Apple A8 SOC ~ 2,000,000,000 transistors ~ _66,666 ARM2 22-core Xeon Broadwell-E5 ~ 7,200,000,000 transistors ~ 240,000 ARM2 ________________AMD FuryX ~ 8,900,000,000 transistors ~ 296,666 ARM2

Dividing the 512 GB/s of external bandwidth for FuryX across a variable number of ARM2 sized cores clocked a 1 GHz, suggests as on-chip ALU scales beyond what a GPU can do, that cores must mostly fully consume on-chip generated data. Also given GPU on-chip routing networks typically have some small integer scaling of off-chip bandwidth, this suggests not the classic GPU formula for production and consumption of on-chip data (meaning not routing through some coherent L2, but rather more neighbor to neighbor, or very localized).

___1 ARM2 ~ 512 GB/s ~ 512 B/op __16 ARM2 ~ _32 GB/s ~ _32 B/op _256 ARM2 ~ __2 GB/s ~ __2 B/op __4K ARM2 ~ 128 MB/s ~ __8 op/B <--- FuryX is a 4K core GPU _64K ARM2 ~ __8 MB/s ~ 128 op/B 256K ARM2 ~ __2 MB/s ~ 512 op/B

Below looking at this from another perspective, taking 6 G transistors for SRAM cells, dividing into N cores, looking at limit of SRAM bytes per core (not including anything other than 6 transistors per bit in this approximation). If one wanted to scale to mass numbers of simple small cores, the amount of on-chip memory per core would be tiny. Suggests that maybe sharing of instruction RAMs and on-chip memories becomes one of the major design challenges.

___1 ~ 128 MB __16 ~ __8 MB _256 ~ 512 KB __4K ~ _32 KB _64K ~ __2 KB 256K ~ 512 B

This next table looks at 64K cores clocked at 1 GHz running at 64 frames/second, or roughly 1024 Gop/frame. Then taking this 1024 Gop/frame number divided by the number of instructions fetched from off-chip memory per frame. Providing a rough idea of the level of instruction reuse required. This table tops out at 4 G instructions working with a 16-bit instruction width, that would be fully utilizing 512 GB/s of off-chip bandwidth.

__4 G instructions ~ __256 usage average/instruction ~ ___full usage of off-chip bandwidth 128 M instructions ~ __8 K usage average/instruction ~ ___1/32 usage of off-chip bandwidth __4 M instructions ~ 256 K usage average/instruction ~ _1/1024 usage of off-chip bandwidth 128 K instructions ~ __8 M usage average/instruction ~ 1/32768 usage of off-chip bandwidth

Suggests that the majority of program workflow must traverse similar code paths, either through SIMD or looping or something else. Another important aspect to this problem is looking at random access (unique) vs broadcast (same) for filling instruction RAMs. Starting with assuming instruction RAMs are not shared across cores, and cores are running random programs (non-SIMD).

4 Ginst/frame / 64 Kcores = 64 Kinst/core/frame ~ ___full usage of off-chip bandwidth _____________________________8 Kinst/core/frame ~ ____1/8 usage of off-chip bandwidth _____________________________1 Kinst/core/frame ~ ___1/64 usage of off-chip bandwidth

If broadcast is not used, programs need to be pinned to a given core across multiple frames. Suggests as cores/chip increases, broadcast update of on-chip RAMs is critical. Meaning if supporting unique control paths per core, the window of code must the the same across many cores.

These tiny poor estimations of large scale effects paints a very clear picture of why GPUs are SIMD machines with SIMD units clustered and sharing memories. I'm personally interested in figuring out what comes after the GPU, meaning what does to the GPU what the GPU did to the CPU in terms of ALU density on a chip. This post talks in terms of the classic model of an ALU connected to a memory, fetching operands and sending results back into the memory. Perhaps we are at the point where the next form of scaling requires leaving that model behind?

↧

Thinking "Clearly" About 4K

September 9, 2016, 8:42 am

≫ Next: The Great MacOS 9

≪ Previous: Transistor Count Thoughts

The real advantage of this console "upgrade cycle" is that now a developer should be able to produce a good 1080p @ 60 Hz game without loosing pixel quality.

However what is likely going to happen instead is 1st party titles under 4K marketing pressure will try for 4K, accepting a reduction of pixel quality, and get stuck at 30Hz. Games built around 30Hz won't scale to 60Hz due to CPU loads, etc, and we don't get 60Hz this round in the general case. Instead games will generally offer some amount of extra super-sampling for 1080p on upgrade consoles.

Beating the System
If I was developing a console game for the "upgrade generation", I'd take advantage of 4K by dropping render resolution to 960x540 (yeah) in the game, then use a stylized CRT up-sampler. The advantage of 4K: 540p shader scaled to 2160p has 4 lines per rendered line, so it is possible to have the scan-line effect without loosing as much brightness (three mostly solid lines with one 50% darker gap line is only a 1/8 drop in total brightness). 540p@60Hz on a PS4 Pro is 135 Kflop/pixel, or roughly a 4.6x scaling of perf/pixel from a 1080p@30Hz PS4 title. That would be a transformative visual upgrade.

Maths

CONCLUSIONS ON 4K FOR CONSOLE "UPGRADE CYCLE"
=============================================

Majority of devs will render way under 4K,
adopting up-scaling temporal AA to get to 4K,
and will stay with 30Hz.

The why,

Xbox 360 was a 720p@30Hz target with 9 Kflop/pix.
Many developers did less than 720p rendering on the 360.

Xbox One is a 1080p@30Hz target with 19 Kflop/pix.
So roughly double Xbox 360 perf/pix, a transformative change.
Many developers continue to do less than 1080p rendering on Xbox One.

PS4 is a 1080p@30Hz target with 29 Kflop/pix.
I'm going to take this as the baseline Kflop/pix requirement 
for a current generation visual standard.

PS4 PRO in my opinion is a good 1080p@60Hz target with 34 Kflop/pix,
as that maintains the PS4's perf/pixel standard.
To hit 4K@30Hz is 17 Kflop/pix, 
which is a downgrade to an Xbox One perf/pix level.

Xbox Scorpio, using a 6.0 Tflop number, at 4K@30Hz is 24 Kflop/pix, 
which does *not* hit the PS4 perf/pix quality standard,
but does maintain an Xbox One perf/pix level.
So as with 360 and XB1, devs will continue to render under native resolution.


RAW DATA
========

Using "Google Supplied" numbers for Tflops,

 360 -> .24 Tflops [Xbox 360    ]
 XB1 -> 1.2 Tflops [Xbox One    ]
 PS4 -> 1.8 Tflops [PS4         ]
 PRO -> 4.2 Tflops [PS4 Pro     ]
 SCO -> 6.0 Tflops [Xbox Scorpio]

And looking at flop/pix in units of 1000,

                                    360   XB1   PS4   PRO   SCO
 ================================= ===== ===== ===== ===== =====
  960 x _540 @  30 Hz =  16 Mpix/s   15    77   116   270   386
 1280 x  720 @  30 Hz =  28 Mpix/s    9    43    65   152   217
  960 x  540 @  60 Hz =  31 Mpix/s    8  __39_   58   135   193
 1280 x  720 @  60 Hz =  55 Mpix/s    4    22    33    76   109
  960 x  540 @ 120 Hz =  62 Mpix/s    4    19    29    68    96
 1920 x 1080 @  30 Hz =  62 Mpix/s    4    19  __29_   68    96
 1280 x  720 @ 120 Hz = 111 Mpix/s    2    11    16    38    54
 1920 x 1080 @  60 Hz = 124 Mpix/s    2    10    14  __34_ __48_
 1920 x 1080 @ 120 Hz = 249 Mpix/s    1     5     7    17    24
 3840 x 2160 @  30 Hz = 249 Mpix/s    1     5     7    17    24
 3840 x 2160 @  60 Hz = 498 Mpix/s    0     2     4     8    12
 3840 x 2160 @ 120 Hz = 995 Mpix/s    0     1     2     4     6

↧

The Great MacOS 9

September 12, 2016, 8:46 pm

≫ Next: Parallel Noise Generation

≪ Previous: Thinking "Clearly" About 4K

This "An OS 9 odyssey: Why these Mac users won’t abandon 16-year-old software." is an awesome article. OS 9 was the peak of Apple operating systems. Low latency, instant response. If only the industry didn't fabricate internet "standards" complexity at a rate which is impossible to dream of supporting on the older machines.

↧

Parallel Noise Generation

September 21, 2016, 9:30 am

≫ Next: T4K

≪ Previous: The Great MacOS 9

Re a related Twitter Post ...

Concerning making tile-able textures for grain or noise, and getting various desired properties. My preference is towards algorithms which parallelize trivially. The shadertoy referenced in the tweetgenerates a noise pattern, by starting out with some poor non-random noise, then applying filters (ending with a high-pass) to transform it into something which is pleasing to the eye. I'd advise always using the technique I outlined in this GPUOpen Postwhich remaps the texture to a perfect distribution of values (see the follow up post as well) while maintaining it's original form (this applies to both techniques in this post).

A second technique I've leveraged in the past is to work with one {x,y} coordinate for a grain position distributed in a regular grid array of grains. Starting with the perfect honeycomb distribution (start from a regular grid position, where every other row is shifted to the left or right, and make the grid have proper aspect ratio for honeycomb),

_x_x_x_x_x_x_x_x x_x_x_x_x_x_x_x_ _x_x_x_x_x_x_x_x x_x_x_x_x_x_x_x_ _x_x_x_x_x_x_x_x x_x_x_x_x_x_x_x_ _x_x_x_x_x_x_x_x x_x_x_x_x_x_x_x_

Then permuting grain position by some function (which could be a noise function with various distributions based on frequency, or perhaps do some kind of clustered rotation of points by nearest cluster, etc). This process typically results in undesired look. Then applying various passes on the array where the position of each grain is filtered against the positions of the pre-filtered neighbors (only dependent on prior pass). The point being to re-shape the array into something which has a more visually pleasing feel. The filter can work with a hex neighborhood (neighbors depend on if the pixel is on a even or odd row),

_x_x_ ... ab_ ... _ab x_x_x ... cde ... cde _x_x_ ... ef_ ... _ef

Could be something as easy as relaxing the position of the point (push the point in the direction towards being equal distance from neighbors, but not so much that one resets to a honeycomb). After getting grains distributed as desired, can use for {x,y} coordinates, or transform back into an image of grain (which could be a different resolution image).

↧

T4K

October 1, 2016, 2:58 pm

≫ Next: TK4 - Try 2

≪ Previous: Parallel Noise Generation

Tries : 3 2 1

This post is just me using blogger an as active notepad to paper design a FPGA soft-core for a many-core machine. Hopefully I'll update it once an a while. The aim is to try to see what can be built in the budget of one BRAM per core on Xilinx FPGAs, thinking through pipelines and physical implementation in CLBs. Machine target is 600 32-bit integer cores, at 300 to 500 MHz, with 4KB/core of on-chip local RAM (for instruction and data), with '25-bit * 18-bit + 48-bit' DSP based ALUs, with a Hoplite-based on-chip router for message passing.

Update Log
2016/10/02 : Returning to the mechanics of code, shift and bit array stuff.

2016/10/01 : Updating ISA, etc. Traditionally an instruction encapsulates {source values, operation, destination}, and the instruction data flows down the CPU pipeline. I'm working towards a different kind of ISA, where the opcode instead describes what to do in this clock, routing between the various data pipelines in the CPU core. This could be a horrible idea (racing to fail), it certainly is for assembly readability! Effectively I'm attempting to be able to describe {branch, alu, mov, mem} access all in a single 32-bit opcode. Attempt to push an IPC closer to 3 or 4, instead of around 1.

2016/10/01 : Beginning. Yanked the prior post because my implementation estimation was fail, the revised version appears here.

Notes

==============

  FPGA NOTES

==============

===============
  CORE BUDGET
===============
  1 BRAM (32-bit x 1024 entry, with 2 ports both which can read or write)
  2 DSPs 
400 LUTs (50 slices, 8 LUTs per slice)


==============
  LUTs / MUX
==============
1 LUT =  2:1 MUX x2 (get two of these)
1 LUT =  4:1 MUX
2 LUT =  8:1 MUX
4 LUT = 16:1 MUX
8 LUT = 32:1 MUX


===================================
  LUTs / GENERAL PURPOSE FUNCTION
===================================
1 LUT =  5:1 x2 (get two of these sharing same 5 input bits)
1 LUT =  6:1
2 LUT = 13:1
4 LUT = 27:1 


===============
  SLICE RULES
===============
Carry chain for slice can start at LUT 0 or LUT 4
Distributed RAM granularity is half a slice (starting at LUT 0 or LUT 4)


========================
  DSP PATTERN DETECTOR
========================
From docs, "use of the pattern detector leads to a moderate speed reduction 
  due to the extra logic on the pattern detect path"
Suggests not using this to check for zero
  So branch on signed or unsigned only?
Wouldn't be able to use this for both saturation checks and zero check anyway


====================

  FUNCTIONAL UNITS

====================

============================
  CURRENT LUT BUDGET USAGE
============================
LUTs  %    Usage
====  ===  =====
28    7    Program counter
12    3    Return stack
107   27   BRAM variable bit-width windows (includes adr^imm)
64    16   Register file
----  ---  -----
211   53   Total


===================
  PROGRAM COUNTER
===================
10-bit program counter (PC)
Only lower 8-bits of PC increment on linear execution
  Requires only an 8-bit PC+1 computation (one slice)

Aim to minimize critical path getting next address to BRAM
  Only one level of LUT to compute next address
  All inputs registered at end of prior clock
  Followed by 8-bit add to compute possible PC for next clock

PC function inputs per output bit (map to 13:1 function at 2 LUTs/bit)
  Bits  Meaning
  ====  =======
  1     Next PC if not branching (computed in prior clock)
  1     Top of return stack
  1     Immediate absolute branch address
  1     PSP P output register (computed branch target in prior clock)
  1     PSP P output register sign bit (for conditional branch)
  8     Up to 8 bits from instruction opcode to decode

LUTs  Usage
====  =====
20    13:1 function for next 10-bit PC computation including instruction decode
8     PC+1 adder for 8 lower bits of PC
----  -----
28    Total (7% of 400 LUT/BRAM budget)


================
  RETURN STACK
================
Going to plan on a dedicated return stack for now
Only need single port for return stack (either call/push, or return/pop)
Hardware background,
  Distributed RAM works in 4 LUT granularity
  1 LUT provides 2x SPRAM32 (single port 32x1 RAM)
  Writes are synchronous on clock edge
  Reads are async
  8 LUTs for a 32 x 16-bit return stack data

Not using everything
  Padding to 4 LUT granularity
  Only need 10-bits out of 16-bits (6-bits free for other state)
  Likely going to keep only 4-bit top of stack address register
    Could use other 16 entries for run function on new message?

Todo
  Control inputs, adder input, etc

LUTs  Usage
====  =====
8     32 entry x 16-bit return stack
4     4-bit top of stack pointer
?     
----  -----
12    Total (3% of 400 LUT/BRAM budget)


=================================================
  ADDRESS REGISTER XOR INSTEAD OF ADD IMMEDIATE
=================================================
Planning on [address ^ immediate] addressing
  This removes an adder from the design

XOR is the same as [address + immediate] for an n-bit immediate
  When lower n-bits of address are zero
  Means data must be aligned to the nearest pow2 of maximum immediate offset

Using the following terms,
  ggggggoooo
    g =  group bits (address bits choose the group of data)
    o = offset bits (address bits are zero, immediate chooses element in group)

For bits in address which are not cleared (ie the group bits),
  Setting bits in the immediate results in accessing a neighbor group
  Regardless of the starting group in the address register
    It is possible to roll through all aligned groups
    But ordering is different based on starting group address
    Example of group bits for address crossed with immediate
           00 01 10 11   
         +-------------
      00 | 00 01 10 11
      01 | 01 00 11 10
      10 | 10 11 00 01
      11 | 11 10 01 00


===================================
  BRAM VARIABLE BIT-WIDTH WINDOWS
===================================
Trying to support transparent pack/unpack of variable bit-widths from BRAM
  Want zero impact to ISA, no special instructions
  Instead dividing address range into windows of different bit-widths
  Each address range addresses at a multiple of the bit-width
  Effectively the high bits of address choose the bit-width

Store path limited to {8,16,32}-bit
  Only using BRAM byte write mask to avoid any {read, modify, write}

Fixed signed vs unsigned configuration
  32-bit doesn't matter
  16-bit going to go with signed (needed for vector or audio)
   8-bit unsigned (not so sure about that)
   4-bit unsigned for sure (sprites?)

BRAMs always in 32-bit port mode,
  fedcba9876543210 
  ================
  .xxxxxxxxxx00000 - requires 10-bit address

Address register,
  fedcba9876543210 
  ================
  00....xxxxxxxxxx - 1024 x 32-bit
  01...xxxxxxxxxxx - 2048 x 16-bit 
  10..xxxxxxxxxxxx - 4096 x  8-bit 
  11.xxxxxxxxxxxxx - 8192 x  4-bit (supported for read only)

Address register value to BRAM address translation
  This needs to include XORing the immediate
  Uses a 13:1 function for each bit,
    Bits  Meaning
    ====  =======
    4     Address shifted left {0,1,2,3} bits
    4     Immediate shifted left {0,1,2,3} bits
    2     The 'fe' address bits
    3     Up to 3 bits from instruction opcode to decode
          Should select if use immediate, etc

Permutations (showing address and byte write mask for store),
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa - 32-bit adr=00....xxxxxxxxxx  write=1111
  ................bbbbbbbbbbbbbbbb - 16-bit adr=01...xxxxxxxxxx0  write=0011
  cccccccccccccccc................ - 16-bit adr=01...xxxxxxxxxx1  write=1100
  ........................dddddddd -  8-bit adr=10..xxxxxxxxxx00  write=0001
  ................eeeeeeee........ -  8-bit adr=10..xxxxxxxxxx01  write=0010
  ........ffffffff................ -  8-bit adr=10..xxxxxxxxxx10  write=0100
  gggggggg........................ -  8-bit adr=10..xxxxxxxxxx11  write=1000
  ............................hhhh -  4-bit adr=11.xxxxxxxxxx000
  ........................iiii.... -  4-bit adr=11.xxxxxxxxxx001
  ....................jjjj........ -  4-bit adr=11.xxxxxxxxxx010
  ................kkkk............ -  4-bit adr=11.xxxxxxxxxx011
  ............llll................ -  4-bit adr=11.xxxxxxxxxx100
  ........mmmm.................... -  4-bit adr=11.xxxxxxxxxx101
  ....nnnn........................ -  4-bit adr=11.xxxxxxxxxx110
  oooo............................ -  4-bit adr=11.xxxxxxxxxx111

Shift value for store,
  Requires 3:1 MUX per bit, 32 LUTs

Generate write enable for store,
  Requires same 4-bits per function,
    2 lower address bits
    2 upper address bits
  2 LUTs (5:1 function sharing inputs, 2 outputs per LUT)

Unpack after load,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
<---------------bbbbbbbbbbbbbbbb - sign extended
<---------------cccccccccccccccc - sign extended
  000000000000000000000000dddddddd
  000000000000000000000000eeeeeeee
  000000000000000000000000ffffffff
  000000000000000000000000gggggggg
  0000000000000000000000000000hhhh
  0000000000000000000000000000iiii
  0000000000000000000000000000jjjj
  0000000000000000000000000000kkkk
  0000000000000000000000000000llll
  0000000000000000000000000000mmmm
  0000000000000000000000000000nnnn
  0000000000000000000000000000oooo
  ================================
  xxxxxxxxxxxxxxxx................ -  4:1 MUX/bit (a bit, top b bit, top c bit, 0)                                   
  ................xxxxxxxx........ -  4:1 MUX/bit ({a,b,c} bit, 0)
  ........................xxxx.... -  8:1 MUX/bit ({a,b,...g} bit, 0) 
  ............................xxxx - 15:1 MUX/bit ({a,b,...o} bit) 

Unpack if done with MUX control logic computed in prior clock,
  This registers 4-bits extra, to reduce LUT cost on MUX
  Register 2-bits for top 24-bit MUX control
    Logic function (3:1 sharing inputs), one LUT 
      up lo    1 0
      == ===   = =
      00 xxx | 0 0
      01 xx0 | 0 1
      01 xx1 | 1 0
      10 xxx | 1 1
      11 xxx | 1 1    
  Register 3-bits for 2nd lowest 4-bit MUX control
    Logic function (4:1 sharing inputs), rounds to 2 LUTs 
      up lo    2 1 0
      == ==    = = =
      00 xxx | 0 0 0
      01 xx0 | 0 0 1
      01 xx1 | 0 1 0
      10 x00 | 0 1 1
      10 x01 | 1 0 0
      10 x10 | 1 0 1
      10 x11 | 1 1 0
      11 xxx | 1 1 1  
  Register 4-bits for lower 4-bit MUX control
    Logic function (5:1 sharing inputs), 2 LUTs 
      Skipping as pattern is obvious ...

LUTs  Usage
====  =====
20    Address register value to BRAM address translation 2 LUTs x 10-bits
32    Shift value for store
2     Generate write enable
5     Unpack MUX control logic (done on clock computing address for BRAM)
24    Unpack MUX for 24-bits
8     Unpack MUX for 4-bits (higher nibble)
16    Unpack MUX for 4-bits (lower nibble) 
----  -----
107   Total (27% of 400 LUT/BRAM budget)


=================
  REGISTER FILE
=================
Using the smallest possible, but assuming the need for 1 write and 3 read ports,
  8 LUTs (one SLICEM) yields one Quad-port 32 entry x 4-bit RAM

Register file write sources,
  DSP P output register
  BRAM load
  What else?
  TODO?
    Need to fold these choices into BRAM unpack logic?
    Or is BRAM load also forwarded into DSP inputs without first going to reg file?

Register file read sources,
  Address register
  DSP A,B,C inputs
  What else?   

LUTs  Usage
====  =====
64    Total (16% of 400 LUT/BRAM budget)


=======
  DSP
=======
Not attempting to use all features of DSP
  Pre-add 'a+d' tossed because of extra pipeline stage

DSP is effectively modal,
  If multiply is enabled then there are only 2 options,
'p = c+(a*b)'
'p = c-(a*b)'<- is multiply subtract worth it? (assuming yes for now)
  Otherwise,
'p = c OP (a:b)


===================
  MESSAGE PASSING  
===================
Todo


=================================

  ISA / CODE GENERATION DETAILS

=================================
Todo

=======
  ISA 
=======
Going to get messy for now.
Attempting first to describe everything which needs instruction control.
This will overflow a 32-bit instruction.
In process culling options which are less needed.
In hope of eventually fitting everything.

Source input data which can be accessed each clock,
  Register file loads (from prior clock),
    32-bit x
    32-bit y
    32-bit z
  Instruction immediate for this instruction,
    10-bit i
  BRAM last unpacked fetch (from prior clock),
    32-bit f
  DSP output (from prior clock),
    32|48-bit p

Sink output data,
  DSP inputs,
    18-bit b
    24-bit a (could be up to 30-bit, but no LUTs for that)
    32-bit c (could be up to 48-bit, but no LUTs for that)
     ?-bit control bits (todo)
  BRAM output word
    32-bit o
  BRAM memory address before translation
    16-bit m
  BRAM access control bits
  Register file entries for each port
     5-bit immediate s (store port)
     5-bit immediate t 
     5-bit immediate u
     5-bit immediate v
  Register file write value
    32-bit w

Alphabet usage,
  abcdefghijklmnopqrstuvwxyz
  ==========================
  abc.......................  DSP inputs
  ...............p..........  DSP output
  ........i.................  immediate
  ............m.............  BRAM memory address
  .....f....................  BRAM fetched value
  ..............o...........  BRAM output word
  ..................s.......  Register file store port
  ...................tuv....  Register file read ports
  ......................w...  Register file write value
  .......................xyz  Register file reads

Tracking how much opcode overload,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  bbb.............................  Branching 
  ...rr...........................  BRAM operation 
  .....sssstttuuuuvvvv............  Reg file ports
  ....................??..........  Not enough space for DSP control
  ......................iiiiiiiiii  Trying for fixed 10-bit immediate 
  ================================
  Notes,
    Definitely need smarter opcode encoding

Immediate,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  ......................iiiiiiiiii  Trying for fixed 10-bit immediate 
  ================================
  Notes,
    Wanted 10-bit to hit full BRAM absolute address

DSP control,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  Notes,
    Lots of control bits to get correct in here      

DSP inputs,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  ................................  Input a 24-bits
  ................................    a=signExtend(i)
  ................................    a=y[32:18]
  ................................    a=f[32:18]
  ................................    a=f
  ................................    a=y
  ................................    a=z
  ================================
  ................................  Input b 18-bits
  ...............................0    b=i
  ..............................??    b=y
  ..............................??    b=f 
  ================================
  ................................  Input c 32-bits
  ...............................?    c=f
  ...............................?    c=z
  ================================
  Notes,
    Might be able to have a be modal based on DSP control (for mul vs a:b cases)
      Or maybe merge for some cases?
    The a=signExtend(i) case is needed for signed (a:b)=i
    Should c=z be z or something else
    Believe f probably should be an input into b for '(a:b) op c' case
    Culled, a=i, as i is too small for top bits of a:b, and mul is associative
    Culled, b=p, as it is better to just forward in these cases
    Culled, c=p, hoping forwarding if control bits covers this

Branch control,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  .............................???  No branch
  .............................???  Return
  .............................???  Call to p
  .............................???  Switch to/from MessageHandler/Program
  .............................???  Conditional jump if p<0
  .............................???  Conditional jump if p>=0
  .............................???  Call
  .............................???  Jump
  ================================
  Notes,
    Likely not getting branch control under 3-bits
      At least without multiple instruction forms
    In theory this enables "free" branching
    Won't have p==0, as don't want to turn on pattern detector
    Work through computed branch targets cases again ...

Register file ports,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  .................ssss...........  Store to any register
  .....................ttt........  BRAM address registers (limited to first 8 for space)
  ........................uuuuvvvv  Load from any register
  ================================
  Notes,
    This is the most trouble in opcode encoding size
    Dropping to 16 entries from 32
      Assuming want freer context switch to handle message
        First 16 used during normal execution
        Later 16 used during handle message

BRAM operation bits,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  ..............................00  f=[x]
  ..............................01  f=[x^i]
  ..............................10  [x]=p
  ..............................11  [x^i]=p
  ================================
  Notes,
    This is all that is needed to keep the one BRAM port busy
    Direct mapping to register port t (1st read port) to save size
    Only supporting storing from DSP output p
      If one is going to store, store when generated
      If need to store later just load back into the DSP (nop)
    Need separate [x] case because imm may be used for something else
    Culled, nop (just going to burn power for load regardless if needed)
    Culled, f=[i], [i]=p, because [x^i] can load x=0


===================
  SHIFTING ISSUES
===================
Needs more thought ...

Have the following pipeline options built in the DSP
 (25-bit a * 18-bit b) + 48-bit c
 ((25-bit a * 18-bit b) + 48-bit c) << 17

Usage cases,
  Address math,
    This is 'base + index * stride', so use 'a*b+c'
    Index limited to 16 M
    Base not limited
  Bitfield,
    Included in via variable-bit BRAM access
  Bit arrays,
    Expanded to later section
  Divides
    Needs some thought ...

Shifter will do at most 24-bit integers,
  Top bits get sign-extended (likely not the opcode space for unsigned)
  Using 24 to keep with quad LUT alignment

Shifts for 16-bit,
  __for_shift_>>__ __for_shift_<<__
  1111111111111111 0000000000000000
  fedcba9876543210 fedcba9876543210  mul       <<>>
  ================ ================  ========  ==  ==    
  ................ fedcba9876543210  00000001   0  10   
  ...............f edcba9876543210.  00000002   1   f
  ..............fe dcba9876543210..  00000004   2   e
  .............fed cba9876543210...  00000008   3   d
  ............fedc ba9876543210....  00000010   4   c
  ...........fedcb a9876543210.....  00000020   5   b
  ..........fedcba 9876543210......  00000040   6   a
  .........fedcba9 876543210.......  00000080   7   9
  ........fedcba98 76543210........  00000100   8   8
  .......fedcba987 6543210.........  00000200   9   7
  ......fedcba9876 543210..........  00000400   a   6
  .....fedcba98765 43210...........  00000800   b   5
  ....fedcba987654 3210............  00001000   c   4
  ...fedcba9876543 210.............  00002000   d   3
  ..fedcba98765432 10..............  00004000   e   2
  .fedcba987654321 0...............  00008000   f   1
  fedcba9876543210 ................  00010000  10   0


==============
  BIT ARRAYS
==============
Maybe best to just use the 8-bit BRAM window
  Keeps with-in the range of immediate for AND mask

Emulation without special hardware,
  Algorithms,
    Extract lowest bit set ....................  x & -x
    Get mask up to lowest bit set .............  x ^ (x - 1)
    Reset lowest bit set ......................  x & (x - 1)
    ===========================================  ============
    Fill from lowest clear bit ................  x & (x + 1)
    Isolate lowest clear bit and complement ...  ~x & (x + 1)
    Mask from lowest clear bit ................  x ^ (x + 1)
    Mask from trailing zeros ..................  ~x & (x - 1)
    ===========================================  ============
    Isolate lowest clear bit ..................  x | ~(x + 1)
    Set lowest clear bit ......................  x | (x + 1)
    Fill from lowest set bit ..................  x | (x - 1)
    Isolate lowest set bit and complement .....  ~x | (x - 1)
    Inverse mask from trailing ones ...........  ~x | (x + 1)

Common algorithms,
  Bit insert
  Bit extract
  Popuplation count
    Output is 3 bits
    Requires 8:1 function, or 2 LUTs/bit, 6 LUTs in hardware
  Count leading zeros
    6 LUTs in hardware
  Count trailing zeros
    6 LUTs in hardware

Todo,
  Look through De Bruijn Sequence based stuff again

↧

TK4 - Try 2

October 3, 2016, 8:52 pm

≫ Next: TK4 - Try 3

≪ Previous: T4K

Tries : 3 2 1

Update Log
2016/10/03 : Initial posting. Most of pipelining figured out. Working through DSP input and operation details. Have to think through data and return stack usage cases, decide if data register file should just be removed and replaced with indexed fetched from data stack.

2016/10/02 : Trying a different design path. Concerned that the core which enables easy factored code, and thus well compressed code in limited memory, is not having to optimize around a CPU pipeline from the perspective of a thread of execution. So in this try, I'm working through paper implementation of a core which round-robins through 4 threads for a 4 stage pipeline (talked about in this prior post). Maintaining variable bit-width address windows, and other things from prior post.

Notes


================
  FORTH HYBRID
================
Dual stack machine with register file
Core functional units,

  16-bit x 8-entry return stack (1 port) 
  32-bit x 8-entry data stack (1 port)
  32-bit x 8-entry register file (2 ports, port 0 for DSP input, port 1 for BRAM address)
  32-bit x 1024-entry BRAM (2 ports, port 0 for instruction fetch, port 1 for data)


===================
  EXECUTION MODEL
===================
4 threads/core of execution running with guaranteed round-robin scheduling
Instructions are VLIW style with a fixed logical ordered set of operations

  Order  Operation
  =====  =========
  1st    mux inputs for DSP ............ uses loads from prior instruction
  2nd    DSP execution .................
  3nd    load/store to REG and BRAM .... can store DSP result
  4th    branch ........................ can branch to DSP result


=====================
  PHYSICAL PIPELINE
=====================
Designing under the following constraints,

  Loads from RAMs are not used until next stage
  Each stage does only one LUT or ADD both of which can be vertically chained 
  DSP is fully pipelined

Outline,

         DSP  DSP  DSP  DSP  DSP  DAT  ADR  BLK  BLK
  Stage  A B  MUL  C    ADD  P    REG  REG  OUT  RAM  PC   INS
  =====  ===  ===  ===  ===  ===  ===  ===  ===  ===  ===  ===
      0  lut       lut                 @
      1       mul  reg                 lut
      2                 add            lut            add
      3                      lut  @!        lut  @!   lut  @
  -----  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---

  DSP A B .... DSP input a and b arguments
  DSP MUL .... DSP mul stage
  DSP C ...... DSP input c argument
  DSP ADD .... DSP add/op stage
  DAT REG .... Register file data load/store
  ADR REG .... Register file address register to BRAM address translation
  BLK OUT .... BRAM data construct write value
  BLK RAM .... BRAM data load/store
  PC ......... Update program counter
  INS ........ Fetch next instruction


============================
  CURRENT LUT BUDGET USAGE
============================
Budget is 400 LUTs/core, adding as design is roughed out,

  LUTs  %    usage
  ====  ===  =====
    32    8  register file (4x 8-LUT SLICEM 32-entry x  8-bit 2 port RAM)
    16    4  data stack    (2x 8-LUT SLICEM 32-entry x 16-bit 1 port RAM)
     8    2  return stack  (   8-LUT SLICEM 32-entry x 16-bit 1 port RAM)
  ----  ---  -----
    54       DSP b input
             DSP a input
             DSP c input
    18    5  BRAM address generation
    34       BRAM output generation
    38   10  program counter
  ====  ===  =====
   200       total


========
  TODO
========
Make sure to register all inputs required to generate a b and c


===========================
  DSP B INPUT
===========================
Need to also unpack BRAM load options so this gets expensive
Placement in pipeline,

  stage  action
  =====  ======
      0  LUT DSP b input
      1  
      2  register pre-translated address for stage 0 of next cycle
      3  register pre-translated address for stage 0 of next cycle

The b input expanded with unpack options, and control bits,

  fedcba9876543210  n  LUT input count
  ================  =  ===============
<-----iiiiiiiiii  i   1-bit
  tttttttttttttttt  t   1-bit
  dddddddddddddddd  d   1-bit
  ================
  aaaaaaaaaaaaaaaa  f   4-bits for MSB 8-bits of output
  bbbbbbbbbbbbbbbb      7-bits for 2nd LSB 4-bits of output
  cccccccccccccccc     15-bits for LSB 4-bits of output
  00000000dddddddd
  00000000eeeeeeee
  00000000ffffffff
  00000000gggggggg
  000000000000hhhh
  000000000000iiii
  000000000000jjjj
  000000000000kkkk
  000000000000llll
  000000000000mmmm
  000000000000nnnn
  000000000000oooo
  ================
  xxxxxxxxxxxxxxxx  needs 2-bit opcode control
  xxxxxxxxxxxxxxxx  needs 2-bit MSB of pre-translate address
  xxxxxxxx........  needs 1-bit LSB of pre-translate address
  ........xxxx....  needs 2-bit LSB of pre-translate address
  ............xxxx  needs 3-bit LSB of pre-translate address
  ================
  xxxxxxxx........  12:1 function (2 LUT/bit) x 8-bit = 16 LUT
  ........xxxx....  16:1 function (4 LUT/bit) x 4-bit = 16 LUT
  ............xxxx  25:1 function (4 LUT/bit) x 4-bit = 16 LUT

LUT area estimate,

  LUTs  usage
  ====  =====
    48  generate b
     6  to register 5-bits x 2 stages of pre-translate address (rounded up)
  ----  -----
    54  total


===========================
  DSP A INPUT
===========================
Placement in pipeline,

  stage  action
  =====  ======
      0  LUT DSP a input
      1  
      2
      3

LUT area estimate,

  LUTs  usage
  ====  =====
  ----  -----
        total


===========================
  DSP C INPUT
===========================
Placement in pipeline,

  stage  action
  =====  ======
      0  LUT DSP c input
      1  register c
      2
      3

LUT area estimate,

  LUTs  usage
  ====  =====
  ----  -----
        total


===========================
  BRAM ADDRESS GENERATION
===========================
Supports the feature of variable-bit width windows into the 4KB of ram
Placement in pipeline,

  stage  action
  =====  ======
      0  fetch base address from register file 
      1  optionally XOR immediate
      2  translate into BRAM address
      3

Implementation requires XOR control to be single bit in opcode

BRAMs always in 32-bit port mode,

  fedcba9876543210 
  ================
  .xxxxxxxxxx00000 - requires 10-bit address

Address register,

  fedcba9876543210  access
  ================  ======
  00....xxxxxxxxxx  1024 x 32-bit
  01...xxxxxxxxxxx  2048 x 16-bit 
  10..xxxxxxxxxxxx  4096 x  8-bit 
  11.xxxxxxxxxxxxx  8192 x  4-bit (supported for read only)

Address register value to BRAM address translation
  Uses a 6:1 function for each bit,

    bits  meaning
    ====  =======
       4  address shifted left {0,1,2,3} bits
       2  the 'fe' address bits

LUT area estimate,

  LUTs  usage
  ====  =====
     8  optional XOR (16-bits at 2-bits per LUT), rounding up for ending register
    10  translate (10-bits x 1 LUT)
  ----  -----
    18  total


===========================
  BRAM OUTPUT GENERATION
===========================
Shifts DSP p output for store, and compute byte write mask
Placement in pipeline,

  stage  action
  =====  ======
      0  
      1  
      2  
      3  LUT new output here

Permutations (showing address and byte write mask for store),

  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa - 32-bit adr=00....xxxxxxxxxx  write=1111
  ................bbbbbbbbbbbbbbbb - 16-bit adr=01...xxxxxxxxxx0  write=0011
  cccccccccccccccc................ - 16-bit adr=01...xxxxxxxxxx1  write=1100
  ........................dddddddd -  8-bit adr=10..xxxxxxxxxx00  write=0001
  ................eeeeeeee........ -  8-bit adr=10..xxxxxxxxxx01  write=0010
  ........ffffffff................ -  8-bit adr=10..xxxxxxxxxx10  write=0100
  gggggggg........................ -  8-bit adr=10..xxxxxxxxxx11  write=1000
  ............................hhhh -  4-bit adr=11.xxxxxxxxxx000
  ........................iiii.... -  4-bit adr=11.xxxxxxxxxx001
  ....................jjjj........ -  4-bit adr=11.xxxxxxxxxx010
  ................kkkk............ -  4-bit adr=11.xxxxxxxxxx011
  ............llll................ -  4-bit adr=11.xxxxxxxxxx100
  ........mmmm.................... -  4-bit adr=11.xxxxxxxxxx101
  ....nnnn........................ -  4-bit adr=11.xxxxxxxxxx110
  oooo............................ -  4-bit adr=11.xxxxxxxxxx111

Shift value for store,
  Requires 3:1 MUX per bit, 32 LUTs

Generate write enable for store,
  Requires same 4-bits per function,
    2 lower address bits
    2 upper address bits
  2 LUTs (5:1 function sharing inputs, 2 outputs per LUT)

LUT area estimate,

  LUTs  usage
  ====  =====
    32  shift value for store
     2  generate write enable
  ----  -----
    34  total


===================
  PROGRAM COUNTER
===================
10-bit program counter (PC)
Only lower 8-bits of PC increment on linear execution
  Requires only an 8-bit PC+1 computation (one slice)

Placement in pipeline,

  stage  action
  =====  ======
      0  register
      1  register
      2  increment PC
      3  LUT new PC based on DSP p output and instruction opcode

PC function inputs per output bit (map to 13:1 function at 2 LUTs/bit),

  bits  meaning
  ====  =======
     1  next PC if not branching (computed in prior stage)
     1  top of return stack
     1  immediate absolute branch address
     1  DSP P output register (computed branch target in prior clock)
     1  DSP P output register sign bit (for conditional branch)
     3  bits from instruction opcode

LUT area estimate,

  LUTs  usage
  ====  =====
    20  13:1 function for next 10-bit PC computation including instruction decode
     8  PC+1 adder for 8 lower bits of PC
    10  for 2 stage registers (2-bits/LUT)
  ----  -----
    38  total


========================
  INSTRUCTION PIPELINE
========================
Todo, remember to count cost to pipeline opcode bits through stages



=========

  NOTES

=========

=================================================
  ADDRESS REGISTER XOR INSTEAD OF ADD IMMEDIATE
=================================================
Planning on [address ^ immediate] addressing
  This removes an adder from the design

XOR is the same as [address + immediate] for an n-bit immediate
  When lower n-bits of address are zero
  Means data must be aligned to the nearest pow2 of maximum immediate offset

Using the following terms,
  ggggggoooo
    g =  group bits (address bits choose the group of data)
    o = offset bits (address bits are zero, immediate chooses element in group)

For bits in address which are not cleared (ie the group bits),
  Setting bits in the immediate results in accessing a neighbor group
  Regardless of the starting group in the address register
    It is possible to roll through all aligned groups
    But ordering is different based on starting group address
    Example of group bits for address crossed with immediate

           00 01 10 11   
         +-------------
      00 | 00 01 10 11
      01 | 01 00 11 10
      10 | 10 11 00 01
      11 | 11 10 01 00


===================================
  BRAM VARIABLE BIT-WIDTH WINDOWS
===================================
Trying to support transparent pack/unpack of variable bit-widths from BRAM
  Want zero impact to ISA, no special instructions
  Instead dividing address range into windows of different bit-widths
  Each address range addresses at a multiple of the bit-width
  Effectively the high bits of address choose the bit-width

Store path limited to {8,16,32}-bit
  Only using BRAM byte write mask to avoid any {read, modify, write}

Fixed signed vs unsigned configuration,

  size    choice
  ======  ======
  32-bit  signed (but doesn't matter)
  16-bit  going to go with signed (needed for vector or audio)
   8-bit  unsigned (keeps implementation simple)
   4-bit  unsigned for sure (sprites?)


==========================================
  WORKING THROUGH OPTIONS DSP OPERATIONS
==========================================
Opcode forms,

  p = c op ((a << 16) + unsigned(b))
  p = c + (a * b)  
  p = c - (a * b)

Where op can be the following,

  and .....
  nand ....
  nor .....
  not .....
  or ......
  xnor ....
  xor .....

Where the following can also be applied,

  extra set c bit -1 to 1 (for rounding)
  (a * b) can be forced to zero (nop)
  ((a << 16) + unsigned(b)) can be forced to zero (nop) 
  ((a << 16) + unsigned(b)) can be forced to all ones 


===============================================
  WORKING THROUGH OPTIONS FOR A,B,C DSP INPUT
===============================================
DSP inputs (as they appear in the core),

  24-bit a
  16-bit b
  40-bit c

Possible inputs,

  10-bit immediate 
  16-bit top of return stack
  32-bit top of data stack
  32-bit register file load (from prior instruction)
  32-bit BRAM load (from prior instruction)
  40-bit DSP p output


====================
  FAST ABS MIN MAX
====================
Simple design exercise to think through DSP issues

These need to work on the 40-bit accumulator without precision loss
  So using multiply stage is out

Min and max, where a is the accumulator, and b is the limit,

  min(a, b) = ((a - b) & ((a - b) < 0 ? ~0 : 0)) + b
  max(a, b) = ((a - b) & ((a - b) < 0 ? 0 : ~0)) + b

Want to be able do the following,

  acc -= b; 
  acc = acc < 0 ? acc : 0; // want to fold this into prior operation
  acc += b;

Have to either LUT or register p in stage 3,
  Could LUT p to zero if signed or unsigned based on control bit
  This works out to 2-bits/LUT (pair of 5:1 functions with same input)
  So 20 LUTs total (same as just registering)
    Plus likely need to decode control and enable from opcode in prior pass

  inputs
  ------
  2 p bits
  1 p sign bit (might want the overflow sign bit?)
  1 enable bit
  1 signed or unsigned control bit

This enables min and max to work in 2 instructions without branching

Absolute value,

  abs(a) = max(a, -a)

Does this make the case for,

  expanding data stack to 40-bit (to match accumulator)?
  reducing accumulator to 36-bit, or even 32-bit?

Operation,

  push copy of acc; // want to fold into start of next op
  acc += acc; acc = acc < 0 ? 0 : acc; 
  acc -= pop;

Using top of data stack for DSP input means it must be pre-registered
  That register could be 40-bit until it gets actually stored on stack
  Want a bit which marks if should consume data stack

Ideally push to happen before the first add (included in that opcode)
  Stage 0 : must save top to stack RAM
  Stage 1 : set top to p
    Todo, think through when c is computed again

Data stack top is going to be expensive
  40-bit : minimum 80 LUTs
  32-bit : minimum 64 LUTs

Time to rethink ...

Related
Avnet AES-KU040-DB-G (XCKU040 Based Dev Board)

↧

TK4 - Try 3

October 5, 2016, 6:23 am

≫ Next: Epiphany-V Taped Out

≪ Previous: TK4 - Try 2

Tries : 3 2 1

Update Log
2016/10/04 : Trying dropping data stack (won't fit, gets expensive with adders if doing indexed access). Have something figured out for DSP inputs. Rethinking ISA, have place holder for now to see that things fit in worst case. Thinking estimated 81% LUT usage so far will be too much, perhaps drop accumulator down from 40-bits to 36-bits or 32-bits? The problem I see with this design is that due to load/store being at the end of the pipeline, after a store, the next instruction will be starved of inputs. Next try (4) will likely focus on {load/store, mux input, mul, add} ordered pipeline, which will attempt to make {load,dsp}, {store,return}, {dsp,call/jmp}, VLIW instruction usage work well, but will likely require a forth like address register (to avoid 2 dependent loads before feeding the dsp).

Notes


==================
  CORE RESOURCES
==================
Core functional units,

  16-bit x 8-entry return stack (1 port) 
  32-bit x 8-entry register file (2 ports, port 0 for DSP input, port 1 for BRAM address)
  32-bit x 1024-entry BRAM (2 ports, port 0 for instruction fetch, port 1 for data)


===================
  EXECUTION MODEL
===================
4 threads/core of execution running with guaranteed round-robin scheduling
Instructions are VLIW style with a fixed logical ordered set of operations

  Order  Operation
  =====  =========
  1st    mux inputs for DSP ............ uses loads from prior instruction
  2nd    DSP execution .................
  3nd    load/store to REG and BRAM .... can store DSP result
  4th    branch ........................ can branch to DSP result


=====================
  PHYSICAL PIPELINE
=====================
Designing under the following constraints,

  Loads from RAMs are not used until next stage
  Each stage does only one LUT or ADD both of which can be vertically chained 
  DSP is fully pipelined

Outline,

         DSP  DSP  DSP  DSP  DSP  DAT  ADR  BLK  BLK
  Stage  A B  MUL  C    ADD  P    REG  REG  OUT  RAM  PC   INS
  =====  ===  ===  ===  ===  ===  ===  ===  ===  ===  ===  ===
      0  lut       lut                 @
      1       mul  reg                 lut
      2                 add            lut            add
      3                      lut  @!        lut  @!   lut  @
  -----  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---

  DSP A B .... DSP input a and b arguments
  DSP MUL .... DSP mul stage
  DSP C ...... DSP input c argument
  DSP ADD .... DSP add/op stage
  DAT REG .... Register file data load/store
  ADR REG .... Register file address register to BRAM address translation
  BLK OUT .... BRAM data construct write value
  BLK RAM .... BRAM data load/store
  PC ......... Update program counter
  INS ........ Fetch next instruction


=======
  ISA
=======
Alphabet usage,
  a ... 24-bit DSP input (sign extended)
  b ... 16-bit DSP input (sign extended)
  c ... 48-bit DSP input, accumulator
  d ... 32-bit data register index
  f ... 32-bit BRAM fetched value
  i ... 10-bit immediate
  j ... 10-bit program counter
  m ...  2-bit BRAM memory mode
  p ... 48-bit DSP output
  s ...  3-bit BRAM address base register index
  w ... 32-bit BRAM write value

Encoding,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  ......................iiiiiiiiii  Immediate 10-bits
  ================================
  .................mmsss..........  BRAM control
  .................00.............  f=[s]
  .................01.............  f=[s^i]  
  .................10.............  [s]=w
  .................11.............  [s^i]=w  
  ...................sss..........  Register file address register index
  ================================
  ...oooooaabbccddd...............  DSP control
  ...ooooo........................  Opcode (todo)
  ........aa......................  DSP a input choice
  ..........bb....................  DSP b input choice
  ............cc..................  DSP c input choice
  ..............ddd...............  Register file data register index
  ================================
  ggg.............................  PC control
  ???.............................  No branch
  ???.............................  Return
  ???.............................  Call immediate
  ???.............................  Call to p from end of prior instruction
  ???.............................  Conditional jump immediate if p<0 from end of prior instruction
  ???.............................  Conditional jump immediate if p>=0 from end of prior instruction
  ???.............................  Jump immediate
  ???.............................  Jump to p from prior instruction


============================
  CURRENT LUT BUDGET USAGE
============================
Budget is 400 LUTs/core, adding as design is roughed out,

  LUTs  %    usage
  ====  ===  =====
    32    8  register file (4x 8-LUT SLICEM 32-entry x  8-bit 2 port RAM)
     8    2  return stack  (   8-LUT SLICEM 32-entry x 16-bit 1 port RAM)
  ----  ---  -----
    20    5  DSP p output modifier
   102   26  DSP c input
    48   12  DSP b input
    24    6  DSP a input
    18    5  BRAM address generation
    34    9  BRAM output generation
    38   10  program counter
  ====  ===  =====
   324   81  total


=====================
  DSP BIT ALIGNMENT
=====================
DSP hardware does p=op(join(a,b),c) or p=c+mul(a,b) or p=c-mul(a,b)
  25-bit a
  18-bit b
  48-bit c
  48-bit p

Designing for the following,
  222222222222222211111111111111110000000000000000
  fedcba9876543210fedcba9876543210fedcba9876543210
  ================================================
                         s........................  25-th bit is sign extended
                         .aaaaaaaaaaaaaaaaaaaaaaaa  using only 24-bits of a                              
  ================================================
                                ................00  2 lower bits are set to zero
                                bbbbbbbbbbbbbbbb..  16-bits of b shifted left by 2
  ================================================
        aaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbb00  extent of join(a,b) input (40-bit effective)
  ================================================
        ........................................00  2 lower bits are set to zero
        ........................................1.  option to set to one for rounding
        ........cccccccccccccccccccccccccccccccc..  32-bits of c shifted left by 2
        cccccccc..................................  maintaining extra 8-bits of p
  ================================================

The reason for the 2-bit padding,
  Can reuse the join(a,b) cases as a>>16 for the multiply input


=======================
  DSP INPUT CHALLENGE
=======================
Need to support both
  BRAM unpack in the LUT
  DSP a and b inputs separate
  DSP a and b inputs joined {MSB a, b LSB}
    Which forces needing unpack options with >>16 (not going to fit)

Solving this by
  Only supporting unpack in c and b
  The {MSB a, b LSB} input is only going to support p (accumulator feedback)
    The c input is reused for the non-accumulator input
    This works out because only subtract in {MSB a, b LSB} cases is non-associative
  The mul cases will still use c as an accumulator
  Not supporting unpacking in the a input


=========================
  DSP P OUTPUT MODIFIER
=========================
Placement in pipeline,

  stage  action
  =====  ======
      0  
      1  
      2  
      3  modify p post DSP for input for next instruction

Can conditionally zero p if signed or unsigned,

  inputs for pair of 5:1 functions in a LUT
  =========================================
  2 p bits
  1 p sign bit (might want the overflow sign bit?)
  1 enable bit
  1 signed or unsigned control bit

LUT area estimate,

  LUTs  usage
  ====  =====
    20  p output modifier


===========================
  DSP C INPUT
===========================
Need to also unpack BRAM load options so this gets expensive
Placement in pipeline,

  stage  action
  =====  ======
      0  LUT DSP c input
      1  register c
      2  register pre-translated address for stage 0 of next cycle
      3  register pre-translated address for stage 0 of next cycle

The c input expanded with unpack options, and control bits,

  76543210fedcba9876543210fedcba9876543210  n  LUT input count
  ========================================  =  ===============
<-----------------------------iiiiiiiiii  i   1-bit
  pppppppppppppppppppppppppppppppppppppppp  p   1-bit
<-------dddddddddddddddddddddddddddddddd  d   1-bit
  ========================================
<-------aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa  f   4-bits for MSB 8-bits of output
<-----------------------bbbbbbbbbbbbbbbb      7-bits for 2nd LSB 4-bits of output
<-----------------------cccccccccccccccc     15-bits for LSB 4-bits of output
  00000000000000000000000000000000dddddddd
  00000000000000000000000000000000eeeeeeee
  00000000000000000000000000000000ffffffff
  00000000000000000000000000000000gggggggg
  000000000000000000000000000000000000hhhh
  000000000000000000000000000000000000iiii
  000000000000000000000000000000000000jjjj
  000000000000000000000000000000000000kkkk
  000000000000000000000000000000000000llll
  000000000000000000000000000000000000mmmm
  000000000000000000000000000000000000nnnn
  000000000000000000000000000000000000oooo
  ========================================
  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  needs 2-bit opcode control
  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  needs 2-bit MSB of pre-translate address
  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx........  needs 1-bit LSB of pre-translate address
  ................................xxxx....  needs 2-bit LSB of pre-translate address
  ....................................xxxx  needs 3-bit LSB of pre-translate address
  ========================================
  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx........  12:1 function (2 LUT/bit) x 32-bit = 64 LUT
  ................................xxxx....  15:1 function (4 LUT/bit) x  4-bit = 16 LUT
  ....................................xxxx  24:1 function (4 LUT/bit) x  4-bit = 16 LUT

LUT area estimate,

  LUTs  usage
  ====  =====
    96  generate c
     6  to register 5-bits x 2 stages of pre-translate address (rounded up)
  ----  -----
   102  total


===========================
  DSP B INPUT
===========================
Less expensive compared to c because of less bits
Placement in pipeline,

  stage  action
  =====  ======
      0  LUT DSP b input
      1  
      2
      3

The b input expanded with unpack options, and control bits,

  fedcba9876543210  n  LUT input count
  ================  =  ===============
<-----iiiiiiiiii  i   1-bit
  pppppppppppppppp  p   1-bit
  dddddddddddddddd  d   1-bit
  ================
  aaaaaaaaaaaaaaaa  f   4-bits for MSB 8-bits of output
  bbbbbbbbbbbbbbbb      7-bits for 2nd LSB 4-bits of output
  cccccccccccccccc     15-bits for LSB 4-bits of output
  00000000dddddddd
  00000000eeeeeeee
  00000000ffffffff
  00000000gggggggg
  000000000000hhhh
  000000000000iiii
  000000000000jjjj
  000000000000kkkk
  000000000000llll
  000000000000mmmm
  000000000000nnnn
  000000000000oooo
  ================
  xxxxxxxxxxxxxxxx  needs 2-bit opcode control
  xxxxxxxxxxxxxxxx  needs 2-bit MSB of pre-translate address
  xxxxxxxx........  needs 1-bit LSB of pre-translate address
  ........xxxx....  needs 2-bit LSB of pre-translate address
  ............xxxx  needs 3-bit LSB of pre-translate address
  ================
  xxxxxxxx........  12:1 function (2 LUT/bit) x 8-bit = 16 LUT
  ........xxxx....  16:1 function (4 LUT/bit) x 4-bit = 16 LUT
  ............xxxx  25:1 function (4 LUT/bit) x 4-bit = 16 LUT

LUT area estimate,

  LUTs  usage
  ====  =====
    48  generate b


===========================
  DSP A INPUT
===========================
No need to unpack for this input
Placement in pipeline,

  stage  action
  =====  ======
      0  LUT DSP a input
      1  
      2
      3

Inputs can use simple 4:1 MUX,

  76543210fedcba9876543210 
  ========================
  pppppppppppppppppppppppp  this is p>>16
<-------------iiiiiiiiii
  pppppppppppppppppppppppp
  dddddddddddddddddddddddd

LUT area estimate,

  LUTs  usage
  ====  =====
    24  generate a


===========================
  BRAM ADDRESS GENERATION
===========================
Supports the feature of variable-bit width windows into the 4KB of ram
Placement in pipeline,

  stage  action
  =====  ======
      0  fetch base address from register file 
      1  optionally XOR immediate
      2  translate into BRAM address
      3

Implementation requires XOR control to be single bit in opcode

BRAMs always in 32-bit port mode,

  fedcba9876543210 
  ================
  .xxxxxxxxxx00000 - requires 10-bit address

Address register,

  fedcba9876543210  access
  ================  ======
  00....xxxxxxxxxx  1024 x 32-bit
  01...xxxxxxxxxxx  2048 x 16-bit 
  10..xxxxxxxxxxxx  4096 x  8-bit 
  11.xxxxxxxxxxxxx  8192 x  4-bit (supported for read only)

Address register value to BRAM address translation
  Uses a 6:1 function for each bit,

    bits  meaning
    ====  =======
       4  address shifted left {0,1,2,3} bits
       2  the 'fe' address bits

LUT area estimate,

  LUTs  usage
  ====  =====
     8  optional XOR (16-bits at 2-bits per LUT), rounding up for ending register
    10  translate (10-bits x 1 LUT)
  ----  -----
    18  total


===========================
  BRAM OUTPUT GENERATION
===========================
Shifts DSP p output for store, and compute byte write mask
Placement in pipeline,

  stage  action
  =====  ======
      0  
      1  
      2  
      3  LUT new output here

Permutations (showing address and byte write mask for store),

  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa - 32-bit adr=00....xxxxxxxxxx  write=1111
  ................bbbbbbbbbbbbbbbb - 16-bit adr=01...xxxxxxxxxx0  write=0011
  cccccccccccccccc................ - 16-bit adr=01...xxxxxxxxxx1  write=1100
  ........................dddddddd -  8-bit adr=10..xxxxxxxxxx00  write=0001
  ................eeeeeeee........ -  8-bit adr=10..xxxxxxxxxx01  write=0010
  ........ffffffff................ -  8-bit adr=10..xxxxxxxxxx10  write=0100
  gggggggg........................ -  8-bit adr=10..xxxxxxxxxx11  write=1000
  ............................hhhh -  4-bit adr=11.xxxxxxxxxx000
  ........................iiii.... -  4-bit adr=11.xxxxxxxxxx001
  ....................jjjj........ -  4-bit adr=11.xxxxxxxxxx010
  ................kkkk............ -  4-bit adr=11.xxxxxxxxxx011
  ............llll................ -  4-bit adr=11.xxxxxxxxxx100
  ........mmmm.................... -  4-bit adr=11.xxxxxxxxxx101
  ....nnnn........................ -  4-bit adr=11.xxxxxxxxxx110
  oooo............................ -  4-bit adr=11.xxxxxxxxxx111

Shift value for store,
  Requires 3:1 MUX per bit, 32 LUTs

Generate write enable for store,
  Requires same 4-bits per function,
    2 lower address bits
    2 upper address bits
  2 LUTs (5:1 function sharing inputs, 2 outputs per LUT)

LUT area estimate,

  LUTs  usage
  ====  =====
    32  shift value for store
     2  generate write enable
  ----  -----
    34  total


===================
  PROGRAM COUNTER
===================
10-bit program counter (PC)
Only lower 8-bits of PC increment on linear execution
  Requires only an 8-bit PC+1 computation (one slice)

Placement in pipeline,

  stage  action
  =====  ======
      0  register
      1  register
      2  increment PC
      3  LUT new PC based on DSP p output and instruction opcode

PC function inputs per output bit (map to 13:1 function at 2 LUTs/bit),

  bits  meaning
  ====  =======
     1  next PC if not branching (computed in prior stage)
     1  top of return stack
     1  immediate absolute branch address
     1  DSP P output register (computed branch target in prior clock)
     1  DSP P output register sign bit (for conditional branch)
     3  bits from instruction opcode

LUT area estimate,

  LUTs  usage
  ====  =====
    20  13:1 function for next 10-bit PC computation including instruction decode
     8  PC+1 adder for 8 lower bits of PC
    10  for 2 stage registers (2-bits/LUT)
  ----  -----
    38  total


========================
  INSTRUCTION PIPELINE
========================
Todo, remember to count cost to pipeline opcode bits through stages


===============

  EXTRA NOTES

===============

=================================================
  IDEA LIST
=================================================
Want to be able to use address register fetch for DSP source

Would be nice to get to 16-bit immediate to be able to do self modify immediates




=================================================
  ADDRESS REGISTER XOR INSTEAD OF ADD IMMEDIATE
=================================================
Planning on [address ^ immediate] addressing
  This removes an adder from the design

XOR is the same as [address + immediate] for an n-bit immediate
  When lower n-bits of address are zero
  Means data must be aligned to the nearest pow2 of maximum immediate offset

Using the following terms,
  ggggggoooo
    g =  group bits (address bits choose the group of data)
    o = offset bits (address bits are zero, immediate chooses element in group)

For bits in address which are not cleared (ie the group bits),
  Setting bits in the immediate results in accessing a neighbor group
  Regardless of the starting group in the address register
    It is possible to roll through all aligned groups
    But ordering is different based on starting group address
    Example of group bits for address crossed with immediate

           00 01 10 11   
         +-------------
      00 | 00 01 10 11
      01 | 01 00 11 10
      10 | 10 11 00 01
      11 | 11 10 01 00


===================================
  BRAM VARIABLE BIT-WIDTH WINDOWS
===================================
Trying to support transparent pack/unpack of variable bit-widths from BRAM
  Want zero impact to ISA, no special instructions
  Instead dividing address range into windows of different bit-widths
  Each address range addresses at a multiple of the bit-width
  Effectively the high bits of address choose the bit-width

Store path limited to {8,16,32}-bit
  Only using BRAM byte write mask to avoid any {read, modify, write}

Fixed signed vs unsigned configuration,

  size    choice
  ======  ======
  32-bit  signed (but doesn't matter)
  16-bit  going to go with signed (needed for vector or audio)
   8-bit  unsigned (keeps implementation simple)
   4-bit  unsigned for sure (sprites?)


==========================================
  WORKING THROUGH OPTIONS DSP OPERATIONS
==========================================
Opcode forms,

  p = c op ((a << 16) + unsigned(b))
  p = c + (a * b)  
  p = c - (a * b)

Where op can be the following,

  and .....
  nand ....
  nor .....
  not .....
  or ......
  xnor ....
  xor .....

Where the following can also be applied,

  extra set c bit -1 to 1 (for rounding)
  (a * b) can be forced to zero (nop)
  ((a << 16) + unsigned(b)) can be forced to zero (nop) 
  ((a << 16) + unsigned(b)) can be forced to all ones 


===============================================
  WORKING THROUGH OPTIONS FOR A,B,C DSP INPUT
===============================================
DSP inputs (as they appear in the core),

  24-bit a
  16-bit b
  40-bit c

Possible inputs,

  10-bit immediate 
  16-bit top of return stack
  32-bit register file load (from prior instruction)
  32-bit BRAM load (from prior instruction)
  40-bit DSP p output


====================
  FAST ABS MIN MAX
====================
These need to work on the N-bit accumulator without precision loss
  So using multiply stage is out

Min and max, where a is the accumulator, and b is the limit,

  min(a, b) = ((a - b) & ((a - b) < 0 ? ~0 : 0)) + b
  max(a, b) = ((a - b) & ((a - b) < 0 ? 0 : ~0)) + b

Want to be able do the following,

  acc -= b; 
  acc = acc < 0 ? acc : 0; // want to fold this into prior operation
  acc += b;

Have to either LUT or register p in stage 3,
  Could LUT p to zero if signed or unsigned based on control bit
  This works out to 2-bits/LUT (pair of 5:1 functions with same input)
  So 20 LUTs total (same as just registering)
    Plus likely need to decode control and enable from opcode in prior pass

  inputs
  ------
  2 p bits
  1 p sign bit (might want the overflow sign bit?)
  1 enable bit
  1 signed or unsigned control bit

This enables min and max to work in 2 instructions without branching

Absolute value this way is not friendly for accumulator,

  abs(a) = max(a, -a)

Instead leverage the "free" branching (single cycle abs),

  ...; if(acc>=0) goto then; // add branch to end of prior op
  acc = -acc;                // works since acc is now in a:b for 1ops and 2ops
  then:

↧

Epiphany-V Taped Out

October 5, 2016, 7:15 am

≫ Next: Forth Hardware Thoughts

≪ Previous: TK4 - Try 3

Epiphany V Tape Out Page
Epiphany-V: A 1024 processor 64-bit RISC System-On-Chip PDF
Looks like it does 2 32-bit d = a * b + c floating point operations, or 4096 flops/clock.
If they hit 1 GHz they would be around 4 Tflop/s with 64 MB of on-chip memory.
In 16nmFF+ with 117 mm^2 area, with 4.56 B transistors.
Looks quite disruptive.
NV's 1080 in 16nmFF+ is nearly 3x area for maybe only around 2x the flops, with a lot less on-chip memory.

↧

Forth Hardware Thoughts

October 12, 2016, 6:37 am

≫ Next: Technical Evaluation of Traditional vs New "HDR" Encoding Crossed With Display Capability

≪ Previous: Epiphany-V Taped Out

James Bowman's FPGA based J1 : Site | PDF | Presentation | Forth Source
Chuck Moore : Arithmetic | Instruction Set | Ether Forth | Problem Oriented Language

GA144
GreenArrays
144 cores
9216 18-bit words of memory
21.3 mm^2 area on 180 nm process
0.65 watts at peak
666 MHz peak instruction rate

At 180 nm, roughly 20 GA144s would fit in large GPU area: 144 cores * 20x = 2880 cores
At 180 nm, roughly 380 GA144s would fit in large GPU 250 watt budget: 144 cores * 380x = 54,720 cores
At 28 nm, assuming 40x smaller area than 180 nm, in large GPU die: 144 cores * 20x * 40x = 115,200 cores
115,200 cores * 64 words/core = 7,372,800 18-bit words of memory

GA144 runs async, but has a peak instruction rate which is roughly 3x higher than GPUs of the 180 nm era (based on wikipedia numbers). The point of this thought experiment was to roughly imagine how a forth based machine would scale in an alternative timeline where they had been commercially successful. Seems possible to scale to over 100 K cores on 28 nm. These forth cores don't directly compare to GPU cores. For example, GA144 38-bit multiply result takes 18 +* operations: 115,200/18 = 6400 multiplies/clock, and forth designed around rational math instead of floating point. Seems possible that in terms of raw arithmetic, the forth machine would be competitive, if problems were solved in a "parallel forth" way. However, in terms of programmable logic, the forth machine would likely be over an order of magnitude faster. Modern machines tend to use area and pipelining to make expensive operations (like multiply add) run fast, while GA144 effectively micro-codes them, keeping low area and much higher throughput for inexpensive operations.

The imaginary scaled GA144 memory capacity looks possible for a high ALU/MEM ratio. Note GA144 only has 64 words of memory per core. Working this from a different perspective, the Epiphany V is 64 MB of on-chip memory. That 64 MB divided across 256 K forth sized cores is again only 256 bytes of memory (or 64 32-bit words/core). Point being, if one wanted to scale to massive counts of simple cores, memory/core has to be tiny.

Which brings up the ultimate question: is it possible to practically leverage the order of magnitude increase in performance for simple operations, when one needs to deconstruct every problem into such small tasks?

↧

Technical Evaluation of Traditional vs New "HDR" Encoding Crossed With Display Capability

October 14, 2016, 1:56 pm

≫ Next: SymbOS : 8-bit OS Awesome Sause

≪ Previous: Forth Hardware Thoughts

Conclusion first. There is no technical justification for using the new HDR signal standards for HDR. Classic 8-bit/channel or 10-bit/channel Gamma 2.2 output is more than adequate for the full possible current and future range of consumer displays, and still offers substantial advantages in performance (less bandwidth, less shader cost, no hidden encoding passes) and observed visual quality (no per-display tone-mapper variation). The best option for a game developer for HDR TV output is to skip the "HDR API" option, and just use the classic non-HDR output.

Remember, the TV's physical contrast ratio doesn't change based on the signal type!

All that is needed to target HDR on HDR TVs with classic non-HDR outputs is to drop Average Picture Level by around 1-stop (as HDR LCD TVs can be 1-stop brighter than classic LDR TVs) by dropping exposure before the existing in-game tone-mapper. It really is just that easy.

WRGB OLED HDR TV Capability
After changing user display settings to the optimal non-factory choices for best calibration, and then calibrating a 2015 LG OLED display, I get the following calibration curve for traditional non-HDR output,

Measured Facts,

(1.) Display pre-calibration has visible grey level color bias which changes based on intensity.
(2.) Display pre-calibration has lower accuracy than a traditional 8-bit/channel signal.
(3.) Darks are significantly out of calibration (many 8-bit steps off).

WRGB OLED dark accuracy issues have been confirmed by external sources like HDTVTest.co.uk: "Unfortunately there were more near-black woes on the 55EG920V. According to LG, the above-black handling on its 2015 OLEDs is done at lower than 8-bit gradation, which explains various dark-scene phenomenons ... While careful calibration at the lower end could attenuate some of these issues, the only surefire method to make all these above-black artefacts disappear was – ironically – to lower [Brightness] and crush some shadow detail.".

As of 2016, LG ships it's OLED TVs with default settings which crush shadow detail in an attempt to work around display problems with artifacts in near-black tonality, quoting the 2016 August 10th, LG OLED65E6V 4K HDR TV Review from HDTVTest, "LG OLED owners probably don’t realise that the default [Brightness] value of '50' crushed some shadow detail in HDR mode. We do have access to a couple of proprietary HDR-mastered patterns that let us verify this, but if you own the 4K Blu-ray disc of The Revenant, go to timecode 00:19:56 (where Hugh is reassuring his son at the campsite) and raise the [Brightness] on your LG OLED TV from '50', and you should be able to see how much shadow detail you were missing before.".

Display panel user settings can also push the TV far out of calibration, quoting the same 2016 August 10th HDTVTest OLED review, "One word of caution regarding the [Colour Management System] on the LG OLED65E6 – the [Saturation], [Tint] and [Luminance] controls are very potent. From our testing, even one wrong click or two would introduce significant artefacts in the picture, and adjusting one specific colour would paradoxically affect the calibrated greyscale, necessitating multiple calibration runs.".

A core problem with these WRGB OLED based TVs is that they go significantly out of calibration in a matter of X hours of usage. The extent of this can be seen clearly in the HDTVTest pre-calibration RGB Balance graphs,

Quoting the 2016 April 1st, LG OLED65G6P HDR OLED TV Review on HDTVTest, "Case in point: we originally calibrated one of the units and added some targeted touch-ups at 70% stimulus, only to find that, 60 hours later, the same adjustments that had gained us totally flat grayscale at the time of calibration were no longer ideal.". Also, "We calibrated several G6s in both the ISF Night and ISF Day modes, and found that with about 200-250 hours run-in time on each unit, the grayscale tracking was consistently red deficient and had a green tint as a result (relative to a totally accurate reference). We witnessed the same inaccuracy on the European Panasonic CZ950/CZ952 OLED (which of course also uses a WRGB panel from LG Display).".

The source of this accuracy problem with OLEDs, quoting the 2016 January Panasonic TX-65CZ952B OLED TV Review from HDTVTest, "OLEDs do not have inherently good uniformity, so periodically, the electrical current flowing through the pixels must be measured and offset to produce a picture with uniform light distribution.".

HDR LCD With Local Dimming Capability
Facts,

(1.) LCD native contrast ratios without local dimming range around 900:1 to 5000:1 (roughly 10 to 12 stops).
(2.) With a black screen and LED backlight off, contrast ratio is limited by screen reflection and backlight bleed from neighbor dimming zones.
(3.) One LED's local dimming zone spans tens of thousands of pixels.

Grabbing measured numbers from my prior GDC presentation on Advanced Techniques and Optimization of HDR Color Pipelines, best case contrast ratio for a best case 1% ambient reflection screen in a 0.05 nit ambient level room (room so dark it is lit only by the screen itself), is roughly 20 stops. Can use this number to compute an estimate for the observed uniformity error range: 20 stops (LED off) - 12 stops (LED on) = 8 stops.

Practically speaking, the dark accuracy errors introduced by local dimming can range up to 8 stops in an absolute dark room. To place this in perspective, this error is larger in contrast than what is reproducible on a typical photographic print on paper.

As described by HDTVTest's review of the Panasonic TX-65DX902B from August 2016, " ... the dimming algorithm in HDR mode could be too aggressive even with [Automatic Backlight Control] set to the lowest value of 'Min', spoiling the original creative intent of the movie. One such instance was during the opening space sequence in The Martian: the Viera TX-65DX902B was darkening several dark patches excessively, resulting in a blotchy 'reverse clouding' effect. The DX900′s sharply-defined backlight algorithm also had a tendency to show up the FALD grid structure of the television. As the title 'The Martian' appeared on the aforementioned space scene, the bright letters were accompanied by square-shaped haloing/blooming against the dark backdrop. The same phenomenon could be observed in timecode 00:19:56 of The Revenant where Hugh is reassuring his son Hawk at night, with the silhouettes being displayed in rectangular halos.".

The only way to get acceptable image quality out of an HDR LCD is to disable local dimming, at which case the HDR LCD is limited on average to a 11-stop contrast ratio.

Signal vs Display Capability
Possible technical arguments for a new "HDR" signal encoding for displays are as follows,

(1.) HDR TVs would require a new signal encoding because traditional signal encoding would not be technically capable of image reproduction without artifacts.

(2.) HDR TVs would have such high contrast ratios and brightness that traditional display relative encoding would not be sufficient.

Lets see if any of these are technically valid,

Claim (1.)
For a classic Gamma 2.2 signal, the contrast ratio of the darkest step to white is the following,

 8-bit = 1 / power(1/255,  2.2) to 1 = around  196965:1 (over 17 stops)
10-bit = 1 / power(1/1023, 2.2) to 1 = around 4185298:1 (over 21 stops)

Note prior to HDR displays, there have been true non-dithered 10-bit/channel panels, so an assumption of a 10-bit panel will be used below. HDR LED back-lit LCDs have consumer power limited peaks for low APL (Average Picture Level) scenes typical of HDR to around 1500 nits or so. Best case observable contrast for a 1500 nit panel in the darkest 0.05 nit room with black levels only limited only by 1% ambient screen reflection is roughly 21 stops. So 10-bit Gamma 2.2 can reproduce that contrast ratio just fine. And if there is still any concern here, just moving to Gamma 2.4 pushes signal contrast ratio peak to roughly 24 stops in that case.

Note the above calculation assumed that the display was actually capable of accurately reproducing those kinds of contrast ratios. As can been seen from the above sections on measured TV accuracy and error, no TV can come close to accurate reproduction. Even an 8-bit per channel Gamma 2.2 signal has higher accuracy than either a HDR OLED or HDR LCD has.

The typical argument for PQ HDR encoding boils down to better perceptual distribution of the steps of the signal. Effectively PQ redistributes steps of the signal to the darks (the area that the HDR OLED TVs by default just clip to zero anyway). Now many PC panels (like the 300+ nit laptop panel I'm using right now) are actually only 6-bit panels natively. For a laptop, often the GPU is actually temporally dithering the 8-bit/channel output signal to 6-bit, and the LCD's relatively slow switching time makes this not perceptual at 60 Hz.

A 1500 nit panel is 5 times brighter than my 6-bit 300 nit laptop, and 10-bits is 16 times the encoding range. Point being, temporal dithering more than works for standard Gamma 2.2 encoding to remove any possible perceptual banding without introducing any visually perceptual noise for HDR displays.

Lets look at this from some numbers,

Step 1 == 1 / power(1/1023, 2.2) = 4185298 : 1 (21.9969 stops)
Step 2 == 1 / power(1/1022, 2.2) = 4176303 : 1 (21.9938 stops)
----
21.9969 stops - 21.9938 stops = 0.003 stops contrast difference

When temporal dithering between step 1 and 2, the change in the signal is only 0.003 stops over a 60th or 144th of a second (depending on display refresh rate). Well under the ability of a human to perceive any change with a suitable temporal dither pattern, even if the display could accurately change to exact pixel levels that fast.

So it should be rather clear right now that the claim (1.) is false, and that traditional Gamma 2.2 signals are more than good enough for practical HDR image reproduction.

Claim (2.)
The last argument reduces to a claim that HDR TVs are providing something new in terms of observable contrast ratio. Specifically that the physically observable contrast ratio difference between HDR TVs will be so different that the TV must now manually tone-map content to compensate for these differences.

Lets look again at some measured facts, starting with LDR displays,

(1.) Around 1000:1 contrast ratio is typical for PC displays (10 stops).
(2.) Top contrast LDR plasma displays had upwards of 40000:1 contrast (15 stops).

Here are some measured numbers of PC panels from TFTCentral,

The Plasma TV contrast numbers deserve a detailed tangent. According to HT Labs Measurements on a Pioneer Kuro 150fd Plasma HDTV, the measured white/black windowed pattern contrast ratio was 44160:1 (roughly 15 stops). This translates into what is expected when the display is not doing any global dimming, as what would be typical with HDR content. The measured contrast ratio with power limited global dimming was 18220:1 (roughly 14 stops). These really good plasma TVs have 3 stops higher contrast than the best LCD native contrast ratios.

So not counting any viewing environment effects on observed contrast ratio, a DVD or BluRay movie played back on a classic LDR display had to work across a 5-stop contrast difference without any in-display tone-mapping. Somehow this worked just fine.

Now enter HDR displays. The UHD Alliance Ultra HD Premium Cert requirements are over 1000 nits for LCDs and over 540 nits for OLEDs. LED local dimming LCDs have black levels limited by screen reflection when LEDs off, just like OLEDs. The difference being that LED local dimming zones bleed out rectangular artifacts. But for sake of argument we can ignore black level for differences in peak contrast ratio, and focus on peak white level. In that case the average peak HDR display difference is only really around 1 stop (HDR LCDs over one stop above the cert limit don't exist for consumers). Point being HDR displays have smaller contrast variation than traditional non-HDR consumer displays, and HDR LCDs on average add around 1-stop extra brightness compared to existing bright non-HDR displays.

Now lets return to observable contrast, and properly account for the effects of room ambient level on display black level. Grabbing numbers from my prior mentioned GDC Presentation where I physically measured ambient level around my house in all the different viewing conditions from day to night: there is a 15-stop variation in ambient level (which results in a 15-stop variation in observed black level on the display). This means quite literally if you play a BluRay or DVD movie on the range of classic LDR displays, depending on night/day room conditions and room lighting, the observed contrast can roughly range between 4-stops for poor reflectance screens in bright conditions, to 15-stops assuming a good plasma display in a ultra dark room. Also it is easy to take a great LDR plasma display in a dark room and have many stops more contrast than the brightest consumer HDR displays in an afternoon lit room.

So claim (2.) doesn't hold water. The variation of observable contrast on video playback was massive prior to HDR, and HDR TVs have a perceptually negligible effect on that. There is no technical justification for needing to switch to an in-display tone-mapper for HDR video playback.

Extra Costs for HDR for Games
Use of the new HDR signals strips way the game developer's ability to control tone-mapping. This is a show stopper. For example, all the interesting ideas presented on c0de517e: Tone Mapping and Local Adaption, like using game-side knowledge about shadows and lighting to locally adjust tone-mapping per pixel, simply cannot be applied when the display does it's own tone-mapping. This also opens up the content to even more TV/display-induced artifacts.

Here are some comments from HDTVTest reviews talking about the variation of tone-mapping between HDR TVs, "the OLED65E6′s limited peak brightness also resulted in clipped highlight detail with HDR10 content. We played the skydiving sequence (Chapter 18) from the Ultra HD Blu-ray of Kingsman: The Secret Service split using a HDFury Integral device (kindly loaned by friendly and knowledgeable custom installer Ricky Jennings of Kalibrate Limited) to the 65E6 and other 1000+-nit LED LCDs, and the E6V (screen on the right) blew out the sun’s outline earlier than its LCD-based rivals" and "Once we got 4K HDR Ultra HD Blu-ray up and working, we compared the HDR presentation on the LG OLED55E6V side-by-side against a calibrated Panasonic DX900 LED LCD, the most accurate consumer-grade TV so far in terms of PQ EOTF and Rec2020 tracking. And straight away we could see that colours didn’t look right on the OLED: the sandy desert in Mad Max adopted an orangey tint, giving off a cartoony feel (even though some viewers may prefer this richly saturated look); while skin tones in The Martian appeared ruddier than usual even during scenes on Earth. Furthermore, the E6′s default [Brightness] setting of '50' crushed a not insignificant amount of shadow detail, requiring a few upward clicks to bring black floor in line with the Panasonic DX9" and "Unfortunately even our best efforts couldn’t restore the clipping of bright coloured highlights, particularly red-hued ones. On the LG OLED65G6V, the explosions during the storm sequence in Mad Max: Fury Road (timecode 00:28:29) evidently contained less detail than on a rival Ultra HD Premium LED television".

Some HDR signals resort to chroma sub-sampling in order to hit cable bandwidth limits for the HDR signal given resolution and frame rate growth. Chroma sub-sampling is unacceptable for typical game rendered graphics and game UIs. The alternative to this will be to return to dithered 8-bit per channel even for some HDR cases.

If HDR support hits desktop, it is likely to take in a 64-bit/pixel tax, doubling bandwidth required to access compared to classic 32-bit/pixel modes. Also depending on OS/hardware support, there could be hidden full screen passes for pre-display transforms. A single 4K 64-bit/pixel transform on a 128 GB/s GPU is over 1 ms. HDR is set to add a large tax on display operations (like compositing or overlay).

Encoding to/from PQ is expensive in VALU cost. Lets look at some code below. Note as I presented in my GDC Presentation, if a developer attempts to do the PQ transform in a 3D lookup table in combination with color grading, the accuracy is under 8-bits/channel given any practical 3D texture size. HDR signals encoded that way would have higher signal error than classic LDR output at a lower bit-depth.


================================================================
                    GENERAL GAMMA TRANSFORMS
================================================================
CONVERSION
----------
3 LOG (x4) + 3 MUL + 3 EXP (x4) = 27 ops

c = pow(c, immediate);

  v_log_f32     v0, v0
  v_log_f32     v1, v1
  v_log_f32     v2, v2
  v_mul_f32     v0, imm, v0
  v_mul_f32     v1, imm, v1
  v_mul_f32     v2, imm, v2
  v_exp_f32     v0, v0
  v_exp_f32     v1, v1
  v_exp_f32     v2, v2    


================================================================
                        PQ TRANSFORMS
================================================================
PQ FROM LINEAR
--------------
19 + 15 * 4 = 79 operations

// Input {0 to 1}, output {0 to 1}.
float PqFromLinear(float x) {
  float m1 = 0.1593017578125;
  float m2 = 78.84375;
  float c1 = 0.8359375;
  float c2 = 18.8515625;
  float c3 = 18.6875;
  float p = pow( x, m1 );
  return pow((c2 * p + c1) / (c3 * p + 1.0), m2); }

  v_log_f32     v0, v0                                  // 000000000028: 7E004300
  v_log_f32     v1, v1                                  // 00000000002C: 7E024301
  v_log_f32     v2, v2                                  // 000000000030: 7E044302
  v_mul_f32     v0, 0x3e232000, v0                      // 000000000034: 0A0000FF 3E232000
  v_mul_f32     v1, 0x3e232000, v1                      // 00000000003C: 0A0202FF 3E232000
  v_mul_f32     v2, 0x3e232000, v2                      // 000000000044: 0A0404FF 3E232000
  v_exp_f32     v0, v0                                  // 00000000004C: 7E004100
  v_exp_f32     v1, v1                                  // 000000000050: 7E024101
  v_exp_f32     v2, v2                                  // 000000000054: 7E044102
  v_mov_b32     v3, 0x3f560000                          // 000000000058: 7E0602FF 3F560000
  s_mov_b32     s0, 0x4196d000                          // 000000000060: BE8000FF 4196D000
  v_mad_f32     v4, v0, s0, v3                          // 000000000068: D1C10004 040C0100
  s_mov_b32     s1, 0x41958000                          // 000000000070: BE8100FF 41958000
  v_mad_f32     v0, v0, s1, 1.0                         // 000000000078: D1C10000 03C80300
  v_mad_f32     v5, v1, s0, v3                          // 000000000080: D1C10005 040C0101
  v_mad_f32     v1, v1, s1, 1.0                         // 000000000088: D1C10001 03C80301
  v_mac_f32     v3, s0, v2                              // 000000000090: 2C060400
  v_mad_f32     v2, v2, s1, 1.0                         // 000000000094: D1C10002 03C80302
  v_rcp_f32     v2, v2                                  // 00000000009C: 7E044502
  v_mul_f32     v2, v3, v2                              // 0000000000A0: 0A040503
  v_log_f32     v2, v2                                  // 0000000000A4: 7E044302
  v_mul_f32     v2, 0x429db000, v2                      // 0000000000A8: 0A0404FF 429DB000
  v_exp_f32     v2, v2                                  // 0000000000B0: 7E044102
  v_rcp_f32     v0, v0                                  // 0000000000B4: 7E004500
  v_mul_f32     v0, v4, v0                              // 0000000000B8: 0A000104
  v_rcp_f32     v1, v1                                  // 0000000000BC: 7E024501
  v_mul_f32     v1, v5, v1                              // 0000000000C0: 0A020305
  v_log_f32     v0, v0                                  // 0000000000C4: 7E004300
  v_log_f32     v1, v1                                  // 0000000000C8: 7E024301
  v_mul_f32     v0, 0x429db000, v0                      // 0000000000CC: 0A0000FF 429DB000
  v_mul_f32     v1, 0x429db000, v1                      // 0000000000D4: 0A0202FF 429DB000
  v_exp_f32     v0, v0                                  // 0000000000DC: 7E004100
  v_exp_f32     v1, v1

↧

SymbOS : 8-bit OS Awesome Sause

October 14, 2016, 3:07 pm

≫ Next: Possible Directional Routing Hoplite Variant?

≪ Previous: Technical Evaluation of Traditional vs New "HDR" Encoding Crossed With Display Capability

SymbOS on a 4 MHz 8-bit machine.

↧

Possible Directional Routing Hoplite Variant?

October 14, 2016, 9:39 pm

≫ Next: Atomic Scatter-Only Gather-Free Machines

≪ Previous: SymbOS : 8-bit OS Awesome Sause

Thinking about minimal grid based routing. Two things I don't like about the Hoplite,

(1.) Full chip return paths.
(2.) Route length not proportional to 2D locality.

Like the simplified router and crossbar. Wondering if there is a way to improve the routing by adjusting the fixed directions and removing the full chip return paths. Came up with this idea, but not sure if it is deadlock free yet. Likely this is well researched and has some proper name, but I'm not well learned in this area.

+--->+--->+--->+--->+--->+--->+--->+
^    |    ^    |    ^    |    ^    |
|    V    |    V    |    V    |    V
+<---+<---+<---+<---+<---+<---+<---+
^    |    ^    |    ^    |    ^    |
|    V    |    V    |    V    |    V
+--->+--->+--->+--->+--->+--->+--->+
^    |    ^    |    ^    |    ^    |
|    V    |    V    |    V    |    V
+<---+<---+<---+<---+<---+<---+<---+
^    |    ^    |    ^    |    ^    |
|    V    |    V    |    V    |    V
+--->+--->+--->+--->+--->+--->+--->+
^    |    ^    |    ^    |    ^    |
|    V    |    V    |    V    |    V
+<---+<---+<---+<---+<---+<---+<---+

Hoplite is only right and down, with a full chip return to loop around. This is right-only on even rows, left-only on odd rows, then up-only on even columns, down-only on odd columns. Requires grid of cores to be a multiple of 2 in each dimension. Enables a message to quickly turn around.

Haven't fully thought through routing logic, but the general idea is that if a packet needs to move in a direction not supported on a row or column, it get routed on the other axis so it can switch direction.

↧

Atomic Scatter-Only Gather-Free Machines

October 14, 2016, 10:33 pm

≫ Next: Instruction Fetch Optimization

≪ Previous: Possible Directional Routing Hoplite Variant?

GPUs are build around having texture caches, and caches are build around collecting loads for the most part, because loads have the highest memory traffic typically. So if after stripping out the caches from a highly parallel machine, perhaps gathering data to centralized location for processing, and then scattering it out again, is not the best model?

One possible alternative would be to switch to a scatter-centric design. On-chip memory gets divided across all the cores. Each core contains the required remote procedure (RP) functions to interact with the data associated with the core. Programs are composed of fire-and-forget message passing. Message contains arguments to the RP and index/address of the RP to execute. The model is return-free, and the RP only has access to the arguments in the message and the local memory of the core.

This is conceptually similar to taking the GPU's global atomic without return, and making it fully programmable.

This brings up a new challenge, that in order to fully load the machine, data needs to be evenly distributed across the cores based on amount of RP access. Conceptually in this model, each core is a bank of distributed memory, and a mid-range FPGA might have upwards of 1024 banks (each one BRAM). Need to ensure an algorithm doesn't camp on one bank of memory.

Likewise if any data is duplicated across cores for the sake of higher throughput, one might want to build in something into the routing logic which takes the first found compatible core which can service the RP. Also message broadcast with variable 2D locality would be very important for data amplification.

↧

Instruction Fetch Optimization

October 16, 2016, 10:25 pm

≫ Next: Notes from Attempting to Understand FPGA Timing Limits

≪ Previous: Atomic Scatter-Only Gather-Free Machines

Been reading the Artix-7 FPGAs Data Sheet: DC and AC Switching Characteristics to better understand requirements for higher clock FPGA design. Seems as if the only point in using BRAM output register is if the output would have otherwise just gone to a CLB's output register. Registering instead at the BRAM would remove a large delay. Also looks like if doing a load from CLBRAM, there isn't enough time for a level of LUT between the fetch and setting DSP inputs without adding another pipeline stage, but maybe possible to have CLBRAM address on clock N, then route CLBRAM output to DSP input registers on clock N+1. This doesn't appear to be possible with BRAM (almost double the clock to output delay), in that case might as well put in the LUT because it will require another pipeline stage (compared to CLBRAM with no LUT). Still learning...

Possible Instruction Fetch Timing Optimization?
On the critical path for instruction fetch, seems like there might only be time enough for one level of LUT to generate the BRAM address for the next instruction, sourcing from the fetched instruction and registers from the prior clock. Possible optimized setup might be as described below (using ' to remark value for next clock). For a 10-bit program counter (enough for one BRAM), this takes 20 LUTs for the address generation. Then if program counter only advances in the lower 8-bits, the two adders takes 16 LUTs total. Plus some extra overhead for feedback.

// some non-code below (enough to describe the idea)
// generate address for next fetch
// correct for not always having the correct adrNext
// absolute call/branch must be even address, so "adr&1" is the correct next address
// both, imm(inst) and decode(inst), are direct routed (no LUT)
adr'={
  imm(inst), // immediate absolute call/branch address always aligned to the first of 2 words
  adrRet,    // return address
  retNext,   // if prior was a return, this is the return+1
  adrNext,   // if continuing this instruction, the incremented address from prior
  adr&1,     // if prior was an immediate, inline next value
}[choose(decode(inst),feedback)]; // 2 LUTs/bit (13:1 function)

// feed back mux choice for next clock corrections
feedback'=feedback(inst);

// instruction fetch
inst'=bram[adr'];

// not based on adr' because that would be an extra LUT level
adrNext'=adr+1; 
retNext'=adrRet+1;

↧

Notes from Attempting to Understand FPGA Timing Limits

October 18, 2016, 7:04 pm

≫ Next: DSP and Rounding Notes

≪ Previous: Instruction Fetch Optimization

I'm using the table I built below as a quick reference to think though timing while working on design. Reference from last time, Artix-7 FPGAs Data Sheet: DC and AC Switching Characteristics. Working from the Speed Grade -3 (fastest) numbers in ns below.


==========
  TIMING
==========
Simplified drawing below
Where phase of stages would in practice be different
Where regions are not to scale
                   _______________                 _______________ 
  \_______________/               \_______________/               \
   _______________________________
  (____________STAGE_0____________)_______________________________
                  .               (____________STAGE_1____________)
                  .                               .               
                  .                               .
       |<- 0.45 ->|<- 0.31 ->|                    |<- 0.64 ->|
          setup       hold                           delay

        BRAM address register       BRAM registered read stable

My mental model is that the 'delay' eats against the 'setup' for the next stage
Each stage ends in a register (D flip-flop)
The cumulative post register, combinatorial, and net 'delay' from any logic on the wire
  must not violate the 'setup' time for the stage's register
The 'hold' effects the limits of the phase of the stage (offset from positive clock edge)

Timing limits from my (possibly flawed) understanding of the doc,

  term  meaning
  ====  =======
  BRAM  block ram
  CRAM  CLB distributed ram

  number   meaning
  =======  =======
  0.07 ns  CLB 6:1 LUT setup time
  0.59 ns  CLB complex setup time (larger than 6:1, carry)
  0.12 ns  CLB 6:1 LUT hold time
  0.08 ns  CLB complex hold time
  -------  -------
  0.47 ns  CLB 5:1x2 LUT registered delay (2 reg/LUT)
  0.40 ns  CLB registered delay otherwise (1 reg/LUT)
  -------  -------
  0.10 ns  CLB 6:1 LUT combinatorial delay (no register)
  0.27 ns  CLB 5:1x2 LUT combinatorial delay
  0.68 ns  CLB maximum combinatorial delay for complex (larger than 6:1, carry)
  -------  -------
  0.26 ns  DSP a setup time
  0.33 ns  DSP b setup time
  0.17 ns  DSP c setup time
  0.12 ns  DSP a hold time
  0.15 ns  DSP b hold time
  0.17 ns  DSP c hold time
  0.33 ns  DSP p registered output delay  
  -------  -------
  0.45 ns  BRAM address setup time
  0.31 ns  BRAM address hold time
  0.64 ns  BRAM registered read delay before stable time
  -------  -------
  0.27 ns  CRAM address setup time
  0.69 ns  CRAM address setup time with post-load carry or mux usage
  0.55 ns  CRAM address hold time
  0.18 ns  CRAM address hold time with post-load carry or mux usage
  0.10 ns  CRAM read registered output delay (same as 6:1 LUT combinatorial delay)
  0.98 ns  CRAM write registered output delay (CRAM 64-entry or smaller, etc)
  2.10 ns  CRAM minimum clock period for writes (limits to 476 MHz?)

↧

DSP and Rounding Notes

October 18, 2016, 7:34 pm

≫ Next: Variation on Branching Design - Return Only

≪ Previous: Notes from Attempting to Understand FPGA Timing Limits

Posting a few more notes while in the background I continue to work towards the next design try.


=====================
  DSP FUNCTIONALITY
=====================
Some of the Xilinx 7 series DSP functionality and base pipeline (where p as input is from clk-1),

  clk-2  clk-1     clk
  =====  ========  ===========
         c=        p{op}=p
         c=        p{op}=(p>>17)
         c=        p{op}=c
         c=        p=c
  a=,b=  m=a*b     p=m 
  a=,b=  m=a*b     p+=m 
  a=,b=  m=a*b     p-=m 
  a=,b=  m=a*b     p=(p>>17)+m 
  a=,b=  m=a*b     p=(p>>17)-m 
  a=,b=  m=a*b,c=  p=c+m 
  a=,b=  m=a*b,c=  p=c-m
  a=,b=  t=a:b     p{op}=t
  a=,b=  t=a:b     p=(p>>17){op}t
  a=,b=  t=a:b,c=  p=c{op}t

For an 18-bit word accumulator based machine, it is possible to fast min/max
Uses the internal DSP result forwarding paths

  // signed min and max in 3 cycles
  p=min(p,c); -> p-=c; p= (p>>17)&p; p+=c;
  p=max(p,c); -> p-=c; p=~(p>>17)&p; p+=c;

Forwarding is going to be useful for going to one thread and mitigating pipelining

  // branch free, where x and y are immediates in 2 cycles
  p=(p< 0)?x:y; -> p= (p>>17)&(x-y); p+=y;
  p=(p>=0)?x:y; -> p=~(p>>17)&(x-y); p+=y;


====================
  DSP AND ROUNDING
====================
Didn't see how to efficiently implement "round half to even"
However "round half away from zero" seems easy to implement
Logic for rounding (hopefully I got that right),

  if(x>=0) x++; // cin in second example
  x+=(1<<(n-1))-1; // c input in second example
  x>>=n;

DSP supports CIN (carry in) support
Which can take either the inverted sign of P 
(dsp output feed back in the next cycle),
or the inverted sign of a*b
DSP can do the following {multiply, round, shift} 
at a throughput of 0.5 clocks,

  p=a*b+cin+c; // throughput of one clock
  p=c+(p>>17); // gets dedicated forwarding path

Common usage case,

  // divide by a constant via multiply by reciprocal, 
  // or multiply by {0 to 131071} representing {0.0 to nearly 1.0}
  a=number; // up to 25-bit signed number on 7 series DSP
  b=fraction; // second argument is 18-bits signed
  p=a*b+cin+65535;
  p=p>>17;

For reference
Full table for 3-bit signed numbers a*b multiply 
with round away from zero before shift right by 2,

a    b       float   int   binary      a*b+cin   +round     output
------------------------   ------------------------------------------
-4 * -1.00 =  4.00 ->  4   100 * 100 = 010001 -> 0010010 -> 0100 =  4 (overflows) 
-4 * -0.75 =  3.00 ->  3   100 * 101 = 001101 -> 0001110 -> 0011 =  3
-4 * -0.50 =  2.00 ->  2   100 * 110 = 001001 -> 0001010 -> 0010 =  2
-4 * -0.25 =  1.00 ->  1   100 * 111 = 000101 -> 0000110 -> 0001 =  1
-4 *  0.00 = -0.00 ->  0   100 * 000 = 000001 -> 0000010 -> 0000 =  0
-4 *  0.25 = -1.00 -> -1   100 * 001 = 111100 -> 1111101 -> 1111 = -1
-4 *  0.50 = -2.00 -> -2   100 * 010 = 111000 -> 1111001 -> 1110 = -2
-4 *  0.75 = -3.00 -> -3   100 * 011 = 110100 -> 1110101 -> 1101 = -3
------------------------   ------------------------------------------
-3 * -1.00 =  3.00 ->  3   101 * 100 = 001101 -> 0001110 -> 0011 =  3
-3 * -0.75 =  2.25 ->  2   101 * 101 = 001010 -> 0001011 -> 0010 =  2
-3 * -0.50 =  1.50 ->  2   101 * 110 = 000111 -> 0001000 -> 0010 =  2
-3 * -0.25 =  0.75 ->  1   101 * 111 = 000100 -> 0000101 -> 0001 =  1
-3 *  0.00 = -0.00 ->  0   101 * 000 = 000001 -> 0000010 -> 0000 =  0
-3 *  0.25 = -0.75 -> -1   101 * 001 = 111101 -> 1111110 -> 1111 = -1
-3 *  0.50 = -1.50 -> -2   101 * 010 = 111010 -> 1111011 -> 1110 = -2
-3 *  0.75 = -2.25 -> -2   101 * 011 = 110111 -> 1111000 -> 1110 = -2
------------------------   ------------------------------------------
-2 * -1.00 =  2.00 ->  2   110 * 100 = 001001 -> 0001010 -> 0010 =  2
-2 * -0.75 =  1.50 ->  2   110 * 101 = 000111 -> 0001000 -> 0010 =  2
-2 * -0.50 =  1.00 ->  1   110 * 110 = 000101 -> 0000110 -> 0001 =  1
-2 * -0.25 =  0.50 ->  1   110 * 111 = 000011 -> 0000100 -> 0001 =  1
-2 *  0.00 = -0.00 ->  0   110 * 000 = 000001 -> 0000010 -> 0000 =  0
-2 *  0.25 = -0.50 -> -1   110 * 001 = 111110 -> 1111111 -> 1111 = -1
-2 *  0.50 = -1.00 -> -1   110 * 010 = 111100 -> 1111101 -> 1111 = -1
-2 *  0.75 = -1.50 -> -2   110 * 011 = 111010 -> 1111011 -> 1110 = -2
------------------------   ------------------------------------------
-1 * -1.00 =  1.00 ->  1   111 * 100 = 000101 -> 0000110 -> 0001 =  1
-1 * -0.75 =  0.75 ->  1   111 * 101 = 000100 -> 0000101 -> 0001 =  1
-1 * -0.50 =  0.50 ->  1   111 * 110 = 000011 -> 0000100 -> 0001 =  1
-1 * -0.25 =  0.25 ->  0   111 * 111 = 000010 -> 0000011 -> 0000 =  0
-1 *  0.00 = -0.00 ->  0   111 * 000 = 000001 -> 0000010 -> 0000 =  0
-1 *  0.25 = -0.25 ->  0   111 * 001 = 111111 -> 0000000 -> 0000 =  0
-1 *  0.50 = -0.50 -> -1   111 * 010 = 111110 -> 1111111 -> 1111 = -1
-1 *  0.75 = -0.75 -> -1   111 * 011 = 111101 -> 1111110 -> 1111 = -1
------------------------   ------------------------------------------
 0 * -1.00 = -0.00 ->  0   000 * 100 = 000001 -> 0000010 -> 0000 =  0
 0 * -0.75 = -0.00 ->  0   000 * 101 = 000001 -> 0000010 -> 0000 =  0
 0 * -0.50 = -0.00 ->  0   000 * 110 = 000001 -> 0000010 -> 0000 =  0
 0 * -0.25 = -0.00 ->  0   000 * 111 = 000001 -> 0000010 -> 0000 =  0
 0 *  0.00 =  0.00 ->  0   000 * 000 = 000001 -> 0000010 -> 0000 =  0
 0 *  0.25 =  0.00 ->  0   000 * 001 = 000001 -> 0000010 -> 0000 =  0
 0 *  0.50 =  0.00 ->  0   000 * 010 = 000001 -> 0000010 -> 0000 =  0
 0 *  0.75 =  0.00 ->  0   000 * 011 = 000001 -> 0000010 -> 0000 =  0
------------------------   ------------------------------------------
 1 * -1.00 = -1.00 -> -1   001 * 100 = 111100 -> 1111101 -> 1111 = -1
 1 * -0.75 = -0.75 -> -1   001 * 101 = 111101 -> 1111110 -> 1111 = -1
 1 * -0.50 = -0.50 -> -1   001 * 110 = 111110 -> 1111111 -> 1111 = -1
 1 * -0.25 = -0.25 ->  0   001 * 111 = 111111 -> 0000000 -> 0000 =  0
 1 *  0.00 =  0.00 ->  0   001 * 000 = 000001 -> 0000010 -> 0000 =  0
 1 *  0.25 =  0.25 ->  0   001 * 001 = 000010 -> 0000011 -> 0000 =  0
 1 *  0.50 =  0.50 ->  1   001 * 010 = 000011 -> 0000100 -> 0001 =  1
 1 *  0.75 =  0.75 ->  1   001 * 011 = 000100 -> 0000101 -> 0001 =  1
------------------------   ------------------------------------------
 2 * -1.00 = -2.00 -> -2   010 * 100 = 111000 -> 1111001 -> 1110 = -2
 2 * -0.75 = -1.50 -> -2   010 * 101 = 111010 -> 1111011 -> 1110 = -2
 2 * -0.50 = -1.00 -> -1   010 * 110 = 111100 -> 1111101 -> 1111 = -1
 2 * -0.25 = -0.50 -> -1   010 * 111 = 111110 -> 1111111 -> 1111 = -1
 2 *  0.00 =  0.00 ->  0   010 * 000 = 000001 -> 0000010 -> 0000 =  0
 2 *  0.25 =  0.50 ->  1   010 * 001 = 000011 -> 0000100 -> 0001 =  1
 2 *  0.50 =  1.00 ->  1   010 * 010 = 000101 -> 0000110 -> 0001 =  1
 2 *  0.75 =  1.50 ->  2   010 * 011 = 000111 -> 0001000 -> 0010 =  2
------------------------   ------------------------------------------
 3 * -1.00 = -3.00 -> -3   011 * 100 = 110100 -> 1110101 -> 1101 = -3
 3 * -0.75 = -2.25 -> -2   011 * 101 = 110111 -> 1111000 -> 1110 = -2
 3 * -0.50 = -1.50 -> -2   011 * 110 = 111010 -> 1111011 -> 1110 = -2
 3 * -0.25 = -0.75 -> -1   011 * 111 = 111101 -> 1111110 -> 1111 = -1
 3 *  0.00 =  0.00 ->  0   011 * 000 = 000001 -> 0000010 -> 0000 =  0
 3 *  0.25 =  0.75 ->  1   011 * 001 = 000100 -> 0000101 -> 0001 =  1
 3 *  0.50 =  1.50 ->  2   011 * 010 = 000111 -> 0001000 -> 0010 =  2
 3 *  0.75 =  2.25 ->  2   011 * 011 = 001010 -> 0001011 -> 0010 =  2

↧

Variation on Branching Design - Return Only

October 22, 2016, 9:06 pm

≫ Next: Simplified Vulkan Rapid Prototyping

≪ Previous: DSP and Rounding Notes

Thoughts related to Instruction Fetch Optimization, a post which talked about only auto-incrementing the lower 8-bits of the program counter, having even-only branch addresses to remove an ADDer delay...

Return Only ISA
The point here is to be able to know the next program counter, even with a branch, a clock cycle ahead to remove instruction fetch latency from a larger RAM. This works with absolute branching (no relative branching). The minimal ISA has the following,

(1.) Push immediate absolute address on return stack.
(2.) Push computed absolute address from register on return stack.
(3.) 1-bit flag in ISA (on all instructions) to return after 1 cycle delay.

So the return bit causes the top of the return stack to be registered for fetch. Assume an optimization here where a push immediate with a return gets transformed into just setting that registered value. Bunch of usage cases,

// RETURN
opcode, ret; // ... register for future return
opcode; // ........ branch delay slot
// ................ actual return

// COMPUTED JUMP
push reg, ret; // ... push jump address from register and register for future return
opcode; // .......... branch delay slot
// .................. actual jump

// JUMP
push imm, ret; // ... push jump address and register for future return
opcode; // .......... branch delay slot
// .................. actual jump

// CAN PUSH ADDRESS EARLIER
push imm; // ...... push jump address to later branch to
opcode; // ........ some code
opcode; // ........ some code
opcode, ret; // ... register for future return
opcode; // ........ branch delay slot
// ................ actual jump

// SERIES OF JUMPS
push imm; // ........ push addresses for series of branches in reverse order
push imm; // ........ push addresses for series of branches in reverse order
push imm, ret; // ... push first jump address and register for future return
opcode; // .......... branch delay slot
// .................. start first jump, later rets continue to other jumps

// SERIES OF JUMPS VERSION 2
push imm; // ........ push 4th jump address
push imm; // ........ push 3rd jump address
push imm, ret; // ... push 1st jump address and register for future return
push imm; // ........ push 2nd jump in branch delay slot
// .................. start first jump, later rets continue to other jumps

// CALL
push imm, ret; // ... push call address and register for future return
push imm; // ........ push return address in delay slot
// .................. actual call here

// SERIES OF CALLS
push imm; // ........ push 3nd call address
push imm; // ........ push 2nd call address
push imm, ret; // ... push 1st call address and flag for future return
push imm; // ........ push final call return address in delay slot
// .................. start first call, later rets continue without returning back in between

No need for call/jump right after logical return (which would have a delay slot), because that can always be factored to a "series" case.

Another option would be a co-return bit in addition to the return bit. The co-return would swap the top of the return stack with the program counter + 1. This would remove the need to push the final return address for a single call, but won't help with a series of calls (cannot use the co-return, as need reverse order).

↧

Simplified Vulkan Rapid Prototyping

July 24, 2016, 7:43 am

≫ Next: Last Blogger Post

≪ Previous: Variation on Branching Design - Return Only

Nothing simple about using Vulkan, so this title is a little misleading ...
Trying something new for my next Vulkan based at-home prototyping effort and building from scratch for 64-bit machines only. Building a simplified version of my prior rapid prototyping system. This version on code change instead of reloading a DLL, actually does re-compile and restart the program. My theory is that restart time is going to be lower than the time it takes to recompile shaders. I'm not concerned with re-fill of the GPU with baked data because I don't ever use much, and also never have much non-runtime-regeneratable state either. Program is required, somewhat like a "save snapshot" game emulator, to be able to instantly restart to where it was running before (at the time of last snapshot). This has some interesting advantages, like error handling becomes trivial, just exit the program and restart! For correct handling of things like VK_ERROR_DEVICE_LOST or VK_ERROR_SURFACE_LOST_KHR just exit. No need to have two binaries (one for development, one for release), as I never use debug.

Details
I've got only one source file, with #defines to enabling keeping both GLSL and C code in the same file. Also I've got no includes to optimize for compile time. Notice on Windows, "vulkan.h" ultimately includes "windows.h", for example to get HWMD and HINSTANCE types, so sans rolling your own version of the headers, the compile dips into the massive platform include tree. Re-rolling only what I need from the Vulkan headers is quite frankly a nightmare of work due to Vulkan verbosity, but should be mostly over soon. I've also in the process made un-type-safe (yeah) version of the Vulkan API, returning to base system types, so I never have to bother with silly compile warnings. All handles are just 64-bit pointers, etc. It works great. I was beyond having type-safety bugs from birth, being brought up on assembly first. The bugs I have now are more like, "the last time I worked on this was a month ago, and I forgot to call vkGetDeviceQueue(), but already wrote code out-of-order using the queue handle". As any programmer, out of habit, I first blamed the driver, and ultimately realized that I was the idiot instead.

Part of the motivation for this design is out of laziness. Since Vulkan requires SPIR-V input, and I work in GLSL, I need to call "glslangValidator.exe" to convert my GLSL into SPIR-V, and I sure didn't feel like writing a complex system to be spawning processes from inside my app. So I have a shell script per platform which does, {compile shaders, convert SPIR-V binaries to headers which are included in the program, recompile the program, launch program, then repeat}.

Engine design is trivial as well, just setting up baked command buffers and then replaying them until exit. Everything compute based, and dispatch indirect based to manage variability. No graphics makes using Vulkan quite easy relatively speaking, no graphics state, no render passes, trivial transitions.

I'm debating on if to eventually release basic source for this project or not. On one hand it is a good example of Windows/Linux Vulkan app from scratch. On the other hand, my code is very much in shorthand which looks alien to other humans (likely the inverse of how C++ looks totally alien to me). For example, the following (which might get wrapped poorly by the browser) is my implementation of everything I need for printf style debugging writing to terminal.


///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//=============================================================================================================================
//
// [KON] CONSOLE MESSAGE SYSTEM
//
//-----------------------------------------------------------------------------------------------------------------------------
// A background message thread which handles printing.
// This works around the problem of slow console print on Windows.
// This also allows single point to override to stream to file, etc.
// Multiple threads can send messages simultaneously.
// It would be faster to queue messages per thread, but this isn't about speed, but rather mostly debug.
// Merging per message gives proper idea of sequence of events across threads.
// This has spin waits in case of overflow panic, so set limits so overflow panic never happens. 
//=============================================================================================================================
 // Defaults, must be a power of two.
 // Number of characters in ring.
 #ifndef KON_BUF_MAX
  #define KON_BUF_MAX 32768
 #endif
 // Number of messages in ring.
 #ifndef KON_SIZE_MAX
  #define KON_SIZE_MAX 1024
 #endif
 // Maximum message size for macro message generation.
 #ifndef KON_CHR_MAX
  #define KON_CHR_MAX 1024
 #endif
//-----------------------------------------------------------------------------------------------------------------------------
 typedef struct {
  A_(64) U1 buf[KON_BUF_MAX*2]; // Buffer for messages, double size for overflow.
  A_(64) U4 size[KON_SIZE_MAX]; // Size of messages.
  A_(64) U8 atomReserved[1];    // Amount reserved: packed {MSB 32-bit buffer bytes, LSB 32-bit size count}.
  U8 atom[1];                   // Next: packed {MSB 32-bit buffer offset, LSB 32-bit size offset}.
  C2 write;                     // Function to write to console (adr,size). 
  U4 frame;                     // Updated +1 everytime the writter goes to sleep (used for drain).
 } KonT;
 S_ KonT kon_[1];
 #define konR TR_(KonT,kon_)
 #define konV TV_(KonT,kon_)
//-----------------------------------------------------------------------------------------------------------------------------
 // Begin KON_CHR_MAX macro message.
 #define K_ { U1 konMsg[KON_CHR_MAX]; U1R konPtr=U1R_(konMsg)
 // Ends.
 #define KON_MSG KonWrite(konMsg,U4_(U8_(konPtr)-U8_(konMsg)))
 #define KE_ KON_MSG; }
 #define KW_ KON_MSG; KonWake(); }
 #define KD_ KON_MSG; KonDrain(); }
//-----------------------------------------------------------------------------------------------------------------------------
 #define KN_ konPtr[0]='\n'; konPtr++
 // Ends with newline.
 #define KNE_ KN_; KE_
 #define KNW_ KN_; KW_
 #define KND_ KN_; KD_
//-----------------------------------------------------------------------------------------------------------------------------
 // Append numbers.
 #define KH_(a) konPtr=Hex(konPtr,a)
 #define KU1_(a) konPtr=HexU1(konPtr,a)
 #define KU2_(a) konPtr=HexU2(konPtr,a)
 #define KU4_(a) konPtr=HexU4(konPtr,a)
 #define KU8_(a) konPtr=HexU8(konPtr,a)
 #define KS1_(a) konPtr=HexS1(konPtr,a)
 #define KS2_(a) konPtr=HexS2(konPtr,a)
 #define KS4_(a) konPtr=HexS4(konPtr,a)
 #define KS8_(a) konPtr=HexS8(konPtr,a)
//-----------------------------------------------------------------------------------------------------------------------------
 // Append decimal.
 #define KDec1_(a) konPtr=Dec1(konPtr,a)
 #define KDec2_(a) konPtr=Dec2(konPtr,a)
 #define KDec3_(a) konPtr=Dec3(konPtr,a)
//-----------------------------------------------------------------------------------------------------------------------------
 // Append raw data.
 #define KR_(a,b) do { U4 konSiz=U4_(b); CopyU1(konPtr,U1R_(a),konSiz); konPtr+=konSiz; } while(0)
 // Append character.
 #define KC_(a) konPtr[0]=U1_(a); konPtr++
 // Append zero terminated compile time immediate C-string.
 #define KZ_(a) CopyU1(konPtr,Z_(a)-1); konPtr+=sizeof(a)-1
 // Append non-compile time immediate C-string.
 #define KZZ_(a) KR_(a,ZeroLen(U1R_(a)))
//-----------------------------------------------------------------------------------------------------------------------------
 // Quick message for debug.
 #define KQ_(a) K_; KZ_(a); KD_
//-----------------------------------------------------------------------------------------------------------------------------
 // Quick decimal.
 #define KDec2Dot3_(a) KDec2_(a/1000); KC_('.'); KDec3_(a%1000)
 #define KDec3Dot3_(a) KDec3_(a/1000); KC_('.'); KDec3_(a%1000)
//=============================================================================================================================
 S_ void KonWake(void) { SigSet(SIG_KON); }
//-----------------------------------------------------------------------------------------------------------------------------
 // Unpack components from atom.
 I_ U4 KonSize(U8 atom) { return U4_(atom); } 
 I_ U4 KonBuf(U8 atom) { return U4_(atom>>U8_(32)); }
//-----------------------------------------------------------------------------------------------------------------------------
 // Unpack components from atom and mask.
 I_ U4 KonMaskSize(U8 atom) { return KonSize(atom)&(KON_SIZE_MAX-1); }
 I_ U4 KonMaskBuf(U8 atom) { return KonBuf(atom)&(KON_BUF_MAX-1); }
//-----------------------------------------------------------------------------------------------------------------------------
 // Reserve space to write message.
 I_ U8 KonReserve(U4 bytes) { return AtomAddU8(konV->atomReserved,(U8_(bytes)<<32)+1); } 
//-----------------------------------------------------------------------------------------------------------------------------
 // Release space reservation.
 S_ void KonRelease(U4 bytes,U4 msgs) { AtomAddU8(konV->atomReserved,(-(U8_(bytes)<<32))+(-U8_(msgs))); }
//-----------------------------------------------------------------------------------------------------------------------------
 // Check if reservation under limits.
 S_ U4 KonOk(U8 atom) { return (KonSize(atom)//-----------------------------------------------------------------------------------------------------------------------------
 // Get atom for next message.
 I_ U8 KonNext(U4 bytes) { return AtomAddU8(konV->atom,(U8_(bytes)<<32)+1); } 
//-----------------------------------------------------------------------------------------------------------------------------
 // Copy in message.
 S_ void KonCopy(U8 atom,U1R adr,U4 bytes) { CopyU1(konR->buf+KonMaskBuf(atom),adr,bytes);
  AtomSwapU4(konV->size+KonMaskSize(atom),bytes); }
//-----------------------------------------------------------------------------------------------------------------------------
 // Used for debug busy wait until message is displayed.
 S_ void KonDrain(void) { U4 f=konV->frame; while(f==konV->frame) { SigSet(SIG_KON); ThrYield(); } }
//-----------------------------------------------------------------------------------------------------------------------------
 // Write message to console.
 S_ void KonWrite(U1R adr,U4 bytes) { while(1) { 
  if(KonOk(KonReserve(bytes))) { KonCopy(KonNext(bytes),adr,bytes); return; }
  KonRelease(bytes,1); KonWake(); ThrYield(); } }
//=============================================================================================================================
 // Background thread which sends messages to the actual console.
 S_ U8 KonThread(U8 unused) { U4 bufOffset=0; U4 sizeOffset=0; 
  while(1) { U4 bytes=0; U4 msgs=0; SigWait(SIG_KON,1000); SigReset(SIG_KON);
   while(1) { U4 size=konV->size[sizeOffset]; bytes+=size;       
    // If not zero need to force clear before adjusting free counts, to mark as unused entry.
    if(size) konV->size[sizeOffset]=0; 
    // Force write if would wrap, or found zero size message.
    if(((bufOffset+bytes)>=KON_BUF_MAX)||(size==0)) {
     konV->write(U8_(konR->buf+bufOffset),bytes);
     bufOffset=(bufOffset+bytes)&(KON_BUF_MAX-1); 
     KonRelease(bytes,msgs); bytes=0; msgs=0;
     // If hit zero size break (zero size means rest of messages are empty).
     if(size==0) break; }
    msgs++; sizeOffset=(sizeOffset+1)&(KON_SIZE_MAX-1); } 
   // Only advance frame until after draining.
   BarC(); konV->frame++; } } 
//-----------------------------------------------------------------------------------------------------------------------------
 S_ void KonInit(void) { konR->write=C2_(ConWrite); ThrOpen(KonThread,THR_KON); }

↧

Last Blogger Post

November 7, 2016, 9:57 pm

≪ Previous: Simplified Vulkan Rapid Prototyping

Blog is actively slowly migrating to a Git Wiki,
https://github.com/TimothyLottes/TimothyLottesWiki/wiki

↧