OS Project : 6 - Hashing

Been a while since I had any time to work on this project. Renamed as I've done some major design changes...

Axed
Dropping source-less programming part. Source-less worked great for x86 programming, however there are a few things I didn't like about it. Little issues like having 2 words for each common forth construct like "CALL WORD" instead of just "WORD". Larger issues being the requirement to have another editor to generate complex data or non-x86 machine code.

From the Ashes
Pulled the old work into a new bootloader. This time rapid prototyping using just NASM. Eventually I'll have to rewrite the bootloader in the source/language I end up creating. Switched back to 64-bit long mode (see why below). Running with 1GB pages since long mode forces page tables.

Re-Thinking The Forth Dictionary
Time off from this project brought forth some new ideas addressing the original motivation for going source-less: (a.) don't like linear searches at interpreter-time through large dictionaries (simple forth implementation), (b.) don't like how hashing (complex forth implementation) takes a bunch of ALU-time, (c.) really don't like how dynamic hashing can leave 75% of the cache unused, (d.) and still want to have full 32-bit values in source without multiple source tokens.

Solving (d.) is easy, just switch to 64-bit source tokens. Source is now 2x the size, but 32-bit immediates are trivial, and words can now use 10 characters instead of 5. Source is a prefetchable linear read, so the "more memory for more simple" trade is worth it to me.

Now Can Hashing be Made Better?
Solving (b.): how about factoring everything in the hashing algorithm except the "AND" operation into the string encoding itself? The tokens encode strings as 60-bit integers. Just need to switch from a generic 6-bits per character, to something where the characters are encoded in a way that they effect all bits.

One thing I tried is to use a reversible cellular automata (CA) like Wolfram Rule 30R. So strings in tokens are kept in memory in the form after being processed through N steps of applying the CA. Editor to display the string simply has to reverse the CA process.

To test the theory I first grabbed an english dictionary, then hashed to {256, 512, 1K, 2K, ... 512K, 1 M} entry tables, and tracked the {min,max,average} table bin counts. Comparing the reversible CA to just the 64-bit finalizer in MurmurHash3 (not reversible but provides a good baseline): both have similar results.

Also tested against a dictionary with all possible {1,2,3} character strings (including symbols and numbers). Results are again similar.

Second I tried a recursive interval encoding: each character is encoded in the interval of the parent character. First character starts with the full 0 to 2^60-1 interval. Character 63 is a special terminal character (signals no more characters are encoded). Character 63 gets a smaller interval, allowing the other characters to get an extended interval designed to spread out the value when ANDed to a power of two. This works similar in theory to arithmetic encoding, except this has constant probabilities. Multiplers used to encode each character (starting with the 1st character and going to the last),

64*64*64*64*64*64*64*64*64 +63*63*63*63*63*63*63*63,
64*64*64*64*64*64*64*64 +63*63*63*63*63*63*63,
64*64*64*64*64*64*64 +63*63*63*63*63*63,
64*64*64*64*64*64 +63*63*63*63*63,
64*64*64*64*64 +63*63*63*63,
64*64*64*64 +63*63*63,
64*64*64 +63*63,
64*64 +63,
64 +1,
1,

This evenly distributes the effect of a character across all bits. The synthetic all {1,2,3} length string case performs better with this encoding. The english dictionary does not see much of an improvement.

A forth option would be to just use arithmetic encoding, or predictive arithmetic encoding on the string. I suspect this would have similar results on the english dictionary as the above fixed interval encoder had on the all possible strings case. However I didn't try it, because I don't have any great way to get symbol probabilities for source that does not exist yet, and current results are probably good enough.

Hash Cache Utilization
Solving (c.) hash cache utilization: how about a software hash cache. Source tends to have a small working set of words, even if the dictionary gets quite large. Now that hashing to different size tables is free via AND, I can split the hash table into two parts: a 256 entry table (or larger) which will effectively always be filled and in hardware cache, and a much larger table for software misses which still wastes 75% of the hardware cacheline. The software hash cache contains the following per entry,

{4-byte data, 4-byte larger table address, 8-byte string}

The larger table is inclusive. Eviction of the small hash uses the larger table address, to avoid searching (or probing) for the right entry.

Moving On
This should have the right combination of being simple enough and fast enough to avoid triggering the "something is wrong" re-design impulse again.

The high-level view I have of the OS is effectively an ultra-high-power micro-controller that boots into a forth editor like a C64 boots into basic...

OS Project : 6 - Hashing

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112