## Memory Efficient Hash Tables

and Pseudorandom Ordering

## What makes a hash table good?

A hash table is a data struture commonly used to associate keys with values to implement dictionaries or to test if a key is part of a set of keys to implement set operations. The best hash table is the one that enables these operations at the lowest cost. Cost in computing essentially is two-dimensionsional, time and space. Keys or key-value pairs are stored at a pseudorandom position determined first by the hash of the key and then either storing all the entries that map to the same position in a secondary data structure (this method is called chaining) or by a probing strategy that finds another free position. The first step creats a cache miss. A cache miss is when data, that cannot be found in any of the caches of the CPU, is requested. The time this takes is around 300 CPU cycles and typically determines the performance of a hash table. It cannot be avoided due to the pseudorandom nature of hashing. The second step however occurs more frequently the fuller the table becomes, it is a function of the load factor α. The load factor is defined as the number of elements stored in the table divided by the total number of positions available in the table. Therefore, for a given hash function and and collision resolution scheme, the larger table is also faster because it has to resolve the less collisions, and therefore less cache misses.

### The most memory efficient datastructure for associations

The hash table with the best memory efficiency is simply the one with the highest load factor, (it can even exceed 100% memory efficiency by using key compression with compact hashing ). A hash table like that does still provide O(1) lookups, just very slow.

### The best trade-off between memory usage and speed

I argue that the best hash table is not the fastest or the most memory efficient, but the one with the best trade-off between space and time. Asymptotically the best trade-off is the one that has the lowest product of the the two. Imagine building a big server that does nothing but searching entries in a hash table. Say the hash table is very large so that the offset of the price for the CPU and the other hardware will be dwarfed by the cost of the RAM. The server will have a finite life-time, therefore the cost of a single lookup into the hash table will be directly proportional to the product of the time it takes to access an element and the amount of space the hash table needs per element. Let's call this concept the space-time.

If the space-time seems too abstract let's take a common task we all probably do from time to time: Finding hash collisions by brute force. By first computing as many hashes as fit in memory and then - using a hash table - comparing to all of them at the same time with a single lookup for each newly computed hash. Say we can neglect the time to compute a hash, then the performance can be expressed as the number of comparisons that can be done per unit of time. That number is, not incidentally, proportional to the reciprocal of the space-time.

## The `patchmap`

The `patchmap` is my attempt at achieving this magical smallest
space-time. Recently I was frustrated with the performance of
`std::unordered_map` and before I could even test other hash tables
I had come up with an idea of how to save space that I needed to try.
Let's first look at how it compares, then at the algorithm, more about the
motivation later.

### Performance compared to other hash tables

I designed a small benchmark to be as fair and comprehensive as possible. Only random keys can produce comparable memory access patterns across different hash table implementations using different hash functions and probing strategies, keep in mind however that this masks bad hash functions. For the following benchmark 2²⁶ random 64 bit keys were associated with arbitrary 64 bit values. The average memory consumption and time for insertion, lookup and deletion per key-value pair can be found in the plots below for the different hash tables.

Figure 1: Average performance for successful and unsuccessful lookups, insertion and deletion of key-value pairs for selected hash table implementations stated in terms of time and memory consumption.

patchmap: |
khash: × | |

bytell: + | google::sparse_hash_map: ○ | |

google::dense_hash_map: ⬟ | ska::flat_hash_map: △ | |

std::unordered_map: ◇ | sparsepp: ◻ | |

Judy array: ◆ | F14ValueMap: ▲ | |

chaining+sorting: |
robin_hood::unordered_map: ▽ | |

absl::flat_hash_map: ⬠ |
tsl::sparse_hash_map: ★ | |

emilib2::HashMap: ▩ |

The best hash-table implementations define a lower bound of the best achievable space-time trade-off. I chose to fit c • ( 1+(α-1)¯¹ ) as this is the asymptotic complexity of double hashing, which is one of the best probing strategies in terms of asymptotic complexity and it seemed to fit well. Given this curve of achievable trade-offs we want to choose the one where the product of space and time requirement is the lowest. The solid black line indicates the line the hash table implementations need to be below on the plot of memory and speed if they want to beat the apparently achievable best space-time trade-off. The hash tables that have an average memory efficiency of around 2-√2 = 0.585786... will achieve the lowest space-time given this trade-off curve. Coincidentially, but not surprisingly modern hash tables tend to cluster around this memory efficiency.

Figure 2: Average cost for insertion and lookup in the `patchmap` as it is being filled up with 2²⁸ pseudo-random keys. Insertion (in red) shows a strongly quadratic behavior and the function 1+(α/(1-α)²+α)/4 is indicated with a solid black line. Lookup (in blue) is almost linear in α and shows a steep increase only for load factors close to 1. The theoretically derived complexity bound for lookup 1+log₂(log₂((2-α)/(2-2α))+1) is indicated with dashed black line.

### Conclusion

As can be seen from the plots there are some hash tables with significantly suboptimal performance characteristics, but the best space-time trade-off is heavily contested and depends on the use case.
`F14ValueMap`, `bytell`, `patchmap`, `tsl::sparse_map`, `robin_hood::unordered_map` and `absl::flat_map` all exhibit similar performance characteristics.
At that level of difference compiler versions or memory fragmentation can already make the difference. For lookups `bytell` achieves the lowest space-time, closely followed by the `patchmap`, then `absl::flat_map`, `F14ValueMap` and `robin_hood::unordered_map`.
Depending on your constraints you might want to use `ska::flat_hash_map` for speed, it actually is the fastest hash table for lookups, just like Malte Skarupke claims, while `google::dense_hash_table` is a very close second.
Insertions and deletions are typically much less frequent than lookups, this is why they are not part of the ranking, they should still be reasonably fast for a good hash table however.
If you want a fast hash table with higher memory efficiency you might want to take a look at `tsl::sparse_map` and especially `patchmap`, which achieves the highest performance for a memory efficiency between 0.6 and 0.87.
The fastest hash table in the very high memory efficiency regime is `google::sparse_hash_map` at 0.88, but it can be beat by using a hash table combining chaining, a very high load factor and pseudorandom ordering, indicated with a `std::unordered_map` is forced to perform sub-par because the C++ standard requires all iterators except to elements that are directly deleted to stay valid for all hash table operations. This necessitates chaining instead of probing and prohibits reordering schemes.

All in all this shows that pseudorandom ordering is an an interesting technique for achieving highly memory efficient hash tables, while also being competitive at medium memory efficiencies.
Below 50% memory efficiency the overheads associated with this technique make it unsuitable, as collisions become the exception rather than the norm and the reordering does not pay off.

## The algorithm

A sorted array is a compact data structure, it takes up just as much space as absolutely necessary. It also allows for relatively fast acceses on the order of O(log(log(n))) if the keys are uniformly distributed. Insertions and deletions end up being O(n). While this might be good or good enough for some applications, especially when the number of keys is small, it is not what we want from a hash table in general. There is no general compact data structure to my knowledge that allows asymptotically constant lookups, insertions and deletions, but what happens if we allow a sorted array to be more wasteful?

If there is more memory available some positions in the sorted array can be left empty. The sorted values will form clusters or "patches" which will be sorted internally and with respect to each other, in between there will be positions that are not set. For insertions this means that the reordering required is much more localised and the cost only a function of the load factor, not the total size. For lookups, because there are more positions than entries to store, the offset from the position determined by linear interpolation can be minimised. This in turn means that an interpolation search will be able to find the values much quicker, in fact for a given load factor α<1 the number of steps that an interpolation search needs will be independent of n ( number of elements stored in the table ) and doubly logarithmic as a function of α.

But what if the keys we want to store are not uniform? This is where `patchmap` comes in. By sorting the keys by their hash value the order becomes pseudorandom. If the hash function is good then the values that it produces will be uniform even if the keys themselves are not. Therefore by combining a good hash function with an order given by the hash values a data structure can be created that stores keys or key-value pairs in a very memory efficient way and that interpolates between a (randomly) sorted array and a hash table.

### Example: Inserting into a patch

To know which buckets in the table are set and which are not one could reserve a special value like the google implementations do for an empty key. But I have decided to use a binary mask that indicates whether an entry is set or not, because I do not want to impose any restrictions on the keys that can be stored in the hash table. If there were one value of the key that was impossible, like say the `nullptr` it could be used to indicate an empty position, like is being done in the hash table implementations of google.

When inserting into the `patchmap` it goes something like this:

- map the hash value of the key to the range of the table
- look at the mask to find the closest free bucket to this position
- insert the key-value pair there
- restore the order by sorting by the hash value with a single insertion-sort step

#### The `patchmap` before insertion

`key ...__257936__...`

hash ...__455678__...

mask ...--++++++--...

hash ...__455678__...

mask ...--++++++--...

#### First step of inseting the key **1** with hash value of **5**

`key ...__2579361__...`

hash ...__4556787__...

mask ...--+++++++--...

hash ...__4556787__...

mask ...--+++++++--...

#### The order has been restored in a single insertion sort step

`key ...__2579136__...`

hash ...__4556778__...

mask ...--+++++++--...

hash ...__4556778__...

mask ...--+++++++--...

### Competing with `google::sparse_hash_map`

The `patchmap` achieves a good speed/space trade-off, but it cannot beat googles sparse hash map at very high memory efficiency although the comparison to it and `sparsepp` seemed not even very fair in the first place. `sparsepp` and `google::sparse_hash_map` minimize the peak memory consumption instead of the average memory consumption. Each time the other hash tables (including the `patchmap`) have to grow they reserve a new contiguous block of memory, copy the values from the old table, that has become too small, into the new table, therefore taking up at least two times more memory during each grow operation. To make the comparison fair I used a hashed array tree to generate the point for the `patchmap` with the highest memory efficiency. A hashed array tree allows the hash table to grow virtually in place, similar to what realloc allows, but at some additional cost, however keeping non copy-constructible data intact.

Still the `patchmap` performs worse at the memory efficiency that `google::sparse_hash_map` achieves for 64-bit keys.

This is because at high load factors large patches have a very much increased probability of growing bigger because they occupy more positions in the array where values are being mapped to randomly, leading to an asymptotically quadratic behavior. If we really want to beat `google::sparse_hash_map` for memory efficiency we can work around that and instead make a chaining hash table. Entries that hash to the same hash value are going to be stored in a single contiguous array which is being sorted by another hash value, again using the pseudorandom ordering trick. Chaining makes the distribution of the sorted patches more even, as there is no feedback mechanism at work. Interpolation search gives very fast access times and the memory overhead is very little if these arrays are large. At a load factor between 64 and 128 the memory efficiency of `google::sparse_hash_map` can be beat, while not giving up on performance completely, see the

### Tombstones can make things faster but they also create zombies

The `patchmap` seems to fare worst in comparison to the other hash
tables for deletions. This is probably because I went a different route there.
Usually when deleting an element from a hash table elements are not actually
removed from the table, they are only marked as deleted. This technique is
known as tombstoning. It is done mostly because this is the fastest way to
delete elements, but also because the other ways often are much harder to
implement. This is not the case in the `patchmap`. There deleting works
just like insertion, but in reverse. Deleted but not dead entries, let's
call them zombie entries, can accumulate and if we don't pay close
attention they can take over the table. Such a worst case scenario can be
triggered by repeatedly inserting and deleting elements while keeping the
table at the same size. Some implementations are immune and others have taken
specific precautions to mitigate this issue, usually by rehashing the table
as soon as the zombies accumulate too much. This usage pattern or others that
generate many zombies are unusual but not impossible, therefore a small to
medium slowdown is perfectly acceptable, a severe slowdown might however
become an unforeseen problem at some point.

hash table | slowdown | rehasing |
---|---|---|

patchmap | no | none |

bytell | no | none |

std::unordered_map | no | none |

flat_hash_map | no | none |

F14ValueMap | no | none |

sparsepp | minor | frequent |

google::sparse_hash_map | minor | frequent |

google::dense_hash_map | severe | rare |

khash | severe | occasional |

## Additional points

### How to make a good hash function?

I had been thinking about good hash functions for a long time so it was not hard to combine two multiplications and a binary rotation to get a hash function that had a very even output. This is similar to what Malte Skarupke ended up doing starting with a single multiplication with the golden ratio, aka fibonacci hashing and composing it with a bitshift. Galois Field multiplications are a great building block too, but my laptop is a little old and therefore struggles with carry-less multiplications.

### The backstory

Recently I wanted to compute a large number of large histograms.
The fastest way of computing histograms involves hash tables.
As c++ is my favorite programming language I had chosen
`std::unordered_map` as the backend.
Everything was going smoothly just as I had imagined but then when I was
running the code on my laptop instead of the servers at my institute
everything froze.
The memory required was around 8 times of what I had anticipated.
It turns out I was triggering close to the worst case for the
`std::unordered_map` as I was mapping 16-bit values to 16-bit values
on a 64-bit machine.
The `std::unordered_map` is a chaining hash table which means that for
each pair of 16-bit values it needs to store a 64-bit pointer,
this creates an especially large overhead in this case.
I ended up sorting the values first and then computing the histograms from
the sorted arrays, which takes much longer, but uses the minimum amount of
memory.
But I still had more memory available, I was just using 6 GB out of the 16 GB
I had in my laptop.
Would it not be great if there was a data structure that could interpolate
between the two solutions? Or at least if there was a hash table that was more
memory efficient?

And before I even found out that there were indeed hash tables that are way more efficient I had an idea that I needed to try.

The idea is as has been hinted at to interpolate between a sorted array and a hash table. That is mapping the keys linearly to the hash table and resolving collisions based on the numerical value of the key should they occur. This works well if the keys are uniformly distributed and in this case has the added effect of essentially sorting the keys in O(1) time and O(1) memory, not in place however as for this to work efficiently there is some fractional extra space needed, at least 5% extra would be good.

### Comparing to other approaches

This approach might sound very similar to the one suggested in "Ordered Hash Tables" (01 January 1974) O. Amble D. E. Knuth but there is a mayor difference in that the hash tables there are ordered first by the hash of the keys and then by the value of the key. If the entries are however only ordered by the hash value there exists a global order, which all of a sudden allows advanced search strategies like interpolation search and binary search.

In fact the approach is much more similar to robin hood hashing with linear probing because if two keys collide at the same position the key whose hash value is lower will also have been displaced more and will therefore displace the key with the larger hash value, therefore keeping up the same order. There has however been no hash table using robin hood hashing with linear probing that takes advantage of this fact.

### Availability

https://github.com/1ykos/ordered_patch_map

[ C++17, open source, beta stadium, let me know if you want to use it, I will try to make it work for you! ]

### Publication

For a more detailed description of the inner workings of the `patchmap` you can have a look at (open-access):

### Miscellaneous improvements and future improvements

Bijective hash functions

Hash functions with minimal output size that do not introduce additional collisions are necessarily bijective. Bijective functions are just the way to go when creating a hash function. They can be composed arbitrarily and they can also be inverted. They can be made to be arbitrarily hard to invert. Exponentiation by squaring on a prime field for example is used in cryptography, it is computationally very expensive to invert, yet it is bijective. It's a common misconception to think that the security of a cryptographic hash function depends on hash collisions. For a hash table however good mixing is all that matters, being easily invertible can even be an advantage.

Store permutated data

Lookup is the most frequent operation in a hash table, therefore it imperative that it is fast. In the patchmap, because the data is ordered by the hash of the key, for each lookup the hash of the key needs to be compared to several other hashes. To speed up this comparison the hashes could be stored alongside the data, like in some other hash tables. Given a bijective hash function and its reciprocal however this space can be saved and instead of storing hash and data we can just store the hash. The data can be computed from the hash.

Resizing allocations

Some allocators are able to resize the allocations, similar to
`realloc`, but no C++ other than `dlib::allocator` expose that
feature. I experimented using resizing and the time to grow and shrink the
table can be reduced, but there are lots of corner cases which need to be
handeled correctly, and that's why this feature is not in the current version
of the patchmap.

lazy rehashing

Resizing without relocating is especially beneficial when there is a way to
craft the hash reduction so that only a fraction of the data actually needs
to move to a new location. The only hash table that I know of that manages
to make use of this ingenious idea is `khash`. When the table size is
a factor of 2, the hash reduction can be as simple as take n bits of the hash.
When growing the table this means that only half of the keys in the table
need to be moved, making the grow operation even cheaper. If only there was a
way to generalise this trick to arbitrary hash table sizes!

compact hashing

The insertion scheme of the patchmap minimises the offset of the hash of the
key to the position, mapped to the range of available buckets.
This means that roughly `floor(log2(n))` bits of the key can be guessed
from the where the key is stored.
Only storing the differences between the hash and the hash that would be
expected at a given position could save `floor(log2(n))` per key.
At least in theory this is how compact hashing could be adapted to the
patchmap. The devil is in the details, mainly in the alignment.