Hash table performance and memory efficiency

Memory Efficient Hash Tables
and Pseudorandom Ordering

What makes a hash table good?

A hash table is a data struture commonly used to associate keys with values to implement dictionaries or to test if a key is part of a set of keys to implement set operations. The best hash table is the one that enables these operations at the lowest cost. Cost in computing essentially is two-dimensionsional, time and space. Keys or key-value pairs are stored at a pseudorandom position determined first by the hash of the key and then either storing all the entries that map to the same position in a secondary data structure (this method is called chaining) or by a probing strategy that finds another free position. The first step creats a cache miss. A cache miss is when data, that cannot be found in any of the caches of the CPU, is requested. The time this takes is around 300 CPU cycles and typically determines the performance of a hash table. It cannot be avoided due to the pseudorandom nature of hashing. The second step however occurs more frequently the fuller the table becomes, it is a function of the load factor α. The load factor is defined as the number of elements stored in the table divided by the total number of positions available in the table. Therefore, for a given hash function and and collision resolution scheme, the larger table is also faster because it has to resolve the less collisions, and therefore less cache misses.

The most memory efficient datastructure for associations

The hash table with the best memory efficiency is simply the one with the highest load factor, (it can even exceed 100% memory efficiency by using key compression with compact hashing ). A hash table like that does still provide O(1) lookups, just very slow.

The best trade-off between memory usage and speed

I argue that the best hash table is not the fastest or the most memory efficient, but the one with the best trade-off between space and time. Asymptotically the best trade-off is the one that has the lowest product of the the two. Imagine building a big server that does nothing but searching entries in a hash table. Say the hash table is very large so that the offset of the price for the CPU and the other hardware will be dwarfed by the cost of the RAM. The server will have a finite life-time, therefore the cost of a single lookup into the hash table will be directly proportional to the product of the time it takes to access an element and the amount of space the hash table needs per element. Let's call this concept the space-time.

If the space-time seems too abstract let's take a common task we all probably do from time to time: Finding hash collisions by brute force. By first computing as many hashes as fit in memory and then - using a hash table - comparing to all of them at the same time with a single lookup for each newly computed hash. Say we can neglect the time to compute a hash, then the performance can be expressed as the number of comparisons that can be done per unit of time. That number is, not incidentally, proportional to the reciprocal of the space-time.

The `patchmap`

The patchmap is my attempt at achieving this magical smallest space-time. Recently I was frustrated with the performance of std::unordered_map and before I could even test other hash tables I had come up with an idea of how to save space that I needed to try. Let's first look at how it compares, then at the algorithm, more about the motivation later.

Performance compared to other hash tables

I designed a small benchmark to be as fair and comprehensive as possible. Only random keys can produce comparable memory access patterns across different hash table implementations using different hash functions and probing strategies, keep in mind however that this masks bad hash functions. For the following benchmark 2²⁶ random 64 bit keys were associated with arbitrary 64 bit values. The average memory consumption and time for insertion, lookup and deletion per key-value pair can be found in the plots below for the different hash tables.

Figure 1: Average performance for successful and unsuccessful lookups, insertion and deletion of key-value pairs for selected hash table implementations stated in terms of time and memory consumption. patchmap: • khash: × bytell: + google::sparse_hash_map: ○ google::dense_hash_map: ⬟ ska::flat_hash_map: △ std::unordered_map: ◇ sparsepp: ◻ Judy array: ◆ F14ValueMap: ▲ chaining+sorting: • robin_hood::unordered_map: ▽ absl::flat_hash_map: ⬠ tsl::sparse_hash_map: ★ emilib2::HashMap: ▩

The best hash-table implementations define a lower bound of the best achievable space-time trade-off. I chose to fit c • ( 1+(α-1)¯¹ ) as this is the asymptotic complexity of double hashing, which is one of the best probing strategies in terms of asymptotic complexity and it seemed to fit well. Given this curve of achievable trade-offs we want to choose the one where the product of space and time requirement is the lowest. The solid black line indicates the line the hash table implementations need to be below on the plot of memory and speed if they want to beat the apparently achievable best space-time trade-off. The hash tables that have an average memory efficiency of around 2-√2 = 0.585786... will achieve the lowest space-time given this trade-off curve. Coincidentially, but not surprisingly modern hash tables tend to cluster around this memory efficiency.

Figure 2: Average cost for insertion and lookup in the patchmap as it is being filled up with 2²⁸ pseudo-random keys. Insertion (in red) shows a strongly quadratic behavior and the function 1+(α/(1-α)²+α)/4 is indicated with a solid black line. Lookup (in blue) is almost linear in α and shows a steep increase only for load factors close to 1. The theoretically derived complexity bound for lookup 1+log₂(log₂((2-α)/(2-2α))+1) is indicated with dashed black line.

Conclusion

As can be seen from the plots there are some hash tables with significantly suboptimal performance characteristics, but the best space-time trade-off is heavily contested and depends on the use case. F14ValueMap, bytell, patchmap, tsl::sparse_map, robin_hood::unordered_map and absl::flat_map all exhibit similar performance characteristics. At that level of difference compiler versions or memory fragmentation can already make the difference. For lookups bytell achieves the lowest space-time, closely followed by the patchmap, then absl::flat_map, F14ValueMap and robin_hood::unordered_map. Depending on your constraints you might want to use ska::flat_hash_map for speed, it actually is the fastest hash table for lookups, just like Malte Skarupke claims, while google::dense_hash_table is a very close second. Insertions and deletions are typically much less frequent than lookups, this is why they are not part of the ranking, they should still be reasonably fast for a good hash table however. If you want a fast hash table with higher memory efficiency you might want to take a look at tsl::sparse_map and especially patchmap, which achieves the highest performance for a memory efficiency between 0.6 and 0.87. The fastest hash table in the very high memory efficiency regime is google::sparse_hash_map at 0.88, but it can be beat by using a hash table combining chaining, a very high load factor and pseudorandom ordering, indicated with a green dot at 0.95, more on that here. A Judy array is good for medium to small datasets, but the asymptotic properties are catching up to it at 2²⁷ keys and it will only get worse. The std::unordered_map is forced to perform sub-par because the C++ standard requires all iterators except to elements that are directly deleted to stay valid for all hash table operations. This necessitates chaining instead of probing and prohibits reordering schemes.
All in all this shows that pseudorandom ordering is an an interesting technique for achieving highly memory efficient hash tables, while also being competitive at medium memory efficiencies. Below 50% memory efficiency the overheads associated with this technique make it unsuitable, as collisions become the exception rather than the norm and the reordering does not pay off.

The algorithm

A sorted array is a compact data structure, it takes up just as much space as absolutely necessary. It also allows for relatively fast acceses on the order of O(log(log(n))) if the keys are uniformly distributed. Insertions and deletions end up being O(n). While this might be good or good enough for some applications, especially when the number of keys is small, it is not what we want from a hash table in general. There is no general compact data structure to my knowledge that allows asymptotically constant lookups, insertions and deletions, but what happens if we allow a sorted array to be more wasteful?

If there is more memory available some positions in the sorted array can be left empty. The sorted values will form clusters or "patches" which will be sorted internally and with respect to each other, in between there will be positions that are not set. For insertions this means that the reordering required is much more localised and the cost only a function of the load factor, not the total size. For lookups, because there are more positions than entries to store, the offset from the position determined by linear interpolation can be minimised. This in turn means that an interpolation search will be able to find the values much quicker, in fact for a given load factor α<1 the number of steps that an interpolation search needs will be independent of n ( number of elements stored in the table ) and doubly logarithmic as a function of α.

But what if the keys we want to store are not uniform? This is where patchmap comes in. By sorting the keys by their hash value the order becomes pseudorandom. If the hash function is good then the values that it produces will be uniform even if the keys themselves are not. Therefore by combining a good hash function with an order given by the hash values a data structure can be created that stores keys or key-value pairs in a very memory efficient way and that interpolates between a (randomly) sorted array and a hash table.

Example: Inserting into a patch

To know which buckets in the table are set and which are not one could reserve a special value like the google implementations do for an empty key. But I have decided to use a binary mask that indicates whether an entry is set or not, because I do not want to impose any restrictions on the keys that can be stored in the hash table. If there was one value of the key that was impossible, like say the nullptr it could be used to indicate an empty position, like is being done in the hash table implementations of google.

When inserting into the patchmap it goes something like this:

map the hash value of the key to the range of the table
look at the mask to find the closest free bucket to this position
insert the key-value pair there
restore the order by sorting by the hash value with a single insertion-sort step

The `patchmap` before insertion

key ...__257936__... hash ...__455678__... mask ...--++++++--...

First step of inseting the key 1 with hash value 7

key ...__2579361__... hash ...__4556787__... mask ...--+++++++--...

The order has been restored in a single insertion sort step

key ...__2579136__... hash ...__4556778__... mask ...--+++++++--...

Competing with `google::sparse_hash_map`

The patchmap achieves a good speed/space trade-off, but it cannot beat googles sparse hash map at very high memory efficiency although the comparison to it and sparsepp seemed not even very fair in the first place. sparsepp and google::sparse_hash_map minimize the peak memory consumption instead of the average memory consumption. Each time the other hash tables (including the patchmap) have to grow they reserve a new contiguous block of memory, copy the values from the old table, that has become too small, into the new table, therefore taking up at least two times more memory during each grow operation. To make the comparison fair I used a hashed array tree to generate the point for the patchmap with the highest memory efficiency. A hashed array tree allows the hash table to grow virtually in place, similar to what realloc allows, but at some additional cost, however keeping non copy-constructible data intact.
Still the patchmap performs worse at the memory efficiency that google::sparse_hash_map achieves for 64-bit keys.
This is because at high load factors large patches have a very much increased probability of growing bigger because they occupy more positions in the array where values are being mapped to randomly, leading to an asymptotically quadratic behavior. If we really want to beat google::sparse_hash_map for memory efficiency we can work around that and instead make a chaining hash table. Entries that hash to the same hash value are going to be stored in a single contiguous array which is being sorted by another hash value, again using the pseudorandom ordering trick. Chaining makes the distribution of the sorted patches more even, as there is no feedback mechanism at work. Interpolation search gives very fast access times and the memory overhead is very little if these arrays are large. At a load factor between 64 and 128 the memory efficiency of google::sparse_hash_map can be beat, while not giving up on performance completely, see the green dot in figure 1.

Tombstones can make things faster but they also create zombies

The patchmap seems to fare worst in comparison to the other hash tables for deletions. This is probably because I went a different route there. Usually when deleting an element from a hash table elements are not actually removed from the table, they are only marked as deleted. This technique is known as tombstoning. It is done mostly because this is the fastest way to delete elements, but also because the other ways often are much harder to implement. This is not the case in the patchmap. There deleting works just like insertion, but in reverse. Deleted but not dead entries, let's call them zombie entries, can accumulate and if we don't pay close attention they can take over the table. Such a worst case scenario can be triggered by repeatedly inserting and deleting elements while keeping the table at the same size. Some implementations are immune and others have taken specific precautions to mitigate this issue, usually by rehashing the table as soon as the zombies accumulate too much. This usage pattern or others that generate many zombies are unusual but not impossible, therefore a small to medium slowdown is perfectly acceptable, a severe slowdown might however become an unforeseen problem at some point.

hash table	slowdown	rehasing
`patchmap`	no	none
`bytell`	no	none
`std::unordered_map`	no	none
`flat_hash_map`	no	none
`F14ValueMap`	no	none
`sparsepp`	minor	frequent
`google::sparse_hash_map`	minor	frequent
`google::dense_hash_map`	severe	rare
`khash`	severe	occasional

Additional points

How to make a good hash function?

I had been thinking about good hash functions for a long time so it was not hard to combine two multiplications and a binary rotation to get a hash function that had a very even output. This is similar to what Malte Skarupke ended up doing starting with a single multiplication with the golden ratio, aka fibonacci hashing and composing it with a bitshift. Galois Field multiplications are a great building block too, but my laptop is a little old and therefore struggles with carry-less multiplications.

The backstory

Recently I wanted to compute a large number of large histograms. The fastest way of computing histograms involves hash tables. As c++ is my favorite programming language I had chosen std::unordered_map as the backend. Everything was going smoothly just as I had imagined but then when I was running the code on my laptop instead of the servers at my institute everything froze. The memory required was around 8 times of what I had anticipated. It turns out I was triggering close to the worst case for the std::unordered_map as I was mapping 16-bit values to 16-bit values on a 64-bit machine. The std::unordered_map is a chaining hash table which means that for each pair of 16-bit values it needs to store a 64-bit pointer, this creates an especially large overhead in this case. I ended up sorting the values first and then computing the histograms from the sorted arrays, which takes much longer, but uses the minimum amount of memory. But I still had more memory available, I was just using 6 GB out of the 16 GB I had in my laptop. Would it not be great if there was a data structure that could interpolate between the two solutions? Or at least if there was a hash table that was more memory efficient?

And before I even found out that there were indeed hash tables that are way more efficient I had an idea that I needed to try.

The idea is as has been hinted at to interpolate between a sorted array and a hash table. That is mapping the keys linearly to the hash table and resolving collisions based on the numerical value of the key should they occur. This works well if the keys are uniformly distributed and in this case has the added effect of essentially sorting the keys in O(1) time and O(1) memory, not in place however as for this to work efficiently there is some fractional extra space needed, at least 5% extra would be good.

Comparing to other approaches

This approach might sound very similar to the one suggested in "Ordered Hash Tables" (01 January 1974) O. Amble D. E. Knuth but there is a mayor difference in that the hash tables there are ordered first by the hash of the keys and then by the value of the key. If the entries are however only ordered by the hash value there exists a global order, which all of a sudden allows advanced search strategies like interpolation search and binary search.

In fact the approach is much more similar to robin hood hashing with linear probing because if two keys collide at the same position the key whose hash value is lower will also have been displaced more and will therefore displace the key with the larger hash value, therefore keeping up the same order. There has however been no hash table using robin hood hashing with linear probing that takes advantage of this fact.

Availability

https://github.com/1ykos/ordered_patch_map

[ C++17, open source, beta stadium, let me know if you want to use it, I will try to make it work for you! ]

Publication

For a more detailed description of the inner workings of the patchmap you can have a look at (open-access):

INFOCOMP article 581

Miscellaneous improvements and future improvements

Bijective hash functions

Hash functions with minimal output size that do not introduce additional collisions are necessarily bijective. Bijective functions are just the way to go when creating a hash function. They can be composed arbitrarily and they can also be inverted. They can be made to be arbitrarily hard to invert. Exponentiation by squaring on a prime field for example is used in cryptography, it is computationally very expensive to invert, yet it is bijective. It's a common misconception to think that the security of a cryptographic hash function depends on hash collisions. For a hash table however good mixing is all that matters, being easily invertible can even be an advantage.

Store permutated data

Lookup is the most frequent operation in a hash table, therefore it imperative that it is fast. In the patchmap, because the data is ordered by the hash of the key, for each lookup the hash of the key needs to be compared to several other hashes. To speed up this comparison the hashes could be stored alongside the data, like in some other hash tables. Given a bijective hash function and its reciprocal however this space can be saved and instead of storing hash and data we can just store the hash. The data can be computed from the hash.

Resizing allocations

Some allocators are able to resize the allocations, similar to realloc, but no C++ other than dlib::allocator expose that feature. I experimented using resizing and the time to grow and shrink the table can be reduced, but there are lots of corner cases which need to be handeled correctly, and that's why this feature is not in the current version of the patchmap.

lazy rehashing

Resizing without relocating is especially beneficial when there is a way to craft the hash reduction so that only a fraction of the data actually needs to move to a new location. The only hash table that I know of that manages to make use of this ingenious idea is khash. When the table size is a factor of 2, the hash reduction can be as simple as take n bits of the hash. When growing the table this means that only half of the keys in the table need to be moved, making the grow operation even cheaper. If only there was a way to generalise this trick to arbitrary hash table sizes!

compact hashing

The insertion scheme of the patchmap minimises the offset of the hash of the key to the position, mapped to the range of available buckets. This means that roughly floor(log2(n)) bits of the key can be guessed from the where the key is stored. Only storing the differences between the hash and the hash that would be expected at a given position could save floor(log2(n)) per key. At least in theory this is how compact hashing could be adapted to the patchmap. The devil is in the details, mainly in the alignment.