During the development of the Concurrency Kit hash set and hash table, detailed microbenchmarks were used to measure latency and variability of various operations in relation to various open-source hash table implementations. For read-mostly workloads, the implementation was at least twice as fast than Google Dense Hash on reference machines even though it provides stronger forward-progress guarantees for concurrent workloads. For example, it is lock-free for readers and wait-free for writers in single-writer many-reader use-cases. However, a recent use-case required the hash table implementations to handle delete-heavy workloads. As many open-addressing schemes, the implementation failed miserably in this workload due to tombstone accumulation. The strength of the collision resolution mechanism would very quickly lead to the complete elimination of probe sequence termination markers (empty slots) in favor of tombstones. Google Dense Hash performed in a stable manner with the delete-heavy workloads at the cost of a higher operation completion floor for more favorable workloads (due to increased clumping of the quadratic probing it uses). The performance gap for delete-heavy workloads has since then been closed with some trade-offs.
Disclaimer and Benchmark Environment
The fellow engineer at AppNexus who had the delete-heavy use-case wrote a self-contained program that executes long-lived interactions with the hash table. A variant of this benchmark was used to generate this data. The source-code is not available, but I urge readers to implement their own benchmarks and share their results. The use-case here involves data in which value objects can be derived from key pointers. Pointer packing was enabled for the hash set. I do hope to work on a much more detailed hash table benchmark suite in the future that addresses not only operation latency distribution, but longer-lived dynamic workloads.</p>
This analysis is by no means complete, and typically I would use more granular value distribution metrics for these benchmarks. In addition to this, gettimeofday is used in this benchmark and as such, includes vsyscall overhead. Values were normalized to ease human consumption of the graphs. If anything, though, this benchmark does show-case the promising performance properties of the Concurrency Kit hash table implementations, even in the absence of concurrent workloads that require fast lock-free progress guarantees. It also highlights the performance advantages to the excellent Google Dense Hash.
These are done on RHEL5 machines (glibc-2.5-81) with Linux kernel 2.6.32-100.0.19.el5. The processor is an Intel Xeon CPU L5640 at 2.27GHz. The benchmarks were run with SCHED_FIFO policy at highest priority and strong affinity to core 2 on the system.
These graphs represent data from a simple get-put workload. Keys are 40-byte randomly generated sequences and the benchmark will simply check if the key exists in the hash set, and if not, will insert it. This benchmark is run until 200M entries have accumulated. Hash function overhead has been factored out, and time has been normalized to nanoseconds per-operation. No delete operations are executed in this workload.
Cost of Get Operation
Cost of Put Operation
These graphs represent data from a delete-heavy workload. After adding 10M unique entries (get to put workload) to the hash table, every new entry will delete the oldest entry inserted. Hash function overhead has been factored out, and time has been normalized to nanoseconds per-operation. The X-axis represents number of executed transactions (get to put to delete) at increments of 10M transactions. The slot bound string in ck_hs indicates that the CK_HS_MODE_DELETE hint has been passed to enable stronger probe bound guarantees, and GC indicates a variant in which the writer executed a full-table GC operation (ck_hs_gc) every 300M transactions. Amortized GC (which is less bursty but also sub-optimal) is possible, but has not been measured in this benchmark. GC time was factored into delete time for simplicity in comparison.
Additional probing mechanisms are being investigated along with in-band probe sequence bounds to remove fast-path memory pressure.
Thanks to Bruno Keymolen and Riley Berton for their feedback and creation of the self-contained benchmark.