Final Writeup

Team Members: Benson Qiu and Tommy Hung


Summary

We implemented a parallel hashmap with concurrent cuckoo hashing optimizations for read-heavy workloads using algorithmic improvements described in MemC3 by Fan et al.1 and in another paper by Li et al.2 Based on our benchmarks, our fastest implementation was able to achieve a 12x speedup for concurrent reads over a single threaded C++11 std::unordered_map while maintaining 90% space efficiency using two Intel Xeon E5-2620 v3 processors.


Motivation

Read Heavy Workload: MemC3 by Fan et al. makes improvements on Memcached, an in-memory key-value store commonly used as a caching layer for API calls and database requests. The workload of Memcached is primarily read heavy, with real-world read to write ratios of 30:1.

Space Efficiency: Some open address hashing schemes such as linear probing degrade in performance as hashmaps become denser. This number of keys searched increases linearly with space efficiency for successful searches (key exists) and superlinearly with for unsuccessful searches. At 90% space efficiency, an unsuccessful search using basic linear probing would search 55 keys on average.

Linear Probing
Figure 1. Expected number of keys searched for varying space efficiencies.

Background

Cuckoo
Figure 2. Cuckoo Hashing.

Cuckoo hashing is a scheme for resolving hash collisions in a hashmap. The idea is to use two hash functions instead of one, so that there are two buckets for the key to reside in. If all slots in both buckets are full, the key is inserted at a random slot within the two buckets, and the existing key at the randomly chosen slot is evicted and reinserted at another bucket. Lookups are guaranteed to be constant time because we need to search at most 2*NUM_SLOTS_PER_BUCKET slots for a key.


Algorithm / Approach

Optimistic Concurrent
Figure 3. Concurrent Cuckoo Hashing With Optimizations.

// Data structures used for optimistic concurrent cuckoo hashing

HashPointer *m_table;

struct HashPointer {
    unsigned char tag;
    HashEntry* ptr;
};

struct HashEntry {
    std::string key;
    T val;
};

For our project, we implemented several improvements to basic cuckoo hashing scheme, with an emphasis on maximizing concurrent read performance for read-heavy workloads. Each bucket in the hashmap has four slots, where each slot contains a HashPointer struct. The HashPointer struct contains a tag and a pointer to a HashEntry struct that actually contains the key-value pair. (Note: The algorithm described below focuses on a multiple reader, single writer workload. We did further optimizations to handle a multiple writer case.)


Lock Striping: We maintain a key-version counter for each bucket. Whenever an insertion thread is about to displace a key, it atomically increments the relevant counter using __sync_fetch_and_add. The insertion thread increments the counter again after it is done modifying the key, thus incrementing a counter by 2 for each displacement. A lookup thread snapshots its corresponding key-version counter twice, once at the beginning of the process and again when it is done reading both buckets. If at either point the key-version counter is odd, the lookup thread knows that the key is being modified so it has to wait and retry. Lock striping allows for concurrent, lock-free reads.


Tag-Based Lookup/Insertion: When a key is inserted into the hashmap, the key is hashed to a 1 byte tag value that is stored in the HashPointer struct. When performing a key lookup, we apply a tag hash to the key and only dereference the HashPointer's HashEntry* ptr field if the tags match (i.e. tag_hash(current_key) == hash_pointer.tag). The key ideas behind using tags is to reduce pointer dereferences and to maximize cache locality. We only make a pointer dereference if we find matching tags, which will occur 1/256 times with a 8 bit tag. The HashPointer struct requires only 16 bytes of memory, so we can store all 4 slots of a bucket on a 64-byte cache line, thus maximizing cache locality.


Figure 4. Cache miss rate measured on an Intel Xeon E5-2680 processor using perf, with 24 threads reading 10M elements. Exact results vary across machines, but there are generally fewer L3 cache misses with the tag optimization than without.

Separate Path Discovery and Path Eviction: A typical cuckoo hashing insertion scheme may involve a series of insertions, evictions, and re-insertions. We search for the entire eviction path first before applying a sequence of key displacements. Because the path search step does not displace any keys, other threads can read without retry conflicts during this step. In the path eviction step, we move keys backwards along the cuckoo path so that each displacement only affects one key at a time. Separating the search and eviction steps reduces the "floating time" of keys, minimizing the number of retries for concurrent readers. To evaluate the impact of floating time on performance, we measured the average path discovery and average path eviction times on 1M insertions, which were 0.35 and 0.19 seconds, respectively. This optimization reduced the floating time from 0.35+0.19 seconds to 0.19 seconds, a 65% decrease.


Space Efficiency: Using 4 slots per bucket with tag-based lookups and insertions allows us to achieve 90% space efficiency without significant degradation in performance. Other open addressing schemes such as linear probing and basic cuckoo hashing tend to degrade in performance after about 50% space efficiency.


After implementing the optimizations above for a multiple reader, single writer workload, we wanted to create a general purpose concurrent hashmap that performed well for multiple writers.


Lock Later: After separating path discovery and path eviction for insertions, concurrent writers can search for paths in parallel because path search does not make modifications to the data structure. This reduces the critical region to just the path eviction step, although insertion threads must verify during the eviction step that their paths are still valid.


Results

We measured the results of our hashmap implementations by benchmarking raw read/write throughput per second. We used 10M strings, converted from randomly generated integers, as key and value inputs. All tests were performed on two Intel Xeon E5-2620 v3 processors.


Figure 5. Randomized read throughput with 24 threads for parallel implementations.
90% space efficiency was used for all cuckoo hashing schemes.

Our first optimistic cuckoo implementation (with lock striping) achieved 19M reads per second with 24 threads, which is slightly faster than both fine grained (locking over each bucket) separate chaining and fine grained cuckoo hashing. We achieved a significant increase in performance when we implemented the tag optimization. Recall that a HashPointer struct in a parallel cuckoo hashmap with tags requires 16 bytes, so a 4-way set associative bucket can fit entirely into a 64-byte cache line. Also, a 1 byte tag can take on 256 different values, so we only need to make a pointer dereference 1/256 times. We are able to achieve 38.2M reads per second, a 12x speedup for concurrent reads using optimistic cuckoo hashing with tags over a single threaded C++11 std::unordered_map on the Latedays Xeon Phi. The lock later optimization did not yield any noticeable speedup for reads, which is expected because lock later was an extra optimization to allow for concurrent writes.


Figure 6. Randomized read throughput with 24 threads for parallel implementations.
90% space efficiency was used for all cuckoo hashing schemes.

As we make the hashmaps denser by increasing space efficiency, the sequential cuckoo, fine grained cuckoo, and optimistic cuckoo implementations decreased in performance. This is because read operations for denser hashmaps, on average, search through more slots before the correct key-value pair is found. Recall that a reader thread will need to search up to 2*NUM_SLOTS_PER_BUCKET slots, but will generally search fewer slots in sparser hashmaps. Optimistic cuckoo hashing with the tag optimization did not have any significant performance degradation even as the space efficiency of the hashmap increased to 90%. This was likely due to the effects of our tag optimization, where we maximized cache locality and minimized the number of pointer deferences. Optimistic cuckoo hashing with both the tag and lock later optimizations (not included in Figure 4) had nearly identical results.


Figure 7. Read throughput with a single random key that hashes to the same bucket.

We measured read throughput under high contention by performing concurrent read operations on the same key, which always hashes to the same bucket. Performance for all hashmap implementations decreased in the multithreaded case compared to the single threaded case. Even under high contention, the optimistic cuckoo implementations were relatively faster than both fine grained cuckoo hashing and fine grained separate chaining.


Figure 8. Speed of various hash functions on strings.

We tested different hash functions because the overhead of hashing keys was a significant performance bottleneck. We found that hashlittle2, a 64-byte hash that effectively generates two 32-byte hash values, provided the highest throughput. The throughput for hashlittle2 in the graph was multiplied by 2 to account for hashlittle2 generating two 32-bit hash values in a single function call.


Figure 9. Performance of interleaved read/write requests on hashmaps pre-populated to 80% space efficiency.

We measured interleaved read and write throughput on dense hashmaps. All hashmaps were pre-populated by performing write operations until 80% of the slots in the hashmap were occupied. Then we measured performance on the pre-populated hashmaps with interleaved read and write requests, ranging from 90% reads (and 10% writes) to 100% reads. The key observation of our results is that the lock later optimization (blue line in Figure 7) is faster than optimistic cuckoo hashing without the lock later optimization (green and yellow lines) because the lock later optimization speeds up write throughput.


Comparison to Fan et al.

Figure 10. Comparison of our read performance to Fan et al.

Our code achieves almost the same performance as our reference paper under similar settings. Fan's implementation ran on the Intel Xeon L5640. Our implementation ran on the Intel Xeon E5-2680 (we did not use Latedays for this test).


Extensions / Future Work

We were able to implement everything mentioned in the project proposal, which was based on work by Fan et al. We also implemented the lock later optimization to increase write throughput. With our remaining time, we explored the feasibility of several other optimizations described by Li et al. for increasing write throughput.


BFS Path Search: Our current path discovery scheme behaves like DFS. Using BFS over DFS for path discovery will find us the shortest path, which reduces the critical region for the path eviction step. We measured the average eviction path size from DFS, ignoring situations where no evictions were needed because a slot was found immediately, and found that the average eviction path length using DFS is 4.7 when inserting elements in an optimistic cuckoo hashmap with 90% space efficiency. BFS reduces the average eviction path length to a logarithmic factor. However, we noticed that a BFS-based path search would have significantly more overhead during the path search phase.


Fine Grained Locks for Writes: Our current implementation (Optimistic Cuckoo + Tag + Lock Later) reduces the critical section of writes by separating path discovery and path eviction, but we still have a region with a coarse grained lock. An optimization would be to use fine grained locks over each bucket. We did a proof of concept for this optimization by implementing a fine grained version of basic cuckoo hashing and noticed a significant improvement in concurrent write throughput. We feel that this would be an effective optimization to implement if we had more time.


Acknowledgements

Many of the optimizations implemented for this project were borrowed from Fan et al.1 and Li et al.2

Our hash function was written by Bob Jenkins and is publicly available here.


References

1http://www.cs.cmu.edu/~dga/papers/memc3-nsdi2013.pdf
2http://www.cs.princeton.edu/~xl/Publications_files/cuckoo-eurosys14.pdf

Work Done By Each Student

Equal work was performed by both project members.