Can we implement cuckoo hashing using 1 table?

问题

I found about Cuckoo Hash tables and they seem pretty good.
But most examples code I found implement this using 2 tables.
This seems to me wrong because the 2 tables may be in different memory pages and we have overhead of fetching random addresses and have no real locality.
Is it not possible to use 1 array instead of 2?
Is it perhaps not possible to detect when an element has already been kicked out 2 times and is time for resizing?

回答1:

To answer the confusion in the comments: no this is not language specific. If you're thinking about memory locality and want to ensure the two tables are close, then a single allocation is the way to go (however you allocate). In java this may look as follows:

class TwoTables {
    private static final int SIZE_TABLE_FIRST = 11, SIZE_TABLE_SECOND = 29;

    public TwoTables() {
        m_buffer = new int[SIZE_TABLE_FIRST + SIZE_TABLE_SECOND];
    }

    // consider similar setters...

    public int getFirst(int key) {
        return m_buffer[toIndex(hashFirst(key), SIZE_TABLE_FIRST, 0)];
    }

    public int getSecond(int key) {
        return m_buffer[toIndex(hashSecond(key), SIZE_TABLE_SECOND, SIZE_TABLE_FIRST)];
    }

    private static int toIndex(int hash, int mod, int offset) {
        return hash % mod + offset;
    }

    private static int hashFirst(int key) { return ...; }
    private static int hashSecond(int key) { return ...; }

    private final int[] m_buffer;
}

If this performs better than accessing into two separate arrays is dependant on your JVM however: just think about the JIT being able to merge two small allocations into a single larger one on the fly - without you having to perform any index-magic.

回答2:

You can definitely do a cuckoo hashtable with a single hash table; that is, where the two positions for each object are simply positions within a single hash table.

The only small problem to be solved is how to decide during the cuckoo eviction loop which of the two positions to use for an evicted key. Of course, you can just try one position and use the other one if the first one was the same as the actual position. It should be possible to use SIMD to compute both hashes in parallel, so the cost of this strategy might be small.

However, if you want to guarantee a single hash computation during the cuckoo loop, there is a simple solution: instead of using H₀(k) and H₁(k) as the two positions, use H₀(k) and H₀(k) xor H₁(k). (If H₁ is independent from H₀, then so is H₀ xor H₁, so the xor does not affect the distribution of hash values.) With this modification, you can always find "the other position" of k by xor'ing the current position with H₁(k), so only a single hash computation is needed in the loop.

While that allows you to use a single hash table, and may even simplify the code, there is not a lot of evidence that it improves the operation of the algorithm. In my limited testing, it seems to increase the number of loop iterations by 40-50%. (Although it needs to be emphasized that in the vast majority of cases, a new key can be inserted into the table without entering the loop at all, so the increased number of loops is hardly noticeable in the actual execution time.)

回答3:

Well, all forms of hashing are murder on caches.

Anyways you can easily combine the two into a single table. But then how do you tell whether you're on your first hash function or the second? The options are add that as metadata to every bucket, or else figure it out by running the first hash function, seeing whether you got the current location, and running the second only if you were on the first. That either requires extra space, or running more hash functions.

Splitting the table into 2 solves that problem more efficiently. And statistically you need the same number of buckets to store the same number of things whether or not the table has been split. So your whole hash table becomes smaller.