Cuckoo hashing for sketching sets

Sign up for the new posts via the RSS feed.

Below I show a neat application of perfect hashing, which is one of my favorite (cluster of) algorithms. Amazingly, we use it to obtain a purely information-theoretic (rather than algorithmic) statement.

Suppose we have a finite universe $U$ of size $n$ and a $k$-element subset of it $S \subseteq U$ with $k \ll n$. How many bits do we need to encode it? The obvious answer is $\log_2 \binom{n}{k} = \Theta(k \cdot \log(n / k))$.
Can we, however, improve this bound if we allow some approximation?

Even if $n = 2k$, it is not difficult to show the lower bound of $k \cdot \log_2(1 / \delta)$ bits if we allow to be wrong when answering queries “does $x$ belong to $S$?” with probability at most $\delta$ (hint: $\varepsilon$-nets). Can we match this lower bound?

One approach that does not quite work is to hash each element of $S$ to an $l$-bit string using a sufficiently good hash function $h \colon U \to \{0, 1\}^l$, and, when checking if $x$ lies in $S$, compute $h(x)$ and check if this value is among the hashes of $S$. To see why it does not work, let us analyze it: if $x \notin S$, then the probability that $h(x)$ coincides with at least one hash of an element of $S$ is around $k \cdot 2^{-l}$. To make the latter less than $\delta$, we need to take $l = \log_2(k / \delta)$ yielding the overall bound of $k \cdot \log_2(k / \delta)$ falling short of the desired size.

To get the optimal size, we need to avoid using the union bound in the above argument. In order to accomplish this, let us use perfect hashing on top of the above hashing scheme! It is convenient to use a particular approach to perfect hashing called Cuckoo hashing. In short, there is a way to generate two simple hash functions $h_1, h_2 \in U \to [m]$ for $m = O(k)$ and place the elements of our set $S$ into $m$ bins without collisions so that for every $x \in S$, the element $x$ is placed either in $h_1(x)$ or in $h_2(x)$. Now, to encode our set $S$, we build a Cuckoo hash table for it, and then for each of the $m$ bins, we either store one bit indicating that it’s empty, or store an $l$-bit hash of an element that is placed into it. Now we can set $l = \log_2(2 / \delta)$, since we compare the hash of a query to merely two hashes, instead of $k$. This gives the overall size $m + k \cdot \log_2 (2 / \delta) = k \cdot (\log_2(1 / \delta) + O(1))$, which is optimal up to a low-order term. Of course, the encoding should include $h_1$, $h_2$ and $h$, but it turns out they can be taken to be sufficiently simple so that their size does not really matter.

Two remarks are in order. First, in this context people usually bring up Bloom filters. However, they require space, which is $1.44$ times bigger, and, arguably, they are more mysterious (if technically simple). Second, one may naturally wonder why anyone would care about distinguishing bounds like $k \cdot \log_2 (1 / \delta)$ and $k \cdot \log_2(k / \delta)$. In my opinion, there are two answers to this. First, it is just a cool application of perfect hashing (an obligatory link to one of my favorite comic strips). Second, compressing sets is actually important in practice and constant factors do matter, for instance when we are aiming to transfer the set over the network.

Update Kasper Green Larsen observed that we can combine the naive and not-quite-working solutions to obtain the optimal bound. Namely, by hashing everything to $\log_2(k / \delta)$ bits, we effectively reduce the universe size to $n’ = k / \delta$. Then, the naive encoding takes $\log_2 \binom{n’}{k} \approx H(\delta) \cdot n’ = H(\delta) \cdot k / \delta \approx
k \cdot \log_2 (1 / \delta)$ bits.