 # Lullaby of Cape Cod

English blog of Ilya Razenshteyn

# Cuckoo hashing for sketching sets

Below I show a neat application of perfect hashing, which is one of my favorite (cluster of) algorithms. Amazingly, we use it to obtain a purely information-theoretic (rather than algorithmic) statement.

Suppose we have a finite universe $U$ of size $n$ and a $k$-element subset of it $S \subseteq U$ with $k \ll n$. How many bits do we need to encode it? The obvious answer is $\log_2 \binom{n}{k} = \Theta(k \cdot \log(n / k))$.
Can we, however, improve this bound if we allow some approximation?

Even if $n = 2k$, it is not difficult to show the lower bound of $k \cdot \log_2(1 / \delta)$ bits if we allow to be wrong when answering queries “does $x$ belong to $S$?” with probability at most $\delta$ (hint: $\varepsilon$-nets). Can we match this lower bound?

One approach that does not quite work is to hash each element of $S$ to an $l$-bit string using a sufficiently good hash function $h \colon U \to \{0, 1\}^l$, and, when checking if $x$ lies in $S$, compute $h(x)$ and check if this value is among the hashes of $S$. To see why it does not work, let us analyze it: if $x \notin S$, then the probability that $h(x)$ coincides with at least one hash of an element of $S$ is around $k \cdot 2^{-l}$. To make the latter less than $\delta$, we need to take $l = \log_2(k / \delta)$ yielding the overall bound of $k \cdot \log_2(k / \delta)$ falling short of the desired size.

To get the optimal size, we need to avoid using the union bound in the above argument. In order to accomplish this, let us use perfect hashing on top of the above hashing scheme! It is convenient to use a particular approach to perfect hashing called Cuckoo hashing. In short, there is a way to generate two simple hash functions $h_1, h_2 \in U \to [m]$ for $m = O(k)$ and place the elements of our set $S$ into $m$ bins without collisions so that for every $x \in S$, the element $x$ is placed either in $h_1(x)$ or in $h_2(x)$. Now, to encode our set $S$, we build a Cuckoo hash table for it, and then for each of the $m$ bins, we either store one bit indicating that it’s empty, or store an $l$-bit hash of an element that is placed into it. Now we can set $l = \log_2(2 / \delta)$, since we compare the hash of a query to merely two hashes, instead of $k$. This gives the overall size $m + k \cdot \log_2 (2 / \delta) = k \cdot (\log_2(1 / \delta) + O(1))$, which is optimal up to a low-order term. Of course, the encoding should include $h_1$, $h_2$ and $h$, but it turns out they can be taken to be sufficiently simple so that their size does not really matter.

Two remarks are in order. First, in this context people usually bring up Bloom filters. However, they require space, which is $1.44$ times bigger, and, arguably, they are more mysterious (if technically simple). Second, one may naturally wonder why anyone would care about distinguishing bounds like $k \cdot \log_2 (1 / \delta)$ and $k \cdot \log_2(k / \delta)$. In my opinion, there are two answers to this. First, it is just a cool application of perfect hashing (an obligatory link to one of my favorite comic strips). Second, compressing sets is actually important in practice and constant factors do matter, for instance when we are aiming to transfer the set over the network.

Update Kasper Green Larsen observed that we can combine the naive and not-quite-working solutions to obtain the optimal bound. Namely, by hashing everything to $\log_2(k / \delta)$ bits, we effectively reduce the universe size to $n’ = k / \delta$. Then, the naive encoding takes $\log_2 \binom{n’}{k} \approx H(\delta) \cdot n’ = H(\delta) \cdot k / \delta \approx k \cdot \log_2 (1 / \delta)$ bits.

2019

# STOC 2019 workshop “Data Science through a Geometric Lens”

I would like to bring up a STOC 2019 workshop Data Science through a Geometric Lens (co-organized with Sanjoy Dasgupta and Cyrus Rashchtian). It will happen on Sunday at 9am. We have a stellar line-up of speakers:

Some of the topics are a bit unusual for the STOC audience, and we hope you will be inspired by fresh ideas. Please do attend and bring your friends!

2019

# STOC 2018 talk of Erik Waingarten

At STOC 2018, Erik Waingarten (who is starting his fourth year of PhD at Columbia) gave a beautiful talk on the joint work regarding nearest neighbor search in general metric spaces that I briefly described before. Even though the talk involves some math, it is certifiably understandable by non-theorists!

2018

# ICM 2018 survey

Together with Alex Andoni and Piotr Indyk, we have just uploaded on arXiv a survey on (the theory of) the high-dimensional nearest neighbor search problem (NNS), which accompanies Piotr’s talk at the upcoming International Congress of Mathematicians in Rio de Janeiro. At the same congress, Assaf Naor will give a plenary talk about dimension reduction; the corresponding paper will hopefully appear online soon.

In the survey, we touch upon both classic and new topics. Let me briefly go over the sections:

1. In the introduction, we define and motivate the problem and put it in the historical context.
2. We describe two classic frameworks based on dimension reduction and locality-sensitive hashing (LSH). In addition to that, we list known LSH families.
3. We survey some deterministic and Las Vegas data structures for the problem.
4. We cover recent line of work on the data-dependent LSH (where we tailor hash families to a dataset at hand), which constitutes the bulk of my thesis. Besides, we cover a beautiful and underappreciated algorithm for NNS over the $\ell_\infty$ metric.
5. We go over recent exciting developments for the closest-pair problem (an off-line version of NNS, which makes it somewhat easier), which are based on the polynomial method and fast matrix multiplication.
6. Finally, we talk about NNS for metrics beyond $\ell_1$, $\ell_2$ and $\ell_\infty$. There are three main tools on this front: metric embeddings (deterministic and randomized), NNS data structures for direct sums of “tractable” metric spaces, and (since very recently) randomized partitions obtained via estimates on nonlinear spectral gaps (featured in the previous post).

We are looking forward to your feedback!

2018

# STOC 2018 talks and posters

I’m heading out to Los Angeles to attend STOC 2018 (and, more generally, TheoryFest). My co-authors will present two papers, which I was lucky to be involved with:

Data-Dependent Hashing via Nonlinear Spectral Gaps (joint with Alex Andoni, Assaf Naor, Sasho Nikolov and Erik Waingarten). Paper, poster. In this paper, we develop a new technique for randomized partitioning of general metric spaces, which is based on estimates for nonlinear spectral gaps (see here for a thorough overview). These random partitions imply very efficient data structures in the cell-probe model for the high-dimensional approximate nearest neighbor search problem (ANN) (see here for a survey) over general metric spaces. With more work, we can get true time-efficient data structures for several cases of interest: most notably, in the forthcoming paper, we obtain the ANN algorithm for a general $d$-dimensional normed space with approximation $2^{\widetilde{O}(\sqrt{\log d})}$ (the previous best result was the trivial $O(\sqrt{d})$ bound, which readily follows from the classic John’s theorem). This paper is the culmination of a line of work (“nearest neighbor search for general distances”), which I’m involved with for three years (see also our previous paper, where we essentially settle the case of symmetric normed spaces). The connection with spectral gaps we discovered is, in my opinion, a true gem, which we had been overlooking for years. I will blog about it in a greater detail later.

Nonlinear Dimension Reduction via Outer Bi-Lipschitz Extensions (joint with Sepideh Mahabadi, Konstantin Makarychev and Yury Makarychev). Paper, poster. We develop a new geometric primitive, an outer bi-Lipschitz extension, and use it to solve two open problems raised by Elkin, Filtzer and Neiman (posed here and here). In both cases, we obtain new dimension reduction results, which are inherently nonlinear. 2018

# Beyond John’s ellipsoid

Here is the video of my talk from the Learning and Optimization workshop, which was a part of Microsoft TechFest.

This covers (parts of) our very fresh work (joint with Alex Andoni, Assaf Naor, Sasho Nikolov and Erik Waingarten), which is currently under submission. In short, we show the first Approximate Nearest Neighbor (ANN) search1 algorithm for general $d$-dimensional normed spaces with approximation $d^{o(1)}$. The only previously known bound is the trivial $O(\sqrt{d})$, which immediately follows from John’s theorem and any good ANN data structure for the Euclidean distance.

A couple of minor glitches:

1. The Wasserstein-$p$ distance is not a norm for $p > 1$;
2. In the last couple of slides, I started writing $D(\varepsilon)$ instead of $R(\varepsilon)$, but it is meant to be the same.

1 For some flavor of the problem, see this classic, or – in case you are more adventurous – my PhD thesis.

2018

Let $K$ be a convex body in $\mathbb{R}^d$ symmetric around the origin. Let $\mathcal{D}_1$ be a distribution supported on $K$, and let $\mathcal{D}_2$ be a distribution supported outside of $2K$. Can one always find a direction $v$ such that given projections of $\mathrm{poly}(d)$ i.i.d. samples from $\mathcal{D}_i$ on $v$, one can recover $i$ with high probability?
For example, suppose that $K = [-1, 1]^d$ is a cube. Then, each point in the support of $\mathcal{D}_2$ has at least one coordinate whose absolute value is large. It means that there exists a coordinate such that with probability at least $1/d$ over $\mathcal{D}_2$ it is outside of $[-2, 2]$. But then, using this coordinate, one can distinguish $\mathcal{D}_1$ and $\mathcal{D}_2$ using only $O(d)$ samples.