106 Handbook of Big Data
sublinear solutions. The good news is that vast improvements are possible in both theory
and practice, once we settle for approximate answers. In practice, approximate answers
prove to be sufficient even if one looks for exact answers: for many natural datasets there
are only few entries that are close to being an optimal answer (false positives), and thus we
can efficiently filter them.
The above is just an example of how high-dimensional geometry emerges in big data
questions. In general, the geometry is not necessarily Euclidean, and indeed many interesting
questions appear for distances such as Hamming, earthmover distance, and edit distance
(Levenshtein distance), to name just a few. This geometric perspective confers a number of
benefits. First, it brings in our geometric intuition to argue about sets of objects as sets of
points. Second, geometric perspective allows us to bring in some powerful tools from the
areas such as metric embeddings and functional analysis.
In this chapter, we survey the techniques of the high-dimensional computational
geometry through the prism of the NNS problem. From a broader perspective, NNS is just
one example of the questions arising when dealing with massive datasets. Yet these questions
often benefit from common solution concepts that already surface when considering the NNS
problem. In fact, oftentimes, NNS itself serves as a building block for the solutions of the
other questions. Hence, we focus on NNS to have a consistent storyline.
We will start by looking at the aforementioned setting: high-dimensional Euclidean
space, denoted by R
d
where d is the dimension. This is perhaps the most basic high-
dimensional setting. In general, we expect to see a trade-off between the complexity of
the geometric structure and the algorithmic efficiency. We now give the exact definition of
the approximate nearest neighbor problem.
Definition 7.1 c-approximate nearest neighbor Given a set P of n points in a d-
dimensional space R
d
, and approximation factor c>1, construct a data structure such
that, given any query point q, the data structure returns some point p such that d(q, p) ≤
c · min
p
∗
∈P
d(q, p
∗
).
7.2 Dimension Reduction
If high dimensionality presents a problem, why not reduce it? Indeed, the high-dimensional
geometry 101 technique is to map the points into a space of smaller dimension k, while
preserving the distances. The most classic result is the Johnson–Lindenstrauss (JL) lemma
[66]. It says that a projection onto a random k-dimensional subspace suffices, provided k,is
large enough.
Lemma 7.1 JL [66] Let A be the projection onto a random k-dimensional subspace of R
d
,
scaled by
1/k. Then, for any fixed > 0,andpointsx, y ∈ R
d
, we have that
Pr
A
Ax−Ay
x−y
∈ (1 − , 1+)
≥ 1 − e
−Ω(
2
k)
.
An important consequence is that we can argue about distances among a set of n points.
In particular, setting k =Θ(C
−2
log n), we obtain the probability of preserving a fixed
distance as high as 1 − 1/n
C
, for arbitrary large constant C. By union bound over the n
2
vectors, pairwise differences of the n points are preserved with probability at least 1−n
−C+2
.
In fact, this is the most common usage of the lemma.
Another crucial aspect of the above map is that it is oblivious, that is, independent of
the set of points. The alternative, a nonoblivious map, would have to first look at the n
points before constructing the actual map A. The advantage of an oblivious map is that we
can apply it to a new point (sometimes called out-of-sample extension).