Prev: similarity-search Next: generalization
The curse of dimensionality, where the time to query the nearest neighbor scales exponentially to the number of dimensions, seems to be fundamental to the problem of nearest neighbors.
For a better intution why: imagine we sort points on a 1D line. Well, to find the nearest neighbor, you have to go either left or right. In a 2D graph, you have four directions to check. In 3D, eight.
Every known solution that uses a reasonable amount of space uses time
exponential to
We have to resort to approximation.
The curse of dimensionality is a problem, because the natural
representation of data is high-dimensional. Document similarity has
But we can’t solve problems in high dimensions well, so we have to employ dimensionality reduction, representing points in high-dimensional space as points in lower-dimensional space, preserving interpoint distances as much as possible, so our calculations in lower-dimension space aren’t that far off.
Imagine that for objects that have no notion of distance outside of “equal” and “non-equal”. How do we represent them with fewer bits, but with the same relation of equality preserved?
One way is to just hash each item to a 32-bit hash, but that takes 32-bits of memory.
One way to do better is to hash each item, then mod it by 2, so the result takes 1 bit (0 or 1).
The properties of the mapping are:
So, there’s a 50% error of equality. However, for a better accuracy rate, we use more hash functions.
If we repeat the experiment
To achieve an error rate of
The fingerprinting subroutine above approximately preserves a 0-1 function on object pairs (equality). What about distances between object pairs? What’s the analog of a hash for the euclidean distance?
The high level idea involves random projection, which results in the Johnson-Lindenstrauss (JL) transform, which says that if we only care about the euclidean distance, we can assume the number of dimensions is only in the hundreds.
Assuming the
Thus,
The random numbers
The inner product of
Gaussian distributions are symmetric around their mean
Gaussian distributions are useful for dimensionality reduction because of a few properties:
Their closure under addition, means that the addition of two random
variables in a set is in the set already. Suppose
Neither point is necessarily interesting. The mean of
Returning to the random projection function
Fix
Thus,
Since
The variance thus is the square of the distance betwen neighbors. So we next need to find a way to take its root and save that in a lower dimensional space.
Since the variance of a random variable,
Thus,
Using this random projection reduces the number of dimensions from
To do this, we’ll use independent trials.
Instead of picking a single vector
We next need to know how large
This is the Johnson-Lindenstrauss (JL) transform. The JL transform,
for domain and range dimensions
This matrix defines a mapping from
For a fixed pair x, y of
With
Thus, each term is the unbiased estimator
Thus, as long as the only property to keep is interpoint euclidean
distances, doing the computation on
Some optimizations include adding structure so the matrix-vector product Ax can be computed with the fast fourier transform, which is called the “fast JL transform”.
However, this only brings down dimensions to around the hundreds. We need better.
Alta Vista needed to filter search results to remove near-duplicate
documents. They used Jaccard Similarity for this, which is the following
for two sets
We’d like to do something similar to the JL transform, by replacing
The random projection for sets is the MinHash routine:
This gives a simple unbiased estimate of Jaccard similarity.
Consider some arbitrary pair of sets where
So:
We can do the same trick as before, to average
We can’t do the usual hashing technique of hashing all objects into buckets, then doing interpoint search in the buckets to filter out duplicates. This is what Locality Sensitive Hashing (LSH) is for. It’s a way to hash near neighbors into the same bucket, so we could apply the same technique as above on it.
A locality preserving hash is a hash function
This is similar to the MinHash routine.
Prev: similarity-search Next: generalization