Similarity Search
Prev: property-preserving-lossy-compression Next: curse-of-dimensionality
Similarity search involves the basic problem of organizing a dataset so that similar items can be found quickly.
There are two common framings of this question:
-
All the data is present, and one wants to find pairs or sets of similar items from within the dataset.
-
There is a reference dataset that one has plenty of time to process cleverly, but when were are given a new datapoint, we want to quickly return a similar datapoint from the reference dataset.
There are many real world applications for this:
- Similar documents, web pages, genomes, deduplication, plagariasm detection, finding similar genes or chemicals.
- Collaborative filtering: find similar products based on some characteristics.
- Machine Learning/classification
- Combining Datasets
- Fast Labeling
Measures of Similarity
Jaccard Similarity
Jaccard similarity, denoted by is a distance metric between two sets, , and .
We can represent sets as vectors as well.
Jaccard similarity works well in practice for sparse data. For example, representing documents in terms of the multiset of words they contain works well with Jaccard similarity. As well, for similar customers (e.g. Amazon/Netflix) it works well.
Euclidean Distance / distance, and distance
Given datapoints in , the euclidean distance metric is:
As well, Euclidean distance is rotationally invariant (the value does not change when arbitrary rotations are applied to it).
This is a pretty good metric, but this requires that your axes mean something.
Other similarity metrics
Other metrics include “cosine similarity”, the similarity of two vectors, edit distance, etc.
The Relationships Between Metrics, and Metric Embeddings
Given a set of points, and a distance metric , is it possible to map the points into a set of points in , such that the distance between points is approximated by the euclidean distance?
Formally, given a distance metric , a set of points , is it possible to map the points to a set where , where for all ,
This is called “metric embedding”, an embedding in under euclidean distance.
A Data Structure for Similarity Search: KD-trees
A KD-tree allows one to quickly find the closest point in a dataset to a given query point. It works pretty well as long as the dimensionality of the space isn’t . A KD-tree is like a binary search tree that partitions space.
Each node will contain the index of a dimension, , and a value .
- When inserting a value, choose a dimension from
- Fix as the median of the dimension of the points. . Store dimension and median at node . Partition the set into and , according to whether the coordinate of each point exceeds m.
- Make two children of and , and recurse on both children.
The tree will have linear size to the initial set of points, and have a depth of .
To find the closest node to a given node , we first traverse the tree and find the leaf that would end up in. We then backtrack through the tree, at each juncture, asking if the closest point to would’ve ended up down the other path.
In low dimensions, the answer will be no, and the search will be efficient. However, for higher dimensions, this answer is more likely than not maybe, which makes the search less efficient.
The Curse of Dimensionality
Many algorithms have a running time that scales exponentially with the dimension. This is because higher dimensional spaces lack geometry. They have lots and lots of points where the points have roughly the same distance.
Example: What is the largest number of points that fit in -dimensional space, that have pairwise distances of ?
If = 1: At most 2. If = 2: At most 3 points. If = 100: At least several thousand.
Prev: property-preserving-lossy-compression Next: curse-of-dimensionality