Prev: property-preserving-lossy-compression Next: curse-of-dimensionality
Similarity search involves the basic problem of organizing a dataset so that similar items can be found quickly.
There are two common framings of this question:
All the data is present, and one wants to find pairs or sets of similar items from within the dataset.
There is a reference dataset that one has plenty of time to process cleverly, but when were are given a new datapoint, we want to quickly return a similar datapoint from the reference dataset.
There are many real world applications for this:
Jaccard similarity, denoted by
We can represent sets as vectors as well.
Jaccard similarity works well in practice for sparse data. For example, representing documents in terms of the multiset of words they contain works well with Jaccard similarity. As well, for similar customers (e.g. Amazon/Netflix) it works well.
Given datapoints in
As well, Euclidean distance is rotationally invariant (the value does not change when arbitrary rotations are applied to it).
This is a pretty good metric, but this requires that your axes mean something.
Other metrics include “cosine similarity”, the similarity of two vectors, edit distance, etc.
Given a set of points, and a distance metric
Formally, given a distance metric
This is called “metric embedding”, an embedding in
A KD-tree allows one to quickly find the closest point in a dataset
to a given query point. It works pretty well as long as the
dimensionality of the space isn’t
Each node will contain the index of a dimension,
The tree will have linear size to the initial set of points, and have
a depth of
To find the closest node to a given node
In low dimensions, the answer will be no, and the search will be efficient. However, for higher dimensions, this answer is more likely than not maybe, which makes the search less efficient.
Many algorithms have a running time that scales exponentially with the dimension. This is because higher dimensional spaces lack geometry. They have lots and lots of points where the points have roughly the same distance.
Example: What is the largest number of points that fit in
If
Prev: property-preserving-lossy-compression Next: curse-of-dimensionality