Prev: regularization Next: how-pca-works
PCA is used to visualize the data in a lower-dimensional space, to understand sources of variability in the data, and to understand correlations between different coordinates of data points.
Take this table, where your friends rate on a scale of 1 to 10, four foods, kale, taco bell, sashimi, and pop tarts.
kale | taco bell | sashimi | pop tarts | |
---|---|---|---|---|
Alice | 10 | 1 | 2 | 7 |
Bob | 7 | 2 | 1 | 10 |
Carolyn | 2 | 9 | 7 | 3 |
Dave | 3 | 6 | 10 | 2 |
There are 4 data points (
We can represent the data points thusly:
The average of all data points is the vector:
Where the vectors
And each data point can be expressed this way:
And
So to calculate Alice’s scores, you would calculate
This is useful to plot the data, and also to interpret the data.
If we look at
Since we know that kale salad and pop tarts are vegetarian, and taco
bell and sashimi are not, one guess could be that
As well,
Once normalized, the vectors
The goal of PCA is to approximately express each
For each
This is similar to the JL dimensionality reduction technique.
There are a few differences:
Linear Regression and PCA create the best fitting
In PCA, all coordinates are treated equally, and they are not independent from each other.
As well, Linear Regression and PCA have different definitions of best fit.
Linear Regression minimizes the total squared error, as that creates a line of best fit that minimizes the sum of squared vertical distances between the line and data points. This is because in linear regression, the coordinate corresponding to labels is the important one.
In PCA, the squared perpendicular distances are minimized, which shows that all coordinates play a symmetric role, and Euclidean distance is most appropriate.
First, preprocess the data by centering points around the origin, so
that
Afterwards, the data is unshifted. This makes the linear algebra easier to apply.
It is also important to scale each coordinate, so the units each coordinate is measured in does not affect the best-fit line.
One approach involves taking the points
If this isn’t done, then the line of best fit changes for a unit, say between kilometers and miles, even though the underlying data is the same, just measured with different units.
The best fit line for a dataset with a
Minimizing the Euclidean distance between points and the chosen line is done above, but also, the distances are squared before adding up. This ensures the best-fit line passes through the origin, and maximizes variance.
Maximizing variance is useful for PCA, since it preserves clusters that are disimilar from each other in higher dimensional space. It should be used for when variance is thought to glean useful information. If variance is just noise, then it won’t be useful.
For more dimensions than one, the work is similar. The PCA objective funciton becomes:
Recalling that vectors
The span of a collection
Orthonormal vectors
Combining the last two equations, we can state the objective of PCA
for general
The right hand side is the squared projection length.
The resulting
More formally:
PCA is commonly used for data visualization. One typically takes
Perform PCA to get the top
For each data point
Plot the point
One way to think about PCA is a method for approximating data points
as linear combinations of the same
Both the
The datapoints with the largest and smallest projections $(x_i,v_1) on the first principal component may suggest a potential meaning for the component. If any data points are clustered together at each of the two ends, they may have something in common that is illuminating.
Plot all points according to their
The coordinates of a principal component can also be helpful.
PCA can be used to depict genomes in Europe on two dimensions. 1387
Europeans were analyzed on 200,000 SNPs, which are genomes. So with an
Interestingly, this only held for Europe, because America was more sensitive to immigration trends.
Another famous application of PCA is the Eigenfaces project. The data
points are a bunch of faces, all framed the same way, under the same
lighting conditions. Thus,
Scaling/normalization was messed up. PCA is sensitive to different scalings/normalizations, and getting good results from PCA involves choosing an appropriate scaling for the different data coordinates.
Non-linear structure. PCA finds a linear structure in the data, so if the data has some low-dimensional, non-linear structure, PCA will not find anything useful.
Non-orthogonal structure.
Prev: regularization Next: how-pca-works