Prev: generalization Next: understanding-principal-component-analysis
If you have a dataset and a class of models you are trying to fit, regularization is a technique to form some preferences about the models in the chosen class, and return a model that takes into account both its preference and how well it fits.
As an example, imagine
Many common forms of regularization are viewed as prioritizing
different notions of “simple” models.
Some regularizers are implicit, where the algorithm you use to find a model to fit the data, like stochastic gradient descent, prefers some hypotheses to others.
Why regularize at all? If
One viewpoint is to always have your models where
Another viewpoint is that models with more features are simply more
expressive, hence, it’s wasteful to only train models that are
Of course, it depends – if your underlying problem is very simple, neither more data nor more dimensions really help – but for more complicated problems, they help a lot.
Thus, we need to know how to increase dimensionality, by learning a linear classifier or regression in the larger-dimensional space.
One way to increase the dimensionality of a datapoint
A linear function in this quadratic space is more expressive than a
linear function in the original space. A linear function in this new
space would be capable of modeling functions
You can train linear classifiers in polynomial spaces by mapping the data into the new space and proceeding as usual. It is also possible to do this without writing out the higher-dimensional embedding of the data. This is also called kernelization.
A different method might be like the JL transformation, but by
mapping from fewer to more dimensions. To complete the mapping
This embedding increases the dimensionality of the data, but if all the data are mapped in this way, the functions compose, and this is just a linear function of the original space in higher dimensions.
To increase the expressivity of linear functions in this space, you
need a non-linear operation, and apply it to each coordinate in
This is what a neural net does.
You might pick polynomial embedding when you want a basis-dependent embedding or not. If the features of the input space are meaningful, then a polynomial embedding makes sense – if your model has features that model the real world, and they are real features like mass or light, then it would be reasonable to preserve the interpretability of the new features. If your features are not distinct, or chosen arbitrarily, then the random projection that is rotationally invariant might make more sense.
Under Bayes, the true model underlying the data is drawn from some known prior distribution. Given this, you can evaluate the likelihood of a given model.
The frequentist approach to justifying regularization is to argue if the true model has a specific property, then regularization will allow you to recover a good approximation to the true model.
A good linear model might possess the characteristic of sparsity. We can then design a regularizer that prefers sparser models, allowing for a simpler model even when the amount of available data is significantly less than what would be required to learn a dense linear model.
If we prefer sparse models, the most natural regularizer is to
penalize vectors that are non-sparse. This is the
The problem with this is that
In practice, the
This is better because it is amenable to gradient-descent, and other optimization approaches.
This leads to the following proposition:
Given
Prev: generalization Next: understanding-principal-component-analysis