Regularization

Prev: generalization Next: understanding-principal-component-analysis

The Context and Intuition behind Regularization

If you have a dataset and a class of models you are trying to fit, regularization is a technique to form some preferences about the models in the chosen class, and return a model that takes into account both its preference and how well it fits.

As an example, imagine $n$ points, $x_{1}, \dots, x_{n} \in R^{d}$ with $n$ labels, $y_{1}, \dots, y_{n} \in R$ . Take this $l_{2}$ regularized least-squares object function corresponding to a model $a \in R^{d}$ :

$f (a) = \sum_{i = 1}^{n} ((a, x_{i}) - y_{i})^{2} + λ | a |_{2}^{2}$

$| a |_{2}^{2}$ denotes the squared Euclidean length of the vector $a$ , and $λ$ dictates how much the models with the small $l_{2}$ should be prioritized. As $λ$ increases, models with smaller $a$ will be more heavily prioritized.

Many common forms of regularization are viewed as prioritizing different notions of “simple” models. $l_{2}$ regularization is common, as are sparsity inducing models, that favor sparser coefficients.

Some regularizers are implicit, where the algorithm you use to find a model to fit the data, like stochastic gradient descent, prefers some hypotheses to others.

Why regularize at all? If $n ≫ d$ , then just use ERM. Otherwise, if $k \approx n$ , then ERM will not generalize, as it’ll overfit the data. You’ll need some way to make your model generalize, and regularization is one way of encoding additional information, or preferences about the form of the true underlying model, that can be applied and amerliorate the overfitting.

Increasing Dimensionality: the Polynomial Embedding and Random Non-Linear Features

One viewpoint is to always have your models where $n ≫ d$ , so there’s sufficient expressivity so that ERM generalizes.

Another viewpoint is that models with more features are simply more expressive, hence, it’s wasteful to only train models that are $n ≫ d$ . You should keep adding features until the dimensionality, $d$ becomes $d^{'}$ , where $d^{'} \approx n$ . The intuition is that for most problems, the model is complex, and capturing more dimensions of it will only aid your model.

Of course, it depends – if your underlying problem is very simple, neither more data nor more dimensions really help – but for more complicated problems, they help a lot.

Thus, we need to know how to increase dimensionality, by learning a linear classifier or regression in the larger-dimensional space.

The Polynomial Embedding

One way to increase the dimensionality of a datapoint $x = (x_{1}, \dots, x_{d})$ is to add a number of polynomial features. For example, adding $(\begin{matrix} d \\ 2 \end{matrix})$ coordinates, corresponding to the $(\begin{matrix} d \\ 2 \end{matrix})$ products of pairs of coordinates of the original data. This is a quadratic embedding. This embedding, $f : R^{d} \to R^{d + d (d + 1) / 2}$ is defined as follows:

$f (x) = (x_{1}, \dots, x_{d}, x_{1}^{2}, x_{1} x_{2}, x_{1} x_{3}, \dots, x_{d} x_{d - 1}, x_{d}^{2})$

A linear function in this quadratic space is more expressive than a linear function in the original space. A linear function in this new space would be capable of modeling functions $g (x) = (x_{1} + x_{2} + x_{1} 7)^{2}$ .

You can train linear classifiers in polynomial spaces by mapping the data into the new space and proceeding as usual. It is also possible to do this without writing out the higher-dimensional embedding of the data. This is also called kernelization.

A Random Projection plus Nonlinearity

A different method might be like the JL transformation, but by mapping from fewer to more dimensions. To complete the mapping $f : R^{d} \to R^{d^{'}}$ , choose a $d^{'} x d$ matrix $M$ , select each index independently from a standard Gaussian, and then defining a transformation function, $f (x) = M x$ .

This embedding increases the dimensionality of the data, but if all the data are mapped in this way, the functions compose, and this is just a linear function of the original space in higher dimensions.

To increase the expressivity of linear functions in this space, you need a non-linear operation, and apply it to each coordinate in $M x$ . Squaring each coordinate, taking the absolute value, the square root, etc.

This is what a neural net does.

Comparing Polynomial Embedding vs Random Projection + Nonlinearity

You might pick polynomial embedding when you want a basis-dependent embedding or not. If the features of the input space are meaningful, then a polynomial embedding makes sense – if your model has features that model the real world, and they are real features like mass or light, then it would be reasonable to preserve the interpretability of the new features. If your features are not distinct, or chosen arbitrarily, then the random projection that is rotationally invariant might make more sense.

Regularization: The Bayesian and Frequentist viewpoints

The Bayesian View and $l_{2}$ Regularization

Under Bayes, the true model underlying the data is drawn from some known prior distribution. Given this, you can evaluate the likelihood of a given model.

The Frequentist View and $l_{0}$ Regularization

The frequentist approach to justifying regularization is to argue if the true model has a specific property, then regularization will allow you to recover a good approximation to the true model.

A good linear model might possess the characteristic of sparsity. We can then design a regularizer that prefers sparser models, allowing for a simpler model even when the amount of available data is significantly less than what would be required to learn a dense linear model.

$l_{0}$ regularization

If we prefer sparse models, the most natural regularizer is to penalize vectors that are non-sparse. This is the $l_{0}$ regularizer. In regularized least squares, the objective function would be:

$f (a) = \sum_{i = 1}^{n} ((a, x_{i}) - y_{i})^{2} + λ | a |_{0}$

The problem with this is that $l_{0}$ norm is highly discontinuous, so changing $a$ by a small amount can change $| a |_{0}$ by 1.

$l_{1}$ as a computationally tractable proxy for $l_{0}$

In practice, the $l_{1}$ norm is used as a proxy for the $l_{0}$ norm. This is a vector of the sum of the absolute values of the coordinates, so it is continuous and linear. That looks like this:

$f (a) = \sum_{i = 1}^{n} ((a, x_{i}) - y_{i})^{2} + λ | a |_{1}$

This is better because it is amenable to gradient-descent, and other optimization approaches.

This leads to the following proposition:

Given $n$ independent Gaussian vectors $x_{1}, \dots, x_{n} \in R^{d}$ , consider labels $y_{i} = (a, x_{i})$ for some vector $a$ with $| a |_{0} = s$ . The minimizer of the $l_{1}$ regularized objective function, will be the vector $a$ with high probability, provided that $n > c * s \log d$ for some absolute constant $c$ .