Regularization on linear models
The Stochastic Gradient Descent algorithm (SGD) finds the optimal weights {wi} of the model by minimizing the error between the true and the predicted values on the N training samples:
Where are the predicted values, ŷi the real values to be predicted; we have N samples, and each sample has n dimensions.
Regularization consists of adding a term to the previous equation and to minimize the regularized error:
The parameter helps quantify the amount of regularization, while R(w) is the regularization term dependent on the regression coefficients.
There are two types of weight constraints usually considered:
- L2 regularization as the sum of the squares of the coefficients:
- L1 regularization as the sum of the absolute value of the coefficients:
The constraint on the coefficients introduced by the regularization term R(w) prevents the model from overfitting the training data. The coefficients become tied together by the regularization and can no longer be tightly leashed to the predictors. Each type of regularization has its characteristic and gives rise to different variations on the SGD algorithm, which we now introduce: