L1 regularization and Lasso
L1 regularization usually entails some loss of predictive power of the model.
One of the properties of L1 regularization is to force the smallest weights to 0 and thereby reduce the number of features taken into account in the model. This is a desired behavior when the number of features (n) is large compared to the number of samples (N). L1 is better suited for datasets with many features.
The Stochastic Gradient Descent algorithm with L1 regularization is known as the Least Absolute Shrinkage and Selection Operator (Lasso) algorithm.
In both cases the hyper-parameters of the model are as follows:
- The learning rate of the SGD algorithm
- A parameter to tune the amount of regularization added to the model
A third type of regularization called ElasticNet consists in adding both a L2 and a L1 regularization term to the model. This brings up the best of both regularization schemas at the expense of an extra hyper-parameter.
In other contexts, although experts have different opinions (https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization) on which type of regularization is more effective, the consensus seems to favor L2 over L1 regularization.
L2 and L1 regularization are both available in Amazon ML while ElasticNet is not. The amount of regularization available is limited to three values for : mild (10-6), medium (10-4), and aggressive (10-2).