It can be observed that an overfitted model—like the polynomial in the preceding example—has very large weights. To prevent this, a penalty term, Ω, can be added to the objective function, which will drive the weights closer to the origin. Thus, the penalty term should be a function of the norm of the weights. Also, the effect of the penalty term can be controlled by multiplying with a hyperparameter, α. So our objective function becomes: E(w) + αΩ(w). The popularly-used penalty terms are:
- L2 regularization: Penalty term is given by . In regression literature, this is called ridge regression.
- L1 regularization: Penalty term is given by . This is called lasso regression.
L1 regularization leads to a sparse solution; that is, it sets many of the weights to zero and thus acts as a good feature-selection method for regression problems.