Overftting is typically a result of using too complex models, and we need to choose a proper model complexity to achieve the optimal bias-variance tradeoff.

Cross-validation and Regularization are two key techniques in machine learning to prevent overfitting.

1. Cross-validation: How do we pick the best data points for Training and the best points for Testing? When we’re not told which data should be used for Training and for Testing, we can use Cross Validation to figure out which is which in an unbiased way. i.e Cross-validation is a step in the process of building a machine-learning model which helps us ensure that our models fit the data accurately and also ensures that we do not overfit. Cross-validation is used to tune hyperparameters, to strike a balance between bias and variance, and to select the optimal model. We choose dev and test sets to reflect data you expect to get in the future and want to do well on.

Types: In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

  • A model is trained using k-1 of the folds as training data;
  • the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
../_images/grid_search_cross_validation.png

Another type of cross-validation is stratified k-fold. Stratified k-fold cross-validation is a variation of k-fold cross-validation that is used when the dataset is imbalanced. In an imbalanced dataset, the number of samples in each class is not equal, and this can cause issues in traditional k-fold cross-validation where the data is randomly split into k subsets. In stratified k-fold cross-validation, the dataset is divided into k-folds such that each fold has the same proportion of samples from each class as the overall dataset. This ensures that each fold is representative of the overall distribution of classes in the dataset. i.e it preserves the class distribution.

2. Regularization One very common way to deal with overfitting the Training Data is to use a collection of techniques called Regularization. Regularization is basically any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error. No free lunch theorem states that there is no universally best model, or best regularizor. It is recommended that we must choose a form of regularization that is well suited to the particular task we want to solve. i.e Specific regularization tactics will vary according to the learning algorithm under consideration.

Any algorithm trained by optimizing a mathematically well-defined objective function can be regularized by adding to it a penalty norm on the parameters of the model/algorithm. Amount of reguralization can also be controlled via hyper-parameters. Tree-based ensemble algorithms can be regularized by controlling the tradeoff between the learning rate and the number of trees used. Neural networks (and even some tree-based algorithms) are often trained using a regularization technique called dropout. All of these methods differ in the exact mechanics of how they work. But they are all share the same end goal of improving a learning algorithm’s ability to generalize to unseen data.

Example: For linear models, a good approach is to constrain the model by reducing its degrees of freedom. The two most common methods of regularization are Lasso (or L1) regularization, and Ridge (or L2) regularization. Ridge Regression, regularization term (equal to equivalent to the sum of the squares of the magnitude of coefficients) is added to the cost function during training as follows:

\begin{equation*} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (\theta}^{T} \cdot x^{(i)} - y^{(i)})^{2} + \lambda\sum_{j=1}^{n}\theta_{j}^{2}. \end{equation*}

This forces the learning algorithm to not only fit the data but keep the weights as small as possible. Lasso regression also performs coefficient minimization, but instead of squaring the magnitudes of the coefficients, it takes the true values of coefficients.

\begin{equation*} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (\theta}^{T} \cdot x^{(i)} - y^{(i)})^{2} + \lambda\sum_{j=1}^{n}|\theta_{j}|, \end{equation*}
why regularization is slower slope and not higher? - Cross Validated

In the figure above, applying regulization forces model to have a smaller slope and so it doesn’t fit the Training data perfectly, however it does a better job with the testing Data. In other words, by increasing the bias a little, we decreased the variance a lot.

References:

  1. https://scikit-learn.org/stable/modules/cross_validation.html
  2. https://cs229.stanford.edu/notes2022fall/main_notes.pdf