Understanding Generalisation in Machine Learning
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measureP, if its performance at tasks in T, as measured by P, improves with experience E.”
In Supervised machine learning we solve problems like image classification where we learn a concept by showing our learning algorithms a set examples with the hope that it will able to classify them correctly on the future unseen examples.
How do we build a such a classifier? well the general recipe is as follows:
Learning = Representation + Evaluation + Optimisation
We start by choosing a formal language representation which represents the hypothesis space of the learner. This decides what the set of classifiers that it can learn. Then we pick an objective function to evaluate the performance of our classifier. Finally we need a search method for finding the optimal performing classifier. Overall efficiency of our learner depends on the choice of optimzation method.
During training we try to evaluate the performance by plotting its “learning curve” which shows the error rate(measured on test set) as a function of training examples. A good learning algorithm achieves an error rate of (almost) zero as you show it more and more examples. The rate at which the curve approaches zeros depends on how difficult the learning task is.
Training error and Generalisation:
The problem with the canonical machine learning problem is that we don't really care about minimizing this objective on the given data set
$$
\underset{\theta}{\operatorname{minimize}} \sum_{i=1}^m \ell\left(h_\theta\left(x^{(i)}\right), y^{(i)}\right)
$$
What we really care about is how well our function will generalize to new examples that we didn't use to train the system (but which are drawn from the "same distribution" as the examples we used for training)
During training, if we pick really flexible models with higher degree polynomials then chances are they would exhibit overfitting. In practice this means, they actually have very low loss on the training data, but create functions we don't expect to generalize well.
Our Major goal of machine learning is to generalize beyond the examples seen during training. Memorisation is not a good strategy for learners because we often see only a fraction of examples we can will encounter in real world settings. Luckily induction works pretty well, which means by taking certain assumptions ML algorithms are able to figure out knowledge from a small number of examples.
According to “no free lunch” theorem, no learner can beat random guessing over all possible functions to be learned.
In the classification problem there exists a “ground truth” function 'f' specifying the correct classification of every possible data point. 'f' is initially unknown and our learners job is to figure out this function. We assume we are given 'n' data points drawn independently and identically (“i.i.d.”) from the distribution 'D' along with corresponding labels that represent ground truth. We need to come up with a learning algorithm that uses these data points as clues and outputs prediction function g, which is our best guess as to what actual 'f' might be.
But the challenging bit is that we want to generalize beyond the training examples so that our learned function can perform well on future unseen examples as well. So what “Learning” really means is that we take the given examples and extrapolating so that g coincides with 'f'.
"The ability to perform well on previously unobserved inputs is called generalization".
How hard is this objective? well generally speaking the more data we have, the more better chance we have of figuring out the ground truth function 'f'. Our success also depends on no. of functions that could be ground truth (fewer the better). But again, Machine learning is not just about optimization where we reduce the training error. what we really care about is the generalization error (or test error), defined as "the expected value of the error on a new input. i.e probability that g disagrees with the ground truth f on the label of a random sample" [2][3].
As model becomes more complex, training loss always decreases; generalization loss decreases to a point, then starts to increase.
Although it is difficult to quantify the true generalization error (i.e., the error of these algorithms over the complete distribution of possible examples), we can approximate it by holdout cross-validation. Here the Basic idea is to split the data set into a training set and a holdout set Train the algorithm on the training set and evaluate on the holdout set
Computational Learning Theory:
Folks who are deep into ML theory try to answer some of the big philosophical questions. e.g
“What does it mean to learn?” , “Can a computer learn?”, "Are there algorithms that can solve certain learning problems in polynomial time?" .
While its hard to answer them completely, its interesting to ask if we can use some theory to determine the efficiency and power of machine learning models?
PAC-learnability:
PAC Learning can be seen as the computational theory equivalent for machine learning and goes to back to 1984 when Leslie Valiant figured out the idea of learning problems that are feasible (solvable in polynomial time).
One of the aims of learning theory is to predict what learning rates are achievable in a given learning task. The PAC model was one attempt (Probably Approximately Correct) to come with such a theory, which aims to explain the best worst-case learning rate. Though it has been criticised as not capturing the actual behavior of many learning algorithms in practical settings[1]. Still just like computational complexity theory is useful, PAC and its various subsequent variations try to do the same for machine learning.
A problem is PAC-learnable if there is an algorithm A which for any distribution D and any concept c will, when given some independently drawn samples and with high probability, produce a hypothesis whose error is small.
For deep learning models the capacity depends on the optimization algorithm we use and theoretical understanding of nonconvex optimization is still in its early stages making it hard to determine the capacity of deep learning models.
We have the option to alter the hypothesis space (i.e the set of functions that the learning algorithm is allowed to select as being the solution). Our challenge here is to choose a model with the right 'capacity'. This means we pick Machine Learning models that match the complexity of task to be solved and the training data available. e.g quadratic model will generalize well if the underlying function is quadratic.
To better understand generalization, we can decompose the generalization error into bias and variance. Discussed in the previous post
Overall, we have to draw a balance between low training error while at the same time try to achieve a small gap between training and test error(generalization). We keep in mind that measuring generalization error this way is proxy method as we don't know the ground truth 'f' nor the distribution 'D'(that we talked about earlier).
REFERENCES:
[1] https://web.math.princeton.edu/~rvan/tri201106.pdf
[2] https://www.deeplearningbook.org/contents/ml.html
[3] https://web.stanford.edu/class/cs168/l/l5.pdf
[4] https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
[5] https://jeremykun.com/2014/01/02/probably-approximately-correct-a-formal-theory-of-learning/