Literature
ISL: 1, 2.1, 2.2, 4.1
Statistical learning
Term used by the ISL book authors
Statistical learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. Broadly speaking, supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs.
Predictive inference
The goal is to make predictions.
The quantity of interest is a prediction of \(Y\), \(\hat{Y}\), for the predictors \(X\), the estimate of the function \(f\) with the argument \(X\) is \(\hat{Y}\)
\[\hat{Y} = \hat{f}(X)\]
Assume that the estimated function \(\hat{f}\) and the predictors \(X\) is fixed. Then the only source of variation comes from the random errors \(\varepsilon\).
The variance of a prediction is
\[V(\hat{Y}) = E\left( (\hat{Y}-Y)^2 \right) = \underbrace{\left( \hat{f}(X)-f(X)\right)^2}_{Reducible\ error} + \underbrace{V(\varepsilon)}_{Irreducible\ error}\]
Prove the equality!
\[E\left( (Y-\hat{Y})^2 \right) = E\left( f(X) + \varepsilon -\hat{f}(X)\right) = \left( f(X)-\hat{f}(X)\right)^2 + V(\varepsilon)\]
Supervised learning offers techniques for estimating the function \(f\) with the aim of minimising the reducible error.
Parametric inference
The goal is to understand the association between \(Y\) and \(X_1,\ldots,X_p\), e.g.
Which predictors have a strong association with \(Y\)?
What relationship describes this association (increasing/decreasing, linear/nonlinear)?
In parametric inference, the quantity of interest can be a parameter within a function, or a function among a set of candidate functions.
The ISL book uses:
Prediction - Statistical learning is applied to build a model for making predictions.
Inference - Statistical learning is applied to learn something about the world.
I use predictive inference vs parametric inference because inference is required for both goals.
According to the Cambridge dictionary, inference is a guess that you make or an opinion that you form based on the information that you have.
General approach for supervised learning
Training data
We learn about the unknown function \(f\) from our training data: where \(x_{ij}\) is the observed value of predictor \(j\) for the \(i\)th observation, \(i=1,\ldots,n\) and \(j=1,\ldots,p\), and \(y_i\) is the observed response value for the \(i\)th observation.
Notations
For a vector, I write \(x_i = (x_{i1},\ldots,x_{ip})\)
\(x_i\) is an observation of \(X\).
\(X\) is usually treated as a fixed variable (the alternative would be to treat it as a random variable), or a random variable with negligible error.
\(y_i|x_i\) is an observation of the random variable \(Y|X\).
It is common to drop the conditioning notation (assuming it is understood by the reader), and write \(Y\) instead of \(Y|X\) and \(y_i\) instead of \(y_i|x_i\)
Objective
Find a function \(\hat{f}\) such that \(Y\approx \hat{f}(X)\) for any observation \((X,Y)\)
Residual Sum of Squares
The variance of predictions of observed data \((y_1,x_1),\ldots,(y_i,x_i)\), where \(i=1,\ldots,n\) can be estimated from the Residual Sum of Squares
\[RSS = \sum_{i=1}^{n} \left(y_i-\hat{f}(x_i)\right)^2 = \sum_{i=1}^{n} e_i^2\]
Note that the RSS must be divided by a suitable factor to become an estimate of the prediction variance. What is a suitable factor depends on the degrees of freedom associated with the algorithm used.
Regression versus Classification
In general, variables are
Statistical learning about the unknown function \(f\) is termed
- regression, when the response variable \(Y\) is quantitative
and
- classification, when the response variable \(Y\) is categorical
The design of algorithms for statistical learning and ways to measure performance differ between regression and classification models.
Model accuracy for regression
For regression the quality of a fitted model can be measured by the variance of predictions.
Assuming equal variance for all predictions, this variance can be estimated by the Mean Squared Error (MSE)
\[MSE = \frac{1}{n} \sum_{i=1}^{n}\left( y_i-\hat{f}(x_i)\right)^2\]
This can be done based on the
training data, i.e. the data that were used to estimate the function, or
test data, i.e. a data set held out from the start and not used for the estimation.
In general, \(MSE_{train} < MSE_{test}\)
Therefore, using the \(MSE_{test}\) will reduce the risk of being over-confident in the model’s performance or, if the \(MSE\) is used for model selection, the risk of over fitting the model.
Cross-validation is a method for estimating the \(MSE_{test}\) using the training data. We go through this in a later lecture.
Predictive variance
The variance of the difference between the actual value \(Y|x_0\) and the prediction at point \(x_0\), i.e. the predictive (or prediction) error, is
\[V\left(\hat{f}(x_0)-Y|x_0\right):=E\left( (\hat{f}(x_0)-Y|x_0)^2\right) = \underbrace{V\left(\hat{f}(x_0)\right)}_A + \underbrace{\left( \hat{f}(x_0)-f(x_0) \right)^2 }_B+ \underbrace{V(\varepsilon)}_C \]
A is the variance of the estimated function. It indicates how much the estimated function would vary if we had selected a different training data set.
B is the squared bias of the estimated function. It indicates if there is any systematic error in the prediction at point \(x_0\)
C is the variance of the random error and has an expected value of zero.
Bias-Variance trade-off
In statistical learning, one must find a balance between low predictive variance and low biases. This trade-off is well illustrated by Figure 2.9 in ISL.
Model accuracy for classifications
There are several types of errors in classification models.
Sensitivity and specificity
Assume a binary classification model, where the true states are \(Y\in \{True,False\}\) and the predicted states are \(\hat{f} \in \{+,-\}\). The two types of errors can be summarised in a confusion matrix:
- Erroneous predictions FP: frequency of False Positives
FN: frequency of False Negatives
- Correct predictions TN: frequency of True Negatives
TP: frequency of True Positives
Let \(n\) be the total number of observations in the data set. Useful performance metrics:
- Accuracy is \(\frac{TN+TP}{n}\)
If we want to distinguish and find a balance between the two types of errors we can use:
Receiver Operating Characteristic (ROC) curve
A classification model makes the classification by comparing a scoring variable \(S(X)\) to a threshold, \(t\). The ROC-methodology can help modellers to choose a threshold taking into account a trade-off between two types of errors, or to compare classifiers without choosing a threshold.
For each choice of \(t\) derive specificity and sensitivity.
The ROC-curve is sensitivity against 1-specificity over different values on the threshold.
Here we exemplify the ROC methodology with the data set Binary Classification Prediction for type of Breast Cancer downloaded from Kaggle. The response variable diagnosis describes a cancer as Malignant or Benign. We illustrate using the mean radius and the mean compactness as predictors, one per classifier.
Setting levels: control = B, case = M
Setting direction: controls < cases
threshold specificity sensitivity
1 -Inf 0.000000000 1
2 7.3360 0.002801120 1
3 7.7100 0.005602241 1
4 7.7445 0.008403361 1
5 7.9780 0.011204482 1
6 8.2075 0.014005602 1
- Mean compactness as predictor
Setting levels: control = B, case = M
Setting direction: controls < cases
threshold specificity sensitivity
1 -Inf 0.000000000 1
2 0.021410 0.002801120 1
3 0.024970 0.005602241 1
4 0.026625 0.008403361 1
5 0.028955 0.011204482 1
6 0.031640 0.014005602 1
- ROC plot with both classifiers
A model above the 50:50 line is better than tossing a coin.
The closer to the upper left, the better classification model.
The measure Area Under the Curve (AUC) offers a way to compare the performance of classifiers without selecting a value on the cutoff.
Parametric vs non-parametric approaches
Models for statistical learning can be
parametric, where prediction boils down to estimating parameters within a defined model structure, or
non-parametric, where the model predicts by an algorithm responding to training data, without any estimation of specific parameters.
It is useful to discuss the advantages and disadvantages of the two approaches.
Comparison to unsupervised learning
In principle, unsupervised learning has
The goals of unsupervised learning can be to
Study questions
What is the quantity of interest in predictive inference?
What are the quantities of interest in parametric inference?
Is there a difference between estimation and prediction? If so, what is the difference?
What is meant by training data in statistical learning?
Break the expression for the predictive variance \(V(\hat{Y}|x_0)\) into parts and describe what source of variation that are represented by the different parts?
What is the Residual Sum of Squares and what is it used for?
What are the differences between regression and classification when it comes to response variable and how predictive errors are expressed quantitatively.
Explain the differences between categorical, continuous and discrete variables!
What is sensitivity and specificity, and what is it used for?
How can a ROC curve analysis be used to compare the performance of two classifiers?
What is the main difference between a parametric and non-parametric model for supervised learning?
Account for the differences between supervised and unsupervised statistical learning.