Lecture Model selection and Shrinkage methods
BERN02
Literature
ISL: 6.1 and 5.2
Best subset selection
Performance metrics
Residual Sum of Squares
\[RSS = \sum_{i=1}^{n} \left( y_i-\beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\right)^2 \]
Total Sum of Squares
\[TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2\]
Coefficient of determination (“R squared”) \(R^2\)
\[R^2 = 1-\frac{RSS}{TSS}\]
\(R^2\) will increase with the number of parameters \(d\). When comparing models we can introduce a penalty for the number of parameters and instead derive an Adjusted \(R_2\)
\[\text{Adjusted } R^2 = 1-\frac{RSS/(n-d-1)}{TSS/(n-1)}\]
Akaike’s Information Criterion
\[AIC = -2 \ln(L)+2d\]
Validation set error \(MSE_{test}\) or Cross-validation error \(MSE_{CV}\)
Other measures mentioned in the ISL book but that we will not dive more into:
Mallow’s \(C_p\)
\[C_p = \frac{1}{n}(RSS+2d\hat{\sigma}^2)\]
Bayesian Information Criterion
\[BIC = \frac{1}{n}(RSS + log(n)d\hat{\sigma}^2)\]
The importance of visualising data and the model
Performance metrics of goodness-of-fit are not alone enough to evaluate a model.
This has been described by the Anscombe’s dataset which are four data sets with the same descriptive measures (sample means, correlation coefficient, estimated regression line, coefficient of determination) but with different patterns when plotted.
In addition, if hypothesis testing is done, one also has to check model assumptions e.g. with residual analysis.
Here our focus is to build a predictive model, then that is not so important.
Model selection
Best subset selection
Forward selection
- Backward selection
Shrinkage methods
Adding penalties for higher parameter values in the objective function
Ridge regression
The objective function is the RSS with the L2-norm as a basis for the shrinkage penalty
\[\sum_{i=1}^{n} \left( y_i-\beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\right)^2 + \lambda\sum_{j=1}^{p}\beta_j^2 = RSS + \lambda \sum_{j=1}^{p}\beta_j^2\]
L2 norm
\[|| \beta_1,\ldots,\beta_p|| = \sqrt{\sum_{j=1}^p\beta_j^2}\] \(\lambda\) is a tuning parameter. One can select lambda by using cross validation.
Here it is important to standardise independent variables using this formula
\[\tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}\]
Lasso regression
Same as Ridge regression, but the objective function is the RSS with the L1-norm as a basis for the shrinkage penalty.
\[\sum_{i=1}^{n} \left( y_i-\beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\right)^2 + \lambda\sum_{j=1}^{p}|\beta_j| = RSS + \lambda \sum_{j=1}^{p}|\beta_j|\]
L1 norm
\[| \beta_1,\ldots,\beta_p| = \sum_{j=1}^p|\beta_j|\]
Study questions
When can model selection be needed?
Describe the algorithms for best subset selection, forward selection and backward selection
Describe at least three performance metrics that can be used in model selection and compare them to each other
What is ridge regression and lasso regression and why can they be useful when there are many predictor variables