Lecture Model selection and Shrinkage methods

BERN02

Author

Ullrika Sahlin

Literature

ISL: 6.1 and 5.2

Best subset selection

Performance metrics

Residual Sum of Squares

\[RSS = \sum_{i=1}^{n} \left( y_i-\beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\right)^2 \]

Total Sum of Squares

\[TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2\]

Coefficient of determination (“R squared”) \(R^2\)

\[R^2 = 1-\frac{RSS}{TSS}\]

\(R^2\) will increase with the number of parameters \(d\). When comparing models we can introduce a penalty for the number of parameters and instead derive an Adjusted \(R_2\)

\[\text{Adjusted } R^2 = 1-\frac{RSS/(n-d-1)}{TSS/(n-1)}\]

Akaike’s Information Criterion

\[AIC = -2 \ln(L)+2d\]

Validation set error \(MSE_{test}\) or Cross-validation error \(MSE_{CV}\)

Other measures mentioned in the ISL book but that we will not dive more into:

Mallow’s \(C_p\)

\[C_p = \frac{1}{n}(RSS+2d\hat{\sigma}^2)\]

Bayesian Information Criterion

\[BIC = \frac{1}{n}(RSS + log(n)d\hat{\sigma}^2)\]

The importance of visualising data and the model

Performance metrics of goodness-of-fit are not alone enough to evaluate a model.

This has been described by the Anscombe’s dataset which are four data sets with the same descriptive measures (sample means, correlation coefficient, estimated regression line, coefficient of determination) but with different patterns when plotted.

In addition, if hypothesis testing is done, one also has to check model assumptions e.g. with residual analysis.

Here our focus is to build a predictive model, then that is not so important.

Model selection

Best subset selection
Forward selection

Backward selection

Shrinkage methods

Adding penalties for higher parameter values in the objective function

Ridge regression

The objective function is the RSS with the L2-norm as a basis for the shrinkage penalty

\[\sum_{i=1}^{n} \left( y_i-\beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\right)^2 + \lambda\sum_{j=1}^{p}\beta_j^2 = RSS + \lambda \sum_{j=1}^{p}\beta_j^2\]

L2 norm

\[|| \beta_1,\ldots,\beta_p|| = \sqrt{\sum_{j=1}^p\beta_j^2}\] \(\lambda\) is a tuning parameter. One can select lambda by using cross validation.

Here it is important to standardise independent variables using this formula

\[\tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}\]

Lasso regression

Same as Ridge regression, but the objective function is the RSS with the L1-norm as a basis for the shrinkage penalty.

\[\sum_{i=1}^{n} \left( y_i-\beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\right)^2 + \lambda\sum_{j=1}^{p}|\beta_j| = RSS + \lambda \sum_{j=1}^{p}|\beta_j|\]

L1 norm

\[| \beta_1,\ldots,\beta_p| = \sum_{j=1}^p|\beta_j|\]

Study questions

When can model selection be needed?
Describe the algorithms for best subset selection, forward selection and backward selection
Describe at least three performance metrics that can be used in model selection and compare them to each other
What is ridge regression and lasso regression and why can they be useful when there are many predictor variables