Exercise Model selection

BERN02

Author

Ullrika Sahlin

Format

Work individually or in pairs. If you work in a pair, each person hand in individually, but write in the beginning of the file you hand in whom you have been working with.

Time needed

4 hours

Grades

Pass or Fail

Programming language

You can use the language you prefer, but I recommend using R or Python. Create a report using Jupyter Notebook notebook or Quarto.

Cross validation

We are going to work with the air pollution data set ´pollution_cleaneddata.csv´ and use resampling methods to evaluate the predictive performance of alternative model choices.

Load the data ´pollution_cleaneddata.csv´
Specify a polynomial regression model with degree \(p\) that predicts mortality as a function of the average July temperature in degrees F.

\[y_i = \beta_0 + \beta_1x_{i}+ \beta_2x_{i}^2 + \ldots + \beta_px_{i}^p+\varepsilon_i\]

Validation set approach

Set a seed for the random number generator.
Split data into two equal sized sets, one for training and one for testing.
For polynomial models with degree 1 up to 4, derive the mean square error of prediction for the training and testing data sets, respectively.

Present the results in a plot with polynomial degree on the x-axis and Mean Square Error on the y-axis (example provided below).

Judging from the graph you just generated, which degree of the polynomial would you recommend if the goal is to have a small variance of new predictions.
Repeat the procedure ten times but with other random seeds. Do you get similar results for other seeds?

K-fold cross-validation

Select a polynomial degree, e.g. \(p=2\). Estimate the variance of predictions using the K-fold cross-validation approach where you hold out \(K=3\) sets.
Repeat the procedure ten times but with other random seeds. Do you get similar results for other seeds?

The Bootstrap

Use the bootstrap to estimate the standard error of the slope of the line in the Poisson regression of the birds over time

Load the data set ‘bird_count.csv’

Retrieve the Poisson model fitted with maximum likelihood that you did earlier in the course.

Let \(Y|x\) be the counted number of birds at year \(x\)

\[Y|x_i \sim Po(\lambda(x_i))\]

The \(log\) of the intensity is a linear model with an intercept \(\beta_0\) and a slope \(\beta_1\) parameter for year as the predictor \(x\)

\[log(\lambda(x_i)) = \beta_0 + \beta_1x_i\] (@) Use the bootstrap to approximate the standard error of the estimate of the slope parameter.


Call:
glm(formula = count ~ yr, family = poisson, data = df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) 67.17921   46.70014   1.439    0.150
yr          -0.03244    0.02329  -1.393    0.164

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 21.069  on 12  degrees of freedom
Residual deviance: 19.128  on 11  degrees of freedom
AIC: 73.365

Number of Fisher Scoring iterations: 4

When taking the standard deviation of the bootstrap sample of the slope parameter I get 0.034212.

Submit lab report on Canvas

Write your code so that it is clearly documented, and readable for someone other than yourself. We recommend integrating sections of code (R or Python) with sections of text using Markdown language. For example:

Python in Jupiter Notebook on Google Colab (or installed on your own computer)
R in Quarto on posit.cloud (or with R and RStudio installed on your computer)

Write your name and date in the heading of the report and, if applicable, the name of your collaborator.
Save your report as pdf
Upload the report in the assignment Exercise: Resampling on Canvas.