Lecture Linear regression and other alternatives

BERN02

Author

Ullrika Sahlin

Literature

ISL: 3.1, 3.2, 3.3, 7.6, 7.3, 7.4

Parametric approach to linear regression

Ordinary Least Squares (OLS) regression

  1. We select a linear model

\[f(x)=\beta_0+\beta_1 x_1 + \ldots + \beta_p x_p\] 2. We estimate the parameters as those minimising the RSS.

Let \(\beta=(\beta_0,\ldots,\beta_p)\) and

\[Q(\beta):= Q(\beta_0,\ldots,\beta_p) = \sum_{i=1}^{n} \left(y_i - (\beta_0+\beta_1 x_{i1} + \ldots + \beta_p x_{ip})\right)^2\] then

\[\hat{\beta} = \underset{\beta}{\mathrm{argmin}} \ Q(\beta)\]

Note

Simple regression - a model with one independent variable, i.e. \(p=1\)

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2}\]

\[\hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}\]

Multiple regression - a model with several independent variables, i.e. \(p>1\)

Parameters estimated by matrix calculations, not shown here

Airpollution data set

Here we illustrate regression with an air pollution data set from a paper on ridge regression. Source: McDonald, G.C. and Schwing, R.C. (1973) ‘Instabilities of regression estimates relating air pollution to mortality’, Technometrics, vol.15, 463-482.

https://lib.stat.cmu.edu/datasets/pollution

Variables in order: PREC Average annual precipitation in inches JANT Average January temperature in degrees F JULT Same for July OVR65 % of 1960 SMSA population aged 65 or older POPN Average household size EDUC Median school years completed by those over 22 HOUS % of housing units which are sound & with all facilities DENS Population per sq. mile in urbanized areas, 1960 NONW % non-white population in urbanized areas, 1960 WWDRK % employed in white collar occupations POOR % of families with income < $3000 HC Relative hydrocarbon pollution potential NOX Same for nitric oxides SO2 Same for sulphur dioxide HUMID Annual average % relative humidity at 1pm MORT Total age-adjusted mortality rate per 100,000

Rows: 60 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (16): PREC, JANT, JULT, OVR65, POPN, EDUC, HOUS, DENS, NONW, WWDRK, POOR...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A simple linear regression on the association between proportion of poor families and mortality, together with a 95% confidence region for the line.

                 2.5 %     97.5 %
(Intercept) 798.542504 905.725804
POOR          2.554194   9.721911

Bias and precision

We want estimates or predictions to be unbiased and have high precision (i.e. low variance).

Statistical errors in simple linear regression

Under the assumptions of independent and identically distributed model errors, \(\varepsilon_i\), the variance of the errors can be estimated as

\[\hat{\sigma}^2 = \frac{\sum_{i=1}^n e_i^2}{n-2}\] where \(e_i=y_i-\beta_0-\beta_1x_{i1}\).

The variance of the slope is

\[V(\hat{\beta}_1) = \frac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2}\] If the sample size \(n\) is sufficiently large and using the estimate for the variance \(\sigma^2\) we can derive a frequentist \((1-\alpha)\) two-sided confidence interval as

\[I_{\beta_1}: \hat{\beta}_1 \pm \lambda_{\alpha}\cdot \sqrt{\hat{V}(\hat{\beta}_1)}\]

The value of the quantile \(\lambda_{\alpha}\) is chosen based on what level of confidence that is asked for. In the book, they use 2 and refer to this as a 95%th confidence interval.

Quantiles from a normal distribution for a two-sided confidence interval
Confidence level \(\alpha\) \(\lambda_{\alpha}\)
99% 0.5% 2.58
95% 2.5% 1.96
90% 5% 1.64
80% 10% 1.28

The variance for the intercept is

\[V(\hat{\beta}_0) = \sigma^2\left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i-\bar{x})^2} \right]\]

The variance of the expected value of the response variable given that \(x=x_0\) is

\[\begin{split} V(\hat{\mu}(x_0)) = V(\hat{\beta}_0+\hat{\beta}_1x_0) = & & \\ \sigma^2\left[ \frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2} \right] \end{split}\]

A two-sided confidence interval for the expected value of the response given that \(x=x_0\) is

\[\hat{\mu}(x_0) \pm \lambda_{\alpha}\cdot \sqrt{V(\hat{\mu}(x_0))}\]

The variance of a predicted value of the response variable given that \(x=x_0\) is

\[V(\hat{y}(x_0)) = V(\beta_0+\beta_1x_0+\varepsilon) = \sigma^2\left[ \frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2} +1\right] \]

A two-sided confidence interval for a predicted value of the response given that \(x=x_0\) is

\[\hat{y}(x_0) \pm \lambda_{\alpha}\cdot \sqrt{\hat{V}(\hat{y}(x_0))}\]

Make sure you understand the difference between variance in the estimate of an expected value and variance in a predition of a new observation.

Non-parametric approach to regression

  1. Seek an estimate of the unknown function \(f\) that gets close to data points without being too rough or wiggly.
Note

Non-parametric approaches require more observations compared to parametric approaches

Local regression (LOESS)

Local regression is a non-parametric method that fit a local regression at each point (over a grid of points) in predictor space.

  1. Select a value on the x-variable \(x=x_0\)

  2. Select the \(k\) nearest points to \(x_0\)

  3. Assign a weight to each point \(K_{i0}\)

  4. Fit a weighted simple OLS regression by minimising

\[\sum_{i=1}^{n} K_{i0} (y_i-\beta_0-\beta_1x_{i})^2\]

  1. The fitted value at \(x_0\) is \[\hat{y}|x_0 = \hat{f}(x_0) = \hat{\beta_0} + \hat{\beta_1}x_0\]

  2. Repeat for different values on \(x_0\)

Figure 7.9 from ISL

Spline regression

Fit a line in each region of the predictor space defined by knots, requiring continuity of each knot.

Study questions

  1. What is the objective function for estimating parameters in OLS regression?

  2. What assumptions are made about residuals in OLS regression?

  3. Describe a way to estimate the variance of model errors in OLS regression?

  4. What is the difference between bias and precision of an estimate or prediction?

  5. What assumptions are made to construct a frequentist confidence interval for a parameter in OLS regression?

  6. What is the formula for a frequentist confidence interval for the expected value of a parameter in OLS regression?

  7. What is the objective function for estimating parameters in Local regression?

  8. What is the main difference between regression with splines and local regression?