library(readr)
library(dplyr)
library(ggplot2)
The R-version of what we did in Introduction to useful functions in Excel
MVEN10
Preparations
Load packages
Load data and save the variable to an object called x
= as_tibble(read_csv("../data/breast-cancer.csv"))%>% select(radius_mean) df
Descriptive statistics
summary(df)
radius_mean
Min. : 6.981
1st Qu.:11.700
Median :13.370
Mean :14.127
3rd Qu.:15.780
Max. :28.110
= df$radius_mean
x quantile(x,probs=0.25)
25%
11.7
median(x)
[1] 13.37
quantile(x,probs=0.95)
95%
20.576
mean(x)
[1] 14.12729
sd(x)
[1] 3.524049
min(x)
[1] 6.981
max(x)
[1] 28.11
length(x)
[1] 569
Calculate the three summary statistics described in the green area of the sheet.
- The third quartile in the sample, P75
quantile(x,probs=0.75)
75%
15.78
- The 5% quantile (or 5th percentile), P05
quantile(x,probs=0.05)
5%
9.5292
- The coefficient of variation is the ratio between the sample standard deviation and the sample mean
sd(x)/mean(x)*100
[1] 24.94497
Histogram
hist(x)
%>%
df ggplot(aes(x=x))+
geom_histogram()
%>%
df ggplot(aes(x=x))+
geom_histogram(binwidth = 2.5)
%>%
df ggplot(aes(x=x))+
geom_density()
Probability functions
The probability functions follow the principles of combining p, d, q and r with the name (or short name) of the probability distributions.
What to calculate | R-function |
---|---|
CDF | pnorm |
dnorm | |
quantile | qnorm |
random draw | rnorm |
Calculate the probability that a normally distributed variable with mean 14 and standard deviation 3.5 is less than 10
pnorm(10,mean=14,sd=3.5)
[1] 0.126549
Find the 95% quantile in the same distribution
qnorm(0.95,mean=14,sd=3.5)
[1] 19.75699
Calculate the probability that an exponentially distributed variable with mean 14 is less than 10
pexp(10,rate=1/14)
[1] 0.5104583
Type a question mark before the function to see the help text ?pexp
Plot probability distributions
=14
m=3.5
sdata.frame(pp=ppoints(100)) %>%
mutate(x=qnorm(pp,m,s)) %>%
mutate(d=dnorm(x,m,s)) %>%
ggplot(aes(x=x,y=d))+
geom_line()+
xlab('value')+
ylab('density')
If you feel you have the time or do another time:
Copy the sheet and refine the grid by using pp-values from 0.001 to 0.999.
=14
m=3.5
sdata.frame(pp=ppoints(1000)) %>%
mutate(x=qnorm(pp,m,s)) %>%
mutate(d=dnorm(x,m,s)) %>%
ggplot(aes(x=x,y=d))+
geom_line()+
xlab('value')+
ylab('density')
Random sampling
All sample generators start with a random number between 0 and 1. This is also a sample from a uniform distribution.
runif(1)
[1] 0.9704786
Type a function that generates a uniform random number in the interval 1 to 6.
runif(1,min=1,max=6)
[1] 5.595887
A random draw from a probability distribution can be generated by the inverse method. - Draws pp-values from a uniform distribution between 0 and 1 - Transform them into quantiles of the target distribution
Generates random draws from a normal distribution using the inverse method
qnorm(runif(1),m,s)
[1] 13.14799
This is already implemented as a function
rnorm(1,m,s)
[1] 18.37606
Draw from a beta distribution with parameters \(\alpha=2\) and \(\beta=8\)
rbeta(1,2,8)
[1] 0.2157914
Compare descriptive statistics against theoretical values
Wow - now we can generate data where we know the true value on parameters and all theoretical probabilities and quantiles, and compare with what we get when deriving descriptive statistics from the random sample.
This sheet generates a random sample of size 20 from a beta distribution.
rbeta(n=20,2,8)
[1] 0.21436125 0.24865058 0.07527012 0.05475960 0.17566076 0.05934510
[7] 0.25065048 0.05348027 0.33044344 0.23641477 0.08971576 0.17476356
[13] 0.19743103 0.15513189 0.45292185 0.03770521 0.27509703 0.04703781
[19] 0.19567211 0.18022758
A beta distribution has two parameters \(\alpha\) and \(\beta\)
The expected value of a beta distributed variable is \(\frac{\alpha}{\alpha+\beta}\)
Compare the calculated sample average to the theoretical expected value
=2
alpha=8
beta/(alpha+beta) alpha
[1] 0.2
mean(rbeta(n=20,alpha,beta))
[1] 0.1860472
We can also derive the theoretical quantile, let us say the P95.
Compare the quantile from the sample with the quantile calculated from the inverse probability distribution function
qbeta(0.95,alpha,beta)
[1] 0.4291355
quantile(rbeta(n=20,alpha,beta),probs=0.95)
95%
0.4128793
- Which of them has the smallest difference? Why do you think it is like that?
If you feel you have the time or do another time:
Explore what happens with the difference between theoretical and statistical values when you increase sample size from 20 to a high number (close to 1000)?
Below I wrte a script where sample size is controlled at one place. The P95 is approximated fairly well by the sampling when I use \(n=10 000\).
=2
alpha=8
beta=10000
n/(alpha+beta) alpha
[1] 0.2
mean(rbeta(n=n,alpha,beta))
[1] 0.200594
qbeta(0.95,alpha,beta)
[1] 0.4291355
quantile(rbeta(n=n,alpha,beta),probs=0.95)
95%
0.4327183
Let us visualise the convergence of the approximation of the mean and 95th percentile of the beta distribution using Monte Carlo simulation.
The code below defines a function which calculates the mean and P95 after every increase of the sample size and plots the convergence.
<- function(n){
plot_conv =cummean(rbeta(n=n,alpha,beta))
sample_mean=unlist(lapply(1:n,function(iter){
sample_P95quantile(rbeta(n=iter,alpha,beta),probs=0.95)}))
data.frame(values=c(sample_mean,sample_P95),n=rep(1:n,2), statistic=rep(c("mean","P95"),each=n)) %>%
ggplot(aes(x=n,y=values,color=statistic))+
geom_line()+
geom_hline(yintercept = alpha/(alpha+beta)) +
geom_hline(yintercept = qbeta(0.95,alpha,beta))
}
We start with \(n=10\)
plot_conv(n=10)
..increase to \(n=100\)
plot_conv(n=100)
..increase to \(n=1000\)
plot_conv(n=10^3)
..and finally \(n=10000\)
plot_conv(n=10^4)