-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathLab-6-notes.Rmd
347 lines (238 loc) · 15.6 KB
/
Lab-6-notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
---
title: "Lab 6: Probability and Sampling"
author: "Fahd Alhazmi"
output:
html_document:
toc: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
We will talk more about probability distributions in R. The reason why we need to talk about probability and sampling is simple: descriptive statistics (such as the mean and correlation) are derived from a given sample. However, in almost all areas of science, we are interested in estimating those quantities for the general population and not just for the sample at hand. In other words, we need to use inferential tools that allow us to generalize from a given sample to the larger population.
Hopefully you already know some basics of probability from the lectures. Here, I want to introduce how to use the power of simulations to calculate different probabilities.
Throughout this course, we will only deal with five main probability distributions: the binomial distribution, the normal distribution, the chi-squared distribution, the t-distribution and the F-distribution.
# The binomial distribution
Let's go back to the coin flips. You have two coins (N), and those coins are fair coins. What is the probability of getting one head? let's enumerate the possible outcomes by the letters H for head and T for tail: HH, HT, TH, TT. So you get one head in two out of four possible outcomes or 2/4 = 1/2 probability.
Now let's do 3 coins. What is the probability of getting one head? We first enumerate the outcomes: HHH, HHT, HTH, THH, HTT, THT, TTH, TTT. So we have 8 possible outcomes, of which only 3 have one head. So the probability is 3/8.
Binomial distribution works those probabilities for us: we only need to have the number of trials (N: how many events?), and the success probability (P: what is the probability of success in each event?). With those two, we can simply calculate the probability of any event.
Let's do this using code. In order to get the probability of getting `x=1` head, from an experiment of `size=2` trials, in which the probability of getting a head on any given trial is `p=1/2`, we can simply use `dbinom`:
```{r}
dbinom(x=1, size= 2, prob = 1/2 )
```
The probability of getting `x=1` head, from an experiment of `size=3` trials in which the probability of getting a head in any given trial is `p=1/2` will be:
```{r}
dbinom(x = 1, size = 3, prob=1/2)
```
The probability of getting 3 heads will be 1/8, since we only have one HHH, or
```{r}
dbinom(x = 3, size = 3, prob=1/2)
```
The probability of 2 heads (and one tail: HHT, HTH, THH) will be 3/8 or
```{r}
dbinom(x = 2, size = 3, prob=1/2)
```
One final example: let's do an experiment with `size=6` coins, and I want to calculate the probability of getting `x=5` heads:
```{r}
dbinom(x = 5, size = 6, prob=1/2)
```
And what is the probability of getting `x=5` heads OR `x=2` heads? You simply add the two according to the addition rule:
```{r}
dbinom(x = 5, size = 6, prob=1/2) + dbinom(x = 2, size = 6, prob=1/2)
```
So the `size` argument is the number of trials, while the `x` argument is the number of outcomes.
## "at least or at most"
Now, let's shift the question a bit. I have 3 coins, and I want to calculate the probability of getting at least 1 head. Let's first enumerate the possibilities again:
HHH, HHT, HTH, THH, TTH, THT, HTT, TTT
And this time, I want to calculate how many of those have at least one H. Since I have 7 out of the 8 with at least one head, then the probability is 7/8. However, to compute this with the binomial distribution, you will need to use the summation rule: Probability of getting 0 heads + probability of getting 1 head + probability of getting 2 heads, or:
```{r}
dbinom(x = 0, size = 3, prob=1/2) + dbinom(x = 1, size = 3, prob=1/2) + dbinom(x = 2, size = 3, prob=1/2)
```
which is also equivalent to saying: the probability of getting at least one head is:
1 - (the probability of getting 0 heads)
```{r}
1-dbinom(x = 0, size = 3, prob=1/2)
```
Before we leave the binomial distribution, I want to discuss `pbinom` which calculates the probability of getting any number of X or less outcomes. For example, we already calculated the probability of getting at least 1 head in `size=3` coin flips. Which means that we have to sum of the probabilities of 1, 2 and 3 heads (as we did before):
```{r}
pbinom(q = 0, size = 3, prob=1/2, lower.tail = FALSE)
```
Here, `lower.tail = FALSE` means that we want to sum the numbers to the right of 0 heads. If we set `lower.tail = TRUE` and `q=1`, it will calculate the probability of getting 1 head or less:
```{r}
pbinom(q = 1, size = 3, prob=1/2, lower.tail = TRUE)
```
which we can confirm by:
```{r}
dbinom(x = 0, size = 3, prob=1/2) + dbinom(x = 1, size = 3, prob=1/2)
```
# Binomial to the Normal
With large N (i.e., more trials), the binomial distribution actually approximate the normal distribution. To see this in action, we first introduce `rbinorm` which throws random samples from the given distribution. Let's say I have a fair coin (`prob=1/2`) and I want R to simulate `n=10` coin flips then we can simply write:
```{r}
rbinom(n=10, size=1, prob=1/2)
```
`size` here refers to the number of trials in each experiment. If I have `size=1` it simply means we are only recording the outcome of a single coin flip. If we change the size to 20, it will simulate 20 coin flips and prints the number of heads in each simulation:
```{r}
rbinom(n=10, size = 20, prob = 1/2 )
```
Now, you see that the number of heads in the first simulation is 10, and 12 in the second simulation etc. To illustrate why this approach a normal distribution as you increase your `n`, you can make a histogram with `n=10`:
```{r}
hist(rbinom( n = 10, size = 20, prob = 1/2 ))
```
and another one with `n=1000`:
```{r}
hist(rbinom( n = 1000, size = 20, prob = 1/2 ))
```
The point here is that as you collect more samples from a binomial distribution, you start to approximate the normal distribution.
# The Normal Distribution
Now it is time to talk about the normal distribution, the most important distribution in statistics and probably in life. Normal distribution (or the "bell-curve") can be described by only two numbers: the mean and the standard deviation from the mean. Similar to the binomial distribution, we have: `dnorm()`, `pnorm()`, `qnorm()` (which I have been ignoring) and
`rnorm()`. The big difference between normal distribution and binomial distribution is that the normal distribution is based on continuous numbers, and not just discrete events. In the binomial distribution, we can't calculate the probability of getting 1.3 heads in 3 coin flips (you technically can but R will round that for you). In normal distribution, however, we can calculate the probability of any number in the distribution once we know the mean and the standard deviation.
Let's assume we have a normal distribution of `mean=0` and `sd=1`. First, we can make a nice picture by getting the probability density function at each value between -4 and 4:
```{r}
x <- seq(-4, 4, length=100)
hx <- dnorm(x)
mydf <- data.frame(x, hx)
library(ggplot2)
ggplot(data=mydf, aes(x=x, y=hx)) + geom_line()
```
Now that we have a normal distribution in front of us, we can calculate the probability of any value (by calculating the size of the area under the curve). Let's calculate P(X< 0), or what is the proportion of the area that lies below 0?
```{r}
pnorm(0, mean = 0, sd = 1, lower.tail=TRUE)
```
Oh, that means half of the area is below 0. Now we can calculate the proportion of area above 1 (and notice `lower.tail=FALSE`):
```{r}
pnorm(1, mean = 0, sd = 1, lower.tail=FALSE)
```
wihch means ~15.86% of the area are above 1.
Finally, we can calculate the area between any two points by substracting the two. Let's calculate the area between 1 and -1:
```{r}
pnorm(1, mean = 0, sd = 1, lower.tail=TRUE)- pnorm(-1, mean = 0, sd = 1, lower.tail=TRUE)
```
Meaning that ~68% of observations should lie between those two numbers (if numbers were coming from this distribution). Well, we can actually test this using simulations with `rnorm`. `rnorm` give us a list of random samples from a given distribution. For example, we can ask for `n=10` numbers coming from a normal distribution with a mean of 0 and a standard deviation of 1:
```{r}
rnorm(n=10, mean=0, sd=1)
```
Let's test if ~68% of numbers should lie between -1 and 1 (or if any given number is bigger than -1 and less than 1:
```{r}
numbers <- rnorm(n=10, mean=0, sd=1)
mean(numbers > -1 & numbers < 1)
```
I doubt if you will get 0.68! Why is that? because we have too few numbers to play with. If we increase the sample, we will start to get to the true proportion
```{r}
numbers <- rnorm(n=10000, mean=0, sd=1)
mean(numbers > -1 & numbers < 1)
```
We'll later come back to this. Let's continue with some practical problems:
IQ scores are normally distributed with a mean of 100 and a standard deviation of 15:
What percentage of people have an IQ less than 125?
```{r}
pnorm(125, mean = 100, sd = 15, lower.tail=TRUE)
```
What percentage of people have an IQ greater than 110?
```{r}
pnorm(110, mean = 100, sd = 15, lower.tail=FALSE)
```
What percentage of people have an IQ between 110 and 125?
```{r}
pnorm(125, mean = 100, sd = 15, lower.tail=TRUE) - pnorm(110, mean = 100, sd = 15, lower.tail=TRUE)
```
We can also use `qnorm` to find the percentiles. For example, let's find the value that separates the buttom 15.8% of observations:
```{r}
qnorm(.158, mean=0, sd=1, lower.tail = TRUE)
```
Or the top 15.8% observations?
```{r}
qnorm(.158, mean=0, sd=1, lower.tail = FALSE)
```
Continuing with our example of IQ, we can ask what IQ score that separates the bottom 25% from the others:
```{r}
qnorm(.25, mean = 100, sd = 15, lower.tail = TRUE)
```
# Sampling , one more time
We just tested if it is true that ~68% of the numbers drawn from a normal distribution with a mean of 0 and a standard deviation of 1 should lie between -1 and 1. We observed that we get slightly different answers. The reason is because in those simulations, we use random sampling. Random sampling means that each number have the same probability to be picked. If I randomly sample an item from a set of 6 items, then each item has the same probability of being sampled, 1/6. The same happens with sampling continuous numbers from a normal distribution using `rnorm`. In all scientific experiments, we deal with samples from the real world. We almost never deal with the populations. However, as I explained in the beginning, we are interested in making inferences about the populations. Let's say that IQ test is created so that you have a mean of 100 and a standard deviation of 15. That's something like high up in the sky. When you, a humble human being, want to measure the average IQ scores (and you don't know the true mean and standard deviation), you collect a sample. You hope that the average of the IQ score of this sample will approximate the true above-the-sky population average. Let's first make a sample of 5 IQ scores:
```{r}
iq_small_sample <- round(rnorm(n=5, mean=100, sd=15))
hist(iq_small_sample)
```
if you calculate the average and the standard deviation of the sample you have you will get:
```{r}
mean(iq_small_sample)
sqrt(sum((iq_small_sample-mean(iq_small_sample))^2)/(length(iq_small_sample))) # population but divide by N
```
Which are slightly different from the true values. Well, the mean is actually much closer to the true mean but the standard deviation is quite far. Let's get another sample but this time we'll make it bigger:
```{r}
iq_moderate_sample <- rnorm(n=30, mean=100, sd=15)
hist(iq_moderate_sample)
```
```{r}
mean(iq_moderate_sample)
sqrt(sum((iq_moderate_sample-mean(iq_moderate_sample))^2)/(length(iq_moderate_sample)))
```
And now we got much closer to the true values. Finally let's make it a very large sample:
```{r}
iq_laaarge_sample <- rnorm(n=10000, mean=100, sd=15)
hist(iq_laaarge_sample)
```
```{r}
mean(iq_laaarge_sample)
sqrt(sum((iq_laaarge_sample-mean(iq_laaarge_sample))^2)/(length(iq_laaarge_sample)))
```
And finally with larger sample, you get very close to the true values of the population mean and standard deviation.
What is happening? Why does the mean is much closer to the true population even with small sample sizes, but the standard deviation needs a larger sample size? The mean is an unbiased estimate of the population mean and that's why the sample mean is used everywhere -- but the sample standard deviation is a biased estimate of the population standard deviation (basically, it is always smaller) and that's also the reason you divide the standard deviation by N-1 and not by N to correct for this and make the value a bit larger. `sd` actually use the `N-1` formula, so let's use it and compare its result with our old value:
```{r}
sqrt(sum((iq_small_sample-mean(iq_small_sample))^2)/(length(iq_small_sample)))
```
```{r}
sd(iq_small_sample)
```
How about a little more systematic experiment in R? Let's do the same calculations but for a range of values of sample sizes and then see how the mean and the standard deviations change as we increase the sample size.
```{r}
n_experiments <- 1000 # we will conduct X experiments each time
n_samples <- c(1:15) # we will start with a sample size of 1 until X
means_of_sample_n <- c()
sds_uncorrected_of_sample_n <- c()
for (this_sample_n in n_samples) {
this_sample_means <- c()
this_sample_sds_uncorrected <- c()
for (i in 1:n_experiments) {
iq_sample <- rnorm(n=this_sample_n, mean=100, sd=15)
this_sample_means[i] <- mean(iq_sample)
this_sample_sds_uncorrected[i] <- sqrt(sum((iq_sample-mean(iq_sample))^2)/(length(iq_sample)))
}
means_of_sample_n[this_sample_n] <- mean(this_sample_means)
sds_uncorrected_of_sample_n[this_sample_n] <- mean(this_sample_sds_uncorrected)
}
```
Now we'll use ggplot to make some pretty visualizations:
```{r}
#install.packages('ggpubr')
library(ggpubr)
results <- data.frame(n=n_samples,
means=means_of_sample_n,
sds=sds_uncorrected_of_sample_n)
g1 <- ggplot(results, aes(x=n, y=means))+
geom_line(color='red') +
geom_point() +
geom_hline(yintercept = 100) +
ylim(c(95, 105)) +
theme_classic()
g2 <- ggplot(results, aes(x=n, y=sds))+
geom_line(color='red') +
geom_point() +
geom_hline(yintercept = 15) +
ylim(c(-1, 17)) +
theme_classic()
ggarrange(g1, g2, ncol = 2, nrow = 1)
```
Here I will stop for today.
# Exercises
A student is taking an exam with 20 MCQ questions, where each question has 4 choices and the student is randomly answering those questions (so P=0.25). Use R code to answer the following questions:
- Calculate the probability of getting 1 questions correct?
- Calculate the probability of getting AT LEAST 5 questions correct?
- Calculate the probability of getting 15 questions correct?
- Calculate the probability of passing the exam (passing grade is 60%).
- Simulate a class of 100 students taking the same exam (where again each student is making random guesses). What is the average of the class (or the average of correct answers per student)?
- If we increase the class size to 1000 students, what happens to the average? (keep in mind that the true average of the correct answers for any given number of students is simply P * N or 0.25 * 20 = 5 correct answers).
Use simulations to answer the following questions:
- Is the maximum (`max`) a biased estimate or an unbiased estimate of population maximum?
- Is the median (`median`) a biased estimate or an unbiased estimate of the population median?
- Is the variance a biased or unbiased estimate of population variance? (DO NOT USE `var`, rather, use the original formula that divides by N)
You might want to use an R Markdown file to answer those questions.