-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path13-rmdIntro.Rmd
1218 lines (845 loc) · 56.9 KB
/
13-rmdIntro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo=FALSE, warning=FALSE}
library(knitr)
#rm(list = ls())
#This code automatically tidies code so that it does not reach over the page
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE, rownames.print = FALSE, rows.print = 10)
```
# (PART) Assignments {-}
# R Markdown
## Introduction to R Markdown
::: {.infobox .download data-latex="{download}"}
[You can download the example markdown file here](./Code/example_markdown.Rmd)
:::
This page will guide you through creating and editing R Markdown documents. This is a useful tool for reporting your analysis (e.g. for homework assignments). Of course, there is also [a cheat sheet for R-Markdown](https://www.rstudio.org/links/r_markdown_cheat_sheet) and [this book](https://bookdown.org/yihui/rmarkdown/) contains a comprehensive discussion of the format.
The following video contains a short introduction to the R Markdown format.
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/o8FdyMAR-g4" frameborder="0" allowfullscreen></iframe>
</div>
<br>
### Creating a new R Markdown document {-}
In addition to the video, the following text contains a short description of the most important formatting options.
Let's start to go through the steps of creating and .Rmd file and outputting the content to an HTML file.
0. If an R-Markdown file was provided to you, open it with R-Studio and skip to [step 4](#step4) after adding your answers.
1. Open R-Studio
2. Create a new R-Markdown document
![](./rmdExplain/start.PNG)
![](./rmdExplain/openDoc.PNG)
![](./rmdExplain/enterName.PNG)
![](./rmdExplain/template.PNG)
3. Save with appropriate name
![](./rmdExplain/saving.PNG)
3.1. Add your answers
3.2. Save again
<a name="step4"></a>
4. "Knit" to HTML
![](./rmdExplain/knit.PNG)
5. Hand in appropriate file (ending in `.html`) on learn\@WU
![](./rmdExplain/handin.PNG)
### Text and Equations {-}
R-Markdown documents are plain text files that include both text and R-code. Using RStudio they can be converted ('knitted') to HTML or PDF files that include both the text and the results of the R-code. In fact this website is written using R-Markdown and RStudio. In order for RStudio to be able to interpret the document you have to use certain characters or combinations of characters when formatting text and including R-code to be evaluated. By default the document starts with the options for the text part. You can change the title, date, author and a few more advanced options.
![First lines of an R-Markdown document](./rmdExplain/rmdHead.PNG)
The default is text mode, meaning that lines in an Rmd document will be interpreted as text, unless specified otherwise.
#### Headings {-}
Usually you want to include some kind of heading to structure your text. A heading is created using `#` signs. A single `#` creates a first level heading, two `##` a second level and so on.
![](./rmdExplain/headings.PNG)
It is important to note here that the ```#``` symbol means something different within the code chunks as opposed to outside of them. If you continue to put a ```#``` in front of all your regular text, it will all be interpreted as a first level heading, making your text very large.
#### Lists {-}
Bullet point lists are created using `*`, `+` or `-`. Sub-items are created by indenting the item using 4 spaces or 2 tabs.
````
* First Item
* Second Item
+ first sub-item
- first sub-sub-item
+ second sub-item
````
* First Item
* Second Item
+ first sub-item
- first sub-sub-item
+ second sub-item
Ordered lists can be created using numbers and letters. If you need sub-sub-items use `A)` instead of `A.` on the third level.
````
1. First item
a. first sub-item
A) first sub-sub-item
b. second sub-item
2. Second item
````
1. First item
a. first sub-item
A) first sub-sub-item
b. second sub-item
2. Second item
#### Text formatting {-}
Text can be formatted in *italics* (`*italics*`) or **bold** (`**bold**`). In addition, you can ad block quotes with `>`
````
> Lorem ipsum dolor amet chillwave lomo ramps, four loko green juice messenger bag raclette forage offal shoreditch chartreuse austin. Slow-carb poutine meggings swag blog, pop-up salvia taxidermy bushwick freegan ugh poke.
````
> Lorem ipsum dolor amet chillwave lomo ramps, four loko green juice messenger bag raclette forage offal shoreditch chartreuse austin. Slow-carb poutine meggings swag blog, pop-up salvia taxidermy bushwick freegan ugh poke.
### R-Code {-}
R-code is contained in so called "chunks". These chunks always start with three backticks and ```r``` in curly braces (``` ```{r} ```) and end with three backticks (``` ``` ```). Optionally, parameters can be added after the ```r``` to influence how a chunk behaves. Additionally, you can also give each chunk a name. Note that these have to be **unique**, otherwise R will refuse to knit your document.
#### Global and chunk options {-}
The first chunk always looks as follows
```{r setup, include = FALSE}`r ''`
knitr::opts_chunk$set(echo = TRUE)
```
It is added to the document automatically and sets options for all the following chunks. These options can be overwritten on a per-chunk basis.
Keep `knitr::opts_chunk$set(echo = TRUE)` to print your code to the document you will hand in. Changing it to `knitr::opts_chunk$set(echo = FALSE)` will not print your code by default. This can be changed on a per-chunk basis.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r cars, echo = FALSE}`r ''`
summary(cars)
plot(dist~speed, cars)
```
```{r cars, echo = FALSE}
summary(cars)
plot(dist~speed, cars)
```
```{r cars2, echo = TRUE}`r ''`
summary(cars)
plot(dist~speed, cars)
```
```{r cars2, echo = TRUE}
summary(cars)
plot(dist~speed, cars)
```
A good overview of all available global/chunk options can be found [here](https://yihui.name/knitr/options/#chunk_options).
### LaTeX Math {-}
Writing well formatted mathematical formulas is done the same way as in [LaTeX](https://en.wikipedia.org/wiki/LaTeX). Math mode is started and ended using `$$`.
````
$$
f_1(\omega) = \frac{\sigma^2}{2 \pi},\ \omega \in[-\pi, \pi]
$$
````
$$
f_1(\omega) = \frac{\sigma^2}{2 \pi},\ \omega \in[-\pi, \pi]
$$
(for those interested this is the spectral density of [white noise](https://en.wikipedia.org/wiki/White_noise))
Including inline mathematical notation is done with a single ```$``` symbol.
````
${2\over3}$ of my code is inline.
````
${2\over3}$ of my code is inline.
<br>
Take a look at [this wikibook on Mathematics in LaTeX](https://en.wikibooks.org/wiki/LaTeX/Mathematics#Symbols) and [this list of Greek letters and mathematical symbols](https://www.sharelatex.com/learn/List_of_Greek_letters_and_math_symbols) if you are not familiar with LaTeX.
In order to write multi-line equations in the same math environment, use `\\` after every line. In order to insert a space use a single `\`. To render text inside a math environment use `\text{here is the text}`. In order to align equations start with `\begin{align}` and place an `&` in each line at the point around which it should be aligned. Finally end with `\end{align}`
````
$$
\begin{align}
\text{First equation: }\ Y &= X \beta + \epsilon_y,\ \forall X \\
\text{Second equation: }\ X &= Z \gamma + \epsilon_x
\end{align}
$$
````
$$
\begin{align}
\text{First equation: }\ Y &= X \beta + \epsilon_y,\ \forall X \\
\text{Second equation: }\ X &= Z \gamma + \epsilon_x
\end{align}
$$
#### Important symbols {-}
```{r, echo=FALSE, include=TRUE, results="asis", warning = FALSE}
library(knitr)
library(kableExtra)
lat <- readLines("./lat.txt")
lat1 <- paste0("$", lat, "$")
lat2 <- paste0("```", lat, "```")
mathy.df <- data.frame(Symbol = lat1, Code = lat2)
kable(mathy.df, escape=FALSE) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
```
The `{}` after `_` and `^` are not strictly necessary if there is only one character in the sub-/superscript. However, in order to place multiple characters in the sub-/superscript they are necessary.
e.g.
```{r, echo=FALSE, include=TRUE, results="asis", warning = FALSE}
lat <- readLines("./lat2.txt")
lat1 <- paste0("$", lat, "$")
lat2 <- paste0("```", lat, "```")
mathy.df <- data.frame(Symbol = lat1, Code = lat2)
kable(mathy.df, escape=FALSE) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
```
#### Greek letters {-}
[Greek letters](https://en.wikipedia.org/wiki/Greek_alphabet#Letters) are preceded by a `\` followed by their name (`$\beta$` = $\beta$). In order to capitalize them simply capitalize the first letter of the name (`$\Gamma$` = $\Gamma$).
<!---
## Solution assignment 2
As a marketing manager at a video streaming service, you are interested in the effect of online advertising on the number of streams that a movie receives. To test the effect of online advertising on streams, you select a representative sample of 200 movies and randomly assign 100 movies to be included in an online advertising campaign. The other half of the sample serves as the control group. You run the experiment for one week and collect data regarding the number of streams for each movie from this period. Overall, the data set includes the following variables:
* movieID: unique movie ID
* streams_sd: number of streams in SD-quality
* streams_hd: number of streams in HD-quality
* online_advertising: indicator whether a movie was included in the online advertising campaign (0 = no, 1 = yes)
Apply appropriate statistical methods to answer the following questions:
1. Compute the 95% confidence interval for the mean number of streams for movies in SD and HD quality and provide an interpretation of the interval
2. Your historical data tells you that the movies in SD and HD quality received 2,600 and 1,700 streams in the previous week, respectively. Please test if the number of streams that the movies received (irrespective of whether they were included in the experiment or not) in the week of the experiment is significantly different from the previous week for SD and HD movies.
3. Is there a significant difference in streams between movies that were included in the online advertising campaign and those that were not included? (Please conduct the test for SD and HD movies and also compute the effect size Cohen's d)
4. Is there a significant difference in streams between movies in HD and SD quality? (Please also compute the effect size Cohen's d)
5. Assume that you plan to run an experiment with two groups to test two different advertising strategies. You randomly assign movies to the control and experimental conditions and your goal is to test if there is a significant difference between the groups regarding the number of streams that the movies receive. How many movies would you need to include in each group of your experiment if you assume the effect size to be 0.3 for a significance level of 0.05 and power of 0.8?
When answering the questions, please remember to address the following points, where appropriate:
* Formulate the corresponding hypotheses and choose an appropriate statistical test
* Provide the reason for your choice and discuss if the assumptions of the test are met
* Convert the variables to the appropriate type (e.g., factor variables)
* Create appropriate graphs to explore the data (e.g., boxplot, bar chart, histogram)
* Provide appropriate descriptive statistics for the variables
* Report and interpret the test results accurately (including confidence intervals)
* Finally, don't forget to report your research conclusion in an appropriate way
When you are done with your analysis, click on "Knit to HTML" button above the code editor. This will create a HTML document of your results in the folder where the "assignment.Rmd" file is stored. Open this file in your Internet browser to see if the output is correct. If the output is correct, submit the HTML file via Learn\@WU. The file name should be "assignment2_studendID_name.html".
**Load and inspect data**
Let´s load the data first and inspect the contained variables:
```{r }
movie_data <- read.table("https://raw.githubusercontent.com/IMSMWU/MRDA2018/master/data/assignment2.dat",
sep = "\t",
header = TRUE) #read in data
head(movie_data)
str(movie_data)
```
**Load packages**
Next, we load the packages that we will be using to answer the questions:
```{r warning=F, message=F, echo=T}
library(pastecs)
library(ggplot2)
library(psych)
library(pwr)
library(lsr)
library(reshape2)
library(ggstatsplot)
library(Rmisc)
```
```{r warning=F, message=F, echo=F, eval=TRUE}
options(digits=7, scipen = 999)
```
**Question 1**
To compute the confidence intervals for SD and HD streams we will need three things: 1) the mean $\bar x$, 2) the standard error ($s \over \sqrt{n}$), and 3) the critical value for a t-distribution ($t_{crit}$; we will use a t-distribution, because we are not sure of the variance in the population).
```{r warning=F, message=F, echo=T, eval=T}
#Calculate components of confidence interval formula
mean_sd <- mean(movie_data$streams_sd)
mean_hd <- mean(movie_data$streams_hd)
sd_sd <- sd(movie_data$streams_sd)
sd_hd <- sd(movie_data$streams_hd)
n <- nrow(movie_data)
se_sd <- sd_sd/sqrt(n)
se_hd <- sd_hd/sqrt(n)
df <- n-1
t_crit <- qt(0.975, df)
```
Now the confidence intervals for streams in SD and HD quality can be computed as:
```{r warning=F, message=F, echo=T, eval=T}
#Interval for SD movies
ci_lower_sd <- mean_sd - t_crit * se_sd
ci_upper_sd <- mean_sd + t_crit * se_sd
#Interval for HD movies
ci_lower_hd <- mean_hd - t_crit * se_hd
ci_upper_hd <- mean_hd + t_crit * se_hd
```
Hence, the CI for SD movies is given by:
```{r warning=F, message=F, echo=T, eval=T}
ci_lower_sd
ci_upper_sd
```
$CI_{SD} = [2527.97,2827.85]$
Similarly, the CI for HD movies is given by
```{r warning=F, message=F, echo=T, eval=T}
ci_lower_hd
ci_upper_hd
```
$CI_{HD} = [1728.51,1940.52]$
The intervals can be interpreted as follows: If we would take 100 samples, calculate the mean and confidence interval for each of them, then the true population mean would be included in 95% of these intervals.
**Question 2**
To find out whether our data for SD and HD streams differs significantly from the previous week (2600 for SD; 1700 for HD), we will conduct a one sample t-test. This is appropriate, because 1) our data is on an interval scale, and 2) the sampling distribution can be considered as normally distributed due to the fairly large sample size (n=200; see central limit theorem).
Our null hypothesis states that there is no difference between the quantity of SD/HD streams watched in the current week, compared to the previous week. Rejecting the null hypotheses/accepting the alternative hypothesis would mean that there indeed was a difference between the two weeks.
So for our SD streams we could formulate our hypothesis as follows:
$$H_0: \mu_0 = 2600 \\ H_1: \mu_0 \neq 2600 $$
The same approach can be used for our HD streams:
$$H_0: \mu_0 = 1700 \\ H_1: \mu_0 \neq 1700 $$
We can first have a quick look at the descriptive statistics:
```{r warning=F, message=F, echo=F, eval=T}
# make sure describe comes from psych package
describe <- get("describe", pos = "package:psych")
```
```{r eval=T, echo=T, message=F, warning=F, paged.print=FALSE}
describe(movie_data$streams_sd)
describe(movie_data$streams_hd)
```
```{r eval=T, echo=F, message=F, warning=F, paged.print=FALSE}
library(Hmisc)
```
As we can see, the differences between SD/HD and the week before don´t seem to be extraordinary high. To visualize the distribution of the data, we can create histograms:
```{r message=FALSE, warning=FALSE}
ggplot(movie_data,aes(streams_sd)) +
geom_histogram(col = "black", fill = "darkblue") +
labs(x = "Number of SD stremas", y = "Frequency") +
theme_bw()
ggplot(movie_data,aes(streams_hd)) +
geom_histogram(col = "black", fill = "darkblue") +
labs(x = "Number of HD streams", y = "Frequency") +
theme_bw()
```
We can now conduct a one sample t-test to test for significance.
```{r warning=F, message=F, echo=T, eval=T}
t.test(movie_data$streams_sd, mu = 2600, alternative = "two.sided")
t.test(movie_data$streams_hd, mu = 1700, alternative = "two.sided")
```
For SD streams, we can conclude that the average number of SD streams watched in this week (2677.92) were not significantly different from the 2600 streams watched in the previous week, t(199) = 1.025, p > .05 (95% CI = [2528; 2828]). This can be seen from the fact that the p-value is larger than 0.05. This is also evidenced by the fact that the null hypothesis (2600) is included in the range of plausible values given by the confidence interval.
However for HD streams we see that the perceived mean in our sample (1834.52) is significantly higher compared to the previous week t(199) = 2.502, p <.05 (95% CI = [1729; 1941]). This can be seen from the fact that the p-value is smaller than 0.05. This is also evidenced by the fact that the null hypothesis (1700) is not included in the range of plausible values given by the confidence interval.
Alternatively, you could also use the `ggstatsplot` package to conduct the tests:
```{r message=FALSE, warning=FALSE}
gghistostats(
data = movie_data,
x = streams_sd,
title = "Distribution of SD streams",
type = "parametric",
conf.level = 0.95,
bar.measure = "mix",
test.value = 2600,
test.value.line = TRUE,
effsize.type = "d",
test.value.color = "#0072B2",
centrality.para = "mean",
centrality.color = "darkred",
binwidth = 300,
messages = FALSE,
bf.message = FALSE
)
```
```{r message=FALSE, warning=FALSE}
gghistostats(
data = movie_data,
x = streams_hd,
title = "Distribution of HD streams",
type = "parametric",
conf.level = 0.95,
bar.measure = "mix",
test.value = 2600,
test.value.line = TRUE,
effsize.type = "d",
test.value.color = "#0072B2",
centrality.para = "mean",
centrality.color = "darkred",
binwidth = 300,
messages = FALSE,
bf.message = FALSE
)
```
**Question 3**
First we will analyze whether the advertising campaign had an effect on SD streams. We need to formulate a hypothesis which we can test. In this case, the null hypothesis is that the campaign had no effect on the mean number of streams, i.e. that there is no difference in the mean number of streams between the two populations. The alternative hypothesis states that the campaign _did_ have an effect, meaning that there is a difference in the mean number of streams between the populations. In more formal notation this is:
$$H_0: \mu_0 = \mu_1 \\ H_1: \mu_0 \neq \mu_1$$
We need to transform the variable online_advertising into a factor variable for some of our analyses:
```{r warning=F, message=F, echo=T, eval=T}
# Transform into factor variable
movie_data$online_advertising <- factor(movie_data$online_advertising, levels = c(0,1), labels = c("no", "yes"))
```
A good way to get a feeling for the data is to compute descriptive statistics and create appropriate plots. Since we are testing differences in means, a plot of means (or a boxplot) would be appropriate.
```{r warning = F, message = F, echo=T, eval=T}
# Descriptive statistics for SD streams, split by online advertising
describeBy(movie_data$streams_sd, movie_data$online_advertising)
mean_data <- summarySE(movie_data, measurevar = "streams_sd",
groupvars = c("online_advertising"))
# Plot of means
ggplot(mean_data, aes(x = online_advertising, y = streams_sd)) +
geom_bar(position = position_dodge(0.9), colour = "black",
fill = "#CCCCCC", stat = "identity", width = 0.65) +
geom_errorbar(position = position_dodge(0.9), width = 0.15,
aes(ymin = streams_sd - ci, ymax = streams_sd + ci)) +
theme_bw() + labs(x = "Advertising", y = "Average number of SD streams",
title = "Average number of SD streams by group") +
theme_bw() + theme(plot.title = element_text(hjust = 0.5,
color = "#666666"))
```
As we can see in both the descriptive statistics and the plot, the mean of the number of streams is higher where online_advertising = "yes", i.e. for the movies that were included in the marketing campaign. To test whether or not this difference is significant, we need to use a __two sample t-test__. We use an independent-means t-test because we have different movies in each group (i.e., the movies in one condition are *independent* of the movies in the other condition). The requirements are clearly met:
* Our dependent variable is on an interval scale
* Since we have more than 30 observations per group we do not really have to concern ourselves with whether the data is normally distributed or not (see central limit theorem)
* If a movie was included in the campaign or not was assigned randomly
* R automatically performs Welch's t-test, which corrects for unequal variance
Thus we can perform the test in R
```{r warning = F, message = F, echo=T, eval=T}
t.test(streams_sd ~ online_advertising, data = movie_data)
```
The test is significant, since the p-value is smaller than 0.05, leading us to reject the null hypothesis that there is no difference in the mean number of streams. The p-value states the probability of finding a difference of the observed magnitude or higher, if the null hypothesis was in fact true (i.e., if there was in fact no difference between the populations). In effect, this means that the advertising campaign had an effect on the average number of times a video was streamed. Another thing we can extract from this test result is the confidence interval around the difference in means. Since 0 is not included in the interval, it is not a plausible value, confirming the conclusion to reject the null hypothesis.
The standardized effect size can be computed using the ```cohensD``` function:
```{r warning = F, message = F, echo=T, eval=T}
cohensD(streams_sd ~ online_advertising, data = movie_data)
```
This magnitude of the effect size (1.12) suggests that the effect of online advertising on the number of SD streams is large.
The same can be done analogously for HD streams:
```{r warning = F, message = F, echo=T, eval=T}
# Descriptive statistics for HD streams, split by online advertising
stats <- describeBy(movie_data$streams_hd, movie_data$online_advertising)
print(stats)
mean_data <- summarySE(movie_data, measurevar = "streams_hd",
groupvars = c("online_advertising"))
# Plot of means
ggplot(mean_data, aes(x = online_advertising, y = streams_hd)) +
geom_bar(position = position_dodge(0.9), colour = "black",
fill = "#CCCCCC", stat = "identity", width = 0.65) +
geom_errorbar(position = position_dodge(0.9), width = 0.15,
aes(ymin = streams_hd - ci, ymax = streams_hd + ci)) +
theme_bw() + labs(x = "Advertising", y = "Average number of HD streams",
title = "Average number of HD streams by group") +
theme_bw() + theme(plot.title = element_text(hjust = 0.5,
color = "#666666"))
```
Again, the summary statistics and the plot seem to indicate that there is a difference in means. Using the same reasoning as before, we can conclude that we need a two sample t-test to determine whether this difference is significant (note that *two sample t-test* means the same as *independent-means t-test*).
```{r warning = F, message = F, echo=T, eval=T}
t.test(streams_hd ~ online_advertising, data = movie_data)
```
Again, the p-value is so low that any sensible significance level would lead us to reject the null hypothesis, suggesting that there is a difference in mean the number of streams between videos included in the campaign and those that aren't.
Calculate the standardized effect size:
```{r warning = F, message = F, echo=T, eval=T}
cohensD(streams_hd ~ online_advertising, data = movie_data)
```
The magnitude of the effect size indicates again that this effect is large, although it is somewhat smaller than for SD streams.
Alternatively, you could also use the `ggstatsplot` package to conduct the tests:
```{r message=FALSE, warning=FALSE}
ggbetweenstats(
data = movie_data,
plot.type = "box",
x = online_advertising, # 2 groups
y = streams_sd ,
type = "p", # default
effsize.type = "d", # display effect size (Cohen's d in output)
messages = FALSE,
bf.message = FALSE,
mean.ci = TRUE,
title = "SD streams"
)
```
```{r message=FALSE, warning=FALSE}
ggbetweenstats(
data = movie_data,
plot.type = "box",
x = online_advertising, # 2 groups
y = streams_hd ,
type = "p", # default
effsize.type = "d", # display effect size (Cohen's d in output)
messages = FALSE,
bf.message = FALSE,
mean.ci = TRUE,
title = "HD streams"
)
```
**Question 4**
Next we want to examine whether HD and SD streams have similar numbers on average. The null hypothesis here is that there is no difference in the mean number of HD streams and the mean number of SD streams for the same movies. Because the observations come from the same population of movies, we refer to the difference in the means for the same population as $\mu_D$ when stating our hypotheses. The alternative hypothesis states that that there is a difference between the streams in HD and SD quality for the same movies. In mathematical notation this can be written as
$$H_0: \mu_D = 0 \\ H_1: \mu_D \neq 0$$
Again, we start with descriptive statistics to get a feel for the data.
```{r warning=F, message=F, echo=F, eval=T}
if("package:Hmisc" %in% search()) detach("package:Hmisc", unload=TRUE)
```
```{r warning = F, message = F, paged.print = FALSE, echo=T, eval=T}
# Descriptive statistics for HD and SD streams
psych::describe(movie_data$streams_sd)
psych::describe(movie_data$streams_hd)
# Plot of means
movie_data_long <- melt(movie_data[, c("streams_sd", "streams_hd")])
names(movie_data_long) <- c("type", "streams")
mean_data <- summarySE(movie_data_long, measurevar = "streams",
groupvars = c("type"))
# Plot of means
ggplot(mean_data, aes(x = type, y = streams)) +
geom_bar(position = position_dodge(0.9), colour = "black",
fill = "#CCCCCC", stat = "identity", width = 0.65) +
geom_errorbar(position = position_dodge(0.9), width = 0.15,
aes(ymin = streams - ci, ymax = streams + ci)) +
theme_bw() + labs(x = "Type", y = "Average number of streams",
title = "Average number of streams by group") +
theme_bw() + theme(plot.title = element_text(hjust = 0.5,
color = "#666666"))
```
```{r warning=F, message=F, echo=F, eval=T}
library(Hmisc)
```
It appears that there is a difference in the means. To test whether it is significant, we again need a t-test. However, this time we need a slightly different version of the t-test because the same movies are observed for HD and SD streams (i.e., the same movies are available in both formats). This means that we need a __dependent means t-test__. This test is also known as the **paired samples t-test**. The other assumptions are virtually identical to the independent-means t-test. The test can be executed in R by adding ```paired = TRUE``` to the code.
```{r warning = F, message = F, echo=T, eval=T}
t.test(y = movie_data$streams_sd, x = movie_data$streams_hd, paired = TRUE)
```
The p-value is again lower than the chosen significance level of 5% (i.e., p < .05), which means that we reject the null hypothesis that there is no difference in the mean number of streams in HD and SD quality. Make sure you interpret the p-value correctly. It refers to the probability of observing a difference of the observed magnitude (or larger) between streams in HD and SD quality, assuming that there was in fact no difference between the formats. The confidence interval confirms the conclusion to reject the null hypothesis since $0$ is not contained in the range of plausible values.
Now let's find out how strong this effect is.
```{r warning = FALSE, echo=T, eval=T}
cohensD(movie_data$streams_sd, movie_data$streams_hd, method = 'paired')
```
A standardized effect size of approx. 0.79 tells us that this effect is large.
Alternatively, you could also use the `ggstatsplot` package to conduct the tests:
```{r message=FALSE, warning=FALSE}
ggwithinstats(
data = movie_data_long,
x = type,
y = streams,
path.point = FALSE,
path.mean = TRUE,
sort = "descending", # ordering groups along the x-axis based on
sort.fun = mean, # values of `y` variable
title = "Number of streams for movies in SD and HD",
messages = FALSE,
bf.message = FALSE,
mean.ci = TRUE,
effsize.type = "d" # display effect size (Cohen's d in output)
)
```
**Question 5**
The question of how many movies we would need to include in each sample of our experiment can be answered quite comfortably with a power calculation function in R.
```{r warning=F, message=F, echo=T, eval=T}
pwr.t.test(d = 0.3, sig.level = 0.05, power = 0.8, type = c("two.sample"), alternative = c("two.sided"))
```
To achieve our desired effect size of 0.3, a significance level of 0.5 and a power of 0.8 we would need to include at least 175 movies per group in our sample.
## Solution assignment 3
The data file contains customer information from an online fashion shop. In an experiment, the customers were exposed to different types of online advertising over the past year (randomly assigned) and now you wish to analyze the results.
The following variables are included in the data set:
* customerID: unique customer ID
* revenue: revenue per customer for the past year (in EUR)
* gender: 0=male, 1=female
* retargeting: type of online advertising that the customer was exposed to (3 levels: 1 = no advertising, 2 = generic retargeting, 3 = dynamic retargeting)
* customerRank: ranking of customers according to their expenditure level (low rank = valuable customer, high rank = less valuable customer)
* conversion: indicator variable, indicating if a customer converted in the previous campaign (0 = no conversion, 1 = conversion)
Use R and appropriate analytical techniques to answer the following questions:
1. Has the types of online advertising an effect on revenue? Are there significant differences between the individual groups?
2. Is the customer ranking significantly influenced by the type of online advertising? Are there significant differences between the individual groups?
3. Does the conversion rate in the previous campaign differ between male and female customers?
When answering the questions, please remember to address the following points, where appropriate:
* Formulate the corresponding hypotheses and choose an appropriate statistical test
* Provide the reason for your choice and discuss if the assumptions of the test are met
* Convert the variables to the appropriate type (e.g., factor variables)
* Create appropriate graphs to explore the data (e.g., boxplot, bar chart, histogram)
* Provide appropriate descriptive statistics for the variables
* Report and interpret the test results accurately (including confidence intervals)
* Finally, don't forget to report your research conclusion in an appropriate way
When you are done with your analysis, click on "Knit to HTML" button above the code editor. This will create a HTML document of your results in the folder where the "assignment3.Rmd" file is stored. Open this file in your Internet browser to see if the output is correct. If the output is correct, submit the HTML file via Learn\@WU. The file name should be "assignment3_studendID_name.html".
**Load data**
```{r }
rm(list = ls())
customer_data <- read.table("https://raw.githubusercontent.com/IMSMWU/MRDA2018/master/data/assignment3.csv",
sep = ";",
header = TRUE) #read in data
head(customer_data)
str(customer_data)
```
**Data Preparation**
As always, the first step is to load required packages (packages that have not been used as often in the course will be loaded as required to show which packages contain certain functions) and to load and inspect the data.
```{r warning=FALSE, message=FALSE}
library(plyr)
library(ggplot2)
library(psych)
library(Hmisc)
library(Rmisc)
```
Next we are going to recode some of the variables into factors and give them more descriptive level names.
```{r, warning=FALSE, message=FALSE}
customer_data$retargeting <- factor(customer_data$retargeting, levels = c(1,2,3), labels = c("no retargeting", "generic retargeting", "dynamic retargeting"))
customer_data$gender <- factor(customer_data$gender, levels = c(1,0),labels = c("female","male"))
customer_data$conversion <- factor(customer_data$conversion, levels = c(1,0), labels = c("conversion","no conversion"))
```
**Question 1**
To answer whether the type of advertising has an effect on revenue we need to formulate a testable null hypothesis. In our case the null hypothesis is stating that the average level of sales is equal for all advertising types. In mathematical notation this implies:
$$H_0: \mu_1 = \mu_2 = \mu_3 $$
The alternate hypothesis is simply that the means are not all equal, i.e.,
$$H_1: \textrm{Means are not all equal}$$
If you wanted to put this in mathematical notation, you could also write:
$$H_1: \exists {i,j}: {\mu_i \ne \mu_j} $$
The appropriate test for such a hypothesis is one-way ANOVA since we have a metric scales dependent variable and a categorical independent variable with more than two levels.
Next we will calculate summary statistics for the data and produce an appropriate plot.
```{r fig.align="center"}
describeBy(customer_data$revenue,customer_data$retargeting)
mean_data <- summarySE(customer_data, measurevar="revenue", groupvars=c("retargeting"))
ggplot(mean_data,aes(x = retargeting, y = revenue)) +
geom_bar(position=position_dodge(1), colour="black", fill = "#CCCCCC", stat="identity", width = 0.65) +
geom_errorbar(position=position_dodge(.9), width=.15, aes(ymin=revenue-ci, ymax=revenue+ci)) +
theme_bw() +
labs(x = "Group", y = "Average revenue", title = "Average revenue by group")+
theme_bw() +
theme(plot.title = element_text(hjust = 0.5,color = "#666666"))
```
Both the summary statistics and the plot hint at the fact that the means may not be equal. Especially the difference between dynamic retargeting and no retargeting/ generic regtargeting seem to be quite high. Before we move to the formal test, we need to see if a series of assumptions are met, namely:
* Distributional assumptions
* Homogeneity of variances
* Independence of observations
The last assumption is satisfied due to the fact that the observations were randomly assigned to the advertisement groups. To see if we need to worry about distributional assumptions we first take a look at the number of observations in each advertising group.
```{r, warning=FALSE, message=FALSE}
#check number of observations by group
table(customer_data$retargeting)
```
Due to the fact that there are always more than 30 observations in each group we can rely on the central limit theorem to satisfy the distributional assumptions.
Homogeneity of variances can be checked with Levene's test (implemented as ```leveneTest()``` from the ```car``` package). The null hypothesis of this test is that the variances are equal, with the alternative hypothesis being that the variances are not all equal. Note that this step could also be skipped and replaced by the use of the robust ANOVA using the `oneway.test()` function.
```{r, warning=FALSE, message=FALSE, paged.print = FALSE}
#Homogeneity of variances test:
library(car)
leveneTest(revenue ~ retargeting, data=customer_data, center=mean)
```
The test result is insignificant (for a significance level of 5 %), meaning that we do not reject the null hypothesis of equal variances and can operate under the assumption that the variances are equal.
Since all assumptions are fulfilled we can move on to conducting the actual ANOVA using the ```aov()``` function. As said above, it would also be possible to conduct the analysis using the robust ANOVA using the `oneway.test()` function:
```{r, warning=FALSE, message=FALSE}
#Anova:
aov <- aov(revenue~retargeting, data = customer_data)
summary(aov)
```
The p-value is smaller than 0.05, which we chose as our significance level, meaning that we reject the null hypothesis of the means being equal in the three advertising groups.
Next we will briefly inspect the residuals of the ANOVA to see if the assumptions of the test really are justified.
```{r, warning=FALSE, message=FALSE}
#Inspect residuals
plot(aov,1)
```
The first plot gives us a feel for the distribution of the residuals of the three groups. The residuals seem to be roughly equally distributed, which speaks for the fact that the homogeneity of variances assumptions is fulfilled.
```{r, warning=FALSE, message=FALSE}
plot(aov,2)
```
The second plot is a QQ-plot of the residuals, meant as a quick visual check to see if the normality assumption is fulfilled. Leading up to the test we only checked if there were more than 30 observations per group to satisfy the normality assumption but despite this being fulfilled it is still important to check the normality of the residuals, as any strange behavior here may indicate problems with the model specification.
To further confirm that the residuals are roughly normally distributed we employ the Shapiro-Wilk test. The null hypothesis is that the distribution of the data is normal, with the alternative hypothesis positing that the data is not normally distributed.
```{r, warning=FALSE, message=FALSE}
shapiro.test(resid(aov))
```
The p value is far above any widely used significance level and thus we can not reject the null hypothesis of normal distribution, which further implies that the normality assumption is fulfilled.
The ANOVA result only tells us that the means of the three groups are not equal, but it does not tell us anything about _which_ pairs of means are unequal. To find this out we need to conduct post hoc tests to test the following null hypotheses for the respective pairwise comparisons.
$$1) H_0: \mu_1 = \mu_2; H_1 = \mu_1 \neq \mu_2 \\
2) H_0: \mu_2 = \mu_3; H_1 = \mu_2 \neq \mu_3 \\
3) H_0: \mu_1 = \mu_3; H_1 = \mu_1 \neq \mu_3 $$
Here we will conduct both the Bonferroni correction as well as Tukey's HSD test, however either would be sufficient for your homework. Bonferroni's correction conducts multiple pairwise t-tests, with the null hypothesis being that of equal means in each case and the alternative hypothesis stating that the means are unequal.
```{r, warning=FALSE, message=FALSE}
#bonferroni
pairwise.t.test(customer_data$revenue, customer_data$retargeting, data=customer_data, p.adjust.method = "bonferroni")
```
The Bonferroni test reinforces what we saw in our plot earlier, namely that not all of the means might be significantly different from each other.
We can only reject the null hypothesis in the cases:
dynamic regargeting vs. no retargeting
dynamic regargeting vs. generig retargeting
But there seems to be no difference in the means of generic retargeting vs. no retargeting.
Alternatively, you could have also chosen to use Tukey's HSD to conduct the post hoc test. Tukey's HSD similarly compares pairwise means, corrected for family-wise errors (both of the post hoc tests would have been considered correct).
```{r, warning=FALSE, message=FALSE}
#tukey correction using the mult-comp package
library(multcomp)
tukeys <- glht(aov, linfct = mcp(retargeting = "Tukey"))
summary(tukeys)
```
Tukey's correction confirms the conclusion from the Bonferroni test from above. While there seems to be no difference in the means of generic retargeting vs. no retargeting, dynamic retargeting seems to differ significantly from both generic retargeting and no retargeting.
Tukey's HSD further let's us estimate the difference in means with corresponding confidence intervals.
```{r, warning=FALSE, message=FALSE}
confint(tukeys)
# The mar parameter changes the margins around created plots. This is done so the labels on the side of the Tukey plot are visible (however, this was not expected).
par(mar = c(5, 20, 4, 2))
plot(tukeys)
```
It is clearly visible that just the CIs of generic retargetring vs. no retargeting cross the 0 bound, which further indicates that the differences in means are statistically not significantly different from 0.
From a reporting standpoint we can say that revenue is higher when using dynamic retargeting vs. no retargeting or generic retargeting, but there is no significant difference between the sales for products in the dynamic retargeting vs. no retargeting conditions. Managerially, this means that only dynamic retargetting helps us to increase sales.
**Question 2**
For this question we want to examine whether customer ranks are significantly different for different types of advertising. Because we are dealing with data on an ordinal scale, we can not use ANOVA for this type of question. The non-parametric counterpart is the Kruskal-Wallis test, which tests for differences in medians between groups. Hence, the null hypothesis is that the medians are equal in each group and the alternative hypothesis is that there is a difference between at least one pair of groups in terms of the median.
$$H_0: \bar{\mu}_1 = \bar{\mu}_2 = \bar{\mu}_3 $$
$$H_1: \textrm{The meadians are not all equal} $$
Or, alternatively
$$H_1: \exists {i,j}: {\bar \mu_i \ne \bar \mu_j} $$
A good way to visualize ordinal data is through a boxplot.
```{r, fig.align="center"}
ggplot(data = customer_data, aes(x = retargeting, y = rank)) +
geom_boxplot() +
theme_bw() +
labs(x = "", y = "Rank")
```
The boxplot seems to indicate that the medians are unequal. At least for dynamic retargeting our customer ranks seem to be lower than the ones of no retargeting or generic retargeting.
The only assumption that we require for this test is that the dependent variable is at least ordinal, which is fulfilled for customer ranks. Hence we can move on to performing the test in R.
```{r, warning=FALSE, message=FALSE}
#ordinal data so we use a non-parametric test
kruskal.test(rank ~ retargeting, data = customer_data)
```
The p-value is below any sensible signifcance level and thus we reject the null hypothesis of equal medians. This means that the median rank of customers is different for different types of retargeting, implying that the type of retargeting has an effect on the customer rank.
To further see which of the medians are unequal we perform the Nemenyi post hoc test, which can be found in the ```PCMCR``` package in R. The null hypothesis is that the pairwise medians are equal, while the alternative hypothesis is that the pairwise medians are unequal.
```{r, warning=FALSE, message=FALSE}
library(PMCMR)
posthoc.kruskal.nemenyi.test(x = customer_data$rank, g = customer_data$retargeting, dist = "Tukey")
```
Similar to question 1 we can see that there seems to be no difference in (median) customer ranks of no retargeting vs. generic retargeting. On the other side ranks of dynamic retargeting seem to be significantly different from both no retargeting and generic retargeting. This implies that just dynamic retargeting leads to different customer ranks.
**Question 3**
To find out whether our conversion rate differs between our female and male customers, we can use a test for proportions instead of a test for mean differences. To test for the equality of proportions (and therefore no difference between them) we can use a $\chi^2$ test.
Our null hypothesis in this case states that the proportions of conversion are equal for females and males. Our alternative hypothesis states that these proportions are unequal.
$$H_0: \pi_1 = \pi_2 \\ H_1: \pi_1 \neq \pi_2$$
First let´s create a summary plot to get a feeling for the data.
```{r question_3_1}
#conditional relative frequencies
rel_freq_table <- as.data.frame(prop.table(table(customer_data$gender, customer_data$conversion), 1))
names(rel_freq_table) <- c("gender", "conversion","freq") # changing names of the columns
rel_freq_table
ggplot(rel_freq_table, aes(x = gender, y = freq, fill = conversion)) + #plot data
geom_col(width = .7) + #position
geom_text(aes(label = paste0(round(freq*100,0),"%")), position = position_stack(vjust = 0.5), size = 4) + #add percentages
ylab("Proportion of conversions") + xlab("gender") + # specify axis labels
theme_bw()
```
We see that our conversion seems to be better for our female customers, but let´s check whether these proportions are significantly different.
```{r}
n1 <- nrow(subset(customer_data, gender == "female")) #number of observations for females
n2 <- nrow(subset(customer_data, gender == "male")) #number of observations for males
n1_conv <- nrow(subset(customer_data, gender == "female" & conversion == "conversion")) #number of conversions for females
n2_conv <- nrow(subset(customer_data, gender == "male" & conversion == "conversion")) #number of conversions for males
prop.test(x = c(n1_conv, n2_conv), n = c(n1, n2), conf.level = 0.95)
```
The test showed that the conversion rate for females was 26% higher compared to male customers. This difference is highly significant $\chi^2$ (1) = 24.2, p < .05 (95% CI = [0.16,0.36]), which means that we can reject our null hypothesis of equal probability and state that there indeed is a difference between our male and female customers respective their conversion rate.
## Solution assignment 4
As a marketing manager of a consumer electronics company, you are assigned the task to analyze the relative influence of different marketing activities. Specifically, you are supposed to analyse the effects of (1) TV advertising, (2) online advertising, and (3) radio advertising on the sales of fitness trackers (wristbands). Your data set consists of sales of the product in different markets (each line represents one market) from the past year, along with the advertising budgets for the product in each of those markets for three different media: TV, online, and radio.
The following variables are available to you:
* Sales (in thousands of units)
* TV advertising budget (in thousands of Euros)
* Online advertising budget (in thousands of Euros)
* Radio advertising budget (in thousands of Euros)
Please conduct the following analyses:
1. Formally state the regression equation, which you will use to determine the relative influence of the marketing activities on sales.
2. Describe the model variables using appropriate statistics and plots
3. Estimate a multiple linear regression model to determine the relative influence of each of the variables. Before you interpret the results, test if the model assumptions are fulfilled and use appropriate tests and plots to test the assumptions.
4. Interpret the model results:
* Which variables have a significant influence on sales and what is the interpretation of the coefficients?
* What is the relative importance of the predictor variables?
* Interpret the F-test
* How do you judge the fit of the model? Please also visualize the model fit using an appropriate graph.
5. What sales quantity would you predict based on your model for a product when the marketing activities are planned as follows: TV: 150 thsd. €, Online: 26 thsd. €, Radio: 15 thsd. €? Please provide the equation you used to make the prediction.
When you are done with your analysis, click on "Knit to HTML" button above the code editor. This will create a HTML document of your results in the folder where the "assignment4.Rmd" file is stored. Open this file in your Internet browser to see if the output is correct. If the output is correct, submit the HTML file via Learn\@WU. The file name should be "assignment4_studendID_name.html".
**Load data**
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
sales_data <- read.table("https://raw.githubusercontent.com/IMSMWU/MRDA2018/master/data/assignment4.dat",
sep = "\t",
header = TRUE) #read in data
sales_data$market_id <- 1:nrow(sales_data)
head(sales_data)
str(sales_data)
```
**Question 1**
In a first step, we specify the regression equation. In this case, sales is the dependent variable which is regressed on the different types of advertising expenditures that represent the independent variables. Thus, the regression equation is:
$$Sales=\beta_0 + \beta_1 * tv\_adspend + \beta_2 * online\_adspend + \beta_3 * radio\_adspend + \epsilon$$
**Question 2**
The descriptive statistics can be checked using the ```describe()``` function: