-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAssignment2.Rmd
222 lines (155 loc) · 6.19 KB
/
Assignment2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
title: "Assignment 2"
author: "Dmitrii Bychkov / 014059377"
output: html_document
---
# Cross sectional study
Tehty: `r format(Sys.time()) `
### Instruction
Complete this assignment file and return it and resulting html-file to moodle.
Write as "author:" your name and student number (if you have one)
In this assignment we investigate association between HDL (`hdl`) (y-variable) and the following variables:
* Assignment (`age`)
* height (`height.cm`)
* weight (`weight.kg`)
* sex (`gender`)
* loaction (`location`)
* ratio.ter
## Read data
We use [diabetes data](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.html)
```{r}
# Load example diabetes data, this is originally SPSS file
# Use package "foreign"
library(foreign)
load(file=url("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav"))
# First 2 rows
head(diabetes,2)
# Simple summaries
summary(diabetes)
# New variable waist/hip ratio in tertiles
with(diabetes,table(cut(ratio,quantile(ratio,probs=c(0,1/3,2/3,1),na.rm=TRUE),
include.lowest=TRUE),useNA="ifany"))
diabetes$ratio.ter<-with(diabetes,cut(ratio,quantile(ratio,probs=c(0,1/3,2/3,1),na.rm=TRUE),
include.lowest=TRUE))
```
## Variable transformations
Make new variables:
* `height.cm` height in inches transformed to cm
* `weight.kg` pounds into kg
```{r}
diabetes$height.cm <- diabetes$height * 2.54
diabetes$weight.kg <- diabetes$weight / 2.2
```
## Tabulate
Make tables and output them into html.
Take into account if there is any missing values.
```{r message=FALSE}
library(tables)
```
```{r}
tmp.tab0 <- tabular( (gender) ~ (1+Heading(Complete)*complete.cases(diabetes))*(hdl)*((n=1)+mean+sd+median),
data=diabetes )
print(tmp.tab0)
```
__Since diabetes type II is assosiated with obesity, let's take a look at _weight_ and _waist_; But first we need to pre-process columns:__
```{r}
diabetes$weight.Q <- with(diabetes, cut(weight.kg,
breaks=quantile(weight.kg, probs=seq(0,1, by=0.25), na.rm=TRUE),
include.lowest=TRUE))
diabetes$waist.Q <- with(diabetes, cut(waist,
breaks=quantile(waist, probs=seq(0,1, by=0.25), na.rm=TRUE),
include.lowest=TRUE))
```
__weight:__
```{r}
tmp.tab1 <- tabular( (weight.Q) ~ (1+Heading(Complete)*complete.cases(diabetes))*(ratio)*((n=1)+mean+sd+median),
data=diabetes)
print(tmp.tab1)
```
__waist:__
```{r}
tmp.tab2 <- tabular( (waist.Q) ~ (1+Heading(Complete)*complete.cases(diabetes))*(ratio)*((n=1)+mean+sd+median),
data=diabetes)
print(tmp.tab2)
```
Output html tables.
```{r, results='asis',echo=FALSE}
html(tmp.tab1)
html(tmp.tab2)
```
__So we can see as the weight (and the waist of course) of patient grows the cholesterol/HDL ration increases accordingly. Seems logical.__
## Make graphs
Make scatterplots, boxplots, and other types of graphs for description.
```{r, message=FALSE}
library(ggplot2)
library(reshape2)
library(broom)
```
__We need to melt the data first:__
```{r, error=FALSE, message=FALSE, warning=FALSE}
melted <- melt(diabetes[(diabetes$ratio < 10 & diabetes$bp.1s < 200), c('id','bp.1s','bp.1d','weight.kg')],
measure.vars = c('bp.1s','bp.1d'), na.rm = TRUE)
```
__Is there some dependency between blood pressure and patient's weight? __
```{r, warning=FALSE, message=FALSE}
ggplot(melted, aes(x=weight.kg, y=value, colour=variable)) +
geom_point() + geom_smooth(method = 'lm')
```
__There is some, but hard to say from the plot how significant it is__
```{r, warning=FALSE, message=FALSE}
ggplot(na.omit(diabetes), aes(x=location, y=ratio, colour=frame)) +
geom_boxplot()
```
```{r, warning=FALSE, message=FALSE}
ggplot(na.omit(diabetes), aes(x=gender, y=hdl)) +
geom_boxplot()
```
## Testing
Test if sex (gender) or location are associated with HDL.
**NOTE!** These are dichotomic variables
## Linear model
Make linear regression models with HDL as response (y) variable. Select the best model and interpret
results.
Document modeling. Output parameter estimates, p values and confidence intervals of the best model.
```{r}
# Explaining glyhb with weight, make model named tmp.m0
# NOTE! becasue of missing values "subset=complete.cases(diabetes)""
tmp.m0 <- glm( hdl ~ gender, data=diabetes,subset=complete.cases(diabetes) )
#print(tmp.m0)
summary(tmp.m0)
```
__1st Model - Gender & Age__
```{r}
tmp.m1 <- lm( hdl ~ gender + age, data=diabetes,subset=complete.cases(diabetes) )
summary(tmp.m1)
```
__Comments: Both gender an age are assosiated with HDL levels. Significance is not very strong with p-values just a bit below the 5% threshold. Women expose higher levels of HDL as indicated by a coefficient of 7.9. Age is meant to increase HDL levels by about 22% each year. The two factors together explain only about 7.5% of variance observed in the data, which is quite low.__
__2nd Model - Add weight to the 1st model __
```{r}
tmp.m2 <- lm( hdl ~ gender + age + weight + stab.glu, data=diabetes,subset=complete.cases(diabetes) )
summary(tmp.m2)
```
__Comments: Weight appear to be highly significant predictor of HDL levels with effect size of -0.12. Negative means that higher weight is assosiated with lower levels of 'good' cholesterol. Stabilized glucose (stab.glu) is also negatively associated with HDL levels. Second model explains 20% of variance.__
__3rd Model: Introduce interaction term__
```{r}
tmp.m3 <- lm( hdl ~ gender + age*weight + stab.glu, data=diabetes,subset=complete.cases(diabetes) )
summary(tmp.m3)
```
__Comments: By assuming interaction between weight and age we improved R-squared up to almost 28 %.__
__Now let's compare the last to models (with and without an interaction term)__
```{r}
anova(tmp.m2, tmp.m3, test='Chisq')
```
__Anova suggests significant difference in the residuals of the two models. So we should prefer tmp.m3 model as it explainc more variance with the same set of parameters. __
__Resulting model coefficients:__
```{r}
tmp.m3$coefficients
```
__P-values are found from summary() output under Pr(>|t|) for each parameter:__
```{r}
summary(tmp.m3)
```
__95% Confiddence intervals for each of the coefficients:__
```{r}
confint(tmp.m3)
```