-
Notifications
You must be signed in to change notification settings - Fork 1
/
PLS_DA.qmd
436 lines (319 loc) · 15.8 KB
/
PLS_DA.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
---
title: "PLS-DA"
format: html
---
## PLS DA
> https://www.statology.org/partial-least-squares-in-r/
*DA(Discriminant Analysis)*, 判别分析,是一种统计学习方法,旨在寻找将观测数据分为不同类别的最佳决策边界。它是一种有监督学习方法,需要已知观测数据所属的类别信息。 DA 的目标是找到一组线性组合,能够将不同类别的数据分离得最好。DA 的优点在于它可以找到最佳的决策边界,从而对新的观测数据进行分类。此外,DA可以处理多个自变量同时影响输出变量的情况。但是,DA 对输入变量之间的相关性比较敏感,如果输入变量高度相关,可能会出现过拟合现象。
*PLS-DA (Partial Least Squares Discriminant Analysis)* 是一种用于分类问题的统计学习方法。它可以同时考虑多个自变量之间的相关性,克服了传统的线性判别分析(LDA)在自变量高度相关的情况下容易出现的过拟合现象。PLS-DA 是 PLS (Partial Least Squares) 的一种变体,PLS 是一种主成分回归方法,旨在通过线性组合来提取输入变量中的信息,并建立它们与输出变量之间的关系。与主成分回归方法类似,PLS-DA 将输入变量投影到一个低维空间中,但它是针对分类问题的。
在 PLS-DA 中,我们首先根据输入变量的协方差矩阵来计算一组新的变量,称为潜在变量。潜在变量是输入变量的线性组合,并且它们在输入变量中的方差最大,同时与输出变量之间的相关性也最大。我们可以通过不断地计算潜在变量来逐步提高分类模型的准确度。
PLS-DA 的优点在于它能够同时考虑输入变量之间的相关性以及它们与输出变量之间的关系。这使得它能够处理高度相关的输入变量,同时避免出现传统 LDA 中的过拟合现象。此外,PLS-DA可以很好地处理高维数据,因为它可以将数据投影到一个低维空间中,并保留尽可能多的信息。
*PLS (Partial Least Squares)* 是一种回归分析方法,旨在通过线性组合来提取输入变量中的信息,并建立它们与输出变量之间的关系。与传统的多元线性回归(MLR)方法不同,PLS不是直接将输入变量与输出变量拟合到一个线性方程中,而是通过一系列潜在变量来描述输入和输出变量之间的关系。在 PLS中,我们首先根据输入变量的协方差矩阵来计算一组新的变量,称为潜在变量。潜在变量是输入变量的线性组合,并且它们在输入变量中的方差最大,同时与输出变量之间的相关性也最大。我们可以通过不断地计算潜在变量来逐步提高回归模型的准确度。PLS可以用来处理变量间相关性,共线性问题的数据,比如其经常用在代谢组、微生物组中等。
- Standardize both the predictor and response variables.
- Calculate M linear combinations (called “PLS components”) of the original p predictor variables that explain a significant amount of variation in both the response variable and the predictor variables.
- Use the method of least squares to fit a linear regression model using the PLS components as predictors.
- Use k-fold cross-validation to find the optimal number of PLS components to keep in the model.
Partial least squares (PLS) regression is useful when you have very few observations compared to the number of independent variables or when your independent variables are highly correlated. PLS decreases the independent variables down to a smaller number of uncorrelated components using Principal Component Analysis. Then, the procedure performs linear regression on these components rather than the original data. PLS emphasizes developing predictive models and is not used for screening variables. Unlike OLS, you can include multiple continuous dependent variables. PLS uses the correlation structure to identify smaller effects and model multivariate patterns in the dependent variables.
*The principal component regression (PCR)* first applies Principal Component Analysis on the data set to summarize the original predictor variables into few new variables also known as principal components (PCs), which are a linear combination of the original data.
These PCs are then used to build the linear regression model. The number of principal components, to incorporate in the model, is chosen by cross-validation (cv). Note that, PCR is suitable when the data set contains highly correlated predictors. 不过PCR方法所得到的PC不能好好的解释outcome。
## 代码示例
```{r, include=FALSE}
library(tidyverse)
library(MASS)
library(pls)
library(vegan)
library(FactoMineR)
```
### LDA
准备测试数据,在具体的数据中
```{r}
data(iris)
# 设置响应变量为 Species,将其转换为因子变量
iris$Species <- as.factor(iris$Species)
# 拆分数据集为训练集和测试集
set.seed(123)
trainIndex <- caret::createDataPartition(iris$Species, p = 0.7, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]
```
```{r}
# 进行 LDA 分析
lda.fit <- MASS::lda(Species ~ ., data = trainData)
# 预测测试集
lda.pred <- predict(lda.fit, testData[, 1:4])
# 计算预测准确度
accuracy <- mean(lda.pred$class == testData$Species)
cat("LDA accuracy is", round(accuracy, 2))
```
```{r}
# 将预测结果和实际结果合并
results <- data.frame(testData[, "Species"], lda.pred$class)
p <- predict(lda.fit, trainData[, 1:4])
tab <- table(Predicted = p$class, Actual = trainData$Species)
tab
```
```{r}
sum(diag(tab))/sum(tab)
```
```{r}
ldahist(data = p$x[,1], g = trainData$Species)
```
*interpret the result: *It’s clearly evident that no overlaps between first and second and first and third species. But some overlap observed between the second and third species.
即是,在第一维的情况下,三个分组之间具有不错的区分。
```{r}
ldahist(data = p$x[,2], g = trainData$Species)
```
*interpret the result: *histogram based on lda2 showing complete overlap and its not good.
```{r}
klaR::partimat(Species~., data = trainData, method = "lda")
```
*interpret the result: *
### PLS
There are two main algorithms for PLS, NIPALS and SIMPLS.
分类和回归问题都可以处理。
```{r}
# ?pls::gasoline
data("Boston", package = "MASS")
# 将数据集分为训练集和测试集
training.samples <- Boston$medv %>%
caret::createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
```
```{r}
# 进行 PLS 分析, 一个回归问题
# ?pls::plsr
# ncomp 参数的意思为,如何得到准确的ncomp值?
pls.fit <- pls::plsr(medv ~ ., ncomp = 5, data = train.data,
scale=TRUE, #This ensures that no predictor variable is overly influential in the model if it happens to be measured in different units.
validation="CV" #use k-fold cross-validation to evaluate the performance of the model
)
pls_null <- pls::plsr(medv ~ ., data = train.data, scale=TRUE, validation="CV")
```
looking at the test root mean squared error (test RMSE) calculated by the k-fold cross-validation to determine the number of PLS components worth keeping.
```{r}
summary(pls_null)
```
*从输出结果中找到* VALIDATION: RMSEP(test RMSE calculated by the k-fold cross validation) 中 随着comps数变化的CV值,CV值越小越好;TRAINING: % variance explained(percentage of the variance in the response variable explained by the PLS components).
```{r}
#visualize cross-validation plots
validationplot(pls_null)
validationplot(pls_null, val.type="MSEP")
validationplot(pls_null, val.type="R2")
```
```{r}
# 对于回归问题可以计算其混淆矩阵
pls.pred <- predict(pls.fit, test.data)
# 计算预测准确率
table(round(pls.pred), test.data$Q)
```
```{r}
# 绘制 PLS 结果图
pls.pred <- predict(pls.fit, test.data)
plot(pls.fit, type = "line", col = "red", lwd = 2)
plot(pls.pred, col = "blue", pch = 19, cex = 0.7, main = "PLS Results")
```
*利用vegan包展开PLS分析:*
redundancy analysis,
```{r}
# 进行 PLS 分析
pls.fit <- vegan::rda(trainData[,-1], scale = TRUE, method = "pls", k = 5, Y = trainData[,1])
pls.pred <- predict(pls.fit, testData[, -1])
# 计算预测准确率
table(round(pls.pred$Y), testData$Q)
# 绘制 PLS 结果图
plot(pls.fit, type = "biplot", scaling = 3)
ordiplot(pls.fit, display = "sites", type = "n")
points(pls.fit, display = "sites", pch = 19, col = as.numeric(testData$Q) + 1, cex = 0.7)
# 显示 PLS 分析的摘要信息
summary(pls.fit)
```
### PLS-DA
> https://bioconductor.org/packages/devel/bioc/vignettes/ropls/inst/doc/ropls-vignette.html
```{r}
# The ropls R package implements the PCA, PLS(-DA) and OPLS(-DA) approaches with the original, NIPALS-based, versions of the algorithms
library(ropls)
library(mixOmics)
```
For PLS (and OPLS), the Y response(s) must be provided to the opls method.
Y can be either a numeric vector (respectively matrix) for single (respectively multiple) (O)PLS regression, or a character factor for (O)PLS-DA classification.
```{r}
data(sacurine)
genderFc <- sacurine$sampleMetadata[, "gender"]
# orthoI = 0 时执行 PLS-DA
# when set to NA, OPLS is performed and the number of orthogonal components is automatically computed by using the cross-validation (with a maximum of 9 orthogonal components).
# ?ropls::opls,其可以执行PCA、PLS, and OPLS
plsda.fit <- ropls::opls(x = sacurine$dataMatrix, y = genderFc)
plsda.fit
```
```{r}
#VIP 值帮助寻找重要的代谢物
vipVn <- getVipVn(plsda.fit)
vipVn_select <- vipVn[vipVn > 1]
head(vipVn_select)
```
```{r}
# 提取坐标值
data <- as.data.frame(plsda.fit@scoreMN)
data$group <- genderFc
data$samples <- rownames(data)
# 提取解释度
x_lab <- plsda.fit@modelDF[1, "R2X"] * 100
y_lab <- plsda.fit@modelDF[2, "R2X"] * 100
# 绘图
col <- c("#1597A5", "#FFC24B", "#FEB3AE")
p2 <- ggplot(data, aes(
x = p1, y = p2,
color = group, shape = group
)) + # 指定数据、X轴、Y轴,颜色
theme_bw() + # 主题设置
ggforce::geom_ellipse(aes(x0 = 0, y0 = 0, a = 10, b = 6, angle = 0), color = "grey", size = 0.5) +
ggforce::geom_ellipse(aes(x0 = 0, y0 = 0, a = 8, b = 4, angle = 0), color = "grey", size = 0.5) +
ggforce::geom_ellipse(aes(x0 = 0, y0 = 0, a = 6, b = 2, angle = 0), color = "grey", size = 0.5) +
coord_fixed() + # 图中椭圆的绘制代码,不需要可删除
geom_point(size = 1.8) + # 绘制点图并设定大小
theme(panel.grid = element_blank()) +
geom_vline(xintercept = 0, lty = "dashed", color = "red") +
geom_hline(yintercept = 0, lty = "dashed", color = "red") + # 图中虚线
geom_text(aes(label = samples, y = p2 + 0.4, x = p1 + 0.5, vjust = 0), size = 3.5) + # 添加数据点的标签
# guides(color=guide_legend(title=NULL))+#去除图例标题
labs(
x = paste0("P1 (", x_lab, "%)"),
y = paste0("P2 (", y_lab, "%)")
) + # 将x、y轴标题改为贡献度
stat_ellipse(
data = data,
geom = "polygon", level = 0.95,
linetype = 2, linewidth = 0.5,
aes(fill = group),
alpha = 0.2,
show.legend = T
) +
scale_color_manual(values = col) + # 点的颜色设置
scale_fill_manual(values = c("#1597A5", "#FFC24B", "#FEB3AE")) +
theme(
axis.title.x = element_text(size = 12), # 修改X轴标题文本
axis.title.y = element_text(size = 12, angle = 90), # 修改y轴标题文本
axis.text.y = element_text(size = 10), # 修改x轴刻度标签文本
axis.text.x = element_text(size = 10), # 修改y轴刻度标签文本
panel.grid = element_blank()
) # 隐藏网格线
p2
```
```{r}
#检查关于训练子集的预测
trainVi <- getSubsetVi(plsda.fit)
table(sacurine$sampleMetadata[, 'gender'][trainVi], fitted(plsda.fit))
#计算测试子集的性能
table(sacurine$sampleMetadata[, 'gender'][-trainVi], predict(plsda.fit, sacurine$dataMatrix[-trainVi, ]))
```
```{r}
# 对测试集进行预测
plsda.pred <- predict(plsda.fit, newdata = testData)
# 计算预测准确度
accuracy <- mean(plsda.pred$class == testData$Species)
cat("PLS-DA accuracy is", round(accuracy, 2))
```
*PCA for sacurine data*
```{r}
sacurine.pca <- opls(sacurine$dataMatrix)
```
*mixOmics包的分析*
```{r}
##示例数据,详情 ?breast.tumors
data(breast.tumors)
X <- breast.tumors$gene.exp
Y <- breast.tumors$sample$treatment
```
```{r}
# 进行 PLS-DA 分析
# 详情 ?plsda,这里列出前 5 个主要的轴
# 如何像上面的操作一样得到准确的ncomp值,
plsda_result <- mixOmics::plsda(X, Y, ncomp = 5, scale = TRUE)
```
```{r}
predictions <- predict(plsda_result, X, dist = 'max.dist')
prediction <- predictions$class$max.dist[,4] #预测值
get.BER(get.confusion_matrix(truth = Y, predicted = prediction)) #预测的错误率
```
```{r}
#plotIndiv() 提供了较为灵活的二维、三维作图风格样式选择,包括 ggplot2、lattice、graphics 、3d 等,详情 ?plotIndiv
plotIndiv(plsda_result, ind.names = TRUE, style = 'ggplot2')
# plotIndiv(plsda_result, ind.names = TRUE, style = '3d')
```
```{r}
plotIndiv(plsda_result , comp = c(1,2),
group = Y, style = 'ggplot2',ellipse = T,
size.xlabel = 20, size.ylabel = 20, size.axis = 20, pch = 16, cex = 5)
```
```{r}
df <- unclass(plsda_result)
#提取坐标值
df1 = as.data.frame(df$variates$X)
df1$group = Y
df1$samples = rownames(df1)
#提取解释度
explain = df$prop_expl_var$X
x_lable <- round(explain[1],digits=3)
y_lable <- round(explain[2],digits=3)
#绘图
col=c("#1597A5","#FFC24B","#FEB3AE")
p1 <- ggplot(df1, aes(
x = comp1, y = comp2,
color = group, shape = group
)) + # 指定数据、X轴、Y轴,颜色
theme_bw() + # 主题设置
geom_point(size = 1.8) + # 绘制点图并设定大小
theme(panel.grid = element_blank()) +
geom_vline(xintercept = 0, lty = "dashed") +
geom_hline(yintercept = 0, lty = "dashed") + # 图中虚线
geom_text(aes(label = samples, y = comp2 + 0.4, x = comp1 + 0.5, vjust = 0), size = 3.5) + # 添加数据点的标签
# guides(color=guide_legend(title=NULL))+#去除图例标题
labs(
x = paste0("P1 (", x_lable * 100, "%)"),
y = paste0("P2 (", y_lable * 100, "%)")
) + # 将x、y轴标题改为贡献度
stat_ellipse(
data = df1,
geom = "polygon", level = 0.95,
linetype = 2, linewidth = 0.5,
aes(fill = group),
alpha = 0.2,
show.legend = T
) +
scale_color_manual(values = col) + # 点的颜色设置
scale_fill_manual(values = c("#1597A5", "#FFC24B", "#FEB3AE")) +
theme(
axis.title.x = element_text(size = 12), # 修改X轴标题文本
axis.title.y = element_text(size = 12, angle = 90), # 修改y轴标题文本
axis.text.y = element_text(size = 10), # 修改x轴刻度标签文本
axis.text.x = element_text(size = 10), # 修改y轴刻度标签文本
panel.grid = element_blank()
) # 隐藏网格线
p1
```
```{r}
sample_site <- data.frame(plsda_result$variates)[1:2]
sample_site$names <- rownames(sample_site)
names(sample_site)[1:2] <- c("plsda1", "plsda2")
# sample_site <- merge(sample_site, mapfile, by.x = "names", by.y = "Description", all.x = TRUE)
sample_site$group <- Y
plsda_result_eig <- {
plsda_result$explained_variance$X
}[1:2]
p_plsda <- ggplot(sample_site, aes(plsda1, plsda2, color = group, label = names)) +
geom_point(size = 1.5, alpha = 0.6) +
stat_ellipse(show.legend = FALSE, linetype = 2) +
scale_color_manual(values = c("#1D7ACC", "#F67433", "#00815F", "#C673FF2E", "blue2")) +
scale_shape_manual(values = c(16, 17)) +
theme(
panel.grid = element_line(color = "grey50"),
panel.background = element_rect(color = "black", fill = "transparent")
) +
theme(legend.title = element_blank(), legend.key = element_rect(fill = "transparent")) +
labs(
x = paste("PLS-DA axis1 ( explained variance ", round(100 * plsda_result_eig[1], 2), "% )", sep = ""),
y = paste("PLS-DA axis2 ( explained variance ", round(100 * plsda_result_eig[2], 2), "% )", sep = "")
)
p_plsda
```