PLS_DA.qmd

---
title: "PLS-DA"
format: html
---

## PLS DA

> https://www.statology.org/partial-least-squares-in-r/

*DA(Discriminant Analysis)*, 判别分析，是一种统计学习方法，旨在寻找将观测数据分为不同类别的最佳决策边界。它是一种有监督学习方法，需要已知观测数据所属的类别信息。 DA 的目标是找到一组线性组合，能够将不同类别的数据分离得最好。DA 的优点在于它可以找到最佳的决策边界，从而对新的观测数据进行分类。此外，DA可以处理多个自变量同时影响输出变量的情况。但是，DA 对输入变量之间的相关性比较敏感，如果输入变量高度相关，可能会出现过拟合现象。


*PLS-DA (Partial Least Squares Discriminant Analysis)* 是一种用于分类问题的统计学习方法。它可以同时考虑多个自变量之间的相关性，克服了传统的线性判别分析（LDA）在自变量高度相关的情况下容易出现的过拟合现象。PLS-DA 是 PLS （Partial Least Squares） 的一种变体，PLS 是一种主成分回归方法，旨在通过线性组合来提取输入变量中的信息，并建立它们与输出变量之间的关系。与主成分回归方法类似，PLS-DA 将输入变量投影到一个低维空间中，但它是针对分类问题的。
在 PLS-DA 中，我们首先根据输入变量的协方差矩阵来计算一组新的变量，称为潜在变量。潜在变量是输入变量的线性组合，并且它们在输入变量中的方差最大，同时与输出变量之间的相关性也最大。我们可以通过不断地计算潜在变量来逐步提高分类模型的准确度。
PLS-DA 的优点在于它能够同时考虑输入变量之间的相关性以及它们与输出变量之间的关系。这使得它能够处理高度相关的输入变量，同时避免出现传统 LDA 中的过拟合现象。此外，PLS-DA可以很好地处理高维数据，因为它可以将数据投影到一个低维空间中，并保留尽可能多的信息。


*PLS (Partial Least Squares)* 是一种回归分析方法，旨在通过线性组合来提取输入变量中的信息，并建立它们与输出变量之间的关系。与传统的多元线性回归（MLR）方法不同，PLS不是直接将输入变量与输出变量拟合到一个线性方程中，而是通过一系列潜在变量来描述输入和输出变量之间的关系。在 PLS中，我们首先根据输入变量的协方差矩阵来计算一组新的变量，称为潜在变量。潜在变量是输入变量的线性组合，并且它们在输入变量中的方差最大，同时与输出变量之间的相关性也最大。我们可以通过不断地计算潜在变量来逐步提高回归模型的准确度。PLS可以用来处理变量间相关性，共线性问题的数据，比如其经常用在代谢组、微生物组中等。

- Standardize both the predictor and response variables.
- Calculate M linear combinations (called “PLS components”) of the original p predictor variables that explain a significant amount of variation in both the response variable and the predictor variables.
- Use the method of least squares to fit a linear regression model using the PLS components as predictors.
- Use k-fold cross-validation to find the optimal number of PLS components to keep in the model.

Partial least squares (PLS) regression is useful when you have very few observations compared to the number of independent variables or when your independent variables are highly correlated. PLS decreases the independent variables down to a smaller number of uncorrelated components using Principal Component Analysis. Then, the procedure performs linear regression on these components rather than the original data. PLS emphasizes developing predictive models and is not used for screening variables. Unlike OLS, you can include multiple continuous dependent variables. PLS uses the correlation structure to identify smaller effects and model multivariate patterns in the dependent variables.

*The principal component regression (PCR)* first applies Principal Component Analysis on the data set to summarize the original predictor variables into few new variables also known as principal components (PCs), which are a linear combination of the original data.
These PCs are then used to build the linear regression model. The number of principal components, to incorporate in the model, is chosen by cross-validation (cv). Note that, PCR is suitable when the data set contains highly correlated predictors. 不过PCR方法所得到的PC不能好好的解释outcome。


## 代码示例

```{r, include=FALSE}
library(tidyverse)
library(MASS)
library(pls)
library(vegan)
library(FactoMineR)
```


### LDA


准备测试数据，在具体的数据中
```{r}
data(iris)

# 设置响应变量为 Species，将其转换为因子变量
iris$Species <- as.factor(iris$Species)

# 拆分数据集为训练集和测试集
set.seed(123)
trainIndex <- caret::createDataPartition(iris$Species, p = 0.7, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]

```


```{r}
# 进行 LDA 分析
lda.fit <- MASS::lda(Species ~ ., data = trainData)

# 预测测试集
lda.pred <- predict(lda.fit, testData[, 1:4])

# 计算预测准确度
accuracy <- mean(lda.pred$class == testData$Species)
cat("LDA accuracy is", round(accuracy, 2))

```

```{r}
# 将预测结果和实际结果合并
results <- data.frame(testData[, "Species"], lda.pred$class)

p <- predict(lda.fit, trainData[, 1:4])

tab <- table(Predicted = p$class, Actual = trainData$Species)
tab

```
```{r}
sum(diag(tab))/sum(tab)
```


```{r}

ldahist(data = p$x[,1], g = trainData$Species)
```
*interpret the result: *It’s clearly evident that no overlaps between first and second and first and third species. But some overlap observed between the second and third species.
即是，在第一维的情况下，三个分组之间具有不错的区分。

```{r}
ldahist(data = p$x[,2], g = trainData$Species)
```
*interpret the result: *histogram based on lda2 showing complete overlap and its not good.


```{r}
klaR::partimat(Species~., data = trainData, method = "lda")
```
*interpret the result: *


### PLS

There are two main algorithms for PLS, NIPALS and SIMPLS.

分类和回归问题都可以处理。

```{r}
# ?pls::gasoline
data("Boston", package = "MASS")

# 将数据集分为训练集和测试集
training.samples <- Boston$medv %>%
  caret::createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
```

```{r}
# 进行 PLS 分析， 一个回归问题
# ?pls::plsr
# ncomp 参数的意思为，如何得到准确的ncomp值？
pls.fit <- pls::plsr(medv ~ ., ncomp = 5, data = train.data,
                     scale=TRUE, #This ensures that no predictor variable is overly influential in the model if it happens to be measured in different units.
                     validation="CV" #use k-fold cross-validation to evaluate the performance of the model
                     )

pls_null <- pls::plsr(medv ~ ., data = train.data, scale=TRUE, validation="CV")


```


looking at the test root mean squared error (test RMSE) calculated by the k-fold cross-validation to determine the number of PLS components worth keeping.

```{r}
summary(pls_null)
```
*从输出结果中找到* VALIDATION: RMSEP(test RMSE calculated by the k-fold cross validation) 中 随着comps数变化的CV值，CV值越小越好；TRAINING: % variance explained(percentage of the variance in the response variable explained by the PLS components).

```{r}
#visualize cross-validation plots
validationplot(pls_null)
validationplot(pls_null, val.type="MSEP")
validationplot(pls_null, val.type="R2")
```


```{r}
# 对于回归问题可以计算其混淆矩阵  
pls.pred <- predict(pls.fit, test.data)

# 计算预测准确率
table(round(pls.pred), test.data$Q)
```


```{r}
# 绘制 PLS 结果图
pls.pred <- predict(pls.fit, test.data)


plot(pls.fit, type = "line", col = "red", lwd = 2)
plot(pls.pred, col = "blue", pch = 19, cex = 0.7, main = "PLS Results")

```


*利用vegan包展开PLS分析：*

redundancy analysis,
```{r}
# 进行 PLS 分析
pls.fit <- vegan::rda(trainData[,-1], scale = TRUE, method = "pls", k = 5, Y = trainData[,1])
pls.pred <- predict(pls.fit, testData[, -1])

# 计算预测准确率
table(round(pls.pred$Y), testData$Q)

# 绘制 PLS 结果图
plot(pls.fit, type = "biplot", scaling = 3)
ordiplot(pls.fit, display = "sites", type = "n")
points(pls.fit, display = "sites", pch = 19, col = as.numeric(testData$Q) + 1, cex = 0.7)

# 显示 PLS 分析的摘要信息
summary(pls.fit)

```


### PLS-DA

> https://bioconductor.org/packages/devel/bioc/vignettes/ropls/inst/doc/ropls-vignette.html

```{r}
# The ropls R package implements the PCA, PLS(-DA) and OPLS(-DA) approaches with the original, NIPALS-based, versions of the algorithms
library(ropls)

library(mixOmics)
```

For PLS (and OPLS), the Y response(s) must be provided to the opls method. 
Y can be either a numeric vector (respectively matrix) for single (respectively multiple) (O)PLS regression, or a character factor for (O)PLS-DA classification.
```{r}
data(sacurine)

genderFc <- sacurine$sampleMetadata[, "gender"]

# orthoI = 0 时执行 PLS-DA
# when set to NA, OPLS is performed and the number of orthogonal components is automatically computed by using the cross-validation (with a maximum of 9 orthogonal components).
# ?ropls::opls,其可以执行PCA、PLS, and OPLS 
plsda.fit <- ropls::opls(x = sacurine$dataMatrix, y = genderFc)

plsda.fit
```


```{r}
#VIP 值帮助寻找重要的代谢物
vipVn <- getVipVn(plsda.fit)
vipVn_select <- vipVn[vipVn > 1]
head(vipVn_select)
```

```{r}
# 提取坐标值
data <- as.data.frame(plsda.fit@scoreMN)
data$group <- genderFc
data$samples <- rownames(data)
# 提取解释度
x_lab <- plsda.fit@modelDF[1, "R2X"] * 100
y_lab <- plsda.fit@modelDF[2, "R2X"] * 100
# 绘图
col <- c("#1597A5", "#FFC24B", "#FEB3AE")


p2 <- ggplot(data, aes(
  x = p1, y = p2,
  color = group, shape = group
)) + # 指定数据、X轴、Y轴，颜色
  theme_bw() + # 主题设置
  ggforce::geom_ellipse(aes(x0 = 0, y0 = 0, a = 10, b = 6, angle = 0), color = "grey", size = 0.5) +
  ggforce::geom_ellipse(aes(x0 = 0, y0 = 0, a = 8, b = 4, angle = 0), color = "grey", size = 0.5) +
  ggforce::geom_ellipse(aes(x0 = 0, y0 = 0, a = 6, b = 2, angle = 0), color = "grey", size = 0.5) +
  coord_fixed() + # 图中椭圆的绘制代码，不需要可删除
  geom_point(size = 1.8) + # 绘制点图并设定大小
  theme(panel.grid = element_blank()) +
  geom_vline(xintercept = 0, lty = "dashed", color = "red") +
  geom_hline(yintercept = 0, lty = "dashed", color = "red") + # 图中虚线
  geom_text(aes(label = samples, y = p2 + 0.4, x = p1 + 0.5, vjust = 0), size = 3.5) + # 添加数据点的标签
  # guides(color=guide_legend(title=NULL))+#去除图例标题
  labs(
    x = paste0("P1 (", x_lab, "%)"),
    y = paste0("P2 (", y_lab, "%)")
  ) + # 将x、y轴标题改为贡献度
  stat_ellipse(
    data = data,
    geom = "polygon", level = 0.95,
    linetype = 2, linewidth = 0.5,
    aes(fill = group),
    alpha = 0.2,
    show.legend = T
  ) +
  scale_color_manual(values = col) + # 点的颜色设置
  scale_fill_manual(values = c("#1597A5", "#FFC24B", "#FEB3AE")) +
  theme(
    axis.title.x = element_text(size = 12), # 修改X轴标题文本
    axis.title.y = element_text(size = 12, angle = 90), # 修改y轴标题文本
    axis.text.y = element_text(size = 10), # 修改x轴刻度标签文本
    axis.text.x = element_text(size = 10), # 修改y轴刻度标签文本
    panel.grid = element_blank()
  ) # 隐藏网格线

p2
```

```{r}
#检查关于训练子集的预测
trainVi <- getSubsetVi(plsda.fit)
table(sacurine$sampleMetadata[, 'gender'][trainVi], fitted(plsda.fit))
 
#计算测试子集的性能
table(sacurine$sampleMetadata[, 'gender'][-trainVi], predict(plsda.fit, sacurine$dataMatrix[-trainVi, ]))
```

```{r}
# 对测试集进行预测
plsda.pred <- predict(plsda.fit, newdata = testData)

# 计算预测准确度
accuracy <- mean(plsda.pred$class == testData$Species)
cat("PLS-DA accuracy is", round(accuracy, 2))
```


*PCA for sacurine data*

```{r}
sacurine.pca <- opls(sacurine$dataMatrix)
```


*mixOmics包的分析*

```{r}
##示例数据，详情 ?breast.tumors

data(breast.tumors)

X <- breast.tumors$gene.exp
Y <- breast.tumors$sample$treatment

```


```{r}
# 进行 PLS-DA 分析
# 详情 ?plsda，这里列出前 5 个主要的轴
# 如何像上面的操作一样得到准确的ncomp值，
plsda_result <- mixOmics::plsda(X, Y, ncomp = 5, scale = TRUE)
```

```{r}
predictions <- predict(plsda_result, X, dist = 'max.dist')
prediction <- predictions$class$max.dist[,4]  #预测值
get.BER(get.confusion_matrix(truth = Y, predicted = prediction))  #预测的错误率
```


```{r}
#plotIndiv() 提供了较为灵活的二维、三维作图风格样式选择，包括 ggplot2、lattice、graphics 、3d 等，详情 ?plotIndiv
plotIndiv(plsda_result, ind.names = TRUE, style = 'ggplot2')
# plotIndiv(plsda_result, ind.names = TRUE, style = '3d')
```

```{r}
plotIndiv(plsda_result , comp = c(1,2),
          group = Y, style = 'ggplot2',ellipse = T, 
          size.xlabel = 20, size.ylabel = 20, size.axis = 20, pch = 16, cex = 5)
```

```{r}
df <- unclass(plsda_result)
#提取坐标值
df1 = as.data.frame(df$variates$X)
df1$group = Y
df1$samples = rownames(df1)
#提取解释度
explain = df$prop_expl_var$X
x_lable <- round(explain[1],digits=3)
y_lable <- round(explain[2],digits=3)
#绘图
col=c("#1597A5","#FFC24B","#FEB3AE")

p1 <- ggplot(df1, aes(
  x = comp1, y = comp2,
  color = group, shape = group
)) + # 指定数据、X轴、Y轴，颜色
  theme_bw() + # 主题设置
  geom_point(size = 1.8) + # 绘制点图并设定大小
  theme(panel.grid = element_blank()) +
  geom_vline(xintercept = 0, lty = "dashed") +
  geom_hline(yintercept = 0, lty = "dashed") + # 图中虚线
  geom_text(aes(label = samples, y = comp2 + 0.4, x = comp1 + 0.5, vjust = 0), size = 3.5) + # 添加数据点的标签
  # guides(color=guide_legend(title=NULL))+#去除图例标题
  labs(
    x = paste0("P1 (", x_lable * 100, "%)"),
    y = paste0("P2 (", y_lable * 100, "%)")
  ) + # 将x、y轴标题改为贡献度
  stat_ellipse(
    data = df1,
    geom = "polygon", level = 0.95,
    linetype = 2, linewidth = 0.5,
    aes(fill = group),
    alpha = 0.2,
    show.legend = T
  ) +
  scale_color_manual(values = col) + # 点的颜色设置
  scale_fill_manual(values = c("#1597A5", "#FFC24B", "#FEB3AE")) +
  theme(
    axis.title.x = element_text(size = 12), # 修改X轴标题文本
    axis.title.y = element_text(size = 12, angle = 90), # 修改y轴标题文本
    axis.text.y = element_text(size = 10), # 修改x轴刻度标签文本
    axis.text.x = element_text(size = 10), # 修改y轴刻度标签文本
    panel.grid = element_blank()
  ) # 隐藏网格线
p1
```


```{r}

sample_site <- data.frame(plsda_result$variates)[1:2]
sample_site$names <- rownames(sample_site)
names(sample_site)[1:2] <- c("plsda1", "plsda2")
# sample_site <- merge(sample_site, mapfile, by.x = "names", by.y = "Description", all.x = TRUE)
sample_site$group <- Y
plsda_result_eig <- {
  plsda_result$explained_variance$X
}[1:2]

p_plsda <- ggplot(sample_site, aes(plsda1, plsda2, color = group, label = names)) +
  geom_point(size = 1.5, alpha = 0.6) +
  stat_ellipse(show.legend = FALSE, linetype = 2) +
  scale_color_manual(values = c("#1D7ACC", "#F67433", "#00815F", "#C673FF2E", "blue2")) +
  scale_shape_manual(values = c(16, 17)) +
  theme(
    panel.grid = element_line(color = "grey50"),
    panel.background = element_rect(color = "black", fill = "transparent")
  ) +
  theme(legend.title = element_blank(), legend.key = element_rect(fill = "transparent")) +
  labs(
    x = paste("PLS-DA axis1 ( explained variance ", round(100 * plsda_result_eig[1], 2), "% )", sep = ""),
    y = paste("PLS-DA axis2 ( explained variance ", round(100 * plsda_result_eig[2], 2), "% )", sep = "")
  )

p_plsda
```