visualization-gallery.Rmd

# 可视化之图库 {#chap-gallery}


```{r,echo=FALSE}
if (is.na(Sys.getenv("CI", NA))) {
  # 准备 Noto 中英文字体
  sysfonts::font_paths(new = "~/Library/Fonts/")
  ## 宋体
  sysfonts::font_add(
    family = "Noto Serif CJK SC",
    regular = "NotoSerifCJKsc-Regular.otf",
    bold = "NotoSerifCJKsc-Bold.otf"
  )
  ## 黑体
  sysfonts::font_add(
    family = "Noto Sans CJK SC",
    regular = "NotoSansCJKsc-Regular.otf",
    bold = "NotoSansCJKsc-Bold.otf"
  )
  sysfonts::font_add(
    family = "Noto Serif",
    regular = "NotoSerif-Regular.ttf",
    bold = "NotoSerif-Bold.ttf",
    italic = "NotoSerif-Italic.ttf",
    bolditalic = "NotoSerif-BoldItalic.ttf"
  )
  sysfonts::font_add(
    family = "Noto Sans",
    regular = "NotoSans-Regular.ttf",
    bold = "NotoSans-Bold.ttf",
    italic = "NotoSans-Italic.ttf",
    bolditalic = "NotoSans-BoldItalic.ttf"
  )
} else {
  sysfonts::font_paths(new = c(
    "/usr/share/fonts/opentype/noto/",
    "/usr/share/fonts/truetype/noto/"
  ))
  ## 宋体
  sysfonts::font_add(
    family = "Noto Serif CJK SC",
    regular = "NotoSerifCJK-Regular.ttc",
    bold = "NotoSerifCJK-Bold.ttc"
  )
  ## 黑体
  sysfonts::font_add(
    family = "Noto Sans CJK SC",
    regular = "NotoSansCJK-Regular.ttc",
    bold = "NotoSansCJK-Bold.ttc"
  )
  sysfonts::font_add(
    family = "Noto Serif",
    regular = "NotoSerif-Regular.ttf",
    bold = "NotoSerif-Bold.ttf",
    italic = "NotoSerif-Italic.ttf",
    bolditalic = "NotoSerif-BoldItalic.ttf"
  )
  sysfonts::font_add(
    family = "Noto Sans",
    regular = "NotoSans-Regular.ttf",
    bold = "NotoSans-Bold.ttf",
    italic = "NotoSans-Italic.ttf",
    bolditalic = "NotoSans-BoldItalic.ttf"
  )
}
```

```{r}
library(ggplot2)           # ggplot2 图形
library(patchwork)         # 图形布局
library(magrittr)          # 管道操作
library(ggrepel)           # 文本注释
library(extrafont)         # 加载外部字体 TTF
library(maps)              # 地图数据
library(mapdata)           # 地图数据
library(data.table)        # 数据操作
library(KernSmooth)        # 核平滑
library(ggnormalviolin)    # 提琴图
library(ggbeeswarm)        # 蜂群图
library(ggridges)          # 岭线图
library(ggpubr)            # 组合图
library(treemap)           # 树状图
library(treemapify)        # 树状图
library(ggquiver)          # 向量场图
library(ggstream)          # 水流图
library(timelineS)         # 时间线
library(ggdendro)          # 聚类图
library(ggfortify)         # 统计分析结果可视化：主成分图
library(gganimate)         # 动态图
```


## 饼图 {#sec-ggplot2-pie}

我对饼图是又爱又恨，爱的是它表示百分比的时候，往往让读者联想到蛋糕，份额这类根深蒂固的情景，从而让数字通俗易懂、深入人心，是一种很好的表达方式，恨的也是这一点，我用柱状图表达不香吗？人眼对角度的区分度远不如柱状图呢，特别是当两个类所占的份额比较接近的时候，所以很多时候，除了用饼图表达份额，还会在旁边标上百分比，从数据可视化的角度来说，如图 \@ref(fig:bod-pie) 所示，这是信息冗余！

```{r bod-pie, fig.asp=1, fig.width=5, fig.height=5, fig.cap="饼图"}
BOD %>% transform(., ratio = demand / sum(demand)) %>% 
  ggplot(., aes(x = "", y = demand, fill = reorder(Time, demand))) +
  geom_bar(stat = "identity", show.legend = FALSE, color = "white") +
  coord_polar(theta = "y") +
  geom_text(aes(x = 1.6, label = paste0(round(ratio, digits = 4) * 100, "%")),
    position = position_stack(vjust = 0.5), color = "black"
  ) +
  geom_text(aes(x = 1.2, label = Time),
    position = position_stack(vjust = 0.5), color = "black"
  ) +
  theme_void(base_size = 14)
```

`plot_ly(type = "pie", ... )` 和添加图层 `add_pie()` 的效果是一样的

```{r diamond-pie, fig.cap="饼图", eval=knitr::is_html_output()}
dat = aggregate(carat ~ cut, data = diamonds, FUN = length)
plotly::plot_ly() %>%
  plotly::add_pie(
    data = dat, labels = ~cut, values = ~carat,
    name = "简单饼图1", domain = list(row = 0, column = 0)
  ) %>%
  plotly::add_pie(
    data = dat, labels = ~cut, values = ~carat, hole = 0.6,
    textposition = "inside", textinfo = "label+percent",
    name = "简单饼图2", domain = list(row = 0, column = 1)
  ) %>%
  plotly::layout(
    title = "多图布局", showlegend = F,
    grid = list(rows = 1, columns = 2),
    xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
    yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)
  ) %>% 
  plotly::config(displayModeBar = FALSE)
```

设置参数 hole 可以绘制环形饼图，比如 hole = 0.6

## 地图 {#sec-ggplot2-map}

USArrests 数据集描述了1973年美国50个州每10万居民中因袭击、抢劫和强奸而逮捕的人，以及城市人口占比。这里的地图是指按照行政区划为边界的示意图，比如图 \@ref(fig:state-crimes)

```{r state-crimes, fig.cap="1975年美国各州犯罪事件", fig.width=8, fig.height=4}
library(maps)
crimes <- data.frame(state = tolower(rownames(USArrests)), USArrests)
# 等价于 crimes %>% tidyr::pivot_longer(Murder:Rape)
vars <- lapply(names(crimes)[-1], function(j) {
  data.frame(state = crimes$state, variable = j, value = crimes[[j]])
})
crimes_long <- do.call("rbind", vars)
states_map <- map_data("state")
ggplot(crimes, aes(map_id = state)) +
  geom_map(aes(fill = Murder), map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_binned(type = "viridis") +
  coord_map() +
  theme_minimal()
```

先来看看中国及其周边，见图\@ref(fig:incorrect-map)，这个地图的缺陷就是中国南海及九段线没有标记，台湾和中国大陆不是一种颜色标记，这里的地图数据来自 R 包 **maps** 和 **mapdata**，像这样的地图就不宜在国内正式刊物上出现。

```{r incorrect-map, fig.cap="中国及其周边", fig.width=8, fig.height=4}
library(maps)
library(mapdata)
east_asia <- map_data("worldHires",
  region = c(
    "Japan", "Taiwan", "China",
    "North Korea", "South Korea"
  )
)
ggplot(east_asia, aes(x = long, y = lat, group = group, fill = region)) +
  geom_polygon(colour = "black") +
  scale_fill_brewer(palette = "Set2") +
  coord_map() +
  theme_minimal()
```

绘制真正的地图需要考虑投影坐标系，观察角度、分辨率、政策法规等一系列因素，它是一种复杂的图形，如图 \@ref(fig:draw-map) 所示。

```{r draw-map,fig.cap="画地图的正确姿势",fig.width=4,fig.height=4,out.width="45%",fig.show='hold',fig.ncol=2,fig.subcap=c("墨卡托投影", "北极观察", "正交投影", "正交投影北极观察"),collapse=TRUE}
worldmap <- map_data("world")

# 默认 mercator 投影下的默认视角 c(90, 0, mean(range(x)))
ggplot(worldmap, aes(long, lat, group = group)) +
  geom_polygon(aes(fill = region), show.legend = FALSE) +
  coord_map(
    xlim = c(-120, 40), ylim = c(30, 90)
  )

# 换观察角度
ggplot(worldmap, aes(long, lat, group = group)) +
  geom_polygon(aes(fill = region), show.legend = FALSE) +
  coord_map(
    xlim = c(-120, 40), ylim = c(30, 90),
    orientation = c(90, 0, 0)
  )

# 换投影坐标系
ggplot(worldmap, aes(long, lat, group = group)) +
  geom_polygon(aes(fill = region), show.legend = FALSE) +
  coord_map("ortho",
    xlim = c(-120, 40), ylim = c(30, 90)
  )

# 二者皆换
ggplot(worldmap, aes(long, lat, group = group)) +
  geom_polygon(aes(fill = region), show.legend = FALSE) +
  coord_map("ortho",
    xlim = c(-120, 40), ylim = c(30, 90),
    orientation = c(90, 0, 0)
  )
```


## 热图 {#sec-ggplot2-heatmap}

<!-- [heatmap3](https://cran.r-project.org/package=heatmap3) 包提供兼容 Base R 的 heatmap() 函数 -->

Zuguang Gu 开发的 [ComplexHeatmap](https://github.com/jokergoo/ComplexHeatmap) 包实现复杂数据的可视化，用以发现关联数据集之间的模式。特别地，比如基因数据、生存数据等，更多应用见开发者的书籍 [ComplexHeatmap 完全手册](https://jokergoo.github.io/ComplexHeatmap-reference/book/) 。 R 包发布在 Bioconductor 上 <https://www.bioconductor.org/packages/ComplexHeatmap>。使用之前我要确保已经安装 **BiocManager** 包，这个包负责管理 Bioconductor 上所有的包，需要先安装它，然后安装 **ComplexHeatmap** 包 [@Gu_2016_heatmap]。

```{r, eval=!require("ComplexHeatmap")}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ComplexHeatmap")
```

## 散点图 {#ggplot2-scatter}

下面以 diamonds 数据集为例展示 ggplot2 的绘图过程，首先加载 diamonds 数据集，查看数据集的内容

```{r}
data(diamonds)
str(diamonds)
```

数值型变量 carat 作为 x 轴

```{r diamonds-axis}
#| fig.subcap=c("指定 x 轴","数值变量 price 作为纵轴","有序分类变量 cut 指定颜色","指定统一颜色"),
#| fig.cap="绘图过程",
#| out.width="35%",
#| fig.ncol=2,
#| fig.width=2
ggplot(diamonds, aes(x = carat))
ggplot(diamonds, aes(x = carat, y = price))
ggplot(diamonds, aes(x = carat, color = cut))
ggplot(diamonds, aes(x = carat), color = "steelblue")
```

图 \@ref(fig:diamonds-axis) 的基础上添加数据图层

```{r scatter,fig.cap="添加数据图层"}
sub_diamonds <- diamonds[sample(1:nrow(diamonds), 1000), ]
ggplot(sub_diamonds, aes(x = carat, y = price)) +
  geom_point()
```

给散点图\@ref(fig:scatter)上色

```{r scatter-color-1,fig.cap="散点图配色"}
ggplot(sub_diamonds, aes(x = carat, y = price)) +
  geom_point(color = "steelblue")
```


```{r scatter-scale-1,fig.cap="格式化坐标轴刻度标签"}
ggplot(sub_diamonds, aes(x = carat, y = price)) +
  geom_point(color = "steelblue") +
  scale_y_continuous(
    labels = scales::unit_format(unit = "k", scale = 1e-3),
    breaks = seq(0, 20000, 4000)
  )
```

让另一变量 cut 作为颜色分类指标

```{r scatter-color-2,fig.cap="分类散点图"}
ggplot(sub_diamonds, aes(x = carat, y = price, color = cut)) +
  geom_point()
```

当然还有一种类似的表示就是分组，默认情况下，ggplot2将所有观测点视为一组，以分类变量 cut 来分组

```{r scatter-group,fig.cap="分组"}
ggplot(sub_diamonds, aes(x = carat, y = price, group = cut)) +
  geom_point()
```

在图\@ref(fig:scatter-group) 上没有体现出来分组的意思，下面以 cut 分组线性回归为例

```{r group-lm,fig.cap="分组线性回归",fig.ncol=1}
ggplot(sub_diamonds, aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth(method = "lm")
ggplot(sub_diamonds, aes(x = carat, y = price, group = cut)) +
  geom_point() +
  geom_smooth(method = "lm")
```

我们当然可以选择更加合适的拟合方式，如局部多项式平滑 `loess` 但是该方法不太适用观测值比较多的情况，因为它会占用比较多的内存，建议使用广义可加模型作平滑拟合

```{r,fig.cap="局部多项式平滑"}
ggplot(sub_diamonds, aes(x = carat, y = price, group = cut)) +
  geom_point() +
  geom_smooth(method = "loess")
```

```{r group-gam,fig.cap="数据分组应用广义可加平滑"}
ggplot(sub_diamonds, aes(x = carat, y = price, group = cut)) +
  geom_point() +
  geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"))
```

[ggfortify](https://github.com/sinhrks/ggfortify) 包支持更多的统计分析结果的可视化。

为了更好地区分开组别，我们在图\@ref(fig:group-gam)的基础上分面或者配色

```{r group-facet,fig.cap=c("分组分面","分组配色"),fig.ncol=1}
ggplot(sub_diamonds, aes(x = carat, y = price, group = cut)) +
  geom_point() +
  geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs")) +
  facet_grid(~cut)
ggplot(sub_diamonds, aes(x = carat, y = price, group = cut, color = cut)) +
  geom_point() +
  geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"))
```

在分类散点图的另一种表示方法就是分面图，以 cut 变量作为分面的依据

```{r scatter-facet,fig.cap="分面散点图"}
ggplot(sub_diamonds, aes(x = carat, y = price)) +
  geom_point() +
  facet_grid(~cut)
```

给图 \@ref(fig:scatter-facet) 上色

```{r scatter-facet-color-1,fig.cap="给分面散点图上色"}
ggplot(sub_diamonds, aes(x = carat, y = price)) +
  geom_point(color = "steelblue") +
  facet_grid(~cut)
```

在图\@ref(fig:scatter-facet-color-1)的基础上，给不同的类上不同的颜色

```{r scatter-facet-color-2,fig.cap="给不同的类上不同的颜色"}
ggplot(sub_diamonds, aes(x = carat, y = price, color = cut)) +
  geom_point() +
  facet_grid(~cut)
```

去掉图例，此时图例属于冗余信息了

```{r scatter-facet-color-3,fig.cap="去掉图例"}
ggplot(sub_diamonds, aes(x = carat, y = price, color = cut)) +
  geom_point(show.legend = FALSE) +
  facet_grid(~cut)
```

四块土地，所施肥料不同，肥力大小顺序 4 < 2 < 3 < 1 小麦产量随肥力的变化

```{r,fig.cap="多个图例"}
data(Wheat2, package = "nlme") # Wheat Yield Trials
library(colorspace)
ggplot(Wheat2, aes(longitude, latitude)) +
  geom_point(aes(size = yield, colour = Block)) +
  scale_color_discrete_sequential(palette = "Viridis") +
  scale_x_continuous(breaks = seq(0, 30, 5)) +
  scale_y_continuous(breaks = seq(0, 50, 10))
```
  
```{r category-ggplot,fig.cap="分类散点图"}
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(am))) +
  geom_point()
```

图层、分组、分面和散点图介绍完了，接下来就是其它统计图形，如箱线图，小提琴图和条形图

```{r,fig.cap="1948年至1960年航班乘客人数变化"}
dat <- as.data.frame(cbind(rep(1948 + seq(12), each = 12), rep(seq(12), 12), AirPassengers))
colnames(dat) <- c("year", "month", "passengers")

ggplot(data = dat, aes(x = as.factor(year), y = as.factor(month))) +
  stat_sum(aes(size = passengers), colour = "lightblue") +
  scale_size(range = c(1, 10), breaks = seq(100, 650, 50)) +
  labs(x = "Year", y = "Month", colour = "Passengers") +
  theme_minimal()
```

## 条形图 {#sec-ggplot2-barplot}

条形图特别适合分类变量的展示，我们这里展示钻石切割质量 cut 不同等级的数量，当然我们可以直接展示各类的数目，在图层 `geom_bar` 中指定 `stat="identity"`

```{r}
# 需要映射数据框的两个变量，相当于自己先计算了每类的数量
with(diamonds, table(cut))
cut_df <- as.data.frame(table(diamonds$cut))
ggplot(cut_df, aes(x = Var1, y = Freq)) + geom_bar(stat = "identity")
```
```{r diamonds-barplot-1,fig.cap="频数条形图"}
ggplot(diamonds, aes(x = cut)) + geom_bar()
```

还有另外三种表示方法

```{r}
ggplot(diamonds, aes(x = cut)) + geom_bar(stat = "count")
ggplot(diamonds, aes(x = cut, y = ..count..)) + geom_bar()
ggplot(diamonds, aes(x = cut, y = stat(count))) + geom_bar()
```

我们还可以在图 \@ref(fig:diamonds-barplot-1) 的基础上再添加一个分类变量钻石的纯净度 clarity，形成堆积条形图

```{r diamonds-barplot-2,fig.cap="堆积条形图"}
ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar()
```

再添加一个分类变量钻石颜色 color 比较好的做法是分面

```{r diamonds-barplot-3,fig.cap="分面堆积条形图"}
ggplot(diamonds, aes(x = color, fill = clarity)) +
  geom_bar() +
  facet_grid(~cut)
```

实际上，绘制图\@ref(fig:diamonds-barplot-3)包含了对分类变量的分组计数过程，如下

```{r}
with(diamonds, table(cut, color))
```

还有一种堆积的方法是按比例，而不是按数量，如图\@ref(fig:diamonds-barplot-4)

```{r diamonds-barplot-4,fig.cap="比例堆积条形图"}
ggplot(diamonds, aes(x = color, fill = clarity)) +
  geom_bar(position = "fill") +
  facet_grid(~cut)
```

接下来就是复合条形图

```{r diamonds-barplot-5,fig.cap="复合条形图"}
ggplot(diamonds, aes(x = color, fill = clarity)) +
  geom_bar(position = "dodge")
```

再添加一个分类变量，就是需要分面大法了，图 \@ref(fig:diamonds-barplot-5) 展示了三个分类变量，其实我们还可以再添加一个分类变量用作分面的列依据

```{r diamonds-barplot-6,fig.cap="分面复合条形图"}
ggplot(diamonds, aes(x = color, fill = clarity)) +
  geom_bar(position = "dodge") +
  facet_grid(rows = vars(cut))
```

图 \@ref(fig:diamonds-barplot-6) 展示的数据如下

```{r}
with(diamonds, table(color, clarity, cut))
```


```{r barplot-1,fig.cap="条形图的四种常见形态"}
# 漫谈条形图 https://cosx.org/2017/10/discussion-about-bar-graph
set.seed(2020)
dat <- data.frame(
  age = rep(1:30, 2),
  gender = rep(c("man", "woman"), each = 30),
  num = sample(x = 1:100, size = 60, replace = T)
)
# 重叠
p1 <- ggplot(data = dat, aes(x = age, y = num, fill = gender)) +
  geom_col(position = "identity", alpha = 0.5)
# 堆积
p2 <- ggplot(data = dat, aes(x = age, y = num, fill = gender)) +
  geom_col(position = "stack")
# 双柱
p3 <- ggplot(data = dat, aes(x = age, y = num, fill = gender)) +
  geom_col(position = "dodge")
# 百分比
p4 <- ggplot(data = dat, aes(x = age, y = num, fill = gender)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(y = "%")
(p1 + p2) / (p3 + p4)
```

以数据集 diamonds 为例，按照纯净度 clarity 和切工 cut 分组统计钻石的数量，再按切工分组统计不同纯净度的钻石数量占比，如表 \@ref(tab:diamonds-table) 所示

```{r diamonds-table}
library(data.table)
diamonds <- as.data.table(diamonds)
dat <- diamonds[, .(cnt = .N), by = .(cut, clarity)] %>% 
  .[, pct := cnt / sum(cnt), by = .(cut)] %>% 
  .[, pct_pp := paste0(cnt, " (", scales::percent(pct, accuracy = 0.01), ")") ]
# 分组计数 with(diamonds, table(clarity, cut))
dcast(dat, formula = clarity ~ cut, value.var = "pct_pp") %>% 
  knitr::kable(align = "crrrrr", caption = "数值和比例组合呈现")
```

分别以堆积条形图和百分比堆积条形图展示，添加注释到条形图上，见 \@ref(fig:barplot-2)

```{r barplot-2,fig.cap="添加注释到条形图",fig.height=8,fig.width=8}
p1 = ggplot(data = dat, aes(x = cut, y = cnt, fill = clarity)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = cnt), position = position_dodge(1), vjust = -0.5) +
  geom_text(aes(label = scales::percent(pct, accuracy = 0.1)),
    position = position_dodge(1), vjust = 1, hjust = 0.5
  ) +
  scale_fill_brewer(palette = "Spectral") +
  labs(fill = "clarity", y = "", x = "cut") +
  theme_minimal() + 
  theme(legend.position = "top")

p2 = ggplot(data = dat, aes(y = cut, x = cnt, fill = clarity)) +
  geom_col(position = "fill") +
  geom_text(aes(label = cnt), position = position_fill(1), vjust = -0.5) +
  geom_text(aes(label = scales::percent(pct, accuracy = 0.1)),
    position = position_fill(1), vjust = 1, hjust = 0.5
  ) +
  scale_fill_brewer(palette = "Spectral") +
  scale_x_continuous(labels = scales::percent) +
  labs(fill = "clarity", y = "", x = "cut") +
  theme_minimal() + 
  theme(legend.position = "top")

p1 / p2
```

借助 plotly 制作相应的动态百分比堆积条形图

```{r barplot-3, eval=knitr::is_html_output(), fig.cap="百分比堆积条形图", warning=FALSE}
ggplot(data = diamonds, aes(x = cut, fill = clarity)) +
  geom_bar(position = "dodge2") +
  scale_fill_brewer(palette = "Spectral")

# 百分比堆积条形图
plotly::plot_ly(dat,
  x = ~cut, color = ~clarity, y = ~pct,
  colors = "Spectral", type = "bar",
  text = ~ paste0(
    cnt, "颗 <br>",
    "占比：", scales::percent(pct, accuracy = 0.1), "<br>"
  ),
  hoverinfo = "text"
) %>%
  plotly::layout(
    barmode = "stack",
    yaxis = list(tickformat = ".0%")
  ) %>%
  plotly::config(displayModeBar = FALSE)

# `type = "histogram"` 以 cut 和 clarity 分组计数
plotly::plot_ly(diamonds,
  x = ~cut, color = ~clarity,
  colors = "Spectral", type = "histogram"
) %>%
  plotly::config(displayModeBar = FALSE)

# 堆积图
plotly::plot_ly(diamonds,
  x = ~cut, color = ~clarity,
  colors = "Spectral", type = "histogram"
) %>%
  plotly::layout(
    barmode = "stack", 
    yaxis = list(title = "cnt"),
    legend = list(title = list(text = "clarity"))
  ) %>%
  plotly::config(displayModeBar = FALSE)
```

## 直方图 {#ggplot2-histogram}

直方图用来查看连续变量的分布

```{r,fig.cap="钻石价格的分布"}
ggplot(diamonds, aes(price)) + geom_histogram(bins = 30)
```

堆积直方图

```{r,fig.cap="钻石价格随切割质量的分布"}
ggplot(diamonds, aes(x = price, fill = cut)) + geom_histogram(bins = 30)
```

基础 R 包与 Ggplot2 包绘制的直方图的对比，Base R 绘图速度快，代码更加稳定，Ggplot2 代码简洁，更美观

```{r base-vs-ggplot2-hist, fig.width=4,fig.height=3, out.width="45%", fig.ncol=2, fig.cap="直方图", fig.subcap=c("Base R 直方图","Ggplot2 直方图")}
par(mar = c(2.1, 2.1, 1.5, 0.5))
plot(c(50, 350), c(0, 10),
  type = "n", font.main = 1,
  xlab = "", ylab = "", frame.plot = FALSE, axes = FALSE,
  # xlab = "hp", ylab = "Frequency",
  main = paste("Histogram with Base R", paste(rep(" ", 60), collapse = ""))
)
axis(
  side = 1, at = seq(50, 350, 50), labels = seq(50, 350, 50),
  tick = FALSE, las = 1, padj = 0, mgp = c(3, 0.1, 0)
)
axis(
  side = 2, at = seq(0, 10, 2), labels = seq(0, 10, 2),
  # col = "white", 坐标轴的颜色
  # col.ticks 刻度线的颜色
  tick = FALSE, # 取消刻度线
  las = 1, # 水平方向
  hadj = 1, # 右侧对齐
  mgp = c(3, 0.1, 0) # 纵轴边距线设置为 0.1
)
abline(h = seq(0, 10, 2), v = seq(50, 350, 50), col = "gray90", lty = "solid")
abline(h = seq(1, 9, 2), v = seq(75, 325, 50), col = "gray95", lty = "solid")
hist(mtcars$hp,
  col = "#56B4E9", border = "white",
  freq = TRUE, add = TRUE
  # labels = TRUE, axes = TRUE, ylim = c(0, 10.5),
  # xlab = "hp",main = "Histogram with Base R"
)
mtext("hp", 1, line = 1.0)
mtext("Frequency", 2, line = 1.0)

ggplot(mtcars) +
  geom_histogram(aes(x = hp), fill = "#56B4E9", color = "white", breaks = seq(50, 350, 50)) +
  scale_x_continuous(breaks = seq(50, 350, 50)) +
  scale_y_continuous(breaks = seq(0, 12, 2)) +
  labs(x = "hp", y = "Frequency", title = "Histogram with Ggplot2") +
  theme_minimal(base_size = 12)
```


## 箱线图 {#ggplot2-boxplot}

以 PlantGrowth 数据集为例展示箱线图，在两组不同实验条件下，植物生长的情况，纵坐标是干燥植物的量，横坐标表示不同的实验条件。这是非常典型的适合用箱线图来表达数据的场合，Y 轴对应数值型变量，X 轴对应分类变量，在 R 语言中，分类变量的类型是 factor

```{r}
data("PlantGrowth")
str(PlantGrowth)
```

```{r PlantGrowth-boxplot}
ggplot(data = PlantGrowth, aes(x = group, y = weight)) + geom_boxplot()
```

PlantGrowth 数据量比较小，此时比较适合采用抖动散点图，抖动是为了避免点之间相互重叠，为了增加不同类别之间的识别性，我们可以用不同的点的形状或者不同的颜色来表示类别

```{r PlantGrowth-jitter}
ggplot(data = PlantGrowth, aes(x = group, y = weight, shape = group)) + geom_jitter()
ggplot(data = PlantGrowth, aes(x = group, y = weight, color = group)) + geom_jitter()
```


```{r,fig.asp=0.8}
boxplot(weight ~ group,
  data = PlantGrowth,
  ylab = "Dried weight of plants", col = "lightgray",
  notch = FALSE, varwidth = TRUE
)
```


以钻石切割质量 cut 为分面依据，以钻石颜色类别 color 为 x 轴，钻石价格为 y 轴，绘制箱线图\@ref(fig:boxplot-facet-color)

```{r boxplot-facet-color,fig.cap="箱线图"}
ggplot(diamonds, aes(x = color, y = price, color = cut)) +
  geom_boxplot(show.legend = FALSE) +
  facet_grid(~cut)
```

我们当然还可以添加钻石的纯净度 clarity 作为分面依据，那么箱线图可以为图 \@ref(fig:boxplot-facet-color-clarity-1)

```{r boxplot-facet-color-clarity-1,fig.cap="复合分面箱线图"}
ggplot(diamonds, aes(x = color, y = price, color = cut)) +
  geom_boxplot(show.legend = FALSE) +
  facet_grid(clarity ~ cut)
```

经过观察，我们发现水平分类过多，考虑用切割质量 cut 替换钻石颜色 color 绘图，但是由于分类过细，图信息展示不简练，反而不好，如图 \@ref(fig:boxplot-facet-color-clarity-2)

```{r boxplot-facet-color-clarity-2,fig.cap="箱线图配色",fig.subcap=c("切割质量cut上色","钻石颜色配色"),fig.ncol=1}
ggplot(diamonds, aes(x = cut, y = price, color = cut)) +
  geom_boxplot(show.legend = FALSE) +
  facet_grid(clarity ~ color)
ggplot(diamonds, aes(x = cut, y = price, color = color)) +
  geom_boxplot(show.legend = FALSE) +
  facet_grid(clarity ~ color)
```

## 函数图 {#sec-ggplot2-function}

蝴蝶图的参数方程如下

\begin{align}
x &= \sin t \big(\mathrm e^{\cos t} - 2 \cos 4t + \sin^5(\frac{t}{12})\big) \\
y &= \cos t \big(\mathrm e^{\cos t} - 2 \cos 4t + \sin^5(\frac{t}{12})\big), t \in [- \pi, \pi]
\end{align}

## 密度图 {#sec-ggplot2-density}


```{r mpg-cyl-density,fig.cap="按汽缸数分组的城市里程"}
ggplot(mpg, aes(cty)) +
  geom_density(aes(fill = factor(cyl)), alpha = 0.8) +
  labs(
    title = "Density plot",
    subtitle = "City Mileage Grouped by Number of cylinders",
    caption = "Source: mpg",
    x = "City Mileage",
    fill = "# Cylinders"
  )
```

添加透明度，解决遮挡

```{r density,fig.cap=c("密度图","添加透明度的密度图"),fig.ncol=1}
ggplot(diamonds, aes(x = price, fill = cut)) + geom_density()
ggplot(diamonds, aes(x = price, fill = cut)) + geom_density(alpha = 0.5)
```

堆积密度图

```{r stack-density,fig.cap="堆积密度图"}
ggplot(diamonds, aes(x = price, fill = cut)) +
  geom_density(position = "stack")
```

条件密度估计

```{r,fig.cap="条件密度估计图"}
# You can use position="fill" to produce a conditional density estimate
ggplot(diamonds, aes(carat, stat(count), fill = cut)) +
  geom_density(position = "fill")
```


岭线图是密度图的一种变体，可以防止密度曲线重叠在一起

```{r}
ggplot(diamonds) +
  ggridges::geom_density_ridges(aes(x = price, y = color, fill = color))
```

二维的密度图又是一种延伸

```{r}
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_density_2d(aes(color = cut)) +
  facet_grid(~cut)
```

`stat` 函数，特别是 nlevel 参数，在密度曲线之间填充我们又可以得到热力图

```{r}
ggplot(diamonds, aes(x = carat, y = price)) +
  stat_density_2d(aes(fill = stat(nlevel)), geom = "polygon") +
  facet_grid(. ~ cut)
```

`gemo_hex` 也是二维密度图的一种变体，特别适合数据量比较大的情形

```{r}
ggplot(diamonds, aes(x = carat, y = price)) + geom_hex() +
  scale_fill_viridis_c()
```


[heatmaps in ggplot2](https://themockup.blog/posts/2020-08-28-heatmaps-in-ggplot2/) 二维密度图

```{r density-2d,fig.cap="二维密度图",fig.width=4,fig.height=3,out.width="45%",fig.show='hold',fig.ncol=2,fig.subcap=c("默认调色板","viridis 调色板")}
ggplot(faithful, aes(x = eruptions, y = waiting)) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon") +
  xlim(1, 6) +
  ylim(40, 100)

ggplot(faithful, aes(x = eruptions, y = waiting)) +
  stat_density2d(aes(fill = stat(level)), geom = "polygon") +
  scale_fill_viridis_c(option = "viridis") +
  xlim(1, 6) +
  ylim(40, 100)
```

::: {.rmdtip data-latex="{提示}"}
`MASS::kde2d()` 实现二维核密度估计，**ggplot2** 包提供了两种等价的绘图方式

1. `stat_density_2d()` 和 `..`
1. `stat_density2d()` 和 `stat()`
:::

```{r histogram,eval=knitr::is_html_output(),fig.cap="二维直方图/密度图/轮廓图"}
plotly::plot_ly(
  data = faithful, x = ~eruptions,
  y = ~waiting, type = "histogram2dcontour"
) %>%
  plotly::config(displayModeBar = FALSE)

# plot_ly(faithful, x = ~waiting, y = ~eruptions) %>% 
#   add_histogram2d() %>% 
#   add_histogram2dcontour()
```

延伸一下，热力图

```{r, eval=knitr::is_html_output()}
library(KernSmooth)
den <- bkde2D(x = faithful, bandwidth = c(0.7, 7))
# 热力图
p1 <- plotly::plot_ly(x = den$x1, y = den$x2, z = den$fhat) %>%
  plotly::config(displayModeBar = FALSE) %>%
  plotly::add_heatmap()

# 等高线图
p2 <- plotly::plot_ly(x = den$x1, y = den$x2, z = den$fhat) %>%
  plotly::config(displayModeBar = FALSE) %>%
  plotly::add_contour()

htmltools::tagList(p1, p2)
```

## 提琴图 {#sec-ggplot2-violin}

2004 年 Daniel Adler 开发 [vioplot](https://github.com/TomKellyGenetics/vioplot) 包实现提琴图的绘制，它可能是最早实现此功能的 R 包，随后10余年没有更新却一直坚挺在 CRAN 上，非常难得，好在 Thomas Kelly 已经接手维护。另一款绘制提琴图的 R 包是 Peter Kampstra 开发的 [beanplot](https://cran.r-project.org/package=beanplot) [@beanplot_2008_jss]，也存在很多年了，不过随着时间的变迁，比较现代的方式是 **ggplot2** 带来的 `geom_violin()` 扔掉了很多依赖，也是各种图形的汇集地，可以看作是最佳实践。提琴图比起箱线图优势在于呈现更多的分布信息，其次在于更加美观，但是就目前来说箱线图的受众比提琴图要多很多，毕竟前者是包含更多统计信息，如图\@ref(fig:boxplot-violin) 所示。

```{r boxplot-violin,fig.cap="几种不同的提琴图",fig.width=4,fig.height=4,out.width="45%",fig.show='hold',fig.ncol=2,fig.subcap=c("简单箱线图", "vioplot 绘制的提琴图", "ggplot2 绘制的提琴图", "beanplot 绘制的提琴图"),collapse=TRUE}
boxplot(count ~ spray, data = InsectSprays)
vioplot::vioplot(count ~ spray, data = InsectSprays, col = "lightgray")
ggplot(InsectSprays, aes(x = spray, y = count)) +
  geom_violin(fill = "lightgray") +
  theme_minimal()
beanplot::beanplot(count ~ spray, data = InsectSprays, col = "lightgray")
```

[ggnormalviolin](https://github.com/wjschne/ggnormalviolin) 包在给定均值和标准差的情况下，绘制正态分布的概率密度曲线，如图 \@ref(fig:normal-violin) 所示。

```{r normal-violin,fig.cap="正态分布的概率密度曲线",fig.width=6,fig.height=4}
library(ggnormalviolin)
with(
  aggregate(
    data = iris, Sepal.Length ~ Species,
    FUN = function(x) c(dist_mean = mean(x), dist_sd = sd(x))
  ),
  cbind.data.frame(Sepal.Length, Species)
) %>%
  ggplot(aes(x = Species, mu = dist_mean, sigma = dist_sd, fill = Species)) +
  geom_normalviolin() +
  theme_minimal()
```


## 抖动图 {#ggplot2-jitter}


抖动图适合数据量比较小的情况

```{r}
ggplot(mpg, aes(x = class, y = hwy, color = class)) + geom_jitter()
```

抖不抖，还是抖一下

```{r}
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_point(aes(fill = Species), size = 5, shape = 21, colour = "grey20") +
  # geom_boxplot(outlier.colour = NA, fill = NA, colour = "grey20") +
  labs(title = "Not Jittered")

ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_point(aes(fill = Species),
    size = 5, shape = 21, colour = "grey20",
    position = position_jitter(width = 0.2, height = 0.1)
  ) +
  # geom_boxplot(outlier.colour = NA, fill = NA, colour = "grey20") +
  labs(title = "Jittered")
```

在数据量比较大的时候，可以用箱线图、密度图、提琴图

```{r,fig.cap="抖动图的反例"}
ggplot(sub_diamonds, aes(x = cut, y = price)) + geom_jitter()
```

上色和分面都不好使的抖动图，因为区分度变小

```{r,fig.cap="根据钻石颜色上色",fig.asp=1}
ggplot(sub_diamonds, aes(x = color, y = price, color = color)) +
  geom_jitter() +
  facet_grid(clarity ~ cut)
```

箱线图此时不宜分的过细

```{r boxplot-facet-cut-clarity,fig.cap="箱线图",fig.asp=1}
ggplot(diamonds, aes(x = color, y = price, color = color)) +
  geom_boxplot() +
  facet_grid(cut ~ clarity)
```

所以这样更好，先按纯净度分面，再对比不同的颜色，钻石价格的差异

```{r boxplot-facet-clarity,fig.cap="钻石按纯净度分面",fig.asp=1}
ggplot(diamonds, aes(x = color, y = price, color = color)) +
  geom_boxplot() +
  facet_grid(~clarity)
```

最好只比较一个维度，不同颜色钻石的价格对比

```{r boxplot-color,fig.cap="不同颜色钻石的价格比较"}
ggplot(diamonds, aes(x = color, y = price, color = color)) +
  geom_boxplot()
```

设置随机数种子，抖动图是可重复的。

```{r}
ggplot(iris, aes(x = Species, y = Sepal.Width, color = Species)) +
  geom_boxplot(width = 0.65) +
  geom_point(position = position_jitter(seed = 37, width = 0.25))
```

## 蜂群图 {#sec-ggplot2-beeswarm}

在样本点有限的情况下，用蜜蜂图代替普通的抖动图，可视化效果会好很多，如图 \@ref(fig:beeswarm) 所示。Erik Clarke 开发的 [ggbeeswarm](https://github.com/eclarke/ggbeeswarm) 包可以将随机抖动的散点图朝着比较规律的方向聚合，又不丢失数据本身的准确性。

```{r beeswarm,fig.cap="蜜蜂图可视化效果比抖动图好",fig.width=8,fig.height=4}
library(ggbeeswarm)
p1 <- ggplot(iris, aes(Species, Sepal.Length)) +
  geom_jitter() +
  theme_minimal()
p2 <- ggplot(iris, aes(Species, Sepal.Length)) +
  geom_quasirandom() +
  theme_minimal()
p1 + p2
```


## 玫瑰图 {#ggplot2-rose}

南丁格尔风玫瑰图[^nightingale-rose] 可以作为堆积条形图，分组条形图

```{r stack-to-rose,fig.cap="堆积条形图转风玫瑰图"}
ggplot(diamonds, aes(x = color, fill = clarity)) +
  geom_bar()
ggplot(diamonds, aes(x = color, fill = clarity)) +
  geom_bar() +
  coord_polar()
```

```{r wind-rose,fig.cap="风玫瑰图"}
# 风玫瑰图 http://blog.csdn.net/Bone_ACE/article/details/47624987
set.seed(2018)
# 随机生成100次风向，并汇集到16个区间内
direction <- cut_interval(runif(100, 0, 360), n = 16)
# 随机生成100次风速，并划分成4种强度
mag <- cut_interval(rgamma(100, 15), 4)
dat <- data.frame(direction = direction, mag = mag)
# 将风向映射到X轴，频数映射到Y轴，风速大小映射到填充色，生成条形图后再转为极坐标形式即可
p <- ggplot(dat, aes(x = direction, y = ..count.., fill = mag))
p + geom_bar(colour = "white") +
  coord_polar() +
  theme(axis.ticks = element_blank(), axis.text.y = element_blank()) +
  labs(x = "", y = "", fill = "Magnitude")
```

```{r}
p + geom_bar(position = "fill") +
  coord_polar() +
  theme(axis.ticks = element_blank(), axis.text.y = element_blank()) +
  labs(x = "", y = "", fill = "Magnitude")
```


[^nightingale-rose]: https://mbostock.github.io/protovis/ex/crimea-rose-full.html

## 瓦片图 {#sec-ggplot2-tile}

```{r geom-tile,fig.cap="1949-1960年国际航线乘客数量的月度趋势",fig.showtext=TRUE,fig.width=8,fig.height=4}
p1 <- expand.grid(months = month.abb, years = 1949:1960) %>%
  transform(num = as.vector(AirPassengers)) %>%
  ggplot(aes(x = years, y = months, fill = num)) +
  scale_fill_continuous(type = "viridis") +
  geom_tile(color = "white", size = 0.4) +
  scale_x_continuous(
    expand = c(0.01, 0.01),
    breaks = seq(1949, 1960, by = 1), labels = 1949:1960
  ) +
  theme_minimal(base_size = 10.54, base_family = "Noto Serif CJK SC") +
  theme(legend.position = "top") +
  labs(x = "年", y = "月", fill = "人数")

p2 <- expand.grid(months = month.abb, years = 1949:1960) %>%
  transform(num = as.vector(AirPassengers)) %>%
  ggplot(aes(x = years, y = months, color = num)) +
  geom_point(pch = 15, size = 8) +
  scale_color_distiller(palette = "Spectral") +
  scale_x_continuous(
    expand = c(0.01, 0.01),
    breaks = seq(1949, 1960, by = 1), labels = 1949:1960
  ) +
  theme_minimal(base_size = 10.54, base_family = "Noto Serif CJK SC") +
  theme(legend.position = "top") +
  labs(x = "年", y = "月", color = "人数")
p1 + p2
```

## 日历图 {#sec-ggplot2-calendar}

airquality 数据集记录了1973年5月至9月纽约的空气质量，包括气温（华氏度）、风速（米/小时）、紫外线强度、臭氧含量四个指标，图 \@ref(fig:calendar-airquality) 展示了每日的气温变化。

```{r calendar-airquality,fig.cap="1973年5月至9月纽约的气温变化",fig.width=8,fig.height=4}
airquality %>%
  transform(Date = seq.Date(
    from = as.Date("1973-05-01"),
    to = as.Date("1973-09-30"), by = "day"
  )) %>%
  transform(
    Week = as.integer(format(Date, "%W")),
    Year = as.integer(format(Date, "%Y")),
    Weekdays = factor(weekdays(Date, abbreviate = T),
      levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
    )
  ) %>%
  ggplot(aes(x = Week, y = Weekdays, fill = Temp)) +
  scale_fill_distiller(name = "Temp (F)", palette = "Spectral") +
  geom_tile(color = "white", size = 0.4) +
  facet_wrap("Year", ncol = 1) +
  scale_x_continuous(
    expand = c(0, 0),
    breaks = seq(1, 52, length = 12),
    labels = month.abb
  )
```

::: {.rmdnote data-latex="{注意}"}
图 \@ref(fig:calendar-airquality) 横轴的刻度标签换成了月份，一个月为四周，一年 52～53 周，每周的第一天约定为星期一，1973年05月01日为星期二。代码中颇为技巧的在于 `format()` 函数从 Date 日期类型的数据提取第几周， 用 `weekdays()` 函数提取星期几，而 month.abb 则是一个内置常量，12个月份的英文缩写。在调用其它 R 包处理日期数据时要特别小心，要留意一周的的第一天是星期几，有的是星期一，有的是星期日，这往往和宗教信仰相关，星期日在西方也叫礼拜天。 上面 Base R 提供的日期函数认为一周的第一天是星期一，而调用 **data.table** 的话，默认一周是从星期日（礼拜天）开始的。

```{r}
# https://d.cosx.org/d/421230
weekdays(Sys.Date(), abbreviate = TRUE)
data.table::wday(Sys.Date())
```

:::


## 岭线图 {#sec-ggplot2-ridgeline}

**ggridges** 包，[于淼](https://yufree.cn/) 对此图形的来龙去脉做了比较系统的阐述，详见统计之都主站文章[叠嶂图的前世今生](https://cosx.org/2018/04/ridgeline-story/)

```{r lincoln-weather,fig.cap="2016年在内布拉斯加州林肯市的天气变化"}
library(ggridges)
ggplot(lincoln_weather, aes(x = `Mean Temperature [F]`, y = Month, fill = stat(x))) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01, gradient_lwd = 1.) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_discrete(expand = expansion(mult = c(0.01, 0.25))) +
  scale_fill_viridis_c(name = "Temp. [F]", option = "C") +
  labs(
    title = 'Temperatures in Lincoln NE',
    subtitle = 'Mean temperatures (Fahrenheit) by month for 2016'
  ) +
  theme_ridges(font_size = 13, grid = TRUE) + 
  theme(axis.title.y = element_blank())
```

通过数据可视化的手段帮助肉眼检查两组数据的分布

```{r sleep-diamonds,fig.cap="比较数据的分布",fig.width=6,fig.height=8}
p1 <- ggplot(sleep, aes(x = extra, y = group, fill = group)) +
  geom_density_ridges() +
  theme_ridges()

p2 <- ggplot(diamonds, aes(x = price, y = color, fill = color)) +
  geom_density_ridges() +
  theme_ridges()

p1 / p2
```

[ridgeline](https://github.com/R-CoderDotCom/ridgeline) 提供 Base R 绘图方案


```{r ridge, echo=FALSE, fig.cap="岭线图", fig.width=4, fig.height=4, out.width="65%"}
# http://karolis.koncevicius.lt/posts/r_base_plotting_without_wrappers/
dens <- tapply(iris$Sepal.Length, iris$Species, density)

xs <- Map(getElement, dens, "x")
ys <- Map(getElement, dens, "y")
ys <- Map(function(x) (x - min(x)) / max(x - min(x)) * 1.5, ys)
ys <- Map(`+`, ys, length(ys):1)

plot.new()
plot.window(xlim = range(xs), ylim = c(1, length(ys) + 1.5))
abline(h = length(ys):1, col = "grey")

invisible(Map(polygon, xs, ys, col = hcl.colors(length(ys), "Zissou", alpha = 0.8)))

axis(1, tck = -0.01)
mtext(names(dens), 2, at = length(ys):1, las = 2, padj = 0)
```

## 椭圆图 {#sec-ggplot2-ellipse}

type 指定多元分布的类型，`type = "t"` 和 `type = "norm"` 分别表示 t 分布和正态分布，`geom = "polygon"`，以 `eruptions > 3` 分为两组

```{r ellipse,fig.cap="几种不同的椭圆图",fig.width=4,fig.height=4,out.width="45%",fig.show='hold',fig.ncol=2,fig.subcap=c("简单椭圆图", "正态和 t 分布", "填充几何图形"),collapse=TRUE}
ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_point() +
  stat_ellipse()

ggplot(faithful, aes(waiting, eruptions, color = eruptions > 3)) +
  geom_point() +
  stat_ellipse(type = "norm", linetype = 2) +
  stat_ellipse(type = "t") +
  theme(legend.position = "none")

ggplot(faithful, aes(waiting, eruptions, fill = eruptions > 3)) +
  stat_ellipse(geom = "polygon") +
  theme(legend.position = "none")
```


## Q-Q 图 {#sec-ggplot2-qq}

quantile-quantile Q-Q 正态分布图的 ggplot2 实现 [qqplotr](https://github.com/aloy/qqplotr)


## 包络图 {#sec-ggplot2-chull}

ggpubr 包提供了 `stat_chull()` 图层

```{r stat-chull,fig.cap="包络图",fig.width=5,fig.height=4}
library(ggpubr)
ggscatter(mpg, x = "displ", y = "hwy", color = "drv")+
 stat_chull(aes(color = drv, fill = drv), alpha = 0.1, geom = "polygon")
```

其背后的原理如下

```{r}
stat_chull
```

```{r,eval=FALSE}
StatChull <- ggproto("StatChull", Stat,
  compute_group = function(data, scales) {
    data[chull(data$x, data$y), , drop = FALSE]
  },
  required_aes = c("x", "y")
)

stat_chull <- function(mapping = NULL, data = NULL, geom = "polygon",
                       position = "identity", na.rm = FALSE, show.legend = NA,
                       inherit.aes = TRUE, ...) {
  layer(
    stat = StatChull, data = data, mapping = mapping, geom = geom,
    position = position, show.legend = show.legend, inherit.aes = inherit.aes,
    params = list(na.rm = na.rm, ...)
  )
}

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  stat_chull(fill = NA, colour = "black")

ggplot(mpg, aes(displ, hwy, colour = drv)) + 
  geom_point() + 
  stat_chull(fill = NA)
```

## 拟合图 {#sec-ggplot2-fit}

```{r spline-fun,fig.cap="自定义样条函数",fig.width=4,fig.height=4,fig.show='hold',out.width='45%'}
xx <- -9:9
yy <- sqrt(abs(xx))
plot(xx, yy,
  col = "red",
  xlab = expression(x),
  ylab = expression(sqrt(abs(x)))
)
lines(spline(xx, yy, n = 101, method = "fmm", ties = mean), col = "pink")

myspline <- function(formula, data, ...) {
  dat <- model.frame(formula, data)
  res <- splinefun(dat[[2]], dat[[1]])
  class(res) <- "myspline"
  res
}

predict.myspline <- function(object, newdata, ...) {
  object(newdata[[1]])
}

data.frame(x = -9:9) %>%
  transform(y = sqrt(abs(x))) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point(color = "red", pch = 1, size = 2) +
  stat_smooth(method = myspline, formula = y~x, se = F, color = "pink") +
  labs(x = expression(x), y = expression(sqrt(abs(x)))) +
  theme_minimal()
```

下面以真实数据集 trees 为例，介绍 `geom_smooth()` 支持的拟合方法，比如 `"lm"` 线性回归和 `"nls"` 非线性回归

```{r smooth-methods,fig.cap="平滑方法",fig.width=4,fig.height=4,fig.show='hold',out.width='45%'}
ggplot(trees, aes(x = log(Girth), y = log(Volume))) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE)

ggplot(trees, aes(x = Girth, y = Volume)) +
  geom_point() +
  geom_smooth(
    method = "nls", formula = y ~ a * x^2 + b, se = F,
    method.args = list(start = list(a = 5, b = -36))
  )
```

## 地形图 {#sec-ggplot2-raster}

区域之间以轮廓分割，轮廓之间以相同的颜色填充，Cleveland 把这个叫做 level plot， **lattice** 包中 `levelplot()` 函数正来源于此。

[Auckland's Maunga Whau Volcano](https://en.wikipedia.org/wiki/Maungawhau) 是火山喷发后留下的碴堆，位于新西兰奥克兰伊甸山郊区。[Ross Ihaka](https://www.stat.auckland.ac.nz/~ihaka/) 收集了它的地形数据，命名为 volcano，打包在 R 软件环境中，见图 \@ref(fig:elevation-volcano)

```{r elevation-volcano, fig.cap="image 图形",fig.height=5,fig.width=5.5}
filled.contour(volcano,
  color.palette = terrain.colors,
  plot.title = title(
    main = "The Topography of Maunga Whau",
    xlab = "Meters North", ylab = "Meters West"
  ),
  plot.axes = {
    axis(1, seq(100, 800, by = 100))
    axis(2, seq(100, 600, by = 100))
  },
  key.title = title(main = "Height\n(meters)"),
  key.axes = axis(4, seq(90, 190, by = 10))
)
```


## 树状图 {#sec-ggplot2-treemap}

数据集 GNI2014 来自 [**treemap**](https://github.com/mtennekes/treemap) 包，是一个 data.frame 类型的数据对象，记录了 2014 年每个国家的人口总数 population 和国民人均收入 GNI，数据样例见下方：

```{r}
library(treemap)
data(GNI2014, package = "treemap")
subset(GNI2014, subset = grepl(x = country, pattern = 'China'))
```

数据呈现明显的层级结构，从大洲到国家记录人口数量和人均收入，矩阵树图以方块大小表示人口数量，以颜色深浅表示人均收入，见图\@ref(fig:treemap-grid)

```{r treemap-grid,fig.cap="矩阵树图",fig.width=5,fig.height=5}
treemap(GNI2014,
  index = c("continent", "iso3"),
  vSize = "population", 
  vColor = "GNI",
  type = "value",
  format.legend = list(scientific = FALSE, big.mark = " ")
)
```

[**treemapify**](https://github.com/wilkox/treemapify) 包基于 ggplot2 制作树状图，类似地，该 R 包内置了数据集 G20，记录了世界主要经济体 G20 (<https://en.wikipedia.org/wiki/G20>) 的经济和人口信息，国家 GDP （单位：百万美元）`gdp_mil_usd` 和人类发展指数 `hdi`。相比于 GNI2014，它还包含了两列标签信息：经济发展阶段和所处的半球。图@(fig:treemap-ggplot2)以南北半球 hemisphere 分面，以色彩填充区域 region，以 `gdp_mil_usd` 表示区域大小

```{r treemap-ggplot2,fig.cap="世界主要经济体G20的人口和经济信息",fig.width=5,fig.height=5}
library(treemapify)
ggplot(G20, aes(
  area = gdp_mil_usd, fill = region,
  label = country, subgroup = region
)) +
  geom_treemap() +
  geom_treemap_text(grow = T, reflow = T, colour = "black") +
  facet_wrap(~hemisphere) +
  scale_fill_brewer(palette = "Set1") +
  theme(legend.position = "bottom") +
  labs(
    title = "The G-20 major economies by hemisphere",
    caption = "The area of each tile represents the country's GDP as a
      proportion of all countries in that hemisphere",
    fill = "Region"
  )
```

<!-- https://github.com/DaphneGiorgi/IBMPopSim 钻石数据集 diamonds 对比，连续变量离散化，直方图，分布对比 -->

## 留存图 {#sec-ggplot2-cohort}

```{r cohort-ggplot2}
cohort <- data.frame(
  cohort = rep(1:5, times = 5:1),
  week = c(1:5, 1:4, 1:3, 1:2, 1),
  value = c(
    75, 73, 54, 23, 3,
    98, 94, 70, 25,
    52, 5, 3,
    44, 15,
    88
  )
)

ggplot(cohort, aes(x = week, y = cohort, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = value), color = "white") +
  scale_y_reverse() +
  scale_fill_binned(type = "viridis")
```

留存是 [Cohort 分析](https://en.wikipedia.org/wiki/Cohort_analysis) 中的一种情况，还有转化等，首先
定义你的问题，确定度量问题的指标，确定和问题相关的 Cohort （比如时间、空间和用户属性等关键的影响因素），然后数据处理、可视化获得 Cohort 分析结果，最后在实际决策和行动中检验分析结论。

## 瀑布图 {#sec-ggplot2-waterfall}

瀑布图 waterfall 与上月相比，谁增谁减，用瀑布图分别表示占比和绝对数值。[瀑布图 waterfall](https://vita.had.co.nz/papers/ggplot2-wires.pdf)

```{r waterfall,fig.cap="瀑布图"}
balance <- data.frame(
  event = c(
    "Starting\nCash", "Sales", "Refunds",
    "Payouts", "Court\nLosses", "Court\nWins", "Contracts", "End\nCash"
  ),
  change = c(2000, 3400, -1100, -100, -6600, 3800, 1400, -2800)
)

balance$balance <- cumsum(c(0, balance$change[-nrow(balance)])) # 累计值
balance$time <- 1:nrow(balance)
balance$flow <- factor(sign(balance$change)) # 变化为正还是为负

ggplot(balance) +
  geom_hline(yintercept = 0, colour = "white", size = 2) +
  geom_rect(aes(
    xmin = time - 0.45, xmax = time + 0.45,
    ymin = balance, ymax = balance + change, fill = flow
  )) +
  geom_text(aes(
    x = time,
    y = pmin(balance, balance + change) - 50,
    label = scales::dollar(change)
  ),
  hjust = 0.5, vjust = 1, size = 3
  ) +
  scale_x_continuous(
    name = "",
    breaks = balance$time,
    labels = balance$event
  ) +
  scale_y_continuous(
    name = "Balance",
    labels = scales::dollar
  ) +
  scale_fill_brewer(palette = "Spectral") +
  theme_minimal()
```


## 水流图 {#sec-ggplot2-streamgraph}

常用于时间序列数据展示的堆积区域图，[ggstream](https://github.com/davidsjoberg/ggstream) 和 [streamgraph](https://github.com/hrbrmstr/streamgraph)

<!-- 纵轴是怎么计算出来的？宽窄变化表达什么含义？ <https://hrbrmstr.github.io/streamgraph/> -->

```{r stream-graph, fig.cap="堆积区域图", fig.width=6, fig.height=4, out.width="75%"}
library(ggstream)

ggplot(blockbusters, aes(year, box_office, fill = genre)) +
  geom_stream() +
  theme_minimal()
```

## 时间线 {#sec-ggplot2-vistime}

```{r vis-timeline,fig.cap="数据科学的时间轴",fig.width=10,fig.height=5}
# 交互动态图 https://github.com/shosaco/vistime
# 刘思喆 2018 数据科学的时间轴 https://bjt.name/2018/11/18/timeline.html
x <- read.table(
  textConnection("
The Future of Data Analysis,1962
Relational Database,1970
Data science(Peter Naur),1974
Two-Way Communication,1975
Exploratory Data Analysis,1977
Business Intelligence,1989
The First Database Report,1992
The World Wide Web Explodes,1995
Data Mining and Knowledge Discovery,1997
S(ACM Software System Award),1998
Statistical Modeling: The Two Cultures,2001
Hadoop,2006
Data scientist,2008
NOSQL,2009
Deep Learning,2015
"),
  sep = ","
)
names(x) <- c("Event", "EventDate")
x$EventDate <- as.Date(paste(x$EventDate, "/01/01", sep = ""))

library(timelineS)
timelineS(x,
  labels = paste(x[[1]], format(x[[2]], "%Y")),
  line.color = "blue", label.angle = 15
)
```

```{r eval=FALSE}
library(timeline)
data(ww2, package = 'timeline')
timeline(ww2, ww2.events, event.spots=2, event.label='', event.above=FALSE)
```

```{r eval=FALSE}
# 适合放在动态幻灯片
# 美团风格的写轮眼
# 时间线
library(vistime)
# presidents and vice presidents
pres <- data.frame(
  Position = rep(c("President", "Vice"), each = 3),
  Name = c("Washington", rep(c("Adams", "Jefferson"), 2), "Burr"),
  start = c("1789-03-29", "1797-02-03", "1801-02-03"),
  end = c("1797-02-03", "1801-02-03", "1809-02-03"),
  color = c("#cbb69d", "#603913", "#c69c6e")
)

hc_vistime(pres, col.event = "Position", col.group = "Name", 
           title = "Presidents of the USA")
```

## 三元图 {#sec-ggplot2-ternary}

[Ternary](https://github.com/ms609/Ternary/) 使用基础图形库，而 [ggtern](https://bitbucket.org/nicholasehamilton/ggtern) 使用 ggplot2 绘制

```{r,eval=FALSE}
library(ggtern)
library(ggalt)
data("Fragments")
ggtern(Fragments, aes(
  x = Qm, y = Qp, z = Rf + M,
  fill = GrainSize, shape = GrainSize
)) +
  geom_encircle(alpha = 0.5, size = 1) +
  geom_point() +
  labs(
    title = "Example Plot",
    subtitle = "using geom_encircle"
  ) +
  theme_bw() +
  theme_legend_position("tr")
```

## 向量场图 {#sec-vector-fields}


```{r}
library(ggquiver)
```


## 四象限图 {#sec-ggplot2-eisenhower}

```{r eisenhower,fig.cap="四象限图"}
dat <- data.frame(
  perc = c(54, 18, 5, 15),
  wall_policy = c("oppose", "favor", "oppose", "favor"),
  dreamer_policy = c("favor", "favor", "oppose", "oppose"),
  stringsAsFactors = FALSE
) %>%
  transform(
    xmin = ifelse(wall_policy == "oppose", -sqrt(perc), 0),
    xmax = ifelse(wall_policy == "favor", sqrt(perc), 0),
    ymin = ifelse(dreamer_policy == "oppose", -sqrt(perc), 0),
    ymax = ifelse(dreamer_policy == "favor", sqrt(perc), 0)
  )

ggplot(data = dat) +
  geom_rect(aes(
    xmin = xmin, xmax = xmax,
    ymin = ymin, ymax = ymax
  ), fill = "grey") +
  geom_text(aes(
    x = xmin + 0.5 * sqrt(perc),
    y = ymin + 0.5 * sqrt(perc),
    label = perc
  ),
  color = "white", size = 10
  ) +
  coord_equal() +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  theme_minimal() +
  labs(x = "", y = "", title = "")
```


## 龙卷风图 {#sec-ggplot2-tornado}

```{r tornado-ggplot2,fig.cap="龙卷风图展示变量重要性",fig.height=4}
dat <- data.frame(
  variable = c("A", "B", "A", "B"),
  Level = c("Top-2", "Top-2", "Bottom-2", "Bottom-2"),
  value = c(.8, .7, -.2, -.3)
)
ggplot(dat, aes(x = variable, y = value, fill = Level)) +
  geom_bar(position = "identity", stat = "identity") +
  scale_y_continuous(labels = abs) +
  coord_flip() +
  theme_minimal()
```

[Tornado diagram](https://en.wikipedia.org/wiki/Tornado_diagram) 主要用于敏感性分析，比较不同变量的重要性程度。条形图 `geom_bar()` 图层的变体，模型权重可视化的手段，仅限于广义线性模型。

## 聚类图 {#sec-ggplot2-hclust}

ggdendro 的 `dendro_data()` 函数支持 `tree` 、`hclust` 、`dendrogram` 和 `rpart` 结果的整理，进而绘图

```{r}
library(ggdendro)
hc <- hclust(dist(USArrests), "ave")
hcdata <- dendro_data(hc, type = "rectangle")
ggplot() +
  geom_segment(data = segment(hcdata), 
               aes(x = x, y = y, xend = xend, yend = yend)
  ) +
  geom_text(data = label(hcdata), 
            aes(x = x, y = y, label = label, hjust = 0), 
            size = 3
  ) +
  coord_flip() +
  scale_y_reverse(expand = c(0.2, 0)) +
  theme_minimal()
```

## 主成分图 {#sec-ggplot2-prcomp}

借助 [**autoplotly**](https://github.com/terrytangyuan/autoplotly) 包 [@autoplotly] 可将函数 `stats::prcomp` 生成的结果转化为交互图形

```{r pac-plot}
pca <- prcomp(iris[c(1, 2, 3, 4)])
plot(pca)
```

```{r pac-plotly,eval=knitr::is_html_output()}
library(autoplotly)
autoplotly(pca,
  data = iris, colour = "Species",
  label = TRUE, label.size = 3, frame = TRUE
)
```

[**ggfortify**](https://github.com/sinhrks/ggfortify) [@Tang_2016_ggfortify] 包将主成分分析图转化为静态图形

```{r pca-ggplot2,fig.cap="主成分分析",fig.width=6,fig.height=4}
library(ggfortify)
autoplot(pca, data = iris, colour = 'Species')
```

## 组合图 {#sec-ggplot2-composite}

<!-- 也许可以加一个复杂的例子 [Tidy Tuesday Vacination](http://paulnice.datasorcery.tech/posts/tidy-tuesday-vacination/) 
[ggprism](https://github.com/csdaw/ggprism) 提供 [GraphPad](https://www.graphpad.com/) 风格的主题
-->

组合的意思是将不同种类的图形绘制在一个区域中，比如密度曲线和地毯图[^rug-plot]组合。
[**GGally**](https://github.com/ggobi/ggally)、 [**ggupset**](https://github.com/const-ae/ggupset)、 [ggcharts](https://github.com/thomas-neitmann/ggcharts) 和 [**ggpubr**](https://github.com/kassambara/ggpubr) 高度定制了一些组合统计图形，以 ggpubr 为例，见图 \@ref(fig:ggpubr-composite)。

```{r ggpubr-composite,fig.cap="组合图形",fig.width=6,fig.height=4}
library(ggpubr)
ggdensity(sleep,
  x = "extra", add = "mean", rug = TRUE, color = "group",
  fill = "group", palette = c("#00AFBB", "#E7B800")
)
```

上面介绍的都是已经固化的组合方式，一般地，将多个图形组合到一个图中，可以有很多办法，比如 Claus Wilke 开发的 [**cowplot**](https://github.com/wilkelab/cowplot) ，在他的书里 [Fundamentals of Data Visualization](http://serialmentor.com/dataviz) 大量使用，后起之秀 [**patchwork**](https://github.com/thomasp85/patchwork) 则提供更加简洁的组合语法，非常受欢迎，更加底层的拼接方法可以去看 [一页多图](https://msg-book.netlify.app/tricks.html#sec-multipage) 和 R 内置的 grid 系统。

[^rug-plot]: 其实是轴须图 rug plot，只因样子看起来像铺在地上的毛毯，故而称之为地毯图，对应于 R 内置的 `rug()` 函数或 ggplot2 提供的图层 `geom_rug()`，更多解释详见 <https://en.wikipedia.org/wiki/Rug_plot>。

## 动态图 {#sec-ggplot2-animation}

[**av**](https://github.com/ropensci/av) 包基于 [FFmpeg](https://github.com/FFmpeg/FFmpeg) 将静态图片合成视频，而 [**gifski**](https://github.com/r-rust/gifski/) 包基于 [gifski](https://gif.ski/) 将静态图片合成 GIF 动画，[**animation**](https://github.com/yihui/animation) 包 [@Xie_2013_animation] 将 Base R 绘制的图形转化为动画或视频，[mapmate](https://github.com/leonawicz/mapmate) 制作地图相关的三维可视化图形，[**gganimate**](https://github.com/thomasp85/gganimate) 包支持将 ggplot2 生成的图形，**magick** 可以将一系列静态图形合成动态图形，借助 **gifski** 包转化为动态图片或视频。推荐读者从 [gganimate 案例合集](https://github.com/ropenscilabs/learngganimate) 开始制作动态图形。 [**rgl**](https://r-forge.r-project.org/projects/rgl/) 可以制作真三维动态图形，支持缩放、拖拽、旋转等操作， [**rayshader**](https://github.com/tylermorganwall/rayshader) 还支持转化 ggplot2 对象为 3D 图形。


```{r}
#| label: fig-indometh-concentration
#| fig-cap: "药物在人体中的代谢情况"
#| fig-width: 5
#| fig-height: 4
#| dev: "ragg_png"

p <- ggplot(
  data = Indometh,
  aes(x = time, y = conc, color = Subject)
) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  labs(
    x = "time (hr)",
    y = "plasma concentrations of indometacin (mcg/ml)"
  )
p
```

```{r}
#| label: fig-indometh-animate
#| fig-width: 5
#| fig-height: 4
#| fig-show: 'animate'
#| interval: 0.1
#| cache: true
#| dev: "ragg_png"
#| out-width: 60%

library(gganimate)
p + 
  transition_reveal(time)
```

动态图形制作的原理，简单来说，就是将一帧帧静态图形以较快的速度播放，人眼形成视觉残留，以为是连续的画面，相比于 animation， **gganimate** 借助 **tweenr** 包添加了过渡效果，动态图形显得非常自然。下面以 cup 函数[^breast-cup]为例

[^breast-cup]: 函数来自余光创的博客 --- [3D 版邪恶的曲线](https://guangchuangyu.github.io/cn/2017/09/3d-breast/) ，此处借用 gganimate 将其动态化，前方高能，少儿不宜，R 还能这么不正经的玩。

$$f(x;\theta,\phi) = \theta x\log(x)-\frac{1}{\phi}\mathit{e}^{-\phi^4(x-\frac{1}{\mathit{e}})^4}, \quad \theta \in (2,3), \phi \in (30,50), x \in (0,1)$$ 

函数图像随着 $\theta$ 和 $\phi$ 的变化情况见图 \@ref(fig:cup-curve)。

```{r cup-curve,fig.cap="添加过渡效果",fig.width=4,fig.height=4}
library(tweenr)
cup_curve <- function(n = 100, theta = 3, phi = 30, cup = "A") {
  data.frame(x = seq(0.00001, 1, length.out = n), cup = cup) %>%
    transform(y = theta * x * log(x, base = 10) 
              - 1 / phi * exp(-(phi * x - phi / exp(1))^4))
}
mapply(
  FUN = cup_curve, theta = c(E = 3, D = 2.8, C = 2.5, B = 2.2, A = 2),
  phi = c(30, 33, 36, 40, 50), cup = c("E", "D", "C", "B", "A"),
  MoreArgs = list(n = 50), SIMPLIFY = FALSE, USE.NAMES = TRUE
) %>%
  tween_states(
    data = .,
    tweenlength = 2, statelength = 1,
    ease = rep("cubic-in-out", 4), nframes = 100
  ) %>%
  ggplot(data = ., aes(x, y, color = cup, frame = .frame)) +
  geom_path() +
  coord_flip() +
  theme_void()
```