From 5964c5cd258c141bda4b3ceb75ee7524d823bae5 Mon Sep 17 00:00:00 2001 From: Robrecht Cannoodt Date: Tue, 19 Sep 2023 09:47:39 +0200 Subject: [PATCH] Update docs (#111) * update docs * move class diagram to vignette * remove doc folder to #54 * don't include design doc in built package --- .Rbuildignore | 5 +- README.md | 55 +++++---- README.qmd | 29 +++-- doc/challenges.md | 14 --- doc/design.md | 148 ------------------------ doc/design.qmd | 149 ------------------------- vignettes/{features.Rmd => design.Rmd} | 38 ++++++- vignettes/diagrams/class_diagram.mmd | 51 +++++++++ vignettes/diagrams/class_diagram.svg | 1 + vignettes/diagrams/script.sh | 8 ++ 10 files changed, 151 insertions(+), 347 deletions(-) delete mode 100644 doc/challenges.md delete mode 100644 doc/design.md delete mode 100644 doc/design.qmd rename vignettes/{features.Rmd => design.Rmd} (65%) create mode 100644 vignettes/diagrams/class_diagram.mmd create mode 100644 vignettes/diagrams/class_diagram.svg create mode 100755 vignettes/diagrams/script.sh diff --git a/.Rbuildignore b/.Rbuildignore index fa7ae45a..e7831676 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -1,5 +1,4 @@ ^LICENSE\.md$ -^doc$ ^.*\.Rproj$ ^\.Rproj\.user$ ^\.github$ @@ -10,4 +9,6 @@ ^_pkgdown\.yml$ ^docs$ ^pkgdown$ -^vignettes/features.Rmd$ \ No newline at end of file +^vignettes/diagrams/*\.mmd$ +^vignettes/diagrams/*\.svg$ +^vignettes/design\.Rmd$ \ No newline at end of file diff --git a/README.md b/README.md index 29d31163..f86be1d4 100644 --- a/README.md +++ b/README.md @@ -9,22 +9,18 @@ experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](h status](https://www.r-pkg.org/badges/version/anndataR.png)](https://CRAN.R-project.org/package=anndataR) -`{anndataR}` is an R package that brings the power and flexibility of -AnnData to the R ecosystem, allowing you to effortlessly manipulate and -analyze your single-cell data. This package lets you work with backed -h5ad and zarr files, directly access various slots (e.g. X, obs, var, -obsm, obsp), or convert the data into SingleCellExperiment and Seurat -objects. +`{anndataR}` aims to make the AnnData format a first-class citizen in +the R ecosystem, and to make it easy to work with AnnData files in R, +either directly or by converting it to a SingleCellExperiment or Seurat +object. -## Design +Feature list: -This package was initially created at the [scverse 2023-04 -hackathon](https://scverse.org/events/2023_04_hackathon/) in Heidelberg. - -When fully implemented, it will be a complete replacement for -[theislab/zellkonverter](https://github.com/theislab/zellkonverter), -[mtmorgan/h5ad](github.com/mtmorgan/h5ad/) and -[dynverse/anndata](https://github.com/dynverse/anndata). +- Provide an `R6` class to work with AnnData objects in R (either + in-memory or on-disk). +- Read/write `*.h5ad` files natively +- Convert to/from `SingleCellExperiment` objects +- Convert to/from `Seurat` objects ## Installation @@ -34,6 +30,24 @@ You can install the development version of `{anndataR}` like so: devtools::install_github("scverse/anndataR") ``` +You might need to install suggested dependencies manually, depending on +the task you want to perform. + +- To read/write `*.h5ad` files, you need to install `{rhdf5}`: + `BiocManager::install("rhdf5")` +- To convert to/from `SingleCellExperiment` objects, you need to install + `{SingleCellExperiment}`: + `BiocManager::install("SingleCellExperiment")` +- To convert to/from `Seurat` objects, you need to install + `{SeuratObject}`: `install.packages("SeuratObject")` + +You can also install all suggested dependencies at once (though note +that this might take a while to run): + +``` r +devtools::install_github("scverse/anndataR", dependencies = TRUE) +``` + ## Example Here’s a quick example of how to use `{anndataR}`. First, we download an @@ -55,15 +69,10 @@ View structure: ``` r adata -#> class: InMemoryAnnData -#> dim: 50 obs x 100 var -#> X: dgRMatrix -#> layers: counts csc_counts dense_X dense_counts -#> obs: Float FloatNA Int IntNA Bool BoolNA n_genes_by_counts -#> log1p_n_genes_by_counts total_counts log1p_total_counts leiden -#> var: String n_cells_by_counts mean_counts log1p_mean_counts -#> pct_dropout_by_counts total_counts log1p_total_counts highly_variable -#> means dispersions dispersions_norm +#> AnnData object with n_obs × n_vars = 50 × 100 +#> obs: 'Float', 'FloatNA', 'Int', 'IntNA', 'Bool', 'BoolNA', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'leiden' +#> var: 'String', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' +#> layers: 'counts', 'csc_counts', 'dense_X', 'dense_counts' ``` Access AnnData slots: diff --git a/README.qmd b/README.qmd index 6c8f3e72..8337507a 100644 --- a/README.qmd +++ b/README.qmd @@ -27,17 +27,15 @@ knitr::opts_chunk$set( -`{anndataR}` is an R package that brings the power and flexibility of AnnData to the -R ecosystem, allowing you to effortlessly manipulate and analyze your single-cell data. -This package lets you work with backed h5ad and zarr files, directly access various slots -(e.g. X, obs, var, obsm, obsp), or convert the data into SingleCellExperiment and Seurat -objects. +`{anndataR}` aims to make the AnnData format a first-class citizen in the R ecosystem, and to make it easy to work with AnnData files in R, either directly +or by converting it to a SingleCellExperiment or Seurat object. -## Design +Feature list: -This package was initially created at the [scverse 2023-04 hackathon](https://scverse.org/events/2023_04_hackathon/) in Heidelberg. - -When fully implemented, it will be a complete replacement for [theislab/zellkonverter](https://github.com/theislab/zellkonverter), [mtmorgan/h5ad](github.com/mtmorgan/h5ad/) and [dynverse/anndata](https://github.com/dynverse/anndata). +* Provide an `R6` class to work with AnnData objects in R (either in-memory or on-disk). +* Read/write `*.h5ad` files natively +* Convert to/from `SingleCellExperiment` objects +* Convert to/from `Seurat` objects ## Installation @@ -48,6 +46,18 @@ You can install the development version of `{anndataR}` like so: devtools::install_github("scverse/anndataR") ``` +You might need to install suggested dependencies manually, depending on the task you want to perform. + +* To read/write `*.h5ad` files, you need to install `{rhdf5}`: `BiocManager::install("rhdf5")` +* To convert to/from `SingleCellExperiment` objects, you need to install `{SingleCellExperiment}`: `BiocManager::install("SingleCellExperiment")` +* To convert to/from `Seurat` objects, you need to install `{SeuratObject}`: `install.packages("SeuratObject")` + +You can also install all suggested dependencies at once (though note that this might take a while to run): + +``` r +devtools::install_github("scverse/anndataR", dependencies = TRUE) +``` + ## Example Here's a quick example of how to use `{anndataR}`. First, we download an h5ad file. @@ -94,3 +104,4 @@ Convert the AnnData object to a Seurat object: obj <- adata$to_Seurat() obj ``` + diff --git a/doc/challenges.md b/doc/challenges.md deleted file mode 100644 index a7abfa92..00000000 --- a/doc/challenges.md +++ /dev/null @@ -1,14 +0,0 @@ -# Challenges - -## Previously encountered issues - -Below are previously encountered issues when reading h5ad files using hdf5r. They could be -to create test cases. - -* [mojaveazure/seurat-disk#10](https://github.com/mojaveazure/seurat-disk/issues/10): Conversion error with SeuratDisk when copying `uns` -* [PMBio/MuDataSeurat#8](https://github.com/PMBio/MuDataSeurat/issues/8): PCA loadings issue - - -No test data yet: - -* [PMBio/MuDataSeurat#14](https://github.com/PMBio/MuDataSeurat/issues/14): H5Dvlen_reclaim invalid argument \ No newline at end of file diff --git a/doc/design.md b/doc/design.md deleted file mode 100644 index fe4c02e5..00000000 --- a/doc/design.md +++ /dev/null @@ -1,148 +0,0 @@ -# Design document - -## Proposed interface - -``` r -library(anndataR) - -# read from h5ad/h5mu file -adata <- read_h5ad("dataset.h5ad") -adata <- read_h5ad("dataset.h5ad", backed = TRUE) -mdata <- read_h5mu("dataset.h5mu") -mdata <- read_h5mu("dataset.h5mu", backed = TRUE) - -# anndata-like interface (the Python package) -adata$X -adata$obs -adata$var - -# optional feature 1: S3 helper functions for a base R-like interface -adata[1:10, 2:30] -dim(adata) -dimnames(adata) -as.matrix(adata, layer = NULL) -as.matrix(adata, layer = "counts") -t(adata) - -# optional feature 2: S3 helper functions for a bioconductor-like interface -rowData(adata) -colData(adata) -reducedDimNames(adata) - -# converters from/to sce -sce <- adata$to_sce() -from_sce(sce) - -# optional feature 3: converters from/to Seurat -seu <- adata$to_seurat() -from_seurat(seu) - -# optional feature 4: converters from/to SOMA -som <- adata$to_soma() -from_soma(som) -``` - -## Class diagram - -``` mermaid -classDiagram - class AbstractAnnData { - *X: Matrix - *layers: List[Matrix] - *obs: DataFrame - *var: DataFrame - *obsp: List[Matrix] - *varp: List[Matrix] - *obsm: List[Matrix] - *varm: List[Matrix] - *uns: List - *n_obs: int - *n_vars: int - *obs_names: Array[String] - *var_names: Array[String] - *subset(...): AbstractAnnData - *write_h5ad(): Unit - - to_sce(): SingleCellExperiment - to_seurat(): Seurat - - to_h5anndata(): H5AnnData - to_zarranndata(): ZarrAnnData - to_inmemory(): InMemoryAnnData - } - - AbstractAnnData <|-- H5AnnData - class H5AnnData { - init(h5file): H5AnnData - } - - AbstractAnnData <|-- ZarrAnnData - class ZarrAnnData { - init(zarrFile): ZarrAnnData - } - - AbstractAnnData <|-- InMemoryAnnData - class InMemoryAnnData { - init(X, obs, var, shape, ...): InMemoryAnnData - } - - AbstractAnnData <|-- ReticulateAnnData - class ReticulateAnnData { - init(pyobj): ReticulateAnnData - } - - class anndataR { - read_h5ad(path, backend): AbstractAnnData - read_h5mu(path, backend): AbstractMuData - } - anndataR --> AbstractAnnData -``` - -Notation: - -- `X: Matrix` - variable `X` is of type `Matrix` -- `*X: Matrix` - variable `X` is abstract -- `to_sce(): SingleCellExperiment` - function `to_sce` returns object of - type `SingleCellExperiment` -- `*to_sce()` - function `to_sce` is abstract - -## OO-framework - -S4, RC, or R6? - -- S4 offers formal class definitions and multiple dispatch, making it - suitable for complex projects, but may be verbose and slower compared - to other systems. -- RC provides reference semantics, familiar syntax, and encapsulation, - yet it is less popular and can have performance issues. -- R6 presents a simple and efficient OOP system with reference semantics - and growing popularity, but lacks multiple dispatch and the formality - of S4. - -Choosing an OOP system depends on the project requirements, developer -familiarity, and desired balance between formality, performance, and -ease of use. - -## Approach - -- Implement inheritance objects for `AbstractAnnData`, `H5AnnData`, - `InMemoryAnnData` -- Only containing `X`, `obs`, `var` for now -- Implement base R S3 generics -- Implement `read_h5ad()`, `$write_h5ad()` -- Implement `$to_sce()` -- Add simple unit tests - -Optional: - -- Add more fields (obsp, obsm, varp, varm, …) –\> see class diagram -- Start implementing MuData -- Implement `$to_seurat()` -- Implement `ZarrAnnData` -- Implement `ReticulateAnnData` -- Implement Bioconductor S3 generics - -## Conclusion - -- Scope and therefore the name -- What we do after this diff --git a/doc/design.qmd b/doc/design.qmd deleted file mode 100644 index f719c182..00000000 --- a/doc/design.qmd +++ /dev/null @@ -1,149 +0,0 @@ ---- -title: Design document -format: gfm ---- - -:::{.content-hidden} -Rendered using: -``` -quarto render doc/design.qmd; sed -i 's#<#<#g;s#>#>#g' doc/design.md -``` -::: - -## Proposed interface - -```r -library(anndataR) - -# read from h5ad/h5mu file -adata <- read_h5ad("dataset.h5ad") -adata <- read_h5ad("dataset.h5ad", backed = TRUE) -mdata <- read_h5mu("dataset.h5mu") -mdata <- read_h5mu("dataset.h5mu", backed = TRUE) - -# anndata-like interface (the Python package) -adata$X -adata$obs -adata$var - -# optional feature 1: S3 helper functions for a base R-like interface -adata[1:10, 2:30] -dim(adata) -dimnames(adata) -as.matrix(adata, layer = NULL) -as.matrix(adata, layer = "counts") -t(adata) - -# optional feature 2: S3 helper functions for a bioconductor-like interface -rowData(adata) -colData(adata) -reducedDimNames(adata) - -# converters from/to sce -sce <- adata$to_sce() -from_sce(sce) - -# optional feature 3: converters from/to Seurat -seu <- adata$to_seurat() -from_seurat(seu) - -# optional feature 4: converters from/to SOMA -som <- adata$to_soma() -from_soma(som) -``` - -## Class diagram - -```{mermaid} -classDiagram - class AbstractAnnData { - *X: Matrix - *layers: List[Matrix] - *obs: DataFrame - *var: DataFrame - *obsp: List[Matrix] - *varp: List[Matrix] - *obsm: List[Matrix] - *varm: List[Matrix] - *uns: List - *n_obs: int - *n_vars: int - *obs_names: Array[String] - *var_names: Array[String] - *subset(...): AbstractAnnData - *write_h5ad(): Unit - - to_sce(): SingleCellExperiment - to_seurat(): Seurat - - to_h5anndata(): H5AnnData - to_zarranndata(): ZarrAnnData - to_inmemory(): InMemoryAnnData - } - - AbstractAnnData <|-- H5AnnData - class H5AnnData { - init(h5file): H5AnnData - } - - AbstractAnnData <|-- ZarrAnnData - class ZarrAnnData { - init(zarrFile): ZarrAnnData - } - - AbstractAnnData <|-- InMemoryAnnData - class InMemoryAnnData { - init(X, obs, var, shape, ...): InMemoryAnnData - } - - AbstractAnnData <|-- ReticulateAnnData - class ReticulateAnnData { - init(pyobj): ReticulateAnnData - } - - class anndataR { - read_h5ad(path, backend): AbstractAnnData - read_h5mu(path, backend): AbstractMuData - } - anndataR --> AbstractAnnData -``` - -Notation: - - - `X: Matrix` - variable `X` is of type `Matrix` - - `*X: Matrix` - variable `X` is abstract - - `to_sce(): SingleCellExperiment` - function `to_sce` returns object of type `SingleCellExperiment` - - `*to_sce()` - function `to_sce` is abstract - -## OO-framework - -S4, RC, or R6? - -- S4 offers formal class definitions and multiple dispatch, making it suitable for complex projects, but may be verbose and slower compared to other systems. -- RC provides reference semantics, familiar syntax, and encapsulation, yet it is less popular and can have performance issues. -- R6 presents a simple and efficient OOP system with reference semantics and growing popularity, but lacks multiple dispatch and the formality of S4. - -Choosing an OOP system depends on the project requirements, developer familiarity, and desired balance between formality, performance, and ease of use. - -## Approach - -* Implement inheritance objects for `AbstractAnnData`, `H5AnnData`, `InMemoryAnnData` -* Only containing `X`, `obs`, `var` for now -* Implement base R S3 generics -* Implement `read_h5ad()`, `$write_h5ad()` -* Implement `$to_sce()` -* Add simple unit tests - -Optional: - -* Add more fields (obsp, obsm, varp, varm, ...) --> see class diagram -* Start implementing MuData -* Implement `$to_seurat()` -* Implement `ZarrAnnData` -* Implement `ReticulateAnnData` -* Implement Bioconductor S3 generics - -## Conclusion - -* Scope and therefore the name -* What we do after this diff --git a/vignettes/features.Rmd b/vignettes/design.Rmd similarity index 65% rename from vignettes/features.Rmd rename to vignettes/design.Rmd index e7d55f5c..05484031 100644 --- a/vignettes/features.Rmd +++ b/vignettes/design.Rmd @@ -1,8 +1,8 @@ --- -title: "Features" +title: "Design" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{Features} + %\VignetteIndexEntry{Design} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- @@ -15,6 +15,40 @@ knitr::opts_chunk$set( ) ``` + +`{anndataR}` is designed to offer the combined functionality of the following packages: + +* [theislab/zellkonverter](https://github.com/theislab/zellkonverter): Convert AnnData files to/from `SingleCellExperiment` objects. +* [mtmorgan/h5ad](https://github.com/mtmorgan/h5ad/): Read/write `*.h5ad` files natively using `rhdf5`. +* [dynverse/anndata](https://github.com/dynverse/anndata): An R implementation of the AnnData data structures, uses `reticulate` to read/write `*.h5ad` files. + +Ideally, this package will be a complete replacement for all of these packages, and will be the go-to package for working with AnnData files in R. + +## Desired feature list + +* Provide an `R6` class to work with AnnData objects in R (either in-memory or on-disk). +* Read/write `*.h5ad` files natively +* Convert to/from `SingleCellExperiment` objects +* Convert to/from `Seurat` objects + +## Class diagram + +Here is a diagram of the main R6 classes provided by the package: + +![](diagrams/class_diagram.svg) + +Notation: + + - `X: Matrix` - variable `X` is of type `Matrix` + - `*X: Matrix` - variable `X` is abstract + - `to_SingleCellExperiment(): SingleCellExperiment` - function `to_SingleCellExperiment` returns object of type `SingleCellExperiment` + - `*to_SingleCellExperiment()` - function `to_SingleCellExperiment` is abstract + + +## Feature tracking + +The following tables show the status of the implementation of each feature in the package: + ```{r include=FALSE} library(tibble) library(knitr) diff --git a/vignettes/diagrams/class_diagram.mmd b/vignettes/diagrams/class_diagram.mmd new file mode 100644 index 00000000..daa95b5c --- /dev/null +++ b/vignettes/diagrams/class_diagram.mmd @@ -0,0 +1,51 @@ + +classDiagram + class AbstractAnnData { + *X: Matrix + *layers: List[Matrix] + *obs: DataFrame + *var: DataFrame + *obsp: List[Matrix] + *varp: List[Matrix] + *obsm: List[Matrix] + *varm: List[Matrix] + *uns: List + *n_obs: int + *n_vars: int + *obs_names: Array[String] + *var_names: Array[String] + *subset(...): AbstractAnnData + *write_h5ad(): Unit + + to_SingleCellExperiment(): SingleCellExperiment + to_Seurat(): Seurat + + to_HDF5AnnData(): HDF5AnnData + to_ZarrAnnData(): ZarrAnnData + to_InMemoryAnnData(): InMemoryAnnData + } + + AbstractAnnData <|-- HDF5AnnData + class HDF5AnnData { + init(h5file): HDF5AnnData + } + + AbstractAnnData <|-- ZarrAnnData + class ZarrAnnData { + init(zarrFile): ZarrAnnData + } + + AbstractAnnData <|-- InMemoryAnnData + class InMemoryAnnData { + init(X, obs, var, shape, ...): InMemoryAnnData + } + + AbstractAnnData <|-- ReticulateAnnData + class ReticulateAnnData { + init(pyobj): ReticulateAnnData + } + + class anndataR { + read_h5ad(path, backend): Either[AbstractAnnData, SingleCellExperiment, Seurat] + } + anndataR --> AbstractAnnData \ No newline at end of file diff --git a/vignettes/diagrams/class_diagram.svg b/vignettes/diagrams/class_diagram.svg new file mode 100644 index 00000000..93200725 --- /dev/null +++ b/vignettes/diagrams/class_diagram.svg @@ -0,0 +1 @@ +
AbstractAnnData
*X: Matrix
*layers: List[Matrix]
*obs: DataFrame
*var: DataFrame
*obsp: List[Matrix]
*varp: List[Matrix]
*obsm: List[Matrix]
*varm: List[Matrix]
*uns: List
*n_obs: int
*n_vars: int
*obs_names: Array[String]
*var_names: Array[String]
*subset(...) : AbstractAnnData
*write_h5ad() : Unit
to_SingleCellExperiment() : SingleCellExperiment
to_Seurat() : Seurat
to_HDF5AnnData() : HDF5AnnData
to_ZarrAnnData() : ZarrAnnData
to_InMemoryAnnData() : InMemoryAnnData
HDF5AnnData
init(h5file) : HDF5AnnData
ZarrAnnData
init(zarrFile) : ZarrAnnData
InMemoryAnnData
init(X, obs, var, shape, ...) : InMemoryAnnData
ReticulateAnnData
init(pyobj) : ReticulateAnnData
anndataR
read_h5ad(path, backend) : Either[AbstractAnnData, SingleCellExperiment, Seurat]
\ No newline at end of file diff --git a/vignettes/diagrams/script.sh b/vignettes/diagrams/script.sh new file mode 100755 index 00000000..c28400fb --- /dev/null +++ b/vignettes/diagrams/script.sh @@ -0,0 +1,8 @@ +#!/bin/bash + +# Convert mermaid diagrams to different formats +# because RMarkdown doesn't support mermaid diagrams + +docker run --rm -w /pwd -u `id -u`:`id -g` \ + -v `pwd`:/pwd minlag/mermaid-cli \ + -i vignettes/diagrams/class_diagram.mmd -o vignettes/diagrams/class_diagram.svg