diff --git a/CHANGELOG.md b/CHANGELOG.md index f15371d..3d1fb33 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -74,7 +74,17 @@ For more information, please visit the [package website](https://laminr.lamin.ai * Set Python requirements to `lamindb[aws]` for now (PR #33). Will be changed to `lamin_cli` once [laminlabs/lamin-cli#90](https://github.com/laminlabs/lamin-cli/issues/90) is solved. -* Improve documentation for installing suggested dependencies and what they are requrire for (PR #56) +* Improve documentation for installing suggested dependencies and what they are required for (PR #56). + +* Update the README to give a better overview of the package (PR #67). + +* Rename the `usage` vignette to `laminr` and added an overview of the core concepts of LaminDB (PR #67). + +* Update the `architecture` vignette to relate the class structure of the package to the core concepts (PR #67). + +* Add a `development` vignette to document the list of current, planned and unplanned functionality (PR #67). + +* Add vignettes to document registries in the core, bionty, and wetlab modules (PR #67). ## BUG FIXES diff --git a/DESCRIPTION b/DESCRIPTION index 41eee79..f25516d 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -10,7 +10,7 @@ Authors@R:c( person("Lamin Labs", , , "open-source@lamin.ai", role = c("aut", "cph")) ) Description: Interact with 'LaminDB' from R. 'LaminDB' is an open-source - data framework for biology. This package allows you to query and + data framework for biology. This package allows you to query and download data from 'LaminDB' instances. License: Apache License (>= 2) URL: https://laminr.lamin.ai, https://github.com/laminlabs/laminr diff --git a/R/vignette-utils.R b/R/vignette-utils.R new file mode 100644 index 0000000..54ba7a0 --- /dev/null +++ b/R/vignette-utils.R @@ -0,0 +1,103 @@ +# nolint start nolint_cyclomatic_linter +generate_module_markdown <- function(db, module_name, allowed_related_modules = c("core", module_name)) { + # nolint end nolint_cyclomatic_linter + module <- db$get_module(module_name) + + registry_names <- module$get_registry_names() + + type_map <- c( + "BigAutoField" = "integer64", + "AutoField" = "integer", + "CharField" = "character", + "BooleanField" = "logical", + "DateTimeField" = "POSIXct", + "TextField" = "character", + "ForeignKey" = "integer64", + "BigIntegerField" = "integer64", + "SmallIntegerField" = "integer", + "JSONField" = "list" + ) + + output <- c() + + for (registry_name in registry_names) { # nolint cyclocomp_linter + registry <- module$get_registry(registry_name) + fields <- registry$get_fields() + + if (registry$is_link_table) { + next + } + + output <- output |> c(paste0("## ", registry$class_name, "\n\n")) + + classes <- class(registry) |> discard(~ .x == "R6") + class_urls <- paste0("`?", classes, "`") + + output <- output |> c(paste0("Base classes: ", paste(class_urls, collapse = ", "), "\n\n")) + + ## Document simple fields + simple_fields <- fields |> keep( + ~ is.null(.x$related_field_name) && + !grepl("^_", .x$field_name) + ) + + if (length(simple_fields) > 0) { + output <- output |> c(paste0("### Simple fields\n\n")) + } + + for (field in simple_fields) { + field_type <- + if (field$type %in% names(type_map)) { + type_map[[field$type]] + } else { + field$type + } + output <- output |> c(paste0("* `", field$field_name, "` (`", field_type, "`)\n")) + } + + if (length(simple_fields) > 0) { + output <- output |> c("\n\n") + } + + ## Document relational fields + relational_fields <- fields |> keep( + ~ !is.null(.x$related_field_name) && + !grepl("^_", .x$field_name) && + !.x$is_link_table && + .x$related_module_name %in% allowed_related_modules + ) + + if (length(relational_fields) > 0) { + output <- output |> c(paste0("### Relational fields\n\n")) + } + + for (field in relational_fields) { + related_module <- db$get_module(field$related_module_name) + related_registry <- related_module$get_registry(field$related_registry_name) + + related_class_name <- + if (related_module$name == "core") { + related_registry$class_name + } else { + paste0(related_module$name, "$", related_registry$class_name) + } + related_link <- + if (related_module$name == module_name) { + paste0("#", related_registry$name) + } else { + paste0("module_", related_module$name, ".html#", related_registry$name) + } + + output <- output |> c(paste0( + " * `", field$field_name, "` ([`", related_class_name, "`](", + related_link, "))\n" + )) + } + + if (length(relational_fields) > 0) { + output <- output |> c("\n\n") + } + } + + output +} diff --git a/README.md b/README.md index 518ae15..a5b266a 100644 --- a/README.md +++ b/README.md @@ -1,134 +1,61 @@ -# LaminR: Work with LaminDB instances in R +# {laminr}: An R interface to LaminDB - - - [![R-CMD-check](https://github.com/laminlabs/laminr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/laminlabs/laminr/actions/workflows/R-CMD-check.yaml) -This package allows you to query and download data from LaminDB -instances. +**{laminr}** is an R package that provides an interface to [LaminDB](https://lamin.ai), a powerful open-source data framework designed specifically for biological research. With laminr, you can leverage LaminDB's powerful features to manage, query, and track your data and metadata with unparalleled efficiency and scalability, all within the familiar comfort of R. + +## Why use {laminr}? + +LaminDB offers a unique approach to data management in bioinformatics, providing: + +* **Unified Data and Metadata Handling**: Organize your data and its associated metadata in a structured and accessible manner. +* **Powerful Querying and Search**: Effortlessly filter and retrieve specific data and metadata using intuitive query functions. +* **Data Lineage Tracking**: Maintain a comprehensive history of your data transformations, ensuring reproducibility and transparency. +* **Ontology Integration**: Leverage public ontologies (e.g., for genes, proteins, cell types) for standardized metadata annotation. +* **Data Validation and Standardization**: Ensure data quality and consistency with built-in validation and standardization tools. -## Setup +**{laminr}** brings all these benefits to your R workflow, allowing you to seamlessly integrate LaminDB into your existing analysis pipelines. -Install the development version from GitHub: +## Installation + +Get started with **{laminr}** by installing the development version directly from GitHub: ``` r # install.packages("remotes") remotes::install_github("laminlabs/laminr") ``` -To install all suggested dependencies required for some functionality, -use: +To include all suggested dependencies for enhanced functionality, use: ``` r remotes::install_github("laminlabs/laminr", dependencies = TRUE) ``` -You will also need to install `lamindb`: +You will also need to install the `lamindb` Python package: ``` bash pip install lamindb[aws] ``` -## Connect to an instance - -To connect to a LaminDB instance, you will first need to run -`lamin login` OR `lamin load ` in the terminal. This will -create a directory in your home directory called `.lamin` with the -necessary credentials. - -``` bash -lamin login -lamin connect laminlabs/cellxgene -``` - -Then, you can connect to the instance using the `laminr::connect()` -function: - -``` r -library(laminr) - -db <- connect("laminlabs/cellxgene") -db -``` - - cellxgene - Core registries - $Run - $User - $Param - $ULabel - $Feature - $Storage - $Artifact - $Transform - $Collection - $FeatureSet - $ParamValue - $FeatureValue - Additional modules - bionty - -## Query the instance - -You can use the `db` object to query the instance: - -``` r -artifact <- db$Artifact$get("KBW89Mf7IGcekja2hADu") -``` - -You can print the record: +## Getting started -``` r -artifact -``` +The best way to get started with **{laminr}** is to explore the package vignettes (available at [laminr.lamin.ai](https://laminr.lamin.ai)): - Artifact(uid='KBW89Mf7IGcekja2hADu', description='Myeloid compartment', key='cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad', created_by_id=1, run_id=27, suffix='.h5ad', created_at='2024-07-12T12:34:10.345829+00:00', hash='SZ5tB0T4YKfiUuUkAL09ZA', _hash_type='md5-n', storage_id=2, version='2024-07-01', _accessor='AnnData', id=3659, is_latest=TRUE, _key_is_virtual=FALSE, transform_id=22, n_observations=51552, size=691757462, visibility=1, updated_at='2024-07-12T12:40:48.837026+00:00', type='dataset') - -Or call the `$describe()` method to get a summary: - -``` r -artifact$describe() -``` +* **Getting Started**: Learn the basics and explore practical examples (`vignette("laminr", package = "laminr")`). +* **Package Architecture**: Get a better understanding of how **{laminr}** works (`vignette("architecture", package = "laminr")`). +* **Development Roadmap**: Explore current features and future development plans (`vignette("development", package = "laminr")`). - Artifact(uid='KBW89Mf7IGcekja2hADu', description='Myeloid compartment', key='cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad', created_by_id=1, run_id=27, suffix='.h5ad', created_at='2024-07-12T12:34:10.345829+00:00', hash='SZ5tB0T4YKfiUuUkAL09ZA', _hash_type='md5-n', storage_id=2, version='2024-07-01', _accessor='AnnData', id=3659, is_latest=TRUE, _key_is_virtual=FALSE, transform_id=22, n_observations=51552, size=691757462, visibility=1, updated_at='2024-07-12T12:40:48.837026+00:00', type='dataset') - Provenance - $storage = 's3://cellxgene-data-public' - $transform = 'Census release 2024-07-01 (LTS)' - $run = '2024-07-16T12:49:41.81955+00:00' - $created_by = 'sunnyosun' +For information on specific modules and functionalities, check out the following vignettes: -## Access fields +* **Core Module**: Learn about the core registries available in a LaminDB instance (`vignette("module_core", package = "laminr")`). +* **Bionty Module**: Explore the bionty module for biology-related entities (`vignette("module_bionty", package = "laminr")`). + +## Learn more -You can access its fields as follows: - -- `artifact$id`: 3659 -- `artifact$uid`: KBW89Mf7IGcekja2hADu -- `artifact$key`: - cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad - -You can also fetch related fields: - -- `artifact$root`: s3://cellxgene-data-public -- `artifact$created_by`: sunnyosun - -## Load the artifact - -You can directly load the artifact to access its data: - -``` r -artifact$load() -``` +For more information about LaminDB and its features, check out the following resources: - ℹ 's3://cellxgene-data-public/cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad' already exists at '/home/luke/.cache/lamindb/cellxgene-data-public/cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad' +* [LaminDB website](https://lamin.ai/) - AnnData object with n_obs × n_vars = 51552 × 36398 - obs: 'donor_id', 'Predicted_labels_CellTypist', 'Majority_voting_CellTypist', 'Manually_curated_celltype', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid' - var: 'gene_symbols', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length' - uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'sex_ontology_term_id_colors', 'title' - obsm: 'X_umap' +* [LaminDB documentation](https://docs.lamin.ai/) diff --git a/README.qmd b/README.qmd deleted file mode 100644 index 9d7fa56..0000000 --- a/README.qmd +++ /dev/null @@ -1,96 +0,0 @@ ---- -title: "LaminR: Work with LaminDB instances in R" -format: gfm ---- - - - - -[![R-CMD-check](https://github.com/laminlabs/laminr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/laminlabs/laminr/actions/workflows/R-CMD-check.yaml) - - -This package allows you to query and download data from LaminDB instances. - -## Setup - -Install the development version from GitHub: - -```R -# install.packages("remotes") -remotes::install_github("laminlabs/laminr") -``` - -To install all suggested dependencies required for some functionality, use: - -```R -remotes::install_github("laminlabs/laminr", dependencies = TRUE) -``` - -You will also need to install `lamindb`: - -```bash -pip install lamindb[aws] -``` - -## Connect to an instance - -To connect to a LaminDB instance, you will first need to run `lamin login` OR `lamin load ` in the terminal. This will create a directory in your home directory called `.lamin` with the necessary credentials. - -```bash -lamin login -lamin connect laminlabs/cellxgene -``` - -Then, you can connect to the instance using the `laminr::connect()` function: - -```{r setup} -library(laminr) - -db <- connect("laminlabs/cellxgene") -db -``` - -## Query the instance - -You can use the `db` object to query the instance: - -```{r get_artifact} -artifact <- db$Artifact$get("KBW89Mf7IGcekja2hADu") -``` - -You can print the record: - -```{r print_artifact} -artifact -``` - -Or call the `$describe()` method to get a summary: - -```{r describe_artifact} -artifact$describe() -``` - -## Access fields - -You can access its fields as follows: - -* `artifact$id`: `r artifact$id` -* `artifact$uid`: `r artifact$uid` -* `artifact$key`: `r artifact$key` - -You can also fetch related fields: - -* `artifact$root`: `r artifact$storage$root` -* `artifact$created_by`: `r artifact$created_by$handle` - -## Load the artifact - -You can directly load the artifact to access its data: - -```{r load_artifact} -artifact$load() -``` diff --git a/_pkgdown.yml b/_pkgdown.yml index dac5a2d..cb4b619 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -1,3 +1,33 @@ url: https://laminr.lamin.ai/ + template: bootstrap: 5 + +authors: + Robrecht Cannoodt: + href: https://cannoodt.dev + Luke Zappia: + href: https://lazappi.id.au + Data Intuitive: + href: https://data-intuitive.com + Lamin Labs: + href: https://lamin.ai + +navbar: + components: + articles: + text: Articles + menu: + - text: Introduction + - text: Package Architecture + href: articles/architecture.html + - text: Development Roadmap + href: articles/development.html + - text: ------- + - text: API reference + - text: Core module + href: articles/module_core.html + - text: Bionty module + href: articles/module_bionty.html + - text: Wetlab module + href: articles/module_wetlab.html diff --git a/vignettes/architecture.qmd b/vignettes/architecture.qmd index e3e0142..f4dc47f 100644 --- a/vignettes/architecture.qmd +++ b/vignettes/architecture.qmd @@ -10,13 +10,86 @@ knitr: comment: "#>" --- -This package is designed to interact with LaminDB instances. +This vignette provides a high-level overview of the core architectural components in **LaminDB**. Understanding these concepts will help you navigate the system and effectively manage your data and metadata. -## Basic structure +## Core concepts -When connecting to a LaminDB instance, laminr will interact with the LaminDB instance API to retrieve the schema of the data structures in that instance. This schema is used to instantiate Modules containing Registries, which in turn contain Fields. A registry can be used to retrieve Records. +**LaminDB** is built around a few key ideas: -### Class diagram +### Instance + +A LaminDB **instance** is a self-contained environment for storing and managing data and metadata. You can think of it like a database or a project directory. Each instance has its own: + +* **Schema:** Defines the structure of the metadata. +* **Storage:** Where the actual data files are stored (locally, on S3, etc.). +* **Database:** Stores the metadata records in registries. + +For more information about instances, see `?connect()` and `?Instance`. + +### Module + +A **module** in LaminDB is a collection of related registries that provide functionality in a specific domain. For example: + +* **core:** Provides essential registries for general data management (Artifacts, Collections, Transforms, etc.). This module is included in every LaminDB instance. +* **bionty:** Offers registries for managing biological entities (genes, proteins, cell types) and links them to public ontologies. +* **wetlab:** Includes registries for managing experimental metadata (samples, treatments, etc.). +* **And many more...** + +Modules help organize the system and make it easier to find the specific registries you need. + +For more information about modules, see `?Module`. The core module is documented in the `module_core` vignette: `vignette("module_core", package = "laminr")`. + +### Registry + +A **registry** is a centralized collection of related records. It's like a table in a database, where each row represents a specific entity. Examples of registries include: + + * **Artifacts**: Datasets, models, or other data entities. + * **Collections**: Groupings of related artifacts. + * **Transforms**: Data processing operations. + * **Features**: Variables or measurements within datasets. + * **Labels**: Annotations or classifications applied to data. + +Each registry has a defined structure with specific fields that hold relevant information. + +For more information about registries, see `?Registry`. The core registries are documented in the `module_core` vignette: `vignette("module_core", package = "laminr")`. + +### Field + +A **field** is a single piece of information within a registry. It's analogous to a column in a database table. For example, the Artifact registry might have fields like: + +* `key`: Storage key, the relative path within the storage location. +* `storage`: Storage location, e.g. an S3 or GCP bucket or a local directory. +* `description`: A description of the artifact. +* `created_by`: The user who created the artifact. + +Fields define the type of data that can be stored in a registry and provide a way to organize and query the metadata. + +For more information about fields, see `?Field`. The fields of core registries are documented in the `module_core` vignette: `vignette("module_core", package = "laminr")`. + + +### Record + +A **record** is a single entry within a registry. It's like a row in a database table. A record combines multiple fields to represent a specific entity. For example, a record in the Artifact registry might represent a single dataset with its key, storage location, description, creator, and other relevant information. + +### Putting it together + +In essence, you have **instances** that contain **modules**. Each module contains **registries**, which in turn hold **records**. Each record is composed of multiple **fields**. This hierarchical structure allows for flexible and organized management of data and metadata within LaminDB. + + +## Class structure + +The `laminr` package provides a set of classes that mirror the core concepts of LaminDB. These classes allow you to interact with instances, modules, registries, fields, and records in a programmatic way. + +The package provides two sets of classes: the base classes and the sugar syntax classes. + +### Base classes + +These classes provide the core functionality for interacting with LaminDB instances, modules, registries, fields, and records. These are the classes that are documented via +`?Instance`, `?Module`, `?Registry`, `?Field`, and `?Record`. + +The class diagram below illustrates the relationships between these classes. + +However, they are not intended to be used directly in most cases. Instead, the sugar syntax classes provide a more user-friendly interface for working with LaminDB data. ```{mermaid} classDiagram @@ -102,11 +175,12 @@ classDiagram %% # nolint end ``` -## Sugar syntax -The `laminr` package adds some sugar syntax to the `Instance`, `Module`, and `Record` classes. This allows to directly access an instance's and module's registries and a record's fields. +### Sugar syntax classes -For instance, instead of writing: +The sugar syntax classes provide a more user-friendly way to interact with LaminDB data. These classes are designed to make it easier to access and manipulate instances, modules, registries, fields, and records. + +For example, to get an artifact with a specific ID using **only** base classes, you might write: ```r db <- connect("laminlabs/cellxgene") @@ -116,20 +190,22 @@ artifact <- db$get_module("core")$get_registry("artifact")$get("KBW89Mf7IGcekja2 artifact$get_value("id") ``` -Using the sugar syntax, you can write: +With the sugar syntax classes, you can achieve the same result more concisely: ```r db <- connect("laminlabs/cellxgene") -artifact <- db$core$artifact$get("KBW89Mf7IGcekja2hADu") +artifact <- db$Artifact$get("KBW89Mf7IGcekja2hADu") artifact$id ``` -This sugar syntax is achieved by creating RichInstance and RichRecord classes that inherit from Instance and Record, respectively. +This sugar syntax is achieved by creating RichInstance and RichRecord classes that inherit from Instance and Record, respectively. These classes provide additional methods and properties to simplify working with LaminDB data. ### Class diagram +The class diagram below illustrates the relationships between the sugar syntax classes in the `laminr` package. These classes provide a more user-friendly interface for interacting with LaminDB data. + ```{mermaid} classDiagram %% # nolint start diff --git a/vignettes/development.qmd b/vignettes/development.qmd new file mode 100644 index 0000000..87b0a87 --- /dev/null +++ b/vignettes/development.qmd @@ -0,0 +1,131 @@ +--- +title: "Feature List and Roadmap" +vignette: > + %\VignetteIndexEntry{Feature List and Roadmap} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{quarto::html} +knitr: + opts_chunk: + collapse: true + comment: "#>" +--- + +This document outlines the features of the **{laminr}** package and the roadmap for future development. + +## Features + +### Connect to an instance + +* [x] Connect to a LaminDB instance (`connect()`). +* [x] Handle authentication and authorization. +* [ ] Connect to a LaminDB instance without needing to install the `lamin_cli` Python package. + +### Query & search + +* [x] **Query exactly one record** (`Registry$get(...)`): Fetch a single record by ID. +* [ ] **Query sets of records** (`Registry$filter()`): Fetch multiple records based on filters. + - [x] `$df()`: Returns a data frame with each record in a row. + - [ ] `$all()`: Returns all records as a `QuerySet`. + - [ ] `$one()`: Return exactly one record. + - [ ] `$one_or_none()`: Return one record or `NULL`. +* [ ] **Leverage relationships when querying** (`Artifact$filter(created_by__handle__startswith = "testuse")$df()`): Query records based on relationships. +* [ ] **Comparators**: Use comparators in filters. + - [ ] `and`: Example: `Artifact$filter(suffix = ".jpg", created_by = user)` + - [ ] `less than` / `greater than`: Example: `Artifact$filter(size__lt = 1e4)` + - [ ] `in`: Example: `Artifact$filter(suffix_in = [".jpg", ".fastq.gz"])` + - [ ] `order by`: Example: `Artifact$filter().order_by("created_at")` + - [ ] `contains`: Example: `Artifact$filter(name__contains = "test")` + - [ ] `startswith`: Example: `Artifact$filter(name__startswith = "test")` + - [ ] `or`: Example: `...` + - [ ] `not`: Example: `...` +* [ ] **Search for records** (`Registry$search(...)`): Search for records based on a query string. +* [ ] **Pagination**: Support pagination for large query results. +* [ ] **Field lookups**: Provide convenient functions for looking up field values (e.g., `Artifact$lookup("description")`). + +### Manage data & metadata + +* [ ] **Create artifacts**: Create new artifacts from various data sources (e.g., files, data frames, in-memory objects). +* [ ] **Save artifacts**: Save artifacts to LaminDB with appropriate metadata. +* [ ] **Load artifacts**: Load artifacts from LaminDB into R: + - [ ] `fcs`: Load flow cytometry data. + - [ ] `tsv`: Load tabular data. + - [x] `h5ad`: Load an HDF5-backed AnnData object. + - [ ] `h5mu`: Load an HDF5-backed MuData object. + - [ ] `html`: Load HTML content. + - [ ] `json`: Load JSON data. + - [ ] `image`: Load image data. + - [ ] `svg`: Load SVG data. +* [ ] **Cache artifacts**: Cache artifacts locally for faster access: + - [x] `s3`: Interact with S3 storage. + - [ ] `gcp`: Interact with Google Cloud Storage. +* [ ] **Version artifacts**: Create new versions of artifacts. +* [ ] **Manage artifact metadata**: Add, update, and delete artifact metadata. +* [ ] **Work with collections**: Create, manage, and query collections of artifacts. + +### Track notebooks & scripts + +* [ ] **Track code execution**: Automatically track the execution of R scripts and notebooks. +* [ ] **Capture run context**: Record information about the execution environment (e.g., package versions, parameters). +* [ ] **Link code to artifacts**: Associate code execution with generated artifacts. +* [ ] **Visualize data lineage**: Create visualizations of data lineage and dependencies. + +### Curate datasets + +* [ ] **Validate data**: Validate data against predefined schemas or constraints. +* [ ] **Standardize data**: Apply standardization rules to ensure data consistency. +* [ ] **Annotate data**: Add annotations and labels to data. +* [ ] **Use the Curator class**: Implement the `Curator` class for a streamlined curation workflow. + +### Access public ontologies + +* [ ] **Access ontology data**: Fetch data from public ontologies (e.g., gene names, protein IDs). +* [ ] **Search ontologies**: Search for entities within ontologies. +* [ ] **Use ontology terms in queries**: Use ontology terms to filter and query data. +* [ ] **Manage ontology versions**: Access different versions of ontologies. + +### Manage biological registries + +* [ ] **Create and manage records in bionty registries**: Add, update, and delete records for genes, proteins, cell types, etc. +* [ ] **Utilize hierarchical relationships**: Navigate and query based on parent-child relationships in ontologies. +* [ ] **Manage synonyms**: Add and use synonyms for biological entities. + +### Manage schema modules + +* [x] **List available modules**: Retrieve a list of available modules in an instance. +* [x] **Access module registries**: Access registries within specific modules. +* [ ] **(Advanced) Create custom modules**: Define and register custom schema modules. + +### Transfer data + +* [ ] **Upload data**: Upload data files to LaminDB storage. +* [x] **Download data**: Download data files from LaminDB storage. +* [ ] **(Advanced) Support zero-copy data transfer**: Implement efficient data transfer mechanisms. + +## Roadmap + +### Version 0.1.0 + +A first version of the package that allows users to: + +* Connect to a LaminDB instance. +* List all records in a registry. +* Fetch one record by ID or UID. +* Cache S3 artifacts locally. +* Load AnnData artifacts. + +### Version 0.2.0 + +* Expand query functionality with comparators, relationships, and pagination. +* Implement basic data and metadata management features (create, save, load artifacts). +* Expand support for different data formats and storage backends. + +### Version 0.3.0 + +* Implement code tracking and data lineage visualization. +* Introduce data curation features (validation, standardization, annotation). +* Enhance support for bionty registries and ontology interactions. + +### Future versions + +* Implement advanced features like custom module creation and zero-copy data transfer. +* Continuously improve performance, usability, and documentation. diff --git a/vignettes/laminr.Rmd b/vignettes/laminr.Rmd new file mode 100644 index 0000000..1a9d187 --- /dev/null +++ b/vignettes/laminr.Rmd @@ -0,0 +1,149 @@ +--- +title: "Getting Started" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Getting Started} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +This vignette provides a practical introduction to using the **{laminr}** package to interact with LaminDB. We'll start with a brief overview of key concepts and then walk through the basic steps to connect to a LaminDB instance and work with its core components. + +## Key Concepts in LaminDB + +Before diving into the practical usage of **{laminr}**, it's helpful to understand some core concepts in LaminDB. For a more detailed explanation, refer to the Architecture vignette (`vignette("architecture", package = "laminr")`). + +* **Instance**: A LaminDB instance is a self-contained environment for storing and managing data and metadata. Think of it like a database or a project directory. Each instance has its own schema, storage location, and metadata database. +* **Module**: A module is a collection of related registries that provide specific functionality. For example, the core module contains essential registries for general data management, while the bionty module provides registries for biological entities like genes and proteins. +* **Registry**: A registry is a centralized collection of related records, similar to a table in a database. Each registry holds a specific type of metadata, such as information about artifacts, transforms, or features. +* **Record**: A record is a single entry within a registry, analogous to a row in a database table. Each record represents a specific entity and combines multiple fields of information. +* **Field**: A field is a single piece of information within a record, like a column in a database table. For example, an artifact record might have fields for its name, description, and creation date. + +## Initial setup + +Now, let's set up your environment to use **{laminr}**. + +### Python setup + +1. Install the `lamindb` Python package. + + ```bash + pip install lamindb[aws] + ``` + +2. Connect to a LaminDB instance: + + ```bash + lamin connect laminlabs/cellxgene + ``` + +### R setup + +1. Install the **{laminr}** package. + + ```r + # Install the remotes package if needed + if (!requireMethods("remotes", quiety = TRUE)) { + install.packages("remotes") + } + remotes::install_github("laminlabs/laminr") + ``` + +2. (Optional) Install suggested dependencies. + + ```r + remotes::install_github("laminlabs/laminr", dependencies = TRUE) + ``` + + This includes packages like **{anndata}** for working + with AnnData objects and **{s3}** for interacting with S3 + storage. + +## Connecting to LaminDB from R + +Connect to the `laminlabs/cellxgene` instance from your R session: + +```{r connect} +library(laminr) + +db <- connect("laminlabs/cellxgene") +``` + +The `db` object now represents your connection to the LaminDB +instance. You can explore the available registries (like `Artifact`, +`Collection`, `Feature`, etc.) by simply printing the `db` object: + +```{r print_instance} +db +``` + +These registries correspond to [Python classes in LaminDB](https://docs.lamin.ai/lamindb). + +To access registries within specific modules, use the $ operator. For example, to access the bionty module: + +```{r get_module} +db$bionty +``` + +The `bionty` and other registries also have corresponding [Python classes](https://docs.lamin.ai/bionty). + +## Working with registries + +Let's use the `Artifact` registry as an example. This registry stores datasets, models, and other data entities. + +To see the available functions for the `Artifact` registry, print the registry object: + +```{r get_artifact_registry} +db$Artifact +``` + +You can fetch a specific artifact using its ID or UID. For instance, to get the artifact with UID [KBW89Mf7IGcekja2hADu](https://lamin.ai/laminlabs/cellxgene/artifact/KBW89Mf7IGcekja2hADu): + +```{r get_artifact} +artifact <- db$Artifact$get("KBW89Mf7IGcekja2hADu") +``` + +This artifact contains an `AnnData` object with myeloid cell data. You can view its metadata: + +```{r print_artifact} +artifact +``` + +Or get more detailed information: + +```{r describe_artifact} +artifact$describe() +``` + +Access specific fields of the artifact using the `$` operator: + +```{r access_fields} +artifact$id +artifact$uid +artifact$key +``` + +You can also access related data: + +```{r access_related_data} +artifact$storage +artifact$created_by +``` + +Finally, you can download the actual data associated with the artifact: + +```{r cache_artifact} +artifact$cache() # Cache the data locally +artifact$load() # Load the data into memory +``` + +:::{.callout-note} +Currently, laminr primarily supports S3 storage and AnnData objects. Support for other storage backends and data formats will be added in the future. For more information related to planned features and the roadmap, please refer to the Development vignette (`vignette("development", package = "laminr")`). +::: diff --git a/vignettes/module_bionty.Rmd b/vignettes/module_bionty.Rmd new file mode 100644 index 0000000..3a062e0 --- /dev/null +++ b/vignettes/module_bionty.Rmd @@ -0,0 +1,40 @@ +--- +title: "Bionty Module" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Bionty Module} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +This vignette provides documentation for the bionty module in **LaminDB**, which offers specialized registries for managing biological entities. These registries are linked to public ontologies, providing a standardized way to represent and work with common biological concepts. + +For reference, here is the documentation on the [bionty module](https://docs.lamin.ai/bionty) for the LaminDB Python package. + +## What is `bionty`? + +The bionty module extends LaminDB with registries for entities like genes, proteins, cell types, and more. It leverages public ontologies to ensure data consistency and interoperability. Key features of bionty include: + +* **Ontology Integration**: Connect to public ontologies like NCBI Taxonomy, Ensembl, UniProt, Cell Ontology, and others. +* **Hierarchical Relationships**: Represent and navigate relationships between entities (e.g., parent-child relationships in ontologies). +* **Synonym Management**: Handle synonyms and abbreviations for biological entities. +* **Versioning**: Track changes in ontologies and maintain historical versions. + + +```{r generate_docs, echo = FALSE} +library(laminr) +library(purrr) + +db <- connect("laminlabs/lamindata") + +docs <- laminr:::generate_module_markdown(db, "bionty", c("core", "bionty")) + +knitr::asis_output(docs) +``` diff --git a/vignettes/module_core.Rmd b/vignettes/module_core.Rmd new file mode 100644 index 0000000..ffd1ebd --- /dev/null +++ b/vignettes/module_core.Rmd @@ -0,0 +1,42 @@ +--- +title: "Core Module" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Core Module} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +This vignette provides documentation for the core module available within any LaminDB instance. Unlike traditional R packages with a fixed set of functions, LaminDB allows customization through modules and extensions. This means the specific registries and their fields in your LaminDB instance are determined by its schema. + +For reference, here is the documentation on the [core module](https://docs.lamin.ai/api#) for the LaminDB Python package. + +## Key Concepts + +In **LaminDB**, data and metadata are organized using a system of registries and modules. + + * **Registries**: Centralized collections of related records, similar to database tables. Each registry stores specific types of metadata (e.g., artifacts, transforms, features). + * **Modules**: Groupings of related registries that provide domain-specific functionality. The core module is fundamental to all LaminDB instances and includes essential registries for general data management. Other modules (like bionty for biological entities) can be added to extend functionality. + * **Records and Fields**: A record is a single entry within a registry, analogous to a row in a database table. Each record comprises multiple fields, which are individual pieces of information within the record. + +For a more comprehensive explanation of the **LaminDB** concepts and **{laminr}**'s architecture, refer to the **Architecture vignette**: `vignette("architecture", package = "laminr")`. + +To learn how to connect to a LaminDB instance and perform basic operations, see the **Getting started** vignette: `vignette("laminr", package = "laminr")`. + +```{r generate_docs, echo = FALSE} +library(laminr) +library(purrr) + +db <- connect("laminlabs/lamindata") + +docs <- laminr:::generate_module_markdown(db, "core", "core") + +knitr::asis_output(docs) +``` diff --git a/vignettes/module_wetlab.Rmd b/vignettes/module_wetlab.Rmd new file mode 100644 index 0000000..15e5529 --- /dev/null +++ b/vignettes/module_wetlab.Rmd @@ -0,0 +1,42 @@ +--- +title: "Wetlab Module" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Wetlab Module} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +This vignette introduces the wetlab module in **LaminDB**, designed specifically for managing metadata associated with wetlab experiments. This module provides a structured and standardized way to represent various aspects of your experimental workflows, ensuring data consistency and facilitating reproducibility. + +For reference, here is the documentation on the [wetlab module](https://docs.lamin.ai/wetlab) for the LaminDB Python package. + +## What is `wetlab`? + +The wetlab module extends LaminDB with specialized registries for capturing essential information about your wetlab experiments. These registries include: + +* **Experiments**: Represent overall experiments with details like objectives, design, and timelines. +* **Biosamples**: Capture information about biological specimens used in experiments (e.g., tissue, cells, blood). +* **Techsamples**: Represent processed or derived samples created from raw biological materials. +* **Treatments**: Model various types of treatments applied to samples, including compound treatments, environmental perturbations, and genetic modifications. +* **Treatment Targets**: Specify the targets of treatments, such as genes, proteins, or pathways. +* **Wells**: Represent individual wells in microplates or other experimental setups. + + +```{r generate_docs, echo = FALSE} +library(laminr) +library(purrr) + +db <- connect("laminlabs/lamindata") + +docs <- laminr:::generate_module_markdown(db, "wetlab", c("core", "bionty", "wetlab")) + +knitr::asis_output(docs) +``` diff --git a/vignettes/usage.Rmd b/vignettes/usage.Rmd deleted file mode 100644 index 09d0ae3..0000000 --- a/vignettes/usage.Rmd +++ /dev/null @@ -1,135 +0,0 @@ ---- -title: "Usage" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{Usage} - %\VignetteEncoding{UTF-8} - %\VignetteEngine{knitr::rmarkdown} ---- - -```{r, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>" -) -``` - -LaminDB is an open-source data framework for biology. You can find out about some of its features in the [documentation of the lamindb Python package](https://docs.lamin.ai/introduction). - -This vignette will show you how to use the `laminr` package to interact with LaminDB. - -## Initial setup - -### Python package - -As part of a first-time set up, you will need to install the Python `lamin-cli` package, and set up an instance for first use. - -```bash -pip install lamin-cli -lamin connect laminlabs/cellxgene -``` - -### R package - -You will also need to install the `laminr` R package. - -```{r install, eval = FALSE} -# Install the remotes package if needed -if (!requireMethods("remotes", quiety = TRUE)) { - install.packages("remotes") -} -remotes::install_github("laminlabs/laminr") -``` - -#### Suggested dependencies - -Some functionality requires additional suggested dependencies. -You will be prompted to install these packages when needed, or you can install them in advance. -Setting `dependencies = TRUE` will install all suggested packages. - -```{r install-suggested, eval = FALSE} -remotes::install_github("laminlabs/laminr", dependencies = TRUE) -``` - -Or individual suggested packages can be installed with `install.packages()`. - -Suggested dependecies: - -- [anndata](https://cran.r-project.org/package=anndata) - Loading and saving `AnnData` objects -- [s3](https://cran.r-project.org/package=s3) - Caching objects to/from S3 storage - -## Connect to a LaminDB instance - -This vignette uses the [`laminlabs/cellxgene`](https://lamin.ai/laminlabs/cellxgene) instance, which is a LaminDB instance that interfaces the CELLxGENE data. - -You can connect to the instance using the `connect` R function: - -```{r connect} -library(laminr) - -db <- connect("laminlabs/cellxgene") -``` - -By printing the instance, you can see which registries are available, including Artifact, Collection, Feature, etc. Each of these registries have a corresponding [Python class](https://docs.lamin.ai/lamindb). - -```{r print_instance} -db -``` - -All of the 'core' registries are directly available from the `db` object, while registries from other modules can be accessed via `db$`, e.g.: - -```{r get_module} -db$bionty -``` - -The `bionty` and other registries also have corresponding [Python classes](https://docs.lamin.ai/bionty). - -## Registry - -A registry is used to query, store and manage data. For instance, the `Artifact` registry stores datasets and models as files, folders, or arrays. - -You can see which functions you can use to interact with the registry by printing the registry object: - -```{r get_artifact_registry} -db$Artifact -``` - -For instance, you can fetch an Artifact by ID or UID. For example, Artifact [KBW89Mf7IGcekja2hADu](https://lamin.ai/laminlabs/cellxgene/artifact/KBW89Mf7IGcekja2hADu) is an AnnData object containing myeloid cells. - -```{r get_artifact} -artifact <- db$Artifact$get("KBW89Mf7IGcekja2hADu") -``` - -You can view its metadata by printing the object: - -```{r print_artifact} -artifact -``` - -Or get more detailed information by calling the `$describe()` method: - -```{r describe_artifact} -artifact$describe() -``` - -You can access its fields as follows: - -* `artifact$id`: `r artifact$id` -* `artifact$uid`: `r artifact$uid` -* `artifact$key`: `r artifact$key` - -Or fetch data from related registries: - -* `artifact$root`: `r artifact$storage$to_string()` -* `artifact$created_by`: `r artifact$created_by$to_string()` - -Finally, for Artifact objects, you can directly fetch or download the data using `$cache()` and `$load()`, respectively. - -```{r cache_artifact} -artifact$cache() -artifact$load() -``` - -:::{.callout-note} -Only S3 storage and AnnData accessors are supported at the moment. If additional storage and data accessors are desired, please open an issue on the [laminr GitHub repository](https://github.com/laminlabs/laminr/issues). -:::