adds codespell (#155)

* adds codespell * remove notebooks * remove notebook * update to ignore ipynb pics * fix typos * adds release note
scverse · Mar 27, 2024 · a187f4a · a187f4a
1 parent d8ed247
commit a187f4a
Show file tree

Hide file tree

Showing 26 changed files with 90 additions and 57 deletions.
diff --git a/.github/workflows/codespell.yml b/.github/workflows/codespell.yml
@@ -0,0 +1,22 @@
+---
+name: Codespell
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+
+permissions:
+  contents: read
+
+jobs:
+  codespell:
+    name: Check for spelling errors
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Codespell
+        uses: codespell-project/actions-codespell@v2
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -25,3 +25,9 @@ repos:
     -   id: no-commit-to-branch
         args: [--branch=main]
     -   id: detect-private-key
+-   repo: https://github.com/codespell-project/codespell
+    rev: v2.2.6
+    hooks:
+    -   id: codespell
+        additional_dependencies:
+        - tomli
diff --git a/docs/Installation.md b/docs/Installation.md
@@ -24,7 +24,7 @@ It is important to ensure that the CUDA environment is set up correctly so that
 To view a full guide how to set up a fully functioned single cell GPU accelerated conda environment visit [GPU_SingleCell_Setup](https://github.com/Intron7/GPU_SingleCell_Setup)
 
 
-# GPU-Memory and System Requierments
+# GPU-Memory and System Requirements
 
 *rapids-singlecell* relays for most computation on the GPU. A GPU with sufficient VRAM is therefore required to handle large datasets.
 With a RTX 3090 it's possible to analyze 200000 cells without any issues. With an A100 80GB it is even possible to analyze more than 1000000. For even larger datasets, {mod}`~rmm` is required to oversubscribe GPU memory into host memory, similar to SWAP memory. However, using `managed_memory` can result in a performance penalty, but this is still preferable to CPU runtimes.

diff --git a/docs/api/decoupler_gpu.md b/docs/api/decoupler_gpu.md
@@ -1,6 +1,6 @@
 # decoupler-GPU: `dcg`
 
-{mod}`decoupler` contains different statistical methods to extract biological activities. {mod}`rapids_singlecell.dcg` acclerates some of these methods.
+{mod}`decoupler` contains different statistical methods to extract biological activities. {mod}`rapids_singlecell.dcg` accelerates some of these methods.
 
 ```{eval-rst}
 .. module:: rapids_singlecell.dcg

diff --git a/docs/api/scanpy_gpu.md b/docs/api/scanpy_gpu.md
@@ -1,6 +1,6 @@
 # scanpy-GPU
 
-These functions offer accelerated near drop-in replacements for common tools porvided by [`scanpy`](https://scanpy.readthedocs.io/en/stable/api/index.html).
+These functions offer accelerated near drop-in replacements for common tools provided by [`scanpy`](https://scanpy.readthedocs.io/en/stable/api/index.html).
 
 ## Preprocessing `pp`
 Filtering of highly-variable genes, batch-effect correction, per-cell normalization.

diff --git a/docs/api/squidpy_gpu.md b/docs/api/squidpy_gpu.md
@@ -1,6 +1,6 @@
 # squidpy-GPU: `gr`
 
-{mod}`squidpy.gr` is a tool for the analysis of spatial molecular data. {mod}`rapids_singlecell.gr` acclerates some of these functions.
+{mod}`squidpy.gr` is a tool for the analysis of spatial molecular data. {mod}`rapids_singlecell.gr` accelerates some of these functions.
 
 ```{eval-rst}
 .. module:: rapids_singlecell.gr

diff --git a/docs/notebooks/demo_gpu-PR.ipynb b/docs/notebooks/demo_gpu-PR.ipynb
@@ -14,7 +14,7 @@
    "id": "comic-moses",
    "metadata": {},
    "source": [
-    "To run this notebook please make sure you have a working rapids enviroment with all nessaray dependencies. Run the data_downloader notebook first to create the AnnData object we are working with. In this example workflow we'll be looking at a dataset of 500000 brain cells from  [Nvidia](https://github.com/clara-parabricks/rapids-single-cell-examples/blob/master/notebooks/1M_brain_cpu_analysis.ipynb)."
+    "To run this notebook please make sure you have a working rapids environment with all nessaray dependencies. Run the data_downloader notebook first to create the AnnData object we are working with. In this example workflow we'll be looking at a dataset of 500000 brain cells from  [Nvidia](https://github.com/clara-parabricks/rapids-single-cell-examples/blob/master/notebooks/1M_brain_cpu_analysis.ipynb)."
    ]
   },
   {
@@ -490,7 +490,7 @@
    "id": "arctic-upgrade",
    "metadata": {},
    "source": [
-    "Now we safe this verion of the AnnData as adata.raw."
+    "Now we safe this version of the AnnData as adata.raw."
    ]
   },
   {
@@ -777,7 +777,7 @@
     "tags": []
    },
    "source": [
-    "## Clustering and Visulization"
+    "## Clustering and Visualization"
    ]
   },
   {
@@ -795,7 +795,7 @@
    "source": [
     "Next we compute the neighborhood graph using rsc.\n",
     "\n",
-    "Scanpy CPU implementation of nearest neighbor uses an approximation, while the GPU version calculates the excat graph. Both methods are valid, but you might see differences."
+    "Scanpy CPU implementation of nearest neighbor uses an approximation, while the GPU version calculates the exact graph. Both methods are valid, but you might see differences."
    ]
   },
   {

diff --git a/docs/notebooks/demo_gpu-seuratv3-brain-1M.ipynb b/docs/notebooks/demo_gpu-seuratv3-brain-1M.ipynb
@@ -14,7 +14,7 @@
    "id": "fda0ac25-cdbc-451f-84a9-d56a65fec2c0",
    "metadata": {},
    "source": [
-    "To run this notebook please make sure you have a working enviroment with all nessaray dependencies. Run the data_downloader notebook first to create the AnnData object we are working with. In this example workflow we'll be looking at a dataset of 1000000 brain cells from  [Nvidia](https://github.com/clara-parabricks/rapids-single-cell-examples/blob/master/notebooks/1M_brain_cpu_analysis.ipynb)."
+    "To run this notebook please make sure you have a working environment with all nessaray dependencies. Run the data_downloader notebook first to create the AnnData object we are working with. In this example workflow we'll be looking at a dataset of 1000000 brain cells from  [Nvidia](https://github.com/clara-parabricks/rapids-single-cell-examples/blob/master/notebooks/1M_brain_cpu_analysis.ipynb)."
    ]
   },
   {
@@ -640,7 +640,7 @@
    "id": "96c3d84b-a950-4a75-a303-dbbedafe4b40",
    "metadata": {},
    "source": [
-    "Now we safe this verion of the AnnData as adata.raw."
+    "Now we safe this version of the AnnData as adata.raw."
    ]
   },
   {
@@ -717,7 +717,7 @@
    "id": "0f8f3372-ac66-4704-bfa7-8b0ec685eec7",
    "metadata": {},
    "source": [
-    "Next we regess out effects of counts per cell and the mitochondrial content of the cells. As you can with scanpy you can use every numerical column in `.obs` for this."
+    "Next we regress out effects of counts per cell and the mitochondrial content of the cells. As you can with scanpy you can use every numerical column in `.obs` for this."
    ]
   },
   {
@@ -897,7 +897,7 @@
    "id": "first-reggae",
    "metadata": {},
    "source": [
-    "## Clustering and Visulization"
+    "Visualization## Clustering and Visualization"
    ]
   },
   {
@@ -915,7 +915,7 @@
    "source": [
     "Next we compute the neighborhood graph using rsc.\n",
     "\n",
-    "Scanpy CPU implementation of nearest neighbor uses an approximation, while the GPU version calculates the excat graph. Both methods are valid, but you might see differences."
+    "Scanpy CPU implementation of nearest neighbor uses an approximation, while the GPU version calculates the exact graph. Both methods are valid, but you might see differences."
    ]
   },
   {
@@ -1230,7 +1230,7 @@
    "id": "informational-dealer",
    "metadata": {},
    "source": [
-    "After this you can use `X_diffmap` for `sc.pp.neighbors` and other fuctions. "
+    "After this you can use `X_diffmap` for `sc.pp.neighbors` and other functions. "
    ]
   },
   {

diff --git a/docs/notebooks/demo_gpu-seuratv3.ipynb b/docs/notebooks/demo_gpu-seuratv3.ipynb
@@ -20,7 +20,7 @@
    "id": "comic-moses",
    "metadata": {},
    "source": [
-    "To run this notebook please make sure you have a working rapids enviroment with all nessaray dependencies. Run the data_downloader notebook first to create the AnnData object we are working with. In this example workflow we'll be looking at a dataset of ca. 90000 cells from [Quin et al., Cell Research 2020](https://www.nature.com/articles/s41422-020-0355-0)."
+    "To run this notebook please make sure you have a working rapids environment with all nessaray dependencies. Run the data_downloader notebook first to create the AnnData object we are working with. In this example workflow we'll be looking at a dataset of ca. 90000 cells from [Quin et al., Cell Research 2020](https://www.nature.com/articles/s41422-020-0355-0)."
    ]
   },
   {
@@ -634,7 +634,7 @@
    "id": "arctic-upgrade",
    "metadata": {},
    "source": [
-    "Now we safe this verion of the AnnData as adata.raw."
+    "Now we safe this version of the AnnData as adata.raw."
    ]
   },
   {
@@ -713,7 +713,7 @@
    "id": "seventh-liquid",
    "metadata": {},
    "source": [
-    "Next we regess out effects of counts per cell and the mitochondrial content of the cells. As you can with scanpy you can use every numerical column in `.obs` for this."
+    "Next we regress out effects of counts per cell and the mitochondrial content of the cells. As you can with scanpy you can use every numerical column in `.obs` for this."
    ]
   },
   {
@@ -944,7 +944,7 @@
    "id": "first-reggae",
    "metadata": {},
    "source": [
-    "## Clustering and Visulization"
+    "## Clustering and Visualization"
    ]
   },
   {
@@ -962,7 +962,7 @@
    "source": [
     "Next we compute the neighborhood graph using rsc.\n",
     "\n",
-    "Scanpy CPU implementation of nearest neighbor uses an approximation, while the GPU version calculates the excat graph. Both methods are valid, but you might see differences."
+    "Scanpy CPU implementation of nearest neighbor uses an approximation, while the GPU version calculates the exact graph. Both methods are valid, but you might see differences."
    ]
   },
   {
@@ -1127,7 +1127,7 @@
    "id": "ed1a5b70-54e0-4a22-83a8-e1903c5c7205",
    "metadata": {},
    "source": [
-    "We also caluclate the embedding density in the UMAP using cuML"
+    "We also calculate the embedding density in the UMAP using cuML"
    ]
   },
   {
@@ -1584,7 +1584,7 @@
    "id": "informational-dealer",
    "metadata": {},
    "source": [
-    "After this you can use `X_diffmap` for `sc.pp.neighbors` and other fuctions. "
+    "After this you can use `X_diffmap` for `sc.pp.neighbors` and other functions. "
    ]
   },
   {

diff --git a/docs/notebooks/demo_gpu.ipynb b/docs/notebooks/demo_gpu.ipynb
@@ -14,7 +14,7 @@
    "id": "comic-moses",
    "metadata": {},
    "source": [
-    "To run this notebook please make sure you have a working rapids enviroment with all nessaray dependencies. Run the data_downloader notebook first to create the AnnData object we are working with. In this example workflow we'll be looking at a dataset of ca. 90000 cells from [Quin et al., Cell Research 2020](https://www.nature.com/articles/s41422-020-0355-0)."
+    "To run this notebook please make sure you have a working rapids environment with all nessaray dependencies. Run the data_downloader notebook first to create the AnnData object we are working with. In this example workflow we'll be looking at a dataset of ca. 90000 cells from [Quin et al., Cell Research 2020](https://www.nature.com/articles/s41422-020-0355-0)."
    ]
   },
   {
@@ -520,7 +520,7 @@
    "id": "arctic-upgrade",
    "metadata": {},
    "source": [
-    "Now we safe this verion of the AnnData as adata.raw. "
+    "Now we safe this version of the AnnData as adata.raw. "
    ]
   },
   {
@@ -576,7 +576,7 @@
    "id": "seventh-liquid",
    "metadata": {},
    "source": [
-    "Next we regess out effects of counts per cell and the mitochondrial content of the cells. As you can with scanpy you can use every numerical column in `.obs` for this."
+    "Next we regress out effects of counts per cell and the mitochondrial content of the cells. As you can with scanpy you can use every numerical column in `.obs` for this."
    ]
   },
   {
@@ -695,7 +695,7 @@
    "id": "first-reggae",
    "metadata": {},
    "source": [
-    "## Clustering and Visulization"
+    "## Clustering and Visualization"
    ]
   },
   {
@@ -778,7 +778,7 @@
    "source": [
     "Next we compute the neighborhood graph using rsc.\n",
     "\n",
-    "Scanpy CPU implementation of nearest neighbor uses an approximation, while the GPU version calculates the excat graph. Both methods are valid, but you might see differences."
+    "Scanpy CPU implementation of nearest neighbor uses an approximation, while the GPU version calculates the exact graph. Both methods are valid, but you might see differences."
    ]
   },
   {
@@ -1343,7 +1343,7 @@
    "id": "informational-dealer",
    "metadata": {},
    "source": [
-    "After this you can use `X_diffmap` for `sc.pp.neighbors` and other fuctions. "
+    "After this you can use `X_diffmap` for `sc.pp.neighbors` and other functions. "
    ]
   },
   {

diff --git a/docs/notebooks/ligrec_benchmark.ipynb b/docs/notebooks/ligrec_benchmark.ipynb
@@ -14,7 +14,7 @@
    "id": "comic-moses",
    "metadata": {},
    "source": [
-    "To run this notebook please make sure you have a working rapids enviroment with all nessaray dependencies. Run the data_downloader notebook first to create the AnnData object we are working with. In this example workflow we'll be looking at a dataset of ca. 90000 cells from [Quin et al., Cell Research 2020](https://www.nature.com/articles/s41422-020-0355-0)."
+    "To run this notebook please make sure you have a working rapids environment with all nessaray dependencies. Run the data_downloader notebook first to create the AnnData object we are working with. In this example workflow we'll be looking at a dataset of ca. 90000 cells from [Quin et al., Cell Research 2020](https://www.nature.com/articles/s41422-020-0355-0)."
    ]
   },
   {

diff --git a/docs/release-notes/0.10.0.md b/docs/release-notes/0.10.0.md
@@ -7,15 +7,15 @@
 * switch `utils` functions to `get` {pr}`100` {smaller}`S Dicks`
 * added `get.aggregated` to create condensed `anndata` objects {pr}`100` {smaller}`S Dicks`
 * added `pp.scrublet` and `pp.scrublet_simulate_doublets` {pr}`129` {smaller}`S Dicks`
-* adds the option to return a copyed `AnnData` for `get.anndata_to_CPU` & `get.anndata_to_GPU` {pr}`134` {smaller}`S Dicks`
+* adds the option to return a copied `AnnData` for `get.anndata_to_CPU` & `get.anndata_to_GPU` {pr}`134` {smaller}`S Dicks`
 * adds `mask` argument to `pp.scale` and `pp.pca` {pr}`135` {smaller}`S Dicks`
 * adds the option to run `pp.scale` on sparse matrixes `zero_center = False` without densification {pr}`135` {smaller}`S Dicks`
-* updated `ruff` and now requiers paramaters by name/keyword in all public APIs {pr}`140` {smaller}`S Dicks`
+* updated `ruff` and now requires parameters by name/keyword in all public APIs {pr}`140` {smaller}`S Dicks`
 * adds the option to run `pp.harmony` with `np.float32` {pr}`145` {smaller}`S Dicks`
 
 ```{rubric} Bug fixes
 ```
-* Fixes an issue where `pp.normalize` and `pp.log1p` now use `copy` and `inplace` corretly {pr}`129` {smaller}`S Dicks`
+* Fixes an issue where `pp.normalize` and `pp.log1p` now use `copy` and `inplace` correctly {pr}`129` {smaller}`S Dicks`
 * changes the graph constructor for `tl.leiden` and `tl.louvain` {pr}`143` {smaller}`S Dicks`
 * Added a test to handle zero features, that caused issues in the sparse `pp.pca` {pr}`144` {smaller}`S Dicks`
 * Added a test to check if sparse matrices are in `canonical format`. For now this only affects `pp.highly_variable_genes`, `pp.scale` and `pp.normalize_pearson_residuals`. {pr}`146` {smaller}`S Dicks`

diff --git a/docs/release-notes/0.10.1.md b/docs/release-notes/0.10.1.md
@@ -10,3 +10,4 @@
 ```{rubric} Misc
 ```
 * Updates CI to work with `uv` {pr}`149` {smaller}`S Dicks`
+* Adds `Codespell` {pr}`155` {smaller}`S Dicks`
diff --git a/pyproject.toml b/pyproject.toml
@@ -108,3 +108,7 @@ source = "vcs"
 
 [tool.hatch.build.targets.wheel]
 packages = ['src/rapids_singlecell']
+
+[tool.codespell]
+skip = '*.ipynb,*.csv'
+ignore-words-list = "nd"
diff --git a/src/rapids_singlecell/decoupler_gpu/_method_mlm.py b/src/rapids_singlecell/decoupler_gpu/_method_mlm.py
@@ -17,7 +17,7 @@ def fit_mlm(X, y, inv, df):
     coef, sse, _, _ = cp.linalg.lstsq(X, y, rcond=-1)
     if len(sse) == 0:
         raise ValueError(
-            """Couldn\'t fit a multivariate linear model. This can happen because there are more sources
+            """Couldn't fit a multivariate linear model. This can happen because there are more sources
         (covariates) than unique targets (samples), or because the network\'s matrix rank is smaller than the number of
         sources."""
         )
@@ -95,7 +95,7 @@ def run_mlm(
         weight
             Column name in net with weights.
         batch_size
-            Size of the samples to use for each batch. Increasing this will consume more memmory but it will run faster.
+            Size of the samples to use for each batch. Increasing this will consume more memory but it will run faster.
         min_n
             Minimum of targets per source. If less, sources are removed.
         verbose

diff --git a/src/rapids_singlecell/decoupler_gpu/_method_wsum.py b/src/rapids_singlecell/decoupler_gpu/_method_wsum.py
@@ -15,7 +15,7 @@ def run_perm(mat, net, idxs, times, seed):
     net = cp.array(net)
     estimate = mat.dot(net)
     cp.random.seed(seed)
-    # Init null distirbution
+    # Init null distribution
     null_dst = cp.zeros((mat.shape[0], net.shape[1], times), dtype=np.float32)
     pvals = cp.zeros((mat.shape[0], net.shape[1]), dtype=np.float32)
 
@@ -125,7 +125,7 @@ def run_wsum(
         times
             How many random permutations to do.
         batch_size
-            Size of the batches to use. Increasing this will consume more memmory but it will run faster.
+            Size of the batches to use. Increasing this will consume more memory but it will run faster.
         min_n
             Minimum of targets per source. If less, sources are removed.
         seed

diff --git a/src/rapids_singlecell/preprocessing/_hvg.py b/src/rapids_singlecell/preprocessing/_hvg.py
@@ -39,9 +39,9 @@ def highly_variable_genes(
     Annotate highly variable genes.
     Expects logarithmized data, except when `flavor='seurat_v3','pearson_residuals','poisson_gene_selection'`, in which count data is expected.
 
-    Reimplentation of scanpy's function.
+    Reimplementation of scanpy's function.
     Depending on flavor, this reproduces the R-implementations of Seurat, Cell Ranger, Seurat v3 and Pearson Residuals.
-    Flavor `poisson_gene_selection` is an implementation of scvi, which is based on M3Drop. It requiers gpu accelerated pytorch to be installed.
+    Flavor `poisson_gene_selection` is an implementation of scvi, which is based on M3Drop. It requires gpu accelerated pytorch to be installed.
 
     For these dispersion-based methods, the normalized dispersion is obtained by scaling
     with the mean and standard deviation of the dispersions for genes falling into a given
@@ -98,7 +98,7 @@ def highly_variable_genes(
 
     Returns
     -------
-        upates `adata.var` with the following fields:
+        updates `adata.var` with the following fields:
 
             `highly_variable` : bool
                 boolean indicator of highly-variable genes
@@ -716,7 +716,7 @@ def _poisson_gene_selection(
     This is based on M3Drop: https://github.com/tallulandrews/M3Drop
     The method accounts for library size internally, a raw count matrix should be provided.
     Instead of Z-test, enrichment of zeros is quantified by posterior
-    probabilites from a binomial model, computed through sampling.
+    probabilities from a binomial model, computed through sampling.
 
     Parameters
     ----------
@@ -731,7 +731,7 @@ def _poisson_gene_selection(
         of enrichment of zeros for each gene.
     batch_key
         key in adata.obs that contains batch info. If None, do not use batch info.
-        Defatult: ``None``.
+        Default: ``None``.
     minibatch_size
         Size of temporary matrix for incremental calculation. Larger is faster but
         requires more RAM or GPU memory. (The default should be fine unless

diff --git a/src/rapids_singlecell/preprocessing/_regress_out.py b/src/rapids_singlecell/preprocessing/_regress_out.py
@@ -43,7 +43,7 @@ def regress_out(
         batchsize
             Number of genes that should be processed together. \
             If `'all'` all genes will be processed together if `.n_obs` <100000. \
-            If `None` each gene will be analysed seperatly. \
+            If `None` each gene will be analysed separately. \
             Will be ignored if cuML version < 22.12
 
         verbose