[BUG] umap needs init_pos = 'random' for larger datasets #196

bkmartinjr · 2024-05-16T19:41:29Z

Describe the bug

rapids_singlecell.tl.umap will occasionally generate embeddings with spuriously large values (e.g., a small number of points that are very large).

Details: I start with a slice of the Tabula Muris data, and follow the standard workflow for generating a UMAP embedding: filter, highly_variable_genes, pca, neighbors, umap. When I use the ScanPy CPU implementation of UMAP, things look as expected. However, if I use the RSC tl.umap, I get embeddings which have a small number of points with very large values (i.e., appear as small, very dispersed clusters).

This occurs regardless of which implementation I use for the other steps (e.g., PCA) - only the call to umap appears to vary in its behavior.

Example plot using scanpy.tl.umap:

Example plot using rapids_singlecell.tl.umap:

Running numpy.histogram on the embedding shows the effect. Scanpy generated umap, points are well distributed:

# bin counts
array([  495,  1192,  2037,  6329,  7213,  7952, 11369, 12860,  8011,
        7390,  6820,  7103,  7795,  6455,  5127,  3900,  2394,  1746,
         347,    15])
# bin edges
array([-11.400737  ,  -9.587617  ,  -7.7744966 ,  -5.961376  ,
        -4.148256  ,  -2.3351357 ,  -0.52201545,   1.2911048 ,
         3.104225  ,   4.917345  ,   6.7304654 ,   8.543586  ,
        10.356706  ,  12.1698265 ,  13.982946  ,  15.796066  ,
        17.609186  ,  19.422306  ,  21.235428  ,  23.048548  ,
        24.861668  ], dtype=float32)

RSC generated umap, small number of outliers:

# bin counts
array([    2,     1,     0,     0,     0,     0,     5,     2,     0,
           1, 14762, 91773,     2,     0,     0,     0,     1,     0,
           0,     1])
# bin edges
array([-1622.5352 , -1475.78   , -1329.025  , -1182.2699 , -1035.5148 ,
        -888.75977,  -742.00464,  -595.2496 ,  -448.49448,  -301.7394 ,
        -154.98431,    -8.22923,   138.52585,   285.28094,   432.036  ,
         578.79114,   725.5462 ,   872.3013 ,  1019.05634,  1165.8114 ,
        1312.5665 ], dtype=float32)

I have found that this issue only occurs with specific data. For example, some slices of the above dataset generate good embeddings, others do not. For datasets where it fails, it fails reliably (so it appears to be something in the dataset or parameterization that triggers the bug).

When this occurs, the actual embedding in the central mass of points looks good. It is only the additional outliers which are problematic.

Steps/Code to reproduce bug

Installed on Linux (popos/ubuntu), using mamba and the rsc_rapids_24.04.yml recipe. I have a sample dataset that reliably fails and which I can make available.

Test code:

import rapids_singlecell as rsc
import scanpy as sc
import numpy as np

adata = sc.read_h5ad("fooz.h5ad")

def compute_embedding(adata: sc.AnnData):
    # config
    n_neighbors = 15
    n_top_genes = 5000
    min_genes = 10
    min_cells = 10

    sc.pp.filter_cells(adata, min_genes=min_genes)
    sc.pp.filter_genes(adata, min_cells=min_cells)

    if adata.n_obs == 0 or adata.n_vars == 0:
        raise ValueError(f"Empty query result")

    if adata.n_obs <= n_neighbors:
        raise ValueError(f"Insufficient cells to compute KNN")

    rsc.get.anndata_to_GPU(adata)
    rsc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes)
    rsc.tl.pca(adata)
    rsc.pp.neighbors(adata, n_neighbors=n_neighbors)
    rsc.get.anndata_to_CPU(adata)

    adata_CPU = adata.copy()
    sc.tl.umap(adata_CPU)

    adata_GPU = adata.copy()
    rsc.get.anndata_to_GPU(adata_GPU)
    rsc.tl.umap(adata_GPU)
    rsc.get.anndata_to_CPU(adata_GPU)

    return adata_CPU, adata_GPU


def summarize_umap(adata, save):
    print(repr(adata))
    print(f"X_umap, min={adata.obsm['X_umap'].min()}, max={adata.obsm['X_umap'].max()}")
    hist, bins = np.histogram(adata.obsm["X_umap"].flatten(), bins=20)
    print(repr(hist))
    print(repr(bins))
    sc.pl.umap(adata, color="cell_ontology_class", save=save, show=False)
    print()


adata_CPU, adata_GPU = compute_embedding(adata.copy())


print("UMAP via RAPIDS")
summarize_umap(adata_GPU, "gpu.png")

print("UMAP via CPU")
summarize_umap(adata_CPU, "cpu.png")

Environment details (please complete the following information):

Environment location: bare metal Lenovo/intel laptop
Linux Distro/Architecture: PopOS latest (aka Ubuntu 22.04 amd64)
GPU Model/Driver: GTX 3050 (GA 107M)
CUDA: 11.4
Method of Rapids install: mamba, using the rsc_rapids_24.04.yml from GH

pip list

Package                 Version
----------------------- ----------------
aiobotocore             2.12.3
aiohttp                 3.9.5
aioitertools            0.11.0
aiosignal               1.3.1
anndata                 0.10.7
array_api_compat        1.6
asttokens               2.4.1
async-timeout           4.0.3
attrs                   23.2.0
bcc                     0.18.0
blinker                 1.4
botocore                1.34.69
Brlapi                  0.8.3
cellxgene-census        1.13.0
certifi                 2020.6.20
chardet                 4.0.0
chrome-gnome-shell      0.0.0
click                   8.0.3
cloudpickle             2.2.1
colorama                0.4.4
comm                    0.2.2
command-not-found       0.3
contourpy               1.2.1
cryptography            3.4.8
cupshelpers             1.0
cycler                  0.12.1
dbus-python             1.2.18
decorator               5.1.1
defer                   1.0.6
distro                  1.7.0
exceptiongroup          1.2.1
executing               2.0.1
fonttools               4.51.0
frozenlist              1.4.1
fsspec                  2024.3.1
h5py                    3.11.0
hidpidaemon             18.4.6
httplib2                0.20.2
idna                    3.3
importlib-metadata      4.6.4
ipython                 8.24.0
ipywidgets              8.1.2
jedi                    0.19.1
jeepney                 0.7.1
jmespath                1.0.1
joblib                  1.4.2
jupyterlab_widgets      3.0.10
kernelstub              3.1.4
keyring                 23.5.0
kiwisolver              1.4.5
language-selector       0.1
launchpadlib            1.10.16
lazr.restfulclient      0.14.4
lazr.uri                1.0.6
legacy-api-wrap         1.4
llvmlite                0.42.0
louis                   3.20.0
macaroonbakery          1.3.1
matplotlib              3.8.4
matplotlib-inline       0.1.7
more-itertools          8.10.0
multidict               6.0.5
natsort                 8.4.0
netaddr                 0.8.0
netifaces               0.11.0
networkx                3.3
numba                   0.59.1
numpy                   1.26.4
oauthlib                3.2.0
packaging               24.0
pandas                  2.2.2
parso                   0.8.4
patsy                   0.5.6
pexpect                 4.9.0
pillow                  10.3.0
pip                     22.0.2
pop-transition          1.1.2
prompt-toolkit          3.0.43
protobuf                3.12.4
ptyprocess              0.7.0
pure-eval               0.2.2
pyarrow                 16.0.0
pyarrow-hotfix          0.6
pycairo                 1.20.1
pycups                  2.0.1
pydbus                  0.6.0
Pygments                2.18.0
PyGObject               3.42.1
PyJWT                   2.3.0
pymacaroons             0.13.0
PyNaCl                  1.5.0
pynndescent             0.5.12
pyparsing               2.4.7
pyRFC3339               1.1
python-apt              2.4.0+ubuntu3
python-dateutil         2.9.0.post0
python-debian           0.1.43+ubuntu1.1
python-gnupg            0.4.8
python-xlib             0.29
pytz                    2022.1
pyxdg                   0.27
PyYAML                  5.4.1
repolib                 2.2.1
repoman                 1.4.0
requests                2.25.1
s3fs                    2024.3.1
scanpy                  1.10.1
scikit-learn            1.4.2
scipy                   1.13.0
screen-resolution-extra 0.0.0
seaborn                 0.13.2
SecretStorage           3.3.1
session-info            1.0.0
sessioninstaller        0.0.0
setuptools              59.6.0
six                     1.16.0
somacore                1.0.10
stack-data              0.6.3
statsmodels             0.14.2
stdlib-list             0.10.0
systemd-python          234
tblib                   1.7.0
threadpoolctl           3.5.0
tiledb                  0.27.1
tiledb-cloud            0.12.7
tiledbsoma              1.9.5
tqdm                    4.66.4
traitlets               5.14.3
typing_extensions       4.11.0
tzdata                  2024.1
ubuntu-drivers-common   0.0.0
ufw                     0.36.1
umap-learn              0.5.6
urllib3                 1.26.5
wadllib                 1.3.6
wcwidth                 0.2.13
wheel                   0.37.1
widgetsnbextension      4.0.10
wrapt                   1.16.0
xdg                     5
xkit                    0.0.0
yarl                    1.9.4
zipp                    1.0.0

The text was updated successfully, but these errors were encountered:

Intron7 · 2024-05-16T19:45:26Z

Please try setting 'init_pos="random"'. This usually works for me.

bkmartinjr · 2024-05-16T19:51:58Z

Yep, that seems to resolve it. Known issue with spectral init?

Intron7 · 2024-05-16T19:56:49Z

I don't know actually. I have ran into this too for very large datasets (1M or more). I will forward this to the rapids team at NVIDIA.

Intron7 · 2024-05-16T20:02:12Z

I updated the title so people know how to fix it. I'll leave this open for know. I'll also update UMAP so that I has a new default with auto that sets the init_pos to random if you have more than 1M cells.

SNOL2 · 2024-10-09T06:51:24Z

The issue remains when set init_pos = 'random'

Intron7 · 2024-10-09T07:28:44Z

There will be a further update with rapids-24.10. I hope this will completely eliminate the problem.

SNOL2 · 2024-10-10T06:30:51Z

There will be a further update with rapids-24.10. I hope this will completely eliminate the problem.

Thanks for your quick reply!

bkmartinjr added the bug Something isn't working label May 16, 2024

Intron7 changed the title ~~[BUG] tl.umap generating embeddings with very large (dispersed) point values~~ [BUG] umap needs init_pos = 'random' for larger datasets May 16, 2024

Intron7 mentioned this issue May 17, 2024

Add auto to umap #197

Merged

Intron7 closed this as completed in #197 May 17, 2024

Intron7 reopened this May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] umap needs init_pos = 'random' for larger datasets #196

[BUG] umap needs init_pos = 'random' for larger datasets #196

bkmartinjr commented May 16, 2024

Intron7 commented May 16, 2024

bkmartinjr commented May 16, 2024

Intron7 commented May 16, 2024

Intron7 commented May 16, 2024

SNOL2 commented Oct 9, 2024

Intron7 commented Oct 9, 2024

SNOL2 commented Oct 10, 2024

[BUG] umap needs init_pos = 'random' for larger datasets #196

[BUG] umap needs init_pos = 'random' for larger datasets #196

Comments

bkmartinjr commented May 16, 2024

Intron7 commented May 16, 2024

bkmartinjr commented May 16, 2024

Intron7 commented May 16, 2024

Intron7 commented May 16, 2024

SNOL2 commented Oct 9, 2024

Intron7 commented Oct 9, 2024

SNOL2 commented Oct 10, 2024