Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] umap needs init_pos = 'random' for larger datasets #196

Open
bkmartinjr opened this issue May 16, 2024 · 7 comments · Fixed by #197
Open

[BUG] umap needs init_pos = 'random' for larger datasets #196

bkmartinjr opened this issue May 16, 2024 · 7 comments · Fixed by #197
Labels
bug Something isn't working

Comments

@bkmartinjr
Copy link

Describe the bug

rapids_singlecell.tl.umap will occasionally generate embeddings with spuriously large values (e.g., a small number of points that are very large).

Details: I start with a slice of the Tabula Muris data, and follow the standard workflow for generating a UMAP embedding: filter, highly_variable_genes, pca, neighbors, umap. When I use the ScanPy CPU implementation of UMAP, things look as expected. However, if I use the RSC tl.umap, I get embeddings which have a small number of points with very large values (i.e., appear as small, very dispersed clusters).

This occurs regardless of which implementation I use for the other steps (e.g., PCA) - only the call to umap appears to vary in its behavior.

Example plot using scanpy.tl.umap:
image

Example plot using rapids_singlecell.tl.umap:
image

Running numpy.histogram on the embedding shows the effect. Scanpy generated umap, points are well distributed:

# bin counts
array([  495,  1192,  2037,  6329,  7213,  7952, 11369, 12860,  8011,
        7390,  6820,  7103,  7795,  6455,  5127,  3900,  2394,  1746,
         347,    15])
# bin edges
array([-11.400737  ,  -9.587617  ,  -7.7744966 ,  -5.961376  ,
        -4.148256  ,  -2.3351357 ,  -0.52201545,   1.2911048 ,
         3.104225  ,   4.917345  ,   6.7304654 ,   8.543586  ,
        10.356706  ,  12.1698265 ,  13.982946  ,  15.796066  ,
        17.609186  ,  19.422306  ,  21.235428  ,  23.048548  ,
        24.861668  ], dtype=float32)

RSC generated umap, small number of outliers:

# bin counts
array([    2,     1,     0,     0,     0,     0,     5,     2,     0,
           1, 14762, 91773,     2,     0,     0,     0,     1,     0,
           0,     1])
# bin edges
array([-1622.5352 , -1475.78   , -1329.025  , -1182.2699 , -1035.5148 ,
        -888.75977,  -742.00464,  -595.2496 ,  -448.49448,  -301.7394 ,
        -154.98431,    -8.22923,   138.52585,   285.28094,   432.036  ,
         578.79114,   725.5462 ,   872.3013 ,  1019.05634,  1165.8114 ,
        1312.5665 ], dtype=float32)

I have found that this issue only occurs with specific data. For example, some slices of the above dataset generate good embeddings, others do not. For datasets where it fails, it fails reliably (so it appears to be something in the dataset or parameterization that triggers the bug).

When this occurs, the actual embedding in the central mass of points looks good. It is only the additional outliers which are problematic.

Steps/Code to reproduce bug

Installed on Linux (popos/ubuntu), using mamba and the rsc_rapids_24.04.yml recipe. I have a sample dataset that reliably fails and which I can make available.

Test code:

import rapids_singlecell as rsc
import scanpy as sc
import numpy as np

adata = sc.read_h5ad("fooz.h5ad")

def compute_embedding(adata: sc.AnnData):
    # config
    n_neighbors = 15
    n_top_genes = 5000
    min_genes = 10
    min_cells = 10

    sc.pp.filter_cells(adata, min_genes=min_genes)
    sc.pp.filter_genes(adata, min_cells=min_cells)

    if adata.n_obs == 0 or adata.n_vars == 0:
        raise ValueError(f"Empty query result")

    if adata.n_obs <= n_neighbors:
        raise ValueError(f"Insufficient cells to compute KNN")

    rsc.get.anndata_to_GPU(adata)
    rsc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes)
    rsc.tl.pca(adata)
    rsc.pp.neighbors(adata, n_neighbors=n_neighbors)
    rsc.get.anndata_to_CPU(adata)

    adata_CPU = adata.copy()
    sc.tl.umap(adata_CPU)

    adata_GPU = adata.copy()
    rsc.get.anndata_to_GPU(adata_GPU)
    rsc.tl.umap(adata_GPU)
    rsc.get.anndata_to_CPU(adata_GPU)

    return adata_CPU, adata_GPU


def summarize_umap(adata, save):
    print(repr(adata))
    print(f"X_umap, min={adata.obsm['X_umap'].min()}, max={adata.obsm['X_umap'].max()}")
    hist, bins = np.histogram(adata.obsm["X_umap"].flatten(), bins=20)
    print(repr(hist))
    print(repr(bins))
    sc.pl.umap(adata, color="cell_ontology_class", save=save, show=False)
    print()


adata_CPU, adata_GPU = compute_embedding(adata.copy())


print("UMAP via RAPIDS")
summarize_umap(adata_GPU, "gpu.png")

print("UMAP via CPU")
summarize_umap(adata_CPU, "cpu.png")

Environment details (please complete the following information):

  • Environment location: bare metal Lenovo/intel laptop
  • Linux Distro/Architecture: PopOS latest (aka Ubuntu 22.04 amd64)
  • GPU Model/Driver: GTX 3050 (GA 107M)
  • CUDA: 11.4
  • Method of Rapids install: mamba, using the rsc_rapids_24.04.yml from GH
pip list
Package                 Version
----------------------- ----------------
aiobotocore             2.12.3
aiohttp                 3.9.5
aioitertools            0.11.0
aiosignal               1.3.1
anndata                 0.10.7
array_api_compat        1.6
asttokens               2.4.1
async-timeout           4.0.3
attrs                   23.2.0
bcc                     0.18.0
blinker                 1.4
botocore                1.34.69
Brlapi                  0.8.3
cellxgene-census        1.13.0
certifi                 2020.6.20
chardet                 4.0.0
chrome-gnome-shell      0.0.0
click                   8.0.3
cloudpickle             2.2.1
colorama                0.4.4
comm                    0.2.2
command-not-found       0.3
contourpy               1.2.1
cryptography            3.4.8
cupshelpers             1.0
cycler                  0.12.1
dbus-python             1.2.18
decorator               5.1.1
defer                   1.0.6
distro                  1.7.0
exceptiongroup          1.2.1
executing               2.0.1
fonttools               4.51.0
frozenlist              1.4.1
fsspec                  2024.3.1
h5py                    3.11.0
hidpidaemon             18.4.6
httplib2                0.20.2
idna                    3.3
importlib-metadata      4.6.4
ipython                 8.24.0
ipywidgets              8.1.2
jedi                    0.19.1
jeepney                 0.7.1
jmespath                1.0.1
joblib                  1.4.2
jupyterlab_widgets      3.0.10
kernelstub              3.1.4
keyring                 23.5.0
kiwisolver              1.4.5
language-selector       0.1
launchpadlib            1.10.16
lazr.restfulclient      0.14.4
lazr.uri                1.0.6
legacy-api-wrap         1.4
llvmlite                0.42.0
louis                   3.20.0
macaroonbakery          1.3.1
matplotlib              3.8.4
matplotlib-inline       0.1.7
more-itertools          8.10.0
multidict               6.0.5
natsort                 8.4.0
netaddr                 0.8.0
netifaces               0.11.0
networkx                3.3
numba                   0.59.1
numpy                   1.26.4
oauthlib                3.2.0
packaging               24.0
pandas                  2.2.2
parso                   0.8.4
patsy                   0.5.6
pexpect                 4.9.0
pillow                  10.3.0
pip                     22.0.2
pop-transition          1.1.2
prompt-toolkit          3.0.43
protobuf                3.12.4
ptyprocess              0.7.0
pure-eval               0.2.2
pyarrow                 16.0.0
pyarrow-hotfix          0.6
pycairo                 1.20.1
pycups                  2.0.1
pydbus                  0.6.0
Pygments                2.18.0
PyGObject               3.42.1
PyJWT                   2.3.0
pymacaroons             0.13.0
PyNaCl                  1.5.0
pynndescent             0.5.12
pyparsing               2.4.7
pyRFC3339               1.1
python-apt              2.4.0+ubuntu3
python-dateutil         2.9.0.post0
python-debian           0.1.43+ubuntu1.1
python-gnupg            0.4.8
python-xlib             0.29
pytz                    2022.1
pyxdg                   0.27
PyYAML                  5.4.1
repolib                 2.2.1
repoman                 1.4.0
requests                2.25.1
s3fs                    2024.3.1
scanpy                  1.10.1
scikit-learn            1.4.2
scipy                   1.13.0
screen-resolution-extra 0.0.0
seaborn                 0.13.2
SecretStorage           3.3.1
session-info            1.0.0
sessioninstaller        0.0.0
setuptools              59.6.0
six                     1.16.0
somacore                1.0.10
stack-data              0.6.3
statsmodels             0.14.2
stdlib-list             0.10.0
systemd-python          234
tblib                   1.7.0
threadpoolctl           3.5.0
tiledb                  0.27.1
tiledb-cloud            0.12.7
tiledbsoma              1.9.5
tqdm                    4.66.4
traitlets               5.14.3
typing_extensions       4.11.0
tzdata                  2024.1
ubuntu-drivers-common   0.0.0
ufw                     0.36.1
umap-learn              0.5.6
urllib3                 1.26.5
wadllib                 1.3.6
wcwidth                 0.2.13
wheel                   0.37.1
widgetsnbextension      4.0.10
wrapt                   1.16.0
xdg                     5
xkit                    0.0.0
yarl                    1.9.4
zipp                    1.0.0
@bkmartinjr bkmartinjr added the bug Something isn't working label May 16, 2024
@Intron7
Copy link
Member

Intron7 commented May 16, 2024

Please try setting 'init_pos="random"'. This usually works for me.

@bkmartinjr
Copy link
Author

Yep, that seems to resolve it. Known issue with spectral init?

@Intron7
Copy link
Member

Intron7 commented May 16, 2024

I don't know actually. I have ran into this too for very large datasets (1M or more). I will forward this to the rapids team at NVIDIA.

@Intron7 Intron7 changed the title [BUG] tl.umap generating embeddings with very large (dispersed) point values [BUG] umap needs init_pos = 'random' for larger datasets May 16, 2024
@Intron7
Copy link
Member

Intron7 commented May 16, 2024

I updated the title so people know how to fix it. I'll leave this open for know. I'll also update UMAP so that I has a new default with auto that sets the init_pos to random if you have more than 1M cells.

@SNOL2
Copy link

SNOL2 commented Oct 9, 2024

The issue remains when set init_pos = 'random'

@Intron7
Copy link
Member

Intron7 commented Oct 9, 2024

There will be a further update with rapids-24.10. I hope this will completely eliminate the problem.

@SNOL2
Copy link

SNOL2 commented Oct 10, 2024

There will be a further update with rapids-24.10. I hope this will completely eliminate the problem.

Thanks for your quick reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants