Faster filtering on sparse matrices #2772

gjhuizing · 2023-11-30T13:53:21Z

Closes # no existing issue
Tests included or not required because: No new tests

Release notes not necessary because: I did not write release notes

Hi :)

I am proposing a change that speeds up filter_cells (x1000 speedup) and filter_genes (x2 speedup) for CSR sparse matrices. On my personal machine for 1M cells, sc.pp.filter_cells(adata, min_genes=xx) runs in 1ms instead of 10s currently. The speedup should be even stronger on sparser modalities like ATAC.

In spirit, this simply replaces np.sum(X > 0, axis=axis) with X.getnnz(axis=axis), which is much more efficient. But the axis argument in getnnz in csr_array may be deprecated. I think it should still be fine with csr_matrix, but since I don't know for sure I manually implemented it for the CSR case as in scipy/scipy#19405 .

What do you think?

Regarding getnnz: Of course it would be nicer to be able to write .getnnz(axis=axis), which extends beyond CSR to other sparse matrices. Can we assume that we're getting sparse matrices and not sparse arrays ?

Pinging @dschult from the Scipy issue liked above, who mentioned:

I'm pretty sure that a reasonable and commonly occuring use-case would be enough to make the developers include this feature somehow.

(edited because I confused csr_array and csr_matrix)

codecov · 2023-11-30T14:09:09Z

Codecov Report

Attention: Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.

Project coverage is 72.47%. Comparing base (bc349b9) to head (a0670c7).
Report is 143 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2772      +/-   ##
==========================================
+ Coverage   72.46%   72.47%   +0.01%     
==========================================
  Files         111      111              
  Lines       12418    12430      +12     
==========================================
+ Hits         8999     9009      +10     
- Misses       3419     3421       +2

Files	Coverage Δ
scanpy/preprocessing/_simple.py	`82.29% <88.88%> (+0.02%)`	⬆️

gjhuizing · 2023-11-30T14:14:01Z

This made me realize .getnnz is used elsewhere in the code, opened an issue #2773

dschult

I believe to make this work for csr_array, you just need to avoid isspmatrix_csr (which rules out the *_array objects). more below.

dschult · 2023-11-30T15:36:26Z

scanpy/preprocessing/_simple.py

+    elif isspmatrix_csr(X):
+        number_per_cell = np.diff(X.indptr)


I believe that this will work for csr_array (and csr_matrix of course) if you change this elif clause:

Suggested change

elif isspmatrix_csr(X):

number_per_cell = np.diff(X.indptr)

elif issparse(X) and X.format == 'csr':

number_per_cell = np.diff(X.indptr)

dschult · 2023-11-30T15:37:48Z

scanpy/preprocessing/_simple.py

+        number_per_gene = np.sum(X, axis=0)
+        if issparse(X):
+            number_per_gene = number_per_gene.A1
+    elif isspmatrix_csr(X):


similarly here:

Suggested change

elif isspmatrix_csr(X):

elif issparse(X) and X.format == 'csr':

Faster filtering on sparse matrices

7155815

gjhuizing mentioned this pull request Nov 30, 2023

Scipy getnnz deprecation? #2773

Open

3 tasks

fix output shape

a0670c7

dschult reviewed Nov 30, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster filtering on sparse matrices #2772

Faster filtering on sparse matrices #2772

gjhuizing commented Nov 30, 2023 •

edited

Loading

codecov bot commented Nov 30, 2023 •

edited

Loading

gjhuizing commented Nov 30, 2023 •

edited

Loading

dschult left a comment

dschult Nov 30, 2023

dschult Nov 30, 2023

	elif isspmatrix_csr(X):
	elif issparse(X) and X.format == 'csr':

Faster filtering on sparse matrices #2772

Are you sure you want to change the base?

Faster filtering on sparse matrices #2772

Conversation

gjhuizing commented Nov 30, 2023 • edited Loading

codecov bot commented Nov 30, 2023 • edited Loading

Codecov Report

gjhuizing commented Nov 30, 2023 • edited Loading

dschult left a comment

Choose a reason for hiding this comment

dschult Nov 30, 2023

Choose a reason for hiding this comment

dschult Nov 30, 2023

Choose a reason for hiding this comment

gjhuizing commented Nov 30, 2023 •

edited

Loading

codecov bot commented Nov 30, 2023 •

edited

Loading

gjhuizing commented Nov 30, 2023 •

edited

Loading