-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster filtering on sparse matrices #2772
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2772 +/- ##
==========================================
+ Coverage 72.46% 72.47% +0.01%
==========================================
Files 111 111
Lines 12418 12430 +12
==========================================
+ Hits 8999 9009 +10
- Misses 3419 3421 +2
|
This made me realize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe to make this work for csr_array
, you just need to avoid isspmatrix_csr
(which rules out the *_array
objects). more below.
elif isspmatrix_csr(X): | ||
number_per_cell = np.diff(X.indptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this will work for csr_array
(and csr_matrix
of course) if you change this elif
clause:
elif isspmatrix_csr(X): | |
number_per_cell = np.diff(X.indptr) | |
elif issparse(X) and X.format == 'csr': | |
number_per_cell = np.diff(X.indptr) |
number_per_gene = np.sum(X, axis=0) | ||
if issparse(X): | ||
number_per_gene = number_per_gene.A1 | ||
elif isspmatrix_csr(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly here:
elif isspmatrix_csr(X): | |
elif issparse(X) and X.format == 'csr': |
Hi :)
I am proposing a change that speeds up
filter_cells
(x1000 speedup) andfilter_genes
(x2 speedup) for CSR sparse matrices. On my personal machine for 1M cells,sc.pp.filter_cells(adata, min_genes=xx)
runs in 1ms instead of 10s currently. The speedup should be even stronger on sparser modalities like ATAC.In spirit, this simply replaces
np.sum(X > 0, axis=axis)
withX.getnnz(axis=axis)
, which is much more efficient. But the axis argument ingetnnz
incsr_array
may be deprecated. I think it should still be fine withcsr_matrix
, but since I don't know for sure I manually implemented it for the CSR case as in scipy/scipy#19405 .What do you think?
Regarding
getnnz
: Of course it would be nicer to be able to write.getnnz(axis=axis)
, which extends beyond CSR to other sparse matrices. Can we assume that we're getting sparse matrices and not sparse arrays ?Pinging @dschult from the Scipy issue liked above, who mentioned:
(edited because I confused
csr_array
andcsr_matrix
)