Avoid loading full data in memory? #16

koalive · 2020-11-10T16:00:41Z

As a long-term improvement, it would be great to be able to construct and run the pipeline without loading the whole dataset in memory, but processing it iteratively (hard to achieve when operations are performed both on rows and columns).

jeskowagner · 2022-03-02T17:04:03Z

Hi Loan,
I just noticed BioProfiling.jl which looks great, thanks for your efforts! In regards to this issue, I was wondering whether you are aware of HDF5.jl and more generally DiskArrays.jl. Given that BioProfiling.jl at the moment expects a DataFrame to be passed this would obviously need some rewriting and I am not sure what the easiest way for this could be. Just wanted to give you a heads-up in any case.
Cheers,
Jesko

koalive · 2022-03-03T17:06:49Z

Hi Jesko,
Thanks, that sounds like a good start! I feel that supporting to select a subset of the dataset with lazy loading shouldn't be hard thanks to what DiskArrays.jl offers. That would already be an improvement, although the transformation steps might be a bit trickier to implement without loading everything in memory.
Cheers.

jeskowagner · 2022-03-04T15:47:02Z

I see two ways to address transformations:

If two features with all observations fit into memory: one could just loop over the features for correlations, MAD etc. Please correct me if I am missing something there.
If that is not guaranteed
2.1 Approximation by reading data sequentially and computing intermediary statistics. See e.g.: https://stats.stackexchange.com/questions/7959/
2.2 Random subsampling, i.e. selecting random cells. I am not sure how efficiently this would run with DiskArray.jl and would need to further look into it.

Not sure whether I will have time to create a PR soon but let me know what you think.
Cheers, Jesko

koalive · 2022-03-08T19:03:00Z

You're right, many things could be computed feature-wise or decently approximated. I feel things like the quantification of distances between distributions which requires the computation of a covariance matrix or robust estimator of dispersion would still be a scientific challenge and not just an implementation problem.
I think a good start would be to add abstract types to clarify if each method needs the full data in memory or if a lazy version can be supported. That should be pretty straightforward to implement and I'll look into it if I find the time.
Cheers!

koalive added the enhancement New feature or request label Nov 10, 2020

koalive mentioned this issue Nov 10, 2020

Support parallel computing? #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid loading full data in memory? #16

Avoid loading full data in memory? #16

koalive commented Nov 10, 2020

jeskowagner commented Mar 2, 2022

koalive commented Mar 3, 2022

jeskowagner commented Mar 4, 2022 •

edited

Loading

koalive commented Mar 8, 2022

Avoid loading full data in memory? #16

Avoid loading full data in memory? #16

Comments

koalive commented Nov 10, 2020

jeskowagner commented Mar 2, 2022

koalive commented Mar 3, 2022

jeskowagner commented Mar 4, 2022 • edited Loading

koalive commented Mar 8, 2022

jeskowagner commented Mar 4, 2022 •

edited

Loading