Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid loading full data in memory? #16

Open
koalive opened this issue Nov 10, 2020 · 4 comments
Open

Avoid loading full data in memory? #16

koalive opened this issue Nov 10, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@koalive
Copy link
Member

koalive commented Nov 10, 2020

As a long-term improvement, it would be great to be able to construct and run the pipeline without loading the whole dataset in memory, but processing it iteratively (hard to achieve when operations are performed both on rows and columns).

@koalive koalive added the enhancement New feature or request label Nov 10, 2020
@jeskowagner
Copy link

Hi Loan,
I just noticed BioProfiling.jl which looks great, thanks for your efforts! In regards to this issue, I was wondering whether you are aware of HDF5.jl and more generally DiskArrays.jl. Given that BioProfiling.jl at the moment expects a DataFrame to be passed this would obviously need some rewriting and I am not sure what the easiest way for this could be. Just wanted to give you a heads-up in any case.
Cheers,
Jesko

@koalive
Copy link
Member Author

koalive commented Mar 3, 2022

Hi Jesko,
Thanks, that sounds like a good start! I feel that supporting to select a subset of the dataset with lazy loading shouldn't be hard thanks to what DiskArrays.jl offers. That would already be an improvement, although the transformation steps might be a bit trickier to implement without loading everything in memory.
Cheers.

@jeskowagner
Copy link

jeskowagner commented Mar 4, 2022

I see two ways to address transformations:

  1. If two features with all observations fit into memory: one could just loop over the features for correlations, MAD etc. Please correct me if I am missing something there.
  2. If that is not guaranteed
    2.1 Approximation by reading data sequentially and computing intermediary statistics. See e.g.: https://stats.stackexchange.com/questions/7959/
    2.2 Random subsampling, i.e. selecting random cells. I am not sure how efficiently this would run with DiskArray.jl and would need to further look into it.

Not sure whether I will have time to create a PR soon but let me know what you think.
Cheers, Jesko

@koalive
Copy link
Member Author

koalive commented Mar 8, 2022

You're right, many things could be computed feature-wise or decently approximated. I feel things like the quantification of distances between distributions which requires the computation of a covariance matrix or robust estimator of dispersion would still be a scientific challenge and not just an implementation problem.
I think a good start would be to add abstract types to clarify if each method needs the full data in memory or if a lazy version can be supported. That should be pretty straightforward to implement and I'll look into it if I find the time.
Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants