You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.
Frank Noe edited this page Oct 31, 2015
·
1 revision
Data types in early processing (e.g. TICA)
Some data processing steps are currently inefficient - in memory usage, CPU usage, or both.
Allow to use sparse matrices as input
Allow to use different data types, e.g. boolean or bitarrays for contact maps
The question arises how we still keep generality in the data processing pipeline.
Build specialized low-level estimators for specific datatypes, e.g. covariance estimators
for integer and sparse boolean data. (simple one-pass algorithm is robust for integral data,
C implementation can efficiently deal with 1's and 0's.)
High-level estimator (e.g. TICA) encapsulates multiple types, e.g. float/int.
There is a fallback implementation if specialized low-level algorithms are not implemented.
For example a boolean array can be cast to a float array containing 0.0 and 1.0, a sparse
data chunk can be copied into a dense data chunk etc.
Data types in late Estimation
Clustering output is integer, MSM/HMSM input is integer
How can they be included in a data processing pipeline?