BioRxiv: https://www.biorxiv.org/content/10.1101/2024.09.30.614425v1
In-house dependencies: imzy and UniPy
DOI: https://doi.org/10.4121/a6efd47a-b4ec-493e-a742-70e8a369f788
Data: https://surfdrive.surf.nl/files/index.php/s/TLPccCCAP7Uat5r
There are 2 data sets: (1) FT-ICR (fticr_data.tar.gz), (2) qTOF (all other files, i.e. xaa-xaz). Below are the steps explained for Linux/Mac unpacking of the data.
Unpack data
tar -xfz fticr_data.tar.gz -C fticr_data
Given the 100+ GB data set size of the qTOF in compressed format, we need to gather all differnt split files (4GB each) to unpack the whole folder.
cat xa* | tar xfz -
Use your favorite toolbox for this, e.g. https://github.com/vandeplaslab/imzy The code below provides a simple sample of code when B fits into memory (in a dense format).
import numpy as np
import scipy.sparse as scis
from imzy import get_reader
path = "path/to/file"
reader = get_reader(path)
f = reader[0]
out = np.zeros((reader.n_mz_bins, len(reader.framelist)), dtype=f.dtype)
for k, i in enumerate(reader.framelist):
f = reader[i]._init_csc()
out[f.indices, k] = f.data
B = scis.csc_matrix(out, dtype='float32', copy=False)
When B (in a dense format) is too large to fit in memory, one can use the following
import numpy as np
import scipy.sparse as scis
from imzy import get_reader
path = "path/to/file"
reader = get_reader(path)
vec_size = 0
for i in reader.framelist:
vec_size += reader.bo[i][1][5]
b_data = np.zeros((vec_size,), dtype='float32')
b_indices = np.zeros((vec_size,), dtype='int32')
b_indptr = np.zeros((len(iter_var)+1), dtype='int64')
def read_in(q, e, i):
a = reader[i]
lsize = a.data.shape[0]
b_data[q:q+lsize] = a.data
b_indices[q:q+lsize] = a.indices
b_indptr[e+1] = b_indptr[e]+a.indptr[1]
return q+lsize
q = 0
for e, i in enumerate(iter_var):
q = read_in(q, e, i)
if e%1_000 == 0: # Print out
print(np.round(e/len(iter_var)*100, 2), '% in', np.round(t.time()-tic), 's', end='\r')
b_data = b_data[:q]
b_indices = b_indices[:q]
B = scis.csc_matrix((b_data, b_indices, b_indptr), copy=False)
This package depends heavily on https://github.com/vandeplaslab/unipy make sure to install Unipy before this package. Then, get the latest package version of full_profile, e.g. through GitHub CLI (https://cli.github.com/)
gh repo clone vandeplaslab/full_profile
Install package (e.g. through pip) and get required dependencies (first cd
into the full_profile folder)
pip install .
Once the package is installed we can apply the paper methods onto data.
For our purposes, we first applied a 5-95% Total Ion Current Count (TIC) normalization (i.e. C
) on the raw data (see Supplementary Materials of paper). The method can be found here: https://github.com/vandeplaslab/imzy/blob/add-norms/src/imzy/_normalizations/_extract.py or we can apply it directly onto our sparse data by
def calculate_normalizations_optimized(spectrum: np.ndarray) -> np.ndarray:
"""Calculate various normalizations, optimized version.
This function expects a float32 spectrum.
"""
# Filter positive values once and reuse
positive_spectrum = spectrum[spectrum > 0]
# Calculating quantiles once for all needed
if positive_spectrum.size > 1:
q95, q90, q10, q5 = np.nanquantile(positive_spectrum, [0.95, 0.9, 0.1, 0.05])
else:
q95, q90, q10, q5 = 0, 0, 0, 0
# Using logical indexing with boolean arrays might be faster due to numba optimization
condition_q95 = spectrum < q95
condition_q5 = spectrum > q5
return np.sum(spectrum[condition_q5 & condition_q95]) # 5-95% TIC
C = np.zeros((B.shape[1],1))
for i in range(B.shape[1]):
C[i] = calculate_normalizations_optimized(B[:,i].toarray())
print(i, 'of', B.shape[1], end='\r')
C_sp = (1/(C/np.median(C))) # Apply rescaler (np.median(C)) for ease of visualization
B *= C_sp # Apply normalization
Example of how to apply singular value thresholding algorithm.
from full_profile import svt
inst = svt.svt(maxiter=100, tau_factor=1e-2, delta=1, method='scipy_gesdd', dense=True, verbose=True)
inst.run(B)
This line creates an instance of the SVT (likely "Singular Value Thresholding") class with the following parameters:
maxiter=100
: First positional argument is the maximal number of iterationstau_factor=1e-2
: Sets the tau factor to 0.01 (see Supplementary Materials)delta=1
: Sets delta, i.e. step size, to 1 (see Supplementary Materials)method='scipy_gesdd'
: Specifies the underlying svd method as 'scipy_gesdd' (multiple available see UniPy package)dense=True
: Sets the dense flag to True if enough memory is available to reconstruct B into dense memory (a True statement speeds up the calculations)verbose=True
: Enables verbose output
Example of how to apply the fixed point continuation algorithm.
from full_profile import fpc
inst = fpc.fpc(maxiter=100, tau_factor=1e-3, delta=1.4, method='arpack', verbose=True)
inst.run(B)
This line creates an instance of the FPC (likely "Fixed Point Continuation") class with the following parameters:
maxiter=100
: First positional argument is the maximal number of iterationstau_factor=1e-2
: Sets the tau factor to 0.01 (see Supplementary Materials)delta=1
: Sets delta, i.e. step size, to 1 (see Supplementary Materials)method='scipy_gesdd'
: Specifies the underlying svd method as 'scipy_gesdd' (multiple available see UniPy package)dense=True
: Sets the dense flag to True if enough memory is available to reconstruct B into dense memory (a True statement speeds up the calculations)verbose=True
: Enables verbose output
from full_profile import dfc
inst = dfc.dfc(maxiter=100, tau_factor=1e-3, delta=1.4, method='arpack', verbose=True)
inst.run(B)
2024/10/04: we are working hard to make UniPy available along with the DFC implementation