BioRxiv: https://www.biorxiv.org/content/10.1101/2024.09.30.614425v1
In-house dependencies: imzy and UniPy
DOI: https://doi.org/10.4121/a6efd47a-b4ec-493e-a742-70e8a369f788
Data: https://surfdrive.surf.nl/files/index.php/s/TLPccCCAP7Uat5r
There are 2 data sets: (1) FT-ICR (fticr_data.tar.gz), (2) qTOF (all other files, i.e. xaa-xaz). Below are the steps explained for Linux/Mac unpacking of the data.
Unpack data
tar -xfz fticr_data.tar.gz -C fticr_data
Given the 100+ GB data set size of the qTOF in compressed format, we need to gather all differnt split files (4GB each) to unpack the whole folder.
cat xa* | tar xfz -
Use your favorite toolbox for this, e.g. https://github.com/vandeplaslab/imzy The code below provides a simple sample of code when B fits into memory (in a dense format).
import numpy as np
import scipy.sparse as scis
from imzy import get_reader
path = "path/to/file"
reader = get_reader(path)
f = reader[0]
out = np.zeros((reader.n_mz_bins, len(reader.framelist)), dtype=f.dtype)
for k, i in enumerate(reader.framelist):
f = reader[i]._init_csc()
out[f.indices, k] = f.data
B = scis.csc_matrix(out, dtype='float32', copy=False)
When B (in a dense format) is too large to fit in memory, one can use the following
import numpy as np
import scipy.sparse as scis
from imzy import get_reader
path = "path/to/file"
reader = get_reader(path)
vec_size = 0
for i in reader.framelist:
vec_size += reader.bo[i][1][5]
b_data = np.zeros((vec_size,), dtype='float32')
b_indices = np.zeros((vec_size,), dtype='int32')
b_indptr = np.zeros((len(iter_var)+1), dtype='int64')
def read_in(q, e, i):
a = reader[i]
lsize = a.data.shape[0]
b_data[q:q+lsize] = a.data
b_indices[q:q+lsize] = a.indices
b_indptr[e+1] = b_indptr[e]+a.indptr[1]
return q+lsize
q = 0
for e, i in enumerate(iter_var):
q = read_in(q, e, i)
if e%1_000 == 0: # Print out
print(np.round(e/len(iter_var)*100, 2), '% in', np.round(t.time()-tic), 's', end='\r')
b_data = b_data[:q]
b_indices = b_indices[:q]
B = scis.csc_matrix((b_data, b_indices, b_indptr), copy=False)
This package depends heavily on https://github.com/vandeplaslab/unipy make sure to install Unipy before this package. Then, get the latest package version of full_profile, e.g. through GitHub CLI (https://cli.github.com/)
gh repo clone vandeplaslab/full_profile
Install package (e.g. through pip) and get required dependencies (first cd
into the full_profile folder)
pip install .
Once the package is installed we can apply the paper methods onto data.
For our purposes, we first applied a 5-95% Total Ion Current Count (TIC) normalization (i.e. C
) on the raw data (see Supplementary Materials of paper). The method can be found here: https://github.com/vandeplaslab/imzy/blob/add-norms/src/imzy/_normalizations/_extract.py or we can apply it directly onto our sparse data by
def calculate_normalizations_optimized(spectrum: np.ndarray) -> np.ndarray:
"""Calculate various normalizations, optimized version.
This function expects a float32 spectrum.
"""
# Filter positive values once and reuse
positive_spectrum = spectrum[spectrum > 0]
# Calculating quantiles once for all needed
if positive_spectrum.size > 1:
q95, q90, q10, q5 = np.nanquantile(positive_spectrum, [0.95, 0.9, 0.1, 0.05])
else:
q95, q90, q10, q5 = 0, 0, 0, 0
# Using logical indexing with boolean arrays might be faster due to numba optimization
condition_q95 = spectrum < q95
condition_q5 = spectrum > q5
return np.sum(spectrum[condition_q5 & condition_q95]) # 5-95% TIC
C = np.zeros((B.shape[1],1))
for i in range(B.shape[1]):
C[i] = calculate_normalizations_optimized(B[:,i].toarray())
print(i, 'of', B.shape[1], end='\r')
C_sp = (1/(C/np.median(C))) # Apply rescaler (np.median(C)) for ease of visualization
B *= C_sp # Apply normalization
Example of how to apply singular value thresholding algorithm.
from full_profile import svt
inst = svt.SVT(maxiter=100, tau_factor=1e-2, delta=1, method='scipy_gesdd', dense=True, verbose=True)
inst.run(B)
This line creates an instance of the SVT ("Singular Value Thresholding") class with the following parameters:
maxiter=100
: First positional argument is the maximal number of iterationstau_factor=1e-2
: Sets the tau factor to 0.01 (see Supplementary Materials)delta=1
: Sets delta, i.e. step size, to 1 (see Supplementary Materials)method='scipy_gesdd'
: Specifies the underlying svd method as 'scipy_gesdd' (multiple available see UniPy package)dense=True
: Sets the dense flag to True if enough memory is available to reconstruct B into dense memory (a True statement speeds up the calculations if enough memory is available)verbose=True
: Enables verbose output
Example of how to apply the fixed point continuation algorithm.
from full_profile import fpc
inst = fpc.FPC(maxiter=100, tau_factor=1e-3, delta=1.4, method='arpack', verbose=True)
inst.run(B)
This line creates an instance of the FPC ("Fixed Point Continuation") class with the following parameters:
maxiter=100
: First positional argument is the maximal number of iterationstau_factor=1e-2
: Sets the tau factor to 0.001 (see Supplementary Materials)delta=1
: Sets delta, i.e. step size, to 1.4 (see Supplementary Materials)method='scipy_gesdd'
: Specifies the underlying svd method as 'arpack' (multiple available see UniPy package)dense=True
: Sets the dense flag to True if enough memory is available to reconstruct B into dense memory (a True statement speeds up the calculations if enough memory is available)verbose=True
: Enables verbose output
from full_profile import dfc
from pyspa import SPAReader
reader = SPAReader(path_fticr)
selection = reader.framelist
C = np.ones_like(selection)
inst_svt = svt.SVT(10, .01, 1.4, method='sparse_arpack', dense=True, verbose=False)
inst = dfc.DFC(reader, selection, inst_svt, C)
inst.divide()
inst.factor()
inst.combine()
This line creates an instance of the DFC with underlying SVT ("Divide-Factor-Combine") class. SVT or FPC can be used and their corresponding parameters can be used. Divide and Combine have individual parameters that can be set, to define the sampling size, oversampling factor for the projection and the rank oversampling.