Skip to content

Tools for chemometrics and machine learning on high-dimensional data (e.g. Partial least squares regression/discrimination)

License

Notifications You must be signed in to change notification settings

mlesnoff/Jchemo.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jchemo.jl

Chemometrics and machine learning on high-dimensional data with Julia

Stable Dev Build Status Project Status: Active - The project has reached a stable, usable state and is being actively developed.

About

Jchemo was initially dedicated to partial least squares regression (PLSR) and discrimination (PLSDA) and their extensions, in particular locally weighted PLS models (LWPLS-R & -DA; e.g. https://doi.org/10.1002/cem.3209). In a second phase, the package has expand to many other models for dimension reduction and regression and discrimination (see the list of functions here).

Why the name Jchemo?: Since it is orientated to chemometrics , in brief the use of biometrics for chemistry. But most of the provided methods are generic to other application domains.

Jchemo is organized between

  • transform operators (that have a function transf),
  • predictors (that have a function predict),
  • utility functions.

Some models, such as PLSR models, are both a transform operators and a predictor.

Ad'hoc pipelines of operations can also be built by the user. In Jchemo, a pipeline is a chain of K modeling steps containing

  • either K transform steps,
  • or K - 1 transform steps and a final prediction step.

The pipelines are built with function pip, see here.

Installation

To install Jchemo

  • From the official Julia repo, run in the Pkg REPL:
pkg> add Jchemo

or for a specific version:

pkg> add Jchemo@0.1.18
  • For the current developing version (potentially not stable):
pkg> add https://github.com/mlesnoff/Jchemo.jl.git

Warning

Before to update the package, it is recommended to have a look on What changed to avoid eventual problems due to breaking changes.

Tips

Syntax

Most the Jchemo functions have keyword arguments (kwargs). The keyword arguments required/allowed in a function can be found in the Index of function section of the documentation:

or in the REPL by displaying the function's help page, for instance for function plskern:

julia> ?plskern

The kwargs arguments and their default values are defined in containers depending of the functions, available here. For a given function, they can be displayed in the REPL with function default, e.g.:

julia> default(plskern)

Jchemo.ParPlsr
  nlv: Int64 1
  scal: Bool false

Examples of syntax are given at the end of this README, and in the help pages of the functions.

Datasets

The datasets used in the help pages' examples are stored in the package JchemoData.jl, a repository of datasets on chemometrics and other domains. Examples of scripts demonstrating Jchemo are also available in the pedagogical project JchemoDemo.

Tuning predictive models

Two generic grid-search functions are available to tune the predictors:

Highly accelerated versions of these tuning tools have been implemented for models based on latent variables (LVs) and ridge regularization.

Multi-threading

Some Jchemo functions (in particular those using kNN selections) use multi-threading to speed the computations. Taking advantage of this requires to specify a relevant number of threads (e.g. from the Settings menu of the VsCode Julia extension and the file settings.json).

Plotting

Jchemo uses Makie for plotting. Displaying the plots requires to preliminary install and load one of the Makie's backends (e.g. CairoMakie).

Benchmark

using Jchemo, BenchmarkTools
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 23 on 16 virtual cores
Environment:
  JULIA_EDITOR = code

Multi-variate PLSR with n = 1e6 observations

n = 10^6  # nb. observations (samples)
p = 500   # nb. X-variables (features)
q = 10    # nb. Y-variables to predict
nlv = 25  # nb. PLS latent variables
X = rand(n, p)
Y = rand(n, q)
zX = Float32.(X)
zY = Float32.(Y)
## Float64
## (NB.: multi-threading is not used in plskern) 
model = plskern(; nlv)
@benchmark fit!($model, $X, $Y)

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 7.532 s (1.07% GC) to evaluate,
 with a memory estimate of 4.09 GiB, over 2677 allocations.
# Float32 
@benchmark fit!($model, $zX, $zY) 

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min  max):  3.956 s     4.148 s  ┊ GC (min  max): 0.82%  3.95%
 Time  (median):     4.052 s               ┊ GC (median):    2.42%
 Time  (mean ± σ):   4.052 s ± 135.259 ms  ┊ GC (mean ± σ):  2.42% ± 2.21%

  █                                                        █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.96 s         Histogram: frequency by time         4.15 s <

 Memory estimate: 2.05 GiB, allocs estimate: 2677.

Examples of syntax

Some fictive data

n = 150 ; p = 200 
q = 2 ; m = 50 
Xtrain = rand(n, p)
Ytrain = rand(n, q) 
Xtest = rand(m, p)
Ytest = rand(m, q) 

Fitting a transformer

a) Example of a signal preprocessing

Let us consider a signal preprocessing with the Savitsky-Golay filter, using function savgol. The keyword arguments of savgol are npoint, deriv and degree. See for instance in the REPL:

julia> ?savgol

The syntax to fit the model is as follows:

## Below, the order of the kwargs is not 
## important but the argument names have 
## to be correct.
## Keywords arguments are specified
## after character ";"

## Model definition
## (below, the name 'model' can be replaced by any other name)
npoint = 11 ; deriv = 2 ; degree = 3
model = savgol(; npoint, deriv, degree)

## Fitting
fit!(model, Xtrain)

which is the strictly equivalent to:

## Below, ";" is not required since the kwargs values are
## specified within the function

model = savgol(npoint = 11, deriv = 2, degree = 3)
fit!(model, Xtrain)

Contents of objects model can be displayed by:

julia> pnames(model)

(:algo, :fitm, :kwargs)

Objects algo, fitm and kwargs contain the used algorithm (function), the model fitted on the data and the kwargs arguments, respectively.

Once the model is fitted, the transformed (i.e. here preprocessed) data are given by:

Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)

Several preprocessing can be applied sequentially to the data by building a pipeline (see section Fitting a pipeline thereafter for examples).

b) Example of a PCA

Let us consider a principal component analysis (PCA), using function pcasvd.

The syntax to fit the model is as follows:

nlv = 15  # nb. principal components
model = pcasvd(; nlv)
fit!(model, Xtrain, ytrain)

For a preliminary scaling of the data before the PCA decomposition, the syntax is:

nlv = 15 ; scal = true
model = pcasvd(; nlv, scal)
fit!(model, Xtrain, ytrain)

The PCA score matrices (i.e. the projections of the observations on the PCA directions) can be computed by:

Ttrain = transf(model, Xtrain)
Ttest = transf(model, Xtest)

Object Ttrain above can also be obtained directly by:

Ttrain = model.fitm.T

Some model summary (% of explained variance, etc.) can be displayed by:

summary(model, Xtrain)

Fitting a predictor

a) Example of a KPLSR

Let us consider a Gaussian kernel partial least squares regression (KPLSR), using function kplsr.

The syntax to fit the model is as follows:

nlv = 15  # nb. latent variables
kern = :krbf ; gamma = .001 
model = kplsr(; nlv, kern, gamma)
fit!(model, Xtrain, ytrain)

As for PCA, the score matrices can be computed by:

Ttrain = transf(model, Xtrain)   # = model.fitm.T
Ttest = transf(model, Xtest)

and model summary by:

summary(model, Xtrain)

Y-Predictions are given by:

pred = predict(model, Xtest).pred

Examples of tuning of predictive models (test-set validation and cross-validation) are given in the help pages of functions gridscore and gridcv:

?gridscore
?gridcv

Fitting a pipeline

a) Example of chained preprocessing

Let us consider a data preprocessing by standard-normal-variation transformation (SNV) followed by a Savitsky-Golay filter and a polynomial de-trending transformation.

The pipeline is fitted as follows:

## Models' definition
model1 = snv()
model2 = savgol(npoint = 5, deriv = 1, degree = 2)
model3 = detrend_pol()  
## Pipeline building
model = pip(model1, model2, model3)
## Fitting
fit!(model, Xtrain)

The transformed data are given by:

Xptrain = transf(model, Xtrain)
Xptest = transf(model, Xtest)

b) Example of PCA-SVMR

Let us consider a support vector machine regression model implemented on preliminary computed PCA scores (PCA-SVMR).

The pipeline is fitted as follows:

nlv = 15
kern = :krbf ; gamma = .001 ; cost = 1000
model1 = pcasvd(; nlv)
model2 = svmr(; kern, gamma, cost)
model = pip(model1, model2)
fit!(model, Xtrain)

The Y-predictions are given by:

pred = predict(model, Xtest).pred

Any step(s) of data preprocessing can obviously be implemented before the modeling, either outside of the given predictive pipeline or being involded directlty in the pipeline, such as for instance:

degree = 2    # de-trending with polynom degree 2
nlv = 15
kern = :krbf ; gamma = .001 ; cost = 1000
model1 = detrend_pol(; degree)
model2 = pcasvd(; nlv)
model3 = svmr(; kern, gamma, cost)
model = pip(model1, model2, model3)

c) Example of LWR Naes et al. 1990

The LWR algorithm of Naes et al (1990) consists in implementing a preliminary global PCA on the data and then a kNN locally weighted multiple linear regression (kNN-LWMLR) on the global PCA scores.

The pipeline is defined by:

nlv = 25
metric = :eucl ; h = 2 ; k = 200
model1 = pcasvd(; nlv)
model2 = lwmlr(; metric, h, k)
model = pip(model1, model2)

Naes et al., 1990. Analytical Chemistry 664–673.

d) Example of Shen et al. 2019

The pipeline of Shen et al. (2019) consists in implementing a preliminary global PLSR on the data and then a kNN-PLSR on the global PLSR scores.

The pipeline is defined by:

nlv = 25
metric = :mah ; h = Inf ; k = 200
model1 = plskern(; nlv)
model2 = lwplsr(; metric, h, k)
model = pip(model1, model2)

Shen et al., 2019. Journal of Chemometrics, 33(5) e3117.

Credit

Author

Matthieu Lesnoff
contact: [email protected]

How to cite

Lesnoff, M. 2021. Jchemo: Chemometrics and machine learning on high-dimensional data with Julia. https://github.com/mlesnoff/Jchemo. UMR SELMET, Univ Montpellier, CIRAD, INRA, Institut Agro, Montpellier, France

Acknowledgments

About

Tools for chemometrics and machine learning on high-dimensional data (e.g. Partial least squares regression/discrimination)

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages