Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regressor cat #130

Merged
merged 14 commits into from
Apr 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[flake8]
exclude = .git,__pycache__,.vscode,tests
exclude = .git,__pycache__,.vscode
max-line-length=99
ignore=E302,E305,W503,E203,E731,E402,E266,E712,F401,F821
indent-size = 4
Expand Down
9 changes: 9 additions & 0 deletions HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,15 @@
History
=======

0.1.4 (2024-04-**)
------------------

* ImputerMean, ImputerMedian and ImputerMode have been merged into ImputerSimple
* File preprocessing.py added with classes new MixteHGBM, BinTransformer, OneHotEncoderProjector and WrapperTransformer providing tools to manage mixed types data
* Tutorial plot_tuto_categorical showcasing mixed type imputation
* Titanic dataset added
* accuracy metric implemented

0.1.3 (2024-03-07)
------------------

Expand Down
2 changes: 2 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,8 @@ Selected Topics in Signal Processing 10.4 (2016): 740-756.
[6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.
(`pdf <https://www.academia.edu/download/60477900/Garcia__Luengo__Herrera-Data_Preprocessing_in_Data_Mining_-_Springer_International_Publishing_201520190903-77973-th1o73.pdf>`__)

[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. "Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)

📝 License
==========

Expand Down
59 changes: 38 additions & 21 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ Qolmat API

.. currentmodule:: qolmat

Imputers
=========
Imputers API
============

.. autosummary::
:toctree: generated/
Expand All @@ -15,10 +15,8 @@ Imputers
imputations.imputers.ImputerKNN
imputations.imputers.ImputerInterpolation
imputations.imputers.ImputerLOCF
imputations.imputers.ImputerMedian
imputations.imputers.ImputerMean
imputations.imputers.ImputerSimple
imputations.imputers.ImputerMICE
imputations.imputers.ImputerMode
imputations.imputers.ImputerNOCB
imputations.imputers.ImputerOracle
imputations.imputers.ImputerRegressor
Expand All @@ -28,17 +26,17 @@ Imputers
imputations.imputers.ImputerSoftImpute
imputations.imputers.ImputerShuffle

Comparator
===========
Comparator API
==============

.. autosummary::
:toctree: generated/
:template: class.rst

benchmark.comparator.Comparator

Missing Patterns
================
Missing Patterns API
====================

.. autosummary::
:toctree: generated/
Expand All @@ -51,8 +49,8 @@ Missing Patterns
benchmark.missing_patterns.GroupedHoleGenerator


Metrics
=======
Metrics API
===========

.. autosummary::
:toctree: generated/
Expand All @@ -63,6 +61,7 @@ Metrics
benchmark.metrics.mean_absolute_error
benchmark.metrics.mean_absolute_percentage_error
benchmark.metrics.weighted_mean_absolute_percentage_error
benchmark.metrics.accuracy
benchmark.metrics.dist_wasserstein
benchmark.metrics.kl_divergence
benchmark.metrics.kolmogorov_smirnov_test
Expand All @@ -75,19 +74,19 @@ Metrics
benchmark.metrics.pattern_based_weighted_mean_metric


RPCA engine
================
RPCA engine API
===============

.. autosummary::
:toctree: generated/
:template: class.rst

imputations.rpca.rpca_pcp.RPCAPCP
imputations.rpca.rpca_noisy.RPCANoisy
imputations.rpca.rpca_pcp.RpcaPcp
imputations.rpca.rpca_noisy.RpcaNoisy


EM engine
================
Expectation-Maximization engine API
===================================

.. autosummary::
:toctree: generated/
Expand All @@ -96,8 +95,8 @@ EM engine
imputations.em_sampler.MultiNormalEM
imputations.em_sampler.VARpEM

Diffusion engine
================
Diffusion Model engine API
==========================

.. autosummary::
:toctree: generated/
Expand All @@ -107,9 +106,27 @@ Diffusion engine
imputations.diffusions.ddpms.TabDDPM
imputations.diffusions.ddpms.TsDDPM

Preprocessing API
=================

.. autosummary::
:toctree: generated/
:template: class.rst

imputations.preprocessing.MixteHGBM
imputations.preprocessing.BinTransformer
imputations.preprocessing.OneHotEncoderProjector
imputations.preprocessing.WrapperTransformer

.. autosummary::
:toctree: generated/
:template: function.rst

imputations.preprocessing.make_pipeline_mixte_preprocessing
imputations.preprocessing.make_robust_MixteHGB

Utils
================
Utils API
=========

.. autosummary::
:toctree: generated/
Expand Down
19 changes: 10 additions & 9 deletions docs/imputers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,16 @@ Imputers

All imputers can be found in the ``qolmat.imputations`` folder.

1. mean/median/shuffle
----------------------
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerMean`, :class:`~qolmat.imputations.imputers.ImputerMedian` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
1. Simple (mean/median/shuffle)
-------------------------------
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerSimple` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.

2. LOCF
-------
Imputes the missing values using the last observation carried forward. See the :class:`~qolmat.imputations.imputers.ImputerLOCF` class.

3. interpolation (on residuals)
-------------------------------
3. Time interpolation and TSA decomposition
-------------------------------------------
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See the :class:`~qolmat.imputations.imputers.ImputerResiduals` class.


Expand All @@ -28,7 +28,7 @@ Two cases are considered.

**RPCA via Principal Component Pursuit (PCP)** [1, 12]

The class :class:`RPCAPCP` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
The class :class:`RpcaPcp` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem

.. math::
\text{min}_{\mathbf{M} \in \mathbb{R}^{m \times n}} \quad \Vert \mathbf{M} \Vert_* + \lambda \Vert P_\Omega(\mathbf{D-M}) \Vert_1
Expand All @@ -38,7 +38,7 @@ See the :class:`~qolmat.imputations.imputers.ImputerRpcaPcp` class for implement

**Noisy RPCA** [2, 3, 4]

The class :class:`RPCANoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
The class :class:`RpcaNoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following

.. math::
\text{min}_{\mathbf{M, A} \in \mathbb{R}^{m \times n}} \quad \frac 1 2 \Vert P_{\Omega} (\mathbf{D}-\mathbf{M}-\mathbf{A}) \Vert_F^2 + \tau \Vert \mathbf{M} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{M H_k} \Vert_p
Expand Down Expand Up @@ -91,6 +91,7 @@ We estimate the distribution parameter :math:`\theta` by likelihood maximization
Once the parameter :math:`\theta^*` has been estimated the final data imputation can be done in two different ways, depending on the value of the argument `method`:

* `mle`: Returns the maximum likelihood estimator

.. math::
X^* = \mathrm{argmax}_X L(X, \theta^*)

Expand All @@ -115,8 +116,8 @@ In training phase, we use the self-supervised learning method of [9] to train in

In the case of time-series data, we also propose :class:`~qolmat.imputations.diffusions.ddpms.TsDDPM` (built on top of :class:`~qolmat.imputations.diffusions.ddpms.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [9].

References
----------
References (Imputers)
---------------------

[1] Candès, Emmanuel J., et al. `Robust principal component analysis? <https://arxiv.org/abs/2001.05484>`_ Journal of the ACM (JACM) 58.3 (2011): 1-37.

Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

imputers
examples/tutorials/plot_tuto_benchmark_TS
examples/tutorials/plot_tuto_categorical
examples/tutorials/plot_tuto_diffusion_models

.. toctree::
Expand Down
2 changes: 1 addition & 1 deletion environment.dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ dependencies:
- python=3.8
- pip=23.0.1
- scipy=1.10.1
- scikit-learn=1.2.2
- scikit-learn=1.3.2
- sphinx=4.3.2
- sphinx-gallery=0.10.1
- sphinx_rtd_theme=1.0.0
Expand Down
2 changes: 1 addition & 1 deletion examples/RPCA.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ plt.show()

```python
%%time
# rpca_noisy = RPCANoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
# rpca_noisy = RpcaNoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
rpca_noisy = RpcaNoisy(tau=1, lam=0.4, rank=2, norm="L2")
M, A = rpca_noisy.decompose(D, Omega)
# imputed = X
Expand Down
Loading
Loading