scikit-learn-contrib · JulienRoussel77 · Apr 5, 2024 · Mar 15, 2024 · Mar 15, 2024 · Mar 20, 2024
diff --git a/.flake8 b/.flake8
@@ -1,5 +1,5 @@
 [flake8]
-exclude = .git,__pycache__,.vscode,tests
+exclude = .git,__pycache__,.vscode
 max-line-length=99
 ignore=E302,E305,W503,E203,E731,E402,E266,E712,F401,F821
 indent-size = 4

diff --git a/HISTORY.rst b/HISTORY.rst
@@ -2,6 +2,15 @@
 History
 =======
 
+0.1.4 (2024-04-**)
+------------------
+
+* ImputerMean, ImputerMedian and ImputerMode have been merged into ImputerSimple
+* File preprocessing.py added with classes new MixteHGBM, BinTransformer, OneHotEncoderProjector and WrapperTransformer providing tools to manage mixed types data
+* Tutorial plot_tuto_categorical showcasing mixed type imputation
+* Titanic dataset added
+* accuracy metric implemented
+
 0.1.3 (2024-03-07)
 ------------------
 

diff --git a/README.rst b/README.rst
@@ -232,6 +232,8 @@ Selected Topics in Signal Processing 10.4 (2016): 740-756.
 [6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.
 (`pdf <https://www.academia.edu/download/60477900/Garcia__Luengo__Herrera-Data_Preprocessing_in_Data_Mining_-_Springer_International_Publishing_201520190903-77973-th1o73.pdf>`__)
 
+[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. "Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)
+
 📝 License
 ==========
 

diff --git a/docs/api.rst b/docs/api.rst
@@ -4,8 +4,8 @@ Qolmat API
 
 .. currentmodule:: qolmat
 
-Imputers
-=========
+Imputers API
+============
 
 .. autosummary::
     :toctree: generated/
@@ -15,10 +15,8 @@ Imputers
     imputations.imputers.ImputerKNN
     imputations.imputers.ImputerInterpolation
     imputations.imputers.ImputerLOCF
-    imputations.imputers.ImputerMedian
-    imputations.imputers.ImputerMean
+    imputations.imputers.ImputerSimple
     imputations.imputers.ImputerMICE
-    imputations.imputers.ImputerMode
     imputations.imputers.ImputerNOCB
     imputations.imputers.ImputerOracle
     imputations.imputers.ImputerRegressor
@@ -28,17 +26,17 @@ Imputers
     imputations.imputers.ImputerSoftImpute
     imputations.imputers.ImputerShuffle
 
-Comparator
-===========
+Comparator API
+==============
 
 .. autosummary::
     :toctree: generated/
     :template: class.rst
 
     benchmark.comparator.Comparator
 
-Missing Patterns
-================
+Missing Patterns API
+====================
 
 .. autosummary::
     :toctree: generated/
@@ -51,8 +49,8 @@ Missing Patterns
     benchmark.missing_patterns.GroupedHoleGenerator
 
 
-Metrics
-=======
+Metrics API
+===========
 
 .. autosummary::
     :toctree: generated/
@@ -63,6 +61,7 @@ Metrics
     benchmark.metrics.mean_absolute_error
     benchmark.metrics.mean_absolute_percentage_error
     benchmark.metrics.weighted_mean_absolute_percentage_error
+    benchmark.metrics.accuracy
     benchmark.metrics.dist_wasserstein
     benchmark.metrics.kl_divergence
     benchmark.metrics.kolmogorov_smirnov_test
@@ -75,19 +74,19 @@ Metrics
     benchmark.metrics.pattern_based_weighted_mean_metric
 
 
-RPCA engine
-================
+RPCA engine API
+===============
 
 .. autosummary::
     :toctree: generated/
     :template: class.rst
 
-    imputations.rpca.rpca_pcp.RPCAPCP
-    imputations.rpca.rpca_noisy.RPCANoisy
+    imputations.rpca.rpca_pcp.RpcaPcp
+    imputations.rpca.rpca_noisy.RpcaNoisy
 
 
-EM engine
-================
+Expectation-Maximization engine API
+===================================
 
 .. autosummary::
     :toctree: generated/
@@ -96,8 +95,8 @@ EM engine
     imputations.em_sampler.MultiNormalEM
     imputations.em_sampler.VARpEM
 
-Diffusion engine
-================
+Diffusion Model engine API
+==========================
 
 .. autosummary::
     :toctree: generated/
@@ -107,9 +106,27 @@ Diffusion engine
     imputations.diffusions.ddpms.TabDDPM
     imputations.diffusions.ddpms.TsDDPM
 
+Preprocessing API
+=================
+
+.. autosummary::
+    :toctree: generated/
+    :template: class.rst
+
+    imputations.preprocessing.MixteHGBM
+    imputations.preprocessing.BinTransformer
+    imputations.preprocessing.OneHotEncoderProjector
+    imputations.preprocessing.WrapperTransformer
+
+.. autosummary::
+    :toctree: generated/
+    :template: function.rst
+
+    imputations.preprocessing.make_pipeline_mixte_preprocessing
+    imputations.preprocessing.make_robust_MixteHGB
 
-Utils
-================
+Utils API
+=========
 
 .. autosummary::
     :toctree: generated/

diff --git a/docs/imputers.rst b/docs/imputers.rst
@@ -3,16 +3,16 @@ Imputers
 
 All imputers can be found in the ``qolmat.imputations`` folder.
 
-1. mean/median/shuffle
-----------------------
-Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerMean`, :class:`~qolmat.imputations.imputers.ImputerMedian` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
+1. Simple (mean/median/shuffle)
+-------------------------------
+Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerSimple` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
 
 2. LOCF
 -------
 Imputes the missing values using the last observation carried forward. See the :class:`~qolmat.imputations.imputers.ImputerLOCF` class.
 
-3. interpolation (on residuals)
--------------------------------
+3. Time interpolation and TSA decomposition
+-------------------------------------------
 Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See the :class:`~qolmat.imputations.imputers.ImputerResiduals` class.
 
 
@@ -28,7 +28,7 @@ Two cases are considered.
 
 **RPCA via Principal Component Pursuit (PCP)** [1, 12]
 
-The class :class:`RPCAPCP` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
+The class :class:`RpcaPcp` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
 
 .. math::
    \text{min}_{\mathbf{M} \in \mathbb{R}^{m \times n}} \quad \Vert \mathbf{M} \Vert_* + \lambda \Vert P_\Omega(\mathbf{D-M}) \Vert_1
@@ -38,7 +38,7 @@ See the :class:`~qolmat.imputations.imputers.ImputerRpcaPcp` class for implement
 
 **Noisy RPCA** [2, 3, 4]
 
-The class :class:`RPCANoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or  :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
+The class :class:`RpcaNoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or  :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
 
 .. math::
    \text{min}_{\mathbf{M, A} \in \mathbb{R}^{m \times n}} \quad \frac 1 2 \Vert P_{\Omega} (\mathbf{D}-\mathbf{M}-\mathbf{A}) \Vert_F^2 + \tau \Vert \mathbf{M} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{M H_k} \Vert_p
@@ -91,6 +91,7 @@ We estimate the distribution parameter :math:`\theta` by likelihood maximization
 Once the parameter :math:`\theta^*` has been estimated the final data imputation can be done in two different ways, depending on the value of the argument `method`:
 
 * `mle`: Returns the maximum likelihood estimator
+
 .. math::
     X^* = \mathrm{argmax}_X L(X, \theta^*)
 
@@ -115,8 +116,8 @@ In training phase, we use the self-supervised learning method of [9] to train in
 
 In the case of time-series data, we also propose :class:`~qolmat.imputations.diffusions.ddpms.TsDDPM` (built on top of :class:`~qolmat.imputations.diffusions.ddpms.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [9].
 
-References
-----------
+References (Imputers)
+---------------------
 
 [1] Candès, Emmanuel J., et al. `Robust principal component analysis? <https://arxiv.org/abs/2001.05484>`_ Journal of the ACM (JACM) 58.3 (2011): 1-37.
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -16,6 +16,7 @@
 
    imputers
    examples/tutorials/plot_tuto_benchmark_TS
+   examples/tutorials/plot_tuto_categorical
    examples/tutorials/plot_tuto_diffusion_models
 
 .. toctree::

diff --git a/environment.dev.yml b/environment.dev.yml
@@ -16,7 +16,7 @@ dependencies:
     - python=3.8
     - pip=23.0.1
     - scipy=1.10.1
-    - scikit-learn=1.2.2
+    - scikit-learn=1.3.2
     - sphinx=4.3.2
     - sphinx-gallery=0.10.1
     - sphinx_rtd_theme=1.0.0

diff --git a/examples/RPCA.md b/examples/RPCA.md
@@ -199,7 +199,7 @@ plt.show()
 
 ```python
 %%time
-# rpca_noisy = RPCANoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
+# rpca_noisy = RpcaNoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
 rpca_noisy = RpcaNoisy(tau=1, lam=0.4, rank=2, norm="L2")
 M, A = rpca_noisy.decompose(D, Omega)
 # imputed = X