Skip to content

Commit

Permalink
Feature:: Preprocessing models (#104)
Browse files Browse the repository at this point in the history
* add `Binarizer` in the `SKLEARN_PREPROCESSING_TABLE`

* add preprocessing models' tests runner

* add `Binarizer` test

* `CHANGELOG.md` updated

* `SUPPORTED_MODELS.md` updated

* fix : index error in `SUPPORTED_MODELS.md` fixed.

* update : extra lines adde to `binarizer.py` added.

* add : test for `OneHotEncoder` added.

* fix : `toarray()` added to `one_hot_encoder.py`.

* fix : trailing white spaces removed.

* add : tests added for `LabelBinarizer`.

* add : tests for `LabelEncoder` added.

* add : tests added for `StandardScaler`.

* update : preprocessing tests updated.

* add : `FunctionTransformer` preprocessing model added.

* .gitignore updated

* functionTransporter enhanced to support numpy ufunc

* preprocessings test dedicated utils added

* storage directory added containing preprocessings exported files in json

* add full end to end transporting to preprocessing modules tests

* add FunctionTransporter to Preprocessing chain

* fix : codacy issued fixed.

* add : `KBinsDiscretizer` preprocessing model added.

* fix[?] : nd arrays stack in the general data structure.

* [revert] : `KBinsDiscretizer` reverted temporary.

* add : `KernelCenterer` preprocessing model added.

* update : minor updates in preprocessing tests.

* add : `MultiLabelBinarizer` preprocessing model

* add : `MaxAbsScaler` preprocessing module added.

* add : `Normalizer` preprocessing model added.

* add : `OrdinalEncoder` preprocessing added. [:skull:] check the Nan value in json for parsing.

* fix : codacy issues fixed.

* add : `PolynomialFeatures` preprocessing model added.

* add : `PowerTransformer` preprocessing model added.

* fix : typo fixed for `RobustScaler`.

* add : `TargetEncoder` preprocessing module added.

* [revert] : `TargetEncoder` implementation reverted.

* add : `QuantileTransformer` preprocessing model.

* prevent "object" dtype dictating

* prevent redundant reshaping

* `KBinsDiscretizer` added to `pymilo_param.py`

* add check prefix list function

* add `KBinsDiscretizer` test case

* fix dtype missing

* prevent index out of bound

* remove trailing whitespaces

* `kbins_discretizer` added to `test_preprocessings.py`

* remove trailing whitespaces

* add inner preprocessing module transportation (1 depth)

* `PowerTransformer` added

* add `PowerTransformer` test case

* `power_transformer` added to `test_preprocessings.py`

* `SplineTransformer` added

* `spline_transformer` added to `test_preprocessings.py`

* `SplineTransformer` test case added

* `BSpline` transporting added to `preprocessing_transporter.py`

* remove unused import

* add exception handling to`SplineTransformer` import

* add exception handling to`SplineTransformer` testcase

* add `TargetEncoder` to `pymilo_param.py`

* add `TargetEncoder` testcase

* TargetEncoder testcase added

* apply dtype dictation only if inner items doesn't have any

* enhance `target_encoder` to be comparable before after pymiloing

* remove unused variable + fixing the exception type

* `CHANGELOG.md` updated

* `CHANGELOG.md` updated

* add `numpy.nan` to NUMPY_TYPE_DICT

* add `NaN` type transportation

* `CHANGELOG.md` updated

* refactor function's core functionality and make it simpler

* make for iterator simpler & faster

* remove try, except and decrease the complexity

* fulfill preprocessing table

* `CHANGELOG.md` enhanced

* `autopep8.sh` applied

* fix bug

* `README.md` updated

* revert : changelog older log reverted.

* update : requirements updated.

* remove : extra line removed in `CHANGELOG.md`.

* update : last update tag in `SUPPORTED_MODELS.md` updated.

* change : `a` -> `list1` and `b` -> `list2`.

* add : `scipy` added to `dev-requirements.txt`.

* remove unused import has_named_parameter

---------

Co-authored-by: Sadra Sabouri <[email protected]>
Co-authored-by: AHReccese <[email protected]>
  • Loading branch information
3 people authored Jun 4, 2024
1 parent 0166ff5 commit fe4fc8b
Show file tree
Hide file tree
Showing 32 changed files with 795 additions and 30 deletions.
9 changes: 1 addition & 8 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -99,12 +99,5 @@ ENV/
out
gen

/tests/exported_linear_models
/tests/exported_neural_networks
/tests/exported_decision_trees
/tests/exported_clusterings
/tests/exported_naive_bayes
/tests/exported_svms
/tests/exported_neighbors
/tests/exported_ensembles
/.VSCodeCounter
/tests/exported*
25 changes: 25 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,32 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.

## [Unreleased]
### Added
- `prefix_list` function in `utils.util.py`
- `KBinsDiscretizer` preprocessing model
- `PowerTransformer` preprocessing model
- `SplineTransformer` preprocessing model
- `TargetEncoder` preprocessing model
- `QuantileTransformer` preprocessing model
- `RobustScaler` preprocessing model
- `PolynomialFeatures` preprocessing model
- `OrdinalEncoder` preprocessing model
- `Normalizer` preprocessing model
- `MaxAbsScaler` preprocessing model
- `MultiLabelBinarizer` preprocessing model
- `KernelCenterer` preprocessing model
- `FunctionTransformer` preprocessing model
- `Binarizer` preprocessing model
- Preprocessing models test runner
### Changed
- `NaN` type in `pymilo_param`
- `NaN` type transportation in `GeneralDataStructureTransporter` Transporter
- `BSpline` Transportation in `PreprocessingTransporter` Transporter
- one layer deeper transportation in `PreprocessingTransporter` Transporter
- dictating outer ndarray dtype in `GeneralDataStructureTransporter` Transporter
- preprocessing params fulfilled in `pymilo_param`
- `SUPPORTED_MODELS.md` updated
- `README.md` updated
- `serialize_possible_ml_model` in the Ensemble chain
## [0.8] - 2024-05-06
### Added
- `StandardScaler` Transformer in `pymilo_param.py`
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ PyMilo is an open source Python package that provides a simple, efficient, and s
| Nearest Neighbors &#x2705; | - |
| Ensemble Models &#x2705; | - |
| Pipeline Model &#x2705; | - |
| Preprocessing Models | - |
| Preprocessing Models &#x2705; | - |

Details are available in [Supported Models](https://github.com/openscilab/pymilo/blob/main/SUPPORTED_MODELS.md).

Expand Down
73 changes: 72 additions & 1 deletion SUPPORTED_MODELS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Supported Models

**Last Update: 2024-04-24**
**Last Update: 2024-05-30**


<h2 id="scikit-learn">Scikit-Learn</h2>
Expand Down Expand Up @@ -635,4 +635,75 @@
<td><b>StandardScaler</b></td>
<td>>=0.8</td>
</tr>
<tr align="center">
<td>5</td>
<td><b>Binarizer</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>6</td>
<td><b>FunctionTransformer</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>7</td>
<td><b>KernelCenterer</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>8</td>
<td><b>MultiLabelBinarizer</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>9</td>
<td><b>MaxAbsScaler</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>10</td>
<td><b>Normalizer</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>11</td>
<td><b>OrdinalEncoder</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>12</td>
<td><b>PolynomialFeatures</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>13</td>
<td><b>RobustScaler</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>14</td>
<td><b>QuantileTransformer</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>15</td>
<td><b>KBinsDiscretizer</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>16</td>
<td><b>PowerTransformer</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>17</td>
<td><b>SplineTransformer</b></td>
<td>>=0.9</td>
</tr>
<tr align="center">
<td>18</td>
<td><b>TargetEncoder</b></td>
<td>>=0.9</td>
</tr>

</table>
3 changes: 2 additions & 1 deletion dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
numpy==1.26.4
scikit-learn==1.5.0
scipy>=0.19.1
setuptools>=40.8.0
vulture>=1.0
bandit>=1.5.1
pydocstyle>=3.0.0
pytest>=4.3.1
pytest-cov>=2.6.1
pytest-cov>=2.6.1
4 changes: 3 additions & 1 deletion otherfiles/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,10 @@ requirements:
- setuptools
- python >=3.6
run:
- art >=1.8
- python >=3.6
- numpy >=1.9.0
- scikit-learn >=0.22.2
- scipy >=0.19.1
about:
home: https://github.com/openscilab/pymilo
license: MIT
Expand Down
29 changes: 29 additions & 0 deletions pymilo/pymilo_param.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,20 @@
except BaseException:
pass

spline_transformer_support = False
try:
from sklearn.preprocessing import SplineTransformer
spline_transformer_support = True
except BaseException:
pass

target_encoder_support = False
try:
from sklearn.preprocessing import TargetEncoder
target_encoder_support = True
except BaseException:
pass


PYMILO_VERSION = "0.8"
NOT_SUPPORTED = "NOT_SUPPORTED"
Expand Down Expand Up @@ -202,6 +216,20 @@
"OneHotEncoder": preprocessing.OneHotEncoder,
"LabelBinarizer": preprocessing.LabelBinarizer,
"LabelEncoder": preprocessing.LabelEncoder,
"Binarizer": preprocessing.Binarizer,
"FunctionTransformer": preprocessing.FunctionTransformer,
"KernelCenterer": preprocessing.KernelCenterer,
"MultiLabelBinarizer": preprocessing.MultiLabelBinarizer,
"MaxAbsScaler": preprocessing.MaxAbsScaler,
"Normalizer": preprocessing.Normalizer,
"OrdinalEncoder": preprocessing.OrdinalEncoder,
"PolynomialFeatures": preprocessing.PolynomialFeatures,
"RobustScaler": preprocessing.RobustScaler,
"QuantileTransformer": preprocessing.QuantileTransformer,
"KBinsDiscretizer": preprocessing.KBinsDiscretizer,
"PowerTransformer": preprocessing.PowerTransformer,
"SplineTransformer": SplineTransformer if spline_transformer_support else NOT_SUPPORTED,
"TargetEncoder": TargetEncoder if target_encoder_support else NOT_SUPPORTED,
}

KEYS_NEED_PREPROCESSING_BEFORE_DESERIALIZATION = {
Expand All @@ -223,6 +251,7 @@
"numpy.uint8": np.uint8,
"numpy.uint64": np.uint64,
"numpy.dtype": np.dtype,
"numpy.nan": np.nan,
}

EXPORTED_MODELS_PATH = {
Expand Down
18 changes: 12 additions & 6 deletions pymilo/transporters/function_transporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from ..utils.util import import_function, check_str_in_iterable
from .transporter import AbstractTransporter
from types import FunctionType
from numpy import ufunc

array_function_dispatcher_support = False
try:
Expand All @@ -30,12 +31,17 @@ def serialize(self, data, key, model_type):
:type model_type: str
:return: pymilo serialized output of data[key]
"""
if isinstance(
data[key],
FunctionType) or (
array_function_dispatcher_support and isinstance(
data[key],
_ArrayFunctionDispatcher)):
if isinstance(data[key], ufunc):
function = data[key]
data[key] = {
"function_name": function.__name__,
"function_module": "numpy",
}
return data[key]

elif isinstance(data[key], FunctionType) or (
array_function_dispatcher_support and
isinstance(data[key], _ArrayFunctionDispatcher)):
function = data[key]
data[key] = {
"function_name": function.__name__,
Expand Down
23 changes: 19 additions & 4 deletions pymilo/transporters/general_data_structure_transporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@

from ..pymilo_param import NUMPY_TYPE_DICT

from ..utils.util import get_homogeneous_type, all_same
from ..utils.util import is_primitive, is_iterable, check_str_in_iterable
from ..utils.util import get_homogeneous_type, all_same, prefix_list
from ..utils.util import is_primitive, check_str_in_iterable

from .transporter import AbstractTransporter

Expand Down Expand Up @@ -93,7 +93,13 @@ def serialize(self, data, key, model_type):
:type model_type: str
:return: pymilo serialized output of data[key]
"""
if isinstance(data[key], type):
if not (isinstance(data[key], object) or isinstance(data[key], str)):
if np.isnan(data[key]): # throws exception on object & str types
data[key] = {
"np-type": "numpy.nan",
"value": "NaN"
}
elif isinstance(data[key], type):
raw_type = str(data[key])
raw_type = "numpy" + str(raw_type).split("numpy")[-1][:-2]
if raw_type in NUMPY_TYPE_DICT.keys():
Expand Down Expand Up @@ -281,6 +287,8 @@ def get_deserialized_regular_primary_types(self, content):
if "np-type" in content:
if content["np-type"] == "numpy.dtype":
return NUMPY_TYPE_DICT[content["np-type"]](NUMPY_TYPE_DICT[content['value']])
if content["np-type"] == "numpy.nan":
return NUMPY_TYPE_DICT[content["np-type"]]
return NUMPY_TYPE_DICT[content["np-type"]](content['value'])

def is_numpy_primary_type(self, content):
Expand Down Expand Up @@ -450,4 +458,11 @@ def deep_deserialize_ndarray(self, deserialized_ndarray):
new_list.append(item)
else:
new_list.append(item)
return np.asarray(new_list, dtype=dtype).reshape(shape)

pre_result = np.asarray(new_list, dtype=dtype)
if dtype == "object" and hasattr(new_list[0], "dtype"):
# check if inner items have specific dtype.
pre_result = np.asarray(new_list)
if not prefix_list(list(pre_result.shape), shape):
return pre_result.reshape(shape)
return pre_result
Loading

0 comments on commit fe4fc8b

Please sign in to comment.