Supercharge your AI engineering pipelines with featurewiz-polars
, a new library built on the classic featurewiz
library, enhanced for high-performance feature engineering and selection using Polars DataFrames.
This library was born out of the need for efficient feature engineering when working with large datasets and the Polars DataFrame library. Traditional feature selection and categorical encoding methods often become computationally expensive and memory-intensive as datasets grow in size and dimensionality.
Specifically, the motivation stemmed from the following challenges:
- Performance limitations with large datasets: Existing feature selection and encoding implementations (often in scikit-learn or Pandas-based libraries) can be slow and inefficient when applied to datasets with millions of rows and hundreds of columns.
- Lack of Polars integration: Many feature engineering tools are designed for Pandas DataFrames, requiring conversions to and from Polars, which introduces overhead and negates the performance benefits of Polars.
- Need for efficient MRMR feature selection: Max-Relevance and Min-Redundancy (MRMR) is a powerful feature selection technique, but efficient implementations optimized for large datasets were needed, especially within the Polars ecosystem.
- Handling diverse data types: The library needed to seamlessly handle both numerical and categorical features, and intelligently apply appropriate imputation strategies for diverse use cases.
- Extensibility and Pipeline Integration: The components should be designed as scikit-learn compatible transformers, allowing for easy integration into machine learning pipelines and workflows.
The new featurewiz-polars
leverages Polars library to deliver following advantages over the current classic featurewiz
library:
-
Conquer Overfitting:
featurewiz-polars
validates features on a train-validation split, crushing overfitting and boosting generalization power. -
Rock-Solid Stability: Multiple runs with different splits mean more stable feature selection. Thanks to Polars, stabilization is now lightning fast!
-
Big Data? No Sweat! Polars' raw speed and efficiency tame even the largest datasets, making feature selection a breeze.
-
XGBoost / Polars native integration:
featurewiz-polars
integrates natively and seamlessly with XGBoost, streamlining your entire ML pipeline from start to finish with Polars.
In short, using Polars with our train-validation-split recursive_xgboost
method offers a powerful combination: the robust feature selection of classic featurewiz
with the new stabilization tailwind to give you more choices for reliable and fast feature selection, particularly for large datasets.
The development of featurewiz-polars
was an iterative process driven by the initial request for improved feature engineering and evolved through several key stages:
- Initial Request & Problem Definition: My journey began with a user request to bring featurewiz's amazing feature selection and encoding capabilities to Polars, particularly for large datasets. The core problem they faced was the inefficiency of existing methods which cannot scale with pandas to millions of rows.
- Focus on Polars Efficiency: I then made the decision to build the library natively using Polars DataFrames to leverage Polars' blazing speed and memory efficiency. This meant re-implementing my entire feature engineering pipeline using native Polars operations as much as possible. This was much harder than I imagined.
Polars_CategoricalEncoder
Development: The first step was to create a fast categorical encoder for Polars.Polars_CategoricalEncoder
was conceived and developed, by refining my original featurewiz implementation calledMy_Label_Encoder()
, but adding more encoding types (Target, WOE, Ordinal, OneHot), and I then finally added handling of nan's and null values in those categories (Wooh).Polars_DateTimeEncoder
Development: The next step was to create a fast date-time encoder for Polars.Polars_DateTimeEncoder
was developed, again added some error handling in dates.Other Transformers
Development: The final step was to create a Y-Transformer that can encode target variables that are categorical for Polars.The YTransformer
was developed and it turned out to be very useful in scikit-learn pipelines which I later developed (see below) .Polars_SULOV_MRMR
Development & Iteration: The core of the library, thePolars_SULOV_MRMR
, underwent several iterations to optimize performance and correctness:V1-V3
: Initially I focused on creating a basic Polars-compatible MRMR selector, copying my featurewiz implementation and modifying it to Polars.V4
: I then made a critical correction to the calculations inrecursive_xgboost
process to address the growing volume of features it selected that did not add to performance.V5
: I found a major stumbling block in that classic process and corrected it inV5
. This involved splitting the dataset into train and validation and running recursive_xgboost within it, thus ensuring a stable number of features in each round of XGBoost feature selection. I have renamed the old approach as "classic" and the new approach as "split-driven" which you can test in the library.
- Pipeline Examples & Testing: Pipeline examples (e.g.,
fs_test.py
andfeaturewiz_polars_test1.ipynb
) were created to test the integration of the encoder and selector within scikit-learn pipelines and to test the functionality and performance of the library. - Addressing Date-Time Variables: The library's scope was expanded to include date-time variables (to help in time series tasks) and the filling of NaN's and Nulls in Polars data frames. Thus a
Polars_MissingTransformer
was added to handle Nulls and NaNs efficiently within Polars. - Testing and Refinement: Throughout the development, I put in an intense focus on verifying the correctness of the new algorithms and code, making sure that the new algorithm outpeformed my existing classic featurewiz library, particularly in the
recursive_xgboost
method which I modified.
The featurewiz-polars
library is not available on pypi yet (I am still refining it). There are 3 ways to download and install this library to your machine.
- You can git clone this library from source and run it from the terminal command as follows:
git clone https://github.com/AutoViML/featurewiz_polars.git
cd featurewiz_polars
pip install -r requirements.txt
cd examples
python fs_test.py
or
2. You can download and unzip https://github.com/AutoViML/featurewiz_polars/archive/master.zip and follow the instructions from pip install above. But you start from the terminal in the directory where you downloaded the zip file.
or
3. You can install from source as follows on the terminal command:
pip install git+https://github.com/AutoViML/featurewiz_polars.git
To help you quickly get started with the featurewiz-polars
library, I've provided example scripts like fs_test.py
. These scripts demonstrate how to use the library in a concise manner. Additionally, the fs_lazytransform_test.py
script allows you to compare the performance of featurewiz-polars
against the lazytransform
library. For a more in-depth comparison, use fs_mr_comparison_test.py
to benchmark featurewiz-polars
against other competitive mRMR feature selection libraries.
If you prefer working in a Jupyter Notebook or Colab, here are direct links to work in Colab with featurewiz-polars:
Anybody can open a copy of my Github-hosted notebooks within Colab. To make it easier I have created Open-in-Colab
links to those GitHub-hosted notebooks below:
I have also provided code snippets to illustrate how to load a file into polars
library's dataframes for use with featurewiz-polars
.
### Load data into Polars DataFrames using:
import polars as pl
df = pl.read_csv(datapath+filename, null_values=['NULL','NA'], try_parse_dates=True,
infer_schema_length=10000, ignore_errors=True)
### Split the Polars dataframe into train and test using scikit-learn's train_test_split using:
target = 'target'
predictors = [x for x in df.columns if x not in [target]]
from sklearn.model_selection import train_test_split
X = df[predictors]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
This section demonstrates two distinct ways to use the featurewiz-polars
library: feature selection only
, and feature selection combined with model training
.
This approach is useful when you want to pre-process your data and select the most relevant features before feeding them into a separate model training pipeline. It's particularly helpful if you want to experiment with different models using the same selected features.
Here's how to use Featurewiz_MRMR
for feature selection in a classification scenario, where the categorical target variable "y" needs to be transformed into a numerical representation:
from featurewiz_polars import Featurewiz_MRMR
# Initialize Featurewiz_MRMR for classification with XGBoost doing feature selection (estimator=None)
mrmr = Featurewiz_MRMR(model_type="Classification", estimator=None,
corr_threshold=0.7, encoding_type='onehot', classic=True, verbose=0)
# Fit and transform the training data (X_train, y_train)
X_transformed, y_transformed = mrmr.fit_transform(X_train, y_train)
# Transform the test data (X_test)
X_test_transformed = mrmr.transform(X_test)
# Transform the test target variable (y_test)
y_test_transformed = mrmr.y_encoder.transform(y_test)
Key Points:
- We use the
Featurewiz_MRMR
class for feature selection. - The
estimator
argument is used to select the model doing the feature selection. You can try other estimators. Currently, only XGBoost, RandomForest and LightGBM are allowed. CatBoost is available but giving an error with Polars. - The
fit_transform
method is used to fit the feature selection process on the training data and simultaneously transform it. - We use the
transform
method separately to transform the test data, applying the same feature selection learned from the training data. - The
y_encoder
is used to transform the target variable if it's categorical.
This approach combines feature selection and model training into a single pipeline. It's useful when you want to streamline the entire process and train a model directly on the selected features.
Here's how to use Featurewiz_MRMR_Model
for both feature selection and model training in a regression scenario. It assumes that the target variable "y" is already numerical.
from featurewiz_polars import Featurewiz_MRMR_Model
from xgboost import XGBRegressor
# Initialize Featurewiz_MRMR_Model for regression with an XGBoost Regressor
mrmr_model = Featurewiz_MRMR_Model(model_type="Regression", model=XGBRegressor(),
corr_threshold=0.7, encoding_type='onehot', classic=True, verbose=0)
# Fit and transform the training data (X_train, y_train)
X_transformed, y_transformed = mrmr_model.fit_transform(X_train, y_train)
# Make predictions on the test data (X_test) - handles both transform and predict simultaneously
y_pred = mrmr_model.predict(X_test)
Key Points:
- We use the
Featurewiz_MRMR_Model
class to combine feature selection and model training. - The
fit_transform
method is used to fit the feature selection process and train the specified model on the training data. - The
predict
method handles both transforming the test data using the learned feature selection and making predictions with the trained model, streamlining the entire process.
The Featurewiz_MRMR_Model
class initializes the pipeline with a built-in Random Forest estimator (which you can change - see below) for building data pipelines that use the feature engineering, selection, and model training capabilities of Polars. You need to upload your data into Polars DataFrames and then start calling these pipelines.
-
estimator
(estimator object, optional): This argument is used to by featurewiz to do the feature selection. You can try other estimators but currently, only XGBoost, RandomForest and LightGBM are allowed. CatBoost is available but giving an error with Polars. -
model
(estimator object, optional): This estimator is used in the pipeline to train a new modelafter feature selection
. IfNone
, a default estimator (Random Forest) will be trained after selection. Defaults toNone
. Thismodel
argument can be different from theestimator
argument above. -
model_type
(str, optional): The type of model to be built ('classification'
or'regression'
). Determines the appropriate preprocessing and feature selection strategies. Defaults to'classification'
. -
encoding_type
(str, optional): The type of encoding to apply to categorical features ('target'
,'onehot'
, etc.).'woe'
encoding is only available for classification model types. Defaults to'target'
. -
imputation_strategy
(str, optional): The strategy for handling missing values ('mean'
,'median'
,'zeros'
). Determines how missing data will be filled in before feature selection. Defaults to'mean'
. -
corr_threshold
(float, optional): The correlation threshold for removing highly correlated features. Features with a correlation above this threshold will be targeted for removal. Defaults to0.7
. -
classic
(bool, optional): IfTrue
, it implements the original classicfeaturewiz
library using Polars. IfFalse
, implements the train-validation-split-recursive-xgboost version, which is faster and uses train/validation splits to stabilize features. Defaults toFalse
. -
verbose
(int, optional): Controls the verbosity of the output during feature selection.0
for minimal output, higher values for more detailed information. Defaults to0
.
Select either the old featurewiz method or the new method using the classic
argument in the new library: (e.g., if you set classic
=True, you will get features similar to the old feature selection method). If you set it to False, you will use the new feature selection method. I would suggest you try both methods to see which set of features works well for your dataset.
The new featurewiz-polars
library uses an improved method for recursive_xgboost
feature selection known as Split-Driven Recursive_XGBoost
: In this method, we use Polars under the hood to speed up calculations for large datasets and in addition perform the following steps:
- Split Data for Validation: Divide the dataset into separate training and validation sets. The training set is used to build the XGBoost model, and the validation set is used to evaluate how well the selected features generalize to unseen data.
- XGBoost Feature Ranking (with Validation): Within each run, use the training set to train an XGBoost model and evaluate feature importance. Assess the performance of selected features on the validation set to ensure they generalize well.
- Select Key Features (with Validation): Determine the most significant features based on their importance scores and validation performance.
- Repeat with New Split: After each run of the recursive_xgboost cycle is complete, repeat the entire process (splitting, ranking, selecting) with a new train/validation split.
- Final, Stabilized Feature Set: After multiple runs with different splits, combine the selected features from all runs, removing duplicates. This results in a more stable and reliable final feature set, as it's less sensitive to the specific training/validation split used.
- Significant Performance Boost: Leverage Polars' speed and efficiency for feature engineering on large datasets.
- Native Polars Workflows: Work directly with Polars DataFrames throughout your feature engineering and machine learning pipelines, avoiding unnecessary data conversions.
- Robust Feature Selection: Benefit from the power of MRMR feature selection, optimized for Polars and corrected for accurate redundancy calculation across mixed data types.
- Flexible Categorical Encoding: Choose from various encoding schemes (Target, WOE, Ordinal, OneHot Encoding)
If you are working on processing massive datasets with Polars' speed and efficiency, while leveraging the power of featurewiz_polars
for building high quality MLOps workflows, I welcome your feedback and comments to me at rsesha2001 at yahoo dot com for making it more useful to you in the months to come. Please star
this repo or open a pull request or report an issue. Every which way, you make this repo more useful and better for everyone!