This project analyzes the effectiveness of direct marketing campaigns for a banking institution and predicts customer subscription to term deposits based on historical campaign data. The model leverages various machine learning techniques, feature engineering, and statistical analysis to derive insights and improve prediction accuracy.
- Data Preprocessing: Handling missing values, outliers, and categorical encoding.
- Feature Engineering: Creating new features to enhance model performance.
- Exploratory Data Analysis (EDA): Detailed visualizations to uncover trends and patterns.
- Model Training: Comparison of Logistic Regression, Random Forest, and XGBoost models.
- Performance Metrics: Evaluating models on accuracy and F1-score.
- Submission Generation: Creating predictions in the required format for submission.
The data is related to direct marketing campaigns of a banking institution. Campaigns were conducted via phone calls, often involving multiple contacts per client to determine whether they would subscribe to a term deposit.
train.csv
: Training datasettest.csv
: Test datasetsample_submission.csv
: Sample submission file in the correct format
- last contact date: Date of last contact
- age: Age of the client (numeric)
- job: Type of job (categorical)
- marital: Marital status (categorical)
- education: Level of education (categorical)
- default: Has credit in default? (binary)
- balance: Average yearly balance in euros (numeric)
- housing: Has housing loan? (binary)
- loan: Has personal loan? (binary)
- contact: Contact communication type (categorical)
- duration: Last contact duration in seconds (numeric)
- campaign: Number of contacts performed during this campaign and for this client (numeric)
- pdays: Days passed since the client was last contacted from a previous campaign (numeric)
- previous: Number of contacts performed before this campaign (numeric)
- poutcome: Outcome of the previous marketing campaign (categorical)
- target: Whether the client subscribed to a term deposit? (binary: "yes", "no")
-
Exploratory Data Analysis (EDA)
- Univariate, bivariate, and multivariate analysis using visualizations.
- Correlation heatmap for feature selection.
-
Data Preprocessing
- Handling missing values using
SimpleImputer
. - Encoding categorical variables with
OrdinalEncoder
andOneHotEncoder
. - Scaling numerical variables using
StandardScaler
.
- Handling missing values using
-
Feature Engineering
- Derived new features like
age_group
,balance_group
, and interaction terms. - Calculated derived metrics like
campaign_intensity
andcontact_rate
.
- Derived new features like
-
Model Building
- Models: Logistic Regression, Random Forest, and XGBoost.
- Hyperparameter tuning for optimal performance.
- Metrics: Accuracy, Precision, Recall, F1-score, and AUC-ROC.
-
Results Visualization
- Bar charts comparing model performance on test accuracy and F1-score.
-
Submission Generation
- Predictions exported in the required format (
submission.csv
).
- Predictions exported in the required format (
-
Clone the repository.
git clone https://github.com/username/banking-marketing-prediction.git
-
Install dependencies.
pip install -r requirements.txt
-
Run the analysis script.
python main.py
-
Generate submission files.
- Logistic Regression:
submission_lr.csv
- Random Forest:
submission_rf.csv
- XGBoost:
final_submission.csv
- Logistic Regression:
Model | Train Accuracy | Test Accuracy | Train F1-Score | Test F1-Score |
---|---|---|---|---|
Logistic Regression | 85.4% | 83.2% | 81.2% | 80.1% |
Random Forest | 89.7% | 87.5% | 85.9% | 84.3% |
XGBoost | 91.2% | 88.7% | 87.6% | 86.2% |
- Test Accuracy Comparison
- Test F1-Score Comparison
- Feature selection optimization.
- Incorporate additional models for comparison.
- Implement model explainability techniques.
This project is licensed under the MIT License - see the LICENSE file for details.