You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Release Notes - PipeDreams CSV Data Explorer
Version 1.0
Release Date: 10/31/2024
Features
1. CSV Upload and Default Dataset
File Uploader: Users can upload any CSV file for immediate analysis and exploration.
Default Dataset: If no CSV file is uploaded, the app defaults to a sample dataset (customers-100000.csv) located in the data/ directory, allowing users to test features without needing their own data.
2. ETL (Extract, Transform, Load) Transformations
Data Cleaning: Automatically removes rows with missing values to streamline analysis.
Date Conversion: Converts Subscription Date column (if present) to datetime format and calculates a new column, Years Since Subscription, indicating the number of years since each customer’s subscription date.
Categorical Encoding: Converts categorical columns into numeric format using label encoding (excluding non-informative columns like Email and Website).
3. Synthetic Data Generation
Annual Purchase Amount: Introduces a synthetic target column (Annual Purchase Amount) generated with random values to demonstrate predictive analysis capabilities.
4. Interactive Data Visualization
Chart Options: Allows users to visualize data with five chart types:
Scatter Plot
Bar Chart
Line Chart
Histogram
Box Plot
Column Selection: Users can select any numeric column for X and Y axes, making data exploration flexible and adaptable to different datasets.
5. Clustering Analysis
Feature Selection for Clustering: Provides an interactive feature selection for clustering analysis with KMeans.
Standardization: Scales selected features to ensure equal contribution in clustering analysis.
Cluster Visualization: Displays clusters in a scatter plot with color-coded groups to illustrate natural data groupings, allowing for easy segment identification.
6. Predictive Analysis with Linear Regression
Target and Feature Selection: Users can select the target variable (Annual Purchase Amount by default) and multiple predictor features to build a regression model.
Train/Test Split: Splits data into training and test sets (80/20) to evaluate model performance.
Model Evaluation: Calculates and displays the Mean Squared Error (MSE) for model accuracy on the test set.
Prediction Results Display: Shows a sample of actual vs. predicted values and visualizes these in an interactive scatter plot for easy comparison.
7. Insight Summary
Summary of Findings: Concludes with key insights from clustering and predictive analysis, helping users interpret results in a business or research context:
Clustering reveals natural groupings within the data.
Predictive analysis can uncover trends and provide actionable insights based on selected features.
Improvements and Fixes
Enhanced User Interface: Streamlit-based interface is optimized for an intuitive experience with organized sections for ETL, visualization, clustering, and predictive analysis.
Improved Compatibility: Categorical encoding and feature standardization ensure smooth functionality across various datasets and data types.
Error Handling: Catches missing or incompatible data types and defaults to sample data if no file is uploaded, ensuring a seamless experience for first-time users.
Known Issues
Non-Numeric Columns for ML Analysis: Currently, only numeric columns are selectable for clustering and regression; non-numeric features need encoding.
Synthetic Data Limitations: The Annual Purchase Amount column is randomly generated and may not reflect real-world correlations. Users should replace this with actual target variables for meaningful predictive analysis.