Releases: markjacksonfishing/pipedreams
Releases · markjacksonfishing/pipedreams
v1.0
Release Notes - PipeDreams CSV Data Explorer
Version 1.0
Release Date: 10/31/2024
Features
1. CSV Upload and Default Dataset
- File Uploader: Users can upload any CSV file for immediate analysis and exploration.
- Default Dataset: If no CSV file is uploaded, the app defaults to a sample dataset (
customers-100000.csv
) located in thedata/
directory, allowing users to test features without needing their own data.
2. ETL (Extract, Transform, Load) Transformations
- Data Cleaning: Automatically removes rows with missing values to streamline analysis.
- Date Conversion: Converts
Subscription Date
column (if present) todatetime
format and calculates a new column,Years Since Subscription
, indicating the number of years since each customer’s subscription date. - Categorical Encoding: Converts categorical columns into numeric format using label encoding (excluding non-informative columns like
Email
andWebsite
).
3. Synthetic Data Generation
- Annual Purchase Amount: Introduces a synthetic target column (
Annual Purchase Amount
) generated with random values to demonstrate predictive analysis capabilities.
4. Interactive Data Visualization
- Chart Options: Allows users to visualize data with five chart types:
- Scatter Plot
- Bar Chart
- Line Chart
- Histogram
- Box Plot
- Column Selection: Users can select any numeric column for X and Y axes, making data exploration flexible and adaptable to different datasets.
5. Clustering Analysis
- Feature Selection for Clustering: Provides an interactive feature selection for clustering analysis with KMeans.
- Standardization: Scales selected features to ensure equal contribution in clustering analysis.
- Cluster Visualization: Displays clusters in a scatter plot with color-coded groups to illustrate natural data groupings, allowing for easy segment identification.
6. Predictive Analysis with Linear Regression
- Target and Feature Selection: Users can select the target variable (
Annual Purchase Amount
by default) and multiple predictor features to build a regression model. - Train/Test Split: Splits data into training and test sets (80/20) to evaluate model performance.
- Model Evaluation: Calculates and displays the Mean Squared Error (MSE) for model accuracy on the test set.
- Prediction Results Display: Shows a sample of actual vs. predicted values and visualizes these in an interactive scatter plot for easy comparison.
7. Insight Summary
- Summary of Findings: Concludes with key insights from clustering and predictive analysis, helping users interpret results in a business or research context:
- Clustering reveals natural groupings within the data.
- Predictive analysis can uncover trends and provide actionable insights based on selected features.
Improvements and Fixes
- Enhanced User Interface: Streamlit-based interface is optimized for an intuitive experience with organized sections for ETL, visualization, clustering, and predictive analysis.
- Improved Compatibility: Categorical encoding and feature standardization ensure smooth functionality across various datasets and data types.
- Error Handling: Catches missing or incompatible data types and defaults to sample data if no file is uploaded, ensuring a seamless experience for first-time users.
Known Issues
- Non-Numeric Columns for ML Analysis: Currently, only numeric columns are selectable for clustering and regression; non-numeric features need encoding.
- Synthetic Data Limitations: The
Annual Purchase Amount
column is randomly generated and may not reflect real-world correlations. Users should replace this with actual target variables for meaningful predictive analysis.