Release Notes - PipeDreams CSV Data Explorer

Features

File Uploader: Users can upload any CSV file for immediate analysis and exploration.
Default Dataset: If no CSV file is uploaded, the app defaults to a sample dataset (customers-100000.csv) located in the data/ directory, allowing users to test features without needing their own data.

Data Cleaning: Automatically removes rows with missing values to streamline analysis.
Date Conversion: Converts Subscription Date column (if present) to datetime format and calculates a new column, Years Since Subscription, indicating the number of years since each customer’s subscription date.
Categorical Encoding: Converts categorical columns into numeric format using label encoding (excluding non-informative columns like Email and Website).

Annual Purchase Amount: Introduces a synthetic target column (Annual Purchase Amount) generated with random values to demonstrate predictive analysis capabilities.

Chart Options: Allows users to visualize data with five chart types:
- Scatter Plot
- Bar Chart
- Line Chart
- Histogram
- Box Plot
Column Selection: Users can select any numeric column for X and Y axes, making data exploration flexible and adaptable to different datasets.

Feature Selection for Clustering: Provides an interactive feature selection for clustering analysis with KMeans.
Standardization: Scales selected features to ensure equal contribution in clustering analysis.
Cluster Visualization: Displays clusters in a scatter plot with color-coded groups to illustrate natural data groupings, allowing for easy segment identification.

Target and Feature Selection: Users can select the target variable (Annual Purchase Amount by default) and multiple predictor features to build a regression model.
Train/Test Split: Splits data into training and test sets (80/20) to evaluate model performance.
Model Evaluation: Calculates and displays the Mean Squared Error (MSE) for model accuracy on the test set.
Prediction Results Display: Shows a sample of actual vs. predicted values and visualizes these in an interactive scatter plot for easy comparison.

Summary of Findings: Concludes with key insights from clustering and predictive analysis, helping users interpret results in a business or research context:
- Clustering reveals natural groupings within the data.
- Predictive analysis can uncover trends and provide actionable insights based on selected features.

Enhanced User Interface: Streamlit-based interface is optimized for an intuitive experience with organized sections for ETL, visualization, clustering, and predictive analysis.
Improved Compatibility: Categorical encoding and feature standardization ensure smooth functionality across various datasets and data types.
Error Handling: Catches missing or incompatible data types and defaults to sample data if no file is uploaded, ensuring a seamless experience for first-time users.

Non-Numeric Columns for ML Analysis: Currently, only numeric columns are selectable for clustering and regression; non-numeric features need encoding.
Synthetic Data Limitations: The Annual Purchase Amount column is randomly generated and may not reflect real-world correlations. Users should replace this with actual target variables for meaningful predictive analysis.