Skip to content

Releases: markjacksonfishing/pipedreams

v1.0

31 Oct 18:21
81da42e
Compare
Choose a tag to compare

Release Notes - PipeDreams CSV Data Explorer

Version 1.0

Release Date: 10/31/2024


Features

1. CSV Upload and Default Dataset

  • File Uploader: Users can upload any CSV file for immediate analysis and exploration.
  • Default Dataset: If no CSV file is uploaded, the app defaults to a sample dataset (customers-100000.csv) located in the data/ directory, allowing users to test features without needing their own data.

2. ETL (Extract, Transform, Load) Transformations

  • Data Cleaning: Automatically removes rows with missing values to streamline analysis.
  • Date Conversion: Converts Subscription Date column (if present) to datetime format and calculates a new column, Years Since Subscription, indicating the number of years since each customer’s subscription date.
  • Categorical Encoding: Converts categorical columns into numeric format using label encoding (excluding non-informative columns like Email and Website).

3. Synthetic Data Generation

  • Annual Purchase Amount: Introduces a synthetic target column (Annual Purchase Amount) generated with random values to demonstrate predictive analysis capabilities.

4. Interactive Data Visualization

  • Chart Options: Allows users to visualize data with five chart types:
    • Scatter Plot
    • Bar Chart
    • Line Chart
    • Histogram
    • Box Plot
  • Column Selection: Users can select any numeric column for X and Y axes, making data exploration flexible and adaptable to different datasets.

5. Clustering Analysis

  • Feature Selection for Clustering: Provides an interactive feature selection for clustering analysis with KMeans.
  • Standardization: Scales selected features to ensure equal contribution in clustering analysis.
  • Cluster Visualization: Displays clusters in a scatter plot with color-coded groups to illustrate natural data groupings, allowing for easy segment identification.

6. Predictive Analysis with Linear Regression

  • Target and Feature Selection: Users can select the target variable (Annual Purchase Amount by default) and multiple predictor features to build a regression model.
  • Train/Test Split: Splits data into training and test sets (80/20) to evaluate model performance.
  • Model Evaluation: Calculates and displays the Mean Squared Error (MSE) for model accuracy on the test set.
  • Prediction Results Display: Shows a sample of actual vs. predicted values and visualizes these in an interactive scatter plot for easy comparison.

7. Insight Summary

  • Summary of Findings: Concludes with key insights from clustering and predictive analysis, helping users interpret results in a business or research context:
    • Clustering reveals natural groupings within the data.
    • Predictive analysis can uncover trends and provide actionable insights based on selected features.

Improvements and Fixes

  • Enhanced User Interface: Streamlit-based interface is optimized for an intuitive experience with organized sections for ETL, visualization, clustering, and predictive analysis.
  • Improved Compatibility: Categorical encoding and feature standardization ensure smooth functionality across various datasets and data types.
  • Error Handling: Catches missing or incompatible data types and defaults to sample data if no file is uploaded, ensuring a seamless experience for first-time users.

Known Issues

  • Non-Numeric Columns for ML Analysis: Currently, only numeric columns are selectable for clustering and regression; non-numeric features need encoding.
  • Synthetic Data Limitations: The Annual Purchase Amount column is randomly generated and may not reflect real-world correlations. Users should replace this with actual target variables for meaningful predictive analysis.