The Data Dive project is not an examination but a group endeavor aimed at emulating the real-life work environment of data scientists. This project is an integral part of the DATA 6300 course, offering students a fun and comprehensive platform to demonstrate their acquired data science skills in an unstructured format. Each participant will be assigned to a group of 3-4 members to collaboratively tackle the project over Zoom. The main tasks include data exploration, cleaning, engineering, and modeling, with the ultimate goal of predicting car sale prices using a dataset provided on D2L.
The primary purpose of this project is to enable students to apply their knowledge in a practical setting, simulating a real-world data science task. The dataset consists of approximately 50 features and 200,000 records related to car sales prices. Beyond predicting prices, the project encourages exploration, hypothesis testing regarding the data features, and innovation in data handling and model creation. It's an opportunity to practice data cleaning, feature engineering, and to develop a deeper understanding of how different variables may influence car sale prices.
Exploration and Hypothesis Generation: Initially, groups are expected to familiarize themselves with the dataset, generating hypotheses about how various features might affect car sale prices. This phase involves critical thinking and creativity in predicting potential relationships within the data.
Data Cleaning and Feature Engineering: This phase focuses on preparing the data for modeling by handling null values, correcting anomalies, and creating new features that could improve model performance. It's about making the dataset more coherent and useful for predictive modeling.
Modeling: Using the cleaned and engineered dataset, groups will construct models to predict car sale prices. This involves selecting appropriate features, handling multicollinearity, and evaluating model performance.
Presentation and Documentation: Finally, each group will present their findings and methodology in a concise presentation. The project culminates with the submission of a well-documented Jupyter notebook that outlines the steps taken, from initial data exploration to final model evaluation.
As for the results summarized from the Jupyter notebook:
The project began with a robust data cleaning phase, addressing null values and outliers, and proceeded with thoughtful feature engineering, including combining related variables and creating dummy variables for categorical data. Two models were evaluated based on their R-squared values: the first model achieved an R-squared of 0.808, and the final model, after adjustments for multicollinearity and high p-values, achieved an R-squared of 0.788. Despite a slight decrease in the R-squared value, the final model is considered more reliable due to the careful consideration of statistical significance and the reduction of multicollinearity. The findings highlight the challenges of working with an unclean dataset and underscore the importance of a meticulous approach to data preparation and modeling. The project emphasizes the critical role of data cleaning and feature engineering in building effective predictive models. This project is a comprehensive test of the skills learned in the DATA 6300 course, providing students with a valuable opportunity to apply their knowledge in a practical and collaborative environment.