As an upcoming ML engineer, I challenged myself to put my machine learning skills to the test. I challenged myself by tackling the Housing Prices Challenge on Kaggle. The goal of this challenge is to predict the prices of houses in Ames, Iowa based on a given set of features. To be exact, there are 79 features in total. This project allows the engineer (in this case myself) to practice critical Data Science & Machine Learning techniques.
This repository is organized via various self-explanatory folders.
The model is evaluated using the Root Mean Square Error, as this is the metric we are trying to minimize. My best model has a RMSE of 0.13757. This currently ranks in the top 43%. In reality, my solution would be much higher for various reasons:
- Some solutions have an unfeasible RMSE of 0.0. No Machine Learning model can predict with such accuracy. I suspect cheating occured here.
- Some solutions have a RMSE of 0.00044. After further inspection of such solutions, I found that these solutions are invalid because of the fact that competitors are simply providing the results of answers to a similar challenge (Boston Housing Prices). Once again, I believe this is cheating since no real Machine Learning methodologies are being deployed.
Final Model: My best model is a tuned CatBoost Model.
Note: you may use my solution as a reference; however, I would strongly advise you to tackle this challenge on your own. The only way you will get better at machine learning is to practice it on your own. I do not condone nor am I responsible for any cheating that may occur as a result of this repository.
This checklist is what I use for every ML project. This goes through every major step & ensures that I have done everything correctly.
- Framing the Problem - Complete
- Getting the Data - Complete
- Exploring the Data - Complete
- Data Preprocessing - Complete
- Model Development - Complete
- Model Tuning/Ensemble Learning - Complete
- Deploying Model on Test Set & Presentation of Solution - Complete
- Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow
- Medium Article about Unit Testing in Python
- Medium Article about sys module usuage
In reality, there are infinite adjustments I could make to improve my score; however, here a couple fruitful ones:
- Combine the Tuned-CatBoost model with some other models (Linear Regression & Support Vector Machines seem promising)
- Feature Engineering: I could maybe cut down the categories for certain features.
- Feature Importance: Further feature selection. Use my model to make better selections for features.
- Maybe incorporate outside data like many credible top-ranked solutions.
This project was very enjoyable ,and I definitely learned a lot along the way! I would recommend this challenge to anyone who is looking to dive into Machine Learning & Data Science. It is quite simple, and the dataset is relatively small & not overwhelming. Overall, this challenge was really fun and a great learning experience!
I am an undergraduate student @ Rutgers University New Brunswick, who is pursing bachelor degrees in Computer Science and Cognitive Science. Furthermore, I am pursing a certificate in Data Science. I have a passion for AI ,and I am always intriguied by its power. Feel free to contact me via Linkedln.
Enjoy!
Jinal Shah