An ML project on the joblistings scraped from the Joblisting Webscraper project and explored in Joblisting Cleaning EDA.
In short, I'm motivated by a want to learn more about the data science pipeline and a desire to venture into the unknown! I've done countless scikit-learn projects, but I pursued this one because I wanted a project that would be more in-depth.
Figure 1. Data science lifecycle.
This project is part of a larger project! This is only 1 step in that larger project. To check out the other projects in this series:
About the structure of this repo:
csv
stores the CSVs I generated from the previous projectdiagrams
stores the diagrams from the past 2 projects and this projectimg
stores images from the past 2 projects and this projectinput
stores the dataset I scraped, split and preprocessed datapages
stores the subpage for my apppipelines
stores the pipelines I tested inmodeling.ipynb
1_🧠_Predictive_Modeling.py
is the main page of my streamlit appmodeling.ipynb
is the source coderesources.md
is a list of all the resources I reference throughout this projectutils.py
contains a few helper functions for my app
Note: the package versions listed in requirements.txt
and imported in my code may not be the exact versions. However, the versioning here is less important. I've listed all used libraries.
A little about the dataset: the data was webscraped from Glassdoor.com's job listings for data science jobs. I used my own webscraper for it! That can be found here: https://github.com/alckasoc/Joblisting-Webscraper. The dataset is small and can be found in this repo under input
. As an alternative, I've also stored this on Kaggle publicly: https://www.kaggle.com/datasets/vincenttu/glassdoor-joblisting.
Note, I talk more about this in the app! I faced a ton of difficulties going into this project. For one, prior to this, I've only ever made simple projects modeling tidy data in fixed environments without too much depth. Venturing into this unknown meant a lot of searching and reading and learning! Along the way, I ran into countless problems code-wise and model-wise.
Note, I talk more about this in the app! I learned more about each step of the machine learning pipeline. I've never gone this in-depth in any of the subject matters whether that be feature engineering or hyperparameter tuning. This project I aimed to flush out each and every aspect to the best of my ability. I learned tools like optuna
, raytune
, hyperopt
for hyperparameter tuning. I learned various feature engineering methods and libraries like lofo
. I learned a bit about AutoML through tools like autofeat
and EvalML
. More importantly, I learned about the experimentation process and how crucial it is to have a strong cross validation framework for testing what works and what doesn't work. This is something every Kaggler knows!
For information on references and resources, refer to resources.md
.
Contact me:
Gmail: [email protected]
Linkedin: Vincent Tu
Kaggle: vincenttu
Psst, I've written another thank you note in my app (check it out). I'd just like to reiterate again that I'm grateful for the tools, documentation, and articles available to me. They have been a great help in my project and without them, this entire project would've been much like any other one I've made! And, thank you, again, Catherine for your help with the visuals and banners! This project is incomplete without you. ❤️
Lastly, thank you for viewing!