run the file web_scraper_v2.py
like this:
python3 web_scraper_v2.py
in your terminal
- You can use my aws_connect.py file as a guidline of how to connect and upload to your s3 bucket
Expected Output
- The script should output two folders:
- a folder that contains the partitioned by year CSVs
- a folder that contains a CSV created from concatenating the partitioned CSVs
File Descriptions
web_scraper_v2.py
: parses html from a different url that contains more data- separates output files based on the year of the data
- outputs a concatenated file of all years combined aswell
load_to_s3.py
: connects to my s3 Bucket and uploads csvs to the bucketweb_scraper_v1.py
: extracts pdfs from the html of the diamond league site- with where I am at in the project this is not very useful but handy if I ever need to extract pdfs again
pdf_parser.py
: goes into each pdf and extracts data from the pdf- again, not used as I have found a better approach but want to keep it here as it is apart of the journey
Next Steps
-
Add location data such as:
- Elevation, temperature, and weather
-
Try to find speed tracking data
-
Create Visualizations/Dashboard
-
Add a transform file
- add 19 or 20 to birth year
- add age
- make DOB more date-like
- add date of event col
- add weather data for the date
- make col for venue location
- remove the location from og venue col
- add race distance
- add 19 or 20 to birth year
-
load to a mysql db (dbeaver maybe)
-
orchestrate with airflow on an ec2 container
-
Create a full data pipeline
-
Use regression to predict their next race
-
if its possible I want to add functionality of being able to choose what event you want to look at
Metrics to track
- How much people are affected by different weathers
- Are any tracks "faster" than others
Goal
- My goal for this project is to build a model that forecasts likelihood of who wins the next race based on their previous performances