This code scrapes the links of the top 5 online stories from the Daily Mail, BBC, and The Sun homepages. It scrapes each story for its title, timestamp, byline, and article text. It then runs the article text through EquiQuote, my tool to detect the gender of news sources and scrapes the results generated. Finally, it exports a CSV with the article data as well as results from EquiQuote.
This script is meant to help test the quality of results from EquiQuote, my 2023 dissertation project for my Master of Science in Computational and Data Journalism at Cardiff University. The three news outlets chosen were identified as the top 3 news brands in the UK by the Press Gazette in July 2023.
-
Clone this repository:
git clone https://gitfront.io/r/user-7653615/hHok8ubE1FTX/equiquote-scraper.git cd equiquote-scraper
-
Set up a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install the required Python packages:
pip install -r requirements.txt
-
Download the appropriate version of ChromeDriver for your system and place it in your PATH or specify its location in your code.
-
Run the scraper:
python scraper.py
After running the scraper, you will find the scraped data in the data
folder. The scraper increments the counter in counter.txt
every time it runs and will stop after running five times. To reset it, replace the contents of counter.txt
with "0".
Cron is a time-based job scheduler in Unix-like operating systems. You can use it to schedule the scraper to run at specific intervals, such as daily or weekly.
-
Open your crontab file for editing:
crontab -e
-
Add a new cron job:
In the editor, add a new line with the following format:
0 * * * * cd /path/to/equiquote-scraper && /path/to/venv/bin/python /path/to/equiquote-scraper/scraper.py >> /path/to/equiquote-scraper/cron.log 2>&1
Replace /path/to/python, /path/to/scraper.py, and /path/to/cron.log with the appropriate paths. The >> redirects the standard output and the 2>&1 redirects the standard error to the specified log file.
In this example, the scraper will run every hour at minute 0. Adjust the cron schedule expression as needed.
- Save and exit the editor:
After adding the cron job, save the changes and exit the text editor. The scraper should now run automatically at the specified intervals and log the output and any errors to cron.log
. You can view the scheduled tasks by typing:
crontab -l
Note for vi and vim users: If your default editor is vi or vim, you'll need to enter "insert mode" before you can begin typing in the file. To do this, press the i key. You can then start editing the file. When you're done editing, press the Esc key to exit insert mode. To save your changes and exit the editor, type :wq and press Enter. If you want to exit without saving changes, type :q! and press Enter.
This project is licensed under the terms of the license provided in the LICENSE.txt
file.