Job Board Scraper collects, cleans, organizes, and indexes English teaching positions from an existing online job board once a day.
The code scrapes the job board with Scrapy and integrates it into a Django website with an Elasticsearch search index and a PostgreSQL database. The website is hosted on Heroku.
Prerequisites: Python 3, SQLite, Redis, pip, virtualenv, virtualenvwrapper, Git.
$ mkvirtualenv jobboardscraper -p python3
$ git clone [email protected]:richardcornish/jobboardscraper.git
$ cd jobboardscraper/
$ pip install -r requirements.txt
$ cd jobboardscraper/
$ python manage.py migrate
$ python manage.py loaddata jobboardscraper/fixtures/*
$ python manage.py createsuperuser
$ python manage.py runserver
Open http://127.0.0.1:8000. Kill with Ctrl+C
.
Setting a virtualenv default directory is usually a good idea:
$ setvirtualenvproject $WORKON_HOME/jobboardscraper/ ~/Sites/jobboardscraper/jobboardscraper/
$ cdproject
To run the spider to scrape the website:
$ cd scraper/
$ scrapy crawl eslcafe
Elasticsearch is required to build and update the search index. Assuming Homebrew is installed, initial indexing:
$ brew install caskroom/cask/brew-cask
$ brew install Caskroom/cask/java
$ brew install elasticsearch
$ elasticsearch --config=/usr/local/opt/elasticsearch/config/elasticsearch.yml
$ python manage.py rebuild_index`
Future indexing:
$ python manage.py update_index
If you're using Heroku, deploying requires the Heroku Toolbelt:
Heroku add-ons I installed:
Initial deploy:
$ heroku login
$ heroku create
$ heroku config:set SECRET_KEY='...' # replace with your own
$ heroku config:set DEBUG=''
$ heroku addons:create heroku-postgresql:hobby-dev
$ heroku addons:create heroku-redis:hobby-dev
$ heroku addons:create searchbox:starter
$ git push heroku master
$ heroku run python jobboardscraper/manage.py migrate
$ heroku run python jobboardscraper/manage.py loaddata jobboardscraper/jobboardscraper/fixtures/*
$ heroku run python jobboardscraper/manage.py createsuperuser
$ heroku open
Future deploys:
$ git push heroku master
After installation you can scrape the website and build the search index on Heroku:
$ heroku run '(cd jobboardscraper/scraper/ && scrapy crawl eslcafe)'
$ heroku run python jobboardscraper/manage.py rebuild_index
Future scraping and indexing are handled by daily Celery tasks with a Redis broker.