Skip to content

Latest commit

 

History

History
104 lines (75 loc) · 3.3 KB

README.md

File metadata and controls

104 lines (75 loc) · 3.3 KB

Job Board Scraper

Job Board Scraper collects, cleans, organizes, and indexes English teaching positions from an existing online job board once a day.

The code scrapes the job board with Scrapy and integrates it into a Django website with an Elasticsearch search index and a PostgreSQL database. The website is hosted on Heroku.

Install

Prerequisites: Python 3, SQLite, Redis, pip, virtualenv, virtualenvwrapper, Git.

$ mkvirtualenv jobboardscraper -p python3
$ git clone [email protected]:richardcornish/jobboardscraper.git
$ cd jobboardscraper/
$ pip install -r requirements.txt
$ cd jobboardscraper/
$ python manage.py migrate
$ python manage.py loaddata jobboardscraper/fixtures/*
$ python manage.py createsuperuser
$ python manage.py runserver

Open http://127.0.0.1:8000. Kill with Ctrl+C.

Setting a virtualenv default directory is usually a good idea:

$ setvirtualenvproject $WORKON_HOME/jobboardscraper/ ~/Sites/jobboardscraper/jobboardscraper/
$ cdproject

Scrape

To run the spider to scrape the website:

$ cd scraper/
$ scrapy crawl eslcafe

Search

Elasticsearch is required to build and update the search index. Assuming Homebrew is installed, initial indexing:

$ brew install caskroom/cask/brew-cask
$ brew install Caskroom/cask/java
$ brew install elasticsearch
$ elasticsearch --config=/usr/local/opt/elasticsearch/config/elasticsearch.yml
$ python manage.py rebuild_index`

Future indexing:

$ python manage.py update_index

Deploy

If you're using Heroku, deploying requires the Heroku Toolbelt:

Heroku add-ons I installed:

Initial deploy:

$ heroku login
$ heroku create
$ heroku config:set SECRET_KEY='...' # replace with your own
$ heroku config:set DEBUG=''
$ heroku addons:create heroku-postgresql:hobby-dev
$ heroku addons:create heroku-redis:hobby-dev
$ heroku addons:create searchbox:starter
$ git push heroku master
$ heroku run python jobboardscraper/manage.py migrate
$ heroku run python jobboardscraper/manage.py loaddata jobboardscraper/jobboardscraper/fixtures/*
$ heroku run python jobboardscraper/manage.py createsuperuser
$ heroku open

Future deploys:

$ git push heroku master

After installation you can scrape the website and build the search index on Heroku:

$ heroku run '(cd jobboardscraper/scraper/ && scrapy crawl eslcafe)'
$ heroku run python jobboardscraper/manage.py rebuild_index

Future scraping and indexing are handled by daily Celery tasks with a Redis broker.