Job Board Scraper

Job Board Scraper collects, cleans, organizes, and indexes English teaching positions from an existing online job board once a day.

The code scrapes the job board with Scrapy and integrates it into a Django website with an Elasticsearch search index and a PostgreSQL database. The website is hosted on Heroku.

Install

Prerequisites: Python 3, SQLite, Redis, pip, virtualenv, virtualenvwrapper, Git.

$ mkvirtualenv jobboardscraper -p python3
$ git clone [email protected]:richardcornish/jobboardscraper.git
$ cd jobboardscraper/
$ pip install -r requirements.txt
$ cd jobboardscraper/
$ python manage.py migrate
$ python manage.py loaddata jobboardscraper/fixtures/*
$ python manage.py createsuperuser
$ python manage.py runserver

Open http://127.0.0.1:8000. Kill with Ctrl+C.

Setting a virtualenv default directory is usually a good idea:

$ setvirtualenvproject $WORKON_HOME/jobboardscraper/ ~/Sites/jobboardscraper/jobboardscraper/
$ cdproject

Scrape

To run the spider to scrape the website:

$ cd scraper/
$ scrapy crawl eslcafe

Search

Elasticsearch is required to build and update the search index. Assuming Homebrew is installed, initial indexing:

$ brew install caskroom/cask/brew-cask
$ brew install Caskroom/cask/java
$ brew install elasticsearch
$ elasticsearch --config=/usr/local/opt/elasticsearch/config/elasticsearch.yml
$ python manage.py rebuild_index`

Future indexing:

$ python manage.py update_index

Deploy

If you're using Heroku, deploying requires the Heroku Toolbelt:

Heroku Toolbelt

Heroku add-ons I installed:

Initial deploy:

$ heroku login
$ heroku create
$ heroku config:set SECRET_KEY='...' # replace with your own
$ heroku config:set DEBUG=''
$ heroku addons:create heroku-postgresql:hobby-dev
$ heroku addons:create heroku-redis:hobby-dev
$ heroku addons:create searchbox:starter
$ git push heroku master
$ heroku run python jobboardscraper/manage.py migrate
$ heroku run python jobboardscraper/manage.py loaddata jobboardscraper/jobboardscraper/fixtures/*
$ heroku run python jobboardscraper/manage.py createsuperuser
$ heroku open

Future deploys:

$ git push heroku master

After installation you can scrape the website and build the search index on Heroku:

$ heroku run '(cd jobboardscraper/scraper/ && scrapy crawl eslcafe)'
$ heroku run python jobboardscraper/manage.py rebuild_index

Future scraping and indexing are handled by daily Celery tasks with a Redis broker.

Name		Name	Last commit message	Last commit date
Latest commit History 445 Commits
jobboardscraper		jobboardscraper
.gitignore		.gitignore
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Job Board Scraper

Install

Scrape

Search

Deploy

About

Releases

Packages

Contributors 4

Languages

License

dillonko/jobboardscraper

Folders and files

Latest commit

History

Repository files navigation

Job Board Scraper

Install

Scrape

Search

Deploy

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages