(for database credentials: place database.ini file into repo directory on your machine)
Current development:
- web scraping/crawling
- ORM database management
- PostgreSQL database
- currently hosted by Supabase
- Wiki documentation
Future development:
- NLP
- Website
- Aggregation and other data analysis
Due to the use of scrapy-playwright (loading js elements when using scrapy), we recommend installing WSL/Ubuntu to run the scrapy spider.
However, you may want to still have a conda environment on your Windows environment for quick debugging and development through your IDE (ex. VSCode, PyCharm, etc).
Other utilities to consider include this for commands like view(response) in scrapy shell.
There may be more dependencies to install including:
- playwright install-deps
- etc
- Open Anaconda Prompt
- Create conda environment from environment.yml: has all the necessary libraries and packages
(including ipykernel)
NOTE: environment.yml will need to be updated if we need to use more packages
(base) > conda env create -f environment-[os].yml
- Main packages:
- SQLAlchemy 2.0
- Camelot
- Selenium
- Beautiful Soup 4
- ipykernel
- lxml
- html5lib
- pandas, numpy
- scrapy
- scrapy-playwright
- Use the following command to update environmental.yml
conda env export > environment.yml
(currently trying to migrate from selenium/bs4 into scrapy)
- In the web_crawling folder (has settings.py and a folder called web_crawling)
scrapy crawl munispider
- Check out the wiki for updated information
- Open Anaconda Prompt
- Install nb_conda_kernels in base environment: allows you to access conda environments in Jupyter Notebook
(as long as ipykernel is installed)
(base) > conda install nb_conda_kernels
- When running .ipynb, switch kernel to "Python [conda env: db_env]"
- A quick test to make sure the environment/kernel is working.
import sqlalchemy sqlalchemy.__version__ >> '2.0.12'
- Download Chrome driver. Change path of chrome driver in get_html() in scraper_functions.py.
- Main function is create_csv_url() in data_processing.py
from data_processing import * create_csv_url("CA", "Los Angeles County", [insert url], [insert table number]) # web-scrape create_csv_url("CA", "Los Angeles County") # if csv exists
- Follow prompts
- Main function is read_database() in database_functions.py
from database_functions import * read_database()
- Follow prompts
- Main function is read_pdf() in scraper_functions.py
- Install Ghostscript via your OS here
- Run function with parameters