This project demonstrates the implementation of both relational (PostgreSQL) and graph-based NoSQL (Neo4j) database paradigms, showcasing advanced data modeling, API integration, and database migration techniques.
- Dual API Integration: Engineered a robust data collection pipeline that fetches and merges book data from the Penguin Random House and Google Books APIs, handling rate limiting and request failures gracefully
- Large-Scale Data Management: Successfully processed and stored data for over 120,000 books (~500mb) with comprehensive metadata
- Complex Data Modeling:
- PostgreSQL: Implemented database patterns including IS-A relationships, weak entities, and complex referential integrity constraints
- Neo4j: Developed an optimized graph structure for representing book relationships and metadata, with performance-tuned indexing
- Advanced Query Optimization: Created and benchmarked various database indexes, demonstrating significant performance improvements (up to 50x faster for certain queries)
- Role-Based Access Control: Implemented view-based security measures to control data access based on user privileges
- Built with Python, PostgreSQL, Neo4j, Docker
- Implemented transaction-based data operations for fault tolerance
- Developed intelligent retry mechanisms for API rate limiting
- Utilized database-specific optimization techniques:
- PostgreSQL: Triggers, materialized views, complex joins
- Neo4j: Full-text search indexes, relationship-based queries
Name | Student ID |
---|---|
Nathan Grenier | 40250986 |
Nathanial Hwong | 40243583 |
Below are the data model diagram of each database.
A report providing deeper insight into the project can be found here.
The Postgres data dump file can be found either:
- In the project at
src/data/data_backup.sql
- Online at https://mega.nz/file/cglF1T6S#nCY-0kH13lLYo-R3SqO5ZI7Z-yf0qlQJFOas8_FL5I0
Note: Mega Upload might have messed up the character encoding of the
data_dump.sql
file.
Set these env variables before running any commands:
Relational Database:
- POSTGRES_DB
- POSTGRES_USER
- POSTGRES_PASSWORD
Look in the
docker-compose.yaml
for the db credentials
Neo4j (Graph Database):
- NEO4J_PASSWORD
API:
- RANDOMHOUSE_API_KEY
You should install docker for you system before starting.
Both the Postgres instance and database management tool (pgAdmin) are configured in the docker-compose.yml
file.
- To run both services, use
docker compose up
.You can run the container in "detached" mode by appending the
-d
flag to the command above. - Next, check that both services are running with
docker ps
. - Copy the "postgres" services docker id (ex: 1fc60e0e538d).
- Inspect the details of the postgres container using
docker inspect {postgres id}
. - Search for the
IPAddress
attribute of the postgres database and keep note of it. - Open
http://localhost:5050/
to view the pgAdmin webpage. - Click on the "Add New Server" Quick Link in pgAdmin to add the postgres instance.
- In the General tab:
- Give the postgres server a name.
- In the Connection Tab:
- Enter the postgres container's ip address
- Enter the same username as in the
.env
file (POSTGRES_USER) - Enter the same password as in the
.env
file (POSTGRES_PASSWORD)
The neo4j browser, an interactive window to run queries and visualize database data, can be accessed at http://localhost:7474/browser/ after running docker compose up
.
Make sure to set the appropriate environment variables in your
.env
file.
In order to render ER diagrams (chen's notation), you must use the server version of plantUML.
To pull the docker image, run:
docker run -d -p 8181:8080 --name plantuml -e BASE_URL=plantuml plantuml/plantuml-server:jetty
Add the following settings to your setting.json
file in VsCode:
"plantuml.server": "http://localhost:8181/plantuml",
"plantuml.render": "PlantUMLServer",
Note: You can change the host's port (port before the ":") to whatever you'd like. Default for http is usually
80
or8080
- Start the container:
docker start {name}
- Stop the container:
docker stop {name}
- List all running containers:
docker ps
To specify where the diagrams should be defined and exported, add the following to VsCode's setting.json
:
"plantuml.diagramsRoot": "diagrams/src",
"plantuml.exportOutDir": "diagrams/out",
First, install virtualvenv using pip install virtualenv
.
Now, you can create a venv to work in using virtualenv --python 3.12.1 venv
Note: You need the specified version on python installed on your local computer to run the command above
In order to activate the venv to start working in it, use this command:
# Linux and Mac
source venv/bin/activate
# Windows
.\venv\Scripts\activate
To stop working in the venv, use the command: deactivate
.
Use the following command while in the venv to install the project's dependencies:
pip install -r requirements.txt
You can generate a data dump of your postgres database by using the following command: pg_dump -U postgres -d db -f data_backup.sql --data-only --column-inserts
.
In order to populate your database with data, you'll need to either run the code (which will take a long time), or download and import the data dump file in the postgres docker container through the command line.
Follow the steps below to populate your postgres database (in docker):
docker cp /path/to/your/local/data_dump.sql postgres:/tmp/data_dump.sql
docker exec -it postgres bash
- From the bash terminal instance, run:
psql -U postgres -d db < /tmp/data_dump.sql
If psql
isn't installed in the container, try these alternatives:
- Install psql in the container:
apt-get update && apt-get install -y postgresql-client
- Use
psql
from your environment:docker exec -i postgres psql -U postgres -d db < /path/to/your/local/data_dump.sql
I like using Ruff to format and lint my python code. This package is installed whenever you install the project's dependencies and can be used with the following command:
ruff format .
If you want the file to format on save, you can install the VsCode Ruff extension and add these lines to VsCodes' setting.json
file:
{
"notebook.formatOnSave.enabled": true,
"notebook.codeActionsOnSave": {
"notebook.source.organizeImports": "explicit"
},
"[python]": {
"editor.formatOnSave": true,
"editor.codeActionsOnSave": {
"source.organizeImports": "explicit"
},
"editor.defaultFormatter": "charliermarsh.ruff"
}
}