Note: Notice the variable settings in the .env file.
Start the PostGres database and PostGresAdmin (webinterface to the database - see images further below):
docker-compose up
In another shell we start the python dev environment container (Python 3.9 on Debian 9 with pandas and dbt-postgres installed):
docker run -it --network=dv_tutorial_db-server-network --mount src="$(pwd)",target=/app,type=bind dv_tutorial_dbt bash
Now a command line in the dev container will open. In this container we run:
dbt init
You will be asked to provide some information:
- name: dv_tutorial
- choose 1 -> postgres
- as shown below (or save the content to
/root/.dbt/profiles.yml
in the container):
dv_tutorial:
outputs:
dev:
dbname: dv_test
host: sqlNode1
pass: abcd
port: 5432
schema: dbo
threads: 1
type: postgres
user: dev
target: dev
Let us now check the database connection and setup:
dbt debug
This should show a working setup and database connection.
Now copy the data files:
- Products.csv
- SalesOrderItems.csv
from Kaggle (Link) and save them into the seed directory.
Next we load the raw data in the database run:
dbt seed
This takes a moment and will insert the data into the database.
Note: Use raw stage sqls to generate a combined table from multiple sources. This allows to pull the data in one table and then use the staging sql statements to generate a staging table.
In the image below we can see that the raw tables are loaded into the database.
dbt
allows you to run separate stage scripts so you can trace step by step what you are doing. If you want to run everything at once: dbt run
.
Now you need to install the dependencies (packages.yml
):
packages:
- package: Datavault-UK/automate_dv
version: 0.10.1
# dbt run -s name_of_component
dbt run -s stg_product --full-refresh # if you put a + in front of stg_product it also runs raw sql if you have that
dbt run -s stg_salesorderitems --full-refresh
dbt run -s hub_product --full-refresh
dbt run -s hub_salesorderitems --full-refresh
dbt run -s lnk_product_salesorderitem --full-refresh
dbt run -s sat_product_details --full-refresh
dbt run -s sat_salesorder_details --full-refresh
As a test I tried to aggregate some simple data using a python model (models/aggregate/mean_weight_product_cat.py
). In order for this to work I needed a third-party library dbt-fal[postgres]
.
NOTE: Install first dbt-fal[postgres] and then dbt-postgres, otherwise you have some version issues.
dbt
can autogenerate docs.
dbt docs generate
dbt docs serve