To automate the standard data engineering process at MECCA by auto-populating DDL, DAG and documentation files in order to:
- mitigate copy-paste and syntax errors;
- save time on rewriting code/finding code to copy;
- follow best practices in terms of naming conventions, code formatting and layout
- Windows machine
- bash (e.g. git-bash)
- Python3
- pywin32
- beautifulsoup4
-
Create a confluence/word document using one of the provided templates:
- New Table Template
- Import Template
-
In confluence, click the three dots in the top right of the page and from the list choose Export > Export to Word
- Ensure it is saved as a .doc file – .docx can be converted to .doc by opening in Word and using Save As
- Save the document in the directory that you will run DEP
-
Open a console and navigate to the directory of the input file you have exported
-
Type
dep.py
and press the enter key- It will use the first file found with extention .doc in the current directory
- A directory called dep-output will be created with the output files within
- A file called vars.json will also have been created which has the values used to create the output file. This can be editted and used as the input json file:
dep.py --json vars.json
-
For help:
dep.py -h
- dep-output directory
- vars.json file – a json file containing the values used to create the files, this can be altered and also used as an input to DEP
- DDL SQL file – sql table and view definitions
- DAG Python file – airflow scheduled etl pipeline
- Unit test Markdown file – table definition and row unit tests
DEP can get information from either *.doc
, input.json
or both. By default, DEP looks for these files in the current directory. If the files exist, the *.doc
is read first followed by input.json
. Thus, variables found in the confluence doc can be overwritten by declaration in the JSON file by design. Either of the files may be absent, although a table structure must be defined in at least one of these files. Placeholder values are given to variables not found in the input files and are denoted by 'xxx' for easy search-and-replace functionality.
The input file paths can be specified by command line arguments: dep.py --doc /path/to/doc --json /path/to/json
Confluence pages can be exported to a .doc format and placed in the working directory to be read by DEP. The first .doc file found in the directory will be used. Alternatively, the file cm be specified with command line arguments. DEP looks for specific headers and section contents under the headers to identify specific information. The headings can be in any order and are read in a case insensitive manner. The following headers and are read from the document:
Action: New Table
Heading Name | Heading type | Information extracted | Section contents | Used In |
---|---|---|---|---|
{schema}.{table} | Confluence title | Schema and table name | NA | Everywhere |
Overview | h1/h2 heading | Table description Subject area |
Overview of new table Section should begin with a concise one sentence description A problem statement following the format Working in (.*) I want to |
DDL table comment |
Status | h1/h2 heading | Jira ID | Link to Jira Issue – most recent at the top | DDL file name Unit test file name Unit test title |
Granularity | h1/h2 heading | Primary keys | List of primary keys – no text underneath | DDL table definition DAG ON CONFLICT |
Indexes | h1/h2 heading | Indexes | List of indexes only – no text underneath | DDL table definition |
Update Method | h1/h2 heading | Method (to be initialised) Schedule |
Method: {Incremental/Insert} Schedule: {e.g. Sunday 10PM} |
Insert type in DAG (method) DDL table comment (schedule) |
DFD | h1/h2 heading | To use a staging table | DFD Description of DFD – include text staging to include a staging table |
Staging table in DAG flow |
Table structure | h1/h2 heading | Column names Data Types Nullable Descriptions Source tables/columns |
Table with columns for: Column Name Data Type Nullable Description Source/Column |
Table DDL definition DAG SQL (source tables/columns) |
View Structure | h1/h2 heading | View DDL definition | Table with columns for: Column Name Description Source/Column |
View DDL definition |
Materialized View Structure | h1/h2 heading | Materialized View DDL definition | Table with columns for: Column Name Description Source/Column |
Mat. View DDL definition |
Created By | h1/h2 heading | Engineer name Engineer email |
Firstname Lastname (email) | DDL table comment (name) DAG email (email) |
Action: Import
Heading Name | Heading type | Information extracted | Section contents | Used In |
---|---|---|---|---|
sys_{schema} | Confluence title | Schema name Action to perform = import when title starts with sys_ |
NA | Everywhere |
Overview | h1/h2 heading | Schema description Subject area |
Overview of schema Section should begin with a concise one sentence description A problem statement following the format Working in (.*) I want to |
DDL schema comment (schema description) DDL table comments (subject area) |
Status | h1/h2 heading | Jira ID | Link to Jira Issue – most recent at the top | DDL file name Unit test file name Unit test title |
Import Method | h1/h2 heading | Schedule Import method for each table: Incrementally or full load with soft delete |
Schedule: {e.g. Sunday 10PM} Table with columns: Table Name Import Method – Incremental or full |
DDL table comment (schedule) DAG SQL |
{sys_schema}.{table_name} | h1/h2 heading | Each table to be imported and their definitions | One for each table to be imported. Subsections: Description – First sentence should be concise descritption Granuarity – List of primary keys Indexes – List of indexes Foreign Keys – List of foreign keys Table Structure – with columns: Column Name Data Type Nullable Description Source/Column |
DDL table definition DAG SQL |
Created By | h1/h2 heading | Engineer name Engineer email |
Firstname Lastname (email) | DDL table comment (name) DAG email (email) |
Default variables and those from confluence can be overwritten in a JSON input file. By default, DEP looks for input.json
in the working directory although the file to use can be specified with command line arguments.
In the following example the four variables, whether found in a confluence document or not will be replaced with the following details:
Supplementary input.json
:
{
"user": "zoeseeger",
"user_email": "[email protected]",
"subject_area": "Loss prevention",
"schedule": "Monday 2PM"
}
In this way JSON input files can be used to either supplement the information found in the confluence document.
Alternatively, when DEP is run, a vars.json
file is created which holds the runtime variables used to create the output files. This output file can then be used as an input file with updated values.
Standalone input.json
:
{
"jira_id": "GID-3069",
"action": "import",
"output_dir": "dep-output",
"user": "Zoe Seeger",
"user_email": "[email protected]",
"description": "This is the schema for the Tester Order System (TOS) which resides in mecca_oms (Host:db005wp.mecca.com.au) in the dbo schema",
"subject_area": "Supply Chain",
"schedule": "Sunday 10PM",
"staging_table": false,
"ddl_file": "GID-3069_sys_new_ddl.sql",
"dag_file": "import_sys_new.py",
"doc_file": "Unit-Test-GID-3069_sys_new.md",
"table_structures": [
{
"schema": "sys_new",
"table_name": "orderitemadjustmentreasons",
"schema_table": "sys_new.orderitemadjustmentreasons",
"import_method": "full",
"table_description": "Order adjustment reasons for orderitems",
"columns": [
"order_item_adjustment_reason_id",
"reason",
"record_created",
"record_updated",
"record_deleted"
],
"types": [
"int4",
"text",
"timestamptz",
"timestamptz",
"timestamptz"
],
"nullables": [
"NOT NULL",
"NOT NULL",
"NOT NULL",
"NOT NULL",
"NULL"
],
"descriptions": [
"",
"",
"",
"",
""
],
"source_columns": [
"orderitemadjustmentreasonid",
"",
"",
"",
""
],
"source_alias": [
"o",
"o",
"",
"",
""
],
"sources": {
"dbo.orderitemadjustmentreasons": {
"alias": "o",
"source_table": "dbo.orderitemadjustmentreasons"
}
},
"indexes": [],
"primary_keys": [
"order_item_adjustment_reason_id"
],
"foreign_keys": []
},
{
"schema": "sys_new",
"table_name": "orderitems",
"schema_table": "sys_new.orderitems",
"import_method": "incremental",
"table_description": "Tester order items",
"columns": [
"order_item_id",
"order_id",
"item_code",
"description",
"quantity",
"unit_price",
"adjusted_quantity",
"order_item_adjustment_reason_id",
"record_created",
"record_updated"
],
"types": [
"uuid",
"uuid",
"text",
"text",
"int4",
"numeric",
"int4",
"int4",
"timestamptz",
"timestamptz"
],
"nullables": [
"NOT NULL",
"NOT NULL",
"NOT NULL",
"NULL",
"NOT NULL",
"NOT NULL",
"NULL",
"NULL",
"NOT NULL",
"NOT NULL"
],
"descriptions": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"source_columns": [
"orderitemid",
"orderid",
"itemcode",
"",
"",
"unitprice",
"adjustedquantity",
"orderitemadjustmentreasonid",
"",
""
],
"source_alias": [
"o",
"o",
"o",
"o",
"o",
"o",
"o",
"o",
"",
""
],
"sources": {
"dbo.orderitems": {
"alias": "o",
"source_table": "dbo.orderitems"
}
},
"indexes": [
"order_id",
"item_code",
"record_updated"
],
"primary_keys": [
"order_item_id"
],
"foreign_keys": [
"order_item_adjustment_reason_id"
]
}
],
"view_structures": [],
"debug": true
}