Skip to content

zoeseeger/dep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Processor (DEP)

Purpose

To automate the standard data engineering process at MECCA by auto-populating DDL, DAG and documentation files in order to:

  • mitigate copy-paste and syntax errors;
  • save time on rewriting code/finding code to copy;
  • follow best practices in terms of naming conventions, code formatting and layout

Requirements

  • Windows machine
  • bash (e.g. git-bash)
  • Python3
  • pywin32
  • beautifulsoup4

Quick Start

  1. Create a confluence/word document using one of the provided templates:

    • New Table Template
    • Import Template
  2. In confluence, click the three dots in the top right of the page and from the list choose Export > Export to Word

    • Ensure it is saved as a .doc file – .docx can be converted to .doc by opening in Word and using Save As
    • Save the document in the directory that you will run DEP
  3. Open a console and navigate to the directory of the input file you have exported

  4. Type dep.py and press the enter key

    • It will use the first file found with extention .doc in the current directory
    • A directory called dep-output will be created with the output files within
    • A file called vars.json will also have been created which has the values used to create the output file. This can be editted and used as the input json file: dep.py --json vars.json
  5. For help: dep.py -h


Output

  • dep-output directory
  • vars.json file – a json file containing the values used to create the files, this can be altered and also used as an input to DEP
  • DDL SQL file – sql table and view definitions
  • DAG Python file – airflow scheduled etl pipeline
  • Unit test Markdown file – table definition and row unit tests

Input

DEP can get information from either *.doc, input.json or both. By default, DEP looks for these files in the current directory. If the files exist, the *.doc is read first followed by input.json. Thus, variables found in the confluence doc can be overwritten by declaration in the JSON file by design. Either of the files may be absent, although a table structure must be defined in at least one of these files. Placeholder values are given to variables not found in the input files and are denoted by 'xxx' for easy search-and-replace functionality.

The input file paths can be specified by command line arguments: dep.py --doc /path/to/doc --json /path/to/json

Confluence Input

Confluence pages can be exported to a .doc format and placed in the working directory to be read by DEP. The first .doc file found in the directory will be used. Alternatively, the file cm be specified with command line arguments. DEP looks for specific headers and section contents under the headers to identify specific information. The headings can be in any order and are read in a case insensitive manner. The following headers and are read from the document:

Action: New Table

Heading Name Heading type Information extracted Section contents Used In
{schema}.{table} Confluence title Schema and table name NA Everywhere
Overview h1/h2 heading Table description
Subject area
Overview of new table
Section should begin with a concise one sentence description
A problem statement following the format Working in (.*) I want to
DDL table comment
Status h1/h2 heading Jira ID Link to Jira Issue – most recent at the top DDL file name
Unit test file name
Unit test title
Granularity h1/h2 heading Primary keys List of primary keys – no text underneath DDL table definition
DAG ON CONFLICT
Indexes h1/h2 heading Indexes List of indexes only – no text underneath DDL table definition
Update Method h1/h2 heading Method (to be initialised)
Schedule
Method: {Incremental/Insert}
Schedule: {e.g. Sunday 10PM}
Insert type in DAG (method)
DDL table comment (schedule)
DFD h1/h2 heading To use a staging table DFD
Description of DFD – include text staging to include a staging table
Staging table in DAG flow
Table structure h1/h2 heading Column names
Data Types
Nullable
Descriptions
Source tables/columns
Table with columns for:
Column Name
Data Type
Nullable
Description
Source/Column
Table DDL definition
DAG SQL (source tables/columns)
View Structure h1/h2 heading View DDL definition Table with columns for:
Column Name
Description
Source/Column
View DDL definition
Materialized View Structure h1/h2 heading Materialized View DDL definition Table with columns for:
Column Name
Description
Source/Column
Mat. View DDL definition
Created By h1/h2 heading Engineer name
Engineer email
Firstname Lastname (email) DDL table comment (name)
DAG email (email)

Action: Import

Heading Name Heading type Information extracted Section contents Used In
sys_{schema} Confluence title Schema name
Action to perform = import when title starts with sys_
NA Everywhere
Overview h1/h2 heading Schema description
Subject area
Overview of schema
Section should begin with a concise one sentence description
A problem statement following the format Working in (.*) I want to
DDL schema comment (schema description)
DDL table comments (subject area)
Status h1/h2 heading Jira ID Link to Jira Issue – most recent at the top DDL file name
Unit test file name
Unit test title
Import Method h1/h2 heading Schedule
Import method for each table: Incrementally or full load with soft delete
Schedule: {e.g. Sunday 10PM}
Table with columns:
Table Name
Import Method – Incremental or full
DDL table comment (schedule)
DAG SQL
{sys_schema}.{table_name} h1/h2 heading Each table to be imported and their definitions One for each table to be imported. Subsections:
Description – First sentence should be concise descritption
Granuarity – List of primary keys
Indexes – List of indexes
Foreign Keys – List of foreign keys
Table Structure – with columns:
Column Name
Data Type
Nullable
Description
Source/Column
DDL table definition
DAG SQL
Created By h1/h2 heading Engineer name
Engineer email
Firstname Lastname (email) DDL table comment (name)
DAG email (email)

JSON Input

Default variables and those from confluence can be overwritten in a JSON input file. By default, DEP looks for input.json in the working directory although the file to use can be specified with command line arguments.

In the following example the four variables, whether found in a confluence document or not will be replaced with the following details:

Supplementary input.json:

{
    "user": "zoeseeger",
    "user_email": "[email protected]",
    "subject_area": "Loss prevention",
    "schedule": "Monday 2PM"
}

In this way JSON input files can be used to either supplement the information found in the confluence document.

Alternatively, when DEP is run, a vars.json file is created which holds the runtime variables used to create the output files. This output file can then be used as an input file with updated values.

Standalone input.json:

{
    "jira_id": "GID-3069",
    "action": "import",
    "output_dir": "dep-output",
    "user": "Zoe Seeger",
    "user_email": "[email protected]",
    "description": "This is the schema for the Tester Order System (TOS) which resides in mecca_oms (Host:db005wp.mecca.com.au) in the dbo schema",
    "subject_area": "Supply Chain",
    "schedule": "Sunday 10PM",
    "staging_table": false,
    "ddl_file": "GID-3069_sys_new_ddl.sql",
    "dag_file": "import_sys_new.py",
    "doc_file": "Unit-Test-GID-3069_sys_new.md",
    "table_structures": [
        {
            "schema": "sys_new",
            "table_name": "orderitemadjustmentreasons",
            "schema_table": "sys_new.orderitemadjustmentreasons",
            "import_method": "full",
            "table_description": "Order adjustment reasons for orderitems",
            "columns": [
                "order_item_adjustment_reason_id",
                "reason",
                "record_created",
                "record_updated",
                "record_deleted"
            ],
            "types": [
                "int4",
                "text",
                "timestamptz",
                "timestamptz",
                "timestamptz"
            ],
            "nullables": [
                "NOT NULL",
                "NOT NULL",
                "NOT NULL",
                "NOT NULL",
                "NULL"
            ],
            "descriptions": [
                "",
                "",
                "",
                "",
                ""
            ],
            "source_columns": [
                "orderitemadjustmentreasonid",
                "",
                "",
                "",
                ""
            ],
            "source_alias": [
                "o",
                "o",
                "",
                "",
                ""
            ],
            "sources": {
                "dbo.orderitemadjustmentreasons": {
                    "alias": "o",
                    "source_table": "dbo.orderitemadjustmentreasons"
                }
            },
            "indexes": [],
            "primary_keys": [
                "order_item_adjustment_reason_id"
            ],
            "foreign_keys": []
        },
        {
            "schema": "sys_new",
            "table_name": "orderitems",
            "schema_table": "sys_new.orderitems",
            "import_method": "incremental",
            "table_description": "Tester order items",
            "columns": [
                "order_item_id",
                "order_id",
                "item_code",
                "description",
                "quantity",
                "unit_price",
                "adjusted_quantity",
                "order_item_adjustment_reason_id",
                "record_created",
                "record_updated"
            ],
            "types": [
                "uuid",
                "uuid",
                "text",
                "text",
                "int4",
                "numeric",
                "int4",
                "int4",
                "timestamptz",
                "timestamptz"
            ],
            "nullables": [
                "NOT NULL",
                "NOT NULL",
                "NOT NULL",
                "NULL",
                "NOT NULL",
                "NOT NULL",
                "NULL",
                "NULL",
                "NOT NULL",
                "NOT NULL"
            ],
            "descriptions": [
                "",
                "",
                "",
                "",
                "",
                "",
                "",
                "",
                "",
                ""
            ],
            "source_columns": [
                "orderitemid",
                "orderid",
                "itemcode",
                "",
                "",
                "unitprice",
                "adjustedquantity",
                "orderitemadjustmentreasonid",
                "",
                ""
            ],
            "source_alias": [
                "o",
                "o",
                "o",
                "o",
                "o",
                "o",
                "o",
                "o",
                "",
                ""
            ],
            "sources": {
                "dbo.orderitems": {
                    "alias": "o",
                    "source_table": "dbo.orderitems"
                }
            },
            "indexes": [
                "order_id",
                "item_code",
                "record_updated"
            ],
            "primary_keys": [
                "order_item_id"
            ],
            "foreign_keys": [
                "order_item_adjustment_reason_id"
            ]
        }
    ],
    "view_structures": [],
    "debug": true
}

About

Data Engineering Processor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages