-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy pathBuilding ETL Pipelines with Python=Brij Kishore;Note=Erxin.txt
123 lines (72 loc) · 6.35 KB
/
Building ETL Pipelines with Python=Brij Kishore;Note=Erxin.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
Building ETL Pipelines with Python=Brij Kishore;Note=Erxin
# Preface
- Extract, Transform, and Load (ETL) process. ETL workflows, also known as ETL pipelines, enable data engineers to create customized solutions that are not only strategic but also enable developers to create flexible deployment environments
- Popular programming languages, such as SQL, Python, R, and Spark, are some of the most popular languages used to develop custom data solutions.
- tools
“iTerm2 brings the Terminal into the modern age with features you never knew you always wanted.” Take some time to install and customize your new iTerm2
PyCharm app by following these steps: https://www.jetbrains.com/help/pycharm/github.html.
pipenv takes care of updating the Pipfile and Pipfile.lock files with each package installation.
- Management systems, MMSs are like special folders that only work in certain environments. They do this by changing sys.prefix and sys.exec_prefix so that they point to the base directory of the virtual environment. This is helpful because it lets developers create “clean” applications and also makes sure that all the different parts of the project work well together.
ETL and ELT are data processing approaches that involve extracting data from a source, performing transformations on it, and loading it into a final storage location. ETL is commonly used when the source and target systems are different and the data needs to be transformed in a specific way
ELT is more commonly used when the target system is more powerful and can handle the transformation step itself
- ETL design, engineers added a Persistent Staging Area (PSA) layer to the pipeline to preserve segments of data source conversions during the data migration
- volatile staging area (VSA) layer, the ETL-VP is the glow-up version of the ETL-P pipeline design pattern, as seen in the following diagram
- packages
pandas
numpy
- scaling for big data packages
Dask is the Python library that allows for the parallelization of processing tasks in a flexible and dynamic scope. Dask is a cloud-based module that uses “clusters”, which are extra processing units in the cloud on standby
Numba is another powerhouse module specifically designed to enhance numerical processes in Python.
# Sourcing insightful data and data extraction strategies
- industry, as follows:
CSV/Excel files
Parquet files
APIs
RDBMS databases
HTML
- Creating data pipeline activity in Python
- pgAdmin application on your local device and enter the password you created in the PostgreSQL
- building an end to end etl pipeline in python
- powerful etl libraries and tools in Python:
Bonobo, Bonobo (https://www.bonobo-project.org/) is a Python-based Extract, Transform, Load (ETL) framework that uses a simple and rather elegant approach to pipeline construction
Odo, Odo (http://odo.pydata.org/en/latest/index.html) is a Python library built on top of the Blaze and Dask libraries; it provides a uniform API for data migration between different formats, storage systems, and DataFrames.
mETL, Mito ETL (mETL) (https://pypi.org/project/mETL/0.1.7.0dev/) is a Python library that provides a simple, flexible, plug and play framework to manipulate structured source data to produce the specified structured sink data within the ETL workflow.
Riko, (https://github.com/nerevu/riko) is a Python library specifically catered to dealing with streaming data. As we mentioned earlier in the book, creating a pipeline to handle real-time streaming source data requires
pETL, (https://petl.readthedocs.io/en/stable/) is a convenient Python library for building ETLs with various data source types (and data source quality). Using a marginal amount of your local system’s memory, pETL focuses on scalability rather than processing speed.
Luigi, (https://luigi.readthedocs.io/en/stable/index.html) is an open source Python package for building complex data pipelines of batch jobs
- Apache Airflow (https://airflow.apache.org/) is an open source platform that allows users to programmatically author, schedule, and monitor workflows.
$ pip install apache-airflow
- creating etl pipelines in aws
Using Boto3 and Amazon S3 to read dat
CI/CD
# auto scaling etl pipeline
- Vertical scaling can be accomplished using physical hardware or cloud resources.
- horizontal scaling distributes the workload across multiple machines.
- AWS Elastic Container Service (ECS) are designed to handle large data volumes
- Directed Acyclic Graph (DAG) for your ETL tasks, scheduling them to run during off-peak hours, and setting up notifications for any task failures.
- testing strategies
unit testing, pytest
When seeking to test interconnections between the different components of the pipeline
End-to-end testing ensures the entire pipeline as a whole works as intended.
Performance testing gauges the scalability and speed of your pipeline
Resilience testing is all about ensuring that your system can handle and recover from errors.
validation tests make sure that the output of the pipeline is as expected, given some input.
Integrate your CI/CD tool with deployment automation tools such as Kubernetes to orchestrate the deployment process.
- CI
Jenkins
CircleCI (https://circleci.com/) is a popular CI/CD tool that relies on a YAML file
Github action
- ETL testing tools
Apache JMeter
Talend, provides a wide range of data integration and transformation solutions
QuerySurge, focused on automating data testing, data validation and data monitoring
QualiDi, end to end test automation platform enabling enterprises to build a robust data testing framework.
ICEDq, Specializes in data testing for ETL.
Tricentis Tosca: An enterprise-level tool for functional and performance testing
ETL Validator: Designed to automate ETL testing, including data migration testing and data warehouse testing
Informatica Data Validation: Offers data validation and profiling features and can be integrated with Informatica Data Quality
RightData: Focuses on data quality assurance and data reconciliation
Assertible: Designed for automated API tests, including web services and ETL processes
- Best practices for ETL pipelines
A SPOF in a data pipeline is a part of the system that, if it fails, will stop the entire system from working.
Auditing is all about reinforcing data integrity and specific logging for compliance purposes. It’s the intentional recording and reviewing of processes