This repository contains the exercises for the course around DataFrames and data processing with two different analytics engines: Polars and Pyspark.
Below will be removed from the README, but is useful as a guideline during development of the course.
-
Operational vs. analytical data
-
Data processing / data transformation
- Examples transformations (input/output) for each common transformation (relational model)
- Join
- Agg (GroupBy)
- Window
- Filter
- Project
- Examples transformations (input/output) for each common transformation (relational model)
-
DataFrame abstraction (tabular data vs. unstructured or semi-structured data)
-
Engines
- Spark vs. Polars vs. DuckDB vs. Pandas
- Cost / Simplicity / scalability trade-off
- SQL vs. DataFrame API
- Roles: analyst, data scientist, data engineer
- Spark vs. Polars vs. DuckDB vs. Pandas
-
Polars:
- Deep dive + architecture
- Hands-on exercises
-
PySpark:
- Deep dive + architecture
- Hands-on exercises
-
Ecosystem
-
Advanced
- Arrow interoperability
- Python DataFrame API
- Substrait
-
Outlook: processing in the data engineering landscape