Unlocking the Power of PySpark

tl;dr

This repo aims to cover the key concepts of Apache Spark and it's Python API, Pyspark. Apache Spark is an open-source unified analytics engine for large-scale data processing. It consists of five components, namely Core, SQL, ML, Streaming and GraphX. Here, the first three are touched.

What to expect

First, we will get to know the basic principles of Spark and its Python API, PySpark.
Next, we'll take a look at Spark's machine learning capabilities.
With a business case in mind, we turn away from the local setup and move to the cloud, namely AWS.

Setup

To be able to start with the content, there are several options.

Option 1)
You can build a virtual environment with the corresponding dependencies via poetry or requirements.

poetry install
jupyter lab

Option 2)
As an alternative, you can start it directly via docker.

docker build -t tutorial .

docker run -p 8888:8888 tutorial

Option 3)
Do it directly in Google Colab.

Is this really Python?

PySpark is a Python API for Apache Spark. While Spark is implemented in Scala and runs on the Java Virtual Machine (JVM), PySpark allows Python developers to interface with Spark and take advantage of its distributed computing capabilities. Besides Pyspark, however, all other libraries of Python can be used and combined in your scripts.

So do i need Scala to be installed?

No, you don't need to install Scala to use PySpark. PySpark comes bundled with a pre-built version of Spark that includes the Scala runtime environment. When you use PySpark, you don't interact with Scala directly, but PySpark communicates with the Spark runtime environment, which is implemented in Scala and runs on the Java Virtual Machine (JVM). So, you need Java to be installed.

How can I imagine a cluster figuratively?

Here you are.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
data_prep		data_prep
img		img
sandbox		sandbox
.gitignore		.gitignore
01_Principles.ipynb		01_Principles.ipynb
02_ML.ipynb		02_ML.ipynb
03_Cloud_Motivation.ipynb		03_Cloud_Motivation.ipynb
Dockerfile		Dockerfile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unlocking the Power of PySpark

tl;dr

What to expect

Setup

Is this really Python?

So do i need Scala to be installed?

How can I imagine a cluster figuratively?

References

About

Releases

Packages

Languages

CarstenFrommhold/Unlocking-the-Power-of-Pyspark

Folders and files

Latest commit

History

Repository files navigation

Unlocking the Power of PySpark

tl;dr

What to expect

Setup

Is this really Python?

So do i need Scala to be installed?

How can I imagine a cluster figuratively?

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages