Skip to content

CarstenFrommhold/Unlocking-the-Power-of-Pyspark

Repository files navigation

Unlocking the Power of PySpark

tl;dr

This repo aims to cover the key concepts of Apache Spark and it's Python API, Pyspark. Apache Spark is an open-source unified analytics engine for large-scale data processing. It consists of five components, namely Core, SQL, ML, Streaming and GraphX. Here, the first three are touched.

What to expect

  • First, we will get to know the basic principles of Spark and its Python API, PySpark.
  • Next, we'll take a look at Spark's machine learning capabilities.
  • With a business case in mind, we turn away from the local setup and move to the cloud, namely AWS.

Setup

To be able to start with the content, there are several options.

Option 1)
You can build a virtual environment with the corresponding dependencies via poetry or requirements.

poetry install
jupyter lab

Option 2)
As an alternative, you can start it directly via docker.

docker build -t tutorial .
docker run -p 8888:8888 tutorial

Option 3)
Do it directly in Google Colab.

Is this really Python?

PySpark is a Python API for Apache Spark. While Spark is implemented in Scala and runs on the Java Virtual Machine (JVM), PySpark allows Python developers to interface with Spark and take advantage of its distributed computing capabilities. Besides Pyspark, however, all other libraries of Python can be used and combined in your scripts.

So do i need Scala to be installed?

No, you don't need to install Scala to use PySpark. PySpark comes bundled with a pre-built version of Spark that includes the Scala runtime environment. When you use PySpark, you don't interact with Scala directly, but PySpark communicates with the Spark runtime environment, which is implemented in Scala and runs on the Java Virtual Machine (JVM). So, you need Java to be installed.

How can I imagine a cluster figuratively?

Here you are.

Spark in cluster mode

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published