This repo aims to cover the key concepts of Apache Spark and it's Python API, Pyspark. Apache Spark is an open-source unified analytics engine for large-scale data processing. It consists of five components, namely Core, SQL, ML, Streaming and GraphX. Here, the first three are touched.
- First, we will get to know the basic principles of Spark and its Python API, PySpark.
- Next, we'll take a look at Spark's machine learning capabilities.
- With a business case in mind, we turn away from the local setup and move to the cloud, namely AWS.
To be able to start with the content, there are several options.
Option 1)
You can build a virtual environment with the corresponding dependencies via
poetry or requirements.
poetry install
jupyter lab
Option 2)
As an alternative, you can start it directly via docker.
docker build -t tutorial .
docker run -p 8888:8888 tutorial
Option 3)
Do it directly in Google Colab.
PySpark is a Python API for Apache Spark. While Spark is implemented in Scala and runs on the Java Virtual Machine (JVM), PySpark allows Python developers to interface with Spark and take advantage of its distributed computing capabilities. Besides Pyspark, however, all other libraries of Python can be used and combined in your scripts.
No, you don't need to install Scala to use PySpark. PySpark comes bundled with a pre-built version of Spark that includes the Scala runtime environment. When you use PySpark, you don't interact with Scala directly, but PySpark communicates with the Spark runtime environment, which is implemented in Scala and runs on the Java Virtual Machine (JVM). So, you need Java to be installed.
Here you are.