Spark

Introduction

Apache Spark is a powerful open-source framework designed for fast and versatile big data processing, enabling efficient large-scale data manipulation and real-time analytics. In this course, you will learn how to leverage Spark's capabilities to process massive datasets and perform complex data analysis tasks with ease.

Educational goals

Discover all the functionalities of Apache Spark and why it is everywhere.
Understand the internals of Spark.
Learn to use Spark for batch and streaming data analytics.
Learn how to package a Spark application for production.
Learn Machine Learning with MLlib.

Prerequisites

Python programming knowledge, Linux/Unix shell basic knowledge.

Modules

Module 1 (3h) - Introduction to Spark & RDDs

Presentation
Spark in Hadoop ecosystem
Use cases
Spark ecosystem
Internals
Data structures
Operations
Resilient Distributed Datasets (RDDs)

Module 2 (3h) - Spark SQL and DataFrames

RDDs: Pros and Cons
DataFrames
RDDs vs DataFrames
Working with DataFrames
Why SQL?

Module 3 (3h) - Spark Structured Streaming

Streaming introduction
Difference between batch and stream processing
Stream processing models
Different processing semantics
Programming model
Event-time vs. processing time
Windows: tumbling, overlapping
Handling late data and how long to wait
Vocabulary

Module 4 (3h) - Application packaging and submitting

Spark functionalities recap
Spark application components
Spark application configuration
Spark application monitoring
Spark application performance tuning

Module 5 (3h) - Machine Learning with Spark MLlib

Spark Ecosystem
Machine Learning With Spark: packages
Machine Learning
MLlib: Functionalities
When do we choose ML with Spark?
Spark Data Structures and connection to Python
Types of Parallelism
ML Terminology (Spark, sklearn…)
Steps of a Machine Learning project
Linear Regression and Random Forest regression

Resource

You can freely download a book, used for this course:

Learning Spark, 2nd Edition

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Spark

Introduction

Educational goals

Prerequisites

Modules

Module 1 (3h) - Introduction to Spark & RDDs

Module 2 (3h) - Spark SQL and DataFrames

Module 3 (3h) - Spark Structured Streaming

Module 4 (3h) - Application packaging and submitting

Module 5 (3h) - Machine Learning with Spark MLlib

Resource

Files

index.md

Latest commit

History

index.md

File metadata and controls

Spark

Introduction

Educational goals

Prerequisites

Modules

Module 1 (3h) - Introduction to Spark & RDDs

Module 2 (3h) - Spark SQL and DataFrames

Module 3 (3h) - Spark Structured Streaming

Module 4 (3h) - Application packaging and submitting

Module 5 (3h) - Machine Learning with Spark MLlib

Resource