Skip to content

Latest commit

 

History

History
78 lines (57 loc) · 2.15 KB

index.md

File metadata and controls

78 lines (57 loc) · 2.15 KB

Spark

Introduction

Apache Spark is a powerful open-source framework designed for fast and versatile big data processing, enabling efficient large-scale data manipulation and real-time analytics. In this course, you will learn how to leverage Spark's capabilities to process massive datasets and perform complex data analysis tasks with ease.

Educational goals

  • Discover all the functionalities of Apache Spark and why it is everywhere.
  • Understand the internals of Spark.
  • Learn to use Spark for batch and streaming data analytics.
  • Learn how to package a Spark application for production.
  • Learn Machine Learning with MLlib.

Prerequisites

Python programming knowledge, Linux/Unix shell basic knowledge.

Modules

Module 1 (3h) - Introduction to Spark & RDDs

  • Presentation
  • Spark in Hadoop ecosystem
  • Use cases
  • Spark ecosystem
  • Internals
  • Data structures
  • Operations
  • Resilient Distributed Datasets (RDDs)

Module 2 (3h) - Spark SQL and DataFrames

  • RDDs: Pros and Cons
  • DataFrames
  • RDDs vs DataFrames
  • Working with DataFrames
  • Why SQL?

Module 3 (3h) - Spark Structured Streaming

  • Streaming introduction
  • Difference between batch and stream processing
  • Stream processing models
  • Different processing semantics
  • Programming model
  • Event-time vs. processing time
  • Windows: tumbling, overlapping
  • Handling late data and how long to wait
  • Vocabulary

Module 4 (3h) - Application packaging and submitting

  • Spark functionalities recap
  • Spark application components
  • Spark application configuration
  • Spark application monitoring
  • Spark application performance tuning

Module 5 (3h) - Machine Learning with Spark MLlib

  • Spark Ecosystem
  • Machine Learning With Spark: packages
  • Machine Learning
  • MLlib: Functionalities
  • When do we choose ML with Spark?
  • Spark Data Structures and connection to Python
  • Types of Parallelism
  • ML Terminology (Spark, sklearn…)
  • Steps of a Machine Learning project
  • Linear Regression and Random Forest regression

Resource

You can freely download a book, used for this course:

Learning Spark, 2nd Edition