Apache Spark is a powerful open-source framework designed for fast and versatile big data processing, enabling efficient large-scale data manipulation and real-time analytics. In this course, you will learn how to leverage Spark's capabilities to process massive datasets and perform complex data analysis tasks with ease.
- Discover all the functionalities of Apache Spark and why it is everywhere.
- Understand the internals of Spark.
- Learn to use Spark for batch and streaming data analytics.
- Learn how to package a Spark application for production.
- Learn Machine Learning with MLlib.
Python programming knowledge, Linux/Unix shell basic knowledge.
- Presentation
- Spark in Hadoop ecosystem
- Use cases
- Spark ecosystem
- Internals
- Data structures
- Operations
- Resilient Distributed Datasets (RDDs)
- RDDs: Pros and Cons
- DataFrames
- RDDs vs DataFrames
- Working with DataFrames
- Why SQL?
- Streaming introduction
- Difference between batch and stream processing
- Stream processing models
- Different processing semantics
- Programming model
- Event-time vs. processing time
- Windows: tumbling, overlapping
- Handling late data and how long to wait
- Vocabulary
- Spark functionalities recap
- Spark application components
- Spark application configuration
- Spark application monitoring
- Spark application performance tuning
- Spark Ecosystem
- Machine Learning With Spark: packages
- Machine Learning
- MLlib: Functionalities
- When do we choose ML with Spark?
- Spark Data Structures and connection to Python
- Types of Parallelism
- ML Terminology (Spark, sklearn…)
- Steps of a Machine Learning project
- Linear Regression and Random Forest regression
You can freely download a book, used for this course: