Skip to content

ashwinkumar12345/apache-kafka-intro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 

Repository files navigation

apache-kafka-intro

Learn Apache Kafka 2.0 Ecosystem

Contents

Why Do We Need Real Time?
Batch to Real Time => Lambda Architecture
Streaming Architecture
Kafka / MapR Stream Concepts
Streaming Example
Kafka CLI (MapR Sandbox) Demo
Kafka Backend
Producers
Consumers
Kafka Streaming Architecture
Kafka Java Demo
Kafka Demo – Part 1
Kafka Eco Components
Kafka Demo – Part 2
Admin Responsibilities
Use Cases 1 - MovieFlix
Use Cases 2 - TrendBook

Why Do We Need Real Time?

Data is a continuous stream of events. An event is a significant change in state. The value of events usually decreases with time. So, to react to events as they happen, you need to process data as it arrives.

alt text

Batch to Real Time => Lambda Architecture

Twitter used to write their data to an HDFS and an Apache Storm cluster. Writing data to both clusters had drawbacks:

  • Events were processed out of order
  • Events were lost because of node failures
  • Keeping both clusters in sync was difficult

alt text

It became important to find a reliable way to integrate source and target systems without needing to maintain data in two separate places.

Streaming Architecture

A message bus was the solution. With a message bus, you could decouple your source and target systems. The most popular messaging systems are:

  • Apache Kafka (initially developed at LinkedIn, now managed by Confluent)
  • MapR Event Store for Apache Kafka (MapR)

These systems are distributed, fault tolerant, resilient, and scalable.

alt text

Kafka / MapR Stream Concepts

Topic:

  • Logical collection of messages or events
  • You can have as many topics as you want
  • Identified by its name

Partition:

  • Topics are split into partitions for parallelism

Offsets:

  • Partitions are split into offsets with incremental IDs

Streams (specific to MapR Event Store)

  • Stream is a collection of topics

Some important points:

  • To identify a message you need to specify the stream name, topic name, partition number, and offset ID
  • Once data is written to a partition it cannot be changed
  • Data stored at the offsets is only kept for a limited amount of time
  • Order is guaranteed only within a partition and not across partitions
  • Data is written randomly to partitions 0, 1, or 2, if you don't provide a key

4

Streaming Example

Trucks report their GPS coordinates to Kafka using a some mechanism. GPS coordinates are sent to Kafka every 20 seconds. Each message constitutes the truck ID and coordinates. All trucks send data to one topic, you do not have one topic per truck. Create a Kafka topic with name "trucks_gps“ with 10 partitions (arbitrarily chosen). Once you have the data in Kafka, you can write a consumer application, say a location dashboard application or an application to calculate velocity of each truck

5

Kafka CLI (MapR Sandbox) Demo

  • Create a Stream
[root@maprdemo mapr]# maprcli stream create -path /sample-stream
  • Create a Topic
[root@maprdemo mapr]# maprcli stream topic create -path /sample-stream  -topic fast-messages --partitions 3
  • List the Topic
[root@maprdemo mapr]# maprcli stream topic list -path /sample-stream
  • List the Topic
[root@maprdemo mapr]# maprcli stream topic list -path /sample-stream
  • Produce to the Topic
[root@maprdemo bin]# ./kafka-console-producer.sh --broker-list 127.0.0.1:9092 --topic /sample-stream:fast-messages
  • Consume from the Topic
[root@maprdemo bin]# ./kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9092 --topic /sample-stream:fast-messages

Kafka Backend

A topic or stream is split across the cluster

  • Broker or server: 3 (IDs 101, 102, 103)
  • Topics: 1 (Topic-A)
  • Partitions: 2 (0 and 1)
  • Replication factor: 1

For a given partition, you have only one leader that receives and serves data for that partition, the other brokers synchronize that data. If a leader broker goes down, you have an election to choose the new leader. The leader and in-sync replicas are decided by Zookeeper.

6

Producers

Write data to topics. Producers automatically know which broker and partition to write to Delivery semantics indicates the integrity of data as it moves from point A to point B:

  • acks = 0: Producer does not wait for acknowledgment (data loss) - At most once
  • acks = 1: Producer waits for leader acknowledgement (limited data loss) - At least once
  • acks = 2: Producer waits for leader + replica acknowledgement (no data loss) - Exactly once, expensive

7

Consumers

Read data from a topic. A single consumer can read from only one partition so there is no contention among consumers. Consumer Groups are a group of consumers working together. Consumers in a consumer group read data from exclusive partitions for faster performance, each consumer would get a subset of the messages. Consumer group represents an application Consumer offsets: Kafka stores offsets at which a consumer group has been reading to help the consumer group keep track of it's location as it's reading data from a topic

8

Kafka Streaming Architecture

9

Kafka Java Demo

  • Start Zookeeper
C:\kafka_2.12-2.2.0> zookeeper-server-start.bat config\zookeeper.properties
  • Start Kafka Server
C:\kafka_2.12-2.2.0> kafka-server-start.bat config\server.properties
  • Create a Topic
C:\kafka_2.12-2.2.0> kafka-topics.bat --zookeeper 127.0.0.1:2181 --topic first_topic --create --partitions 3 --replication-factor 1
  • Start a Consumer
C:\kafka_2.12-2.2.0> kafka-console-consumer.bat --bootstrap-server 127.0.0.1:9092 --topic twitter_tweets

Kafka Demo – Part 1

alt text

Kafka Eco Components

  • Kafka Connect Source
  • Kafka Connect Sink
  • KStreams (Spark Streaming or Apache Flink for advanced computation)
  • KSQL

Kafka Demo – Part 2

alt text

Admin Responsibilities

  • Kafka Admin

    • Kafka Cluster Setup
    • Isolate Zookeeper and Kafka brokers
    • Setup Kafka Monitoring (ES + Kibana, Confluent Control Center…)
    • Setup Security
    • Setup MirrorMaker (Manage Replication)
  • MapR Admin

    • MapR Cluster
    • MapR Monitoring
    • Secured by default

Use Cases 1 - MovieFlix

alt text

Use Cases 2 - TrendBook

alt text

About

Learn the basics of Apache Kafka

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages