apache-kafka-intro

Learn Apache Kafka 2.0 Ecosystem

Contents

Why Do We Need Real Time?
Batch to Real Time => Lambda Architecture
Streaming Architecture
Kafka / MapR Stream Concepts
Streaming Example
Kafka CLI (MapR Sandbox) Demo
Kafka Backend
Producers
Consumers
Kafka Streaming Architecture
Kafka Java Demo
Kafka Demo – Part 1
Kafka Eco Components
Kafka Demo – Part 2
Admin Responsibilities
Use Cases 1 - MovieFlix
Use Cases 2 - TrendBook

Why Do We Need Real Time?

Data is a continuous stream of events. An event is a significant change in state. The value of events usually decreases with time. So, to react to events as they happen, you need to process data as it arrives.

Batch to Real Time => Lambda Architecture

Twitter used to write their data to an HDFS and an Apache Storm cluster. Writing data to both clusters had drawbacks:

Events were processed out of order
Events were lost because of node failures
Keeping both clusters in sync was difficult

It became important to find a reliable way to integrate source and target systems without needing to maintain data in two separate places.

Streaming Architecture

A message bus was the solution. With a message bus, you could decouple your source and target systems. The most popular messaging systems are:

Apache Kafka (initially developed at LinkedIn, now managed by Confluent)
MapR Event Store for Apache Kafka (MapR)

These systems are distributed, fault tolerant, resilient, and scalable.

Kafka / MapR Stream Concepts

Topic:

Logical collection of messages or events
You can have as many topics as you want
Identified by its name

Partition:

Topics are split into partitions for parallelism

Offsets:

Partitions are split into offsets with incremental IDs

Streams (specific to MapR Event Store)

Stream is a collection of topics

Some important points:

To identify a message you need to specify the stream name, topic name, partition number, and offset ID
Once data is written to a partition it cannot be changed
Data stored at the offsets is only kept for a limited amount of time
Order is guaranteed only within a partition and not across partitions
Data is written randomly to partitions 0, 1, or 2, if you don't provide a key

Streaming Example

Trucks report their GPS coordinates to Kafka using a some mechanism. GPS coordinates are sent to Kafka every 20 seconds. Each message constitutes the truck ID and coordinates. All trucks send data to one topic, you do not have one topic per truck. Create a Kafka topic with name "trucks_gps“ with 10 partitions (arbitrarily chosen). Once you have the data in Kafka, you can write a consumer application, say a location dashboard application or an application to calculate velocity of each truck

Kafka CLI (MapR Sandbox) Demo

Create a Stream

[root@maprdemo mapr]# maprcli stream create -path /sample-stream

Create a Topic

[root@maprdemo mapr]# maprcli stream topic create -path /sample-stream  -topic fast-messages --partitions 3

List the Topic

[root@maprdemo mapr]# maprcli stream topic list -path /sample-stream

List the Topic

[root@maprdemo mapr]# maprcli stream topic list -path /sample-stream

Produce to the Topic

[root@maprdemo bin]# ./kafka-console-producer.sh --broker-list 127.0.0.1:9092 --topic /sample-stream:fast-messages

Consume from the Topic

[root@maprdemo bin]# ./kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9092 --topic /sample-stream:fast-messages

Kafka Backend

A topic or stream is split across the cluster

Broker or server: 3 (IDs 101, 102, 103)
Topics: 1 (Topic-A)
Partitions: 2 (0 and 1)
Replication factor: 1

For a given partition, you have only one leader that receives and serves data for that partition, the other brokers synchronize that data. If a leader broker goes down, you have an election to choose the new leader. The leader and in-sync replicas are decided by Zookeeper.

Producers

Write data to topics. Producers automatically know which broker and partition to write to Delivery semantics indicates the integrity of data as it moves from point A to point B:

acks = 0: Producer does not wait for acknowledgment (data loss) - At most once
acks = 1: Producer waits for leader acknowledgement (limited data loss) - At least once
acks = 2: Producer waits for leader + replica acknowledgement (no data loss) - Exactly once, expensive

Consumers

Read data from a topic. A single consumer can read from only one partition so there is no contention among consumers. Consumer Groups are a group of consumers working together. Consumers in a consumer group read data from exclusive partitions for faster performance, each consumer would get a subset of the messages. Consumer group represents an application Consumer offsets: Kafka stores offsets at which a consumer group has been reading to help the consumer group keep track of it's location as it's reading data from a topic

Kafka Streaming Architecture

Kafka Java Demo

Start Zookeeper

C:\kafka_2.12-2.2.0> zookeeper-server-start.bat config\zookeeper.properties

Start Kafka Server

C:\kafka_2.12-2.2.0> kafka-server-start.bat config\server.properties

Create a Topic

C:\kafka_2.12-2.2.0> kafka-topics.bat --zookeeper 127.0.0.1:2181 --topic first_topic --create --partitions 3 --replication-factor 1

Start a Consumer

C:\kafka_2.12-2.2.0> kafka-console-consumer.bat --bootstrap-server 127.0.0.1:9092 --topic twitter_tweets

Kafka Demo – Part 1

Kafka Eco Components

Kafka Connect Source
Kafka Connect Sink
KStreams (Spark Streaming or Apache Flink for advanced computation)
KSQL

Kafka Demo – Part 2

Admin Responsibilities

Kafka Admin
- Kafka Cluster Setup
- Isolate Zookeeper and Kafka brokers
- Setup Kafka Monitoring (ES + Kibana, Confluent Control Center…)
- Setup Security
- Setup MirrorMaker (Manage Replication)
MapR Admin
- MapR Cluster
- MapR Monitoring
- Secured by default

Use Cases 1 - MovieFlix

Use Cases 2 - TrendBook

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
ConsumerDemo.java		ConsumerDemo.java
ProducerDemo.java		ProducerDemo.java
README.md		README.md
SampleTwitterProducer		SampleTwitterProducer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

apache-kafka-intro

Contents

Why Do We Need Real Time?

Batch to Real Time => Lambda Architecture

Streaming Architecture

Kafka / MapR Stream Concepts

Streaming Example

Kafka CLI (MapR Sandbox) Demo

Kafka Backend

Producers

Consumers

Kafka Streaming Architecture

Kafka Java Demo

Kafka Demo – Part 1

Kafka Eco Components

Kafka Demo – Part 2

Admin Responsibilities

Use Cases 1 - MovieFlix

Use Cases 2 - TrendBook

About

Releases

Packages

Languages

ashwinkumar12345/apache-kafka-intro

Folders and files

Latest commit

History

Repository files navigation

apache-kafka-intro

Contents

Why Do We Need Real Time?

Batch to Real Time => Lambda Architecture

Streaming Architecture

Kafka / MapR Stream Concepts

Streaming Example

Kafka CLI (MapR Sandbox) Demo

Kafka Backend

Producers

Consumers

Kafka Streaming Architecture

Kafka Java Demo

Kafka Demo – Part 1

Kafka Eco Components

Kafka Demo – Part 2

Admin Responsibilities

Use Cases 1 - MovieFlix

Use Cases 2 - TrendBook

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages