kafka-in-production

Curious to know how big companies are operating their kafka fleet in production? This might be the repo for you:

What are the issues encountered when running kafka in production? 📝
How other organisations attempt to solve the issues? 🛠️
Why certain approaches are adopted over others? ⚖️
What can we learn for our own use case?

Table of Contents

Adobe
Agoda
Airbnb
Apple
Appsflyer
Bloomberg
Bolt
Booking.com
Brex
Cloudflare
Coinbase
Datadog
Deliveroo
GoTo
Grab
LinkedIn
Lyft
Netflix
Pinterest
Robinhood
Slack
Stripe
Uber
Wise
Wix
Yelp
Zalando

Adobe

How Adobe Experience Platform Pipeline Became the Cornerstone of In-Flight Processing for Adobe - 2019 - 📚
Moving Beyond Newtonian Reductionism in the Management of Large-Scale Distributed Systems, Part 2 - 2019 - 📚
Adobe Experience Platform’s Streaming Sources and Destinations Overview and Architecture - 2019 - 📚
Wins from Effective Kafka Monitoring at Adobe: Stability, Performance, and Cost Savings - 2019 - 📚
Creating Adobe Experience Platform Pipeline with Kafka - 2018 - 📚

Agoda

How Agoda manages 1.5 Trillion Events per day on Kafka - 2021 - 📚
Adding Time Lag to Monitor Kafka Consumer - 2021 - 📚
How our data scientists' petabytes of data is ingested into Hadoop (from Kafka) - 2021 - 📚

Airbnb

Migrating Kafka transparently between Zookeeper clusters - 2021 - 📚

Apple

Experiences Operating Apache Kafka® at Scale - 2019 - 🎙️
Kafka as a Service A Tale of Security and Multi Tenancy - 2018 - 🎙️

Appsflyer

Kafka Lag Monitoring For Human Beings - 2020 - 🎙️

Bloomberg

Fully-Managed, Multi-Tenant Kafka Clusters: Tips, Tricks, and Tools - 2022 - 🎙️

Bolt

Using Apache Kafka and ksqlDB for Data Replication at Bolt - 2021 - 🎙️
How Bolt Has Adopted Change Data Capture with Confluent Platform - 2020 - 📚
Kewei Shang - 2020 - 📚

Booking.com

Data Streaming Ecosystem Management at Booking.com - 2018 - 📚

Brex

Transactional Events Publishing At Brex - 2022 - 📚

Cloudflare

Intelligent, automatic restarts for unhealthy Kafka consumers - 2023 - 📚
Using Apache Kafka to process 1 trillion inter-service messages - 2022 - 📚

Coinbase

Kafka infrastructure renovation at Coinbase - 2022 - 📚
How we scaled data streaming at Coinbase using AWS MSK - 2021 - 📚

Datadog

Running Production Kafka Clusters in Kubernetes - 2019 - 🎙️

Deliveroo

Improving Stream Data Quality With Protobuf Schema Validation - 2019 - 📚

GoTo

Sink Kafka Messages to ClickHouse Using 'ClickHouse Kafka Ingestor' - 2022 - 📚
When Kafka Went Offshore - 2021 - 📚
Enhancing Ziggurat - The Backbone Of Gojek's Kafka Ecosystem - 2021 - 📚
Handling Dead Letters in a Streaming System - 2020 - 📚
How Kafka Solved a Culture Problem at Gojek - 2019 - 📚
Fronting : An Armoured Car for Kafka Ingestion - 2018 - 📚
Sakaar: Taking Kafka data to cloud storage at GO-JEK - 2018 - 📚

Grab

Zero trust with Kafka - 2022 - 📚
How Kafka Connect helps move data seamlessly - 2022 - 📚
Exposing a Kafka Cluster via a VPC Endpoint Service - 2022 - 📚
Detect Fraud Successfully with GrabDefence! - 2021 - 🎙️
Optimally Scaling Kafka Consumer Applications - 2020 - 📚

LinkedIn

Load-balanced Brooklin Mirror Maker: Replicating large-scale Kafka clusters at LinkedIn - 2022 - 📚
TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters - 2022 - 📚
How LinkedIn customizes Apache Kafka for 7 trillion messages per day - 2019 - 📚
Test Strategy for Samza/Kafka Services - 2017 - 📚
Kafka Ecosystem at LinkedIn - 2016 - 📚
Kafkaesque Days at LinkedIn – Part 1 - 2016 - 📚
How We’re Improving and Advancing Kafka at LinkedIn - 2015 - 📚

Lyft

Building an Adaptive, Multi-Tenant Stream Bus with Kafka and Golang - 2020 - 📚
Can Kafka Handle a Lyft Ride? - 2020 - 🎙️
Operating Apache Kafka Clusters 24/7 Without A Global Ops Team - 2019 - 📚
Bulletproof Apache Kafka® with Fault Tree Analysis - 2019 - 🎙️
Production Ready Kafka on Kubernetes - 2019 - 🎙️

Netflix

Featuring Apache Kafka in the Netflix Studio and Finance World - 2020 - 📚
Inca — Message Tracing and Loss Detection For Streaming Data @Netflix - 2019 - 📚
Evolution of the Netflix Data Pipeline - 2016 - 📚
Kafka Inside Keystone Pipeline - 2016 - 📚

Pinterest

Lessons Learned from Running Apache Kafka at Scale at Pinterest - 2021 - 📚
How Pinterest runs Kafka at scale - 2018 - 📚
Open sourcing DoctorKafka: Kafka cluster healing and workload balancing - 2017 - 📚

Robinhood

Tackling Kafka, with a Small Team - 2019 - 🎙️

Slack

Building Self-driving Kafka clusters using open source components - 2022 - 📚

Stripe

6 Nines: How Stripe keeps Kafka highly-available across the globe - 2022 - 🎙️

Uber

Securing Kafka® Infrastructure at Uber - 2022 - 📚
Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot - 2021 - 📚
Introducing uGroup: Uber’s Consumer Management Framework - 2021 - 📚
Disaster Recovery for Multi-Region Kafka at Uber - 2020 - 📚
Kafka Cluster Federation at Uber - 2019 - 🎙️
Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka - 2018 - 📚
Introducing Chaperone: How Uber Engineering Audits Apache Kafka End-to-End - 2016 - 📚
uReplicator: Uber Engineering’s Robust Apache Kafka Replicator - 2016 - 📚

Wise

Rack awareness in Kafka Streams - 2022 - 📚
Teamwork: Implementing a Kafka retry strategy at Wise - 2021 - 📚
Running Kafka in Kubernetes, Part 1: Why we migrated our Kafka clusters to Kubernetes. - 2021 - 📚
Running Kafka in Kubernetes, Part 2: How we migrated our Kafka clusters to Kubernetes. - 2021 - 📚
Securing Kafka with SPIFFE at TransferWise - Jonathan Oddy, Levani Kokhreidze - 2020 - 🎙️
Achieving high availability with stateful Kafka Streams applications - 2018 - 📚

Wix

4 Steps for Kafka Rebalance - Notes From the Field - 2021 - 📚
Wix’s Journey Into Data Streams - 2021 - 📚
Building a High-level SDK for Kafka: Greyhound Unleashed - 2020 - 📚

Yelp

Kafka on PaaSTA: Running Kafka on Kubernetes at Yelp (Part 1 - Architecture) - 2021 - 📚
Streams and Monk – How Yelp is Approaching Kafka in 2020 - 2020 - 📚
Billions of Messages a Day – Yelp’s Real-time Data Pipeline - 2017 - 🎙️

Zalando

Rock Solid Kafka and ZooKeeper Ops on AWS - 2018 - 📚
Many-to-Many Relationships Using Kafka - 2018 - 📚
Event First Development - Moving Towards Kafka Pipeline Applications - 2017 - 📚
Reattaching Kafka EBS in AWS - 2017 - 📚
Real-time Ranking with Apache Kafka’s Streams API - 2017 - 📚
Running Kafka Streams applications in AWS - 2017 - 📚
A Recipe for Kafka Lag Monitoring - 2017 - 📚
Surviving Data Loss - 2017 - 📚