-
Notifications
You must be signed in to change notification settings - Fork 235
AutoMQ Helps CaoCao Mobility Address Kafka Scalability During National Day
Author: Wang Zhiyang, Messaging Middleware Director, CaoCao Mobility
CaoCao Mobility was founded on May 21, 2015, as a strategic investment by Geely Holding Group to establish a "new energy vehicle sharing ecosystem." It has since evolved into China's leading shared mobility platform. With the mission of "redefining green shared mobility through technology," CaoCao Mobility leverages cutting-edge internet, connected car, autonomous driving technologies, and new energy innovations in the shared mobility space. The brand's slogan, "Heartfelt service for national travel," underscores its commitment to becoming the most reputable mobility service brand.
As a mobility service provider, CaoCao Mobility's data volume is closely tied to time and various scenarios. Data spikes are observed during holidays, rush hours, weather events, and specific activities like concerts. Additionally, years of operation have led to a substantial amount of travel data that needs to be analyzed daily.
Under these circumstances, the main issue with using Kafka was its lack of an elastic compute-storage integrated architecture. Due to the inherent rigidity of Apache Kafka's architecture, we encountered several problems in actual production use:
-
Limited Cloud Disk Storage and Scalability : Kafka heavily relies on local storage. The existing Kafka clusters at CaoCao Mobility use MBR disk formats by default, supporting up to 2.2 TB of storage. Directly modifying the disk format could lead to data loss. Given these constraints, we had to either reduce the retention period of Topics or mount multiple cloud disks on EC2 instances. The latter approach, though initially adopted, introduced significant operational burdens. Each scaling operation was a nerve-wracking experience, involving not just Kafka's scaling issues but also manual disk mounting, configuration, and validation.
-
Complex and Risky Kafka Cluster Scaling : Scaling a Kafka cluster is a complex and high-risk operation. To avoid wasting compute resources when storage is insufficient, we resorted to mounting multiple cloud disks on a single Broker. Cluster scaling requires planning Topic reassignment and allocation by Broker and disk, making the entire process both complicated and risky. It necessitates coordination with upstream and downstream applications to perform operations during off-peak times, ensuring minimal business disruption.
After thorough research, we found that AutoMQ's innovative shared storage architecture can fully resolve Kafka's elasticity issues:
-
Rapid scalability, responsive to business changes : The design concept of AutoMQ is to offload data persistence to cloud storage such as object storage and cloud disk. Therefore, its internal scalability no longer involves partition data replication like Apache Kafka, and the computing and storage layers can be completely separated. The partition migration behavior in AutoMQ is merely a change in metadata, which allows it to achieve partition migration in seconds. With the support of rapid partition migration, AutoMQ can migrate partitions to new nodes or from nodes to be reduced in seconds when Broker nodes are added or reduced, thereby ensuring that the entire scalability can be completed quickly.
-
Automated Elasticity, Reduced Complexity in Scaling : With an in-built self-balancing component running continuously, AutoMQ can observe metrics to generate real-time scheduling plans to help users automatically reassign partitions. This means that when scaling AutoMQ, users no longer need to manually create Topic and partition reassignment plans, as the entire process is fully automated. AutoMQ's transformation of Kafka's high-risk and complex scaling operations into low-risk, routine automated processes is a significant innovation. In the past, we encountered failures due to the coordination between Topic expiration deletion tasks and Topic reassignment during Kafka scaling. With AutoMQ, we no longer need to worry about these issues.
-
Continuous Self-Balancing, Freeing Up Maintenance : AutoMQ comes with a continuously working self-balancing component, which is a true boon for Kafka cluster maintenance personnel. The self-balancing component automatically monitors metric information within the AutoMQ cluster and, using these metrics and the built-in rule engine, generates and executes real-time partition reassignment plans, bringing the following benefits:
-
Node Failure Self-Healing : When a Broker node fails, the self-balancing component automatically reassigns the partitions from the failed node to other healthy nodes.
-
Improved Cluster Capacity Utilization: The self-balancing component automatically reassigns partitions to ensure that the throughput capacity of each Broker in the cluster is fully utilized, avoiding resource wastage.
-
Broker Hotspot Self-Healing : Partition hotspots are a common phenomenon in Kafka. The self-balancing component can automatically identify hotspot brokers, reassign partitions of the hotspot broker according to the rule engine, and migrate them to other brokers. This disperses the hotspot partitions while ensuring efficient capacity utilization.
-
Additionally, another very important reason for choosing AutoMQ is its ability to address the elasticity pain points of Kafka while maintaining 100% compatibility with Apache Kafka . Since we already have a significant amount of Kafka-related data infrastructure, this compatibility allows us to transition to AutoMQ smoothly without making any changes to the surrounding data infrastructure.
The following architecture diagram illustrates the position of AutoMQ within CaoCao Mobility's data stack and explains how it functions.
The data sources for CaoCao Mobility primarily come from RDS, application instrumentation writes, and logs collected by ilogtail. These data mainly include core travel-related data such as orders, drivers, and passengers. The data primarily flows into three clusters:
-
Big Data Cluster : The topics in this cluster are mainly used for big data-related analysis, such as user behavior analysis and funnel analysis, to better guide some business decisions and operations in the mobility business.
-
Observability Cluster : Collects information such as traces and metrics, stores them in Elasticsearch, and is mainly used for fault diagnosis and real-time alerts. This allows for the early detection of application issues and business risks.
-
Business Cluster : The data sent from business application instrumentation is processed by Flink to generate various reports.
To date, AutoMQ has successfully helped CaoCao Mobility navigate the traffic peaks during the Mid-Autumn Festival and National Day, providing a seamless scaling experience. The following diagram depicts an AutoMQ cluster in one of CaoCao Mobility's production environments during the Mid-Autumn period. You can observe that the travel business exhibits a strong time-based periodic characteristic during the morning peak (7 AM), evening peak (6 PM), and the return trip on the last day of the Mid-Autumn Festival (9 PM). With AutoMQ, we no longer face the same operational anxiety we experienced with Kafka during high travel peaks. When we need to scale, AutoMQ can quickly expand the cluster to the specified capacity, ensuring the cluster can stably handle production traffic. This not only resolves the elasticity pain points of past Kafka deployments but also significantly reduces our operational complexity and risk, enhancing the happiness index of our Kafka operations team.
Overall, AutoMQ's innovations in Kafka offer a leading technological advantage globally. It is a Kafka product that ensures complete compatibility while maximizing cost efficiency and elasticity. Moving forward, we will continue to collaborate with AutoMQ, further promoting and deepening its application within CaoCao Mobility.
- What is automq: Overview
- Difference with Apache Kafka
- Difference with WarpStream
- Difference with Tiered Storage
- Compatibility with Apache Kafka
- Licensing
- Deploy Locally
- Cluster Deployment on Linux
- Cluster Deployment on Kubernetes
- Example: Produce & Consume Message
- Example: Simple Benchmark
- Example: Partition Reassignment in Seconds
- Example: Self Balancing when Cluster Nodes Change
- Example: Continuous Data Self Balancing
-
S3stream shared streaming storage
-
Technical advantage
- Deployment: Overview
- Runs on Cloud
- Runs on CEPH
- Runs on CubeFS
- Runs on MinIO
- Runs on HDFS
- Configuration
-
Data analysis
-
Object storage
-
Kafka ui
-
Observability
-
Data integration