From 6431c618b6b9c624958a8401ec9ce01672709532 Mon Sep 17 00:00:00 2001
From: Kaiming Wan <kaiming.wan@automq.com>
Date: Mon, 3 Jun 2024 22:30:54 +0800
Subject: [PATCH] chore: add readme

Signed-off-by: Kami Wan <kamiwan@iMac-Pro.local>
---
 README.md    | 140 ++++++++++++++++++++++++++++++++++++++-------------
 README_zh.md | 117 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 221 insertions(+), 36 deletions(-)
 create mode 100644 README_zh.md

diff --git a/README.md b/README.md
index c758ce0..6e3bad3 100644
--- a/README.md
+++ b/README.md
@@ -1,48 +1,116 @@
-# OpenMessaging Benchmark Framework For comparison
+## Kafka Provider Comparison
 
-[![Build](https://github.com/openmessaging/benchmark/actions/workflows/pr-build-and-test.yml/badge.svg)](https://github.com/openmessaging/benchmark/actions/workflows/pr-build-and-test.yml)
-[![License](https://img.shields.io/badge/license-Apache%202-4EB1BA.svg)](https://www.apache.org/licenses/LICENSE-2.0.html)
+Kafka Provider Comparison (KPC) is a public Kafka comparison platform built on the OpenMessaging Benchmark code. Kafka API
+It has become the de facto standard in the streaming field in recent years. Recently, many new stream systems compatible with Kafka API (hereinafter referred to as Kafka Providers) have emerged.
+The intention of building this comparison platform is not to determine who is the best Kafka streaming system, but to provide a **fair**, **objective**, and **open** comparison environment to generate objective and fair Kafka assessments.
+Stream Systems Comparative Report. The report includes multiple comparison dimensions such as latency, cost, elasticity, throughput, etc. Different products have different design architectures and trade-offs, which naturally lead to varying performances across these dimensions. This objective comparison is highly beneficial for users making technological selections.
 
-**Notice：** We do not consider or plan to release any unilateral test results based on this standard. For reference, you can purchase server tests on the cloud by yourself.
+## Supported Kafka Provider
 
-This repository houses user-friendly, cloud-ready benchmarking suites for the following messaging platforms:
-
-* [AutoMQ for Kafka](./driver-automq/README.md)
-* [Apache ActiveMQ Artemis](https://activemq.apache.org/components/artemis/)
-* [Apache Bookkeeper](https://bookkeeper.apache.org)
+* [AutoMQ](https://www.automq.com)
 * [Apache Kafka](https://kafka.apache.org)
-* [Apache Pulsar](https://pulsar.apache.org)
-* [Apache RocketMQ](https://rocketmq.apache.org)
-* Generic [JMS](https://javaee.github.io/jms-spec/)
-* [KoP (Kafka-on-Pulsar)](https://github.com/streamnative/kop)
-* [NATS JetStream](https://docs.nats.io/nats-concepts/jetstream)
-* [NATS Streaming (STAN)](https://docs.nats.io/legacy/stan/intro)
-* [NSQ](https://nsq.io)
-* [Pravega](https://pravega.io/)
-* [RabbitMQ](https://www.rabbitmq.com/)
-* [Redis](https://redis.com/)
 
-> More details could be found at the [official documentation](http://openmessaging.cloud/docs/benchmarks/).
+## Comparison Platform Execution Logic
+
+All execution logic for the comparison platform is encapsulated within Github Actions, utilizing Github Actions' Workflow to initiate the comparison tasks. The execution logic for these tasks unfolds as follows:
+
+1. When the scheduled trigger conditions are met, Github Actions initiates the Workflow.
+2. The Workflow orchestrates multiple Kafka Provider Benchmark processes that run concurrently.
+3. Each Benchmark Provider process consists of sequentially executed sub-stages, carried out in a specific order. Various Kafka Providers are evaluated simultaneously.
+   1. Install: Initialize cloud resources on AWS using Terraform configuration files, check out code, install dependencies, and deploy Kafka via an Ansible playbook.
+      Provider: This stage also includes cost calculations based on the Terraform configurations, which will contribute to the upcoming final Comparison Report.
+   2. Benchmark: This phase follows the installation and is activated only after its completion. It mainly involves utilizing the Terraform output to remotely access the cloud-hosted client machine and conduct the OpenMessaging Benchmark tests.
+   3. Generate Report: On the GitHub Runner machine, files containing benchmark results from cloud-based client executions are copied, the data is parsed to create the final Report, and then displayed in [issue-1](https://github.com/AutoMQ/kafka-provider-comparison/issues/1).
+4. Uninstall: This phase is dependent on the Benchmark phase and is initiated upon its completion. It includes the cleanup of cloud resources, involving the deletion of both the cloud-based client machines and the cloud-based Kafka provider clusters.
+
+
+## Benchmark Report Overview
+A comprehensive Benchmark Report includes the following components:
+- Report Generated: The generation timestamp of the report. Based on this timestamp, you can view the specific workflow details in the Github Actions of the open source repository, including how the cost was calculated, and logs from the Benchmark execution.
+- Workload Configuration: Workload configuration details extracted from the OpenMessaging Benchmark's run logs. To ensure fairness in comparison, identical Workload, Producer, and Consumer configurations are used for all Kafka Providers.
+- Producer Configuration: Producer settings, the same for all Kafka Providers.
+- Consumer Configuration: Consumer settings, the same for all Kafka Providers.
+- Topic Configuration: Topic settings, the same for all Kafka Providers.
+- Replication Factor: Replication factor, the same for all Kafka Providers.
+- Average Throughput: The average throughput for the entire benchmarking process is measured in MB/s.
+- E2E LatencyAvg(ms): The average end-to-end latency throughout the entire benchmarking process, measured in milliseconds.
+- E2E P95 Latency(ms): The 95% percentile end-to-end latency observed during the entire benchmarking process, measured in milliseconds.
+- E2E P99 Latency(ms): The 99% percentile end-to-end latency recorded during the entire benchmarking process, measured in milliseconds.
+- Baseline Cost: This is the baseline cost for this Kafka Provider, excluding the cost of IaaS cloud services utilized, and is measured in USD. It is calculated based on an analysis of Terraform configuration files using [Infracost](https://www.infracost.io/).
+- Usage Cost: This cost represents the cloud resource usage for this Kafka Provider, measured in USD. It is calculated from Terraform configuration files and usage settings with [Infracost](https://www.infracost.io/). The usage calculations leverage data from the infracost directory. For instance, for AutoMQ, real-world production data shows that each GB of write traffic typically generates 31.25 PUTs and 12.5 GETs. Thus, costs are computed based on an assumed average write traffic of 10MB/s to estimate the number of API calls per second. Detailed explanations of this cost estimation methodology will be provided in subsequent chapters.
+- Total Cost: The total cost for this Kafka Provider, measured in USD, combines both the baseline and usage costs.
+
+## How to Contribute
+Assuming your Kafka provider is named foo, the following content will be created to include foo in the comparison list:
+Create a module named driver-foo in the root directory, which will contain the following key files in the /deploy/aws-cn directory:
+- provision-kafka-aws.tf: Defines Terraform resources
+- var.tfvars: Allows customization of machine type, disk size, and number of brokers/servers
+
+> Tips: Currently, comparisons are only supported on aws-cn; support for additional cloud providers will be available in the future. Tests allow users to use machines of various quantities and specifications, but they must meet the minimum average throughput requirements, otherwise the results will not be displayed. Using higher machine specifications will improve performance but will also lead to increased costs.
+
+### Standardized Workload
+To ensure the fairness of comparisons, we have standardized a representative workload configuration, tail-read-500m.yaml, which supports generating a theoretical peak write throughput of 500 MB/s.
+
+#### Workload configuration
+name: 1-topic-1000-partitions-4kb-4p4c-500m
+``` 
+topics: 1
+partitionsPerTopic: 1000
+messageSize: 4096
+payloadFile: "payload/payload-4Kb.data"
+subscriptionsPerTopic: 1
+consumerPerSubscription: 4
+producersPerTopic: 4
+producerRate: 128000
+consumerBacklogSizeGB: 0
+warmupDurationMinutes: 0
+testDurationMinutes: 1
+logIntervalMillis: 1000
+```
+
+#### Producer configuration
+```
+value.serializer: org.apache.kafka.common.serialization.ByteArraySerializer
+acks: all
+batch.size: 65536
+bootstrap.servers: 10.0.0.120:9092,10.0.1.103:9092
+key.serializer: org.apache.kafka.common.serialization.StringSerializer
+linger.ms: 1
+```
+
+#### Consumer configuration
+```
+key.deserializer: org.apache.kafka.common.serialization.StringDeserializer
+value.deserializer: org.apache.kafka.common.serialization.ByteArrayDeserializer
+enable.auto.commit: true
+bootstrap.servers: 10.0.0.82:9092
+auto.offset.reset: earliest
+```
 
-## Build
+#### Cost estimation
+Estimating costs presents a challenge due to the variability in cloud service consumption. The diversity in product implementations complicates the accurate forecasting of usage costs. Nonetheless, by organizing storage solutions into specific categories and adopting baseline assumptions, we can more accurately compute these costs.
 
-Requirements:
+Current Kafka Provider's storage solutions, contingent upon the overarching dependent cloud services, are classified into the following categories:
+- Relying on both block storage and object g storage: AutoMQ, Confluent Tiered Storage, Redpanda
+- Solely dependent on object storage: StreamNative USAR, WarpStream
+- Dependent solely on cloud storage: Apache Kafka®
+  When analyzing storage consumption across cloud services, it's reasonable to assume that a uniform adoption of an optimized storage model would result in a similar relationship between write traffic and usage. Hence, a single infracost usage template is adequate for one storage model.
 
-* JDK 8
-* Maven 3.8.6+
+Another critical aspect of usage is the number of replicas. With the same volume of write traffic, the number of replicas significantly influences both the required storage capacity and the write traffic. For example, AutoMQ utilizes a single replica to maintain high availability, which decreases the write traffic and storage costs by two-thirds compared to Apache Kafka®. Consequently, when estimating costs, we adjust the usage expenses based on the number of replicas needed by the Kafka Provider. For instance, AutoMQ requires 6T of storage, whereas Apache Kafka® needs 18T.
 
-Common build actions:
 
-|             Action              |                 Command                  |
-|---------------------------------|------------------------------------------|
-| Full build and test             | `mvn clean verify`                       |
-| Skip tests                      | `mvn clean verify -DskipTests`           |
-| Skip Jacoco test coverage check | `mvn clean verify -Djacoco.skip`         |
-| Skip Checkstyle standards check | `mvn clean verify -Dcheckstyle.skip`     |
-| Skip Spotless formatting check  | `mvn clean verify -Dspotless.check.skip` |
-| Format code                     | `mvn spotless:apply`                     |
-| Generate license headers        | `mvn license:format`                     |
+### Dependent Action Secrets
+- AUTOMQ_ACCESS_KEY: AWS Access Key
+- AUTOMQ_SECRET_KEY: AWS Secret Key
+- INFRA_COST_API_KEY:  [Infracost](https://www.infracost.io/) API Key. Obtainable from Infracost
+- SSH_PRIVATE_KEY: SSH Private Key, directly embedded in secrets
+- SSH_PUBLIC_KEY: SSH Public Key, directly embedded in secrets
+- TF_BACKEND_BUCKET: S3 Bucket for storing Terraform State
+- TF_BACKEND_KEY: The S3 Key for storing Terraform State
 
-## License
+## Reporting cycle comparison
+We plan to generate a comparison report weekly.
 
-Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0
+## TODO
+- Supports unified management of Kafka Provider's templates, ansible.cfg, and similar files to avoid redundancy when adding new Kafka Providers.
+- The currently supported device name uses a fixed /dev/nvme1n1, which may be inaccurate.
\ No newline at end of file
diff --git a/README_zh.md b/README_zh.md
new file mode 100644
index 0000000..b339d74
--- /dev/null
+++ b/README_zh.md
@@ -0,0 +1,117 @@
+## Kafka Provider Comparison
+
+Kafka Provider Comparison (KPC) 是基于 OpenMessaging Benchmark 的代码构建的公开的 Kafka 对比平台。Kafka API
+当前已经成为流领域的事实标准。近些年来，也出现了很多新的 Kafka API Compatible 的流系统（下文简称 Kafka Provider）。
+构建该对比平台的初衷并不是希望决出谁是最好的 Kafka 流系统，而是提供一个**公平**、**客观**、**开放**的的对比环境生成客观、公平的 Kafka
+流系统对比报告。对比报告会包含多个对比维度，包括延迟、成本、弹性、吞吐等等。不同产品的设计架构和trade-off不同，自然也会在不同的对比维度上有不同的表现。这份客观的对比结果对于用户做技术选型也将会是非常有帮助的。
+
+## Supported Kafka Provider
+
+* [AutoMQ](https://www.automq.com)
+* [Apache Kafka](https://kafka.apache.org)
+
+## 对比平台执行逻辑
+
+对比平台的所有执行逻辑封装在 Github Actions 中，通过 Github Actions 的 Workflow 来触发对比任务的执行。对比任务的执行逻辑如下：
+
+1. Github Actions 满足定时触发条件，触发 Workflow 执行。
+2. Workflow 中包含的多个 Kafka Provider 的 Benchmark 流程会并行执行
+3. 每个 Benchmark Provider 流程中包含如下几个顺序执行地子阶段，会依次执行。不同的 Kafka Provider 会同时进行评测。
+    1. Install: 在 AWS 上根据 Terraform 配置文件初始化云资源、检出代码、安装依赖然后利用 ansible playbook 安装 Kafka
+       Provider。该阶段也会根据 Terraform 的配置文件计算成本，作为后续最终 Comparison Report 的一部分。
+    2. Benchmark: 依赖 Install 阶段并且在其完成后才会触发执行。该阶段主要是利用 Terraform Output 的信息远程登入到云上的 Client 机器，执行 OpenMessaging Benchmark 测试。
+    3. Generate Report: 在 Github Runner 机器上会拷贝云端 Client 执行地 Benchmark 结果文件，解析其中内容生成最终 Report 内容，并且展示在 [issue-1](https://github.com/AutoMQ/kafka-provider-comparison/issues/1)
+    4. Uninstall: 依赖 Benchmark 阶段并且在其完成后才会触发执行。该阶段会清理云资源，包括删除云上的 Client 机器、删除云上的 Kafka Provider 集群。
+
+
+## Benchmark Report 说明
+一份完整的 Benchmark Report 会包含如下内容：
+- Report Generated: Report 的生成时间。根据该生成时间，你可以从开源仓库的 Github Actions 当中看到具体 Workflow 的执行细节，包括 cost 是如何计算的，Benchmark 的输出日志等。
+- Workload Configuration: 从 OpenMessagging Benchmark 的运行日志中提取出来的 Workload 配置信息。为了保证对比的公平性，针对所有的 Kafka Provider 我们均会采用完全相同的 Workload、Producer、Consumer 配置。
+- Producer Configuration: 生产者配置，所有 Kafka Provider 的 Producer 配置均相同。
+- Consumer Configuration: 消费者配置，所有 Kafka Provider 的 Consumer 配置均相同。
+- Topic Configuration: Topic 配置，所有 Kafka Provider 的 Topic 配置均相同。
+- Replication Factor: 复制因子，所有 Kafka Provider 的复制因子均相同。
+- Average Throughput: 整个 Benchmark 过程中的平均吞吐量，单位 MB/s
+- E2E LatencyAvg(ms): 整个 Benchmark 过程中的平均端到端延迟，单位 ms
+- E2E P95 Latency(ms): 整个 Benchmark 过程中的 95% 端到端延迟，单位 ms
+- E2E P99 Latency(ms): 整个 Benchmark 过程中的 99% 端到端延迟，单位 ms
+- Baseline Cost: 该 Kafka Provider 的基线成本(不包含用量的 IaaS 云服务成本)，单位 USD。基于 [Infracost](https://www.infracost.io/) 分析 Terraform 的配置文件询价得出。
+- Usage Cost: 该 Kafka Provider 的云资源用量成本，单位 USD。基于 [Infracost](https://www.infracost.io/) 分析 Terraform 的配置文件和用量配置询价得出。用量是基于 `infracost` 目录下的 infracost usage 来进行计算的。例如，针对 AutoMQ 我们根据实际的生产经验得出每GB的写入流量对应 31.25 PUTs and 12.5 GETs。因此会根据固定的平均写入流量 10MB/s 计算出每秒的 API 调用数量来计算成本。关于这块成本估算逻辑，我们会在后续的章节中详细说明。
+- Total Cost: 该 Kafka Provider 的总成本，单位 USD。其值等于基线成本加上用量成本。
+
+## 如何贡献
+假设你的 Kafka provider 名称为 `foo`，则你将新建如下内容来将 `foo` 纳入比较列表:
+在跟目录下新建 `driver-foo` 模块，该模块 `/deploy/aws-cn` 目录下将包含如下关键文件:
+   - provision-kafka-aws.tf: 定义了 Terraform 资源
+   - var.tfvars: 可以自定义机型、磁盘大小、broker/server 数量等配置
+
+> Tips: 当前仅支持在 aws-cn 进行对比，后续会支持更多云厂商。测试允许用户使用不同数量和规格的机器，但是要求满足最低的平均吞吐要求，否则结果将不予以展示。使用较高的机器规格会提高性能表现，但是也会导致成本的提升。
+
+### 固化的 Workload
+为了保证对比的公平性，我们固化了一套比较有代表性的 Workload，Producer、Consumer 配置 `tail-read-500m.yaml`。该配置支持产生理论峰值 500 MB/s 的写入流量。
+
+#### Workload 配置
+```
+name: 1-topic-1000-partitions-4kb-4p4c-500m
+
+topics: 1
+partitionsPerTopic: 1000
+messageSize: 4096
+payloadFile: "payload/payload-4Kb.data"
+subscriptionsPerTopic: 1
+consumerPerSubscription: 4
+producersPerTopic: 4
+producerRate: 128000
+consumerBacklogSizeGB: 0
+warmupDurationMinutes: 0
+testDurationMinutes: 1
+logIntervalMillis: 1000
+```
+
+#### Producer 配置
+```
+value.serializer: org.apache.kafka.common.serialization.ByteArraySerializer
+acks: all
+batch.size: 65536
+bootstrap.servers: 10.0.0.120:9092,10.0.1.103:9092
+key.serializer: org.apache.kafka.common.serialization.StringSerializer
+linger.ms: 1
+```
+
+#### Consumer 配置
+```
+key.deserializer: org.apache.kafka.common.serialization.StringDeserializer
+value.deserializer: org.apache.kafka.common.serialization.ByteArrayDeserializer
+enable.auto.commit: true
+bootstrap.servers: 10.0.0.82:9092
+auto.offset.reset: earliest
+```
+
+### 成本估算
+成本估算的难点在于云服务用量的估算。不同产品的实现存在差异，导致用量成本很难精确估算。但是，通过将存储的实现分成如下几种模式并且配合一个前提假设，我们即可相对公平的来计算用量成本。
+
+当前 Kafka Provider 的存储实现依据依赖的云服务总体可以分为如下几种模式：
+- 同时依赖云盘和对象存储: AutoMQ、Confluent Tiered Storage、Redpanda
+- 只依赖对象存储: StreamNative USAR、WarpStream
+- 只依赖云盘: Apache Kafka
+关于云服务存储上的用量，我们只要假设 `在同一种存储模式上，假设大家都是相对优的实现，则写入流量与用量之间的关系是比较接近的`，这样我们针对一种存储模式，只要使用一种 infracost 用量模板即可。
+
+用量的另外一个关键因子是副本数。同一份写入流量，副本数会直接影响实际需要的存储空间以及写入流量。像 AutoMQ 可以使用单副本保证高可用，相比 Apache Kafka 则会减少2份写入流量，存储成本上也会减少三分之二的存储空间。因此，计算成本时，根据 Kafka Provider 所需要的副本数，我们也会对用量成本进行相应的调整。例如 AutoMQ 需要 6T 的存储空间，而 Apache Kafka 就需要 18T 的存储空间。
+
+
+### 依赖的 Action Secrets
+- AUTOMQ_ACCESS_KEY: AWS Access Key
+- AUTOMQ_SECRET_KEY: AWS Secret Key
+- INFRA_COST_API_KEY: Infracost API Key. 可以从 [Infracost](https://www.infracost.io/) 获取
+- SSH_PRIVATE_KEY: SSH Private Key，直接固化在 secrets 当中
+- SSH_PUBLIC_KEY: SSH Public Key，直接固化在 secrets 当中
+- TF_BACKEND_BUCKET: 存放 Terraform State 的 S3 Bucket
+- TF_BACKEND_KEY: 存放 Terraform State 的 S3 Key
+
+## 对比报告生成周期
+我们计划每周生成一份对比报告。
+
+### TODO
+- 支持将 Kafka Provider 的 templates、ansible.cfg 等统一管理，避免新增 Kafka Provider 时需要重复编写这些文件。
+- 当前支持的设备名采用固化的/dev/nvme1n1，可能不准。
\ No newline at end of file