Skip to content

How to Build a High Quality Testing Infrastructure

lyx edited this page Jan 17, 2025 · 1 revision

Author Information: Xinyu Zhou , Co-Founder & CTO of AutoMQ

AutoMQ has always adhered to the cloud-native philosophy, deeply utilizing cloud-native capabilities to develop a separation of storage and compute in AutoMQ. Compared to Apache Kafka, we have achieved a tenfold cost advantage, and have made qualitative leaps in operational efficiency and scalability. Thus, the stable operation of public clouds is closely intertwined with our business.

Today, I read a thought-provoking piece from an industry leader about the frequent public cloud outages recently, titled "[Public Clouds that You Use Have Never Been Tested]". The core idea of the article is that "public clouds are untestable." However, I hold a contrary opinion; cloud services are software too, and software engineering has evolved to offer diverse and rich testing methodologies. Moreover, with the massive production traffic advantage that public clouds have, leveraging canary/grey releases can help catch bugs that might have slipped through other testing stages before a new version is fully rolled out.

Given how easy testing is, why do public cloud outages still occur so frequently? Drawing from my experience at cloud providers, the probable reason behind the two recent IAM-related incidents at two major clouds is lack of investment. IAM products do not generate direct revenue and, in revenue-driven public cloud providers, the predicament of the IAM team is apparent. My previous role involved integrating IAM into our core data pathways for authentication, leading to numerous interactions with the IAM team. Seeing such a small team support such crucial operations was always a cause for concern. Thus, according to Murphy's Law, the issues both leading domestic cloud providers faced with IAM were both coincidental and inevitable.

Choose cloud services with the highest investment and largest scale from providers

Is the issue with IAM an isolated incident or a widespread problem? By reviewing the product catalogs of cloud providers, which include hundreds of proprietary products, and considering the number of R&D personnel involved, it's easy to conclude that "the investment in testing resources for numerous cloud products is insufficient." Hence, from day one, AutoMQ set two principles[1], one of which was "choose cloud services with the highest investment and largest scale from providers," as these services typically have the highest maturity levels, predominantly focused at the IaaS layer, including computing, storage, and networking products, and of course, databases are also a critical battleground for cloud providers.

How to build a high-quality testing infrastructure for AutoMQ

Returning to AutoMQ, our team’s past experiences have profoundly influenced our understanding of the importance of building a solid testing system. If you've ever managed thousands of production nodes under severe testing resource constraints, you'll understand the trepidation we felt then and our resolve now to ensure the quality of AutoMQ.

As an enterprise-level software service, the importance of a comprehensive testing infrastructure is self-evident, manifesting in three dimensions:

  • For the software itself, it's a powerful means of ensuring quality, the foundation for rapid, stable, and efficient iteration.

  • For the team, it lays the groundwork for practicing excellent engineering culture, automating everything, and is key to enhancing development fluidity, confidence, and job satisfaction.

  • For users, it serves as an essential gateway to understanding the software itself and its capabilities, allowing some testing features to be productized (such as letting users perform fault injection themselves), and some testing capabilities can even be showcased, such as gRPC, which has made its performance testing dashboard public [2].

Today, I also take this opportunity to introduce some of the testing capabilities we already have at AutoMQ.

Unit Testing

Unit testing is a mandatory requirement for all modules, where Mock Everything is needed for testing, relying on component libraries including JUnit, Mockito, Awaitility [3], etc. Taking AutoMQ's core module, S3Stream, as an example, current unit tests cover approximately 80% of classes and over 60% of lines, with ongoing efforts to further improve these metrics.

import PJMKKJ from './R6OowHykTinqgmkGq1jcZsumn5e/1.png';

Integration Testing

Integrating all or part of the software modules, along with external dependencies for testing, is routine. Leading external dependency software usually provides integration test suites, for example, Test::RedisServer. After the widespread containerization of software, using TestContainer for integration testing has become very convenient; it integrates most stateful software. For instance, relying on Adobe's S3Mock component [4], we developed a series of integration test cases that incorporate S3Stream with object storage dependencies, covering scenarios such as Compaction, concurrent Append, concurrent Fetch, hot/cold reads, dynamic Payload, Stream operations, Cache eviction, and dynamic configuration. Every attempt to modify S3Stream through a Pull Request must pass corresponding unit and integration tests.

import RIFDKJ from './R6OowHykTinqgmkGq1jcZsumn5e/2.png';

E2E Testing

End-to-End Testing (E2E Test) is a software testing method designed to simulate real user scenarios to check the integrity and functionality of the entire system. E2E testing is usually automated, using automated testing tools and processes to simulate user behavior and operations, verifying the system's functionality, performance, and reliability.

Thanks to AutoMQ's separation of storage and compute architecture, we've reused all of Apache Kafka's compute layer code, achieving 100% compatibility that allows AutoMQ to fully utilize Apache Kafka's E2E testing infrastructure. Kafka uses a matrix-based testing method that supports running a test case across different Kafka cluster sizes and even different cluster configurations, maximizing the value of each test case. For example, a Producer throughput test case will be run on a cluster with 5 nodes, with the first matrix providing 4 different cluster configuration combinations and the second matrix offering 2 combinations, totaling 6 scenarios for this test case.

import DCFTHJ from './R6OowHykTinqgmkGq1jcZsumn5e/3.png';

AutoMQ, developed based on the KRaft version of Kafka, thus excluding E2E test cases related to Zookeeper mode, we have passed the remaining 500+ test cases and will periodically run these tests to promptly detect any broken issues.

Performance Testing

As a data-intensive software, throughput and latency performance metrics are crucial. AutoMQ conducts tests using the OpenMessaging Benchmark framework [5] and also uses it to compare technical indicators with other products. Shown below is a latency comparison chart of AutoMQ under a specific traffic model, with more technical indicators detailed in the performance white paper [6].

import SKGGAT from './R6OowHykTinqgmkGq1jcZsumn5e/4.png';

Certainly, performance testing is not only about conducting tests during release cycles or when comparing against competitors but more crucially about maintaining a performance baseline and regularly (e.g., daily) regressing against the mainline code to monitor performance. This approach ensures timely detection of any commits that might degrade performance indicators. Without this, performance could deteriorate over software iterations, making it challenging to trace back to when the decline started. Currently, AutoMQ has not fully implemented automated regression for performance baselines, but we will keep you updated on future developments.

Soak Testing

Common coding defects are usually identified and rectified during the aforementioned testing phases. However, for elusive corner cases, long-term soak or endurance testing is essential. By testing the system over extended periods under various traffic models, we can unearth rare software issues, such as memory leaks.

For software like AutoMQ, which encompasses distributed systems, high concurrency, storage, and multi-cloud characteristics, the complexity of testing is significantly high. For instance, distributed timing issues may only surface under extreme conditions. To enhance our endurance testing capabilities, AutoMQ has developed the Marathon framework, which simplifies cluster deployment, scaling, fault injection, and multi-cloud deployment details, allowing developers to focus on crafting endurance test scenarios. Currently, our 24/7 endurance test scenarios are presented in the following table.

Scenario
Type
Test Purpose
Extreme Stress Scenarios
High Traffic
Under extreme traffic pressure, validate whether there are unexpected software defects or failures, such as traffic exceeding expectations, TPS, and metadata scale.
Extreme Stress Scenarios
High TPS
Under extreme traffic pressure, validate whether there are unexpected software defects or failures, such as traffic exceeding expectations, TPS, and metadata scale.
Extreme Stress Scenarios
Massive metadata
Under extreme traffic pressure, validate whether there are unexpected software defects or failures, such as traffic exceeding expectations, TPS, and metadata scale.
Dependency unavailable for an extended period
EBS unavailable
In scenarios where dependencies are unavailable for extended periods and disaster recovery is not feasible, various unexpected conditions may arise, such as persistent queuing, increased memory pressure, and repeated connection attempts, potentially leading to a state where, even after the dependencies are restored, the software itself may not recover.
Dependency unavailable for an extended period
S3 is unavailable
In scenarios where dependencies are unavailable for extended periods and disaster recovery is not feasible, various unexpected conditions may arise, such as persistent queuing, increased memory pressure, and repeated connection attempts, potentially leading to a state where, even after the dependencies are restored, the software itself may not recover.
Dependency unavailable for an extended period
Network is unavailable
In scenarios where dependencies are unavailable for extended periods and disaster recovery is not feasible, various unexpected conditions may arise, such as persistent queuing, increased memory pressure, and repeated connection attempts, potentially leading to a state where, even after the dependencies are restored, the software itself may not recover.
Dependency unavailable for an extended period
JVM continuous GC is unavailable
In scenarios where dependencies are unavailable for extended periods and disaster recovery is not feasible, various unexpected conditions may arise, such as persistent queuing, increased memory pressure, and repeated connection attempts, potentially leading to a state where, even after the dependencies are restored, the software itself may not recover.
Real-world business scenario simulation
Auto-Scaling
Continuously trigger the scaling of the cluster to verify that, in scenarios of frequent scaling, each change in the number of cluster nodes does not result in any unexpected behavior.
Real-world business scenario simulation
Auto-Balancing
Partition traffic is constantly changing, and the Auto-Balancing component continuously reassigns partitions to ensure traffic is balanced at the node level, verifying that each reassignment functions correctly in scenarios of frequent partition scheduling.
Real-world business scenario simulation
Failover
Cloud failures occur at any moment, and if any node fails, can the cluster quickly perform a failover?
Real-world business scenario simulation
Cold read
Cycle through a cold read of a data segment, ensuring the completeness and reliability of data on S3 throughout the entire Compaction lifecycle.
Real-world business scenario simulation
Data lifecycle
Frequently modify the data lifecycle at the Topic level, verify the accuracy of metadata and data on S3, and ensure there are no data leaks.
Real-world business scenario simulation
Special message validation
Simulate scenarios using idempotent producers, transactional messages, and compacted topics to ensure their sustained availability over long periods.

These use cases typically run against a backdrop of elastic traffic, triggering automatic scaling of the cluster as demonstrated in the figure below, which shows an AutoMQ cluster frequently scaling up and down during an endurance test.

import GEDACU from './R6OowHykTinqgmkGq1jcZsumn5e/5.png';

Chaos Testing

Fault injection testing is also an indispensable final component of basic software validation. Whether AutoMQ can perform disaster recovery as expected in scenarios of ECS failure, severe network packet loss, EBS hanging, or S3 unavailability is something that needs prolonged verification. By integrating fault injection testing with endurance tests, using the Chaos Mesh component[7], all test cases are run in a cluster environment with random, periodic fault injections to verify if AutoMQ's performance meets expectations.

import JRKCGS from './R6OowHykTinqgmkGq1jcZsumn5e/6.png';

As illustrated in the figure above, the two vertical dashed lines next to each other indicate the start and end of a fault injection. It's easy to observe that when one node in the cluster encounters an issue, traffic drops to zero and then partitions are moved to another node via fault transfer, leading to a continuous rise in traffic on that node. After the fault is resolved, the AutoBalancing component reallocates partitions to achieve as balanced a traffic distribution as possible.

Conclusion

Failures in public clouds will never cease. Each failure, if it can provoke thought and serve as a reminder, motivates us to continually invest in software quality assurance. AutoMQ's current testing system consumes tens of thousands in cloud resources monthly to ensure that software defects are caught during the development phase, minimizing their escape to production. Of course, online failures are inevitable, but AutoMQ innovatively addresses many issues of Kafka using cloud-native capabilities. Whether it can also manage the monitoring, detection, and recovery of online failures cost-effectively, across multi-cloud and multi-regional BYOC deployments using a cloud-native approach, will be discussed in our next article.

Finally, given that public cloud outages are inevitable, even though AutoMQ relies solely on IaaS-level cloud services, we will also share in subsequent articles how AutoMQ adopts a cloud-native approach to handle failures like ECS, EBS, S3, and AZ-level incidents.

References

[1]. AutoMQ Cloud-Native Solutions Explained: https://mp.weixin.qq.com/s/rmGoamqBnMPlrylDeSwgEA

[2]. gRPC Performance Dashboard: https://grafana-dot-grpc-testing.appspot.com/

[3]. Concurrency Scenario Unit Testing Tool: https://github.com/awaitility/awaitility

[4]. S3 Mock Component https://github.com/adobe/S3Mock

[5]. AutoMQ Performance Testing Framework: https://github.com/AutoMQ/openmessaging-benchmark

[6]. AutoMQ Performance White Paper: https://docs.automq.com/docs/automq-s3kafka/CYxlwqDBHitThCkxSl2cePxrnBc

[7]. Chaos Mesh components: https://chaos-mesh.org/

AutoMQ Wiki Key Pages

What is automq

Getting started

Architecture

Deployment

Migration

Observability

Integrations

Releases

Benchmarks

Reference

Articles

Clone this wiki locally