docs: add stateless redis disruptor proposal #331

roobre · 2023-09-06T13:30:14Z

Description

This PR adds a proposal to add a redis disruptor. Details about the use cases and implementation proposals can be found in the added design doc.

docs/01-development/design-docs/003-stateless-redis-disruptor.md

pablochacin

As I mentioned in my comments, I would prefer that we explore the stateful proxy as a goal and leave the stateless proxy as a PoC stage toward that goal as part of the implementation plan instead of as an alternative implementation.

The rationale for this is that the project should not aim for simple implementations but for meaningful developers experiences, and given the limitations of the stateless proxy, I'm not sure we would like to offer it to the users.

pablochacin · 2023-09-13T15:31:58Z

docs/01-development/design-docs/003-stateless-redis-disruptor.md

+
+## Background
+
+Caching services like Redis are a common way to improve the performance of distributed systems, but sometimes make difficult to know or estimate how a given system will behave when the caching service stops behaving as expected. Non-catastrophic failure modes such as increase of latency, or unexpected miss rate increase can affect a distributed system in qualitative ways and lead to catastrophic failure. 


Suggested change

Caching services like Redis are a common way to improve the performance of distributed systems, but sometimes make difficult to know or estimate how a given system will behave when the caching service stops behaving as expected. Non-catastrophic failure modes such as increase of latency, or unexpected miss rate increase can affect a distributed system in qualitative ways and lead to catastrophic failure.

Caching services like Redis are a common way to improve the performance of applications, but sometimes it is difficult to know or estimate how a given system will behave when the caching service stops behaving as expected. Non-catastrophic failure modes such as an increase in latency, or unexpected miss rate increase can affect a system in significant ways and lead to catastrophic failure.

pablochacin · 2023-09-13T15:40:43Z

docs/01-development/design-docs/003-stateless-redis-disruptor.md

+
+## Problem statement
+
+A big reliability challenge, and a common source of incidents, is the [metastable behavior](http://charap.co/metastable-failures-in-distributed-systems/) ([archive.org](https://web.archive.org/web/20230502171335/http://charap.co/metastable-failures-in-distributed-systems/)) of distributes systems when using common patterns such as caching. A common example of a metastable failure is system that is responding well to a certain load thanks to warm cache, but when this cache is lost due to an instance restart, or a node failure, the sudden cache miss events overload the backing database preventing the system from recovering.


Suggested change

A big reliability challenge, and a common source of incidents, is the [metastable behavior](http://charap.co/metastable-failures-in-distributed-systems/) ([archive.org](https://web.archive.org/web/20230502171335/http://charap.co/metastable-failures-in-distributed-systems/)) of distributes systems when using common patterns such as caching. A common example of a metastable failure is system that is responding well to a certain load thanks to warm cache, but when this cache is lost due to an instance restart, or a node failure, the sudden cache miss events overload the backing database preventing the system from recovering.

A big reliability challenge, and a common source of incidents, is the [metastable behavior](http://charap.co/metastable-failures-in-distributed-systems/) ([archive.org](https://web.archive.org/web/20230502171335/http://charap.co/metastable-failures-in-distributed-systems/)) of applications when using caching. In this scenario, the application is responding well to a certain load thanks to a warm cache, but when this cache is lost due to an instance restart, or a node failure, the sudden cache miss events overload the backing database preventing the system from recovering.

pablochacin · 2023-09-13T15:41:37Z

docs/01-development/design-docs/003-stateless-redis-disruptor.md

+
+## Goals
+
+Add baseline redis faulting functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults:


Suggested change

Add baseline redis faulting functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults:

Add Redis fault injection functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults:

pablochacin · 2023-09-13T15:58:52Z

docs/01-development/design-docs/003-stateless-redis-disruptor.md

+
+Without the requirement of being able to correlate responses with the requests that originated them, a RESP proxy can be made stateless. This reduces the complexity at the cost of, as expected, not being to correlate those responses. However, it should still be possible to meet the goals above with an stateless proxy.
+
+A stateless RESP proxy accepts connections from Redis clients. It will read messages sent by clients, parse them, and decide if any action is necessary, such as modifying the request, or delaying it. It simply passes through responses from the server back to the client, without needing to decode them. A stateless proxy always needs to forward requests, modified or not, to the upstream server. As it is not aware of the flow of responses, it should be compatible with server pushes without needing any additional logic. 


From what I understand from the description of the fault injection in the following sections, I think this approach is rather limiting:

Having to change the keys in the upstream requests instead of intercepting and modifying the responses. This may not have any side effects, but still, I found it "hacky"

Not allowing latency per command, but per message (i understand, a message can have multiple commands)

I would like to evaluate the complexity of an alternative approach that is aware of the responses.

pablochacin · 2023-09-13T16:00:37Z

docs/01-development/design-docs/003-stateless-redis-disruptor.md

+
+### Advantages
+
+- Easier to implement and less error-prone than a stateful proxy


I think that it is valid to implement a PoC using the stateless approach, but I would prefer that we address a full-fledged implementation in this design document.

pablochacin · 2023-09-13T16:01:28Z

docs/01-development/design-docs/003-stateless-redis-disruptor.md

+
+#### Disadvantages
+
+- Code is more complex, requiring more development time and increasing the surface for bugs to appear.


Even when this is a valid concern, I think we should explore this option and leave the stateless proxy as a PoC of the final goal.

pablochacin · 2023-09-13T16:24:22Z

Regarding the complexity of implementing the Redis protocol, we can explore and learn from existing projects:

docs: add stateless redis disruptor propoosal

e296f16

dgzlopes reviewed Sep 11, 2023

View reviewed changes

docs/01-development/design-docs/003-stateless-redis-disruptor.md Outdated Show resolved Hide resolved

add @dgzlopes as approver

8079740

roobre force-pushed the proposal-redis branch from 683839e to 8079740 Compare September 12, 2023 09:44

pablochacin changed the title ~~docs: add stateless redis disruptor propoosal~~ docs: add stateless redis disruptor proposal Sep 12, 2023

pablochacin requested changes Sep 13, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add stateless redis disruptor proposal #331

docs: add stateless redis disruptor proposal #331

roobre commented Sep 6, 2023

pablochacin left a comment

pablochacin Sep 13, 2023

pablochacin Sep 13, 2023

pablochacin Sep 13, 2023

pablochacin Sep 13, 2023

pablochacin Sep 13, 2023

pablochacin Sep 13, 2023

pablochacin commented Sep 13, 2023


		## Background

		Caching services like Redis are a common way to improve the performance of distributed systems, but sometimes make difficult to know or estimate how a given system will behave when the caching service stops behaving as expected. Non-catastrophic failure modes such as increase of latency, or unexpected miss rate increase can affect a distributed system in qualitative ways and lead to catastrophic failure.


		## Problem statement

		A big reliability challenge, and a common source of incidents, is the [metastable behavior](http://charap.co/metastable-failures-in-distributed-systems/) ([archive.org](https://web.archive.org/web/20230502171335/http://charap.co/metastable-failures-in-distributed-systems/)) of distributes systems when using common patterns such as caching. A common example of a metastable failure is system that is responding well to a certain load thanks to warm cache, but when this cache is lost due to an instance restart, or a node failure, the sudden cache miss events overload the backing database preventing the system from recovering.


		## Goals

		Add baseline redis faulting functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults:

	Add baseline redis faulting functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults:
	Add Redis fault injection functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults:


		Without the requirement of being able to correlate responses with the requests that originated them, a RESP proxy can be made stateless. This reduces the complexity at the cost of, as expected, not being to correlate those responses. However, it should still be possible to meet the goals above with an stateless proxy.

		A stateless RESP proxy accepts connections from Redis clients. It will read messages sent by clients, parse them, and decide if any action is necessary, such as modifying the request, or delaying it. It simply passes through responses from the server back to the client, without needing to decode them. A stateless proxy always needs to forward requests, modified or not, to the upstream server. As it is not aware of the flow of responses, it should be compatible with server pushes without needing any additional logic.


		### Advantages

		- Easier to implement and less error-prone than a stateful proxy


		#### Disadvantages

		- Code is more complex, requiring more development time and increasing the surface for bugs to appear.

docs: add stateless redis disruptor proposal #331

Are you sure you want to change the base?

docs: add stateless redis disruptor proposal #331

Conversation

roobre commented Sep 6, 2023

Description

pablochacin left a comment

Choose a reason for hiding this comment

pablochacin Sep 13, 2023

Choose a reason for hiding this comment

pablochacin Sep 13, 2023

Choose a reason for hiding this comment

pablochacin Sep 13, 2023

Choose a reason for hiding this comment

pablochacin Sep 13, 2023

Choose a reason for hiding this comment

pablochacin Sep 13, 2023

Choose a reason for hiding this comment

pablochacin Sep 13, 2023

Choose a reason for hiding this comment

pablochacin commented Sep 13, 2023