Skip to content

Commit

Permalink
#33: Documentation updates
Browse files Browse the repository at this point in the history
  • Loading branch information
jaccomoc committed Nov 26, 2023
1 parent 39839dd commit 9607cf1
Showing 1 changed file with 23 additions and 15 deletions.
38 changes: 23 additions & 15 deletions docs/_posts/2023-11-10-checkpoint-poc.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
---
layout: post
title: "Checkpoint Replication Proof Of Concept"
title: "Proof of Concept: 20,000 Payments per Second with Checkpoint Replication"
date: 2023-11-10 15:11:44 +1100
categories: blog
author: "James Crawford"
image: /assets/logo2-picture.jpg
---

Jactl 1.3.0 introduces the ability to checkpoint the current execution state of a script and have it restored and
[Jactl](https://jactl.io) 1.3.0 introduces the ability to checkpoint the current execution state of a script and have it restored and
resumed elsewhere after a failure.
In order to test this new feature, I implemented a proof-of-concept project that simulates a payment processing
system that accepts payment requests, performs some operations by interacting with multiple external systems and then
returns a response.
The payment processing checkpoints its state when required and if a failure occurs payments in flight will have their
The payment processing system checkpoints its state when required and if a failure occurs, payments in flight will have their
state restored and resumed to make sure that no payment is lost.

## Jactl Checkpoints

Jactl is a scripting language intended to be embedded into real-time Java applications and doesn't itself offer
any mechanism for persisting or replicating the checkpoint state.
Instead, it provides hooks that need to be implemented by the application if the checkpoints are to be persisted or
Jactl is a open source scripting language intended to be embedded into real-time Java applications.
It doesn't itself offer any mechanism for persisting or replicating the checkpoint state but,
instead, provides hooks that need to be implemented by the application if the checkpoints are to be persisted or
replicated.
These hooks allow the application to know when a new checkpoint has been generated for a running script and when a
running script instance has finished (which allows the application to then delete any checkpoints for that instance
Expand All @@ -28,7 +28,7 @@ There is also a hook for the application to use when recovering after a failure
instance and resume its execution from where it left off.

Since Jactl scripts never block (from a Java thread point of view), whenever a script invokes a function that would
otherwise block such as sending a request to a remote system and needing to wait for a response, or reading some
otherwise block, such as sending a request to a remote system and needing to wait for a response, or reading some
data from a database, Jactl captures the current execution state in a `Continuation` object (actually a chain of
`Continuation` objects; one per stack frame) and when the response is ready to be processed the execution state is
resumed.
Expand All @@ -46,29 +46,29 @@ the script instance state and continue execution from where it had been checkpoi
To implement a proof-of-concept of the checkpoint facility, I used the publish/subscribe mechanism provided by the
[NATS](https://nats.io) messaging infrastructure.
I wrote a generic Jactl Server that takes a directory of Jactl scripts and for each script listens on a NATS topic of
the same name.
the same name and then provided it with a Jactl script for processing payment requests.

NATS allows multiple instances of an application to listen on the same topic and will automatically spread the load
across those instances.
Clients can then send requests to a topic and NATS will forward it to one of the instances subscribed to that topic.

> NATS also supports reliable message streams (called JetStream) but this was not used for this project.
> Note that although NATS also supports reliable message streams (called JetStream), this was not used for this project.
> Since we are checkpointing our state when necessary, we rely on the checkpointing to provide reliability and use
> standard NATS messaging without any reliability guarantees.
All messages between the different components were encoded in JSON using the native JSON support provided by Jactl.

## Kubernetes

To make it easy to scale up multiple application instances, I deployed all the simulators (including the
load generator, the simulators for the external systems, and the payment processing simulator itself) in a Kubernetes
To make it easy to scale up multiple application instances, I deployed the payment processor instances and the simulators
(including the load generator and the simulators for the external systems) in a Kubernetes
cluster and captured metrics in Prometheus that I could then view in Grafana.

## Checkpoint Persistence, Replication, and Recovery

For the proof-of-concept I decided to write the checkpoints locally to volumes attached to each application instance.
The volumes are handled by Kubernetes and are attached to the pods (the application instances) when they are started.
If a pod dies, it will be restarted by Kubernetes, and it will be given the same persistent volume as the before.
If a pod dies, it will be restarted by Kubernetes, and it will be given the same persistent volume as before.

In order to provide additional fault tolerance, the checkpoints, as well as being written locally to the file system
belonging to that pod instance, are replicated to another pod instance where they are also stored in the file system
Expand Down Expand Up @@ -102,8 +102,9 @@ an Alias Lookup via a callout to another external system to find the destination
needs to be sent, sends the authorisation, and then responds back with the result.

The payment application does an early acknowledgement back to the requester once it has a token and has done the alias
lookup and we checkpoint our state at this point so that the caller knows that once it gets that early acknowledgement
that there is a guarantee that the payment will be processed even if a failure later occurs.
lookup and we checkpoint our state at this point.
The caller therefore knows that once it gets that early acknowledgement there is a guarantee that the payment
will be processed even if a failure later occurs.

We also checkpoint our state after getting the response from the bank.

Expand Down Expand Up @@ -136,10 +137,17 @@ As the graph shows, the impact of the pod failures on the throughput was pretty
![](/assets/checkpoint_poc_grafana2.png)


There is also a graph showing if there are any long-running payments (from the payment load generator's point of view).
On the dashboard shown there is also a graph of "Long Running Requests" showing if there are any long-running payments
(from the payment load generator's point of view).
This shows the number of payments where the early acknowledgement has been received but for which the final response
has still not been received after 2 seconds.
If there was a problem with recovering script execution state after a failure then we would expect to see this count
be non-zero.
Given that it remains at zero for the entire run it shows that all such payments were recovered and
completed successfully.

## Further Reading

For information about Jactl, see the Jactl documentation [web-site](https://jactl.io).

The project is hosted on [github](https://github.com/jaccomoc/jactl).

0 comments on commit 9607cf1

Please sign in to comment.