diff --git a/README.md b/README.md index a044d3e..b153336 100644 --- a/README.md +++ b/README.md @@ -14,24 +14,24 @@ Features: + Exactly-once delivery. + Smart and configurable retries. + Many integrations. -+ Compatible. You can use walnats to emit events for non-walnats services or consume event emitted by non-walnats services. The tools is flexible enough to adapt for any format of the messages you use. ++ Compatible. You can use walnats to emit events for non-walnats services or consume events emitted by non-walnats services. The tool is flexible enough to adapt to any format of the messages you use. ## Compared to other tools -Compared to other big Python frameworks for background jobs (like [celery](https://docs.celeryq.dev/en/stable/), [dramatiq](https://dramatiq.io/index.html), [rq](https://python-rq.org/), [huey](https://huey.readthedocs.io/en/latest/), and so on), the main difference from implementation perspective is that walnats is younger and so had opportunity to be designed around modern technologies right from the beginning. Namely, mypy-powered type safety, async/await powered concurrency, and nats-powered persistency and distribution. +Compared to other big Python frameworks for background jobs (like [celery](https://docs.celeryq.dev/en/stable/), [dramatiq](https://dramatiq.io/index.html), [rq](https://python-rq.org/), [huey](https://huey.readthedocs.io/en/latest/), and so on), the main difference from an implementation perspective is that walnats is younger and so had an opportunity to be designed around modern technologies right from the beginning. Namely, mypy-powered type safety, async/await-powered concurrency, and nats-powered persistency and distribution. And when compared to **all** other Python frameworks for background jobs (including new async/await-based ones like [arq](https://arq-docs.helpmanual.io/), [pytask-io](https://github.com/joegasewicz/pytask-io), and [aiotasks](https://github.com/cr0hn/aiotasks)), the main difference is that Walnats is event-driven. While in all these frameworks the job scheduling is conceptually a function call over the network, in walnats publishers instead emit events to which any subscribers can subscribe at any point. This approach is called "[tell, don't ask](https://wiki.c2.com/?TellDontAsk)". For example, when your webshop sends a parcel to a client, instead of directly calling `send_email` and `send_sms` actors like you'd do with Celery, with Walnats publisher will emit a single `parcel-sent` event, and that event will be delivered by Walnats to all interested actors. It gives you a few nice benefits: -1. Publisher makews only one network request. +1. Publisher makes only one network request. 1. When you add a new actor, you don't need to modify the publisher. That's especially cool for microservice architecture when the publisher and the actor can be different services owned by different teams. -1. It's easier to reason about. When you develop a microservice, you only need to know what events there are you can subscribe to and emit your own events without thinking too much how all other sefvices in the system work with these events. -1. It's easier to observe. Walnats directly translates events into Nats subject and actors into Nats JetStream consumers. So, any Nats observability tool will give you great insights on what's going on in your system. +1. It's easier to reason about. When you develop a microservice, you only need to know what events there are you can subscribe to and emit your own events without thinking too much about how all other services in the system work with these events. +1. It's easier to observe. Walnats directly translates events into Nats subjects and actors into Nats JetStream consumers. So, any Nats observability tool will give you great insights into what's going on in your system. -If you have a big distributed system, Walnats is for you. If all you want is to send emails in background from your Django monolith or a little hobby project, you might find another framework a better fit. +If you have a big distributed system, Walnats is for you. If all you want is to send emails in the background from your Django monolith or a little hobby project, you might find another framework a better fit. -Lastly, compared to you just taking [nats.py](https://github.com/nats-io/nats.py) and writing your service from scratch, Walnats does a better job at handling failures, load spikes, and corner-cases. Walnats is "[designed for failure](https://www.v-wiki.net/design-for-failure/)". Distributed systems are hard, and you shouldn't embark on this journey alone. +Lastly, compared to you just taking [nats.py](https://github.com/nats-io/nats.py) and writing your service from scratch, Walnats does a better job at handling failures, load spikes, and corner cases. Walnats is "[designed for failure](https://www.v-wiki.net/design-for-failure/)". Distributed systems are hard, and you shouldn't embark on this journey alone. ## Installation @@ -71,7 +71,7 @@ async def run() -> None: asyncio.run(run()) ``` -Create subscriber (a service that listens to events): +Create a subscriber (a service that listens to events): ```python import asyncio diff --git a/docs/actors.md b/docs/actors.md index 8ebd20e..2409929 100644 --- a/docs/actors.md +++ b/docs/actors.md @@ -11,7 +11,7 @@ async def send_email(user: User) -> None: ... ``` -That's it. The handler knows nothing about Nats, Walnats, or the publisher. That means, it's dead simple to test, refactor, or reuse with another framework. This pattern is known as [Functional core, imperative shell](https://www.destroyallsoftware.com/screencasts/catalog/functional-core-imperative-shell). The handler is the functional core, and walnats is the imperative shell that you don't need to test (at least not in the unit tests). +That's it. The handler knows nothing about Nats, Walnats, or the publisher. That means it's dead simple to test, refactor, or reuse with another framework. This pattern is known as [Functional core, imperative shell](https://www.destroyallsoftware.com/screencasts/catalog/functional-core-imperative-shell). The handler is the functional core, and walnats is the imperative shell that you don't need to test (at least not in the unit tests). ## Declare actors @@ -23,7 +23,7 @@ WELCOME_USER = walnats.Actor( ) ``` -That's all you need to get started. Below are the API docs covering different configuration options you can specify for an actor. They allow you to adjust actor's behavior in case if there are too many messages or a handler failure. We'll also talk about handling high load and failures in the [Subscriber](sub) section. So, you can skip the rest of the page for now if you're only getting started. +That's all you need to get started. Below are the API docs covering different configuration options you can specify for an actor. They allow you to adjust the actor's behavior in case there are too many messages or a handler failure. We'll also talk about handling high loads and failures in the [Subscriber](sub) section. So, you can skip the rest of the page for now if you're only getting started. ## API diff --git a/docs/alloy.md b/docs/alloy.md index 318acbd..1bc08d4 100644 --- a/docs/alloy.md +++ b/docs/alloy.md @@ -4,13 +4,13 @@ layout: default --- # Verification -Formal verification usually means that you describe the same algoithm twice (declaratively and imperatively), and then computer checks that both implementations are equivalent. And when it comes to distributed systems, you need not only to describe your system but also how it changes over time. There are currently 2 usable languages that can effectively describe changing systems: [TLA+](https://en.wikipedia.org/wiki/TLA%2B) and [Alloy 6](https://en.wikipedia.org/wiki/Alloy_(specification_language)). On this page, I use Alloy 6 because it's simpler, looks more like a programming language, great for describing relations, and has a nice visualizer. +Formal verification usually means that you describe the same algorithm twice (declaratively and imperatively), and then the computer checks that both implementations are equivalent. And when it comes to distributed systems, you need not only to describe your system but also how it changes over time. There are currently 2 usable languages that can effectively describe changing systems: [TLA+](https://en.wikipedia.org/wiki/TLA%2B) and [Alloy 6](https://en.wikipedia.org/wiki/Alloy_(specification_language)). On this page, I use Alloy 6 because it's simpler, looks more like a programming language, is great for describing relations, and has a nice visualizer. -This page provides a declarative model for a walnats-based system. You can use this model to generate possible failure scenarios or built on top of it verification of your business-specific logic. +This page provides a declarative model for a walnats-based system. You can use this model to generate possible failure scenarios or build on top of it verification of your business-specific logic. ## Learn Alloy -The good news is that there are tons of articles and publications about Alloy. Disproportinally more than this tool is actually used by sane people. The bad news is that Alloy 6 (released in 2021) was a huge release introducing a great support for working with time. Before the release, working with time required a lot of dirty hacks and workarounds, and it all was far from pretty. And most of models and tutorials using Alloy relied on time. Hence if you see a tutorial for Alloy using time and not updated after 2021, you can safely dicard it, it's useless now. +The good news is that there are tons of articles and publications about Alloy. Disproportionally more than this tool is actually used by sane people. The bad news is that Alloy 6 (released in 2021) was a huge release introducing great support for working with time. Before the release, working with the time required a lot of dirty hacks and workarounds, and it all was far from pretty. And most of the models and tutorials using Alloy relied on time. Hence if you see a tutorial for Alloy using time and not updated after 2021, you can safely discard it, it's useless now. There are some resources to learn Alloy that are still good: @@ -26,7 +26,7 @@ Alloy [supports running Markdown files](https://alloy.readthedocs.io/en/latest/t 1. [Install Alloy Analyzer](https://github.com/AlloyTools/org.alloytools.alloy/releases/) 1. Run Alloy Analyzer: `java -jar org.alloytools.alloy.dist.jar`. 1. Press "Open". -1. Navigate to the walnats repository you clonned, and inside go to `docs/alloy.md` and open it. +1. Navigate to the walnats repository you cloned, and inside go to `docs/alloy.md` and open it. 1. Press "Execute" to generate a sample of the model and "Show" to open the graph. Go to these tutorials to learn using the GUI: @@ -37,7 +37,7 @@ Go to these tutorials to learn using the GUI: ## Message -The first "signature" (something like `class` in Python) we'll have is `Message`. It represents specific messages sent for a single event type. I make a model for only one event type to keep it simple. Whenever possible, you should use [inductive](https://en.wikipedia.org/wiki/Inductive_reasoning) approach. Prove your assumptions for one event, prove that every event in the system fits the model, and you have proven the whole system. That's why one event type is sufficient. +The first "signature" (something like `class` in Python) we'll have is `Message`. It represents specific messages sent for a single event type. I make a model for only one event type to keep it simple. Whenever possible, you should use the [inductive](https://en.wikipedia.org/wiki/Inductive_reasoning) approach. Prove your assumptions for one event, prove that every event in the system fits the model, and you have proven the whole system. That's why one event type is sufficient. ```alloy some sig Message {} @@ -45,11 +45,11 @@ some sig Message {} The `some` quantifier means that there is always at least one message because models without messages are boring. If nothing happens, what's the point? -if there is a Message, it doesn't mean it has been emitted yet. All that means is that it will be emitted on the interval we consider. In other words, each message atom (instance) at the start is a planned message. I find it helpful for verification. We know in advance what messages we expect, and so it's easier to check if all expected messages actually happened. Lastly, it will help you if you plan to model producer failures. +if there is a Message, it doesn't mean it has been emitted yet. All that means is that it will be emitted on the interval we consider. In other words, each message atom (instance) at the start is a planned message. I find it helpful for verification. We know in advance what messages we expect, so it's easier to check if all expected messages actually happened. Lastly, it will help you if you plan to model producer failures. ## Producer -Producer is a service that emits messages for the event. We can have multiple producers. +A producer is a service that emits messages for the event. We can have multiple producers. ```alloy sig Producer { @@ -71,7 +71,7 @@ abstract sig Actor { sig Actor1, Actor2 extends Actor {} ``` -`Actor1` and `Actor2` are different actor types, such as "send-email" or "generate-report". Each atom (instance) here is a separate instance of that actor. For example, the atom `Actor1$2` will be 2nd instance of Actor1. +`Actor1` and `Actor2` are different actor types, such as "send-email" or "generate-report". Each atom (instance) here is a separate instance of that actor. For example, the atom `Actor1$2` will be the 2nd instance of Actor1. The `abstract` means that there are no atoms of `Actor` type, it must be either `Actor1` or `Actor2`. @@ -127,9 +127,9 @@ Liveness properties are the ones that **eventually** will be true, that "good th ```alloy pred liveness { - // Every message is eventaully emitted. + // Every message is eventually emitted. Message = Producer.emitted - // Every emitted message is eventaully processed. + // Every emitted message is eventually processed. Producer.emitted = Actor1.handled Producer.emitted = Actor2.handled } @@ -155,4 +155,7 @@ We start with `init` and let the system evolve such that `safety` always holds t ## Stuttering -Since we require +Since we require `liveness` properties to be eventually true, on each step the system will progress towards that goal. There are a few things to keep in mind, though: + ++ One step of the spec can take any amount of time in the real world, from nanoseconds to days. The system may die and be dead for days until an engineer comes and fixes a critical bug. All we say is that *eventually* messages will be processed. ++ Individual actors or even all of them still can [stutter](https://www.learntla.com/core/temporal-logic.html?highlight=stutter#anything-can-crash) for any duration of steps. Again, all we say is that they will *eventually* move on. diff --git a/docs/api.md b/docs/api.md index 9e0e1ee..d5289e0 100644 --- a/docs/api.md +++ b/docs/api.md @@ -1,6 +1,6 @@ # API -Are you looking for an API documentation for a specific class or module? If can find most of the answers in docstrings without leaving the IDE. But in case you need a web reference, here is an index of pages covering specific parts of the API. +Are you looking for API documentation for a specific class or module? If can find most of the answers in docstrings without leaving the IDE. But in case you need a web reference, here is an index of pages covering specific parts of the API. ## Public classes diff --git a/docs/clock.md b/docs/clock.md index f3ef996..8ee316c 100644 --- a/docs/clock.md +++ b/docs/clock.md @@ -1,6 +1,6 @@ # Periodic tasks -You often will have tasks that you need to execute periodically (for instance, every day or every 5 minutes) or by schedule (for instance, 5 days after today). For example, create a database backup every midnight or actualize cache of orders every 5 minutes. Wouldn't it be great to have an event for that? Then you can get for the periodic tasks all the benefits of event-driven actors: scalability, distribution, retries, observibility, and more. +You often will have tasks that you need to execute periodically (for instance, every day or every 5 minutes) or by schedule (for instance, 5 days after today). For example, create a database backup every midnight or actualize a cache of orders every 5 minutes. Wouldn't it be great to have an event for that? Then you can get for the periodic tasks all the benefits of event-driven actors: scalability, distribution, retries, observability, and more. Meet `walnats.Clock`. It's a worker that emits periodic events. To implement a periodic task, run the clock and subscribe to the event in any other place. @@ -11,6 +11,6 @@ Meet `walnats.Clock`. It's a worker that emits periodic events. To implement a p ## Tips -1. It might be a good idea to run a separate clock on each nats cluster, so that in case of network failure between clusters, events will still be coming in each of them. When connection is restored, Nats will make sure there are no duplicates. +1. It might be a good idea to run a separate clock on each nats cluster so that in case of network failure between clusters, events will still be coming in each of them. When the connection is restored, Nats will make sure there are no duplicates. 1. The interval between events may be affected by CPU-bound tasks running in the same process. To avoid it, make sure to use {py:attr}`walnats.ExecuteIn.PROCESS` for all CPU-heavy actors running in the same instance as the clock. -1. If there is just one worker that needs a very specific schedule (for instance, it must be run daily at 12:03 and any other time won't do), prefer using {py:class}`walnats.decorators.filter_time`. If there are multiple workers that need the same schedule (like be run once a day, doesn't matter when), prefer making for them a separate Clock with a fitting duration. +1. If there is just one worker that needs a very specific schedule (for instance, it must be run daily at 12:03 and any other time won't do), prefer using {py:class}`walnats.decorators.filter_time`. If multiple workers need the same schedule (like being run once a day, doesn't matter when), prefer providing for them a separate Clock with a fitting duration. diff --git a/docs/events.md b/docs/events.md index 33e6d54..b815a25 100644 --- a/docs/events.md +++ b/docs/events.md @@ -1,6 +1,6 @@ # Events -Disigning an event-driven architecture start from describing services and events you're going to have. And so walnats-based project starts with defining events too. +Designing an event-driven architecture starts with describing services and events you're going to have. And so walnats-based project starts with defining events too. ## Declare events @@ -32,7 +32,7 @@ If you're curious, this is the full reference for the Event class: ## Limit events -A distributed system should be designed with a fault-tolerance in mind. Walnats will take care of redelivering failed messages in case a handler fails, an instance dies, or any other emergency. But for how long should it try redeliveries? And what if there are suddenly too many messages? That differs from event to event, depends on the business model and the system you build, and so only you can answer these questions. That's why you should specify limits for all events. The limits describe for how long messages can be stored, how much space they can take, how many of them can be in total, and so on. When a limit is reached, Nats server will drop old messages to fit into the limit. +A distributed system should be designed with a fault-tolerance in mind. Walnats will take care of redelivering failed messages in case a handler fails, an instance dies, or any other emergency. But for how long should it try redeliveries? And what if there are suddenly too many messages? That differs from event to event, depending on the business model and the system you build, and so only you can answer these questions. That's why you should specify limits for all events. The limits describe how long messages can be stored, how much space they can take, how many of them can be in total, and so on. When a limit is reached, the Nats server will drop old messages to fit into the limit. ```python USER_REGISTERED = walnats.Event( @@ -63,16 +63,16 @@ This is the reference of Limits with all limits you can set: Declaring events: + **Descriptive name**. The event name should be a verb that describes what happened in the system. Examples: "user-registered", "parcel-delivered", "order-created". -+ **Persistent name**. Choose the event name carefully. You cannot rename the event after it reaches the production. Well, you can, but that's very hard to do without loosing (or duplicating) messages because actors and producers are deployed independently. ++ **Persistent name**. Choose the event name carefully. You cannot rename the event after it reaches production. Well, you can, but that's very hard to do without losing (or duplicating) messages because actors and producers are deployed independently. + **Registry**. If you have non-Python microservices, consider also using some kind of event registry, like [eventcatalog](https://github.com/boyney123/eventcatalog) or [BSR](https://docs.buf.build/bsr/introduction), so that event definition (and especially schemas) are available for all services. + Use `SCREAMING_SNAKE_CASE` for the variable where the event is assigned. Events are immutable, and so can be considered constants. + **Defaults**. When you add a new field in an existing event, provide the default value for it. It is possible that the actor expecting the field is deployed before the producer, or an old event emitted by an old producer arrives. Always keep in mind backward compatibility. Changing a service is atomic and can be done in one deployment, changing multiple services isn't. -+ **Versioning**. Make sure that when you change the limits, {py:meth}`walnats.ConnectedEvents.register` is executed only in the latest version of your app. For example, if v1 of your service sets `max_age=60`, v2 sets `max_age=120`, and then so happened that v1 calls `register` afte v2 did that, v1 will negate the changes made by v1. To avoid the issue, call `register` only once at the application start and make sure that if the app is restarted, it will be replaced by the latest version. ++ **Versioning**. Make sure that when you change the limits, {py:meth}`walnats.ConnectedEvents.register` is executed only in the latest version of your app. For example, if v1 of your service sets `max_age=60`, v2 sets `max_age=120`, and then so happened that v1 calls `register` after v2 did that, v1 will negate the changes made by v1. To avoid the issue, call `register` only once the application start and make sure that if the app is restarted, it will be replaced by the latest version. Which fields to include in the event schema: -+ Include in the event fields that will be needed to many or all actors. For example, "parcel-status-changed" should include not only the parce ID in the database but also the old and the new status of the parcel. That way, the actors that do something only to the parcel moving in a specific status will be able to check the status without doing any database requests. -+ Do not include fields that are not needed or needed only for a handful of actors, it doesn't scale well. For example, if there is a "send-email" actor that reacts to "user-registered" event, trying to fit into the event all information needed for the email is a dead end. Each time you decide to add more information into email, you'll need to update not only the actor but also the producer of the event, which defeats the whole point of event-driven architecture. Producers (and hence the events) should not depend on the implementation of a specific actor. -+ When considering to include or not an event, you should think about scenario when the relevant database record has been updated after the event was emitted. For example, don't include user's email in the "user-registered" event. If they update their name, all actors sending emails should use the latest version. But you should include the user's email in the "user-email-changed" event, so that if the user changes email again before the first event is processed, both changes are properly handled by actors. ++ Include in the event fields that will be needed for many or all actors. For example, "parcel-status-changed" should include not only the parcel ID in the database but also the old and the new status of the parcel. That way, the actors that do something only to the parcel moving in a specific status will be able to check the status without doing any database requests. ++ Do not include fields that are not needed or needed only for a handful of actors, it doesn't scale well. For example, if there is a "send-email" actor that reacts to "user-registered" event, trying to fit into the event all information needed for the email is a dead end. Each time you decide to add more information to an email, you'll need to update not only the actor but also the producer of the event, which defeats the whole point of event-driven architecture. Producers (and hence the events) should not depend on the implementation of a specific actor. ++ When considering whether to include or not an event, you should think about a scenario when the relevant database record has been updated after the event was emitted. For example, don't include the user's email in the "user-registered" event. If they update their name, all actors sending emails should use the latest version. But you should include the user's email in the "user-email-changed" event so that if the user changes email again before the first event is processed, both changes are properly handled by actors. Modeling events for your system is like art, except if you do it wrong, everyone dies. The [awesome-event-patterns](https://github.com/boyney123/awesome-event-patterns) list has some good articles about designing events. That's a good place to get started. diff --git a/docs/glossary.md b/docs/glossary.md index 4888c89..20f9dd1 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -1,23 +1,23 @@ # Glossary -We do our best not to use too many technical terms, or at least not to rely on them too much. But if you encountered in the documentation some term that you don't understand, check this page, it may be explained there. These aren't technical 100% correct explanations but rather informal ones that should be sufficient to use and understand walnats. +We do our best not to use too many technical terms, or at least not to rely on them too much. But if you encountered in the documentation some terms that you don't understand, check this page, it may be explained there. These aren't technical 100% correct explanations but rather informal ones that should be sufficient to use and understand walnats. -+ **Ack** (acknowledgment) is a message that the actor sends when the message is succesfully handled. If Nats doesn't receive ack in the configured timeframe, the message will be redelivered to another instance of the actor. It does so to prevent the message not being handled if the first actor died, had networking issues, or frozed. Walnats keeps postponing the ack deadline while the message is in progress. -+ **Actor** is a piece of code in a subscriber service that acts on a specific event. For example, "notifications" service can have "send-email" and "send-sms" actors that act on the same "user-registered" event (or on different unrelated events). -+ **Consumer** is how Nats calls an Actor. If you run multiple copies of the same Actor, event on different machines, they all will be considered one consumer. Consumers are "the same" if they have the same name. That's why an Actor name must be unique accross the whole system. -+ **Event** is information about something happening in your system. Services in microservice architecture emit events into a message broker, so that other services can act upon them. For example, "users" service may emit event "user-registered", so that "email" can send a email confirmation message to the user and "audit-log" service can write an audit record that a new user has registered. ++ **Ack** (acknowledgment) is a message that the actor sends when the message is successfully handled. If Nats doesn't receive ack in the configured timeframe, the message will be redelivered to another instance of the actor. It does so to prevent the message not being handled if the first actor died, had networking issues, or froze. Walnats keeps postponing the ack deadline while the message is in progress. ++ **Actor** is a piece of code in a subscriber service that acts on a specific event. For example, the "notifications" service can have "send-email" and "send-sms" actors that act on the same "user-registered" event (or on different unrelated events). ++ **Consumer** is how Nats calls an Actor. If you run multiple copies of the same Actor, even on different machines, they all will be considered one consumer. Consumers are "the same" if they have the same name. That's why an Actor's name must be unique across the whole system. ++ **Event** is information about something happening in your system. Services in microservice architecture emit events into a message broker so that other services can act upon them. For example, "users" service may emit the event "user-registered" so that "email" can send an email confirmation message to the user, and the "audit-log" service can write an audit record that a new user has registered. + **Job** is a single run of an actor. In other words, each call of the handler for handling a message is one job. -+ **Handler** is a function that implements the core logic for an actor. For example, "send-sms" actor can call "send_sms" Python function that accepts the event payload and sends an SMS. ++ **Handler** is a function that implements the core logic for an actor. For example, the "send-sms" actor can call the "send_sms" Python function that accepts the event payload and sends an SMS. + **Message broker** is a service that takes care of delivering messages over the network. The most famous message brokers are RabbitMQ, Kafka, and Redis. Walnats uses Nats as the message broker. + **Nak** (or sometimes **nack**, non-acknowledgment) is the message that walnats sends into Nats when a message processing failed. When Nats receives nak, the message will be re-delivered to another (or the same) instance of the actor. -+ **Nats** is a message broker that we use for delivering messages. It's cloud-native, written on Go, and initially was just a Pub.Sub solution, meaning that messages are delivered to only the subscribers that are currently online and are never retried. Now, it also has Nats JetStream for streams, and that's what makes walnats possible. -+ **Payload** is the content of the event serialized into binary, so it can be transfered over the network. In other words, it's the raw body of Nats message. ++ **Nats** is a message broker that we use for delivering messages. It's cloud-native, written on Go, and initially was just a Pub/Sub solution, meaning that messages are delivered to only the subscribers that are currently online and are never retried. Now, it also has Nats JetStream for streams, and that's what makes walnats possible. ++ **Payload** is the content of the event serialized into binary, so it can be transferred over the network. In other words, it's the raw body of the Nats message. + **Polling** is periodically requesting and receiving messages from a message broker. It's like "pull" but in a loop. That's how Actors consume Events. + **Publisher** (or **producer**) is a service that emits events. + **Registry** is a collection of related things that can be managed together. For example, `walnats.Actors` is a collection of `walnats.Actor` instances. -+ **Schema** (scheme, model) is a Python type of the data inside of the event. It can be a Pydantic model, a protobuf message, a dataclass, or a built-in type. Walnats transfers raw bytes over the network, and the schema is a Pytohn representation of the data that you can work with. -+ **Serializer** is a class that can take your data in Python types and convert it into bytes to be transfered over the network. For example, if your schema is `dict` and and your data is `{'hello': 1}`, walnats will (by default) convert it on the publisher side into `{"hello":1}` binary JSON payload, transfer it through the network, and turn back into `{'hello': 1}` on the receiving side. -+ **Stream** is a peristency layer on top of a subject in Nats JetStream. This is how Nats JetStream makes sure that messages are delivered to the right Actors and don't get lost. ++ **Schema** (scheme, model) is a Python type of data inside of the event. It can be a Pydantic model, a protobuf message, a dataclass, or a built-in type. Walnats transfers raw bytes over the network, and the schema is a Python representation of the data that you can work with. ++ **Serializer** is a class that can take your data in Python types and converts it into bytes to be transferred over the network. For example, if your schema is `dict` and your data is `{'hello': 1}`, walnats will (by default) convert it on the publisher side into `{"hello":1}` binary JSON payload, transfer it through the network, and turn back into `{'hello': 1}` on the receiving side. ++ **Stream** is a persistency layer on top of a subject in Nats JetStream. This is how Nats JetStream makes sure that messages are delivered to the right Actors and don't get lost. + **Subject** is how Nats calls queues. This is how different subscribers can receive from Nats only specific messages they are interested in instead of everything that gets put in it. Walnats uses the Event name as the subject name when emitting events. + **Subscriber** is a service that consumes events and acts on them. -+ **Tracing** (distributed tracing) is when you collect information about how long each step took when processing a specific request. For example, you know that a specific request from the user to your web app took too long. You look up in logs the ID of this request, open the trace collector service (such as [Zipkin](https://zipkin.io/), [OpenTelemetry](https://opentelemetry.io/docs/concepts/signals/traces/), [Google Cloud Trace](https://cloud.google.com/trace), or [Datadog](https://docs.datadoghq.com/getting_started/tracing/)), find the trace for this request by ID, and you can see how long it took to run a specific SQL query, emit a Nats event, or porcess the event by a specific actor. ++ **Tracing** (distributed tracing) is when you collect information about how long each step took when processing a specific request. For example, you know that a specific request from the user to your web app took too long. You look up in logs the ID of this request, open the trace collector service (such as [Zipkin](https://zipkin.io/), [OpenTelemetry](https://opentelemetry.io/docs/concepts/signals/traces/), [Google Cloud Trace](https://cloud.google.com/trace), or [Datadog](https://docs.datadoghq.com/getting_started/tracing/)), find the trace for this request by ID, and you can see how long it took to run a specific SQL query, emit a Nats event, or process the event by a specific actor. diff --git a/docs/intro.md b/docs/intro.md index 3da0452..c4ffb99 100644 --- a/docs/intro.md +++ b/docs/intro.md @@ -1,27 +1,27 @@ # Getting started -This page gives a high-level overview of how a walnats-project look like. It's ok if you have unanswered questions after reading it. Later pages will dive into more details. This documentation targets to be as clear and approachable as possible with the cost of repetition. +This page gives a high-level overview of what a walnats-based project looks like. It's ok if you have unanswered questions after reading it. Later pages will dive into more details. This documentation targets to be as clear and approachable as possible with the cost of repetition. ## Overview -Walnats is a Python library for writing *event-driven* services. Being event-driven means that when a service (*producer* or *publisher*) wants to run some job in background, on schedule, or in another internal service, that service *emits* an *event*. The event tells other services that something happened in the system. Other services (*subscribers*) may have one or multiple *actors* that *listen* to events and *handle* them in some way. +Walnats is a Python library for writing *event-driven* services. Being event-driven means that when a service (*producer* or *publisher*) wants to run some job in the background, on schedule, or in another internal service, that service *emits* an *event*. The event tells other services that something happened in the system. Other services (*subscribers*) may have one or multiple *actors* that *listen* to events and *handle* them in some way. -For example, your project may have a "users" service that emits "user-registered" event. And then "notifications" service may have a "send-email" actor that sends a welcome email to the newly registered user. +For example, your project may have a "users" service that emits a "user-registered" event. And then "notifications" service may have a "send-email" actor that sends a welcome email to the newly registered user. -This decoupling is a very powerful idea that allows to scale services in a microservices architecture fast and independently. You may run multiple instances of the same actor, and walnats will make sure to evenly distribute the job among them. Or you can add a new actor (let's say, you now want to also send an SMS to the registered user) without changing a line of code in the service emitting the event. +This decoupling is a very powerful idea that allows scaling services in a microservices architecture fast and independently. You may run multiple instances of the same actor, and walnats will make sure to evenly distribute the job among them. Or you can add a new actor (let's say, you now want to also send an SMS to the registered user) without changing a line of code in the service emitting the event. ## Components A typical walnats-based service consists of a few parts, each having its own name and serving a distinct purpose: 1. A **schema** describes the type of data you want to send in the event. For example, it can be a pydantic model `User` that has fields `id: int` and `full_name: str`. -1. An **event** ({py:class}`walnats.Event`) connects the event (identified by a name that is unique accross the system) to a schema. For example, we can say that `user-registered` event has a `User` schema of data. It allows us to ensure type safety for both the publisher and the subscriber. +1. An **event** ({py:class}`walnats.Event`) connects the event (identified by a name that is unique across the system) to a schema. For example, we can say that `user-registered` event has a `User` schema of data. It allows us to ensure type safety for both the publisher and the subscriber. 1. A collection of **events** ({py:class}`walnats.Events`) is used to emit events. 1. A **handler** is a function that gets called when an event happens. It gets called with a message (a value of the schema type, the event payload) and can do inside anything it needs to do. 1. An **actor** ({py:class}`walnats.Actor`) connects a handler to an event. It makes sure to call the handler when an event occurs. 1. A collection of **actors** ({py:class}`walnats.Actors`) is used to listen to all relevant events. -Here is a little memo of how these things are related ot each other: +Here is a little memo of how these things are related to each other: ![schema](./schemas/intro1.svg) @@ -29,7 +29,7 @@ If you encounter an unfamiliar term in this documentation, check out {doc}`gloss ## Before you start -If you like a hands on approach and want to write your own project as you follow along, you need to prepare the environment first: +If you like a hands-on approach and want to write your own project as you follow along, you need to prepare the environment first: 1. Install walnats: `python3 -m pip install walnats` 1. [Install nats server](https://docs.nats.io/running-a-nats-service/introduction/installation): `go install github.com/nats-io/nats-server/v2@latest` diff --git a/docs/pub.md b/docs/pub.md index 621fc1a..09aeae9 100644 --- a/docs/pub.md +++ b/docs/pub.md @@ -1,6 +1,6 @@ # Publisher -Publisher is the code that emits events. It's up to you how you structure and run the publisher. It might be a web service, an actor, a CLI tool, or anything else. +The publisher is the code that emits events. It's up to you how you structure and run the publisher. It might be a web service, an actor, a CLI tool, or anything else. All you need to do: @@ -19,13 +19,13 @@ async with events.connect() as conn: ## Designing for failure -This section gives a few insights into how to make the event producers more resilient. +This section gives a few insights into how to make event producers more resilient. + **Outbox**. If the nats client we use ([nats.py](https://github.com/nats-io/nats.py)) gets disconnected from the server, it tries to reconnect in the background. If you try sending a message while the connection is lost, that message will be put into the pending queue and sent after the connection is restored. Hence, successful `emit` doesn't mean the message will be delivered to Nats server. The client might be shut down before it restores the connection with the server. You can adjust this behavior in several ways: - + Call {py:meth}`walnats.types.ConnectedEvents.emit` with `sync=True`. Then walnats will make sure the message is delivered into Nats JetStream before returning. + + Call {py:meth}`walnats.types.ConnectedEvents.emit` with `sync=True`. Then walnats will make sure the message is delivered to Nats JetStream before returning. + Call {py:func}`nats.connect` with `pending_size=0` (and pass this connection into `walnats.Events.connect`). The argument sets the maximum size in bytes of the pending messages queue. Setting it to zero disables the outbox altogether. When the connection is lost, you try to emit an event, and the limit is reached, the client will raise {py:exc}`nats.errors.OutboundBufferLimitError`. -+ **Transactions**. Use transactions for database changes. For example, for SQLAlchemy, see [Transactions and Connection Management](https://docs.sqlalchemy.org/en/20/orm/session_transaction.html). That way, I `emit` fails, the database changes will be rolled back, and then the whole operation you did can be safely retried. -+ **Deduplication**. If you have retries for the operation that emits an event, it's possible that the operation (try to guess) will be retried, and so `emit` will be called twice. To avoid the same event to be emitted twice, you can provide `uid` argument for {py:meth}`walnats.types.ConnectedEvents.emit` (which has to be a unique identifier of the message), and Nats will make sure to ignore duplicates. The default deduplication window is 2 minutes. ++ **Transactions**. Use transactions for database changes. For example, for SQLAlchemy, see [Transactions and Connection Management](https://docs.sqlalchemy.org/en/20/orm/session_transaction.html). That way, if I `emit` fails, the database changes will be rolled back, and then the whole operation you did can be safely retried. ++ **Deduplication**. If you have retries for the operation that emits an event, it's possible that the operation (try to guess) will be retried, and so `emit` will be called twice. To avoid the same event to be emitted twice, you can provide a `uid` argument for {py:meth}`walnats.types.ConnectedEvents.emit` (which has to be a unique identifier of the message), and Nats will make sure to ignore duplicates. The default deduplication window is 2 minutes. If you want to know everything about what publishers can do with events, check out the API docs below. If you're just getting started, feel free to skip to the next chapter: [Actors](actors). @@ -43,7 +43,7 @@ If you want to know everything about what publishers can do with events, check o ## CloudEvents -[CloudEvents](https://cloudevents.io/) is a specification that describes the format of metadata for events. Walnats doesn't care about most of the data there but it can be useful if you use some third-party tools or consumers that can benefit from such metadata. If that's the case, you can pass a {py:class}`walnats.CloudEvent` instance as an argument into {py:meth}`walnats.types.ConnectedEvents.emit`. This metadata will be included into the Nats message headers according to the WIP specification: [NATS Protocol Binding for CloudEvents](https://github.com/cloudevents/spec/blob/main/cloudevents/bindings/nats-protocol-binding.md). +[CloudEvents](https://cloudevents.io/) is a specification that describes the format of metadata for events. Walnats doesn't care about most of the data there but it can be useful if you use some third-party tools or consumers that can benefit from such metadata. If that's the case, you can pass a {py:class}`walnats.CloudEvent` instance as an argument into {py:meth}`walnats.types.ConnectedEvents.emit`. This metadata will be included in the Nats message headers according to the WIP specification: [NATS Protocol Binding for CloudEvents](https://github.com/cloudevents/spec/blob/main/cloudevents/bindings/nats-protocol-binding.md). ```{eval-rst} .. autoclass:: walnats.CloudEvent diff --git a/docs/recipes.md b/docs/recipes.md index 6bfb4d2..c0e2993 100644 --- a/docs/recipes.md +++ b/docs/recipes.md @@ -4,7 +4,7 @@ This page covers common usage patterns of walnats. ## Passing dependencies into handlers -The handler can be any callable object (like a method or a class with `__call__` method), not necessarily a function. You can use this fact to bind handler's dependencies as class attributes: +The handler can be any callable object (like a method or a class with a `__call__` method), not necessarily a function. You can use this fact to bind the handler's dependencies as class attributes: ```python @dataclass @@ -24,7 +24,7 @@ async def run() -> None: ## Routing events based on a value -Often, you'll have actors that need to do something only for event with a specific value in some field. For example, send email only for a parcel update only when it moves in a specific status or write audit logs only for non-admin users. The easiest solution is to simply check the condition at the beginning of the handler, but that means the actor will have to receive, decode, and check every event and actually do something useful only to a small portion of them. +Often, you'll have actors that need to do something only for an event with a specific value in some field. For example, send email only for a parcel update only when it moves in a specific status or write audit logs only for non-admin users. The easiest solution is to simply check the condition at the beginning of the handler, but that means the actor will have to receive, decode, and check every event and do something useful only to a small portion of them. A better solution (in some situations) might be to provide separate events for each possible value of the field. Assuming that there is a small and well-known set of possible values. diff --git a/docs/resp.md b/docs/resp.md index b6702d0..d128260 100644 --- a/docs/resp.md +++ b/docs/resp.md @@ -1,8 +1,8 @@ # Request/response -If possible, you should try to design your system as a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) where there is only one producer for each event type, and the producer doesn't care how the events it emits are handled. Sometimes, you'll have loops when there are two actors and each can also produce an event in which the other actor is interested. And in some rare occasions, you'll have producer that when emit an event will need to wait for a response from an actor to this particular event. This is known as request/response pattern and walnats provides support for it out-of-the box. +If possible, you should try to design your system as a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) where there is only one producer for each event type, and the producer doesn't care how the events it emits are handled. Sometimes, you'll have loops when there are two actors and each can also produce an event in which the other actor is interested. And in some rare occasions, you'll have a producer that when emitting an event will need to wait for a response from an actor to this particular event. This is known as the request/response pattern and walnats provides support for it out-of-the-box. -The internal implementation is similar to [Request-Reply](https://docs.nats.io/nats-concepts/core-nats/reqreply) pattern in Nats. There is a difference, though. Nats implements it only for core Nats, without JetStream, and so it cannot be used with Nats JetStream consumers (and so with walnats actors). Walnats fixes it by providing its own implementation of the pattern on top of Nats. However, there is still no persistency for responses. Whatever response the actor sends, it will be emitted through the core Nats. If there is a network failure or the producer gets restarted, the response will get lost. THe idea is that the response matters only right now and only for this specific producer. If you need persistency, explicitly use events and actors for responses as well. +The internal implementation is similar to [Request-Reply](https://docs.nats.io/nats-concepts/core-nats/reqreply) pattern in Nats. There is a difference, though. Nats implements it only for core Nats, without JetStream, and so it cannot be used with Nats JetStream consumers (and so with walnats actors). Walnats fixes it by providing its own implementation of the pattern on top of Nats. However, there is still no persistency for responses. Whatever response the actor sends, will be emitted through the core Nats. If there is a network failure or the producer gets restarted, the response will get lost. The idea is that the response matters only right now and only for this specific producer. If you need persistency, explicitly use events and actors for responses as well. There are just a few changes you need to make for sending and receiving responses: diff --git a/docs/sub.md b/docs/sub.md index 515ce75..ba37416 100644 --- a/docs/sub.md +++ b/docs/sub.md @@ -4,7 +4,7 @@ The subscriber runs a set of actors. It's up to you to provide CLI for the subsc What you need to do: -1. Collect all actors you want to run into `walnats.Actors` registry. +1. Collect all actors you want to run into the `walnats.Actors` registry. 2. Call `register` to create Nats JetStream consumers. 3. Call `listen` to start listening for events. @@ -15,7 +15,7 @@ async with actors.connect() as conn: await conn.listen() ``` -Below are the API docs that list all options available for connecting and listening. You can scroll past them and go directly to the sections covering handling of failures and high load. +Below are the API docs that list all options available for connecting and listening. You can scroll past them and go directly to the sections covering the handling of failures and the high load. ## API @@ -28,18 +28,18 @@ Below are the API docs that list all options available for connecting and listen ## Design for high load -Walnats does its best to balanace the load between actors, but exact numbers and strategies highly depend on the type of jobs you have, and that's something that only you may know. +Walnats does its best to balance the load between actors, but exact numbers and strategies highly depend on the type of jobs you have, and that's something that only you may know. -* You can run multiple processes, each having its own walnats event listener. Either directly with [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) or run the whole application multiple times. If you use [supervisord](http://supervisord.org/), consider adjusting `numprocs` option based on how many cores the target machine has. +* You can run multiple processes, each having its own walnats event listener. Either directly with [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) or running the whole application multiple times. If you use [supervisord](http://supervisord.org/), consider adjusting the `numprocs` option based on how many cores the target machine has. * The whole idea of async/await is to switch to CPU-bound work while waiting for an IO-bound response. For example, while we're waiting for a database response to a query, we can prepare and run another query. So, make sure there is always work to do while in `await`. You can do that by increasing the `max_jobs` value in both individual actors and `listen`, running more actors on the same machine, and actively using async/await in your code, with `asyncio.gather` and all. -* After some point, increasing `max_jobs` doesn't bring any value. This is the point when there is already more than enough work to do while in `await`, and so blocking operations start making the pause at `await` much longer that it is needed. It will make every job slower, and you'd better scale with more processes or machines instead. -* Keep in mind that all system resources are limited, and some limits are smaller that you might think. For example, the number of files or network connections that can be open simultaneously. Again, having a smaller `max_jobs` (and in case of network connections, `max_polls`) might help. +* After some point, increasing `max_jobs` doesn't bring any value. This is the point when there is already more than enough work to do while in `await`, and so blocking operations start making the pause at `await` much longer than it is needed. It will make every job slower, and you'd better scale with more processes or machines instead. +* Keep in mind that all system resources are limited, and some limits are smaller than you might think. For example, the number of files or network connections that can be opened simultaneously. Again, having a smaller `max_jobs` (and in the case of network connections, `max_polls`) might help. * If you have a long-running CPU-bound task, make sure to run it in a separate process poll by specifying `execute_in`. ## Design for failure 1. Keep in mind that your code can explode at any point, especially if you work a lot with external services and third-party APIs. Walnats will take care of retries, but it's on you to make sure nothing is left half-done in case of a failure ("Atomicity" part of [ACID](https://en.wikipedia.org/wiki/ACID)). For database queries, use a transaction. For network requests, keep them closer to the end and retry just the requests in case of failures. For file writes, write a temporary file first, and then copy it in the right place. -1. Make sure to have a reasonable `ack_wait` value. Too high number means the event might arrive when nobody needs the result ([real-time system](https://en.wikipedia.org/wiki/Real-time_computing)). Too low value might mean that walnats didn't get enough time to send "in progress" pulse into Nats, and so the message was re-delivered to another instance of the actor while the first one haven't failed. -1. The job may freeze. Make sure you specify timeouts for all network requests and the actor itself has a low enough `job_timeout` value. -1. Some errors are caused by deterministic bugs, and so no amount of retries can fix them. Make sure to specify `max_attempts` for the actor and `limits` for the event. -1. Make sure you have error reporting in place. Walnats provides out-of-the-box a middleware for Sentry ({py:class}`walnats.middlewares.SentryMiddleware`), but you can always write your own middleware for whatever service you use. +1. Make sure to have a reasonable `ack_wait` value. A too high number means the event might arrive when nobody needs the result ([real-time system](https://en.wikipedia.org/wiki/Real-time_computing)). Too low value might mean that walnats didn't get enough time to send an "in progress" pulse into Nats, and so the message was re-delivered to another instance of the actor while the first one hasn't failed. +1. The job may freeze. Make sure you specify timeouts for all network requests and that the actor itself has a low enough `job_timeout` value. +1. Some errors are caused by deterministic bugs, so no amount of retries can fix them. Make sure to specify `max_attempts` for the actor and `limits` for the event. +1. Make sure you have error reporting in place. Walnats provides out-of-the-box middleware for Sentry ({py:class}`walnats.middlewares.SentryMiddleware`), but you can always write your own middleware for whatever service you use.