-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Restate architecture documentation #100
Conversation
✅ Deploy Preview for docsrestatedev ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @tillrohrmann !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Till! I think there are a few topics that we could still discuss here:
- Journal: I think the journal warrants a dedicated section here to explain how it gives you suspension and replay. And that it logs invocations but also context calls.
- How invocations work: request goes to the ingress, state is eagerly attached together with the journal, runtime knows where the service is running (service registry) and sets up the connection... Mention suspensions
- Service registry: that metas keep service registry based on discovery and that services don't need to do this themselves anymore. Their requests just go via the runtime.
Just some thoughts...
docs/restate/architecture.md
Outdated
The *Metas* are responsible for managing the service meta information and coordinating the *Workers*. | ||
|
||
The *Workers* are responsible for invoking services, storing their journal and service state as well as maintaining processing order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't introduce the term journal yet. Probably would be better to describe the responsibility from a higher perspective and then say we accomplish that via having a central journal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the journal approach in general deserves a section here. To explain how we do all our magic: resiliency etc.
Thanks for the feedback @gvdongen. I will add sections for the journal and service invocation process. |
I've pushed another commit including the description of durable execution via journaling, the service registry and the service invocation flow @gvdongen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the new sections @tillrohrmann
I think this adds a lot of useful information for the user!
I think the reading flow could be improved by shuffling some sections around... I would propose changing the order of the sections to
- Durable execution via journaling (because from the user perspective this is the most important building block that he needs to understand)
- Service invocation flow (includes service registry section... not sure if that would improve the reading flow)
- Scalability
- Consistency & fault tolerance (although in the mental model this belongs together with the journal for durable execution for me...)
- State storage (include state queries into the section or maybe skip that for this page... For me this is more like a feature than an architecture component...)
What do you think?
## Service registry | ||
|
||
All servie meta information is maintained by the *Metas* via the service registry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All servie meta information is maintained by the *Metas* via the service registry. | |
All service meta information is maintained by the *Metas* via the service registry. |
It seems that you have some other expectations for the architecture page than what I thought @gvdongen. My understanding for this page was to describe the runtime's architecture (basic principles and design ideas) in order to give credibility to what we are doing (like the runtime is built with scalability, consistency and fault tolerance in mind). Maybe you had more the whole of Restate in mind (what are the basic concepts you as a user need to understand, how do things work end-to-end from a higher level)?
I am wondering whether durable execution is something that belongs on the architecture page of how the runtime works or should be more closer to the "Services" section. Technically speaking one could implement durable execution also by taking a memory snapshot or it could be a pure SDK concept. What matters from the runtime perspective is that the service endpoint can durable store bytes (not 100% correct because the runtime also needs to understand a few commands like calls or sleeps). Also given my description I talk more about the service endpoint than the runtime which might be an indicator.
Moving the service invocation flow up would mean that the definition of partitions and partition processors would only come later. It might also not be clear why one needs to route the invocation to the right Worker running a specific partition processor at this point.
For me these are two different pairs of shoes. What I want to describe here is how the runtime achieves consistency and fault tolerance (by running replicated state machines using Raft). What is built on top of it (durable execution via journaling) is certainly related but is just one way of how to achieve durable execution. If we could take a memory snapshot of the service endpoint, then storing these bytes would work equally well.
I would like to keep the state query part because for me it is major architectural component (exposing internal state via a SQL interface by running a SQL execution engine) and it is technically speaking independent of the actual state storage. |
I've pushed a commit that groups scalability and consistency & fault tolerance under principles and state storage, state query and service registry under components. Not sure whether this makes the reading experience easier. |
First of all, sorry for being so difficult there... I think all the content here is good so feel free to merge it. I don't want to block this... Besides that, it seems indeed that we have slightly different views of the scope of what should be on there...I mainly saw this page as a a page where we describe how Restate makes sure that it can do what it does. So that it doesn't just seem like magic to users. I think what you wrote until now is definitely content that should be covered there... But I think I saw this page as slightly broader, so more as the architecture of Restate-as-a-larger-product instead focused on the runtime. The way I saw it, was to have:
Anyway, let's not block this on my feedback here. Because this page contains a lot of useful information for the user and we can always iterate to improve the story that we are telling across the docs 👍 |
I'll try to give it another pass to improve the overall reading experience by highlighting what matters most to users. If I don't manage to improve it, then we'll iterate on what we have in this PR here. |
This fixes #92.
This commit restructures the architecture section to start with durable execution and the service invocation flow. The runtime specific sections are now grouped under "Runtime".
I've re-arranged the sections into:
|
This fixes #92.