feat(dgw): persistent job queue for crash resistance #1108
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hackaton: Crash and kill signal resistance using a persistent job queue via Turso’s libSQL
Motivation
This year we added some background tasks in the Gateway that should not be canceled, or if they are, should be restarted later. Essentially two tasks: mass deletion of recordings (relatively important, but it's always possible to launch indexing in DVLS in case of a problem) and remuxing recordings to webm format (good to have). If the service is killed in the middle of one of these operations, we should resume execution on the next startup.
Demo
rec2024-11-14-12-46-44.webm
Implementation
This persisent job queue is implemented using Turso’s libSQL. Using libSQL (or SQLite) for implementing the queue allow us to benefit from all the work put into implementing a reliable, secure and performant disk-based database instead of attempting to implement our own ad-hoc storage and debugging it forever.
Inspiration was taken from 37signals' Solid Queue:
And "How to build a job queue with Rust and PostgreSQL" from kerkour.com:
The 'user_version' value, which is a SQLite PRAGMA, is used to keep track of the migration state. It's a very lightweight approach as it is just an integer at a fixed offset in the SQLite file.
Why Turso’s libSQL?
Introducing Turso’s libSQL, as opposed to SQLite, will serve us for "Recording Farms" in the future. We’ll want instances of a same Recording Farm to coordinate. At this point, we’ll want to use Turso's libSQL network database feature. Indeed, putting the SQLite database file on a virtual filesystem is not recommended. This can lead to corruption and data loss. Turso will allow us to have a local mode for the simplest setups, and a network and distributed mode for Recording Farms when we get there.