-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feature: add concepts and definitions
- Loading branch information
Showing
6 changed files
with
135 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Advanced concepts and definitions | ||
|
||
The following concepts are not required to understand what's following, you get back to them as need. | ||
|
||
### Worker pools | ||
A ***worker pool*** is a collection of [task workers](./concepts-basic#workers) running the **same [async app](./concepts-basic#app)** on the **same machine**. | ||
|
||
A worker pool can be created and started using the [`icij-worker`](https://github.com/ICIJ/icij-python/tree/main/icij-worker) CLI. | ||
|
||
### Broker | ||
|
||
We call ***broker*** the messaging service and protocol allowing Datashare's [task manager](./concepts-basic#task-manager) and [task workers](./concepts-basic#workers) to communicate together. | ||
|
||
Behind the scene we use a custom task protocol built on the top of [RabbitMQ](https://www.rabbitmq.com/)'s [AMQP protocol](https://en.wikipedia.org/wiki/Advanced_Message_Queuing_Protocol). | ||
|
||
### Queues | ||
***Queues*** are stacks of messages sent over the [broker](#broker). It's possible to route messages to specific queues so that it's read by some specific agents ([task manager](./concepts-basic#task-manager) or some specific [worker](./concepts-basic#workers)). | ||
|
||
While there can be several workers, **there is single task manager**, and it's already running inside Datashare's backend, so most of the time you don't have to care about it ! | ||
|
||
### Task States | ||
|
||
A task can have different ***task states***: | ||
```python | ||
class TaskState(str, Enum): | ||
# When created through the TaskManager | ||
CREATED = "CREATED" | ||
# When published to some queue, but not grabbed by a worker yet | ||
QUEUED = "QUEUED" | ||
# When grabbed by and executed by a worker | ||
RUNNING = "RUNNING" | ||
# When the worker couldn't execute the task for some reason | ||
ERROR = "ERROR" | ||
# When successful | ||
DONE = "DONE" | ||
# When cancelled by the user | ||
CANCELLED = "CANCELLED" | ||
``` | ||
|
||
### Task Groups | ||
|
||
Because its can be convenient to implement task meant to be executed by different types of [workers](./concepts-basic.md#workers) in the same [app](./concepts-basic.md#app), | ||
tasks can be grouped together to be executed by a given type of worker. | ||
|
||
In an ML context it's frequent to perform preprocessing tasks which require memory and CPU, and then perform ML inference which require GPU. | ||
Because, it's convenient to implement all these tasks as part of the same codebase (in particular for testing) and register them in the same [app](./concepts-basic.md#app), task can be assigned to a group. | ||
|
||
When starting a worker it's possible to provide a group of tasks to be executed. In the above case, we could for instance define a `cpu` and a `gpu` groups and split task between them. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
# Basic concepts and definitions | ||
|
||
Before starting, here are a few definitions of concepts that we'll regularly use in this documentation. | ||
|
||
The following concepts are important for the rest of this tutorial, make sure you understand them properly ! | ||
|
||
## Definitions | ||
|
||
### Tasks | ||
|
||
***Tasks*** (a.k.a. *"async tasks"* or *"asynchronous tasks"*) are **units of work** that can be | ||
executed [asynchronously](#asynchronous). | ||
|
||
Datashare has its own **built-in** tasks such as indexing documents, finding named entities, performing search or | ||
download by batches... Tasks are visible in on [Datashares's tasks page](https://datashare-demo.icij.org/#/tasks). | ||
|
||
The goal of this documentation is to let you implement **your own custom tasks**. They could virtually be anything: | ||
|
||
- classifying documents | ||
- extracting named entities from documents | ||
- extracting structured content from documents | ||
- translating documents | ||
- tagging documents | ||
- ... | ||
|
||
### Asynchronous | ||
|
||
In our context ***asynchronous*** mean *"executed in the background"*. Since tasks can be long, getting their result is | ||
not as simple as calling an API endpoint. | ||
|
||
Instead, executing a task asynchronously implies: | ||
|
||
1. requesting the task execution by publish the task name and arguments (parameters needed to perform the task) on | ||
the [broker](./concepts-advanced#broker) | ||
2. receive the task name and arguments from the broker and perform the actual task in the background inside | ||
a [task worker](#workers) (optionally publishing progress updates on the broker) | ||
3. monitor the task progress | ||
4. saving task results or errors | ||
5. accessing the task results or errors | ||
|
||
### Workers | ||
|
||
***Workers*** (a.k.a. *"async apps"*) are infinite loop **Python programs running [async tasks](#tasks)**. | ||
|
||
They pseudo for the worker loop is: | ||
|
||
```python | ||
while True: | ||
task_name, task_args = get_next_task() | ||
task_fn = get_task_function_by_name(task_name) | ||
try: | ||
result = task_fn(**task_args) | ||
except Exception as e: | ||
save_error(e) | ||
continue | ||
save_result(result) | ||
``` | ||
|
||
### Task Manager | ||
|
||
The ***task manager*** is the primary interface to interact with tasks. The task manager lets us: | ||
|
||
- create [tasks](#tasks) and send them to [workers](#workers) | ||
- post task [task state](concepts-advanced.md#task-states) and progress updates | ||
- monitor [task state](concepts-advanced.md#task-states) and progress | ||
- get task results and errors | ||
- cancel task | ||
- ... | ||
|
||
### App | ||
|
||
***Apps*** (a.k.a. *"async apps"*) are **collections of [tasks](#tasks)**, they act as a registry and bind a task name | ||
to an actual unit of work (a.k.a. a Python function). | ||
|
||
## Next | ||
|
||
- skip directly to learn more about [tasks](tasks.md) | ||
- or continue to learn about [advanced concepts](concepts-advanced.md) |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters