Skip to content

Commit

Permalink
feature: add concepts and definitions
Browse files Browse the repository at this point in the history
  • Loading branch information
ClemDoum committed Dec 13, 2024
1 parent db91f96 commit 58b852c
Show file tree
Hide file tree
Showing 6 changed files with 135 additions and 13 deletions.
3 changes: 2 additions & 1 deletion docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@
- https://dagster.io/
- https://docs.celeryq.dev/en/stable/
- communicate with datashare
-

### Can I create task for the local version of datashare ?
8 changes: 4 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
</p>
<br/>

# Implement your own Datashare tasks, written in Python
# Implement **your own Datashare tasks**, written in Python

Most AI, Machine Learning, Data Engineering happens in Python.
[Datashare](https://icij.gitbook.io/datashare) now lets you extend its backend with your own tasks implemented in Python.
Expand Down Expand Up @@ -64,14 +64,14 @@ you'll then be able to execute task by starting using our [HTTP client]() (and s

[//]: # (TODO: add a link to the HTTP task creation guide)

## Learn
## **Learn**

Learn how to integrate Data Processing and Machine Learning pipelines to Datashare following our [tutorial](./learn/tasks.md).

## Get started
## **Get started**

Follow our [get started](get-started/index.md) guide an learn how to clone the [template repository](https://github.com/ICIJ/datashare-ml-worker-template) and implement your own Datashare tasks !

## Refine your knowledge
## **Refine your knowledge**

Follow our [guides](guides/index.md) to learn how to implement complex tasks and deploy Datashare workers running your own tasks.
48 changes: 48 additions & 0 deletions docs/learn/concepts-advanced.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Advanced concepts and definitions

The following concepts are not required to understand what's following, you get back to them as need.

### Worker pools
A ***worker pool*** is a collection of [task workers](./concepts-basic#workers) running the **same [async app](./concepts-basic#app)** on the **same machine**.

A worker pool can be created and started using the [`icij-worker`](https://github.com/ICIJ/icij-python/tree/main/icij-worker) CLI.

### Broker

We call ***broker*** the messaging service and protocol allowing Datashare's [task manager](./concepts-basic#task-manager) and [task workers](./concepts-basic#workers) to communicate together.

Behind the scene we use a custom task protocol built on the top of [RabbitMQ](https://www.rabbitmq.com/)'s [AMQP protocol](https://en.wikipedia.org/wiki/Advanced_Message_Queuing_Protocol).

### Queues
***Queues*** are stacks of messages sent over the [broker](#broker). It's possible to route messages to specific queues so that it's read by some specific agents ([task manager](./concepts-basic#task-manager) or some specific [worker](./concepts-basic#workers)).

While there can be several workers, **there is single task manager**, and it's already running inside Datashare's backend, so most of the time you don't have to care about it !

### Task States

A task can have different ***task states***:
```python
class TaskState(str, Enum):
# When created through the TaskManager
CREATED = "CREATED"
# When published to some queue, but not grabbed by a worker yet
QUEUED = "QUEUED"
# When grabbed by and executed by a worker
RUNNING = "RUNNING"
# When the worker couldn't execute the task for some reason
ERROR = "ERROR"
# When successful
DONE = "DONE"
# When cancelled by the user
CANCELLED = "CANCELLED"
```

### Task Groups

Because its can be convenient to implement task meant to be executed by different types of [workers](./concepts-basic.md#workers) in the same [app](./concepts-basic.md#app),
tasks can be grouped together to be executed by a given type of worker.

In an ML context it's frequent to perform preprocessing tasks which require memory and CPU, and then perform ML inference which require GPU.
Because, it's convenient to implement all these tasks as part of the same codebase (in particular for testing) and register them in the same [app](./concepts-basic.md#app), task can be assigned to a group.

When starting a worker it's possible to provide a group of tasks to be executed. In the above case, we could for instance define a `cpu` and a `gpu` groups and split task between them.
78 changes: 78 additions & 0 deletions docs/learn/concepts-basic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Basic concepts and definitions

Before starting, here are a few definitions of concepts that we'll regularly use in this documentation.

The following concepts are important for the rest of this tutorial, make sure you understand them properly !

## Definitions

### Tasks

***Tasks*** (a.k.a. *"async tasks"* or *"asynchronous tasks"*) are **units of work** that can be
executed [asynchronously](#asynchronous).

Datashare has its own **built-in** tasks such as indexing documents, finding named entities, performing search or
download by batches... Tasks are visible in on [Datashares's tasks page](https://datashare-demo.icij.org/#/tasks).

The goal of this documentation is to let you implement **your own custom tasks**. They could virtually be anything:

- classifying documents
- extracting named entities from documents
- extracting structured content from documents
- translating documents
- tagging documents
- ...

### Asynchronous

In our context ***asynchronous*** mean *"executed in the background"*. Since tasks can be long, getting their result is
not as simple as calling an API endpoint.

Instead, executing a task asynchronously implies:

1. requesting the task execution by publish the task name and arguments (parameters needed to perform the task) on
the [broker](./concepts-advanced#broker)
2. receive the task name and arguments from the broker and perform the actual task in the background inside
a [task worker](#workers) (optionally publishing progress updates on the broker)
3. monitor the task progress
4. saving task results or errors
5. accessing the task results or errors

### Workers

***Workers*** (a.k.a. *"async apps"*) are infinite loop **Python programs running [async tasks](#tasks)**.

They pseudo for the worker loop is:

```python
while True:
task_name, task_args = get_next_task()
task_fn = get_task_function_by_name(task_name)
try:
result = task_fn(**task_args)
except Exception as e:
save_error(e)
continue
save_result(result)
```

### Task Manager

The ***task manager*** is the primary interface to interact with tasks. The task manager lets us:

- create [tasks](#tasks) and send them to [workers](#workers)
- post task [task state](concepts-advanced.md#task-states) and progress updates
- monitor [task state](concepts-advanced.md#task-states) and progress
- get task results and errors
- cancel task
- ...

### App

***Apps*** (a.k.a. *"async apps"*) are **collections of [tasks](#tasks)**, they act as a registry and bind a task name
to an actual unit of work (a.k.a. a Python function).

## Next

- skip directly to learn more about [tasks](tasks.md)
- or continue to learn about [advanced concepts](concepts-advanced.md)
7 changes: 0 additions & 7 deletions docs/learn/concepts.md

This file was deleted.

4 changes: 3 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,9 @@ plugins:
nav:
- Datashare Python: index.md
- Learn:
# - Concepts: learn/concepts.md
- Concepts and definitions:
- Basic: learn/concepts-basic.md
- Advanced: learn/concepts-advanced.md
- From functions to tasks:
- Creating async tasks: learn/tasks.md
- Creating an async app: learn/app.md
Expand Down

0 comments on commit 58b852c

Please sign in to comment.