-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feature: get started with basic worker
- Loading branch information
Showing
12 changed files
with
221 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Build your worker image (WIP) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# How to use the worker template ? | ||
|
||
The [datashare-python](https://github.com/ICIJ/datashare-python) repository is meant to be used as a template to implement your own Datashare worker. | ||
|
||
## Clone the template repository | ||
|
||
Start by cloning the [template repository](https://github.com/ICIJ/datashare-python): | ||
|
||
<!-- termynal --> | ||
```console | ||
$ git clone [email protected]:ICIJ/datashare-python.git | ||
---> 100% | ||
``` | ||
|
||
## Explore the codebase | ||
|
||
In addition to be used as a template, the repository can also showcases some of advanced schemes detailed in the | ||
[guides](../../guides/index.md) section of this documentation. | ||
|
||
Don't hesitate to have a look at the codebase before starting (or get back to it later on) ! | ||
|
||
In particular the following files should be of interest: | ||
```console | ||
. | ||
├── ml_worker | ||
│ ├── app.py | ||
│ ├── config.py | ||
│ ├── tasks | ||
│ │ ├── __init__.py | ||
│ │ ├── classify_docs.py | ||
│ │ ├── dependencies.py | ||
│ │ └── translate_docs.py | ||
``` | ||
|
||
|
||
## Replace existing tasks with your own | ||
|
||
To implement your Datashare worker the only thing you have to do is to **replace existing tasks with your own and | ||
register them in the `app` app variable of the `app.py` file.** | ||
|
||
We'll detail how to do so in the [Basic Worker](./worker-basic.md) and [Advanced Worker](./worker-advanced.md) examples. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Implement your own Datashare worker | ||
|
||
## Clone the template repository | ||
|
||
Start by cloning the [template repository](https://github.com/ICIJ/datashare-python): | ||
|
||
<!-- termynal --> | ||
```console | ||
$ git clone [email protected]:ICIJ/datashare-python.git | ||
---> 100% | ||
``` | ||
|
||
## Install dependencies | ||
|
||
Install [`uv`](https://docs.astral.sh/uv/getting-started/installation/) and install dependencies: | ||
<!-- termynal --> | ||
```console | ||
$ curl -LsSf https://astral.sh/uv/install.sh | sh | ||
$ uv sync --frozen --group dev | ||
``` | ||
|
||
## Implement your own tasks | ||
|
||
The template repository already contains some examples of tasks performing document translation and classification. | ||
You can keep them for now as a model to implement your own task, but we'll eventually want to **get rid of them (and their tests)**. | ||
|
||
For the sake of example, let's add a dummy [TF-IDF-based](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vector store to | ||
Datashare and add different tasks to our [async app](../../learn/concepts-basic.md#app): | ||
|
||
- the `create_vectorization_tasks` task scans the Datashare index, fits a [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and persists it to the filesystem. | ||
It will then create `vectorize_docs` task grouping vectorization task | ||
- the `vectorize_docs` task receives a batch of doc IDs, loads the vectorizer, vectorizes the docs and persist the vectors to the filesystem | ||
- the `find_most_similar` task receives a batch of doc IDs and find their nearest neighbors in the vector database | ||
|
||
!!! tip | ||
The `create_vectorization_tasks` task may seem artificial, however in case the vectorization work load is high, it's useful to distribute it across works. | ||
Having a first task splitting a large task into smaller ones allows us to distribute the computationally expensive task across several workers. | ||
In our case, it probably adds complexity but it's here for demo purpose. | ||
|
||
Learn more about how to implement complex workflows and workload distribution in the [task workflow guide](../../guides/task-workflows.md) ! | ||
|
||
|
||
### Updating dependencies | ||
|
||
Let's add the [scikit-learn](https://scikit-learn.org/stable/index.html) and [pandas](https://pandas.pydata.org/docs/index.html) as dependencies to our project: | ||
<!-- termynal --> | ||
```console | ||
$ uv add scikit-learn pandas | ||
``` | ||
|
||
### Update dependency injection | ||
Minimal: config + loggers | ||
|
||
If your tasks require to dependencies to be injected (configu, think about DB clients) | ||
|
||
## Test | ||
|
||
## Register your tasks in the `app` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# Basic Datashare worker | ||
|
||
## Install dependencies | ||
|
||
Start by installing [`uv`](https://docs.astral.sh/uv/getting-started/installation/) and dependencies: | ||
<!-- termynal --> | ||
```console | ||
$ curl -LsSf https://astral.sh/uv/install.sh | sh | ||
$ uv sync --frozen --group dev | ||
``` | ||
|
||
## Implement the `hello_user` task function | ||
|
||
As seen in the [task tutorial](../../learn/tasks.md#task-arguments), one of the dummiest tasks we can implement take | ||
the `:::python user: dict | None` argument automatically added by Datashare to all tasks and greet that user. | ||
|
||
The function performing this task is the following | ||
|
||
```python | ||
--8<-- | ||
basic_app.py:hello_user_fn | ||
--8<-- | ||
``` | ||
|
||
## Register the `hello_user` task | ||
|
||
In order to turn our function into a Datashare [task](../../learn/concepts-basic.md#tasks), we have to register it into the | ||
`:::python app` [async app](../../learn/concepts-basic.md#app) variable of the [app.py](../../../ml_worker/app.py) file, using the `:::python @task` decorator. | ||
|
||
Since we won't use existing tasks, we can also perform some cleaning and get rid of them. | ||
The `app.py` file should hence look like this: | ||
|
||
```python title="app.py" hl_lines="9" | ||
--8<-- | ||
basic_app.py:app | ||
--8<-- | ||
``` | ||
|
||
The only thing we had to do is to use the `:::python @app.task` decorator and make sure to provide it with | ||
`:::python name` to **bind the function to a task name** and group the task in the `:::python PYTHON_TASK_GROUP = TaskGroup(name="PYTHON")`. | ||
|
||
As detailed in [here](../../learn/datashare-app.md#grouping-our-tasks-in-the-python-task-group), using this task group | ||
ensures that when custom tasks are published for execution, they are correctly routed to your custom Python worker and | ||
not to the Java built-in workers running behind Datashare's backend. | ||
|
||
## Get rid of unused codebase | ||
|
||
## Next | ||
|
||
Now that you have created a basic app, you can either: | ||
|
||
- learn how to [build a docker image](../build.md) from it | ||
- learn how to implement a more realistic worker in the [advanced example](./worker-advanced.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,5 @@ | ||
# Get started (WIP) | ||
# About | ||
|
||
This section will is a step-by-step guide to create and deploy your own Datashare tasks. | ||
|
||
You might want to [learn](../learn/index.md) the basics before actually starting to implement your own worker. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# Run using `docker compose` (WIP) | ||
|
||
## Publish tasks |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
# About | ||
|
||
This section is a tutorial and will guide you through the basic steps of building your own tasks for Datashare **from scratch**. | ||
This section is a tutorial and will guide you through the basic steps theoretically required for building your own tasks for Datashare **from scratch**. | ||
|
||
Rest assured, **in practice, you won't have to start anything from scratch**. The [get started](../get-started/index.md) section will show you how to create Datashare tasks by cloning the [datashare-python](https://github.com/ICIJ/datashare-python) template repo. | ||
Don't worry, **in practice, you won't have to start anything from scratch**. The [get started](../get-started/index.md) section will show you how to create Datashare tasks by cloning the [datashare-python](https://github.com/ICIJ/datashare-python) template repo. | ||
|
||
However, this section will teach you the basic steps it took to build this template. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# --8<-- [start:app] | ||
from icij_worker import AsyncApp | ||
from icij_worker.app import TaskGroup | ||
|
||
app = AsyncApp("some-app") | ||
|
||
PYTHON_TASK_GROUP = TaskGroup(name="PYTHON") | ||
|
||
|
||
# --8<-- [end:hello_world] | ||
@app.task(name="hello_user", group=PYTHON_TASK_GROUP) | ||
# --8<-- [start:hello_user_fn] | ||
def hello_user(user: dict | None) -> str: | ||
greeting = "Hello " | ||
if user is None: | ||
user = "unknown" | ||
else: | ||
user = user["id"] | ||
return greeting + user | ||
|
||
|
||
# --8<-- [end:hello_user_fn] | ||
# --8<-- [end:app] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters