Update README.md (#4)

facebookresearch · Dec 9, 2024 · 5a64285 · 5a64285
1 parent 5414175
commit 5a64285
Show file tree

Hide file tree

Showing 2 changed files with 55 additions and 22 deletions.
diff --git a/.github/workflows/test-type-lint.yaml b/.github/workflows/test-type-lint.yaml
@@ -71,9 +71,9 @@ jobs:
     - name: Test README code blocks
       run: |
         source activate ./ci_env
-        cd docs
         # update readmes to avoid running on slurm:
-        sed -i 's/cluster: slurm/cluster: null/g' infra/*.md
+        sed -i 's/cluster: slurm/cluster: null/g' docs/infra/*.md
+        sed -i 's/\"auto\"/None/g' README.md
         # on Mac: sed -i '' 's/cluster: slurm/cluster: null/g' infra/*.md
         # check readmes
         pytest --markdown-docs -m markdown-docs .

diff --git a/README.md b/README.md
@@ -1,8 +1,10 @@
 # Exca - ⚔
 
+Execute and cache seamlessly in python.
+
 ![workflow badge](https://github.com/facebookresearch/exca/actions/workflows/test-type-lint.yaml/badge.svg)
 
-## Quick install 
+## Quick install
 
 ```
 pip install exca
@@ -14,50 +16,81 @@ Documentation is available at [https://facebookresearch.github.io/exca/](https:/
 
 ## Basic overview
 
+`exca` provides simple decorators to:
+- execute a (hierarchy of) computation(s) either locally or on distant nodes,
+- cache the result.
 
-Consider you have one `pydantic` model/config (if you do not know `pydantic`, it is similar to dataclasses) that fully defines one processing to perform, for instance through a `process` method like below: 
-
+### The problem:
+In ML pipelines, the use of a simple python function, such as `my_task`:
 
 ```python
 import numpy as np
-import typing as tp
-import pydantic
 
-class TutorialTask(pydantic.BaseModel):
-    param: int = 12
+def my_task(param: int = 12) -> float:
+    return param * np.random.rand()
+```
 
-    def process(self) -> float:
-        return self.param * np.random.rand()
+often requires cumbersome overheads to (1) configure the parameters, (2) submit the job on a cluster, (3) cache the results: e.g.
+```python continuation fixture:tmp_path
+import pickle
+from pathlib import Path
+import submitit
+
+# Configure
+param = 12
+
+# Check task has already been executed
+filepath = tmp_path / f'result-{param}.npy'
+if not filepath.exists():
+
+    # Submit job on cluster
+    executor = submitit.AutoExecutor(cluster=None, folder=tmp_path)
+    job = executor.submit(my_task, param)
+    result = job.result()
+
+    # Cache result
+    with filepath.open("wb") as f:
+        pickle.dump(result, f)
 ```
 
-Updating `process` to enable caching of its output and running it on slurm only requires adding a [`TaskInfra`](https://facebookresearch.github.io/exca/infra/reference.html#exca.TaskInfra) sub-configuration and decorate the method:
+These overheads lead to several issues, such as debugging, handling hierarchical execution and properly saving the results (ending in the classic `'result-parm12-v2_final_FIX.npy'`).
 
 
-```python continuation
-import typing as tp
-import exca as xk
+### The solution:
+`exca` can be used to decorate the method of a [`pydantic` model](https://docs.pydantic.dev/latest/) so as to seamlessly configure its execution and caching:
 
+```python fixture:tmp_path
+import numpy as np
+import pydantic
+import exca as xk
 
-class TutorialTask(pydantic.BaseModel):
+class MyTask(pydantic.BaseModel):
     param: int = 12
-    infra: xk.TaskInfra = xk.TaskInfra(version="1")
+    infra: xk.TaskInfra = xk.TaskInfra()
 
     @infra.apply
     def process(self) -> float:
         return self.param * np.random.rand()
-```
 
-`TaskInfra` provides configuration for caching and computation, in particular providing a `folder` activates caching through the filesystem:
-`TaskInfra` provides configuration for caching and computation, in particular providing a `folder` activates caching through the filesystem, and setting `cluster="auto"` triggers computation either on slurm cluster if available, or in a dedicated process otherwise.
 
-```python continuation fixture:tmp_path
-task = TutorialTask(param=1, infra={"folder": tmp_path, "cluster": "auto"})
+task = MyTask(param=1, infra={"folder": tmp_path, "cluster": "auto"})
 out = task.process()  # runs on slurm if available
 # calling process again will load the cache and not a new random number
 assert out == task.process()
 ```
 See the [API reference for all the details](https://facebookresearch.github.io/exca/infra/reference.html#exca.TaskInfra)
 
+
+## Quick comparison
+
+| **feature \ tool**            | lru_cache | hydra |  submitit | exca |
+| ----------------------------- | :-------: | :---: |  :------: | :--: |
+| RAM cache                     | ✔         |       |           | ✔    |
+| file cache                    |           |       |           | ✔    |
+| remote compute                |           | ✔     |  ✔        | ✔    |
+| pure python (vs commandline)  | ✔         |       |  ✔        | ✔    |
+| hierarchical config           |           | ✔     |           | ✔    |
+
 ## Contributing
 
 See the [CONTRIBUTING](.github/CONTRIBUTING.md) file for how to help out.