Skip to content

Commit

Permalink
Update README.md (#4)
Browse files Browse the repository at this point in the history
  • Loading branch information
kingjr authored Dec 9, 2024
1 parent 5414175 commit 5a64285
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 22 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/test-type-lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,9 +71,9 @@ jobs:
- name: Test README code blocks
run: |
source activate ./ci_env
cd docs
# update readmes to avoid running on slurm:
sed -i 's/cluster: slurm/cluster: null/g' infra/*.md
sed -i 's/cluster: slurm/cluster: null/g' docs/infra/*.md
sed -i 's/\"auto\"/None/g' README.md
# on Mac: sed -i '' 's/cluster: slurm/cluster: null/g' infra/*.md
# check readmes
pytest --markdown-docs -m markdown-docs .
Expand Down
73 changes: 53 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Exca - ⚔

Execute and cache seamlessly in python.

![workflow badge](https://github.com/facebookresearch/exca/actions/workflows/test-type-lint.yaml/badge.svg)

## Quick install
## Quick install

```
pip install exca
Expand All @@ -14,50 +16,81 @@ Documentation is available at [https://facebookresearch.github.io/exca/](https:/

## Basic overview

`exca` provides simple decorators to:
- execute a (hierarchy of) computation(s) either locally or on distant nodes,
- cache the result.

Consider you have one `pydantic` model/config (if you do not know `pydantic`, it is similar to dataclasses) that fully defines one processing to perform, for instance through a `process` method like below:

### The problem:
In ML pipelines, the use of a simple python function, such as `my_task`:

```python
import numpy as np
import typing as tp
import pydantic

class TutorialTask(pydantic.BaseModel):
param: int = 12
def my_task(param: int = 12) -> float:
return param * np.random.rand()
```

def process(self) -> float:
return self.param * np.random.rand()
often requires cumbersome overheads to (1) configure the parameters, (2) submit the job on a cluster, (3) cache the results: e.g.
```python continuation fixture:tmp_path
import pickle
from pathlib import Path
import submitit

# Configure
param = 12

# Check task has already been executed
filepath = tmp_path / f'result-{param}.npy'
if not filepath.exists():

# Submit job on cluster
executor = submitit.AutoExecutor(cluster=None, folder=tmp_path)
job = executor.submit(my_task, param)
result = job.result()

# Cache result
with filepath.open("wb") as f:
pickle.dump(result, f)
```

Updating `process` to enable caching of its output and running it on slurm only requires adding a [`TaskInfra`](https://facebookresearch.github.io/exca/infra/reference.html#exca.TaskInfra) sub-configuration and decorate the method:
These overheads lead to several issues, such as debugging, handling hierarchical execution and properly saving the results (ending in the classic `'result-parm12-v2_final_FIX.npy'`).


```python continuation
import typing as tp
import exca as xk
### The solution:
`exca` can be used to decorate the method of a [`pydantic` model](https://docs.pydantic.dev/latest/) so as to seamlessly configure its execution and caching:

```python fixture:tmp_path
import numpy as np
import pydantic
import exca as xk

class TutorialTask(pydantic.BaseModel):
class MyTask(pydantic.BaseModel):
param: int = 12
infra: xk.TaskInfra = xk.TaskInfra(version="1")
infra: xk.TaskInfra = xk.TaskInfra()

@infra.apply
def process(self) -> float:
return self.param * np.random.rand()
```

`TaskInfra` provides configuration for caching and computation, in particular providing a `folder` activates caching through the filesystem:
`TaskInfra` provides configuration for caching and computation, in particular providing a `folder` activates caching through the filesystem, and setting `cluster="auto"` triggers computation either on slurm cluster if available, or in a dedicated process otherwise.

```python continuation fixture:tmp_path
task = TutorialTask(param=1, infra={"folder": tmp_path, "cluster": "auto"})
task = MyTask(param=1, infra={"folder": tmp_path, "cluster": "auto"})
out = task.process() # runs on slurm if available
# calling process again will load the cache and not a new random number
assert out == task.process()
```
See the [API reference for all the details](https://facebookresearch.github.io/exca/infra/reference.html#exca.TaskInfra)


## Quick comparison

| **feature \ tool** | lru_cache | hydra | submitit | exca |
| ----------------------------- | :-------: | :---: | :------: | :--: |
| RAM cache || | ||
| file cache | | | ||
| remote compute | ||||
| pure python (vs commandline) || |||
| hierarchical config | || ||

## Contributing

See the [CONTRIBUTING](.github/CONTRIBUTING.md) file for how to help out.
Expand Down

0 comments on commit 5a64285

Please sign in to comment.