Skip to content

Commit

Permalink
datatrove integration docs added
Browse files Browse the repository at this point in the history
  • Loading branch information
malteos committed Jul 29, 2024
1 parent 2656044 commit a5f919f
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 0 deletions.
29 changes: 29 additions & 0 deletions docs/integration-with-other-frameworks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Integration with other frameworks

LLM-Datasets can be used in combination with our own processing pipelines or integration in other frameworks, for example with [Huggingface's DataTrove](https://github.com/huggingface/datatrove).

## DataTrove integration

HuggingFace's DataTrove is a library to process, filter and deduplicate text data at a very large scale.
All datasets implemented within LLM-Dataset can be processed with DataTrove.
To do so, you can use the `LLMDatasetsDatatroveReader` class as input for any DataTrove pipeline.
The `LLMDatasetsDatatroveReader` class takes a list of dataset ID(s) and/or [config files](config-files.md) as arguments, as shown in the example below:

```python
from datatrove.pipeline.filters import SamplerFilter
from datatrove.pipeline.writers import JsonlWriter

from llm_datasets.datatrove_reader import LLMDatasetsDatatroveReader
from llm_datasets.utils.config import Config, get_config_from_paths

llmds_config: Config = get_config_from_paths(["path/to/my/config.yaml"])

pipeline = [
LLMDatasetsDatatroveReader("legal_mc4_en", llmds_config),
SamplerFilter(rate=0.5),
JsonlWriter(
output_folder="/my/output/path"
)
]
```

1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ nav:
- "Extract text data": extract-text-data.md
- "Adding your own data": add-your-own-data.md
- "Compose training and validation dataset": compose-train-validation-data.md
- "Integration with other frameworks": integration-with-other-frameworks.md
- "Related work": related-work.md
- "API reference":
- "BaseDataset": api/base_dataset.md
Expand Down

0 comments on commit a5f919f

Please sign in to comment.