Skip to content

Commit

Permalink
Merge dev changes (#14)
Browse files Browse the repository at this point in the history
* make torch optional

* WIP datatrove integration

* fix limit break

* hard coded doc schema

* hard coded doc id

* hard coded doc id

* WIP datatrove

* WIP removed compression

* added favicons

* Migrated to Ruff linter

* CI for all PRs

* disable docs build on dev branch

* fixed install variation

* changed linter

* clean up tlsh install

* clean up tlsh install

* added datatrove dependency

* recover missing files

* added trust remote code
  • Loading branch information
malteos authored Jul 18, 2024
1 parent fda3024 commit e27c526
Show file tree
Hide file tree
Showing 145 changed files with 1,696 additions and 1,501 deletions.
27 changes: 14 additions & 13 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ on:
push:
branches: [ main ]
pull_request:
branches: [ main ]

jobs:
build:
Expand All @@ -18,20 +17,22 @@ jobs:
uses: actions/setup-python@v2
with:
python-version: "3.10.13"
- name: Install dependencies

- name: Install TLSH
run: |
echo Installing dependencies ....
pip install -r ./requirements.txt
- name: Lint with flake8
echo Installing TLSH dependency ....
make install-tlsh
- name: Install package and dependencies
run: |
echo "Checking synatix errors in files ..."
flake8 --count --show-source --statistics
# - name: Lint with pylint
# run: |
# pylint src --rcfile pyproject.toml
- name: Install package
echo Installing dependencies ....
make install
- name: Lint with ruff
run: |
pip install -e .
echo "Checking synatix and format errors in files ..."
make lint
- name: Test with pytest
run: |
pytest -v
make test
2 changes: 0 additions & 2 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,7 @@ name: docs
on:
push:
branches:
- master
- main
- dev
permissions:
contents: write
jobs:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -161,5 +161,6 @@ cython_debug/

.DS_Store

/logs/
data/
./data/*
11 changes: 0 additions & 11 deletions .pre-commit-config.yaml

This file was deleted.

6 changes: 3 additions & 3 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"type": "debugpy",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
Expand All @@ -17,7 +17,7 @@
},
{
"name": "Python: Current File - custom env",
"type": "python",
"type": "debugpy",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
Expand All @@ -42,7 +42,7 @@
// },
{
"name": "Debug Tests: Current File",
"type": "python",
"type": "debugpy",
"request": "launch",
// "purpose": [
// "debug-test"
Expand Down
37 changes: 37 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
install:
@echo "--- 🚀 Installing project dependencies ---"
pip install -e ".[all]"

install-for-tests:
@echo "--- 🚀 Installing project dependencies for test ---"
@echo "This ensures that the project is not installed in editable mode"
pip install ".[dev]"

install-tlsh:
@echo "--- 🚀 Installing TLSH dependency (same version as OSCAR 23.01) ---"
pip download python-tlsh==4.5.0 && \
tar -xvf python-tlsh-4.5.0.tar.gz && \
cd python-tlsh-4.5.0 && \
sed -i 's/set(TLSH_BUCKETS_128 1)/set(TLSH_BUCKETS_256 1)/g; s/set(TLSH_CHECKSUM_1B 1)/set(TLSH_CHECKSUM_3B 1)/g' CMakeLists.txt && \
python setup.py install && \
rm -rf ../python-tlsh-4.5.0*

lint:
@echo "--- 🧹 Running linters ---"
ruff format . # running ruff formatting
ruff check . --fix # running ruff linting

lint-check:
@echo "--- 🧹 Check is project is linted ---"
# Required for CI to work, otherwise it will just pass
ruff format . --check # running ruff formatting
ruff check **/*.py # running ruff linting

test:
@echo "--- 🧪 Running tests ---"
pytest --durations=5 ./tests

pr:
@echo "--- 🚀 Running requirements for a PR ---"
make lint
make test
19 changes: 8 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,10 @@ cd llm-datasets
conda create -n llm-datasets python=3.10
conda activate llm-datasets

pip install -r requirements.txt
make install

# if you want to use content hash (for deduplication) you need to install TLSH
make install-tlsh
```

Alternatively, you can install the Python package directly from the dev branch:
Expand All @@ -217,25 +220,19 @@ Alternatively, you can install the Python package directly from the dev branch:
pip install git+https://github.com/malteos/llm-datasets.git@dev
```

### Install the pre-commit hooks

This repository uses git hooks to validate code quality and formatting.
### Formating and linting

```bash
pre-commit install
git config --bool flake8.strict true # Makes the commit fail if flake8 reports an error
```
This repository uses Ruff to validate code quality and formatting.

To run the hooks:
```bash
pre-commit run --all-files
make lint
```

### Testing

The tests can be executed with:
```bash
pytest --doctest-modules --cov-report term --cov=llm_datasets
make test
```

## Acknowledgements
Expand Down
Binary file added docs/images/favicon-16x16.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/favicon-32x32.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/favicon.ico
Binary file not shown.
9 changes: 4 additions & 5 deletions examples/custom_datasets/my_datasets/csv_example.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import logging
import pandas as pd
from pathlib import Path
from llm_datasets.datasets.base import BaseDataset, Availability, License

import pandas as pd
from llm_datasets.datasets.base import Availability, BaseDataset, License

logger = logging.getLogger(__name__)

Expand All @@ -14,9 +15,7 @@ class CSVExampleDataset(BaseDataset):
LICENSE = License("mixed")

def get_texts(self):
"""
Extract texts from CSV files (format: "documen_id,text,score,url")
"""
"""Extract texts from CSV files (format: "documen_id,text,score,url")"""
# Iterate over CSV files in raw dataset directory
for file_path in self.get_dataset_file_paths(needed_suffix=".csv"):
file_name = Path(file_path).name
Expand Down
5 changes: 3 additions & 2 deletions examples/custom_datasets/my_datasets/pg19.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
from llm_datasets.datasets.base import Availability, License
from llm_datasets.datasets.hf_dataset import HFDataset
from llm_datasets.datasets.base import License, Availability


class PG19Dataset(HFDataset):
DATASET_ID = "pg19"
TITLE = "Project Gutenberg books published before 1919"
HOMEPAGE = "https://huggingface.co/datasets/pg19"
LICENSE = License(
"Apache License Version 2.0 (or public domain?)", url="https://www.apache.org/licenses/LICENSE-2.0.html"
"Apache License Version 2.0 (or public domain?)",
url="https://www.apache.org/licenses/LICENSE-2.0.html",
)
CITATION = r"""@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
Expand Down
Loading

0 comments on commit e27c526

Please sign in to comment.