LLM-datagen

Scripts for generating synthetic LLM training data:

sharegpt-shorten:
- Takes an input dataset in ShareGPT format (supports json, jsonl, or parquet) and shortens each conversation to a specified token length using a specified tokenizer/chat template
- -d: specifies the path to the dataset
- -t: specifies the path to a Huggingface transformers tokenizer used to process the data; the chat template should be embedded in the tokenizer_config.json
- -l: specifies the number of tokens to limit the conversation length to
sharegpt-squash-system-to-user:
- Takes an input dataset in ShareGPT format (supports json, jsonl, or parquet) and replaces all system messages with user messages, and then collapses consecutive messages of the same role separated by \n\n
- -d: specifies the path to the dataset
sharegpt-to-dpo:
- Takes an input dataset in ShareGPT format (supports json, jsonl, or parquet) and uses the first turn of the conversation to generate a DPO dataset consisting of 'chosen' examples (the original data) and 'rejected' examples (the newly generated data from your LLM API endpoint)
- Designed for TabbyAPI, however should function with any OAI-compatible completion endpoint
- -t: specifies a Jinja2 template to use for formatting the prompt (ignored if using chat completions), otherwise uses the Mistral instruction template by default
- -c: reverses the 'chosen' and 'rejected' samples

Installation:

Clone this repository
Create or enter your Python venv or conda environment
Install the script's requirements with pip install -r requirements.txt
Make a copy of the script's config.example.json named config.json and edit settings as appropriate
Run the desired script, i.e. python sharegpt-to-dpo.py <path_to_sharegpt_dataset> -t <path_to_jinja2_prompt_template>

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.example.json		config.example.json
requirements.txt		requirements.txt
sharegpt-shorten.py		sharegpt-shorten.py
sharegpt-squash-system-to-user.py		sharegpt-squash-system-to-user.py
sharegpt-to-dpo.py		sharegpt-to-dpo.py