Scripts for generating synthetic LLM training data:
- sharegpt-shorten:
- Takes an input dataset in ShareGPT format (supports
json
,jsonl
, orparquet
) and shortens each conversation to a specified token length using a specified tokenizer/chat template -d
: specifies the path to the dataset-t
: specifies the path to a Huggingface transformers tokenizer used to process the data; the chat template should be embedded in thetokenizer_config.json
-l
: specifies the number of tokens to limit the conversation length to
- Takes an input dataset in ShareGPT format (supports
- sharegpt-squash-system-to-user:
- Takes an input dataset in ShareGPT format (supports
json
,jsonl
, orparquet
) and replaces all system messages with user messages, and then collapses consecutive messages of the same role separated by\n\n
-d
: specifies the path to the dataset
- Takes an input dataset in ShareGPT format (supports
- sharegpt-to-dpo:
- Takes an input dataset in ShareGPT format (supports
json
,jsonl
, orparquet
) and uses the first turn of the conversation to generate a DPO dataset consisting of 'chosen' examples (the original data) and 'rejected' examples (the newly generated data from your LLM API endpoint) - Designed for TabbyAPI, however should function with any OAI-compatible completion endpoint
-t
: specifies a Jinja2 template to use for formatting the prompt (ignored if using chat completions), otherwise uses the Mistral instruction template by default-c
: reverses the 'chosen' and 'rejected' samples
- Takes an input dataset in ShareGPT format (supports
- Clone this repository
- Create or enter your Python venv or conda environment
- Install the script's requirements with
pip install -r requirements.txt
- Make a copy of the script's
config.example.json
namedconfig.json
and edit settings as appropriate - Run the desired script, i.e.
python sharegpt-to-dpo.py <path_to_sharegpt_dataset> -t <path_to_jinja2_prompt_template>
Jinja2 prompt templates are available at https://github.com/theroyallab/llm-prompt-templates