This repository contains code for continuing the pretraining of language models. The project is structured to facilitate dataset preparation, model preprocessing, and training. It also includes utilities for handling different types of tokenizers.
- Dataset Combination: Merge multiple datasets into a unified format.
- Sampling: Extract samples from large datasets for testing or validation purposes.
- Tokenization: Efficient tokenization of datasets with support for various tokenizers.
- Training New Tokenizers: Train SentencePiece or Huggingface tokenizers from scratch.
- Combining Tokenizers: Merge multiple tokenizers to handle diverse input formats.
- Vocabulary Expansion: Extend the vocabulary size of a pre-trained model to incorporate new tokens.
- Continued Pretraining: continue pretraining language models with DeepSpeed to optimize memory and computation.
- Clone the Repository
git clone https://github.com/OpenThaiGPT/continue-pretraining.git
cd continue-pretraining
- Create and Activate an Environment
conda create -n continue_pretraining python=3.11 -y
conda activate continue_pretraining
- Install Dependencies
pip install 'torch' 'torchvision' 'torchaudio' --index-url https://download.pytorch.org/whl/cu118
pip install 'ninja' 'packaging>=20.0'
pip install -e .