GitHub - MLI-lab/LLM_data_bias

This repository is based on the Open LM repository, which we modified to allow for text classification.

Installation

We require python >=3.9 as well as several other packages. Start by cloning our project, and then installing the neccessary requirements as follows:

git clone https://github.com/MLI-lab/LLM_data_bias
cd LLM_data_bias
pip install -r requirements.txt
pip install --editable .

Data Preparation

Check the data preparation section for instructions on how to download and process the datasets.

Pretraining

The classification model is first pretrained to predict the next token. Run the following command to run pretraining:

torchrun --nproc-per-node 8 -m open_lm.main   \
 --model open_lm_160m \
 --dataset-manifest /preproc_data/manifest.jsonl \
 --train-num-samples 3200000000 \
 --epochs 1 \
 --workers 8 \
 --precision amp_bfloat16 \
 --global-batch-size 16 \
 --grad-checkpointing \
 --log-every-n-steps 100 \
 --grad-clip-norm 1 \
 --data-key txt \
 --lr 3e-4 \
 --fsdp --fsdp-amp \
 --warmup 2000 \
 --wd 0.1 \
 --beta2 0.95 \
 --resume latest \
 --report-to wandb \
 --wandb-project-name name_of_the_run \
 --logs path_to_logging_directory
 --name name_of_the_run \

Some of the important arguments are:

nproc-per-node: Number of GPUs
model: Model size, our default model size is 160M. The available model sizes can be found in model_configs
dataset-manifest: Path to the manifest file
train-num-samples: Number of tokens per epoch. For the 160M model, 3.2B tokens are used (Chinchilla optimal)
epochs: Model weights and optimizer are saved every epoch. To save intermediate checkpoints, set it to a higher value. For example setting epochs to 10, and train-num-samples to 320M will overall use 3.2B tokens
report-to wandb and wandb-project-name: Omit if logging to wandb is not desired
logs: Path where logging files and checkpoints are saved
name: Project name. This creates a directory in logs with the project name

Classification

The command for classification is similar to pretraining, but the following three arguments are added:

--classification True \
--num-classes 3 \
--classif-model-path path_to_pretrained_model

classification: Indicates that we are doing classification not pretraining. Default value is False
num-classes: Number of classification classes
classif-model-path: Path to pretrained model. Can be omitted if you want to run classification from scratch, instead of finetuning from a pretrained model

Evaluation

To evaluate the classification model, run the following command:

python open_lm/eval.py \
  --model open_lm_160m \
  --classif-model-path path_to_classification_model \
  --num-classes 3 \
  --test-sets C4 FW RW \
  --base-path path_to_test_sets

This example evaluates a 3-way classifier. The test sets (C4.pt, FW.pt, RW.pt) are specified with the same order as during training: C4 (class 0), FW (class 1), RW (class 2), and should be placed in base-path. Ensure that the number of strings in test-sets matches num-classes. The script adds the .pt extension automatically to the strings in test-sets. The script runs on one GPU by default.

Rewriting

We rewrite text with OpenAI's batch API. After obtaining an API key, set it as an environment variable with export OPENAI_API_KEY= YOUR_API_KEY, then run the following command to rephrase text:

python scripts/rewrite_texts_batch_auto.py \
  --input-file path_to_input_file \
  --output-file path_to_output_file \
  --batch-size 2000 \
  --prompt prompt1

input-file and output-file: jsonl files containing the original and rephrased texts respectively. The text is assumed to have the key "text"
batch-size: number of sequences being rephrased. Set to 2000 for a tier 1 OpenAI account
prompt: rephrasing prompt. Set to prompt1 or prompt2 or prompt3

Formatting Removal

To remove formatting, run the following command:

python scripts/remove_formatting.py input_file.jsonl output_file.jsonl

Bag of Words

To train a bag of words model for a 2-way classification task, run the following command:

python scripts/train_BoW.py class0_train.jsonl class1_train.jsonl

For evaluation:

python scripts/test_BoW.py class0_test.jsonl class1_test.jsonl

Dataset Categorization

Similar to rewriting, classifying text sequences of a dataset into one of the 13 thematic categories requires an OpenAI API key. After setting the API key as an environment variable run:

python scripts/categorise_text.py --submit --number-examples 2000 --input-file input_file.jsonl

where input file is a jsonl file with keys "text" and "url". This will print a batch number that should be copied and used to retrieve the results with:

python scripts/categorise_text.py --retrieve BATCH_NUMBER --output-file output_results.jsonl

Data Generation

We generate data from an LLM by prompting it with one single token with the following command:

python scripts/generate_random_sequences.py \
  --hf-model apple/DCLM-Baseline-7B \
  --batch-size 16 \
  --num-seqs 8192 \
  --max-new-tokens 800 \
  --output-file path_to_output_file.jsonl \
  --input-file path_to_input_file.jsonl

hf-model: HuggingFace model. A list of all models used in the paper is found at the top of the script
batch-size: Batch size. Scale it to fill the GPU
num-seqs: Number of sequences to generate
max-new-tokens: Maximal number of tokens per sequence to generate
output-file: Jsonl file where the generated sequences are saved
input-file: Jsonl file from which the first token of each sequence is used to prompt the LLM. Must have equal or more sequences than num-seqs. If input-file is not specified, a token will be drawn uniformly at random

Mixture Proportions Estimation

To estimate the mixture proportions of the domains an LLM was trained on, first train a classifier to distinguish between the potential domains. Second generate random sequences from the LLM (do not specify --input-file) and tokenize them into tensors as described here under test data. Finally run the following command to classify the generated sequences:

python open_lm/classify.py \
  --model open_lm_160m \
  --classif-model-path path_to_classification_model \
  --num-classes 7 \
  --generated-data-path path_to_data_generated_from_LLM.pt

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
data_preparation		data_preparation
eval		eval
open_lm.egg-info		open_lm.egg-info
open_lm		open_lm
plots		plots
sagemaker_train		sagemaker_train
scripts		scripts
tests		tests
AVERAGE.md		AVERAGE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
MOE.md		MOE.md
Makefile		Makefile
README.md		README.md
environment-tests.yml		environment-tests.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_test.txt		requirements_test.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Data Preparation

Pretraining

Classification

Evaluation

Rewriting

Formatting Removal

Bag of Words

Dataset Categorization

Data Generation

Mixture Proportions Estimation

About

Releases

Packages

Contributors 2

Languages

License

MLI-lab/LLM_data_bias

Folders and files

Latest commit

History

Repository files navigation

Installation

Data Preparation

Pretraining

Classification

Evaluation

Rewriting

Formatting Removal

Bag of Words

Dataset Categorization

Data Generation

Mixture Proportions Estimation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages