This repository is based on the Open LM repository, which we modified to allow for text classification.
We require python >=3.9 as well as several other packages. Start by cloning our project, and then installing the neccessary requirements as follows:
git clone https://github.com/MLI-lab/LLM_data_bias
cd LLM_data_bias
pip install -r requirements.txt
pip install --editable .
Check the data preparation section for instructions on how to download and process the datasets.
The classification model is first pretrained to predict the next token. Run the following command to run pretraining:
torchrun --nproc-per-node 8 -m open_lm.main \
--model open_lm_160m \
--dataset-manifest /preproc_data/manifest.jsonl \
--train-num-samples 3200000000 \
--epochs 1 \
--workers 8 \
--precision amp_bfloat16 \
--global-batch-size 16 \
--grad-checkpointing \
--log-every-n-steps 100 \
--grad-clip-norm 1 \
--data-key txt \
--lr 3e-4 \
--fsdp --fsdp-amp \
--warmup 2000 \
--wd 0.1 \
--beta2 0.95 \
--resume latest \
--report-to wandb \
--wandb-project-name name_of_the_run \
--logs path_to_logging_directory
--name name_of_the_run \
Some of the important arguments are:
nproc-per-node
: Number of GPUsmodel
: Model size, our default model size is 160M. The available model sizes can be found in model_configsdataset-manifest
: Path to the manifest filetrain-num-samples
: Number of tokens per epoch. For the 160M model, 3.2B tokens are used (Chinchilla optimal)epochs
: Model weights and optimizer are saved every epoch. To save intermediate checkpoints, set it to a higher value. For example settingepochs
to 10, andtrain-num-samples
to 320M will overall use 3.2B tokensreport-to wandb
andwandb-project-name
: Omit if logging to wandb is not desiredlogs
: Path where logging files and checkpoints are savedname
: Project name. This creates a directory inlogs
with the project name
The command for classification is similar to pretraining, but the following three arguments are added:
--classification True \
--num-classes 3 \
--classif-model-path path_to_pretrained_model
classification
: Indicates that we are doing classification not pretraining. Default value is Falsenum-classes
: Number of classification classesclassif-model-path
: Path to pretrained model. Can be omitted if you want to run classification from scratch, instead of finetuning from a pretrained model
To evaluate the classification model, run the following command:
python open_lm/eval.py \
--model open_lm_160m \
--classif-model-path path_to_classification_model \
--num-classes 3 \
--test-sets C4 FW RW \
--base-path path_to_test_sets
This example evaluates a 3-way classifier. The test sets (C4.pt
, FW.pt
, RW.pt
) are specified with the same order as during training: C4 (class 0), FW (class 1), RW (class 2), and should be placed in base-path
. Ensure that the number of strings in test-sets
matches num-classes
. The script adds the .pt
extension automatically to the strings in test-sets
. The script runs on one GPU by default.
We rewrite text with OpenAI's batch API. After obtaining an API key, set it as an environment variable with export OPENAI_API_KEY= YOUR_API_KEY
, then run the following command to rephrase text:
python scripts/rewrite_texts_batch_auto.py \
--input-file path_to_input_file \
--output-file path_to_output_file \
--batch-size 2000 \
--prompt prompt1
input-file
andoutput-file
: jsonl files containing the original and rephrased texts respectively. The text is assumed to have the key "text"batch-size
: number of sequences being rephrased. Set to 2000 for a tier 1 OpenAI accountprompt
: rephrasing prompt. Set to prompt1 or prompt2 or prompt3
To remove formatting, run the following command:
python scripts/remove_formatting.py input_file.jsonl output_file.jsonl
To train a bag of words model for a 2-way classification task, run the following command:
python scripts/train_BoW.py class0_train.jsonl class1_train.jsonl
For evaluation:
python scripts/test_BoW.py class0_test.jsonl class1_test.jsonl
Similar to rewriting, classifying text sequences of a dataset into one of the 13 thematic categories requires an OpenAI API key. After setting the API key as an environment variable run:
python scripts/categorise_text.py --submit --number-examples 2000 --input-file input_file.jsonl
where input file is a jsonl file with keys "text" and "url". This will print a batch number that should be copied and used to retrieve the results with:
python scripts/categorise_text.py --retrieve BATCH_NUMBER --output-file output_results.jsonl
We generate data from an LLM by prompting it with one single token with the following command:
python scripts/generate_random_sequences.py \
--hf-model apple/DCLM-Baseline-7B \
--batch-size 16 \
--num-seqs 8192 \
--max-new-tokens 800 \
--output-file path_to_output_file.jsonl \
--input-file path_to_input_file.jsonl
hf-model
: HuggingFace model. A list of all models used in the paper is found at the top of the scriptbatch-size
: Batch size. Scale it to fill the GPUnum-seqs
: Number of sequences to generatemax-new-tokens
: Maximal number of tokens per sequence to generateoutput-file
: Jsonl file where the generated sequences are savedinput-file
: Jsonl file from which the first token of each sequence is used to prompt the LLM. Must have equal or more sequences thannum-seqs
. Ifinput-file
is not specified, a token will be drawn uniformly at random
To estimate the mixture proportions of the domains an LLM was trained on, first train a classifier to distinguish between the potential domains. Second generate random sequences from the LLM (do not specify --input-file
) and tokenize them into tensors as described here under test data. Finally run the following command to classify the generated sequences:
python open_lm/classify.py \
--model open_lm_160m \
--classif-model-path path_to_classification_model \
--num-classes 7 \
--generated-data-path path_to_data_generated_from_LLM.pt