Decontamination Pipeline

Description

The Decontamination Pieline is designed to combat leaking of training dataset into evaluation process. The goal of this pipeline is to identify and remove potentially duplicated documents used in evlaution process from pretraining dataset. This contamination check is based on N-Gram MinHash and LSH (Locality-Sensitive Hashing) techniques inspired by the methods presented in this .

Workflow

Evaluation Dataset Processing:

Load all evaluation datasets.
Tokenize the content of the text column using the newmm tokenizer from the nlpo3 library.
Compute N-Gram MinHash for each tokenized text.
Save each dataset (content and minhash) separately in a pickle format.

Pretraining Dataset Processing:

Load the Huggingface pretraining dataset from the specified path.
Compute the MinHash for the pretraining dataset.
Store the computed MinHash results on disk in a Huggingface dataset format.

LSH Indexing:

For each evaluation dataset, load its MinHash values into an LSH index.

Query & Neighbor Identification:

Compare the entire pretraining dataset against the LSH index of each evaluation dataset.
Save indices of documents with a similarity score exceeding the defined threshold, marking them as potential contamination.

Contamination Check:

For each document in the pretraining dataset, compute the approximate Jaccard distance with its potential contaminated documents.
Compute approximate jaccard of score surpasses the threshold, mark the document as contaminated.

Document Removal:

Remove all identified contaminated documents from the pretraining dataset.

Saving the Dataset:

Store the decontaminated documents in a new Huggingface dataset on disk.
Store the index and data of contaminated document pair into contaminated_results_{num_perm}.csv file.

Usage

Conda

python ./src/data/scripts/decontamination/decontaminate.py

Apptainer

apptainer run -B /lustrefs/flash/scratch --home /project/lt200056-opgpth/openthaigpt-refactor image_sandbox python ./src/data/scripts/decontamination/decontaminate.py

Note:

We run it on ThaiSC Lanta's scratch disk to improve the I/O performance
We tested it with Apptainer, but conda python should also work
We run it on Memory node of Lanta but Compute node should also work without OOM
Command export HF_DATASETS_CACHE="/project/lt200056-opgpth/openthaigpt-refactor/.cache" is needed to prevent Huggingface storing cache in home directory and empty disk storage quota.

I/O

Huggingface input dataset format

{
    "train": ["text", ...], // column names
    "validate": ["text", ...] // column names
    ...
}

config/deduplicaiton.yaml

defaults:
  - datasets:
    - Name of the files inside `condig/datasets` directory
    ...

train_dataset:
  key: openthaigpt
  name: openthaigpt_pretraining_dataset
  available_on_hub: False
  path_name: Path of the pretraining datset
  split: Choose Split to use from Huggingface Dataset ex. 'train'
  col_name: Column name to process # text

minhash:
  newmm_dict: Path to store NewMM Dict
  save_path: Path to store minhash

decontaminate:
  thresold: jaccard similarity and LSH thresold 0.3
  minhash_path: Path to store minhash
  save_path: Path to save deduplicated dataset (For further training usage)

global_config:
  num_process: Process need to use (128 on Lanta)
  num_perm: Permutation number to use with MinHash (default 128)

train_dataset:
  split: Choose Split to use from Huggingface Dataset ex. 'train'

minhash:
  save_path: Path to store minhash

deduplication:
  thresold: jaccard similarity and LSH thresold 0.9
  minhash_path: Path to store minhash
  save_path: Path to save deduplicated dataset (For further training usage)
  save_path_duplicated: path to save duplicated dataset (for EDA)
  batch_size: batch size for huggingface map function

global_config:
  num_process: Process need to use (128 on Lanta)
  num_perm: Permutation number to use with MinHash (default 128)

config/datasets/dataset_name.yaml

Load from Huggingface hub case

xquad:
  name: xquad # For reference
  available_on_hub: True
  path_name: xquad # Must be available name on Huggingface hub
  split: validation
  subset: xquad.th # Some dataset has subset, if not can leave blank

Load from disk case.

lst20:
  name: LST20 # For reference
  available_on_hub: True
  path_name: lst20
  path: /scratch/lt200056-opgpth/LST20_Corpus
  split: validation
  subset:

Default Parameters

N_GRAM = 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Decontamination Pipeline

Description

Workflow

Usage

I/O

Default Parameters

Files

README.md

Latest commit

History

README.md

File metadata and controls

Decontamination Pipeline

Description

Workflow

Usage

I/O

Default Parameters