Skip to content

OpenThaiGPT/data-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Processing Pipeline Overview

This pipeline is designed to address data contamination issues, which can degrade model performance and introduce bias. Therefore, it is essential to manage this problem through the following methods:

  • Pattern Filtering: Uses rule-based techniques to filter out inappropriate content (e.g., gambling, advertisements) by detecting prohibited words and filtering out text with a high number of such words.
  • Perplexity Filtering: This step filters out spam or out-of-context sentences that may negatively affect model performance by computing the perplexity of text chunks using a language model.
  • Deduplication: Further removes exact duplicates using hash functions to ensure no repeated sentences remain.
  • Decontamination: Aims to prevent training data from leaking into test sets by applying N-Gram MinHash and LSH techniques across both training and evaluation datasets.
  • Anonymization: Uses Named Entity Recognition (NER) models to filter out sensitive information, such as names and ID numbers, from the datasets.

Folder scripts contains the main program which will call function in folder data_processing

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •