This project focuses on developing a machine learning model to distinguish between essays written by students and those generated by large language models (LLMs). Inspired by a Kaggle competition, this initiative addresses the challenges posed by LLMs in academic and professional contexts.
- Development of a Detection Model: Create a model to differentiate between student-written and LLM-generated essays.
- Accuracy and Efficiency: Focus on improving the Area Under the Curve (AUC) metric for accuracy.
- Benchmarking and Improvement: Replicate top Kaggle entries and state-of-the-art techniques, followed by iterative enhancements.
The urgency of this issue is highlighted by the Kaggle competition, backed by Vanderbilt University and The Learning Agency Lab, due to the increasing proficiency of LLMs in mimicking human writing.
- Data Analysis: Use the dataset from the Kaggle competition, including both student-written essays and LLM-generated texts.
- Model Development: Start with replicating top-performing models from the competition and academic research.
- Iterative Improvement: Enhance the model using various machine learning techniques.
- Evaluation: Consistently use the AUC metric for performance measurement.
- Weeks 1-2: Data preparation and initial model development.
- Week 3: Iterative improvements and optimization.
- Week 4: Final evaluation and adjustments.
- A high-accuracy AI detection model, as measured by AUC.
- Insights into differentiating AI-generated and human-written text.
- Contributions to the technology for detecting AI-generated content.
- Byte-Pair Encoding (BPE) Tokenization
- Compresses text input to allow for pre-training of classifier models
- Vectorizer
- Ensemble
- Ghostbuster: Detecting Text Ghostwritten by Large Language Models
- Deep Learning based approach, basically compares results generated by 3 different LLM models
- This framework works great, but is not entirely efficient, especially for Kaggle Competitions
- Would be great if we can incorporate this into our apporach as well in the future
- Learned a lot from the open notebook from Prem ChotePaint
- [Hugging face bpe tokenizer] (https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt)