Skip to content

Determine if an essay was written by a student or AI-generated!

Notifications You must be signed in to change notification settings

aakarshv1/llm-detect

Repository files navigation

ML@Berkeley NMEP Project - AI Generated Text Detection Model

ML@Berkeley NMEP Project

Introduction

This project focuses on developing a machine learning model to distinguish between essays written by students and those generated by large language models (LLMs). Inspired by a Kaggle competition, this initiative addresses the challenges posed by LLMs in academic and professional contexts.

Demo

Objectives

  • Development of a Detection Model: Create a model to differentiate between student-written and LLM-generated essays.
  • Accuracy and Efficiency: Focus on improving the Area Under the Curve (AUC) metric for accuracy.
  • Benchmarking and Improvement: Replicate top Kaggle entries and state-of-the-art techniques, followed by iterative enhancements.

Background

The urgency of this issue is highlighted by the Kaggle competition, backed by Vanderbilt University and The Learning Agency Lab, due to the increasing proficiency of LLMs in mimicking human writing.

Methodology

  • Data Analysis: Use the dataset from the Kaggle competition, including both student-written essays and LLM-generated texts.
  • Model Development: Start with replicating top-performing models from the competition and academic research.
  • Iterative Improvement: Enhance the model using various machine learning techniques.
  • Evaluation: Consistently use the AUC metric for performance measurement.

Timeline

  • Weeks 1-2: Data preparation and initial model development.
  • Week 3: Iterative improvements and optimization.
  • Week 4: Final evaluation and adjustments.

Expected Outcomes

  • A high-accuracy AI detection model, as measured by AUC.
  • Insights into differentiating AI-generated and human-written text.
  • Contributions to the technology for detecting AI-generated content.

Approaches

  • Byte-Pair Encoding (BPE) Tokenization
  • Compresses text input to allow for pre-training of classifier models
  • Vectorizer
  • Ensemble

Other Approaches Considered/Potential Improvements

  • Ghostbuster: Detecting Text Ghostwritten by Large Language Models
  • Deep Learning based approach, basically compares results generated by 3 different LLM models
  • This framework works great, but is not entirely efficient, especially for Kaggle Competitions
  • Would be great if we can incorporate this into our apporach as well in the future

Credits

Results (On-Going)

Kaggle LeaderBoard

About

Determine if an essay was written by a student or AI-generated!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages