Skip to content

Latest commit

 

History

History
160 lines (120 loc) · 6.95 KB

README.md

File metadata and controls

160 lines (120 loc) · 6.95 KB

🤔 SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning

Paper

🔍 Overview | 🤖 Models | 📚 Dataset | 🛠️ Get Started | 🧑‍💻 Experiments | 📝 Citation | 🙏 Acknowledgements

News

  • 🚀 [Oct. 30] We have publicly released checkpoints, datasets, and code for SemCoder🔥🔥!!
  • 🎉 [Sep. 25] SemCoder has been accepted to NeurIPS'24!!

Overview

🤖 SemCoder: Learning to Reason Comprehensive Code Semantics

  • SemCoder not only generates code, but also comprehensively understands code semantics.
  • We propose to learn varied semantics: from high-level functionalities to low-level details, from static properties to dynamic program states.
  • SemCoder-S-6.7B outperforms GPT-3.5-turbo on code generation (HumanEval: 79.3 vs. 76.8; LiveCodeBench-Lite: 27.5 vs. 23.9) and execution reasoning (CRUXEval-I: 63.6 vs. 50.3; CRUXEval-O: 63.9 vs. 59.0; LiveCodeBench-CodeExecution: 61.2 vs. 43.6)

examples of varied semantics

🤔 Monologue Reasoning: Bridging the Gap Between Static Texts and Dynamic Execution

  • Motivated by rubber-duck debugging, we propose monologue reasoning, learning to explain dynamic execution by reasoning important values, properties and constraints.
  • Monologues are bi-directional: forward and backward.
  • Monologue is notably more effective than both scratchpad and chain-of-thoughts in reasoning dynamic execution.

monologue examples

Models

Model HF Checkpoints Size License
SemCoder 🤗 HF Link 6.7B DeepSeek
SemCoder-S 🤗 HF Link 6.7B DeepSeek

Dataset

  • 📚 PyX: A fully executable Python datasets with comprehensive code semantics.
  • 👨🏼‍🔧 PyX-R: A Python dataset to teach LLM to perform rubber-duck debugging and self-repair.

Get Started

🛠️ Install Environment

git clone https://github.com/ARiSE-Lab/SemCoder.git;
cd SemCoder;
conda env create --name semcoder --file=environment.yml;
conda activate semcoder;
export PYTHONPATH=$(pwd);

🚀 Quick Start

from transformers import pipeline
import torch

generator = pipeline(
  model="semcoder/semcoder_s_1030",
  task="text-generation",
  torch_dtype=torch.float16,
  device_map="auto",
)

# Generate Code

CODEGEN_REQUEST = """You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable <Code> according to <NL_Description>

<NL_Description>
{desc}

<Code>
"""
desc = """You are tasked with implementing a Python class that simulates a simple version of a "To-Do List" application. The class should have the following functionalities:
1. Add a new task to the to-do list.
2. Mark a task as completed.
3. Display all tasks in the to-do list.
4. Display only the incomplete tasks in the to-do list.
"""

prompt = CODEGEN_REQUEST.format(desc=desc)
result = generator(prompt, max_length=2048, num_return_sequences=1, temperature=0.0)
code = result[0]["generated_text"].split("```python")[1].split("```")[0]
print(code)

# Understand Code with Monologues

FWD_MNL_REQUEST = """Simulate the Execution: You are given a Python function and an assertion containing a function input. Complete the assertion containing the execution output corresponding to the given input in [ANSWER] and [/ANSWER] tags.
{code}
"""

tests = """
todo_list = ToDoList()
todo_list.add_task("Buy groceries")
todo_list.add_task("Complete assignment")
todo_list.mark_completed("Buy groceries")
assert todo_list.tasks == ???
"""
code += tests
prompt = FWD_MNL_REQUEST.format(code=code)
result = generator(prompt, max_length=2048, num_return_sequences=1, temperature=0.0)
print(result[0]["generated_text"])

🕹️ Demo

We follow Magicoder script to lanuch a gradio server for the local demo. You can launch your local gradio demo as following:

CUDA_VISIBLE_DEVICES=0 python semcoder_demo.py \
   --base_model "semcoder/semcoder_s_1030" \
   --device "cuda:0" \
   --port 8080

Experiments

🧑‍💻 To reproduce evaluation results mentioned in the paper, please see experiments.

📝 Citation

@article{ding2024semcoder,
  title={SemCoder: Training Code Language Models with Comprehensive Semantics},
  author={Yangruibo Ding and Jinjun Peng and Marcus J. Min and Gail Kaiser and Junfeng Yang and Baishakhi Ray},
  journal={arXiv preprint arXiv:2406.01006},
  year={2024}
}

🙏 Acknowledgements

My favorite quote of 2024 from the GREAT Andrej Karpathy (No Priors Ep. 80):

The Internet data is not the data you want for your Transformers – a nearest neighbor actually gets you really far, surprisingly. What you want is the inner-thought monologue of your brain. If we had billions of that, AGI is here, roughly speaking.

We thank the following amazing projects that inspired our design choices:

The template of this README is also borrowed from Magicoder.