DPAB-α: Dria Pythonic Agent Benchmark

DPAB-α is a comprehensive benchmark designed to evaluate LLMs function calling capabilities through both Pythonic and JSON-based approaches. This benchmark contains 100 synthetically generated and validated problems across different difficulty levels.

Each task has both functions defined in Python and JSON schemas. The benchmark evaluates the model's ability to generate correct function calls based on the given problem description.

Pythonic function calling performance often outstrips JSON-based function calling in scenarios that require creative or multi-step solutions, reinforcing the premise that Pythonic function calling can be more natural and powerful.

Installation

git clone https://github.com/firstbatchxyz/function-calling-eval.git
cd function-calling-eval
pip install -r requirements.txt

Usage

Basic usage:

python run.py --model anthropic/claude-3.5-sonnet --provider openrouter

Command Line Arguments

--model: Model identifier (default: "anthropic/claude-3.5-sonnet")
--provider: API provider (default: "openrouter")
--strict: Enable strict evaluation mode (optional)
--show_completion: Show model completions (default: False)
--mode: Evaluation mode, either "json" or "pythonic" (default: "pythonic")

Providers

DBAP-a supports the following providers:

openrouter: OpenRouter API
lm_studio: LM Studio
vllm: Local models via vLLM
ollama: Local models via Ollama

Example

# Evaluate Claude 3.5 in pythonic mode
python run.py --model anthropic/claude-3.5-sonnet --provider openrouter --mode pythonic --strict 

# Evaluate with JSON mode and show completions
python run.py --model qwen/qwen-2.5-7b-instruct --provider openrouter --mode json --strict  --show_completion

Benchmark Structure

Each test case in the benchmark contains:

difficulty: Easy or hard
function_schema_python: Python function definitions
function_schema_json: JSON function schemas
mock_functions: Implementation with return values
user_query: Natural language question
checklist: Validation criteria

Results

Current benchmark results for various models (strict):

Model Name	Pythonic	JSON
Closed Models
Claude 3.5 Sonnet	87	45
o1-preview-2024-09-12	55	39
o1-mini-2024-09-12	59	35
gpt-4o-2024-11-20	60	30
Open Models
> 100B Parameters
DeepSeek V3 (685B)	63	33
MiniMax-01	62	40
Llama-3.1-405B-Instruct	60	38
> 30B Parameters
Qwen-2.5-Coder-32b-Instruct	68	32
Qwen-2.5-72b-instruct	65	39
Llama-3.3-70b-Instruct	59	40
QwQ-32b-Preview	47	21
< 20B Parameters
Dria-Agent-a-7B	70	38
Qwen2.5-Coder-7B-Instruct	44	39
Dria-Agent-a-3B	72	31
Qwen2.5-Coder-3B-Instruct	26	37
Qwen-2.5-7B-Instruct	47	34
Phi-4 (14B)	55	35

For details, please refer to our blog post.

Citation

If you use this benchmark in your research, please cite:

@misc{Dria-Agent-a,
      url={https://huggingface.co/blog/andthattoo/dria-agent-a},
      title={Dria-Agent-a},
      author={"andthattoo", "Atakan Tekparmak"}
}

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
data		data
eval		eval
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DPAB-α: Dria Pythonic Agent Benchmark

Installation

Usage

Command Line Arguments

Providers

Example

Benchmark Structure

Results

Citation

About

Releases

Packages

Contributors 2

Languages

firstbatchxyz/function-calling-eval

Folders and files

Latest commit

History

Repository files navigation

DPAB-α: Dria Pythonic Agent Benchmark

Installation

Usage

Command Line Arguments

Providers

Example

Benchmark Structure

Results

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages