A simple repo that helps you get started with AI evals using promptfoo.
Every wondered how to run evals on your own prompts? Then this repo gets you from 0 to hero in no time.
If you find this repo useful, please give it a star and follow me on X (@s_streichsbier) for more content like this.
npm install -g promptfoo@latest
Note: You can remove providers you don't care about in the
promptfooconfig.yaml
file.
- OpenAI:
export OPENAI_API_KEY=<your-key>
- Anthropic:
export ANTHROPIC_API_KEY=<your-key>
- Google:
export GOOGLE_API_KEY=<your-key>
- DeepSeek:
export DEEPSEEK_API_KEY=<your-key>
- Groq:
export GROQ_API_KEY=<your-key>
- xai:
export XAI_API_KEY=<your-key>
- OpenRouter:
export OPEN_ROUTER_API_KEY=<your-key>
# Navigate to the directory of the eval you want to run
cd counting_characters_in_words
# Run the eval
promptfoo eval
# Run the eval without cache
promptfoo eval --no-cache
# Default port is 15500,
# -y opens the browser automatically
promptfoo view -y
# Specify a different port
promptfoo view -p 1337 -y
.
├── counting_characters_in_words
│ ├── promptfooconfig.yaml
│ ├── system_instructions.md
│ ├── test-r-in-raspberry.yaml
│ └── test-r-in-strawberry.yaml
counting_characters_in_words
: The directory containing the evalpromptfooconfig.yaml
: The configuration file for promptfoo containing the prompts and models to usesystem_instructions.md
: The system instructions for the evalprompt.json
: contains the prompt format for the evaltest-*.yaml
: The test files for the eval
- Use XML for the prompt format, it works best for LLMs. See this video for details BEST Prompt Format: Markdown, XML, or Raw? for an example.
- Use JSON responses for the eval, this makes it easier to parse the results and run assertions.
- Add a
defaultTest
section to thepromptfooconfig.yaml
file with the expected JSON schema.
gpt-3.5-turbo
gpt-4o-mini
gpt-4o
the o1 models don't support system messages at the moment.
claude-3-5-haiku-20241022
claude-3-5-sonnet-20241022
gemini-1.5-flash-002
gemini-1.5-flash-8b
gemini-1.5-pro
gemini-2.0-flash-exp
gemini-2.0-flash-thinking-exp-1219
gemini-exp-1206
deepseek-v3
llama-3.3-70b-versatile
grok-2
Also check out OpenRouter for more models.