microGRPO

A tiny single-file implementation of Group Relative Policy Optimization (GRPO) as introduced by the DeepSeekMath paper.

For further reading on GRPO, see Yuge (Jimmy) Shi's blog post and Nathan Lambert's RLHF book.

Features

🐭 Only ~300 lines of code
📦 In pure NumPy, with autograd to compute the gradient
✅ Type annotated and linted
✂️ Easily swap out the default game and train on any other game or environment

Getting started

Note

You'll need to install uv to run the commands below.

To start teaching a policy to play a simplified version of Battleship, run:

uv run microgrpo.py

You should see that the policy learns to improve its average score from around 17% to about 48% over 2000 iterations.

Background

File structure

The file is structured into five sections:

🕹️ Game (~50 lines): An implementation of the Battleship board game
🌍 Environment (~60 lines): The API with which an agent can interact with the game
🧠 Policy (~40 lines): A model that produces action probabilities given the observed environment state
🎯 GRPO (~90 lines): The GRPO objective function and training data generator
⚡ Train (~40 lines): The loop that collects training data and optimizes the GRPO objective with AdamW

GRPO config

Starting a training run requires defining a GRPOConfig with your choice of environment (here, BattleshipEnv), a function that evaluates the policy model given its parameters (here, neural_battleship_policy), and another function that evaluates a reference policy model that you don't want the policy to deviate too much from (here, reference_battleship_policy):

# Define the environment, the policy model to optimize, and a reference policy model.
grpo_config = GRPOConfig(
    environment=BattleshipEnv,
    policy=neural_battleship_policy,
    reference_policy=reference_battleship_policy,
)

# Initialize the policy model parameters.
θ_init = neural_battleship_policy_init()

# Train the policy model by maximizing the GRPO objective with AdamW.
θ_star, rewards_val = train_grpo(AdamWOptimizer(θ_init, learning_rate=3e-4), grpo_config)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
microgrpo.py		microgrpo.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

microGRPO

Features

Getting started

Background

File structure

GRPO config

About

Releases

Languages

License

superlinear-ai/microGRPO

Folders and files

Latest commit

History

Repository files navigation

microGRPO

Features

Getting started

Background

File structure

GRPO config

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages