Instructions to reproduce neural experiments from Anytime PSRO for Two-Player Zero-Sum Games
First see installation documentation for setting up the dev environment.
Experiment details and hyperparameters are organized in uniquely named Scenarios
. When launching a learning script, you will generally specify a scenario name as a command-line argument. Experiment scenarios are defined in grl/rl_apps/scenarios/catalog.
Our PSRO/APSRO implementation consists of multiple scripts that are launched on separate terminals:
- The manager script (to track the population and, for PSRO, track the payoff table and launch empirical payoff evaluations)
- Scripts to run RL best response learners for each of the 2 players
The manager acts as a server that the best response learners connect to via gRPC.
(tmux with a nice configuration is useful for managing and organizing many terminal sessions)
# from the repository root
cd grl/rl_apps/psro
python general_psro_manager.py --scenario <my_scenario_name>
# in a 2nd terminal
cd grl/rl_apps/psro
python general_psro_br.py --player 0 --scenario <same_scenario_as_manager> --instant_first_iter
# in a 3rd terminal
cd grl/rl_apps/psro
python general_psro_br.py --player 1 --scenario <same_scenario_as_manager> --instant_first_iter
# from the repository root
cd grl/rl_apps/psro
python general_psro_manager.py --scenario <my_scenario_name>
# in a 2nd terminal
cd grl/rl_apps/psro
python anytime_psro_br_both_players.py --scenario <my_scenario_name> --instant_first_iter
If launching each of these scripts on the same computer, the best response scripts will automatically connect to a manager running the same scenario/seed on a randomized port defined by the manager in \tmp\grl_ports.json
. Otherwise, pass the --help
argument to these scripts to see options for specifying hosts and ports.
Multiple experiments with the same scenario can be launched on a single host by setting the GRL_SEED
environment variable to a different integer value for each set of corresponding experiments. If unset, GRL_SEED
defaults to 0. Best response processes will automatically connect to a manager server with the same scenario and GRL_SEED
.
APSRO (run both scripts together, launch manager first)
conda activate sp_psro; cd grl/rl_apps/psro; export CUDA_VISIBLE_DEVICES=
python general_psro_manager.py --scenario leduc_psro_dqn_regret
conda activate sp_psro; cd grl/rl_apps/psro; export CUDA_VISIBLE_DEVICES=
python anytime_psro_br_both_players.py --scenario leduc_psro_dqn_regret --instant_first_iter
PSRO
conda activate sp_psro; cd examples; export CUDA_VISIBLE_DEVICES=
python launch_psro_as_single_script.py --scenario leduc_psro_dqn_regret --instant_first_iter
APSRO (run both scripts together, launch manager first)
conda activate sp_psro; cd grl/rl_apps/psro; export CUDA_VISIBLE_DEVICES=
python general_psro_manager.py --scenario goofspiel_psro_dqn
conda activate sp_psro; cd grl/rl_apps/psro; export CUDA_VISIBLE_DEVICES=
python anytime_psro_br_both_players.py --scenario goofspiel_psro_dqn --instant_first_iter
PSRO
conda activate sp_psro; cd examples; export CUDA_VISIBLE_DEVICES=
python launch_psro_as_single_script.py --scenario goofspiel_psro_dqn --instant_first_iter
APSRO (run both scripts together, launch manager first)
conda activate sp_psro; cd grl/rl_apps/psro; export CUDA_VISIBLE_DEVICES=
python general_psro_manager.py --scenario loss_game_psro_10_moves_alpha_2.7
conda activate sp_psro; cd grl/rl_apps/psro; export CUDA_VISIBLE_DEVICES=
python anytime_psro_br_both_players.py --scenario loss_game_psro_10_moves_alpha_2.7 --instant_first_iter
PSRO
conda activate sp_psro; cd examples; export CUDA_VISIBLE_DEVICES=
python launch_psro_as_single_script.py --scenario loss_game_psro_10_moves_alpha_2.7 --instant_first_iter
TODO add examples for this
See notebooks for example scripts to graph exploitability vs experience collected.
For smaller games, exact exploitability is logged during training. For larger games like Goofspiel and the 2D Continuous Hill-Climbing Game, approximate exploitability needs to be separately estimated by training best-responses against checkpoints in a standalone script.