Skip to content
/ CORL Public archive

High-quality single-file implementations of SOTA Offline and Offline-to-Online RL algorithms: AWAC, BC, CQL, DT, EDAC, IQL, SAC-N, TD3+BC, LB-SAC, SPOT, Cal-QL, ReBRAC

License

Notifications You must be signed in to change notification settings

tinkoff-ai/CORL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

6afec90 · Aug 1, 2023

History

44 Commits
Jul 19, 2023
Jul 20, 2023
Aug 1, 2023
Jul 20, 2023
Jul 20, 2023
Mar 1, 2023
Jul 19, 2023
Jul 19, 2023
Sep 30, 2022
Sep 23, 2022
Jul 20, 2023
Jul 19, 2023

Repository files navigation

CORL (Clean Offline Reinforcement Learning)

Twitter arXiv Ruff

🧵 CORL is an Offline Reinforcement Learning library that provides high-quality and easy-to-follow single-file implementations of SOTA ORL algorithms. Each implementation is backed by a research-friendly codebase, allowing you to run or tune thousands of experiments. Heavily inspired by cleanrl for online RL, check them out too!

  • 📜 Single-file implementation
  • 📈 Benchmarked Implementation for N algorithms
  • 🖼 Weights and Biases integration

  • ⭐ If you're interested in discrete control, make sure to check out our new library — Katakomba. It provides both discrete control algorithms augmented with recurrence and an offline RL benchmark for the NetHack Learning environment.

Getting started

git clone https://github.com/tinkoff-ai/CORL.git && cd CORL
pip install -r requirements/requirements_dev.txt

# alternatively, you could use docker
docker build -t <image_name> .
docker run --gpus=all -it --rm --name <container_name> <image_name>

Algorithms Implemented

Algorithm Variants Implemented Wandb Report
Offline and Offline-to-Online
Conservative Q-Learning for Offline Reinforcement Learning
(CQL)
offline/cql.py
finetune/cql.py
Offline

Offline-to-online
Accelerating Online Reinforcement Learning with Offline Datasets
(AWAC)
offline/awac.py
finetune/awac.py
Offline

Offline-to-online
Offline Reinforcement Learning with Implicit Q-Learning
(IQL)
offline/iql.py
finetune/iql.py
Offline

Offline-to-online
Offline-to-Online only
Supported Policy Optimization for Offline Reinforcement Learning
(SPOT)
finetune/spot.py Offline-to-online
Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning
(Cal-QL)
finetune/cal_ql.py Offline-to-online
Offline only
✅ Behavioral Cloning
(BC)
offline/any_percent_bc.py Offline
✅ Behavioral Cloning-10%
(BC-10%)
offline/any_percent_bc.py Offline
A Minimalist Approach to Offline Reinforcement Learning
(TD3+BC)
offline/td3_bc.py Offline
Decision Transformer: Reinforcement Learning via Sequence Modeling
(DT)
offline/dt.py Offline
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble
(SAC-N)
offline/sac_n.py Offline
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble
(EDAC)
offline/edac.py Offline
Revisiting the Minimalist Approach to Offline Reinforcement Learning
(ReBRAC)
offline/rebrac.py Offline
Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size
(LB-SAC)
offline/lb_sac.py Offline Gym-MuJoCo

D4RL Benchmarks

You can check the links above for learning curves and details. Here, we report reproduced final and best scores. Note that they differ by a significant margin, and some papers may use different approaches, not making it always explicit which reporting methodology they chose. If you want to re-collect our results in a more structured/nuanced manner, see results.

Offline

Last Scores

Gym-MuJoCo
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
halfcheetah-medium-v2 42.40 ± 0.19 42.46 ± 0.70 48.10 ± 0.18 49.46 ± 0.62 47.04 ± 0.22 48.31 ± 0.22 64.04 ± 0.68 68.20 ± 1.28 67.70 ± 1.04 42.20 ± 0.26
halfcheetah-medium-replay-v2 35.66 ± 2.33 23.59 ± 6.95 44.84 ± 0.59 44.70 ± 0.69 45.04 ± 0.27 44.46 ± 0.22 51.18 ± 0.31 60.70 ± 1.01 62.06 ± 1.10 38.91 ± 0.50
halfcheetah-medium-expert-v2 55.95 ± 7.35 90.10 ± 2.45 90.78 ± 6.04 93.62 ± 0.41 95.63 ± 0.42 94.74 ± 0.52 103.80 ± 2.95 98.96 ± 9.31 104.76 ± 0.64 91.55 ± 0.95
hopper-medium-v2 53.51 ± 1.76 55.48 ± 7.30 60.37 ± 3.49 74.45 ± 9.14 59.08 ± 3.77 67.53 ± 3.78 102.29 ± 0.17 40.82 ± 9.91 101.70 ± 0.28 65.10 ± 1.61
hopper-medium-replay-v2 29.81 ± 2.07 70.42 ± 8.66 64.42 ± 21.52 96.39 ± 5.28 95.11 ± 5.27 97.43 ± 6.39 94.98 ± 6.53 100.33 ± 0.78 99.66 ± 0.81 81.77 ± 6.87
hopper-medium-expert-v2 52.30 ± 4.01 111.16 ± 1.03 101.17 ± 9.07 52.73 ± 37.47 99.26 ± 10.91 107.42 ± 7.80 109.45 ± 2.34 101.31 ± 11.63 105.19 ± 10.08 110.44 ± 0.33
walker2d-medium-v2 63.23 ± 16.24 67.34 ± 5.17 82.71 ± 4.78 66.53 ± 26.04 80.75 ± 3.28 80.91 ± 3.17 85.82 ± 0.77 87.47 ± 0.66 93.36 ± 1.38 67.63 ± 2.54
walker2d-medium-replay-v2 21.80 ± 10.15 54.35 ± 6.34 85.62 ± 4.01 82.20 ± 1.05 73.09 ± 13.22 82.15 ± 3.03 84.25 ± 2.25 78.99 ± 0.50 87.10 ± 2.78 59.86 ± 2.73
walker2d-medium-expert-v2 98.96 ± 15.98 108.70 ± 0.25 110.03 ± 0.36 49.41 ± 38.16 109.56 ± 0.39 111.72 ± 0.86 111.86 ± 0.43 114.93 ± 0.41 114.75 ± 0.74 107.11 ± 0.96
locomotion average 50.40 69.29 76.45 67.72 78.28 81.63 89.74 83.52 92.92 73.84
Maze2d
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
maze2d-umaze-v1 0.36 ± 8.69 12.18 ± 4.29 29.41 ± 12.31 82.67 ± 28.30 -8.90 ± 6.11 42.11 ± 0.58 106.87 ± 22.16 130.59 ± 16.52 95.26 ± 6.39 18.08 ± 25.42
maze2d-medium-v1 0.79 ± 3.25 14.25 ± 2.33 59.45 ± 36.25 52.88 ± 55.12 86.11 ± 9.68 34.85 ± 2.72 105.11 ± 31.67 88.61 ± 18.72 57.04 ± 3.45 31.71 ± 26.33
maze2d-large-v1 2.26 ± 4.39 11.32 ± 5.10 97.10 ± 25.41 209.13 ± 8.19 23.75 ± 36.70 61.72 ± 3.50 78.33 ± 61.77 204.76 ± 1.19 95.60 ± 22.92 35.66 ± 28.20
maze2d average 1.13 12.58 61.99 114.89 33.65 46.23 96.77 141.32 82.64 28.48
Antmaze
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
antmaze-umaze-v2 55.25 ± 4.15 65.75 ± 5.26 70.75 ± 39.18 57.75 ± 10.28 92.75 ± 1.92 77.00 ± 5.52 97.75 ± 1.48 0.00 ± 0.00 0.00 ± 0.00 57.00 ± 9.82
antmaze-umaze-diverse-v2 47.25 ± 4.09 44.00 ± 1.00 44.75 ± 11.61 58.00 ± 7.68 37.25 ± 3.70 54.25 ± 5.54 83.50 ± 7.02 0.00 ± 0.00 0.00 ± 0.00 51.75 ± 0.43
antmaze-medium-play-v2 0.00 ± 0.00 2.00 ± 0.71 0.25 ± 0.43 0.00 ± 0.00 65.75 ± 11.61 65.75 ± 11.71 89.50 ± 3.35 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-medium-diverse-v2 0.75 ± 0.83 5.75 ± 9.39 0.25 ± 0.43 0.00 ± 0.00 67.25 ± 3.56 73.75 ± 5.45 83.50 ± 8.20 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-large-play-v2 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 20.75 ± 7.26 42.00 ± 4.53 52.25 ± 29.01 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-large-diverse-v2 0.00 ± 0.00 0.75 ± 0.83 0.00 ± 0.00 0.00 ± 0.00 20.50 ± 13.24 30.25 ± 3.63 64.00 ± 5.43 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze average 17.21 19.71 19.33 19.29 50.71 57.17 78.42 0.00 0.00 18.12
Adroit
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
pen-human-v1 71.03 ± 6.26 26.99 ± 9.60 -3.88 ± 0.21 81.12 ± 13.47 13.71 ± 16.98 78.49 ± 8.21 103.16 ± 8.49 6.86 ± 5.93 5.07 ± 6.16 67.68 ± 5.48
pen-cloned-v1 51.92 ± 15.15 46.67 ± 14.25 5.13 ± 5.28 89.56 ± 15.57 1.04 ± 6.62 83.42 ± 8.19 102.79 ± 7.84 31.35 ± 2.14 12.02 ± 1.75 64.43 ± 1.43
pen-expert-v1 109.65 ± 7.28 114.96 ± 2.96 122.53 ± 21.27 160.37 ± 1.21 -1.41 ± 2.34 128.05 ± 9.21 152.16 ± 6.33 87.11 ± 48.95 -1.55 ± 0.81 116.38 ± 1.27
door-human-v1 2.34 ± 4.00 -0.13 ± 0.07 -0.33 ± 0.01 4.60 ± 1.90 5.53 ± 1.31 3.26 ± 1.83 -0.10 ± 0.01 -0.38 ± 0.00 -0.12 ± 0.13 4.44 ± 0.87
door-cloned-v1 -0.09 ± 0.03 0.29 ± 0.59 -0.34 ± 0.01 0.93 ± 1.66 -0.33 ± 0.01 3.07 ± 1.75 0.06 ± 0.05 -0.33 ± 0.00 2.66 ± 2.31 7.64 ± 3.26
door-expert-v1 105.35 ± 0.09 104.04 ± 1.46 -0.33 ± 0.01 104.85 ± 0.24 -0.32 ± 0.02 106.65 ± 0.25 106.37 ± 0.29 -0.33 ± 0.00 106.29 ± 1.73 104.87 ± 0.39
hammer-human-v1 3.03 ± 3.39 -0.19 ± 0.02 1.02 ± 0.24 3.37 ± 1.93 0.14 ± 0.11 1.79 ± 0.80 0.24 ± 0.24 0.24 ± 0.00 0.28 ± 0.18 1.28 ± 0.15
hammer-cloned-v1 0.55 ± 0.16 0.12 ± 0.08 0.25 ± 0.01 0.21 ± 0.24 0.30 ± 0.01 1.50 ± 0.69 5.00 ± 3.75 0.14 ± 0.09 0.19 ± 0.07 1.82 ± 0.55
hammer-expert-v1 126.78 ± 0.64 121.75 ± 7.67 3.11 ± 0.03 127.06 ± 0.29 0.26 ± 0.01 128.68 ± 0.33 133.62 ± 0.27 25.13 ± 43.25 28.52 ± 49.00 117.45 ± 6.65
relocate-human-v1 0.04 ± 0.03 -0.14 ± 0.08 -0.29 ± 0.01 0.05 ± 0.03 0.06 ± 0.03 0.12 ± 0.04 0.16 ± 0.30 -0.31 ± 0.01 -0.17 ± 0.17 0.05 ± 0.01
relocate-cloned-v1 -0.06 ± 0.01 -0.00 ± 0.02 -0.30 ± 0.01 -0.04 ± 0.04 -0.29 ± 0.01 0.04 ± 0.01 1.66 ± 2.59 -0.01 ± 0.10 0.17 ± 0.35 0.16 ± 0.09
relocate-expert-v1 107.58 ± 1.20 97.90 ± 5.21 -1.73 ± 0.96 108.87 ± 0.85 -0.30 ± 0.02 106.11 ± 4.02 107.52 ± 2.28 -0.36 ± 0.00 71.94 ± 18.37 104.28 ± 0.42
adroit average 48.18 42.69 10.40 56.75 1.53 53.43 59.39 12.43 18.78 49.21

Best Scores

Gym-MuJoCo
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
halfcheetah-medium-v2 43.60 ± 0.14 43.90 ± 0.13 48.93 ± 0.11 50.06 ± 0.50 47.62 ± 0.03 48.84 ± 0.07 65.62 ± 0.46 72.21 ± 0.31 69.72 ± 0.92 42.73 ± 0.10
halfcheetah-medium-replay-v2 40.52 ± 0.19 42.27 ± 0.46 45.84 ± 0.26 46.35 ± 0.29 46.43 ± 0.19 45.35 ± 0.08 52.22 ± 0.31 67.29 ± 0.34 66.55 ± 1.05 40.31 ± 0.28
halfcheetah-medium-expert-v2 79.69 ± 3.10 94.11 ± 0.22 96.59 ± 0.87 96.11 ± 0.37 97.04 ± 0.17 95.38 ± 0.17 108.89 ± 1.20 111.73 ± 0.47 110.62 ± 1.04 93.40 ± 0.21
hopper-medium-v2 69.04 ± 2.90 73.84 ± 0.37 70.44 ± 1.18 97.90 ± 0.56 70.80 ± 1.98 80.46 ± 3.09 103.19 ± 0.16 101.79 ± 0.20 103.26 ± 0.14 69.42 ± 3.64
hopper-medium-replay-v2 68.88 ± 10.33 90.57 ± 2.07 98.12 ± 1.16 100.91 ± 1.50 101.63 ± 0.55 102.69 ± 0.96 102.57 ± 0.45 103.83 ± 0.53 103.28 ± 0.49 88.74 ± 3.02
hopper-medium-expert-v2 90.63 ± 10.98 113.13 ± 0.16 113.22 ± 0.43 103.82 ± 12.81 112.84 ± 0.66 113.18 ± 0.38 113.16 ± 0.43 111.24 ± 0.15 111.80 ± 0.11 111.18 ± 0.21
walker2d-medium-v2 80.64 ± 0.91 82.05 ± 0.93 86.91 ± 0.28 83.37 ± 2.82 84.77 ± 0.20 87.58 ± 0.48 87.79 ± 0.19 90.17 ± 0.54 95.78 ± 1.07 74.70 ± 0.56
walker2d-medium-replay-v2 48.41 ± 7.61 76.09 ± 0.40 91.17 ± 0.72 86.51 ± 1.15 89.39 ± 0.88 89.94 ± 0.93 91.11 ± 0.63 85.18 ± 1.63 89.69 ± 1.39 68.22 ± 1.20
walker2d-medium-expert-v2 109.95 ± 0.62 109.90 ± 0.09 112.21 ± 0.06 108.28 ± 9.45 111.63 ± 0.38 113.06 ± 0.53 112.49 ± 0.18 116.93 ± 0.42 116.52 ± 0.75 108.71 ± 0.34
locomotion average 70.15 80.65 84.83 85.92 84.68 86.28 93.00 95.60 96.36 77.49
Maze2d
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
maze2d-umaze-v1 16.09 ± 0.87 22.49 ± 1.52 99.33 ± 16.16 136.61 ± 11.65 92.05 ± 13.66 50.92 ± 4.23 162.28 ± 1.79 153.12 ± 6.49 149.88 ± 1.97 63.83 ± 17.35
maze2d-medium-v1 19.16 ± 1.24 27.64 ± 1.87 150.93 ± 3.89 131.50 ± 25.38 128.66 ± 5.44 122.69 ± 30.00 150.12 ± 4.48 93.80 ± 14.66 154.41 ± 1.58 68.14 ± 12.25
maze2d-large-v1 20.75 ± 6.66 41.83 ± 3.64 197.64 ± 5.26 227.93 ± 1.90 157.51 ± 7.32 162.25 ± 44.18 197.55 ± 5.82 207.51 ± 0.96 182.52 ± 2.68 50.25 ± 19.34
maze2d average 18.67 30.65 149.30 165.35 126.07 111.95 169.98 151.48 162.27 60.74
Antmaze
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
antmaze-umaze-v2 68.50 ± 2.29 77.50 ± 1.50 98.50 ± 0.87 78.75 ± 6.76 94.75 ± 0.83 84.00 ± 4.06 100.00 ± 0.00 0.00 ± 0.00 42.50 ± 28.61 64.50 ± 2.06
antmaze-umaze-diverse-v2 64.75 ± 4.32 63.50 ± 2.18 71.25 ± 5.76 88.25 ± 2.17 53.75 ± 2.05 79.50 ± 3.35 96.75 ± 2.28 0.00 ± 0.00 0.00 ± 0.00 60.50 ± 2.29
antmaze-medium-play-v2 4.50 ± 1.12 6.25 ± 2.38 3.75 ± 1.30 27.50 ± 9.39 80.50 ± 3.35 78.50 ± 3.84 93.50 ± 2.60 0.00 ± 0.00 0.00 ± 0.00 0.75 ± 0.43
antmaze-medium-diverse-v2 4.75 ± 1.09 16.50 ± 5.59 5.50 ± 1.50 33.25 ± 16.81 71.00 ± 4.53 83.50 ± 1.80 91.75 ± 2.05 0.00 ± 0.00 0.00 ± 0.00 0.50 ± 0.50
antmaze-large-play-v2 0.50 ± 0.50 13.50 ± 9.76 1.25 ± 0.43 1.00 ± 0.71 34.75 ± 5.85 53.50 ± 2.50 68.75 ± 13.90 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-large-diverse-v2 0.75 ± 0.43 6.25 ± 1.79 0.25 ± 0.43 0.50 ± 0.50 36.25 ± 3.34 53.00 ± 3.00 69.50 ± 7.26 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze average 23.96 30.58 30.08 38.21 61.83 72.00 86.71 0.00 7.08 21.04
Adroit
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
pen-human-v1 99.69 ± 7.45 59.89 ± 8.03 9.95 ± 8.19 121.05 ± 5.47 58.91 ± 1.81 106.15 ± 10.28 127.28 ± 3.22 56.48 ± 7.17 35.84 ± 10.57 77.83 ± 2.30
pen-cloned-v1 99.14 ± 12.27 83.62 ± 11.75 52.66 ± 6.33 129.66 ± 1.27 14.74 ± 2.31 114.05 ± 4.78 128.64 ± 7.15 52.69 ± 5.30 26.90 ± 7.85 71.17 ± 2.70
pen-expert-v1 128.77 ± 5.88 134.36 ± 3.16 142.83 ± 7.72 162.69 ± 0.23 14.86 ± 4.07 140.01 ± 6.36 157.62 ± 0.26 116.43 ± 40.26 36.04 ± 4.60 119.49 ± 2.31
door-human-v1 9.41 ± 4.55 7.00 ± 6.77 -0.11 ± 0.06 19.28 ± 1.46 13.28 ± 2.77 13.52 ± 1.22 0.27 ± 0.43 -0.10 ± 0.06 2.51 ± 2.26 7.36 ± 1.24
door-cloned-v1 3.40 ± 0.95 10.37 ± 4.09 -0.20 ± 0.11 12.61 ± 0.60 -0.08 ± 0.13 9.02 ± 1.47 7.73 ± 6.80 -0.21 ± 0.10 20.36 ± 1.11 11.18 ± 0.96
door-expert-v1 105.84 ± 0.23 105.92 ± 0.24 4.49 ± 7.39 106.77 ± 0.24 59.47 ± 25.04 107.29 ± 0.37 106.78 ± 0.04 0.05 ± 0.02 109.22 ± 0.24 105.49 ± 0.09
hammer-human-v1 12.61 ± 4.87 6.23 ± 4.79 2.38 ± 0.14 22.03 ± 8.13 0.30 ± 0.05 6.86 ± 2.38 1.18 ± 0.15 0.25 ± 0.00 3.49 ± 2.17 1.68 ± 0.11
hammer-cloned-v1 8.90 ± 4.04 8.72 ± 3.28 0.96 ± 0.30 14.67 ± 1.94 0.32 ± 0.03 11.63 ± 1.70 48.16 ± 6.20 12.67 ± 15.02 0.27 ± 0.01 2.74 ± 0.22
hammer-expert-v1 127.89 ± 0.57 128.15 ± 0.66 33.31 ± 47.65 129.66 ± 0.33 0.93 ± 1.12 129.76 ± 0.37 134.74 ± 0.30 91.74 ± 47.77 69.44 ± 47.00 127.39 ± 0.10
relocate-human-v1 0.59 ± 0.27 0.16 ± 0.14 -0.29 ± 0.01 2.09 ± 0.76 1.03 ± 0.20 1.22 ± 0.28 3.70 ± 2.34 -0.18 ± 0.14 0.05 ± 0.02 0.08 ± 0.02
relocate-cloned-v1 0.45 ± 0.31 0.74 ± 0.45 -0.02 ± 0.04 0.94 ± 0.68 -0.07 ± 0.02 1.78 ± 0.70 9.25 ± 2.56 0.10 ± 0.04 4.11 ± 1.39 0.34 ± 0.09
relocate-expert-v1 110.31 ± 0.36 109.77 ± 0.60 0.23 ± 0.27 111.56 ± 0.17 0.03 ± 0.10 110.12 ± 0.82 111.14 ± 0.23 -0.07 ± 0.08 98.32 ± 3.75 106.49 ± 0.30
adroit average 58.92 54.58 20.51 69.42 13.65 62.62 69.71 27.49 33.88 52.60

Offline-to-Online

Scores

Task-Name AWAC CQL IQL SPOT Cal-QL
antmaze-umaze-v2 52.75 ± 8.67 → 98.75 ± 1.09 94.00 ± 1.58 → 99.50 ± 0.87 77.00 ± 0.71 → 96.50 ± 1.12 91.00 ± 2.55 → 99.50 ± 0.50 76.75 ± 7.53 → 99.75 ± 0.43
antmaze-umaze-diverse-v2 56.00 ± 2.74 → 0.00 ± 0.00 9.50 ± 9.91 → 99.00 ± 1.22 59.50 ± 9.55 → 63.75 ± 25.02 36.25 ± 2.17 → 95.00 ± 3.67 32.00 ± 27.79 → 98.50 ± 1.12
antmaze-medium-play-v2 0.00 ± 0.00 → 0.00 ± 0.00 59.00 ± 11.18 → 97.75 ± 1.30 71.75 ± 2.95 → 89.75 ± 1.09 67.25 ± 10.47 → 97.25 ± 1.30 71.75 ± 3.27 → 98.75 ± 1.64
antmaze-medium-diverse-v2 0.00 ± 0.00 → 0.00 ± 0.00 63.50 ± 6.84 → 97.25 ± 1.92 64.25 ± 1.92 → 92.25 ± 2.86 73.75 ± 7.29 → 94.50 ± 1.66 62.00 ± 4.30 → 98.25 ± 1.48
antmaze-large-play-v2 0.00 ± 0.00 → 0.00 ± 0.00 28.75 ± 7.76 → 88.25 ± 2.28 38.50 ± 8.73 → 64.50 ± 17.04 31.50 ± 12.58 → 87.00 ± 3.24 31.75 ± 8.87 → 97.25 ± 1.79
antmaze-large-diverse-v2 0.00 ± 0.00 → 0.00 ± 0.00 35.50 ± 3.64 → 91.75 ± 3.96 26.75 ± 3.77 → 64.25 ± 4.15 17.50 ± 7.26 → 81.00 ± 14.14 44.00 ± 8.69 → 91.50 ± 3.91
antmaze average 18.12 → 16.46 48.38 → 95.58 56.29 → 78.50 52.88 → 92.38 53.04 → 97.33
pen-cloned-v1 88.66 ± 15.10 → 86.82 ± 11.12 -2.76 ± 0.08 → -1.28 ± 2.16 84.19 ± 3.96 → 102.02 ± 20.75 6.19 ± 5.21 → 43.63 ± 20.09 -2.66 ± 0.04 → -2.68 ± 0.12
door-cloned-v1 0.93 ± 1.66 → 0.01 ± 0.00 -0.33 ± 0.01 → -0.33 ± 0.01 1.19 ± 0.93 → 20.34 ± 9.32 -0.21 ± 0.14 → 0.02 ± 0.31 -0.33 ± 0.01 → -0.33 ± 0.01
hammer-cloned-v1 1.80 ± 3.01 → 0.24 ± 0.04 0.56 ± 0.55 → 2.85 ± 4.81 1.35 ± 0.32 → 57.27 ± 28.49 3.97 ± 6.39 → 3.73 ± 4.99 0.25 ± 0.04 → 0.17 ± 0.17
relocate-cloned-v1 -0.04 ± 0.04 → -0.04 ± 0.01 -0.33 ± 0.01 → -0.33 ± 0.01 0.04 ± 0.04 → 0.32 ± 0.38 -0.24 ± 0.01 → -0.15 ± 0.05 -0.31 ± 0.05 → -0.31 ± 0.04
adroit average 22.84 → 21.76 -0.72 → 0.22 21.69 → 44.99 2.43 → 11.81 -0.76 → -0.79

Regrets

Task-Name AWAC CQL IQL SPOT Cal-QL
antmaze-umaze-v2 0.04 ± 0.01 0.02 ± 0.00 0.07 ± 0.00 0.02 ± 0.00 0.01 ± 0.00
antmaze-umaze-diverse-v2 0.88 ± 0.01 0.09 ± 0.01 0.43 ± 0.11 0.22 ± 0.07 0.05 ± 0.01
antmaze-medium-play-v2 1.00 ± 0.00 0.08 ± 0.01 0.09 ± 0.01 0.06 ± 0.00 0.04 ± 0.01
antmaze-medium-diverse-v2 1.00 ± 0.00 0.08 ± 0.00 0.10 ± 0.01 0.05 ± 0.01 0.04 ± 0.01
antmaze-large-play-v2 1.00 ± 0.00 0.21 ± 0.02 0.34 ± 0.05 0.29 ± 0.07 0.13 ± 0.02
antmaze-large-diverse-v2 1.00 ± 0.00 0.21 ± 0.03 0.41 ± 0.03 0.23 ± 0.08 0.13 ± 0.02
antmaze average 0.82 0.11 0.24 0.15 0.07
pen-cloned-v1 0.46 ± 0.02 0.97 ± 0.00 0.37 ± 0.01 0.58 ± 0.02 0.98 ± 0.01
door-cloned-v1 1.00 ± 0.00 1.00 ± 0.00 0.83 ± 0.03 0.99 ± 0.01 1.00 ± 0.00
hammer-cloned-v1 1.00 ± 0.00 1.00 ± 0.00 0.65 ± 0.10 0.98 ± 0.01 1.00 ± 0.00
relocate-cloned-v1 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00
adroit average 0.86 0.99 0.71 0.89 0.99

Citing CORL

If you use CORL in your work, please use the following bibtex

@inproceedings{
tarasov2022corl,
  title={{CORL}: Research-oriented Deep Offline Reinforcement Learning Library},
  author={Denis Tarasov and Alexander Nikulin and Dmitry Akimov and Vladislav Kurenkov and Sergey Kolesnikov},
  booktitle={3rd Offline RL Workshop: Offline RL as a ''Launchpad''},
  year={2022},
  url={https://openreview.net/forum?id=SyAS49bBcv}
}

About

High-quality single-file implementations of SOTA Offline and Offline-to-Online RL algorithms: AWAC, BC, CQL, DT, EDAC, IQL, SAC-N, TD3+BC, LB-SAC, SPOT, Cal-QL, ReBRAC

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published