Skip to content

Commit

Permalink
Add results for RL algorithm (#12)
Browse files Browse the repository at this point in the history
  • Loading branch information
HokageM authored Dec 4, 2023
1 parent a346382 commit f492208
Show file tree
Hide file tree
Showing 13 changed files with 507 additions and 99 deletions.
145 changes: 49 additions & 96 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,119 +6,72 @@ Inverse Reinforcement Learning Algorithm implementation with python.

# Implemented Algorithms

## Maximum Entropy IRL: [1]
## Maximum Entropy IRL:

## Maximum Entropy Deep IRL
Implementation of the Maximum Entropy inverse reinforcement learning algorithm from [1] and is based on the implementation
of [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).
It is an IRL algorithm using Q-Learning with a Maximum Entropy update function.

# Experiments
## Maximum Entropy Deep IRL:

## Mountaincar-v0
[gym](https://www.gymlibrary.dev/environments/classic_control/mountain_car/)

The expert demonstrations for the Mountaincar-v0 are the same as used in [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).

*Heatmap of Expert demonstrations with 400 states*:

<img src="demo/heatmaps/expert_state_frequencies_mountaincar.png">

### Maximum Entropy Inverse Reinforcement Learning

IRL using Q-Learning with a Maximum Entropy update function.

#### Training

*Learner training for 1000 episodes*:

<img src="demo/learning_curves/maxent_999_flat.png">

*Learner training for 4000 episodes*:

<img src="demo/learning_curves/maxent_4999_flat.png">

#### Heatmaps

*Learner state frequencies after 1000 episodes*:

<img src="demo/heatmaps/learner_999_flat.png">

*Learner state frequencies after 2000 episodes*:

<img src="demo/heatmaps/learner_1999_flat.png">

*Learner state frequencies after 5000 episodes*:

<img src="demo/heatmaps/learner_4999_flat.png">

<img src="demo/heatmaps/theta_999_flat.png">

*State rewards heatmap after 5000 episodes*:

<img src="demo/heatmaps/theta_4999_flat.png">
An implementation of the Maximum Entropy inverse reinforcement learning algorithm, which uses a neural-network for the
actor.
The estimated irl-reward is learned similar as in Maximum Entropy IRL.
It is an IRL algorithm using Deep Q-Learning with a Maximum Entropy update function.

*State rewards heatmap after 14000 episodes*:
## Maximum Entropy Deep RL:

<img src="demo/heatmaps/theta_13999_flat.png">
An implementation of the Maximum Entropy reinforcement learning algorithm.
This algorithm is used to compare the IRL algorithms with an RL algorithm.

#### Testing
# Experiment

*Testing results of the model after 29000 episodes*:

<img src="demo/test_results/test_maxentropy_flat.png">


### Deep Maximum Entropy Inverse Reinforcement Learning

IRL using Deep Q-Learning with a Maximum Entropy update function.

#### Training

*Learner training for 1000 episodes*:

<img src="demo/learning_curves/maxentdeep_999_w_reset_10.png">

*Learner training for 5000 episodes*:

<img src="demo/learning_curves/maxentdeep_4999_w_reset_10.png">

#### Heatmaps

*Learner state frequencies after 1000 episodes*:

<img src="demo/heatmaps/learner_999_maxentdeep_w_reset_10.png">

*Learner state frequencies after 2000 episodes*:

<img src="demo/heatmaps/learner_1999_maxentdeep_w_reset_10.png">

*Learner state frequencies after 5000 episodes*:

<img src="demo/heatmaps/learner_4999_maxentdeep_w_reset_10.png">

*State rewards heatmap after 1000 episodes*:

<img src="demo/heatmaps/theta_999_maxentdeep_w_reset_10.png">
## Mountaincar-v0

*State rewards heatmap after 2000 episodes*:
The Mountaincar-v0 is used for evaluating the different algorithms.
Therefore, the implementation of the MDP for the Mountaincar
from [gym](https://www.gymlibrary.dev/environments/classic_control/mountain_car/) is used.

<img src="demo/heatmaps/theta_1999_maxentdeep_w_reset_10.png">
The expert demonstrations for the Mountaincar-v0 are the same as used
in [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).

*State rewards heatmap after 5000 episodes*:
*Heatmap of Expert demonstrations with 400 states*:

<img src="demo/heatmaps/theta_4999_maxentdeep_w_reset_10.png">
<img src="demo/heatmaps/expert_state_frequencies_mountaincar.png">

### Comparing the algorithms

#### Testing
The following tables compare the result of training and testing the two IRL algorithms Maximum Entropy and
Maximum Entropy Deep. Furthermore, results for the RL algorithm Maximum Entropy Deep algorithm are shown, to
highlight the differences between IRL and RL.

*Testing results of the best model after 5000 episodes*:
| Algorithm | Training Curve after 1000 Episodes | Training Curve after 5000 Episodes |
|--------------------------|----------------------------------------------------------------------------|-----------------------------------------------------------------------------|
| Maximum Entropy IRL | <img src="demo/learning_curves/maxent_999_flat.png" width="400"> | <img src="demo/learning_curves/maxent_4999_flat.png" width="400"> |
| Maximum Entropy Deep IRL | <img src="demo/learning_curves/maxentdeep_999_w_reset_10.png" width="400"> | <img src="demo/learning_curves/maxentdeep_4999_w_reset_10.png" width="400"> |
| Maximum Entropy Deep RL | <img src="demo/learning_curves/maxentdeep_999_RL.png" width="400"> | <img src="demo/learning_curves/maxentdeep_4999_RL.png" width="400"> |

<img src="demo/test_results/test_maxentropydeep_best_model_results.png">
| Algorithm | State Frequencies Learner: 1000 Episodes | State Frequencies Learner: 2000 Episodes | State Frequencies Learner: 5000 Episodes |
|--------------------------|-----------------------------------------------------------------------------|------------------------------------------------------------------------------|------------------------------------------------------------------------------|
| Maximum Entropy IRL | <img src="demo/heatmaps/learner_999_flat.png" width="400"> | <img src="demo/heatmaps/learner_1999_flat.png" width="400"> | <img src="demo/heatmaps/learner_4999_flat.png" width="400"> |
| Maximum Entropy Deep IRL | <img src="demo/heatmaps/learner_999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/learner_1999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/learner_4999_maxentdeep_w_reset_10.png" width="400"> |
| Maximum Entropy Deep RL | <img src="demo/heatmaps/learner_999_deep_RL.png" width="400"> | <img src="demo/heatmaps/learner_1999_deep_RL.png" width="400"> | <img src="demo/heatmaps/learner_4999_deep_RL.png" width="400"> |

### Deep Maximum Entropy Inverse Reinforcement Learning with Critic
| Algorithm | IRL Rewards: 1000 Episodes | IRL Rewards: 2000 Episodes | IRL Rewards: 5000 Episodes | IRL Rewards: 14000 Episodes |
|--------------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------------|----------------------------------------------------------------------------|------------------------------------------------------------|
| Maximum Entropy IRL | <img src="demo/heatmaps/theta_999_flat.png" width="400"> | None | <img src="demo/heatmaps/theta_4999_flat.png" width="400"> | <img src="demo/heatmaps/theta_13999_flat.png" width="400"> |
| Maximum Entropy Deep IRL | <img src="demo/heatmaps/theta_999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/theta_1999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/theta_4999_maxentdeep_w_reset_10.png" width="400"> | None |
| Maximum Entropy Deep RL | None | None | None | None |

Coming soon...
| Algorithm | Testing Results: 100 Runs |
|--------------------------|-----------------------------------------------------------------------------------------|
| Maximum Entropy IRL | <img src="demo/test_results/test_maxentropy_flat.png" width="400"> |
| Maximum Entropy Deep IRL | <img src="demo/test_results/test_maxentropydeep_best_model_results.png" width="400"> |
| Maximum Entropy Deep RL | <img src="demo/test_results/test_maxentropydeep_best_model_RL_results.png" width="400"> |

# References
The implementation of MaxEntropyIRL and MountainCar is based on the implementation of:

The implementation of MaxEntropyIRL and MountainCar is based on the implementation of:
[lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent)

[1] [BD. Ziebart, et al., "Maximum Entropy Inverse Reinforcement Learning", AAAI 2008](https://cdn.aaai.org/AAAI/2008/AAAI08-227.pdf).
Expand All @@ -133,12 +86,12 @@ pip install .
# Usage

```commandline
usage: irl [-h] [--version] [--training] [--testing] [--render] ALGORITHM
usage: irl-runner [-h] [--version] [--training] [--testing] [--render] ALGORITHM
Implementation of IRL algorithms
positional arguments:
ALGORITHM Currently supported training algorithm: [max-entropy, max-entropy-deep]
ALGORITHM Currently supported training algorithm: [max-entropy, max-entropy-deep, max-entropy-deep-rl]
options:
-h, --help show this help message and exit
Expand Down
Binary file added demo/heatmaps/learner_1999_deep_RL.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/heatmaps/learner_4999_deep_RL.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/heatmaps/learner_999_deep_RL.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/learning_curves/maxentdeep_4999_RL.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/learning_curves/maxentdeep_999_RL.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ testing =
# script_name = irlwpython.module:function
# For example:
console_scripts =
irl = irlwpython.main:run
irl-runner = irlwpython.main:run
# And any other entry points, for example:
# pyscaffold.cli =
# awesome = pyscaffoldext.awesome.extension:AwesomeExtension
Expand Down
197 changes: 197 additions & 0 deletions src/irlwpython/MaxEntropyDeepRL.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
import numpy as np
import math

import torch
import torch.optim as optim
import torch.nn as nn

from irlwpython.FigurePrinter import FigurePrinter


class QNetwork(nn.Module):
def __init__(self, input_size, output_size):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(input_size, 64)
self.relu1 = nn.ReLU()
self.fc2 = nn.Linear(64, 32)
self.relu2 = nn.ReLU()
self.output_layer = nn.Linear(32, output_size)

self.printer = FigurePrinter()

def forward(self, state):
x = self.fc1(state)
x = self.relu1(x)
x = self.fc2(x)
x = self.relu2(x)
q_values = self.output_layer(x)
return q_values


class MaxEntropyDeepRL:
def __init__(self, target, state_dim, action_size, feature_matrix, one_feature, learning_rate=0.001, gamma=0.99):
self.feature_matrix = feature_matrix
self.one_feature = one_feature

self.target = target

self.q_network = QNetwork(state_dim, action_size)
self.target_q_network = QNetwork(state_dim, action_size)
self.target_q_network.load_state_dict(self.q_network.state_dict())
self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)

self.gamma = gamma

self.printer = FigurePrinter()

def select_action(self, state, epsilon):
"""
Selects an action based on the q values from the network with epsilon greedy.
:param state:
:param epsilon:
:return:
"""
if np.random.rand() < epsilon:
return np.random.choice(3)
else:
with torch.no_grad():
q_values = self.q_network(torch.FloatTensor(state))
return torch.argmax(q_values).item()

def update_q_network(self, state, action, reward, next_state, done):
"""
Updates the q network based on the reward
:param state:
:param action:
:param reward:
:param next_state:
:param done:
:return:
"""
state = torch.FloatTensor(state)
next_state = torch.FloatTensor(next_state)
q_values = self.q_network(state)
next_q_values = self.target_q_network(next_state)

target = q_values.clone()
if not done:
target[action] = reward + self.gamma * torch.max(next_q_values).item()
else:
target[action] = reward

loss = nn.MSELoss()(q_values, target.detach())
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

def update_target_network(self):
"""
Updates the target network.
:return:
"""
self.target_q_network.load_state_dict(self.q_network.state_dict())

def train(self, n_states, episodes=30000, max_steps=200,
epsilon_start=1.0,
epsilon_decay=0.995, epsilon_min=0.01):
"""
Trains the network using the maximum entropy deep reinforcement algorithm.
:param n_states:
:param episodes: Count of training episodes
:param max_steps: Max steps per episode
:param epsilon_start:
:param epsilon_decay:
:param epsilon_min:
:return:
"""
learner_feature_expectations = np.zeros(n_states)

epsilon = epsilon_start
episode_arr, scores = [], []

best_reward = -math.inf
for episode in range(episodes):
state, info = self.target.env_reset()
total_reward = 0

for step in range(max_steps):
action = self.select_action(state, epsilon)

next_state, reward, done, _, _ = self.target.env_step(action)
total_reward += reward

self.update_q_network(state, action, reward, next_state, done)
self.update_target_network()

# State counting for densitiy
state_idx = self.target.state_to_idx(state)
learner_feature_expectations += self.feature_matrix[int(state_idx)]

state = next_state
if done:
break

# Keep track of best performing network
if total_reward > best_reward:
best_reward = total_reward
torch.save(self.q_network.state_dict(),
f"../results/maxentropydeep_{episode}_best_network_w_{total_reward}_RL.pth")

if (episode + 1) % 10 == 0:
# calculate density
learner = learner_feature_expectations / episode
learner_feature_expectations = np.zeros(n_states)

scores.append(total_reward)
episode_arr.append(episode)
epsilon = max(epsilon * epsilon_decay, epsilon_min)
print(f"Episode: {episode + 1}, Total Reward: {total_reward}, Epsilon: {epsilon}")

if (episode + 1) % 1000 == 0:
score_avg = np.mean(scores)
print('{} episode average score is {:.2f}'.format(episode, score_avg))
self.printer.save_plot_as_png(episode_arr, scores,
f"../learning_curves/maxent_{episodes}_{episode}_qnetwork_RL.png")
self.printer.save_heatmap_as_png(learner.reshape((20, 20)), f"../heatmap/learner_{episode}_deep_RL.png")
self.printer.save_heatmap_as_png(self.theta.reshape((20, 20)),
f"../heatmap/theta_{episode}_deep_RL.png")

torch.save(self.q_network.state_dict(), f"../results/maxent_{episodes}_{episode}_network_main.pth")

if episode == episodes - 1:
self.printer.save_plot_as_png(episode_arr, scores,
f"../learning_curves/maxentdeep_{episodes}_qdeep_RL.png")

torch.save(self.q_network.state_dict(), f"src/irlwpython/results/maxentdeep_{episodes}_q_network_RL.pth")

def test(self, model_path, epsilon=0.01, repeats=100):
"""
Tests the previous trained model.
:return:
"""
self.q_network.load_state_dict(torch.load(model_path))
episodes, scores = [], []

for episode in range(repeats):
state, info = self.target.env_reset()
score = 0

while True:
self.target.env_render()
action = self.select_action(state, epsilon)
next_state, reward, done, _, _ = self.target.env_step(action)

score += reward
state = next_state

if done:
scores.append(score)
episodes.append(episode)
break

if episode % 1 == 0:
print('{} episode score is {:.2f}'.format(episode, score))

self.printer.save_plot_as_png(episodes, scores,
"src/irlwpython/learning_curves"
"/test_maxentropydeep_best_model_RL_results.png")
Loading

0 comments on commit f492208

Please sign in to comment.