PPO converges slowly #1

ceteri · 2020-04-27T15:49:09Z

Got a problem with RLlib, while training with a custom environment.
This uses a simple env where the action space is defined as a single parameter in the range of [0.0, 60.0]:

self.action_space = spaces.Box(np.float32(0.0), high, shape=(1,))```
When using OpenAI Gym to [run this environment](https://github.com/DerwenAI/gym_projectile/blob/master/example.py) through many steps, it seems to work correctly.

However, when using PPO to [train a policy](https://github.com/DerwenAI/gym_projectile/blob/c56d3ab15248c4767721fabb8e3731b0522b62cc/train.py#L21):

```import ray.rllib.agents.ppo as ppo
SELECT_ENV = "projectile-v0"

config = ppo.DEFAULT_CONFIG.copy()
agent = ppo.PPOTrainer(config, env=SELECT_ENV)

for _ in range(n_iter):
        result = agent.train()```
... then the action space -- as used by RLlib -- appears to stay very close to the `low` value in the Box, as long as the `low` value is zero.
In contrast, if the `low` value is non-zero, then that is always used and never varies.

For example, this is a simple physics simulation of projectile trajectories, and the action space is a `theta` angle, with an expected value of a projectile `range` in the observed space.  The problem that I'm seeing with RLlib is how `theta` never goes much above zero, and so the `range` also stays in the neighborhood of zero. Based on the Gym simulation, the median for `range` should be up in the thousands but RLlib seems to bias too low:

```(pid=57245) location: [2124, 1]
(pid=57245) location: [2124, 0]
(pid=57245) location: [2124, 27]
(pid=57245) location: [2124, 0]
(pid=57245) location: [2124, 74]
(pid=57245) location: [2124, 30]
(pid=57245) location: [2124, 74]
(pid=57245) location: [2124, 34]
(pid=57245) location: [2124, 67]
(pid=57245) location: [2124, 0]
(pid=57245) location: [2124, 0]
(pid=57245) location: [2124, 86]
(pid=57245) location: [2124, 0]```
Am I configuring the `agent.train()` part incorrectly?

I have noticed that RLlib use of Gym environments is *very* sensitive to odd and relatively undocumented edge cases, where RLlib's preprocessing will throw exceptions for what are otherwise valid configurations of action space and observation space.

The text was updated successfully, but these errors were encountered:

ceteri · 2020-04-27T15:49:26Z

There's a related issue reported: ray-project/ray#8088

ceteri · 2020-04-27T15:51:50Z

Using SAC instead (squashed Gaussian distribution) which resolves some of this, although the firing solutions still seem to converge slowly.

What I also found was that there appear to be some dependencies on the Box bounds, which I haven't find documented: if the absolute value of an action_space Box bounds > 1.0 and the lower bound > 0.0 then also SAC still has the problem of pegging the action to its lower bounds.

Also, I've found that, so far, during rollout the sampled actions range [0.0, 1.0] regardless of how I'd set the action_space Box. Maybe I've omitted some required configuration for the rollout part? In any case, when I make the action space range [0.0, 1.0] this seems to behave properly for both training and rollout.

matej-macak-qb · 2020-04-27T17:07:37Z

@ceteri in the case of continuous action_space bounded space in [0.0,1.0] there is still slow convergence as per the issue that I raised here #8088. I have tested this to be an issue in the case of Impala algorithm as well.

ceteri · 2020-04-27T19:14:29Z

Thank you @matej-macak-qb
@sven1977 pointed me toward what you've researched, and that's much appreciated.

I had a similar approach of needing to start with a simple env.

Glad we're getting more observations and analysis pointing toward underlying issues. And I understand there's work scheduled on RLlib to try to resolve this.

ceteri · 2020-04-30T21:36:08Z

Also related: ray-project/ray#8218

ceteri self-assigned this Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO converges slowly #1

PPO converges slowly #1

ceteri commented Apr 27, 2020

ceteri commented Apr 27, 2020

ceteri commented Apr 27, 2020

matej-macak-qb commented Apr 27, 2020 •

edited

Loading

ceteri commented Apr 27, 2020

ceteri commented Apr 30, 2020

PPO converges slowly #1

PPO converges slowly #1

Comments

ceteri commented Apr 27, 2020

ceteri commented Apr 27, 2020

ceteri commented Apr 27, 2020

matej-macak-qb commented Apr 27, 2020 • edited Loading

ceteri commented Apr 27, 2020

ceteri commented Apr 30, 2020

matej-macak-qb commented Apr 27, 2020 •

edited

Loading