Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO on continuous actions #77

Open
zaksemenov opened this issue Mar 27, 2024 · 11 comments
Open

PPO on continuous actions #77

zaksemenov opened this issue Mar 27, 2024 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@zaksemenov
Copy link

I noticed that in the PPO agent initialization it forces the is_action_continuous=False whereas the PPO algorithm and other libraries implementing PPO allow continuous actions. Can this be added to Pearl as well

https://github.com/facebookresearch/Pearl/blob/main/pearl/policy_learners/sequential_decision_making/ppo.py#L99

@yiwan-rl
Copy link
Contributor

Thanks for your suggestion. We actually plan to add continuous control PPO soon.

@rodrigodesalvobraz rodrigodesalvobraz added the enhancement New feature or request label Mar 28, 2024
@kuds
Copy link

kuds commented Aug 9, 2024

Following up on this issue. Do you have an ETA for when this feature might be implemented? I would be interested in contributing to this if possible.

@rodrigodesalvobraz
Copy link
Contributor

Following up on this issue. Do you have an ETA for when this feature might be implemented? I would be interested in contributing to this if possible.

Continuous action PPO is scheduled for sometime towards the end of the year. If you implement it yourself we would be delighted to accept the contribution!

@kuds
Copy link

kuds commented Aug 14, 2024

@rodrigodesalvobraz

Awesome! Thanks for the opportunity! I will start working on this and add any development updates/questions to this issue.

@rodrigodesalvobraz
Copy link
Contributor

@rodrigodesalvobraz

Awesome! Thanks for the opportunity! I will start working on this and add any development updates/questions to this issue.

BTW, please don't forget to check CONTRIBUTING.md for important information on contributing to the project.

@kuds
Copy link

kuds commented Aug 15, 2024

@rodrigodesalvobraz

Quick question, did you mean to close this issue, or does Pearl support PPO with continuous action spaces so it can be closed?

@rodrigodesalvobraz
Copy link
Contributor

Oops, sorry, I didn't mean to close the issue. Thanks for pointing it out. No, Pearl still does not support PPO with continuous action spaces. Thanks.

@kuds
Copy link

kuds commented Aug 21, 2024

@rodrigodesalvobraz

Development Update

I have spent the last few days better understanding Pearl and the different modules (replay buffer, policy learner, etc.). I also got PPO for discrete action spaces working in two Gymnasium environments (CartPole-v1 & LunarLander-v2). The implementation of PPO for continuous action spaces is coded, and I am currently troubleshooting some bugs. I plan to be wrapped up with this development in early September.

Next Steps

  • Finish implementation of PPO for continuous action spaces
  • Add new unit tests (if needed)
  • Create a tutorial to demonstrate using PPO in discrete and continuous action spaces

Questions

  • Should PPO for continuous action spaces be broken into its own file like SAC?
  • Why do the baseline models use tanh layers between the fully connected layers instead of ReLU? Just preference?

@rodrigodesalvobraz
Copy link
Contributor

rodrigodesalvobraz commented Aug 21, 2024

Good to hear of your progress, @kuds.

Questions

  • Should PPO for continuous action spaces be broken into its own file like SAC?

Yes, please.

  • Why do the baseline models use tanh layers between the fully connected layers instead of ReLU? Just preference?

I could only find it being used in VanillaContinuousActorNetwork. Since that is the last layer, it is probably so the output is normalized (ReLU wouldn't guarantee that).

@kuds
Copy link

kuds commented Sep 5, 2024

@rodrigodesalvobraz

Development Update

I finished working through the bugs for PPO in continuous action spaces. I am cleaning up my changes and adding new unit tests for the ContinuousProximalPolicyOptimization class. I should have the pull request submitted sometime early next week.

Next Steps

  • Finish implementation of PPO for continuous action spaces
  • Add new unit tests
  • Create a tutorial to demonstrate using PPO in discrete and continuous action spaces

Questions

  • Why does the PPO for discrete action spaces sum the losses for the actor-network instead of taking the mean/average?

    pearl/policy_learners/sequential_decision_making/ppo.py on line 131
    loss = torch.sum(-torch.min(r_theta * batch.gae, clip * batch.gae))

  • As part of this implementation, should I normalize the generalized advantage estimation (gae) at the batch level before applying it to the clipped loss?

@yiwan-rl
Copy link
Contributor

yiwan-rl commented Sep 6, 2024

Hi Kuds, I think sum and mean both work if one uses optimizers that normalize the gradient such as Adam and RMSprop. But mean seems to be better if one uses SGD. Ideally, GAE normalization should be provided as an option and is applied in actor loss computation. Thanks for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants