-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiDiscrete action spaces #146
Comments
After some digging into the a2c code, I realized that the log probabilities of the policy need to have the shape of (update_steps, num_processes) so that it can probably be multiplied with the advantages. As a quick workaround, we can basically sum the log probabilities across the dimensions of the action space by changing this line to This should fix the A2C but a general approach for supporting multi discrete action spaces should be considered. |
I guess you can wrap the output of |
I'm also running into a similar issue with my environment, and even before going into the update rule, I'm facing the problem of having a multi-discrete action space with different number of actions along each dimension. For example, dimension 1 has 5 actions, dimension 2 has 3 actions and dimension 3 has 10 actions. How would I code up the final layer of the policy in that case? In the issue above, the author could nicely unflatten the tensor into uniform shapes along each dimension but I'm not aware of any way to do it for multi-discrete action spaces with different dimensions. Also, please let me know if you would rather have me open a new issue for this topic. Thanks! |
I think this requires a new subclass of |
Yes, I think you're right. I've managed to get something simple that works to model individual Categorical torch distributions before combining them. Thanks a lot, although please do consider including agents that support MultiDiscrete action spaces in the future. I think it would be really helpful. |
A perhaps easier but less clean workaround is to model it as a joint distribution of same-sized categorical distributions using |
@muupan Thanks, wrapping the output of @xylee95, Can you share how did you manage to get it working with different sizes of action spaces? I'm currently writing a class that is based on the MultiCategoricalDistrubtion from stable_baseline3 and hopefully open a PR soon. |
@tkelestemur Yes, that is exactly what I did. I wrote a class based on the MultiCategoricalDistribution from stable_baseline3 and changed some of the function names to fit the log_prob calls in the agent. It works fine but I've only tested it with PPO so far and not other agents. If you need more details, I'll be happy to share |
@xylee95 can you share your implementation? I've tried to write a subclass of |
@tkelestemur This is my implementation. It is almost a copy and paste of stable_baseline3 code and I did not write it as a sub-class of
|
I have a custom environment with a MultiDiscrete action space. The MultiDiscrete action space allows controlling an agent with n-dimensional discrete action spaces.
In my environment, I have 4 dimensions where each dimension has 11 actions. I'm trying to use A2C with a Softmax policy. Below is the implementation of the policy and value networks. The output of the policy gives me [N, 4, 11] tensor where N is the batch size. The softmax is applied to the last dimension of this tensor so basically, I have 4 action distributions. I thought this would work but I'm getting the following error:
Do I need to make changes to the A2C or am I doing something wrong?
The text was updated successfully, but these errors were encountered: