Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why single process on Push not work #19

Open
Ericonaldo opened this issue Nov 13, 2021 · 13 comments
Open

Why single process on Push not work #19

Ericonaldo opened this issue Nov 13, 2021 · 13 comments

Comments

@Ericonaldo
Copy link

Hi, Tianhong, thanks for sharing the code. I've tried to run your code based on the guidance in readme

mpirun -np 8 python -u train.py --env-name='FetchPush-v1' 2>&1 | tee push.log

BUt surprisingly I find that running

mpirun -np 1 python -u train.py --env-name='FetchPush-v1' 2>&1 | tee push.log

does not work at all.

Do you happen to know the reason why it does not work?

@Ericonaldo
Copy link
Author

I find that with a larger batch size, HER still not work, do you know why?

@TianhongDai
Copy link
Owner

TianhongDai commented Nov 16, 2021

@Ericonaldo Hi, in actually, MPI = a large batch size. Could I know what is the batch size (a larger batch size) when you train the push task, please?

@Ericonaldo
Copy link
Author

Hi, I've tried 4 processes and 2 processes, they both work but a single process with 2048 batch size cannot work.

@TianhongDai
Copy link
Owner

TianhongDai commented Nov 16, 2021

@Ericonaldo Hi - What I guess is because of the diversity of samples - before the agent updates the network, if you use single process, in each epoch, it will only collect 2*50 = 100 episodes. Then, the agent will sample batch size of episodes from replay buffer and sample one transition from each of sampled episode for the training. In this case, even for 50 epochs, the agent only collects 5000 unique episodes (50 * 100). Although you use batch_size=2048, the diversity of samples is still limited when num_process=1 and you will sample numbers of repeated episodes during training. However, when you use num_process=2, the agent can sample transitions from double-sized collected episodes. But I'm not sure if this is the real reason and welcome for the further discussion.

@Ericonaldo
Copy link
Author

If this is true, we should be able to succeed by scaling the number of episodes by K times? However, it seems not work either.

@TianhongDai
Copy link
Owner

@Ericonaldo Hmm - that's a good point. An interesting finding is here: https://github.com/TianhongDai/hindsight-experience-replay/blob/master/mpi_utils/mpi_utils.py#L21-L22 . I follow the setting of OpenAI, they use sum instead of avg to gather the gradient from MPI workers. I will try to use avg operation to see if it will affect the performance. Will update here later.

@Ericonaldo
Copy link
Author

Great and many thanks. I did this because I find my own implementation of HER can only reach a success rate of 70-80% and I am figuring out what really matters in the training.

@TianhongDai
Copy link
Owner

@Ericonaldo Yes - it's quiet tricky of HER implementation...

@TianhongDai
Copy link
Owner

@Ericonaldo I found that the SUM operator will influence the performance: https://github.com/TianhongDai/hindsight-experience-replay/blob/master/mpi_utils/mpi_utils.py#L21-L22
Here, instead of using SUM, I average the gradient according to the number of MPI workers as:

comm.Allreduce(flat_grads, global_grads, op=MPI.SUM)
# average the gradient.
global_grads /= comm.Get_size()

Then, I plot the training curve using 2 MPI workers, and when the gradient is averaged, the performance will drop. In this case - if we don't average the gradient, the update of the network will become something like: x' = x - (lr * num_mpi) * avg_grad (assume it's a simple SGD optimizer), and the learning rate is increased. I'm not sure if this is the main reason, but we can keep doing more experiments to verify it.

plot

@Ericonaldo
Copy link
Author

This seems an important reason, but when I run with a single process, it just can not get any evidence of learning... (at least the avg gradient of 2 processes works slowly)

@TianhongDai
Copy link
Owner

@Ericonaldo Yes - I agree, need to carry out more experiment to verify. We can use this channel to continue the discussion.

@Ericonaldo
Copy link
Author

I think the learning rates for both the policy network and the value network are important hyper-parameters for these goal-conditioned tasks, after fine-tune some values I found that with only a single process can achieve some good results.

@TianhongDai
Copy link
Owner

I think the learning rates for both the policy network and the value network are important hyper-parameters for these goal-conditioned tasks, after fine-tune some values I found that with only a single process can achieve some good results.

@Ericonaldo Thanks! This is a great finding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants