Why single process on Push not work #19

Ericonaldo · 2021-11-13T15:08:59Z

Hi, Tianhong, thanks for sharing the code. I've tried to run your code based on the guidance in readme

mpirun -np 8 python -u train.py --env-name='FetchPush-v1' 2>&1 | tee push.log

BUt surprisingly I find that running

mpirun -np 1 python -u train.py --env-name='FetchPush-v1' 2>&1 | tee push.log

does not work at all.

Do you happen to know the reason why it does not work?

The text was updated successfully, but these errors were encountered:

Ericonaldo · 2021-11-15T07:09:56Z

I find that with a larger batch size, HER still not work, do you know why?

TianhongDai · 2021-11-16T10:54:45Z

@Ericonaldo Hi, in actually, MPI = a large batch size. Could I know what is the batch size (a larger batch size) when you train the push task, please?

Ericonaldo · 2021-11-16T11:04:04Z

Hi, I've tried 4 processes and 2 processes, they both work but a single process with 2048 batch size cannot work.

TianhongDai · 2021-11-16T11:16:40Z

@Ericonaldo Hi - What I guess is because of the diversity of samples - before the agent updates the network, if you use single process, in each epoch, it will only collect 2*50 = 100 episodes. Then, the agent will sample batch size of episodes from replay buffer and sample one transition from each of sampled episode for the training. In this case, even for 50 epochs, the agent only collects 5000 unique episodes (50 * 100). Although you use batch_size=2048, the diversity of samples is still limited when num_process=1 and you will sample numbers of repeated episodes during training. However, when you use num_process=2, the agent can sample transitions from double-sized collected episodes. But I'm not sure if this is the real reason and welcome for the further discussion.

Ericonaldo · 2021-11-17T15:03:50Z

If this is true, we should be able to succeed by scaling the number of episodes by K times? However, it seems not work either.

TianhongDai · 2021-11-17T15:10:12Z

@Ericonaldo Hmm - that's a good point. An interesting finding is here: https://github.com/TianhongDai/hindsight-experience-replay/blob/master/mpi_utils/mpi_utils.py#L21-L22 . I follow the setting of OpenAI, they use sum instead of avg to gather the gradient from MPI workers. I will try to use avg operation to see if it will affect the performance. Will update here later.

Ericonaldo · 2021-11-17T15:12:12Z

Great and many thanks. I did this because I find my own implementation of HER can only reach a success rate of 70-80% and I am figuring out what really matters in the training.

TianhongDai · 2021-11-17T15:15:02Z

@Ericonaldo Yes - it's quiet tricky of HER implementation...

TianhongDai · 2021-11-17T22:49:21Z

@Ericonaldo I found that the SUM operator will influence the performance: https://github.com/TianhongDai/hindsight-experience-replay/blob/master/mpi_utils/mpi_utils.py#L21-L22
Here, instead of using SUM, I average the gradient according to the number of MPI workers as:

comm.Allreduce(flat_grads, global_grads, op=MPI.SUM)
# average the gradient.
global_grads /= comm.Get_size()

Then, I plot the training curve using 2 MPI workers, and when the gradient is averaged, the performance will drop. In this case - if we don't average the gradient, the update of the network will become something like: x' = x - (lr * num_mpi) * avg_grad (assume it's a simple SGD optimizer), and the learning rate is increased. I'm not sure if this is the main reason, but we can keep doing more experiments to verify it.

Ericonaldo · 2021-11-18T06:25:07Z

This seems an important reason, but when I run with a single process, it just can not get any evidence of learning... (at least the avg gradient of 2 processes works slowly)

TianhongDai · 2021-11-18T10:35:41Z

@Ericonaldo Yes - I agree, need to carry out more experiment to verify. We can use this channel to continue the discussion.

Ericonaldo · 2022-02-21T07:19:55Z

I think the learning rates for both the policy network and the value network are important hyper-parameters for these goal-conditioned tasks, after fine-tune some values I found that with only a single process can achieve some good results.

TianhongDai · 2022-02-21T13:27:58Z

I think the learning rates for both the policy network and the value network are important hyper-parameters for these goal-conditioned tasks, after fine-tune some values I found that with only a single process can achieve some good results.

@Ericonaldo Thanks! This is a great finding.

Ericonaldo closed this as completed Nov 13, 2021

Ericonaldo reopened this Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why single process on Push not work #19

Why single process on Push not work #19

Ericonaldo commented Nov 13, 2021

Ericonaldo commented Nov 15, 2021

TianhongDai commented Nov 16, 2021 •

edited

Loading

Ericonaldo commented Nov 16, 2021

TianhongDai commented Nov 16, 2021 •

edited

Loading

Ericonaldo commented Nov 17, 2021

TianhongDai commented Nov 17, 2021

Ericonaldo commented Nov 17, 2021

TianhongDai commented Nov 17, 2021

TianhongDai commented Nov 17, 2021

Ericonaldo commented Nov 18, 2021

TianhongDai commented Nov 18, 2021

Ericonaldo commented Feb 21, 2022

TianhongDai commented Feb 21, 2022

Why single process on Push not work #19

Why single process on Push not work #19

Comments

Ericonaldo commented Nov 13, 2021

Ericonaldo commented Nov 15, 2021

TianhongDai commented Nov 16, 2021 • edited Loading

Ericonaldo commented Nov 16, 2021

TianhongDai commented Nov 16, 2021 • edited Loading

Ericonaldo commented Nov 17, 2021

TianhongDai commented Nov 17, 2021

Ericonaldo commented Nov 17, 2021

TianhongDai commented Nov 17, 2021

TianhongDai commented Nov 17, 2021

Ericonaldo commented Nov 18, 2021

TianhongDai commented Nov 18, 2021

Ericonaldo commented Feb 21, 2022

TianhongDai commented Feb 21, 2022

TianhongDai commented Nov 16, 2021 •

edited

Loading

TianhongDai commented Nov 16, 2021 •

edited

Loading