Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume Training with Previous Experience (state-action-state')? #1134

Open
wenjunli-0 opened this issue Aug 26, 2021 · 6 comments
Open

Resume Training with Previous Experience (state-action-state')? #1134

wenjunli-0 opened this issue Aug 26, 2021 · 6 comments
Labels
question Further information is requested

Comments

@wenjunli-0
Copy link

I am using stable baseline and I want to train an agent with varying environments, i.e. the environment hyper-parameter is adjusted every 1000 timestep.

for i in range(100):
    a = i * 2
    env = CustomizedEnv(parameter=a)
    env.reset()
    
    model.learn(total_timesteps=1000, reset_num_timesteps=False)
    model.save(save_dir + 'timestep_{}'.format(i))

Describe the bug
I want to know if I resume training this way, whether the previous interaction experience will be automatically used in current training. With the i increases, will the model have access to the larger experience space in the buffer?

If not, could you please let me know how can I do this with stable baselinse?Thanks.

@Miffyli Miffyli added the question Further information is requested label Aug 26, 2021
@Miffyli
Copy link
Collaborator

Miffyli commented Aug 26, 2021

The exact answer depends on the algorithm you use, but at least with DQN the code re-creates the replay buffer on every call to learn.

However in stable-baselines3 the buffer is not re-created, so calling learn again would use the samples from the previous learn call as well.

@wenjunli-0
Copy link
Author

The exact answer depends on the algorithm you use, but at least with DQN the code re-creates the replay buffer on every call to learn.

However in stable-baselines3 the buffer is not re-created, so calling learn again would use the samples from the previous learn call as well.

Thanks for your swift response. I am using TRPO and PPO. So, you mean stable-baselines3 would be more suitable for this problem (because stable-baselines3 will collect previous samples and current samples in buffer), right?

@Miffyli
Copy link
Collaborator

Miffyli commented Aug 26, 2021

Thanks for your swift response. I am using TRPO and PPO. So, you mean stable-baselines3 would be more suitable for this problem (because stable-baselines3 will collect previous samples and current samples in buffer), right?

I would recommend using SB3 in any case (unless you really need TRPO), as it is more up-to-date and is actively supported/maintained :)

But: if you are using TRPO/PPO, then the answer to your original question is "no". These algorithms use a rollout buffer to collect samples, which are then discarded after they have been used to update the policy, so no samples are retained for a longer time (this is a "feature" of these algorithms).

@wenjunli-0
Copy link
Author

Thanks for your swift response. I am using TRPO and PPO. So, you mean stable-baselines3 would be more suitable for this problem (because stable-baselines3 will collect previous samples and current samples in buffer), right?

I would recommend using SB3 in any case (unless you really need TRPO), as it is more up-to-date and is actively supported/maintained :)

But: if you are using TRPO/PPO, then the answer to your original question is "no". These algorithms use a rollout buffer to collect samples, which are then discarded after they have been used to update the policy, so no samples are retained for a longer time (this is a "feature" of these algorithms).

Okay, I will stick to SB3 in my later experiments. There are A2C, DDPG, DQN, HER, PPO, SAC, TD3 in SB3, could you please point out the algorithms that support this continuous training feature for me. I am not that familiar with some of the algorithms, so your explicit answer would be a great help for me.

@Miffyli
Copy link
Collaborator

Miffyli commented Aug 26, 2021

I think any algorithm with replay buffer should work like this, so: DDPG, DQN, SAC and TD3.

@rambo1111
Copy link

#1192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants