-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume Training with Previous Experience (state-action-state')? #1134
Comments
The exact answer depends on the algorithm you use, but at least with DQN the code re-creates the replay buffer on every call to However in stable-baselines3 the buffer is not re-created, so calling |
Thanks for your swift response. I am using TRPO and PPO. So, you mean stable-baselines3 would be more suitable for this problem (because stable-baselines3 will collect previous samples and current samples in buffer), right? |
I would recommend using SB3 in any case (unless you really need TRPO), as it is more up-to-date and is actively supported/maintained :) But: if you are using TRPO/PPO, then the answer to your original question is "no". These algorithms use a rollout buffer to collect samples, which are then discarded after they have been used to update the policy, so no samples are retained for a longer time (this is a "feature" of these algorithms). |
Okay, I will stick to SB3 in my later experiments. There are A2C, DDPG, DQN, HER, PPO, SAC, TD3 in SB3, could you please point out the algorithms that support this continuous training feature for me. I am not that familiar with some of the algorithms, so your explicit answer would be a great help for me. |
I think any algorithm with replay buffer should work like this, so: DDPG, DQN, SAC and TD3. |
I am using stable baseline and I want to train an agent with varying environments, i.e. the environment hyper-parameter is adjusted every 1000 timestep.
Describe the bug
I want to know if I resume training this way, whether the previous interaction experience will be automatically used in current training. With the i increases, will the model have access to the larger experience space in the buffer?
If not, could you please let me know how can I do this with stable baselinse?Thanks.
The text was updated successfully, but these errors were encountered: