You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In ppo_rnd_envpool.py (also ppo_atari_envpoo.py), the implementation of RecordEpisodeStatistics will accumulate the rewards after time-limit truncation since the self.episode_returns is only masked by info["terminated"]. This means that in Atari, the returns of two independent rounds (i.e., one round ends when the agent loses all of its lives) will be accumulated if the previous round gets resets due to time-limit truncation.
The following is what I observe when training using envpool with max_episode_steps=27000 (default value in envpool).
Here is how I log (adapted from this line
foridx, dinenumerate(done):
log_rewards[idx].append(reward[idx])
ifinfo["terminated"][idx]:
avg_returns.append(info["r"][idx])
print(f`Env {idx} finishes a round with length {info['l'][idx]} and score {info['r'][idx]})
log_rewards[idx] = []
Then there are the logs I got
Env 0 finishes a round with length 54012 and score 1900
...
Env 0 finishes a round with length 81016 and score 4900
It's problematic since info["l"][idx] should not exceed 27000. I checked that when the timestep hits 27000, the environment will be reset. This means the scores across two rounds are summed up.
Expected Behavior
Expect the game scores is the sum of rewards over all the lives in one round.
The return in the new episode (Ep=1) is not reset to zero but is carried from the return in the old episode. The expected behavior is to reset the return counter to zero upon timeout.
Problem Description
Checklist
poetry install
(see CleanRL's installation guideline.Current Behavior
In
ppo_rnd_envpool.py
(alsoppo_atari_envpoo.py
), the implementation ofRecordEpisodeStatistics
will accumulate the rewards after time-limit truncation since theself.episode_returns
is only masked byinfo["terminated"]
. This means that in Atari, the returns of two independent rounds (i.e., one round ends when the agent loses all of its lives) will be accumulated if the previous round gets resets due to time-limit truncation.The following is what I observe when training using
envpool
withmax_episode_steps=27000
(default value inenvpool
).Here is how I log (adapted from this line
Then there are the logs I got
It's problematic since info["l"][idx] should not exceed 27000. I checked that when the timestep hits 27000, the environment will be reset. This means the scores across two rounds are summed up.
Expected Behavior
Expect the game scores is the sum of rewards over all the lives in one round.
Possible Solution
Should we change this line) to:
Steps to Reproduce
Run the following script:
You should see the output:
See the above example's output:
The return in the new episode (Ep=1) is not reset to zero but is carried from the return in the old episode. The expected behavior is to reset the return counter to zero upon timeout.
@vwxyzjn
The text was updated successfully, but these errors were encountered: