You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am struggling to understand your reasoning here:
Issue - The paper states that the number of sequences of actions should be 2^N. But I could only find the one sequence of right actions and N other sequences that terminate by the wrong action and the number of transitions in the replay memory to be (N(N+1)/2 + N)
Can you show how this holds for a simlpe case such as N = 3?
Here is mine:
This will form our replay memory. In total, there will be (N*(N+1)/2 + N) transitions in the list.
This also doesn't match what the paper reports. According to the paper:
The replay memory contains all therelevant experience (the total number of transitions is 2^(n+1) - 2)
In the paper they show that returing from state N to state 1 can either give a reward of 1 (green arrow) or 0 (dashed red arrow). How did you decide to implement this?
The text was updated successfully, but these errors were encountered:
I am struggling to understand your reasoning here:
Can you show how this holds for a simlpe case such as N = 3?
Here is mine:
This also doesn't match what the paper reports. According to the paper:
The replay memory contains all therelevant experience (the total number of transitions is 2^(n+1) - 2)
In the paper they show that returing from state N to state 1 can either give a reward of 1 (green arrow) or 0 (dashed red arrow). How did you decide to implement this?
The text was updated successfully, but these errors were encountered: