You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
prioitizing past experiences based on temporal difference error
Optimality tightening He et al., 2017 is similar to this paper
Experience replay for actor-critic
actor-critic framework can also utilize experience replay
difference with off-policy and on-policy stackoverflow
off-policy evaluation involves importance sampling (ACER, Reactor; use Retrace to evaluate), that may not benefit much from past experience if the policy in the past is very different from current policy
this paper does not involve importance sampling and both applicable to discrete and continuous control
https://arxiv.org/abs/1806.05635
Abstract
SIL(Self Imitation Learning)
is to verify past good experiences can indirectly drive deep exploration.1. Introduction
2. Related work
Combining policy gradient and Q-learning; PGQL is that paper proposed lower bound Q-learning to exploit good experiences
3. Self Imitation Learning
4. Theoretical Justification
5. Experiment
6. Conclusion
The text was updated successfully, but these errors were encountered: