-
Notifications
You must be signed in to change notification settings - Fork 40
QLearning comparison
WARNING: THESE RESULTS ARE NOT YET PEER-REVIEWED!
Though the faster learning speeds are consistent with the findings of the following peer-reviewed paper which includes ONA in a comparison with Q-Learners: "Eberding, L. M., Thórisson, K. R., Sheikhlar, A., & Andrason, S. P. (2020). SAGE: Task-Environment Platform for Evaluating a Broad Range of AI Learners. In Artificial General Intelligence: 13th International Conference, AGI 2020, St. Petersburg, Russia, September 16–19, 2020, Proceedings (Vol. 12177, p. 72). Springer Nature."
Scripts for detailed procedure learning comparisons between branches are available. The scripts are available in OpenNARS-for-Applications/misc/evaluation/ In this page we will compare QLearnerComparison branch which implements a QLearner-based agent with master.
It demands the following directory structure:
BASE/master <- v0.8.5 (built via its build.sh)
BASE/QLearnerComparison <- QLearnerComparison branch (built via its build.sh)
BASE/comparison.py (copy from master, OpenNARS-for-Applications/misc/evaluation/)
BASE/plot.py (copy from master, OpenNARS-for-Applications/misc/evaluation/)
First run BASE/comparison.py to generate the outputs from the procedure learning examples. Then run BASE/plot.py to generate the output plots from the generated data.
Besides the capability to deal with multiple and changing objectives by design, ONA demands less implicitly example-dependent parameter tuning than Q-Learning:
- ONA does not rely on learning rate decay. How much new evidences changes an existing belief is only dependent on the amount of evidence which already supports it, making high-confident beliefs automatically more stable.
- ONA reduces motorbabbling by itself once the hypotheses it bases its decisions on are stable and predict successfully, and hence does not depend on a time-dependent reduction of the exploration rate either.
All time dependencies of hyperparameters are implicitly example-specific, and have hence to be avoided when generality is evaluated. With the passing of time, a Reduction of the learning rate makes the Q-Learner take longer to change its policy when new circumstances demand it. Additionally, reduction of motorbabbling over time will make it increasingly unlikely to attempt an alternative solutions. Both is problematic if a good policy has not yet been found.
To ensure generality of the learner's hyper-parameters choice across tasks, for Q-Learning a set of parameters (via grid search with granularity 0.1) was chosen with highest competence product across the 4 examples (which penalizes strong failure on any example severely). Also ONA parameters were not varied across the examples. The grid search found the best hyperparameters for the Q-Learning to be alpha=0.1, gamma=0.1, lambda=0.8, epsilon=0.1. The ONA parameters are the default config in ONA v0.8.5.
The Pong experiment was ran over 10000 iterations, which with the used simulation speed, corresponds to close to 500 opportunities to catch the ball on average. Both methods received horizontal ball position relative to the bat discretized into 3 values left, center, right. The actions are ^left and right.
The result highlights ONA's quicker learning ability, which allows it to reach high success rates much earlier on average.
The second example, Pong2 adds one more action, ^stop, which ideally would be called by the agent whenever the ball is in the center relative to the bat:
The performance after 10K steps turned out to be comparable, though again ONA learned faster. Please also note that in pong2 only ONA managed to learn the use of the ^stop operator, at least in 2 of 10 runs, which gave it an success ratio lead in these particular cases.
Unfortunately, with the same parameters, the QLearner struggled to find a good state-action mapping in Cartpole within 10K steps, except of two runs where it started to look promising, while ONA performed excellent in all runs. Also this simulation runs for 10K simulation steps, giving the AI systems about 5000 opportunities to balance the pole:
Last, in the Alien (a form of Space Invader) example, ONA learns faster on average, while the end performance after 10K steps is comparable (as it tends to be when both techniques succeed, due to the similar found behavior):
-
ONA does not need a specific reduction of learning rate and exploration rate to work well for a particular example, hence needs less parameter tuning.
-
ONA demands more computational resources than a table-based Q-Learning implementation.
-
The recurring correlations ONA finds are not only about reward as consequent. This allows it to learn temporal patterns even in the absence of reward / goal fulfillment. Should an event become a goal in the future, or should its occurrence turn out to become necessary as an intermediate step to reach an outcome, ONA can immediately exploit the previously learned knowledge. to make it happen.
-
ONA allows to pursue multiple goals / objectives simultaneously rather than pursuing a single max. reward outcome.
-
Goals in ONA can change at any time, and when it happens the system's knowledge can be used to achieve the new goals without having to re-learn a state-action mapping.
-
There is a shared property of both the NARS and RL decision theories, in that actions tend to be chosen which most likely will lead to the desired outcome or reward, though in NARS there is no guarantee since there can be multiple such outcomes competing for attention simultaneously.