Reinforcement Learning,
a learning paradigm inspired by behaviourist psychology and classical conditioning - learning by trial and error, interacting with an environment to map situations to actions in such a way that some notion of cumulative reward is maximized. In computer games, reinforcement learning deals with adjusting feature weights based on results or their subsequent predictions during self play.

Q-Learning, introduced by Chris Watkins in 1989, is a simple way for agents to learn how to act optimally in controlled Markovian domains ^{[2]}. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states. Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely ^{[3]}. Q-learning has been successfully applied to deep learning by a GoogleDeepMind team in playing some Atari 2600games as published in Nature, 2015, dubbed deep reinforcement learning or deep Q-networks^{[4]}, soon followed by the spectacular AlphaGo and AlphaZero breakthroughs.

Q-learning at its simplest uses tables to store data. This very quickly loses viability with increasing sizes of state/action space of the system it is monitoring/controlling. One solution to this problem is to use an (adapted) artificial neural network as a function approximator, as demonstrated by Gerald Tesauro in his Backgammon playing temporal difference learning research ^{[5]}^{[6]}.

Temporal Difference Learning is a prediction method primarily used for reinforcement learning. In the domain of computer games and computer chess, TD learning is applied through self play, subsequently predicting the probability of winning a game during the sequence of moves from the initial position until the end, to adjust weights for a more reliable prediction.

Richard E. Bellman (1954). On a new Iterative Algorithm for Finding the Solutions of Games and Linear Programming Problems. Technical Report P-473, RAND Corporation, U. S. Air Force Project RAND

John H. Holland (1975). Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. amazon.com

Richard Sutton, Andrew Barto (1990). Time Derivative Models of Pavlovian Reinforcement. Learning and Computational Neuroscience: Foundations of Adaptive Networks: 497-537

Ronald Parr, Stuart Russell (1997). Reinforcement Learning with Hierarchies of Machines. In Advances in Neural Information Processing Systems 10, MIT Press, zipped ps

Csaba Szepesvári (1998). Reinforcement Learning: Theory and Practice. Proceedings of the 2nd Slovak Conference on Artificial Neural Networks, zipped ps

Vassilis Papavassiliou, Stuart Russell (1999). Convergence of reinforcement learning with general function approximators. In Proc. IJCAI-99, Stockholm, ps

Andrew Ng, Stuart Russell (2000). Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, California: Morgan Kaufmann, pdf

Yngvi Björnsson, Vignir Hafsteinsson, Ársæll Jóhannsson, Einar Jónsson (2004). Efficient Use of Reinforcement Learning in a Computer Game. In Computer Games: Artificial Intellignece, Design and Education (CGAIDE'04), pp. 379–383, 2004. pdf

Marco Block, Maro Bader, Ernesto Tapia, Marte Ramírez, Ketill Gunnarsson, Erik Cuevas, Daniel Zaldivar, Raúl Rojas (2008). Using Reinforcement Learning in Chess Engines. CONCIBE SCIENCE 2008, Research in Computing Science: Special Issue in Electronics and Biomedical Engineering, Computer Science and Informatics, ISSN:1870-4069, Vol. 35, pp. 31-40, Guadalajara, Mexico, pdf

Home * Learning * Reinforcement LearningReinforcement Learning,a learning paradigm inspired by behaviourist psychology and classical conditioning - learning by trial and error, interacting with an environment to map situations to actions in such a way that some notion of cumulative reward is maximized. In computer games, reinforcement learning deals with adjusting feature weights based on results or their subsequent predictions during self play.

Reinforcement learning is indebted to the idea of Markov decision processes (MDPs) in the field of optimal control utilizing dynamic programming techniques. The crucial exploitation and exploration tradeoff in multi-armed bandit problems as also considered in UCT of Monte-Carlo Tree Search - between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other machines - is also faced in reinforcement learning.

^{[1]}## Table of Contents

## Q-Learning

Q-Learning, introduced by Chris Watkins in 1989, is a simple way for agents to learn how to act optimally in controlled Markovian domains^{[2]}. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states. Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely^{[3]}. Q-learning has been successfully applied to deep learning by a Google DeepMind team in playing some Atari 2600 games as published in Nature, 2015, dubbeddeep reinforcement learningordeep Q-networks^{[4]}, soon followed by the spectacular AlphaGo and AlphaZero breakthroughs.## Temporal Difference Learning

see main page Temporal Difference LearningQ-learning at its simplest uses tables to store data. This very quickly loses viability with increasing sizes of state/action space of the system it is monitoring/controlling. One solution to this problem is to use an (adapted) artificial neural network as a function approximator, as demonstrated by Gerald Tesauro in his Backgammon playing temporal difference learning research

^{[5]}^{[6]}.Temporal Difference Learning is a prediction method primarily used for reinforcement learning. In the domain of computer games and computer chess, TD learning is applied through self play, subsequently predicting the probability of winning a game during the sequence of moves from the initial position until the end, to adjust weights for a more reliable prediction.

## See also

UCT

## Selected Publications

## 1954 ...

1954).On a new Iterative Algorithm for Finding the Solutions of Games and Linear Programming Problems. Technical Report P-473, RAND Corporation, U. S. Air Force Project RAND1959).Some Studies in Machine Learning Using the Game of Checkers. IBM Journal July 1959## 1960 ...

1960).Sequential Machines, Ambiguity, and Dynamic Programming. Journal of the ACM, Vol. 7, No. 11960).Dynamic Programming and Markov Processes. MIT Press, amazon1961).Trial and Error. Penguin Science Survey1968).Boxes: An experiment on adaptive control. Machine Intelligence 2, Edinburgh: Oliver & Boyd, pdf## 1970 ...

1972).Brain Function and Adaptive Systems - A Heterostatic Theory. Air Force Cambridge Research Laboratories, Special Reports, No. 133, pdf1975).Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. amazon.com## 1980 ...

1984).Temporal Credit Assignment in Reinforcement Learning. Ph.D. dissertation, University of Massachusetts1984).A Theory of the Learnable. Communications of the ACM, Vol. 27, No. 11, pdf1989).Learning from Delayed Rewards. Ph.D. thesis, Cambridge University, pdf## 1990 ...

1990).Time Derivative Models of Pavlovian Reinforcement. Learning and Computational Neuroscience: Foundations of Adaptive Networks: 497-5371992).Q-learning. Machine Learning, Vol. 8, No. 21992).Temporal Difference Learning of Backgammon Strategy. ML 19921993).Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach. NIPS 1993, pdf1994).Markov Games as a Framework for Multi-Agent Reinforcement Learning. International Conference on Machine Learning, pdf## 1995 ...

1995).TD Learning of Game Evaluation Functions with Hierarchical Neural Architectures. Master's thesis, University of Amsterdam, pdf1995).Temporal Difference Learning and TD-Gammon. Communications of the ACM, Vol. 38, No. 31996).Reinforcement Learning: An Alternative Approach to Machine Intelligence. pdf1996).Reinforcement Learning: A Survey. JAIR Vol. 4, pdf1996).General Game-Playing and Reinforcement Learning. Computational Intelligence, Vol. 12, No. 11997).Reinforcement Learning with Hierarchies of Machines.In Advances in Neural Information Processing Systems 10, MIT Press, zipped ps1997).Adversarial Reinforcement Learning. Carnegie Mellon University, ps1997).Generalizing Adversarial Reinforcement Learning. Carnegie Mellon University, ps1997).HQ-learning. Adaptive Behavior, Vol. 6, No 21998).Reinforcement Learning: Theory and Practice. Proceedings of the 2nd Slovak Conference on Artificial Neural Networks, zipped ps1998).Reinforcement Learning: An Introduction. MIT Press1999).Convergence of reinforcement learning with general function approximators.In Proc. IJCAI-99, Stockholm, ps1999).Explorations in Efficient Reinforcement Learning. Ph.D. thesis, University of Amsterdam, advisors Frans Groen and Jürgen Schmidhuber## 2000 ...

2000).A Review of Reinforcement Learning. AI Magazine, Vol. 21, No. 12000).Chess Neighborhoods, Function Combination, and Reinforcement Learning. CG 20002000).Algorithms for inverse reinforcement learning.In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, California: Morgan Kaufmann, pdf2000).An Integrated Connectionist Approach to Reinforcement Learning for Robotic Control. ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning2000).Reinforcement Learning on POMDPs via Direct Gradient Ascent. ICML 2000, pdf2000).Temporal Abstraction in Reinforcement Learning. Ph.D. Dissertation, Department of Computer Science, University of Massachusetts, Amherst.2001).Chess Neighborhoods, Function Combinations and Reinforcements Learning. In Computers and Games (eds. Tony Marsland and I. Frank). Lecture Notes in Computer Science,. Springer,. pdf2003).Reinforcement Learning in der Schachprogrammierung. Studienarbeit, Freie Universität Berlin, Dozent: Prof. Dr. Raúl Rojas, pdf (German)2003).Learning to play chess using reinforcement learning with database games. Master’s thesis, Cognitive Artiﬁcial Intelligence, Utrecht University2003).Point-based value iteration: An anytime algorithm for POMDPs. IJCAI, pdf2004).Efficient Use of Reinforcement Learning in a Computer Game. In Computer Games: Artificial Intellignece, Design and Education (CGAIDE'04), pp. 379–383, 2004. pdf2004).Reinforcement learning in board games. CSTR-04-004, Department of Computer Science, University of Bristol. pdf^{[7]}2004).Efficient Exploration for Reinforcement Learning. MSc thesis, pdf2004).Multiagent Reinforcement Learning in Stochastic Games with Continuous Action Spaces. pdf## 2005 ...

2006).Learning for stochastic dynamic programming. pdf, pdf2007).A Contribution to Reinforcement Learning; Application to Computer Go.Ph.D. thesis, pdf2007).State Space Partition for Reinforcement Learning Based on Fuzzy Min-Max Neural Network. ISNN 20072007).Reinforcement Learning of Evaluation Functions Using Temporal Difference-Monte Carlo learning method. 12th Game Programming Workshop2008).Using Reinforcement Learning in Chess Engines. CONCIBE SCIENCE 2008, Research in Computing Science: Special Issue in Electronics and Biomedical Engineering, Computer Science and Informatics, ISSN:1870-4069, Vol. 35, pp. 31-40, Guadalajara, Mexico, pdf2008).Grid Differentiated Services: a Reinforcement Learning Approach. In 8th IEEE Symposium on Cluster Computing and the Grid. Lyon, pdf2009).Reinforcement Learning and Simulation-Based Search. Ph.D. thesis, University of Alberta. pdf## 2010 ...

2010).Reinforcement Learning via AIXI Approximation. Association for the Advancement of Artificial Intelligence (AAAI), pdf2010).Multi-objective Reinforcement Learning for Responsive Grids. In The Journal of Grid Computing. pdf20112011).Exploration and Exploitation in Online Learning. ICAIS 20112011).Reinforcement Learning with a Bilinear Q Function. EWRL 201120122012).Reinforcement learning: State-of-the-art. Adaptation, Learning, and Optimization, Vol. 12, SpringerIstván Szita (

2012).Reinforcement Learning in Games. Chapter 172012).Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search. NIPS 2012, pdf20132013).Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search. Journal of Artificial Intelligence Research, Vol. 48, pdf2013).Reinforcement Learning in the Game of Othello: Learning Against a Fixed Opponent and Learning from Self-Play. ADPRL 20132013).Reinforcement Learning to Train Ms. Pac-Man Using Higher-order Action-relative Inputs. ADPRL 2013^{[8]}2013).Reinforcement Learning. Dagstuhl Reports, Vol. 3, No. 8, DOI: 10.4230/DagRep.3.8.1, URN: urn:nbn:de:0030-drops-434092013).Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602^{[9]}^{[10]}20142014).Coevolutionary Shaping for Reinforcement Learning. Ph.D. thesis, Poznań University of Technology, supervisor Krzysztof Krawiec, co-supervisor Wojciech Jaśkowski, pdf## 2015 ...

2015).Human-level control through deep reinforcement learning. Nature, Vol. 5182015).Adaptive Playouts in Monte Carlo Tree Search with Policy Gradient Reinforcement Learning. Advances in Computer Games 142015).Massively Parallel Methods for Deep Reinforcement Learning. arXiv:1507.042962015).Giraffe: Using Deep Reinforcement Learning to Play Chess. M.Sc. thesis, Imperial College London, arXiv:1509.01549v1 » Giraffe2015).Deep Reinforcement Learning with Double Q-learning. arXiv:1509.0646120162016).Dueling Network Architectures for Deep Reinforcement Learning. arXiv:1511.065812016).Mastering the game of Go with deep neural networks and tree search. Nature, Vol. 529 » AlphaGo2016).An Empirical Study on Applying Deep Reinforcement Learning to the Game 2048. CG 20162016).DeepChess: End-to-End Deep Neural Network for Automatic Learning in Chess. ICAAN 2016, Lecture Notes in Computer Science, Vol. 9887, Springer, pdf preprint » DeepChess^{[11]}^{[12]}2016).Asynchronous Methods for Deep Reinforcement Learning. arXiv:1602.01783v22016).Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates. arXiv:1610.006332016).Reinforcement Learning with Unsupervised Auxiliary Tasks. arXiv:1611.05397v12016).Learning to reinforcement learn. arXiv:1611.0576320172017).Deep Reinforcement Learning with Hidden Layers on Future States. Computer Games Workshop at IJCAI 2017, pdf2017).A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning. arXiv:1711.008322017).Mastering the game of Go without human knowledge. Nature, Vol. 550, pdf^{[13]}2017).Deep Reinforcement Learning that Matters. arXiv:1709.065602017).Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv:1712.01815 » AlphaZero## Postings

## External Links

## Reinforcement Learning

## MDP

## Q-Learning

## Courses

## References

1992).Q-learning. Machine Learning, Vol. 8, No. 22015).Human-level control through deep reinforcement learning. Nature, Vol. 5181995).Temporal Difference Learning and TD-Gammon. Communications of the ACM, Vol. 38, No. 3## What links here?

Up one Level