Temporal+Difference+Learning

toc
 * Home * Learning * Temporal Difference Learning**

is a machine learning method applied to multi-step prediction problems. As a prediction method primarily used for reinforcement learning, TD learning takes into account the fact that subsequent predictions are often correlated in some sense, while in supervised learning, one learns only from actually observed values. TD resembles [|Monte Carlo methods] with dynamic programming techniques. In the domain of computer games and computer chess, TD learning is applied through self play, subsequently predicting the [|probability] of winning a game during the sequence of moves from the initial position until the end, to adjust weights for a more reliable prediction.
 * Temporal Difference Learning**, (TD learning)

=Prediction= Each prediction is a single number, derived from a formula using adjustable weights of features, for instance a neural network most simply a single neuron perceptron, that is a linear evaluation function ...

> math \displaystyle v = \sum_{f} {w_{f}*f(x)} math

... with the pawn advantage converted to a winning probability by the standard [|sigmoid squashing function], also topic in logistic regression in the domain of supervised learning and automated tuning, ...

> math \displaystyle P = \frac{1}{1 + e^{-v}} math

... which has the advantage of its simple [|derivative]:

> math \displaystyle \frac {dP}{dv} = P(1-P) math  =TD(λ)= Each pair of [|temporally] successive predictions P at time step t and t+1 gives rise to a recommendation for weight changes, to converge P t to P t+1, first applied in the late 50s by Arthur Samuel in his Checkers player for automamated evaluation tuning. This TD method was improved, generalized and formalized by Richard Sutton et al. in the 80s, the term //Temporal Difference Learning// coined in 1988, also introducing the decay or recency parameter **λ**, where proportions of the score came from the outcome of [|Monte Carlo] simulated games, tapering between [|bootstrapping] (λ = 0) and Monte Carlo predictions (λ = 1), the latter equivalent to [|gradient descent] on the [|mean squared error function]. Weight adjustments in TD(λ) are made according to ...

> math \displaystyle \Delta w_{t} = \alpha \big( P_{t+1} - P_{t} \big) \sum_{k=1}^{t} {\lambda^{t-k} \nabla_w P_k} math

... where P is the series of temporally successive predictions, w the set of adjustable weights. α is a parameter controlling the learning rate, also called step-size, ∇ w P k is the [|gradient], the vector of [|partial derivatives] of P t with respect of w. The process may be applied to any initial set of weights. Learning performance depends on λ and α, which have to be chosen appropriately for the domain. In principle, TD(λ) weight adjustments may be made after each move, or at any arbitrary interval. For game playing tasks the end of every game is a convenient point to actually alter the evaluation weights.

TD(λ) was famously applied by Gerald Tesauro in his Backgammon program [|TD-Gammon], a stochastic game picking the action whose successor state minimizes the opponent's expected reward, i.e. looking one ply ahead.  =TDLeaf(λ)= In games like chess or Othello, due to their tactical nature, deep searches are necessary for expert performance. The problem has already been recognized and solved by Arthur Samuel but seemed to have been forgotten later on - rediscovered independently by Don Beal and Martin C. Smith in 1997, and by Jonathan Baxter, Andrew Tridgell, and Lex Weaver in 1998 , who coined the term TD-Leaf. TD-Leaf is the adaption of TD(λ) to minimax search, where instead of the corresponding positions of the root the leaf nodes of the principal variation are considered in the weight adjustments. TD-Leaf was successfully used in evaluation tuning of chess programs, with KnightCap and CilkChess as most prominent samples, while the latter used the improved **Temporal Coherence Learning** , which automatically adjusts α and λ.

=Quotes=

Don Beal
Don Beal in a 1998 CCC discussion with Jonathan Baxter :

Bas Hamstra
Bas Hamstra in a 2002 CCC discussion on TD learning :

Don Dailey
Don Dailey in a reply to Ben-Hur Carlos Vieira Langoni Junior, CCC, December 2010 :

=Chess Programs= > RootStrap > TreeStrap =See also=
 * CilkChess
 * EXchess
 * FUSc#
 * Giraffe
 * Green Light Chess
 * KnightCap
 * Meep
 * Morph
 * NeuroChess
 * SAL
 * Tao
 * TDChess
 * Automated Tuning
 * Backgammon
 * Deep Learning
 * Evaluation
 * Neural Networks

=Publications=

1959

 * Arthur Samuel (**1959**). //[|Some Studies in Machine Learning Using the Game of Checkers]//. IBM Journal July 1959

1970 ...

 * A. Harry Klopf (**1972**). //Brain Function and Adaptive Systems - A Heterostatic Theory//. [|Air Force Cambridge Research Laboratories], Special Reports, No. 133, [|pdf]
 * John H. Holland (**1975**). //Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence//. [|amazon.com]

1980 ...

 * Richard Sutton (**1984**). //[|Temporal Credit Assignment in Reinforcement Learning]//. Ph.D. dissertation, [|University of Massachusetts]
 * Jens Christensen (**1986**). //[|Learning Static Evaluation Functions by Linear Regression]//. in Tom Mitchell, Jaime Carbonell, Ryszard Michalski (**1986**). //[|Machine Learning: A Guide to Current Research]//. The Kluwer International Series in Engineering and Computer Science, Vol. 12
 * Richard Sutton (**1988**). //Learning to Predict by the Methods of Temporal Differences//. [|Machine Learning], Vol. 3, No. 1, [|pdf]

1990 ...

 * Richard Sutton, Andrew Barto (**1990**). //Time-Derivative Models of Pavlovian Reinforcement//. in [|Michael Gabriel], [|John Moore] (eds.) (**1990**). //Learning and Computational Neuroscience: Foundations of Adaptive Networks//. [|MIT Press], [|pdf]
 * [|Richard C. Yee], [|Sharad Saxena], Paul E. Utgoff, Andrew Barto (**1990**). //Explaining Temporal Differences to Create Useful Concepts for Evaluating States//. [|AAAI 1990], [|pdf]
 * Peter Dayan (**1990**). //Navigating Through Temporal Difference//. [|NIPS 1990], [|pdf]
 * Gerald Tesauro (**1992**). //Temporal Difference Learning of Backgammon Strategy//. [|ML 1992]
 * Peter Dayan (**1992**). //[|The convergence of TD (λ) for general λ]//. [|Machine Learning], Vol. 8, No. 3
 * Gerald Tesauro (**1992**). //[|Practical Issues in Temporal Difference Learning]//. [|Machine Learning], Vol. 8, Nos. 3-4
 * Michael Gherrity (**1993**). //A Game Learning Machine//. Ph.D. thesis, [|University of California, San Diego], advisor Paul Kube, [|pdf], [|pdf]
 * Peter Dayan (**1993**). //Improving generalisation for temporal difference learning: The successor representation//. [|Neural Computation], Vol. 5, [|pdf]
 * Nicol N. Schraudolph, Peter Dayan, Terrence J. Sejnowski (**1994**). //[|Temporal Difference Learning of Position Evaluation in the Game of Go]//. [|Advances in Neural Information Processing Systems 6]
 * Peter Dayan, Terrence J. Sejnowski (**1994**). //TD(λ) converges with Probability 1//. [|Machine Learning], Vol. 14, No. 1, [|pdf]

1995 ...

 * Anton Leouski (**1995**). //Learning of Position Evaluation in the Game of Othello//. Master's Project, [|University of Massachusetts], [|Amherst, Massachusetts], [|pdf]
 * Gerald Tesauro (**1995**). //Temporal Difference Learning and TD-Gammon//. Communications of the ACM, Vol. 38, No. 3
 * Sebastian Thrun (**1995**). //[|Learning to Play the Game of Chess]//. in Gerald Tesauro, [|David S. Touretzky], [|Todd K. Leen] (eds.) Advances in Neural Information Processing Systems 7, [|MIT Press]
 * 1996**
 * Robert Schapire, Manfred K. Warmuth (**1996**). //On the Worst-Case Analysis of Temporal-Difference Learning Algorithms//. [|Machine Learning], Vol. 22, Nos. 1-3, [|pdf]
 * Johannes Fürnkranz (**1996**). //Machine Learning in Computer Chess: The Next Generation.// ICCA Journal, Vol. 19, No. 3, [|zipped ps]
 * Steven Bradtke, Andrew Barto (**1996**) //Linear Least-Squares Algorithms for Temporal Difference Learning//. [|Machine Learning], Vol. 22, Nos. 1/2/3, [|pdf]
 * 1997**
 * John N. Tsitsiklis, Benjamin Van Roy (**1997**). //[|An Analysis of Temporal Difference Learning with Function Approximation]//. IEEE Transactions on Automatic Control, Vol. 42, No. 5
 * Don Beal, Martin C. Smith (**1997**). //Learning Piece Values Using Temporal Differences//. ICCA Journal, Vol. 20, No. 3
 * 1998**
 * Don Beal, Martin C. Smith (**1998**). //[|First Results from Using Temporal Difference Learning in Shogi]//. CG 1998
 * Jonathan Baxter, Andrew Tridgell, Lex Weaver (**1998**). //TDLeaf(lambda): Combining Temporal Difference Learning with Game-Tree Search//. [|Australian Journal of Intelligent Information Processing Systems], Vol. 5 No. 1, [|arXiv:cs/9901001]
 * Jonathan Baxter, Andrew Tridgell, Lex Weaver (**1998**). //Experiments in Parameter Learning Using Temporal Differences//. ICCA Journal, Vol. 21, No. 2, [|pdf]
 * Jonathan Baxter, Andrew Tridgell, Lex Weaver (**1998**). //Knightcap: A chess program that learns by combining td(λ) with game-tree search//. Proceedings of the 15th International Conference on Machine Learning, [|pdf] via [|citeseerX]
 * Richard Sutton, Andrew Barto (**1998**). //[|Reinforcement Learning: An Introduction]//. [|MIT Press], [|6. Temporal-Difference Learning]
 * Justin A. Boyan (**1998**). //Least-Squares Temporal Difference Learning//. Carnegie Mellon University, CMU-CS-98-152, [|pdf]
 * 1999**
 * Don Beal, Martin C. Smith (**1999**). //[|Temporal Coherence and Prediction Decay in TD Learning]//. IJCAI 1999, [|pdf]
 * Don Beal, Martin C. Smith (**1999**). //Learning Piece-Square Values using Temporal Differences.// ICCA Journal, Vol. 22, No. 4

2000 ...

 * Sebastian Thrun, Michael L. Littman (**2000**). //A Review of Reinforcement Learning//. [|AI Magazine, Vol. 21], No. 1, [|pdf]
 * Robert Levinson, Ryan Weber (**2000**). //[|Chess Neighborhoods, Function Combination, and Reinforcement Learning]//. CG 2000, [|pdf] » Morph
 * Jonathan Baxter, Andrew Tridgell, Lex Weaver (**2000**). //Learning to Play Chess Using Temporal Differences//. [|Machine Learning, Vol 40, No. 3], [|pdf]
 * Johannes Fürnkranz (**2000**). //Machine Learning in Games: A Survey//. [|Austrian Research Institute for Artificial Intelligence], OEFAI-TR-2000-3, [|pdf]
 * 2001**
 * Jonathan Schaeffer, Markian Hlynka, Vili Jussila (**2001**). //Temporal Difference Learning Applied to a High-Performance Game-Playing Program//. [|IJCAI 2001]
 * Don Beal, Martin C. Smith (**2001**). //[|Temporal difference learning applied to game playing and the results of application to Shogi]//. [|Theoretical Computer Science], Vol. 252, Nos. 1-2
 * Nicol N. Schraudolph, Peter Dayan, Terrence J. Sejnowski (**2001**). //[|Learning to Evaluate Go Positions via Temporal Difference Methods]//. in  Norio Baba, Lakhmi C. Jain (eds.) (**2001**). //[|Computational Intelligence in Games, Studies in Fuzziness and Soft Computing]//. [|Physica-Verlag]
 * 2002**
 * Ari Shapiro, Gil Fuchs, Robert Levinson (**2002**). //[|Learning a Game Strategy Using Pattern-Weights and Self-play]//. CG 2002, [|pdf]
 * Mark Winands, Levente Kocsis, Jos Uiterwijk, Jaap van den Herik (**2002**). //Temporal difference learning and the Neural MoveMap heuristic in the game of Lines of Action//. GAME-ON 2002 » Neural MoveMap Heuristic
 * James Swafford (**2002**). //Optimizing Parameter Learning using Temporal Differences//. [|AAAI-02], Student Abstracts, [|pdf]
 * 2003**
 * Henk Mannen (**2003**). //Learning to play chess using reinforcement learning with database games//. Master’s thesis, [|Cognitive Artiﬁcial Intelligence], [|Utrecht University]
 * 2004**
 * Henk Mannen, Marco Wiering (**2004**). //[|Learning to play chess using TD(λ)-learning with database games]//. [|Cognitive Artiﬁcial Intelligence], [|Utrecht University], Benelearn’04
 * Marco Block (**2004**). //Verwendung von Temporale-Differenz-Methoden im Schachmotor FUSc#//. Diplomarbeit, Betreuer: Raúl Rojas, Free University of Berlin, [|pdf] (German)
 * Jacek Mańdziuk, Daniel Osman (**2004**). //Temporal Difference Approach to Playing Give-Away Checkers//. [|ICAISC 2004], [|pdf]

2005 ...

 * Marco Wiering, [|Jan Peter Patist], Henk Mannen (**2005**). //Learning to Play Board Games using Temporal Difference Methods//. Technical Report, [|Utrecht University], UU-CS-2005-048, [|pdf]
 * 2006**
 * Simon Lucas, Thomas Philip Runarsson (**2006**). //[|Temporal Difference Learning versus Co-Evolution for Acquiring Othello Position Evaluation]//. IEEE Symposium on Computational Intelligence and Games
 * 2007**
 * Edward P. Manning (**2007**). //[|Temporal Difference Learning of an Othello Evaluation Function for a Small Neural Network with Shared Weights]//. IEEE Symposium on Computational Intelligence and AI in Games
 * Daniel Osman (**2007**). //Temporal Difference Methods for Two-player Board Games//. Ph.D. thesis, Faculty of Mathematics and Information Science, [|Warsaw University of Technology]
 * Yasuhiro Osaki, Kazutomo Shibahara, Yasuhiro Tajima, Yoshiyuki Kotani (**2007**). //Reinforcement Learning of Evaluation Functions Using Temporal Difference-Monte Carlo learning method//. 12th Game Programming Workshop
 * 2008**
 * Yasuhiro Osaki, Kazutomo Shibahara, Yasuhiro Tajima, Yoshiyuki Kotani (**2008**). //An Othello Evaluation Function Based on Temporal Difference Learning using Probability of Winning//. [|CIG'08], [|pdf]
 * Richard Sutton, Csaba Szepesvári, Hamid Reza Maei (**2008**). //A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation//. [|pdf] (draft)
 * Sacha Droste, Johannes Fürnkranz (**2008**). //Learning of Piece Values for Chess Variants.// Technical Report TUD–KE–2008-07, Knowledge Engineering Group, TU Darmstadt, [|pdf]
 * Sacha Droste, Johannes Fürnkranz (**2008**). //Learning the Piece Values for three Chess Variants//. ICGA Journal, Vol. 31, No. 4
 * Albrecht Fiebiger (**2008**). //Einsatz von allgemeinen Evaluierungsheuristiken in Verbindung mit der Reinforcement-Learning-Strategie in der Schachprogrammierung//. [|Besondere Lernleistung] im [|Fachbereich] [|Informatik], [|Sächsischees Landesgymnasium Sankt Afra], Internal advisor: Ralf Böttcher, External advisors: Stefan Meyer-Kahlen, Marco Block, [|pdf] (German)
 * 2009**
 * Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, Doina Precup, David Silver, Richard Sutton (**2009**). //Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation.// Accepted  in Advances in Neural Information Processing Systems 22, Vancouver, BC. December 2009. MIT Press. [|pdf]
 * Richard Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, Eric Wiewiora. (**2009**). //Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation//. In Proceedings of the 26th International Conference on Machine Learning (ICML-09). [|pdf]
 * Joel Veness, David Silver, William Uther, Alan Blair (**2009**). //[|Bootstrapping from Game Tree Search]//. [|pdf]
 * Marcin Szubert, Wojciech Jaśkowski, Krzysztof Krawiec (**2009**). //Coevolutionary Temporal Difference Learning for Othello//. IEEE Symposium on Computational Intelligence and Games, [|pdf]
 * [|J. Zico Kolter], Andrew Ng (**2009**). //Regularization and Feature Selection in Least-Squares Temporal Difference Learning//. [|ICML 2009], [|pdf]

2010 ...

 * Marco Wiering (**2010**). //[|Self-play and using an expert to learn to play backgammon with temporal difference learning]//. [|Journal of Intelligent Learning Systems and Applications], Vol. 2, No. 2
 * Hamid Reza Maei, Richard Sutton (**2010**). //[|GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces]//. In Proceedings of the Third Conference on Artificial General Intelligence
 * 2011**
 * Hamid Reza Maei (**2011**). //Gradient Temporal-Difference Learning Algorithms//. Ph.D. thesis, University of Alberta, advisor Richard Sutton, [|pdf]
 * Joel Veness (**2011**). //Approximate Universal Artificial Intelligence and Self-Play Learning for Games//. Ph.D. thesis, [|University of New South Wales], supervisors: Kee Siong Ng, Marcus Hutter, Alan Blair, William Uther, John Lloyd; [|pdf]
 * I-Chen Wu, Hsin-Ti Tsai, Hung-Hsuan Lin, Yi-Shan Lin, Chieh-Min Chang, Ping-Hung Lin (**2011**). //[|Temporal Difference Learning for Connect6]//. Advances in Computer Games 13
 * Nikolaos Papahristou, Ioannis Refanidis (**2011**). //[|Improving Temporal Difference Performance in Backgammon Variants]//. Advances in Computer Games 13, [|pdf]
 * Krzysztof Krawiec, Wojciech Jaśkowski, Marcin Szubert (**2011**). //[|Evolving small-board Go players using Coevolutionary Temporal Difference Learning with Archives]//. [|Applied Mathematics and Computer Science], Vol. 21, No. 4
 * Marcin Szubert, Wojciech Jaśkowski, Krzysztof Krawiec (**2011**). //Learning Board Evaluation Function for Othello by Hybridizing Coevolution with Temporal Difference Learning//. [|Control and Cybernetics], Vol. 40, No. 3,[|pdf]
 * 2012**
 * István Szita (**2012**). //[|Reinforcement Learning in Games]//. in Marco Wiering, [|Martijn Van Otterlo] (eds.). //[|Reinforcement learning: State-of-the-art]//. [|Adaptation, Learning, and Optimization, Vol. 12], [|Springer]
 * 2013**
 * David Silver, Richard Sutton, Martin Mueller (**2013**). //Temporal-Difference Search in Computer Go//. Proceedings of the [|ICAPS-13 Workshop on Planning and Learning], [|pdf]
 * Florian Kunz (**2013**). //An Introduction to Temporal Difference Learning//. Seminar on Autonomous Learning Systems, TU Darmstad, [|pdf]
 * 2014**
 * I-Chen Wu, Kun-Hao Yeh, Chao-Chin Liang, Chia-Chuan Chang, Han Chiang (**2014**). //Multi-Stage Temporal Difference Learning for 2048//. TAAI 2014
 * Wojciech Jaśkowski, Marcin Szubert, Paweł Liskowski (**2014**). //Multi-Criteria Comparison of Coevolution and Temporal Difference Learning on Othello//. [|EvoApplications 2014], [|Springer, volume 8602]

2015 ...

 * James L. McClelland (**2015**). //[|Explorations in Parallel Distributed Processing: A Handbook of Models, Programs, and Exercises]//. Second Edition, [|Contents], [|Temporal-Difference Learning]
 * Matthew Lai (**2015**). //Giraffe: Using Deep Reinforcement Learning to Play Chess//. M.Sc. thesis, [|Imperial College London], [|arXiv:1509.01549v1] » Giraffe
 * Kazuto Oka, Kiminori Matsuzaki (**2016**). //Systematic Selection of N-tuple Networks for 2048//. CG 2016
 * Huizhen Yu, A. Rupam Mahmood, Richard Sutton (**2017**). //On Generalized Bellman Equations and Temporal-Difference Learning//. Canadian Conference on AI 2017, [|arXiv:1704.04463]

=Forum Posts=

1995 ...
> [|Re: Parameter Tuning] by Don Beal, CCC, October 02, 1998
 * [|Parameter Tuning] by Jonathan Baxter, CCC, October 01, 1998 » KnightCap

2000 ...
> [|Re: Temporal Differences] by Guy Haworth, CCC, November 04, 2004
 * [|any good experiences with genetic algos or temporal difference learning?] by Rafael B. Andrist, CCC, January 01, 2001
 * [|Temporal Difference] by Bas Hamstra, CCC, January 05, 2001
 * [|Tao update] by Bas Hamstra, CCC, January 12, 2001 » Tao
 * [|Re: Parameter Learning Using Temporal Differences !] by Aaron Tay, CCC, March 19, 2002
 * [|Hello from Edmonton (and on Temporal Differences)] by James Swafford, CCC, July 30, 2002
 * [|Temporal Differences] by Stuart Cracraft, CCC, November 03, 2004
 * [|Temporal Differences] by Peter Fendrich, CCC, December 21, 2004
 * [|Chess program improvement project (copy at TalkChess/ICD)] by Stuart Cracraft, Winboard Forum, March 07, 2006 » Win at Chess

2010 ...
> [|Re: Positional learning] by Don Dailey, CCC, December 13, 2010 > [|Re: Pawn Advantage, Win Percentage, and Elo] by Don Dailey, CCC, April 15, 2012
 * [|Positional learning] by Ben-Hur Carlos Vieira Langoni Junior, CCC, December 13, 2010
 * [|Pawn Advantage, Win Percentage, and Elo] by Adam Hair, CCC, April 15, 2012

2015 ...

 * [|*First release* Giraffe, a new engine based on deep learning] by Matthew Lai, CCC, July 08, 2015 » Deep Learning, Giraffe
 * [|td-leaf] by Alexandru Mosoi, CCC, October 06, 2015 » Automated Tuning
 * [|TD-leaf(lambda)] by Robert Pope, CCC, November 09, 2016

=External Links= > [|temporal - Wiktionary] > media type="youtube" key="oJ_MfyrQT8I" width="560"
 * [|Temporal difference learning from Wikipedia]
 * [|Reinforcement learning - Temporal difference methods from Wikipedia]
 * [|Temporal difference learning - Scholarpedia]
 * [|6. Temporal-Difference Learning] in Richard Sutton, Andrew Barto (**1998**). //[|Reinforcement Learning: An Introduction]//. [|MIT Press] eBook
 * [|Temporal-Difference Learning] (Chapter 9) in James L. McClelland (**2015**). //[|Explorations in Parallel Distributed Processing: A Handbook of Models, Programs, and Exercises]//. Second Edition, [|Contents]
 * [|Temporal-Difference learning], Slides as pdf from Chemnitz University of Technology
 * [|University of Alberta Dictionary of Cognitive Science: Credit Assignment Problem]
 * Shawn Lane, Jonas Hellborg, Jeff Sipe - [|Temporal Analogues of Paradise - 2nd Movement], [|Atlanta] Drums & Percussion, August 20, 1996, [|YouTube] Video

=References= =What links here?= include page="Temporal Difference Learning" component="backlinks" limit="220"
 * Up one level**