Temporal Difference Learning, (TD learning)
is a machine learning method applied to multi-step prediction problems. As a prediction method primarily used for reinforcement learning, TD learning takes into account the fact that subsequent predictions are often correlated in some sense, while in supervised learning, one learns only from actually observed values. TD resembles Monte Carlo methods with dynamic programming techniques ^{[1]}. In the domain of computer games and computer chess, TD learning is applied through self play, subsequently predicting the probability of winning a game during the sequence of moves from the initial position until the end, to adjust weights for a more reliable prediction.

Prediction

Each prediction is a single number, derived from a formula using adjustable weights of features, for instance a neural network most simply a single neuron perceptron, that is a linear evaluation function ...

... which has the advantage of its simple derivative:

TD(λ)

Each pair of temporally successive predictions P at time step t and t+1 gives rise to a recommendation for weight changes, to converge Pt to Pt+1, first applied in the late 50s by Arthur Samuel in his Checkers player for automamated evaluation tuning^{[2]}. This TD method was improved, generalized and formalized by Richard Sutton et al. in the 80s, the term Temporal Difference Learning coined in 1988 ^{[3]}, also introducing the decay or recency parameter λ, where proportions of the score came from the outcome of Monte Carlo simulated games, tapering between bootstrapping (λ = 0) and Monte Carlo predictions (λ = 1), the latter equivalent to gradient descent on the mean squared error function. Weight adjustments in TD(λ) are made according to ...

... where P is the series of temporally successive predictions, w the set of adjustable weights. α is a parameter controlling the learning rate, also called step-size, ∇wPk^{[4]} is the gradient, the vector of partial derivatives of Pt with respect of w. The process may be applied to any initial set of weights. Learning performance depends on λ and α, which have to be chosen appropriately for the domain. In principle, TD(λ) weight adjustments may be made after each move, or at any arbitrary interval. For game playing tasks the end of every game is a convenient point to actually alter the evaluation weights ^{[5]}.

TD(λ) was famously applied by Gerald Tesauro in his Backgammon program TD-Gammon^{[6]}^{[7]}, a stochastic game picking the action whose successor state minimizes the opponent's expected reward, i.e. looking one ply ahead.

TDLeaf(λ)

In games like chess or Othello, due to their tactical nature, deep searches are necessary for expert performance. The problem has already been recognized and solved by Arthur Samuel but seemed to have been forgotten later on ^{[8]} - rediscovered independently by Don Beal and Martin C. Smith in 1997 ^{[9]}, and by Jonathan Baxter, Andrew Tridgell, and Lex Weaver in 1998 ^{[10]}, who coined the term TD-Leaf. TD-Leaf is the adaption of TD(λ) to minimax search, where instead of the corresponding positions of the root the leaf nodes of the principal variation are considered in the weight adjustments. TD-Leaf was successfully used in evaluation tuning of chess programs ^{[11]}, with KnightCap^{[12]} and CilkChess as most prominent samples, while the latter used the improved Temporal Coherence Learning^{[13]}, which automatically adjusts α and λ ^{[14]}.

With fixed learning rates (aka step size) we found piece values settle to consistent relative ordering in around 500 self-play games. The ordering remains in place despite considerable erratic movements. But piece-square values can take a lot longer - more like 5000.

The learning rate is critical - it has to be as large as one dares for fast learning, but low for stable values. We've been experimenting with methods for automatically adjusting the learning rate. (Higher rates if the adjustments go in the same direction, lower if they keep changing direction.)

The other problem is learning weights for terms which only occur rarely. Then the learning process doesn't see enough examples to settle on good weights in a reasonable time. I suspect this is the main limitation of the method, but it may be possible to devise ways to generate extra games which exercise the rare conditions.

Bas Hamstra

Bas Hamstra in a 2002 CCC discussion on TD learning ^{[16]}:

I have played with it. I am convinced it has possibilities, but one problem I encountered was the cause-effect problem. For say I am a piece down. After I lost the game TD will conclude that the winner had better mobility and will tune it up. However worse mobility was not the cause of the loss, it was the effect of simply being a piece down. In my case it kept tuning mobility up and up until ridiculous values.

Another approach that may be more in line with what you want is called "temporal difference learning", and it is based on feedback from each move to the move that precedes it. For example if you play move 35 and the program thinks the position is equal, but then on move 36 you find that you are winning a pawn, it indicates that the evaluation of move 35 is in error, the position is better than the program thought it was. Little tiny incremental adjustments are made to the evaluation function so that it is ever so slightly biased in favor of being slightly more positive in this case, or slightly more negative in the case where you find your score is dropping. This is done recursively back through the moves of the game so that winning the game gives some credit to all the positions of the game. Look on the web and read up on the "credit assignment problem" and temporal difference learning. It's probably ideal for what you are looking for. It can be done at the end of the game one time and scores then updated. If you are not using floating point evaluation you may have to figure out how to modify this to be workable.

John H. Holland (1975). Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. amazon.com

Jonathan Baxter, Andrew Tridgell, Lex Weaver (1998). Knightcap: A chess program that learns by combining td(λ) with game-tree search. Proceedings of the 15th International Conference on Machine Learning, pdf via citeseerX

^Jonathan Baxter, Andrew Tridgell, Lex Weaver (1998). Knightcap: A chess program that learns by combining td(λ) with game-tree search. Proceedings of the 15th International Conference on Machine Learning, pdf via citeseerX

## Table of Contents

Home * Learning * Temporal Difference LearningTemporal Difference Learning, (TD learning)is a machine learning method applied to multi-step prediction problems. As a prediction method primarily used for reinforcement learning, TD learning takes into account the fact that subsequent predictions are often correlated in some sense, while in supervised learning, one learns only from actually observed values. TD resembles Monte Carlo methods with dynamic programming techniques

^{[1]}. In the domain of computer games and computer chess, TD learning is applied through self play, subsequently predicting the probability of winning a game during the sequence of moves from the initial position until the end, to adjust weights for a more reliable prediction.## Prediction

Each prediction is a single number, derived from a formula using adjustable weights of features, for instance a neural network most simply a single neuron perceptron, that is a linear evaluation function ...... with the pawn advantage converted to a winning probability by the standard sigmoid squashing function, also topic in logistic regression in the domain of supervised learning and automated tuning, ...

... which has the advantage of its simple derivative:

## TD(λ)

Each pair of temporally successive predictions P at time step t and t+1 gives rise to a recommendation for weight changes, to converge Pt to Pt+1, first applied in the late 50s by Arthur Samuel in his Checkers player for automamated evaluation tuning^{[2]}. This TD method was improved, generalized and formalized by Richard Sutton et al. in the 80s, the termTemporal Difference Learningcoined in 1988^{[3]}, also introducing the decay or recency parameterλ, where proportions of the score came from the outcome of Monte Carlo simulated games, tapering between bootstrapping (λ = 0) and Monte Carlo predictions (λ = 1), the latter equivalent to gradient descent on the mean squared error function. Weight adjustments in TD(λ) are made according to ...... where P is the series of temporally successive predictions, w the set of adjustable weights. α is a parameter controlling the learning rate, also called step-size, ∇wPk

^{[4]}is the gradient, the vector of partial derivatives of Pt with respect of w. The process may be applied to any initial set of weights. Learning performance depends on λ and α, which have to be chosen appropriately for the domain. In principle, TD(λ) weight adjustments may be made after each move, or at any arbitrary interval. For game playing tasks the end of every game is a convenient point to actually alter the evaluation weights^{[5]}.TD(λ) was famously applied by Gerald Tesauro in his Backgammon program TD-Gammon

^{[6]}^{[7]}, a stochastic game picking the action whose successor state minimizes the opponent's expected reward, i.e. looking one ply ahead.## TDLeaf(λ)

In games like chess or Othello, due to their tactical nature, deep searches are necessary for expert performance. The problem has already been recognized and solved by Arthur Samuel but seemed to have been forgotten later on^{[8]}- rediscovered independently by Don Beal and Martin C. Smith in 1997^{[9]}, and by Jonathan Baxter, Andrew Tridgell, and Lex Weaver in 1998^{[10]}, who coined the term TD-Leaf. TD-Leaf is the adaption of TD(λ) to minimax search, where instead of the corresponding positions of the root the leaf nodes of the principal variation are considered in the weight adjustments. TD-Leaf was successfully used in evaluation tuning of chess programs^{[11]}, with KnightCap^{[12]}and CilkChess as most prominent samples, while the latter used the improvedTemporal Coherence Learning^{[13]}, which automatically adjusts α and λ^{[14]}.## Quotes

## Don Beal

Don Beal in a 1998 CCC discussion with Jonathan Baxter^{[15]}:With fixed learning rates (aka step size) we found piece values settle to consistent relative ordering in around 500 self-play games. The ordering remains in place despite considerable erratic movements. But piece-square values can take a lot longer - more like 5000.

The learning rate is critical - it has to be as large as one dares for fast learning, but low for stable values. We've been experimenting with methods for automatically adjusting the learning rate. (Higher rates if the adjustments go in the same direction, lower if they keep changing direction.)

The other problem is learning weights for terms which only occur rarely. Then the learning process doesn't see enough examples to settle on good weights in a reasonable time. I suspect this is the main limitation of the method, but it may be possible to devise ways to generate extra games which exercise the rare conditions.

## Bas Hamstra

Bas Hamstra in a 2002 CCC discussion on TD learning^{[16]}:I have played with it. I am convinced it has possibilities, but one problem I encountered was the cause-effect problem. For say I am a piece down. After I lost the game TD will conclude that the winner had better mobility and will tune it up. However worse mobility was not thecauseof the loss, it was theeffectof simply being a piece down. In my case it kept tuning mobility up and up until ridiculous values.## Don Dailey

Don Dailey in a reply^{[17]}to Ben-Hur Carlos Vieira Langoni Junior, CCC, December 2010^{[18]}:Another approach that may be more in line with what you want is called "temporal difference learning", and it is based on feedback from each move to the move that precedes it. For example if you play move 35 and the program thinks the position is equal, but then on move 36 you find that you are winning a pawn, it indicates that the evaluation of move 35 is in error, the position is better than the program thought it was. Little tiny incremental adjustments are made to the evaluation function so that it is ever so slightly biased in favor of being slightly more positive in this case, or slightly more negative in the case where you find your score is dropping. This is done recursively back through the moves of the game so that winning the game gives some credit to all the positions of the game. Look on the web and read up on the "credit assignment problem" and temporal difference learning. It's probably ideal for what you are looking for. It can be done at the end of the game one time and scores then updated. If you are not using floating point evaluation you may have to figure out how to modify this to be workable.## Chess Programs

RootStrap

TreeStrap

^{[19]}## See also

## Publications

## 1959

1959).Some Studies in Machine Learning Using the Game of Checkers. IBM Journal July 1959## 1970 ...

1972).Brain Function and Adaptive Systems - A Heterostatic Theory. Air Force Cambridge Research Laboratories, Special Reports, No. 133, pdf1975).Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. amazon.com## 1980 ...

1984).Temporal Credit Assignment in Reinforcement Learning. Ph.D. dissertation, University of Massachusetts1986).Learning Static Evaluation Functions by Linear Regression. in Tom Mitchell, Jaime Carbonell, Ryszard Michalski (1986).Machine Learning: A Guide to Current Research. The Kluwer International Series in Engineering and Computer Science, Vol. 121988).Learning to Predict by the Methods of Temporal Differences. Machine Learning, Vol. 3, No. 1, pdf## 1990 ...

1990).Time-Derivative Models of Pavlovian Reinforcement. in Michael Gabriel, John Moore (eds.) (1990).Learning and Computational Neuroscience: Foundations of Adaptive Networks. MIT Press, pdf1990).Explaining Temporal Differences to Create Useful Concepts for Evaluating States. AAAI 1990, pdf1990).Navigating Through Temporal Difference. NIPS 1990, pdf1992).Temporal Difference Learning of Backgammon Strategy. ML 19921992).The convergence of TD (λ) for general λ. Machine Learning, Vol. 8, No. 31992).Practical Issues in Temporal Difference Learning. Machine Learning, Vol. 8, Nos. 3-41993).A Game Learning Machine. Ph.D. thesis, University of California, San Diego, advisor Paul Kube, pdf, pdf1993).Improving generalisation for temporal difference learning: The successor representation. Neural Computation, Vol. 5, pdf1994).Temporal Difference Learning of Position Evaluation in the Game of Go. Advances in Neural Information Processing Systems 61994).TD(λ) converges with Probability 1. Machine Learning, Vol. 14, No. 1, pdf## 1995 ...

1995).Learning of Position Evaluation in the Game of Othello. Master's Project, University of Massachusetts, Amherst, Massachusetts, pdf1995).Temporal Difference Learning and TD-Gammon. Communications of the ACM, Vol. 38, No. 31995).Learning to Play the Game of Chess. in Gerald Tesauro, David S. Touretzky, Todd K. Leen (eds.) Advances in Neural Information Processing Systems 7, MIT Press19961996).On the Worst-Case Analysis of Temporal-Difference Learning Algorithms. Machine Learning, Vol. 22, Nos. 1-3, pdf1996).Machine Learning in Computer Chess: The Next Generation.ICCA Journal, Vol. 19, No. 3, zipped ps1996)Linear Least-Squares Algorithms for Temporal Difference Learning. Machine Learning, Vol. 22, Nos. 1/2/3, pdf19971997).An Analysis of Temporal Difference Learning with Function Approximation. IEEE Transactions on Automatic Control, Vol. 42, No. 51997).Learning Piece Values Using Temporal Differences. ICCA Journal, Vol. 20, No. 319981998).First Results from Using Temporal Difference Learning in Shogi. CG 19981998).TDLeaf(lambda): Combining Temporal Difference Learning with Game-Tree Search. Australian Journal of Intelligent Information Processing Systems, Vol. 5 No. 1, arXiv:cs/99010011998).Experiments in Parameter Learning Using Temporal Differences. ICCA Journal, Vol. 21, No. 2, pdf1998).Knightcap: A chess program that learns by combining td(λ) with game-tree search. Proceedings of the 15th International Conference on Machine Learning, pdf via citeseerX1998).Reinforcement Learning: An Introduction. MIT Press, 6. Temporal-Difference Learning1998).Least-Squares Temporal Difference Learning. Carnegie Mellon University, CMU-CS-98-152, pdf19991999).Temporal Coherence and Prediction Decay in TD Learning. IJCAI 1999, pdf1999).Learning Piece-Square Values using Temporal Differences.ICCA Journal, Vol. 22, No. 4## 2000 ...

2000).A Review of Reinforcement Learning. AI Magazine, Vol. 21, No. 1, pdf2000).Chess Neighborhoods, Function Combination, and Reinforcement Learning. CG 2000, pdf » Morph2000).Learning to Play Chess Using Temporal Differences. Machine Learning, Vol 40, No. 3, pdf2000).Machine Learning in Games: A Survey. Austrian Research Institute for Artificial Intelligence, OEFAI-TR-2000-3, pdf20012001).Temporal Difference Learning Applied to a High-Performance Game-Playing Program. IJCAI 20012001).Temporal difference learning applied to game playing and the results of application to Shogi. Theoretical Computer Science, Vol. 252, Nos. 1-22001).Learning to Evaluate Go Positions via Temporal Difference Methods. in Norio Baba, Lakhmi C. Jain (eds.) (2001).Computational Intelligence in Games, Studies in Fuzziness and Soft Computing. Physica-Verlag20022002).Learning a Game Strategy Using Pattern-Weights and Self-play. CG 2002, pdf2002).Temporal difference learning and the Neural MoveMap heuristic in the game of Lines of Action. GAME-ON 2002 » Neural MoveMap Heuristic2002).Optimizing Parameter Learning using Temporal Differences. AAAI-02, Student Abstracts, pdf20032003).Learning to play chess using reinforcement learning with database games. Master’s thesis, Cognitive Artiﬁcial Intelligence, Utrecht University20042004).Learning to play chess using TD(λ)-learning with database games. Cognitive Artiﬁcial Intelligence, Utrecht University, Benelearn’042004).Verwendung von Temporale-Differenz-Methoden im Schachmotor FUSc#. Diplomarbeit, Betreuer: Raúl Rojas, Free University of Berlin, pdf (German)2004).Temporal Difference Approach to Playing Give-Away Checkers. ICAISC 2004, pdf## 2005 ...

2005).Learning to Play Board Games using Temporal Difference Methods. Technical Report, Utrecht University, UU-CS-2005-048, pdf20062006).Temporal Difference Learning versus Co-Evolution for Acquiring Othello Position Evaluation. IEEE Symposium on Computational Intelligence and Games20072007).Temporal Difference Learning of an Othello Evaluation Function for a Small Neural Network with Shared Weights. IEEE Symposium on Computational Intelligence and AI in Games2007).Temporal Difference Methods for Two-player Board Games. Ph.D. thesis, Faculty of Mathematics and Information Science, Warsaw University of Technology2007).Reinforcement Learning of Evaluation Functions Using Temporal Difference-Monte Carlo learning method. 12th Game Programming Workshop20082008).An Othello Evaluation Function Based on Temporal Difference Learning using Probability of Winning. CIG'08, pdf2008).A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation. pdf (draft)2008).Learning of Piece Values for Chess Variants.Technical Report TUD–KE–2008-07, Knowledge Engineering Group, TU Darmstadt, pdf2008).Learning the Piece Values for three Chess Variants. ICGA Journal, Vol. 31, No. 42008).Einsatz von allgemeinen Evaluierungsheuristiken in Verbindung mit der Reinforcement-Learning-Strategie in der Schachprogrammierung. Besondere Lernleistung im Fachbereich Informatik, Sächsischees Landesgymnasium Sankt Afra, Internal advisor: Ralf Böttcher, External advisors: Stefan Meyer-Kahlen, Marco Block, pdf (German)20092009).Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation.Accepted in Advances in Neural Information Processing Systems 22, Vancouver, BC. December 2009. MIT Press. pdf2009).Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation. In Proceedings of the 26th International Conference on Machine Learning (ICML-09). pdf2009).Bootstrapping from Game Tree Search. pdf2009).Coevolutionary Temporal Difference Learning for Othello. IEEE Symposium on Computational Intelligence and Games, pdf2009).Regularization and Feature Selection in Least-Squares Temporal Difference Learning. ICML 2009, pdf## 2010 ...

2010).Self-play and using an expert to learn to play backgammon with temporal difference learning. Journal of Intelligent Learning Systems and Applications, Vol. 2, No. 22010).GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence20112011).Gradient Temporal-Difference Learning Algorithms. Ph.D. thesis, University of Alberta, advisor Richard Sutton, pdf2011).Approximate Universal Artificial Intelligence and Self-Play Learning for Games. Ph.D. thesis, University of New South Wales, supervisors: Kee Siong Ng, Marcus Hutter, Alan Blair, William Uther, John Lloyd; pdf2011).Temporal Difference Learning for Connect6. Advances in Computer Games 132011).Improving Temporal Difference Performance in Backgammon Variants. Advances in Computer Games 13, pdf2011).Evolving small-board Go players using Coevolutionary Temporal Difference Learning with Archives. Applied Mathematics and Computer Science, Vol. 21, No. 42011).Learning Board Evaluation Function for Othello by Hybridizing Coevolution with Temporal Difference Learning. Control and Cybernetics, Vol. 40, No. 3,pdf20122012).Reinforcement Learning in Games. in Marco Wiering, Martijn Van Otterlo (eds.).Reinforcement learning: State-of-the-art. Adaptation, Learning, and Optimization, Vol. 12, Springer20132013).Temporal-Difference Search in Computer Go. Proceedings of the ICAPS-13 Workshop on Planning and Learning, pdf2013).An Introduction to Temporal Difference Learning. Seminar on Autonomous Learning Systems, TU Darmstad, pdf20142014).Multi-Stage Temporal Difference Learning for 2048. TAAI 20142014).Multi-Criteria Comparison of Coevolution and Temporal Difference Learning on Othello. EvoApplications 2014, Springer, volume 8602## 2015 ...

2015).Explorations in Parallel Distributed Processing: A Handbook of Models, Programs, and Exercises. Second Edition, Contents, Temporal-Difference Learning2015).Giraffe: Using Deep Reinforcement Learning to Play Chess. M.Sc. thesis, Imperial College London, arXiv:1509.01549v1 » Giraffe2016).Systematic Selection of N-tuple Networks for 2048. CG 20162017).On Generalized Bellman Equations and Temporal-Difference Learning. Canadian Conference on AI 2017, arXiv:1704.04463## Forum Posts

## 1995 ...

Re: Parameter Tuning by Don Beal, CCC, October 02, 1998

## 2000 ...

Re: Temporal Differences by Guy Haworth, CCC, November 04, 2004

^{[20]}## 2010 ...

Re: Positional learning by Don Dailey, CCC, December 13, 2010

Re: Pawn Advantage, Win Percentage, and Elo by Don Dailey, CCC, April 15, 2012

## 2015 ...

## External Links

temporal - Wiktionary

1998).Reinforcement Learning: An Introduction. MIT Press eBook2015).Explorations in Parallel Distributed Processing: A Handbook of Models, Programs, and Exercises. Second Edition, Contents## References

1959).Some Studies in Machine Learning Using the Game of Checkers. IBM Journal July 19591988).Learning to Predict by the Methods of Temporal Differences. Machine Learning, Vol. 3, No. 1, pdf1998).First Results from Using Temporal Difference Learning in Shogi. CG 19981992).Temporal Difference Learning of Backgammon Strategy. ML 19921994).TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play. Neural Computation Vol. 6, No. 22008).Learning of Piece Values for Chess Variants.Technical Report TUD–KE–2008-07, Knowledge Engineering Group, TU Darmstadt, pdf1997).Learning Piece Values Using Temporal Differences. ICCA Journal, Vol. 20, No. 31998).Experiments in Parameter Learning Using Temporal Differences. ICCA Journal, Vol. 21, No. 2, pdf1999).Learning Piece-Square Values using Temporal Differences.ICCA Journal, Vol. 22, No. 41998).Knightcap: A chess program that learns by combining td(λ) with game-tree search. Proceedings of the 15th International Conference on Machine Learning, pdf via citeseerX1999).Temporal Coherence and Prediction Decay in TD Learning. IJCAI 1999, pdf1998).Chess Endgames and Neural Networks. ICCA Journal, Vol. 21, No. 4## What links here?

Up one level