Running sets of test-positions with number of solutions per fixed time-frame is useful to prove whether things are broken after program changes or to get hints about missing knowledge. But one should be careful to tune engines based on test-position results, since solving (possible tactical) test-positions does not necessarily correlate with practical playing strength in matches against other opponents.
Most testing involves running different versions of a program in matches, and comparing results.
Time Controls
Generally speaking, for testing changes that don't alter the search tree itself, but only affect performance (eg. move generation) can be tested with given fixed nodes, fixed time or fixed depth. In all other cases the time management should be left to the engine to simulate real tournament conditions. On the other hand, debugging is much easier under fixed conditions as the games become deterministic.
A side from the type of time control one also has to decide on how much time should be spent per game, ie. what the average quality of the games should be like. While one can test more changes in the a certain time at short time controls, it is also relevant how a certain change scales to different strengths. So for example should one increase the R in Null move pruning to 3 in depths > 7, this change may only be effectively tested on time controls where this new condition is triggered frequently enough, ie. where the average search depth is far greater than seven. It is hard to generalize, but on average changes of the search functions (LMR, nullmove, futility or similar pruning, reductions and extensions ) tend to be more sensitive to the time control than the tuning of evaluation parameters.
Opening
During testing the engines should ideally play the same style of openings they would play in a normal tournament, so not to optimize them for different types of positions. One option is to use the engines own opening book or one can use opening suites, a set of quiet test positions. In the latter case the same opening suit would be used for each tournament conducted and furthermore each position is played a second time with colors reversed. With these measures one can try to minimize the disparity between tests caused by different openings.
One can also test an engine's performance by comparing it to other programs on the various internet platforms [4] . In this case the different hardware and features like different Endgame Tablebases or Opening Books have to be considered.
the process either to eliminate bugs and to measure performance of a chess engine. New implementations of move generation are tested with Perft, while new features and tuning of search and evaluation are verified by test-positions and by playing matches against other engines.
Table of Contents
Bug Hunting
Analyzing
Tuning
Test-Positions
Running sets of test-positions with number of solutions per fixed time-frame is useful to prove whether things are broken after program changes or to get hints about missing knowledge. But one should be careful to tune engines based on test-position results, since solving (possible tactical) test-positions does not necessarily correlate with practical playing strength in matches against other opponents.Matches
Most testing involves running different versions of a program in matches, and comparing results.Time Controls
Generally speaking, for testing changes that don't alter the search tree itself, but only affect performance (eg. move generation) can be tested with given fixed nodes, fixed time or fixed depth. In all other cases the time management should be left to the engine to simulate real tournament conditions. On the other hand, debugging is much easier under fixed conditions as the games become deterministic.A side from the type of time control one also has to decide on how much time should be spent per game, ie. what the average quality of the games should be like. While one can test more changes in the a certain time at short time controls, it is also relevant how a certain change scales to different strengths. So for example should one increase the R in Null move pruning to 3 in depths > 7, this change may only be effectively tested on time controls where this new condition is triggered frequently enough, ie. where the average search depth is far greater than seven. It is hard to generalize, but on average changes of the search functions (LMR, nullmove, futility or similar pruning, reductions and extensions ) tend to be more sensitive to the time control than the tuning of evaluation parameters.
Opening
During testing the engines should ideally play the same style of openings they would play in a normal tournament, so not to optimize them for different types of positions. One option is to use the engines own opening book or one can use opening suites, a set of quiet test positions. In the latter case the same opening suit would be used for each tournament conducted and furthermore each position is played a second time with colors reversed. With these measures one can try to minimize the disparity between tests caused by different openings.Interfaces
Free graphical user interfaces or command line tools for UCI and Chess Engine Communication Protocol compatible engines in engine-engine matches are:Cutechess-cli
Frameworks
Chess Server
One can also test an engine's performance by comparing it to other programs on the various internet platforms [4] . In this case the different hardware and features like different Endgame Tablebases or Opening Books have to be considered.Statistics
The question whether certain results actually indicates a strength increase or not, can be answered withRatings
Test Results
Notable Bugs
Publications
Forum Posts
1995 ...
2000 ...
2005 ...
2010 ...
- XBoard and epd tournament by Vlad Stamate, CCC, January 31, 2010 » Chess Engine Communication Protocol
- Long game vs short game testing by Vlad Stamate, CCC, April 08, 2010
- Pairings generation based on a big PGN file by Harun Taner, CCC, July 22, 2010
- hiatus good for bug-finding by Stuart Cracraft, CCC, June 27, 2010
2011- testing question by Larry Kaufman, CCC, June 01, 2011
- Debugging regression tests by Onno Garms, CCC, June 16, 2011 [6]
2012- fast game testing by Jon Dart, CCC, January 08, 2012
- Your best bug ? by Ed Schröder, CCC, August 06, 2012
- Yet Another Testing Question by Brian Richardson, CCC, September 15, 2012
- Another testing question by Larry Kaufman, CCC, September 23, 2012
- A word for casual testers by Don Dailey, CCC, December 25, 2012
2013- A poor man's testing environment by Ed Schröder, CCC, January 04, 2013 [7] » Match Statistics
- engine-engine testing isues by Jens Bæk Nielsen, CCC, January 20, 2013
- Beta for Stockfish distributed testing by Gary, CCC, March 05, 2013 » Fishtest
- Fishtest Distributed Testing Framework by Marco Costalba, CCC, May 01, 2013 » Fishtest
- cutechess-cli 0.6.0 released by Ilari Pihlajisto, CCC, July 12, 2013
- fast testing NIT algorithm by Don Dailey, CCC, August 22, 2013
- OICS: Computers Only ICS based Chess server for anyone by Joshua Shriver, CCC, August 26, 2013 » OICS
20142015 ...
- Bullet vs regular time control, say 40/4m CCRL/CEGT by Ed Schröder, CCC, August 29, 2015
- Static evaluation test posistions by Shawn Chidester, CCC, November 25, 2015
2016Re: Static evaluation test posistions by Ferdinand Mosca, CCC, November 26, 2015 » Python
- Ordo 1.0.9 (new features for testers) by Miguel A. Ballicora, CCC, January 25, 2016
- cluster versus single server by Folkert van Heusden, CCC, April 28, 2016
- Testing using many computers and architectures by Andrew Grant, CCC, September 14, 2016
- command line engine match? by Erin Dame, CCC, November 06, 2016 » CLI
- Testing with different EPD suits for search vs eval changes by Michael Sherwin, CCC, December 23, 2016
2017External Links
References
What links here?
Up one Level