Common tools, ratios and figures to illustrate a tournament outcome and provide a base for its interpretation.

Number of games

The total number of games played by an engine in a tournament.

Score

The score is a representation of the tournament-outcome from the viewpoint of a certain engine.

Win & Draw Ratio

These two ratios depend on the strength difference between the competitors, the average strength level, the color and the drawishness of the opening book-line. Due to the second reason given, these ratios are very much influenced by the timecontrol, what is also confirmed by the published statistics of the testing orgnisations CCRL and CEGT, showing an increase of the draw rate at longer time controls. This correlation was also shown by Kirill Kryukov, who was analyzing statistics of his test-games ^{[2]} . The program playing white seems to be more supported by the additional level of strength. So, although one would expect with increasing draw rates the win ratio to approach 50%, in fact it is remaining about equal.

The likelihood of superiority (LOS) denotes how likely it would be for two players of the same strength to reach a certain result - in other fields called a p-value, a measure of statistical significance of a departure from the null hypothesis^{[6]}. Doing this analysis after the tournament one has to differentiate between the case where one knows that a certain engine is either stronger or equally strong (directional or one-tailed test) or the case where one has no information of whether the other engine is stronger or weaker (non-directional or two-tailed test). The latter due to the reduced information results in larger confidence intervals.

The probability of the null hypothesis being true can be calculated given the tournament outcome. In other words, how likely would it be for two players of the same strength to reach a certain result. The LOS would then be the inverse, 1 - the resulting probability.

For this type of analysis the trinomial distribution, a generalization of the binomial distribution, is needed. Whilest the binomial distribution can only calculate the probability to reach a certain outcome with two possible events, the trinominal distribution can account for all three possible events (win, draw, loss).

The following functions gives the probability of a certain game outcome assuming both players were of equal strength:

This calculation becomes very inefficient for larger number of games. In this case the standard normal distribution can give a good approximation:

where N(1 - draw_ratio) is the sum of wins and losses:

To calculate the LOS one needs the cumulative distribution function of the given normal distribution. However, as pointed out by Rémi Coulom, calculation can be done cleverly, and the normal approximation is not really required ^{[7]} . As further emphasized by Kai Laskos^{[8]} and Rémi Coulom ^{[9]}^{[10]} , draws do not count in LOS calculation and don't make a difference whether the game results were obtained when playing Black or White. It is a good approximation when the two players played the same number of games with each color:

Sample Program
A tiny C++11 program to compute Elo difference and LOS from W/L/D counts was given by Álvaro Begué^{[14]} :

#include <cstdio>#include <cstdlib>#include <cmath>int main(int argc, char**argv){if(argc !=4){
std::printf("Wrong number of arguments.\n\nUsage:%s <wins> <losses> <draws>\n", argv[0]);return1;}int wins = std::atoi(argv[1]);int losses = std::atoi(argv[2]);int draws = std::atoi(argv[3]);double games = wins + losses + draws;
std::printf("Number of games: %g\n", games);double winning_fraction =(wins +0.5*draws)/ games;
std::printf("Winning fraction: %g\n", winning_fraction);double elo_difference =-std::log(1.0/winning_fraction-1.0)*400.0/std::log(10.0);
std::printf("Elo difference: %+g\n", elo_difference);double los =.5+.5* std::erf((wins-losses)/std::sqrt(2.0*(wins+losses)));
std::printf("LOS: %g\n", los);}

Statistical Analysis

The trinomial versus the 5-nomial model

As indicated above a match between two engines is usually modeled as a sequence of independent trials taken from a trinomial distribution with probabilities (win_ratio,draw_ratio,loss_ratio). This model is appropriate for a match with randomly selected opening positions and randomly assigned colors (to maintain fairness). However one may show that under reasonable elo models the trinomial model is not correct in case games are played in pairs with reversed colors (as is commonly the case) and unbalanced opening positions are used.

This was also empirically observed by Kai Laskos^{[15]} . He noted that the statistical predictions of the trinomial model do not match reality very well in the case of paired games. In particular he observed that for some data sets the variance of the match score as predicted by the trinomial model greatly exceeds the variance as calculated by the jackknife estimator. The jackknife estimator is a non-parametric estimator, so it does not depend on any particular statistical model. It appears the mismatch may even occur for balanced opening positions, an effect which can only be explained by the existence of correlations between paired games - something not considered by any elo model.

Over estimating the variance of the match score implies that derived quantities such as the number of games required to establish the superiority of one engine over another with a given level of significance are also over estimated. To obtain agreement between statistical predictions and actual measurements one may adopt the more general 5-nomial model. In the 5-nomial model the outcome of paired games is assumed to follow a 5-nomial distribution with probabilities

These unknown probabilities may be estimated from the outcome frequencies of the paired games and then subsequently be used to compute an estimate for the variance of the match score. Summarizing: in the case of paired games the 5-nomial model handles the following effects correctly which the trinomial model does not:

Unbalanced openings

Correlations between paired games

For further discussion on the potential use of unbalanced opening positions in engine testing see the posting by Kai Laskos^{[16]} .

SPRT

The sequential probability ratio test (SPRT) is a specific sequential hypothesis test - a statistical analysis where the sample size is not fixed in advance - developed by Abraham Wald^{[17]} . While originally developed for use in quality control studies in the realm of manufacturing, SPRT has been formulated for use in the computerized testing of human examinees as a termination criterion ^{[18]}. As mentioned by Arthur Guez in this 2015 Ph.D. thesis Sample-based Search Methods for Bayes-Adaptive Planning^{[19]}, Alan Turing assisted by Jack Good used a similar sequential testing technique to help decipher enigma codes at Bletchley Park^{[20]}. SPRT is applied in Stockfish testing to terminate self-testing series early if the result is likely outside a given elo-window ^{[21]} . In August 2016, Michel Van den Bergh posted following Python code in CCC to implement the SPRT a la Cutechess-cli or Fishtest: ^{[22]}^{[23]}

from__future__import division
importmathdef LL(x):
return1/(1+10**(-x/400))def LLR(W,D,L,elo0,elo1):
"""
This function computes the log likelihood ratio of H0:elo_diff=elo0 versus
H1:elo_diff=elo1 under the logistic elo model
expected_score=1/(1+10**(-elo_diff/400)).
W/D/L are respectively the Win/Draw/Loss count. It is assumed that the outcomes of
the games follow a trinomial distribution with probabilities (w,d,l). Technically
this is not quite an SPRT but a so-called GSPRT as the full set of parameters (w,d,l)
cannot be derived from elo_diff, only w+(1/2)d. For a description and properties of
the GSPRT (which are very similar to those of the SPRT) see
http://stat.columbia.edu/~jcliu/paper/GSPRT_SQA3.pdf
This function uses the convenient approximation for log likelihood
ratios derived here:
http://hardy.uhasselt.be/Toga/GSPRT_approximation.pdf
The previous link also discusses how to adapt the code to the 5-nomial model
discussed above.
"""# avoid division by zeroif W==0or D==0or L==0:
return0.0
N=W+D+L
w,d,l=W/N,D/N,L/N
s=w+d/2
m2=w+d/4
var=m2-s**2
var_s=var/N
s0=LL(elo0)
s1=LL(elo1)return(s1-s0)*(2*s-s0-s1)/var_s/2.0def SPRT(W,D,L,elo0,elo1,alpha,beta):
"""
This function sequentially tests the hypothesis H0:elo_diff=elo0 versus
the hypothesis H1:elo_diff=elo1 for elo0<elo1. It should be called after
each game until it returns either 'H0' or 'H1' in which case the test stops
and the returned hypothesis is accepted.
alpha is the probability that H1 is accepted while H0 is true
(a false positive) and beta is the probability that H0 is accepted
while H1 is true (a false negative). W/D/L are the current win/draw/loss
counts, as before.
"""
LLR_=LLR(W,D,L,elo0,elo1)
LA=math.log(beta/(1-alpha))
LB=math.log((1-beta)/alpha)if LLR_>LB:
return'H1'elif LLR_<LA:
return'H0'else:
return''

^Arthur Guez (2015). Sample-based Search Methods for Bayes-Adaptive Planning. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London, pdf

Home * Engine Testing * Match StatisticsMatch Statistics,the statistics of chess tournaments and matches, that is a collection of chess games and the presentation, analysis, and interpretation of game related data, most common game results to determine the relative playing strength of chess playing entities, here with focus on chess engines. To apply match statistics, beside considering statistical population, it is conventional to hypothesize a statistical model describing a set of probability distributions.

^{[1]}## Table of Contents

## Ratios / Operating Figures

Common tools, ratios and figures to illustrate a tournament outcome and provide a base for its interpretation.## Number of games

The total number of games played by an engine in a tournament.## Score

The score is a representation of the tournament-outcome from the viewpoint of a certain engine.## Win & Draw Ratio

These two ratios depend on the strength difference between the competitors, the average strength level, the color and the drawishness of the opening book-line. Due to the second reason given, these ratios are very much influenced by the timecontrol, what is also confirmed by the published statistics of the testing orgnisations CCRL and CEGT, showing an increase of the draw rate at longer time controls. This correlation was also shown by Kirill Kryukov, who was analyzing statistics of his test-games

^{[2]}. The program playing white seems to be more supported by the additional level of strength. So, although one would expect with increasing draw rates the win ratio to approach 50%, in fact it is remaining about equal.Doubling Time ControlAs posted in October 2016

^{[3]}, Andreas Strangmüller conducted an experiment with Komodo 9.3, time control doubling matches under Cutechess-cli, playing 3000 games with 1500 opening positions each, without pondering, learning, and tablebases, Intel i5-750 @ 3.5 GHz, 1 Core, 128 MB Hash^{[4]}, see also Kai Laskos' 2013 results with Houdini 3^{[5]}and Diminishing Returns:vs 1

10+0.1

20+0.2

40+0.4

80+0.8

160+1.6

320+3.2

640+6.4

1280+12.8

## Elo-Rating & Win-Probability

see Pawn Advantage, Win Percentage, and EloGeneralization of the Elo-Formula:

win_probability of player i in a tournament with n players## Likelihood of Superiority

See LOS TableThe likelihood of superiority (LOS) denotes how likely it would be for two players of the same strength to reach a certain result - in other fields called a p-value, a measure of statistical significance of a departure from the null hypothesis

^{[6]}. Doing this analysis after the tournament one has to differentiate between the case where one knows that a certain engine is either stronger or equally strong (directional or one-tailed test) or the case where one has no information of whether the other engine is stronger or weaker (non-directional or two-tailed test). The latter due to the reduced information results in larger confidence intervals.Two-tailed TestNull- and alternative hypothesis:

The probability of the null hypothesis being true can be calculated given the tournament outcome. In other words, how likely would it be for two players of the same strength to reach a certain result. The LOS would then be the inverse, 1 - the resulting probability.

For this type of analysis the trinomial distribution, a generalization of the binomial distribution, is needed. Whilest the binomial distribution can only calculate the probability to reach a certain outcome with two possible events, the trinominal distribution can account for all three possible events (win, draw, loss).

The following functions gives the probability of a certain game outcome assuming both players were of equal strength:

This calculation becomes very inefficient for larger number of games. In this case the standard normal distribution can give a good approximation:

where N(1 - draw_ratio) is the sum of wins and losses:

To calculate the LOS one needs the cumulative distribution function of the given normal distribution. However, as pointed out by Rémi Coulom, calculation can be done cleverly, and the normal approximation is not really required

^{[7]}. As further emphasized by Kai Laskos^{[8]}and Rémi Coulom^{[9]}^{[10]}, draws do not count in LOS calculation and don't make a difference whether the game results were obtained when playing Black or White. It is a good approximation when the two players played the same number of games with each color:^{[11]}^{[12]}^{[13]}One-tailed TestNull- and alternative hypothesis:

Sample ProgramA tiny C++11 program to compute Elo difference and LOS from W/L/D counts was given by Álvaro Begué

^{[14]}:## Statistical Analysis

The trinomial versus the 5-nomial modelAs indicated above a match between two engines is usually modeled as a sequence of independent trials taken from a trinomial distribution with probabilities (win_ratio,draw_ratio,loss_ratio). This model is appropriate for a match with randomly selected opening positions and randomly assigned colors (to maintain fairness). However one may show that under reasonable elo models the trinomial model is not correct in case games are played in pairs with reversed colors (as is commonly the case) and unbalanced opening positions are used.

This was also empirically observed by Kai Laskos

^{[15]}. He noted that the statistical predictions of the trinomial model do not match reality very well in the case of paired games. In particular he observed that for some data sets the variance of the match score as predicted by the trinomial model greatly exceeds the variance as calculated by the jackknife estimator. The jackknife estimator is a non-parametric estimator, so it does not depend on any particular statistical model. It appears the mismatch may even occur for balanced opening positions, an effect which can only be explained by the existence of correlations between paired games - something not considered by any elo model.Over estimating the variance of the match score implies that derived quantities such as the number of games required to establish the superiority of one engine over another with a given level of significance are also over estimated. To obtain agreement between statistical predictions and actual measurements one may adopt the more general 5-nomial model. In the 5-nomial model the outcome of paired games is assumed to follow a 5-nomial distribution with probabilities

These unknown probabilities may be estimated from the outcome frequencies of the paired games and then subsequently be used to compute an estimate for the variance of the match score. Summarizing: in the case of paired games the 5-nomial model handles the following effects correctly which the trinomial model does not:

For further discussion on the potential use of unbalanced opening positions in engine testing see the posting by Kai Laskos

^{[16]}.## SPRT

The sequential probability ratio test (SPRT) is a specific sequential hypothesis test - a statistical analysis where the sample size is not fixed in advance - developed by Abraham Wald^{[17]}. While originally developed for use in quality control studies in the realm of manufacturing, SPRT has been formulated for use in the computerized testing of human examinees as a termination criterion^{[18]}. As mentioned by Arthur Guez in this 2015 Ph.D. thesisSample-based Search Methods for Bayes-Adaptive Planning^{[19]}, Alan Turing assisted by Jack Good used a similar sequential testing technique to help decipher enigma codes at Bletchley Park^{[20]}. SPRT is applied in Stockfish testing to terminate self-testing series early if the result is likely outside a given elo-window^{[21]}. In August 2016, Michel Van den Bergh posted following Python code in CCC to implement the SPRT a la Cutechess-cli or Fishtest:^{[22]}^{[23]}## Tournament Manager

## See also

## Publications

## 1920 ...

1929).Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung. pdf (German)1945).Sequential Tests of Statistical Hypotheses. Annals of Mathematical Statistics, Vol. 16, No. 2, doi: 10.1214/aoms/11777311181947).Sequential Analysis. John Wiley and Sons, AbeBooks1952).Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, Vol. 39, Nos. 3/4, doi: 10.2307/2334029, JSTOR 2334029## 1960 ...

1962).Games, Gods & Gambling: A History of Probability and Statistical Ideas. Dover Publications, ISBN-13: 978-04864002351973).Mechanisms for Comparing Chess Programs.ACM Annual Conference, pdf1978).Performance Analysis of the Technology Chess Program. Ph.D. Thesis. Tech. Report CMU-CS-78-189, Carnegie Mellon University, CMU-CS-77 pdf » Tech1978).The Rating of Chessplayers, Past and Present. Arco Publications^{[24]}1979).Strength of a Chess Playing Computer. ICCA Newsletter, Vol. 2, No. 11979).On the Grading of Chess Players. Personal Computing, Vol. 3, No. 3, pp. 47## 1980 ...

1981).Survey-Chess Games. Your Computer, August/September 1981^{[25]}1982).Computer Chess Strength. Advances in Computer Chess 31985).Sequential Analysis. Tests and confidence intervals. Springer1989).Measuring the Performance Potential of Chess Programs, Advances in Computer Chess 51989).Playing Levels. Computer Chess News Sheet 23, pp 2, pdf hosted by Mike Watters## 1990 ...

1990).Measuring the Performance Potential of Chess Programs.Artificial Intelligence, Vol. 43, No. 11990).Are all Linear Paired Comparison Models Equivalent. pdf1990).Speed, Processors and Ratings. Computer Chess News Sheet 25, pp 6, pdf hosted by Mike Watters1991).A taxonomy of concepts for evaluating chess strength: examples from two difficult categories. Advances in Computer Chess 6, pdf1992).Are You Sure It's Better?Selective Search 40, pp. 21, pdf hosted by Mike Watters1993).Rating Systems for Gameplayers, and Learning. ps1997).CRAFTY Goes Deep. ICCA Journal, Vol. 20, No. 2 » Crafty## 2000 ...

2000).New Self-Play Results in Computer Chess. CG 20002001).Self-play Experiments in Computer Chess Revisited.Advances in Computer Games 92001).Modeling the “Go Deep” Behaviour of CRAFTY and DARK THOUGHT.Advances in Computer Games 9 » Crafty, Dark Thought2001).Self-Play, Deep Search and Diminishing Returns.ICGA Journal, Vol. 24, No. 22002).Self-play: Statistical Significance. 7th Computer Olympiad Workshop2003).Self-Play: Statistical Significance. ICGA Journal, Vol. 26, No. 22003).Follow-Up on Self-Play, Deep Search, and Diminishing Returns.ICGA Journal, Vol. 26, No. 22004).MM Algorithms for Generalized Bradley-Terry Models. The Annals of Statistics, Vol. 32, No. 1, 384–406, pdf^{[26]}^{[27]}^{[28]}^{[29]}## 2005 ...

2005).New Results in Deep-Search Behaviour. ICGA Journal, Vol. 28, No. 4, pdf2007).Factors affecting diminishing returns for searching deeper. CGW 2007 » Crafty, Rybka, Shredder, Diminishing Returns2007).Statistical Minefields with Version Testing. AI Factory, Winter 2007 » Engine Testing2007).Visualization and Adjustment of Evaluation Functions Based on Evaluation Values and Win Probability. AAAI 2007, pdf2008).Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength. CG 2008, draft as pdf2009).Skill Rating by Bayesian Inference. CIDM 2009, pdf^{[30]}2009).Performance and Prediction: Bayesian Modelling of Fallible Choice in Chess. Advances in Computer Games 12, pdf## 2010 ...

2010).Predicting the Outcome of Chess Games based on Historical Data. IST - Technical University of Lisbon^{[31]}2011).Intrinsic Chess Ratings. AAAI 2011, pdf, slides as pdf^{[32]}2011).Understanding Distributions of Chess Performances. Advances in Computer Games 13, pdf2011).A Discrete Evolutionary Model for Chess Players' Ratings. Physics and Society, arXiv:1103.1530v22012).Paired Comparisons with Ties: Modeling Game Outcomes in Chess. pdf preprint^{[33]}2012).Determining the Strength of Chess Players based on actual Play. ICGA Journal, Vol. 35, No. 12013).The Impact of the Search Depth on Chess Playing Strength. ICGA Journal, Vol. 36, No. 22014).ORDO v0.9.6 Ratings for chess and other games. September 2014, pdf » Ordo^{[34]}2014).Move Similarity Analysis in Chess Programs. Entertainment Computing, Vol. 5, No. 3, preprint as pdf^{[35]}2014).Human and Computer Preferences at Chess. pdf2014).Quality of play in chess and methods for measuring. pdf^{[36]}^{[37]}## 2015 ...

2015).Quantifying Depth and Complexity of Thinking and Knowledge. ICAART 2015, pdf2015).Measuring Level-K Reasoning, Satisficing, and Human Error in Game-Play Data. IEEE ICMLA 2015, pdf preprint2015).A Comparative Review of Skill Assessment: Performance, Prediction and Profiling. Advances in Computer Games 142015).Estimating Ratings of Computer Players by the Evaluation Scores and Principal Variations in Shogi. ACIT-CSI2017).Who is the Master? ICGA Journal, Vol. 39, No. 1, draft as pdf » Stockfish, Who is the Master?## Forum & Blog Postings

## 1996 ...

## 2000 ...

## 2005 ...

^{[38]}## 2010 ...

Re: Engine Testing - Statistics by John Major, CCC, January 14, 2010

20112012^{[39]}^{[40]}2013^{[41]}» Engine Testing^{[42]}^{[43]}2014## 2015 ...

^{[44]}Re: The SPRT without draw model, elo model or whatever.. by Michel Van den Bergh, CCC, August 18, 2016

^{[45]}2016^{[46]}^{[47]}^{[48]}About expected scores and draw ratios by Jesús Muñoz, CCC, September 17, 2016

^{[49]}2017^{[50]}ELO measurements by Peter Österlund, CCC, August 06, 2017 » Playing Strength

Re: "Intrinsic Chess Ratings" by Regan, Haworth -- by Kenneth Regan, CCC, November 20, 2017 » Who is the Master?

2018## External Links

^{[51]}» Engine Testing## Rating Systems

^{[52]}^{[53]}## Tools

## Statistics

## Data Visualization

^{[54]}^{[55]}^{[56]}^{[57]}^{[58]}## Misc

ARMS Charity Concert, Madison Square Garden, December 08, 1983

## References

1945).Sequential Tests of Statistical Hypotheses. Annals of Mathematical Statistics, Vol. 16, No. 2, doi: 10.1214/aoms/11777311182015).Sample-based Search Methods for Bayes-Adaptive Planning. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London, pdf1979).Studies in the history of probability and statistics. XXXVII AM Turing’s statistical work in World War II. Biometrika, Vol. 66, No. 22004).MM Algorithms for Generalized Bradley-Terry Models. The Annals of Statistics, Vol. 32, No. 1, 384–406, pdf2014).Move Similarity Analysis in Chess Programs. Entertainment Computing, Vol. 5, No. 3, preprint as pdf## What links here?

Up one level