Parallel Search, also known as Multithreaded Search or SMP Search, is a way to increase search speed by using additional processors. This topic that has been gaining popularity recently with multiprocessor computers becoming widely available. Utilizing these additional processors is an interesting domain of research, as traversing a search tree is inherently serial. Several approaches have been devised, with the most popular today being Young Brothers Wait Concept and Shared Hash Table aka Lazy SMP.
This page gives a brief summary of the different types. SMP algorithms are classified by their scalability (trend in search speed as the number of processors becomes large) and their speedup (change in time to complete a search). Typically, programmers use scaling to mean change in nodes per second (NPS) rates, and speedup to mean change in time to depth. Scaling and scalability are thus two different concepts.
This technique is a very simple approach to SMP. The implementation requires little more than starting additional processors. Processors are simply fed the root position at the beginning of the search, and each searches the same tree with the only communication being the transposition table. The gains come from the effect of nondeterminism. Each processor will finish the various subtrees in varying amounts of time, and as the search continues, these effects grow making the search trees diverge. The speedup is then based on how many nodes the main processor is able to skip from transposition table entries. Many programs used this if a "quick and dirty" approach to SMP is needed. It had the reputation of little speedup on a mere 2 processors, and to scale quite badly after this.
Recent improvements by Dan Homan[2] , Martin Sedlak[3] and others on Lazy SMP indicate that the algorithm scales quite well up to 8 cores and beyond [4] .
ABDADA, Alpha-Bêta Distribué avec Droit d'Aînesse (Distributed Alpha-Beta Search with Eldest Son Right) is a loosely synchronized, distributed search algorithm by Jean-Christophe Weill[5] . It is based on the Shared Hash Table, and adds the number of processors searching this node inside the hash-table entry for better utilization - considering the Young Brothers Wait Concept.
Parallel Alpha-Beta
These algorithms divide the Alpha-Beta tree, giving different subtrees to different processors. Because alpha-beta is a serial algorithm, this approach is much more complex. However, these techniques also provide for the greatest gains from additional processors.
Principal Variation Splitting (PVS)
In Principal Variation Splitting (PVS), each node is expressed by a thread. A thread will spawn one child thread for each legal move. But data dependency specified by the algorithm exists among these threads: After getting a tighter bound from the thread corresponding to the PV node, the remaining threads are ready to run.
The idea in Feldmann'sYoung Brothers Wait Concept (YBWC) [7][8] as well in Kuszmaul'sJamboree Search[9][10][11] , is to search the first sibling node first before spawning the remaining siblings in parallel. This is based on the observations that the first move is either going to produce a cutoff (in which case processing sibling nodes is wasted effort) or return much better bounds. If the first move does not produce a cut-off, then the remaining moves are searched in parallel. This process is recursive.
Since the number of processors is not infinite the process of "spawning" work normally consists in putting it on some kind of "work to be done stack" where processors are free to grab work in FIFO fashion when there is no work to do. In YBW you would not "spawn" or place work on the stack until the first sibling is searched.
In their 1983 paper Improved Speedup Bounds for Parallel Alpha-Beta Search[12] , Raphael Finkel and John Philip Fishburn already gave the theoretical confirmation to the common sense wisdom that parallel resources should first be thrown into searching the first child. Assuming the tree is already in an approximation to best-first order, this establishes a good alpha value that can then be used to parallel search the later children. The algorithm in the 1982 Artificial Intelligence paper [13] , which Fishburn called the "dumb algorithm" in his 1981 thesis presentation [14] gives p^0.5 speedup with p processors, while the 1983 PAMI algorithm (the "smart algorithm") gives p^0.8 speedup for lookahead trees with the branching factor of chess.
This algorithm, invented by the Cray Blitz team (including Robert Hyatt[15] ), is the most complex. Though this gives the best known scalability for any SMP algorithm, there are very few programs using it because of its difficulty of implementation.
Other Approaches
Many different approaches have been tried that do not directly split the search tree. These algorithms have not enjoyed popular success due to the fact that they are not scalable. Typical examples include one processor that evaluates positions fed to it by a searching processor, or a tactical search that confirms or refutes a positional search.
During an parallel search, certain areas of memory must be protected to make sure processors do not write simultaneously and corrupt the data. Some type of semaphore system must be used. Semaphores access a piece of shared memory, typically an integer. When a processor wants to access protected memory, it reads the integer. If it is zero, meaning no other process is accessing the memory, then the processor attempts to make the integer nonzero. This whole process must be done atomically, meaning that the read, compare, and write are all done at the same time. Without this atomicity another processor could read the integer at the same time and both would see that they can freely access the memory.
In chess programs that use parallel alpha-beta, usually spinlocks are used. Because the semaphores are held for such short periods of time, processors want to waste as little time as possible after the semaphore is released before acquiring access. To do this, if the semaphore is held when a processor reaches it, the processor continuously reads the semaphore. This technique can waste a lot of processing time in applications with high contention, meaning that many processes try to access the same semaphore simultaneously. In chess programs, however, we want as much processing power as possible.
Spinlocks are sometimes implemented in assembly language because the operating system does not have an API for them.
Threads vs. Processes
There are two ways of utilizing the extra processing power of multiple CPUs, threads and processes. The difference between them is that threads share all memory in the program, but there are multiple threads of execution. In processes, all memory is local to each processor except memory that is explicitly shared. This means that in a threaded program, functions must pass around an extra argument specifying which thread is active, thus which board structure to use. When using processes, a single global board can be used that will be duplicated for each process.
Threads are more common, because they are easier to debug as well as implement, provided the program does not already have lots of global variables. Processes are favored by some because the need to explicitly share memory makes subtle bugs easier to avoid. Also, in processes, the extra argument to most functions is not needed.
Selim Akl, David T. Barnard, R.J. Doran, (1980). Design, analysis and implementation of a parallel alpha-beta algorithm, Department of Computing and Information Science, Queen's University, Kingston, Ontario.
Selim Akl, David T. Barnard, R.J. Doran (1980). Simulation and Analysis in Deriving Time and Storage Requirements for a Parallel Alpha-Beta Pruning Algorithm. IEEE International Conference on Parallel Processing, pp. 231-234.
Selim Akl, David T. Barnard, R.J. Doran (1980). Searching game trees in parallel, Proceedings of the Third Biennial Conference of the Canadian Society for Computational Studies of Intelligence, Victoria, B.C.
1981
John Philip Fishburn (1981). Analysis of Speedup in Distributed Algorithms. Ph.D. Thesis, pdf
Selim Akl, R.J. Doran (1981). A comparison of parallel implementations of the alpha-beta and Scout tree search algorithms using the game of checkers, Department of Computing and Information Science, Queen's University, Kingston, Ontario.
Jonathan Schaeffer (1988). Distributed Game-Tree Searching. Journal of Parallel and Distributed Computing, Vol. 6, No. 2, pp. 90-114.
Chris Ferguson, Richard Korf (1988). Distributed Tree Search and its Application to Alpha-Beta Pruning. Proceedings of AAAI-88, Vol. I, pp. 128-132. Saint Paul, MN, pdf
Monroe Newborn (1988). Unsynchronized Iterative Deepening Parallel Alpha-Beta Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI, Vol. 10, No. 5, pp. 687-694. ISSN 0162-8828.
Robert Hyatt, Bruce W. Suter, Harry Nelson (1989). A Parallel Alpha-Beta Tree Searching Algorithm. Parallel Computing, Vol. 10, No. 3, pp. 299-308. ISSN 0167-8191.
Feng-hsiung Hsu (1989). Large Scale Parallelization of Alpha-beta Search: An Algorithmic and Architectural Study with Computer Chess. Ph.D. thesis, Technical report CMU-CS-90-108, Carnegie Mellon University, advisor Hsiang-Tsung Kung
Rainer Feldmann, Peter Mysliwietz, Burkhard Monien (1992). Distributed Game Tree Search on a Massively Parallel System. Data Structures and Efficient Algorithms, B. Monien, Th. Ottmann (eds.), Springer, Lecture Notes in Computer Science, 594, 1992, 270-288
1993
Rainer Feldmann (1993). Game Tree Search on Massively Parallel Systems. Ph.D. Thesis, pdf
Valavan Manohararajah (2001). Parallel Alpha-Beta Search on Shared Memory Multiprocessors. Masters Thesis, pdf
Florian Schintke, Jens Simon, Alexander Reinefeld (2001). A Cache Simulator for Shared Memory Systems. International Conference on Computational Science ICCS 2001, San Francisco, CA, Springer LNCS 2074, vol. 2, pp. 569-578. zipped ps
2002
Akihiro Kishimoto, Jonathan Schaeffer. (2002). Distributed Game-Tree Search Using Transposition Table Driven Work Scheduling, In Proc. of 31st International Conference on Parallel Processing (ICPP'02), pages 323-330, IEEE Computer Society Press. pdf via CiteSeerX
Akihiro Kishimoto, Jonathan Schaeffer. (2002). Transposition Table Driven Work Scheduling in Distributed Game-Tree Search (Best Paper Prize), In Proc. of Fifteenth Canadian Conference on Artificial Intelligence (AI'2002), volume 2338 of Lecture Notes in Artificial Intelligence (LNAI), pages 56-68, Springer
^John Romein, Henri Bal, Jonathan Schaeffer, Aske Plaat (2002). A Performance Analysis of Transposition-Table-Driven Scheduling in Distributed Search. IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 5, pp. 447–459. pdf
This page gives a brief summary of the different types. SMP algorithms are classified by their scalability (trend in search speed as the number of processors becomes large) and their speedup (change in time to complete a search). Typically, programmers use scaling to mean change in nodes per second (NPS) rates, and speedup to mean change in time to depth. Scaling and scalability are thus two different concepts.
A subtype of parallel algorithms, distributed algorithms are algorithms designed to work in cluster computing and distributed computing environments, where additional concerns beyond the scope of "classical" parallel algorithms need to be addressed.
Table of Contents
Shared Hash Table
see Main page: Shared Hash TableThis technique is a very simple approach to SMP. The implementation requires little more than starting additional processors. Processors are simply fed the root position at the beginning of the search, and each searches the same tree with the only communication being the transposition table. The gains come from the effect of nondeterminism. Each processor will finish the various subtrees in varying amounts of time, and as the search continues, these effects grow making the search trees diverge. The speedup is then based on how many nodes the main processor is able to skip from transposition table entries. Many programs used this if a "quick and dirty" approach to SMP is needed. It had the reputation of little speedup on a mere 2 processors, and to scale quite badly after this.
Lazy SMP
see Main page: Lazy SMPRecent improvements by Dan Homan [2] , Martin Sedlak [3] and others on Lazy SMP indicate that the algorithm scales quite well up to 8 cores and beyond [4] .
ABDADA
see Main page: ABDADAABDADA, Alpha-Bêta Distribué avec Droit d'Aînesse (Distributed Alpha-Beta Search with Eldest Son Right) is a loosely synchronized, distributed search algorithm by Jean-Christophe Weill [5] . It is based on the Shared Hash Table, and adds the number of processors searching this node inside the hash-table entry for better utilization - considering the Young Brothers Wait Concept.
Parallel Alpha-Beta
These algorithms divide the Alpha-Beta tree, giving different subtrees to different processors. Because alpha-beta is a serial algorithm, this approach is much more complex. However, these techniques also provide for the greatest gains from additional processors.Principal Variation Splitting (PVS)
In Principal Variation Splitting (PVS), each node is expressed by a thread. A thread will spawn one child thread for each legal move. But data dependency specified by the algorithm exists among these threads: After getting a tighter bound from the thread corresponding to the PV node, the remaining threads are ready to run.YBWC and Jamboree
The idea in Feldmann's Young Brothers Wait Concept (YBWC) [7] [8] as well in Kuszmaul's Jamboree Search [9] [10] [11] , is to search the first sibling node first before spawning the remaining siblings in parallel. This is based on the observations that the first move is either going to produce a cutoff (in which case processing sibling nodes is wasted effort) or return much better bounds. If the first move does not produce a cut-off, then the remaining moves are searched in parallel. This process is recursive.Since the number of processors is not infinite the process of "spawning" work normally consists in putting it on some kind of "work to be done stack" where processors are free to grab work in FIFO fashion when there is no work to do. In YBW you would not "spawn" or place work on the stack until the first sibling is searched.
In their 1983 paper Improved Speedup Bounds for Parallel Alpha-Beta Search [12] , Raphael Finkel and John Philip Fishburn already gave the theoretical confirmation to the common sense wisdom that parallel resources should first be thrown into searching the first child. Assuming the tree is already in an approximation to best-first order, this establishes a good alpha value that can then be used to parallel search the later children. The algorithm in the 1982 Artificial Intelligence paper [13] , which Fishburn called the "dumb algorithm" in his 1981 thesis presentation [14] gives p^0.5 speedup with p processors, while the 1983 PAMI algorithm (the "smart algorithm") gives p^0.8 speedup for lookahead trees with the branching factor of chess.
Dynamic Tree Splitting (DTS)
Main page: Dynamic Tree SplittingThis algorithm, invented by the Cray Blitz team (including Robert Hyatt [15] ), is the most complex. Though this gives the best known scalability for any SMP algorithm, there are very few programs using it because of its difficulty of implementation.
Other Approaches
Many different approaches have been tried that do not directly split the search tree. These algorithms have not enjoyed popular success due to the fact that they are not scalable. Typical examples include one processor that evaluates positions fed to it by a searching processor, or a tactical search that confirms or refutes a positional search.Taxonomy
Overview and taxonomy of parallel algorithms based on alpha-beta, given by Mark Brockington, ICCA Journal, Vol. 19: No. 3 in 1996 [16]Described
Control Distribution
At These Nodes
At These Nodes
Leftmost child of 3
John Philip Fishburn
Murray Campbell
Leftmost child of 3
Bruce W. Suter,
Harry Nelson
Steve Otto
with no TT-Entry
& no cutoff
Richard Korf
Tree Expansion
Yaoqing Gao
Jonathan Schaeffer
Other Considerations
Memory Design
Semaphores
During an parallel search, certain areas of memory must be protected to make sure processors do not write simultaneously and corrupt the data. Some type of semaphore system must be used. Semaphores access a piece of shared memory, typically an integer. When a processor wants to access protected memory, it reads the integer. If it is zero, meaning no other process is accessing the memory, then the processor attempts to make the integer nonzero. This whole process must be done atomically, meaning that the read, compare, and write are all done at the same time. Without this atomicity another processor could read the integer at the same time and both would see that they can freely access the memory.In chess programs that use parallel alpha-beta, usually spinlocks are used. Because the semaphores are held for such short periods of time, processors want to waste as little time as possible after the semaphore is released before acquiring access. To do this, if the semaphore is held when a processor reaches it, the processor continuously reads the semaphore. This technique can waste a lot of processing time in applications with high contention, meaning that many processes try to access the same semaphore simultaneously. In chess programs, however, we want as much processing power as possible.
Spinlocks are sometimes implemented in assembly language because the operating system does not have an API for them.
Threads vs. Processes
There are two ways of utilizing the extra processing power of multiple CPUs, threads and processes. The difference between them is that threads share all memory in the program, but there are multiple threads of execution. In processes, all memory is local to each processor except memory that is explicitly shared. This means that in a threaded program, functions must pass around an extra argument specifying which thread is active, thus which board structure to use. When using processes, a single global board can be used that will be duplicated for each process.Threads are more common, because they are easier to debug as well as implement, provided the program does not already have lots of global variables. Processes are favored by some because the need to explicitly share memory makes subtle bugs easier to avoid. Also, in processes, the extra argument to most functions is not needed.
Some programs that use threads:
Some programs that use processes:
Didactic Programs
See also
Publications
1950 ...
1970 ...
1980 ...
- Tony Marsland, Murray Campbell, A. L. Rivera (1980). Parallel Search of Game Trees. Technical Report TR 80-7, Computing Science Department, University of Alberta, pdf
- Raphael Finkel, Marvin Solomon (1980). The Arachne Kernel. Version 1.2 Technical Report 380, University of Wisconsin-Madison, pdf
- Raphael Finkel, John Philip Fishburn (1980). Parallel Alpha-Beta Search on Arachne. IEEE International Conference on Parallel Processing, pp. 235-243.
- Gérard M. Baudet, Richard P. Brent, Hsiang-Tsung Kung (1980). Parallel Execution of a Sequence of tasks on a Asynchronous Multiprocessor. Australian Computer Journal 12(3): 105-112, pdf
- Selim Akl, David T. Barnard, R.J. Doran, (1980). Design, analysis and implementation of a parallel alpha-beta algorithm, Department of Computing and Information Science, Queen's University, Kingston, Ontario.
- Selim Akl, David T. Barnard, R.J. Doran (1980). Simulation and Analysis in Deriving Time and Storage Requirements for a Parallel Alpha-Beta Pruning Algorithm. IEEE International Conference on Parallel Processing, pp. 231-234.
- Selim Akl, David T. Barnard, R.J. Doran (1980). Searching game trees in parallel, Proceedings of the Third Biennial Conference of the Canadian Society for Computational Studies of Intelligence, Victoria, B.C.
1981- John Philip Fishburn (1981). Analysis of Speedup in Distributed Algorithms. Ph.D. Thesis, pdf
- Selim Akl, R.J. Doran (1981). A comparison of parallel implementations of the alpha-beta and Scout tree search algorithms using the game of checkers, Department of Computing and Information Science, Queen's University, Kingston, Ontario.
- Tony Marsland, Murray Campbell (1981). Parallel Search of Strongly Ordered Game Trees. Technical Report TR 81-9, Department of Computing Science , University of Alberta, pdf
1982- Tony Marsland, Murray Campbell (1982). Parallel Search of Strongly Ordered Game Trees. ACM Computing Surveys, Vol. 14, No. 4, pp. 533-551. pdf
- Tony Marsland, Murray Campbell (1982). A Study of Parallel Tree Search Algorithms. Technical Report TR 82-4, Computing Science Department, University of Alberta, pdf
- Raphael Finkel, John Philip Fishburn (1982). Parallelism in Alpha-Beta Search.Artificial Intelligence, Vol. 19, No. 1
- Monroe Newborn, (1982). OSTRICH/P - a parallel search chess program, SOCS-82.3, McGill University, School of Computer Science, Montreal.
- Selim Akl, David T. Barnard, R.J. Doran (1982). Design, Analysis, and Implementation of a Parallel Tree Search Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 4, No. 2, pp. 192-203. ISSN 0162-8828
1983- Raphael Finkel, John Philip Fishburn (1983). Improved Speedup Bounds for Parallel Alpha-Beta Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 5, No. 1, pp. 89 - 92
- Clyde Kruskal (1983). Searching, Merging, and Sorting in Parallel Computation. IEEE Transactions on Computers
- Gary Lindstrom (1983). The Key Node Method: A Highly-Parallel Alpha-Beta Algorithm. Technical Report UUCCS 83-101, University of Utah, pdf
19841985 ...
- Tony Marsland, Fred Popowich (1985). Parallel Game-Tree Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 7, No. 4, pp. 442-452. 1984 pdf (Draft)
- Baruch Awerbuch (1985). A New Distributed Depth-First Search Algorithm. Information Processing Letters, Vol. 20, No. 3
- Monroe Newborn (1985). A Parallel Search Chess Program. Proceedings of the ACM Annual Conference, pp. 272-277. Denver, Co.
- Clyde Kruskal, Alan Weiss (1985). Allocating Independent Subtasks on Parallel Processors. IEEE Transactions on Software Engineering, Vol. 11
- Daniel B. Leifker, Laveen N. Kanal (1985). A Hybrid SSS*/Alpha-Beta Algorithm for Parallel Search of Game Trees. IJCAI'85 » SSS* and Dual*
1986- Tony Marsland, Marius Olafsson, Jonathan Schaeffer (1986). Multiprocessor Tree-Search Experiments. Advances in Computer Chess 4
- Henri Bal, Robbert van Renesse (1986). Parallel Alpha-Beta Search. 4th NGI-SION Symposium Stimulerende Informatica. Jaarbeurs Utrecht, The Netherlands
- Henri Bal, Robbert van Renesse (1986). A Summary of Parallel Alpha-Beta Search Results. ICCA Journal, Vol 9, No. 3
- Jonathan Schaeffer (1986). Improved Parallel Alpha-Beta Searching. Proceedings ACM/IEEE Fall Joint Computer Conference, pp. 519-527.
- Rainer Feldmann, Peter Mysliwietz, Oliver Vornberger (1986). A Local Area Network Used as a Parallel Architecture. Technical Report 31, University of Paderborn
1987- Hiromoto Usui, Masafumi Yamashita, Masaharu Imai, Toshihide Ibaraki (1987). Parallel Searches of Game Tree. Systems and Computer in Japan, Vol. 18, No. 8, pp. 97-109.
1988- Jonathan Schaeffer (1988). Distributed Game-Tree Searching. Journal of Parallel and Distributed Computing, Vol. 6, No. 2, pp. 90-114.
- Chris Ferguson, Richard Korf (1988). Distributed Tree Search and its Application to Alpha-Beta Pruning. Proceedings of AAAI-88, Vol. I, pp. 128-132. Saint Paul, MN, pdf
- Monroe Newborn (1988). Unsynchronized Iterative Deepening Parallel Alpha-Beta Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI, Vol. 10, No. 5, pp. 687-694. ISSN 0162-8828.
- Ed Felten, Steve Otto (1988). Chess on a Hypercube. The Third Conference on Hypercube Concurrent Computers and Applications, Vol. II-Applications
- Robert Hyatt (1988). A High-Performance Parallel Algorithm to Search Depth-First Game Trees. Ph.D. Thesis, Department of Computer Science, University of Alabama at Birmingham
- Maarten van der Meulen (1988). Parallel Conspiracy-Number Search. M.Sc. thesis, Faculty of Mathematics and Computer Science, Vrije Universteit, Amsterdam
- Matthew Huntbach, F. Warren Burton (1988). Alpha-beta search on virtual tree machines. Information Sciences, Vol. 44, No. 1
19891990 ...
- Rainer Feldmann, Peter Mysliwietz, Burkhard Monien (1991) A Fully Distributed Chess Program. Advances in Computer Chess 6, pdf
1992- Jaleh Rezaie, Raphael Finkel, (1992). A comparison of some parallel game-tree search algorithms, zipped postscript
- Rainer Feldmann, Peter Mysliwietz, Burkhard Monien (1992). Experiments with a Fully Distributed Chess Program. Heuristic Programming in AI 3
- Rainer Feldmann, Peter Mysliwietz, Burkhard Monien (1992). Distributed Game Tree Search on a Massively Parallel System. Data Structures and Efficient Algorithms, B. Monien, Th. Ottmann (eds.), Springer, Lecture Notes in Computer Science, 594, 1992, 270-288
1993- Rainer Feldmann (1993). Game Tree Search on Massively Parallel Systems. Ph.D. Thesis, pdf
- Vincent David (1993). Algorithmique parallèle sur les arbres de décision et raisonnement en temps contraint. Etude et application au Minimax = Parallel algorithm for heuristic tree searching and real-time reasoning. Study and application to the Minimax, Ph.D. Thesis, École nationale supérieure de l'aéronautique et de l'espace, Toulouse, France
- Paul Lu (1993). Parallel Search of Narrow Game Trees. M.Sc. Thesis, University of Alberta
- Van-Dat Cung (1993). Parallelizations of Game-Tree Search. PARCO 1993, pdf hosted by CiteSeerX
19941995 ...
- Bradley C. Kuszmaul (1995). The StarTech Massively Parallel Chess Program. pdf
- Henri Bal and Victor Allis (1995). Parallel Retrograde Analysis on a Distributed System. Supercomputing ’95, San Diego, CA.
- Jean-Christophe Weill (1995). Programmes d'Échecs de Championnat: Architecture Logicielle Synthèse de Fonctions d'Évaluations, Parallélisme de Recherche. Ph.D. Thesis. Université Paris 8, Saint-Denis, zipped ps (French)
- Holger Hopp, Peter Sanders (1995). Parallel Game Tree Search on SIMD Machines. IRREGULAR 1995, from CiteSeerX
- Tony Marsland and Yaoqing Gao (1995). Speculative Parallelism Improves Search? Technical Report 95-05, Department of Computing Science, University of Alberta, CiteSeerX
1996- Jean-Christophe Weill (1996). The ABDADA Distributed Minimax Search Agorithm. Proceedings of the 1996 ACM Computer Science Conference, pp. 131-138. ACM, New York, N.Y, reprinted ICCA Journal, Vol. 19, No. 1, zipped postscript
- Ulf Lorenz, Valentin Rottmann (1996) Parallel Controlled Conspiracy-Number Search - Advances in Computer Chess 8
- Yaoqing Gao, Tony Marsland (1996). Multithreaded Pruned Tree Search in Distributed Systems. Journal of Computing and Information, 2(1), 482-492, pdf
- Mark Brockington, Jonathan Schaeffer (1996). The APHID Parallel αβ Search Algorithm. Technical Report 96-07, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada. as pdf from CiteSeerX
- Mark Brockington (1996). A Taxonomy of Parallel Game-Tree Search Algorithms. ICCA Journal, Vol. 19, No. 3
- Bernhard Balkenhol (1996). Problems in Sequential and Parallel Game Tree Search. Bielefeld University, zipped ps
- Ming Li, John Tromp, Paul M. B. Vitányi (1996). How to Share Concurrent Wait-Free Variables. Journal of the ACM, Vol. 43, No. 4
1997- Robert Hyatt (1997). The Dynamic Tree-Splitting Parallel Search Algorithm, ICCA Journal, Vol. 20, No. 1
- Andrew Tridgell (1997). KnightCap — a parallel chess program on the AP1000+. zipped ps » KnightCap
- Mark Brockington, Jonathan Schaeffer (1997). APHID Game-Tree Search. Advances in Computer Chess 8
- David Sturgill and Alberto Maria Segre (1997). Nagging: A Distributed, Adversarial Search-Pruning Technique Applied to First-Order Inference. Journal of Automated Reasoning, Vol. 19, No. 3 [23]
1998- Mark Brockington (1998). Asynchronous Parallel Game-Tree Search. Ph.D. Thesis, University of Alberta, zipped postscript
- Rainer Feldmann, Burkhard Monien (1998). Selective Game Tree Search on a Cray T3E. ps
- Craig S. Bruce (1998). Performance Optimization for Distributed-Shared-Data Systems. Ph.D. thesis, University of Waterloo
19992000 ...
- John Romein, Henri Bal, Dick Grune (2000). The Multigame Reference Manual. Vrije Universiteit, pdf
2001- John Romein (2001). Multigame - An Environment for Distributed Game-Tree Search. Ph.D. thesis, Vrije Universiteit, supervisor Henri Bal, pdf
- Yaron Shoham, Sivan Toledo (2001). Parallel randomized best-first minimax search. School of Computer Science, Tel-Aviv University, pdf
- Valavan Manohararajah (2001). Parallel Alpha-Beta Search on Shared Memory Multiprocessors. Masters Thesis, pdf
- Florian Schintke, Jens Simon, Alexander Reinefeld (2001). A Cache Simulator for Shared Memory Systems. International Conference on Computational Science ICCS 2001, San Francisco, CA, Springer LNCS 2074, vol. 2, pp. 569-578. zipped ps
2002- Akihiro Kishimoto, Jonathan Schaeffer. (2002). Distributed Game-Tree Search Using Transposition Table Driven Work Scheduling, In Proc. of 31st International Conference on Parallel Processing (ICPP'02), pages 323-330, IEEE Computer Society Press. pdf via CiteSeerX
- Akihiro Kishimoto, Jonathan Schaeffer. (2002). Transposition Table Driven Work Scheduling in Distributed Game-Tree Search (Best Paper Prize), In Proc. of Fifteenth Canadian Conference on Artificial Intelligence (AI'2002), volume 2338 of Lecture Notes in Artificial Intelligence (LNAI), pages 56-68, Springer
- John Romein, Henri Bal, Jonathan Schaeffer, Aske Plaat (2002). A Performance Analysis of Transposition-Table-Driven Scheduling in Distributed Search. IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 5, pp. 447–459. pdf » Transposition Table [26]
- Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence 140, pdf, pdf
2003- Brian Greskamp (2003). Parallelizing a Simple Chess Program. pdf
20042005 ...
- Jan Renze Steenhuisen (2005). Transposition-Driven Scheduling in Parallel Two-Player State-Space Search. Masters Thesis, pdf
2006- Edward A. Lee (2006). The Problem with Threads. Technical Report No. UCB/EECS-2006-1, University of California, Berkeley, pdf
2007- Vladan Vučković (2007). The Method of the Chess Search Algorithms - Parallelization using Two-Processor distributed System, pdf
- Tristan Cazenave, Nicolas Jouandeau (2007). On the Parallelization of UCT. CGW 2007, pdf » UCT
- Keijirou Yanagi, Kazutomo Shibahara, Yasuhiro Tajima, Yoshiyuki Kotani (2007). Multiple Parallel Search in Shogi. 12th Game Programming Workshop
2008- Guillaume Chaslot, Mark Winands, Jaap van den Herik (2008). Parallel Monte-Carlo Tree Search. CG 2008, pdf
- Tristan Cazenave, Nicolas Jouandeau (2008). A Parallel Monte-Carlo Tree Search Algorithm. CG 2008, pdf
- Sylvain Gelly, Jean-Baptiste Hoock, Arpad Rimmel, Olivier Teytaud, Yann Kalemkarian (2008). The Parallelization of Monte-Carlo Planning - Parallelization of MC-Planning. ICINCO-ICSO 2008: 244-249, pdf, slides as pdf
- Kai Himstedt, Ulf Lorenz, Dietmar P. F. Möller (2008). A Twofold Distributed Game-Tree Search Approach Using Interconnected Clusters. Euro-Par 2008: 587-598, abstract from springerlink
- Tristan Cazenave, Nicolas Jouandeau (2008). A Parallel Monte-Carlo Tree Search Algorithm. pdf
- James Swafford (2008). A Survey of Parallel Search Algorithms over Alpha-Beta Search Trees using Symmetric Multiprocessor Machines. Masters Project, East Carolina University, advisor Ronnie Smith
20092010 ...
- Amine Bourki, Guillaume Chaslot, Matthieu Coulm, Vincent Danjean, Hassen Doghmen, Thomas Hérault, Jean-Baptiste Hoock, Arpad Rimmel, Fabien Teytaud, Olivier Teytaud, Paul Vayssière, Ziqin Yu (2010). Scalability and Parallelization of Monte-Carlo Tree Search. CG 2010, pdf
- Tsan-sheng Hsu (2010). Parallel Alpha-Beta Based Game Tree Search, slides as pdf
- Dan Wu, Pan Chen, Kui Dai, Jinli Rao, Xuecheng Zou (2010). Implementation of Parallel Game Tree Search on a SIMD System. Huazhong University of Science & Technology, Wuhan, China, ICIE 2010, Vol. 1
- Tomoyuki Kaneko (2010). Parallel Depth First Proof Number Search. AAAI 2010 » Proof-Number Search
2011- Jean Méhat, Tristan Cazenave (2011). A Parallel General Game Player. KI Journal, Vol. 25, No. 1, pdf
- Kazuki Yoshizoe, Akihiro Kishimoto, Tomoyuki Kaneko, Haruhiro Yoshimoto, Yutaka Ishikawa (2011). Scalable Distributed Monte Carlo Tree Search. SoCS2011, pdf
- Ľubomír Lackovič (2011). Parallel Game Tree Search Using GPU. Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, pdf » GPU
- Khondker Shajadul Hasan (2011). A Distributed Chess Playing Software System Model Using Dynamic CPU Availability Prediction. SERP 2011, pdf
- John Mellor-Crummey (2011). Shared-memory Parallel Programming with Cilk. Rice University, slides as pdf » Cilk
- Lars Schaefers, Marco Platzner, Ulf Lorenz (2011). UCT-Treesplit - Parallel MCTS on Distributed Memory. MCTS Workshop, Freiburg, Germany, pdf » UCT
- Tobias Graf, Ulf Lorenz, Marco Platzner, Lars Schaefers (2011). Parallel Monte-Carlo Tree Search for HPC Systems. Euro-Par 2011, pdf
- Damjan Strnad, Nikola Guid (2011). Parallel Alpha-Beta Algorithm on the GPU. CIT. Journal of Computing and Information Technology, Vol. 19, No. 4 » GPU, Reversi
2012- Chih-Hung Chen, Shun-Shii Lin, Min-Huei Huang (2012). Volunteer Computing System Applied to Computer Games. TCGA 2012 Workshop, pdf
- Liang Li, Hong Liu, Peiyu Liu, Taoying Liu, Wei Li, Hao Wang (2012). A Node-based Parallel Game Tree Algorithm Using GPUs. CLUSTER 2012, pdf » GPU
2013- Kunihito Hoki, Tomoyuki Kaneko, Akihiro Kishimoto, Takeshi Ito (2013). Parallel Dovetailing and its Application to Depth-First Proof-Number Search. ICGA Journal, Vol. 36, No. 1 » Proof-Number Search [27]
- Jakub Pawlewicz, Ryan Hayward (2013). Scalable Parallel DFPN Search. CG 2013
- Georg Hager [28], Jan Treibig, Gerhard Wellein (2013). The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems. RRZE, SC13, slides as pdf
20142015 ...
Forum Posts
1995 ...
2000 ...
2005 ...
- Iterative DTS by Fritz Reul, CCC, July 02, 2007
- multithreading questions by Martin Fierz, CCC, August 08, 2007
- re-inventing the SMP wheel by Harm Geert Muller, CCC, August 15, 2007
- SMP thread goes here by Robert Hyatt, CCC, August 29, 2007
- A Few General Questions on Parallel Search by Pradu Kannan, Winboard Forum, August 31, 2007
- interested in making single processor program multi by Mike Adams, CCC, December 28, 2007
2008- If making an SMP engine, do NOT use processes by Zach Wegner, CCC, February 07, 2008
- Questions about getting ready for multicore programming by Carey, CCC, April 01, 2008
- Minimizing Sharing of Data between Physical Processors by Pradu Kannan, CCC, May 19, 2008
- threads vs processes by Robert Hyatt, CCC, July 16, 2008
- threads vs processes again by Robert Hyatt, CCC, August 05, 2008
- Authors of WinBoard SMP engines, take note! by Harm Geert Muller, CCC, October 11, 2008 » Chess Engine Communication Protocol
- Cluster Rybka by Robert Hyatt, CCC, November 10, 2008
- UCI protocol and SMP by Aart Bik, CCC, November 13, 2008 » UCI
2009Results from UCT parallelization by Gian-Carlo Pascutto, CCC, March 11, 2009
2010 ...
- asynchronous search by Daniel Shawul, CCC, April 6, 2010
- SMP basics by Richard Allbert, CCC, April 09, 2010
- DTS Structure by Edmund Moshammer, CCC, May 28, 2010 » Dynamic Tree Splitting, Iterative Search
- DTS-ification of YBWC by Marco Costalba, CCC, June 01, 2010
- SMP speed up by Miguel A. Ballicora, CCC, September 14, 2010
- SMP questions by Harm Geert Muller, CCC, September 19, 2010
2011- On parallelization by Onno Garms, CCC, March 13, 2011 » Onno
- AMD Phenom Hex core (SMP performance problem) by Miguel A. Ballicora, CCC, April 04, 2011
- SMP for Android UCI engines by Aart Bik, CCC, April 14, 2011 » Android
- Questions on SMP search by Ben-Hur Carlos Vieira Langoni Junior, CCC, April 21, 2011
- Some spinlock code, just for you by Steven Edwards, CCC, June 01, 2011
- Spinlocks galore by Steven Edwards, CCC, June 02, 2011
- LIFO stack based parallel processing? by Srdja Matovic, CCC, September 22, 2011
- Some questions on split points by Edward Yu, CCC, November 09, 2011
- Question on parallel search by Michel Van den Bergh, CCC, December 27, 2011
2012- Parallelization questions, ABDADA or DTS? by Benjamin Rosseaux, CCC, March 23, 2012 » ABDADA, Dynamic Tree Splitting
- algorithms - distributed alpha beta pruning by wirate, Computer Science Stack Exchange, April 2, 2012
- YBWC: Active Reparenting by Marco Costalba, CCC, April 10, 2012 » Young Brothers Wait Concept
- Virtualization and multi-processor engines by Russell Reagan, CCC, August 16, 2012
- How to measure Parallel Search Speedup? by Marcel Fournier, CCC, August 23, 2012
- SMP and questions by Edsel Apostol, CCC, November 23, 2012
- Lazy SMP by Julien Marcel, CCC, December 27, 2012 » Lazy SMP
2013- Lazy SMP, part 2 by Daniel Homan, CCC, January 12, 2013
- Parallel search: System-level programming details by Álvaro Begué, CCC, February 09, 2013
- SMP search in Viper and idea about search in cluster system by Chao Ma, CCC, February 22, 2013 » Viper
- Message passing parallel search on SMP system by Daniel Shawul, CCC, March 07, 2013 [30]
- Transposition driven scheduling by Daniel Shawul, CCC, April 04, 2013 [31]
- Parallel search slowdown? by Evert Glebbeek, CCC, April 12, 2013
- Implementation of multithreaded search in Jazz by Evert Glebbeek, CCC, April 20, 2013 » Jazz
- parallel search speedup/overhead by Robert Hyatt, CCC, May 06, 2013
- Peculiarity of Komodo 5.1MP by Kai Laskos, CCC, June 19, 2013 » Komodo
- back to the Komodo SMP issue by Robert Hyatt, CCC, July 01, 2013
- Measure of SMP scalability by Edsel Apostol, CCC, July 03, 2013 » Hannibal
- Lazy SMP and Work Sharing by Dan Homan, CCC, July 03, 2013 » Lazy SMP in EXChess
- use sleeping threads by Don Dailey, CCC, July 10, 2013 » Stockfish
- Recursive DTS-like search algorithm by Peter Österlund, CCC, July 24, 2013 » Texel, Recursion
- DTS-like SMP by ThinkingALot, OpenChess Forum, July 25, 2013 » Gull
- Time to depth measuring tool by Peter Österlund, CCC, July 28, 2013 » Depth
- interesting SMP bug by Robert Hyatt, CCC, September 24, 2013 » Crafty
- SMP and Thread Pool Design pattern by Edsel Apostol, CCC, October 02, 2013
2014Measure of SMP scalability (sub-thread) by Ernest Bonnem, CCC, July 08, 2013
2015 ...
- SMP: on same branch instead splitting? by Frank Ludwig, CCC, January 23, 2015
- Lazy SMP in Cheng by Martin Sedlak, CCC, February 02, 2015 » Cheng
- Some SMP measurements with Rookie v3 by Marcel van Kervinck, CCC, February 05, 2015 » Rookie
- Stockfish with 16 threads - big news? by Louis Zulli, CCC, February 15, 2015 » Thread
- Parallel iterative search function by Fabio Gobbato, CCC, March 05, 2015 » Iterative Search
- An MPI perft program by Chao Ma, CCC, April 05, 2015 » Perft [32]
- Crude parallel search by Carl Bicknell, CCC, April 07, 2015
- Trying to improve lazy smp by Daniel José Queraltó, CCC, April 11, 2015
- Empirical results with Lazy SMP, YBWC, DTS by Kai Laskos, CCC, April 16, 2015
- DTS-Algorithm Split Block by JacobusOpperman, OpenChess Forum, May 19, 2015
- The effective speedup from 1 to 8 cpus for SF and Komodo by Adam Hair, CCC, May 31, 2015 » Stockfish, Komodo
- parallel speedup and assorted trivia by Robert Hyatt, CCC, June 05, 2015 » Crafty
- There are compilers and there are compilers by Robert Hyatt, CCC, June 24, 2015
- Parallel search once more by Robert Hyatt, CCC, June 25, 2015
- SMP speedup discussion by Robert Hyatt, CCC, July 05, 2015
- Possible improvement to ABDADA -- checking for cutoffs by Tom Kerrigan, CCC, July 10, 2015 » ABDADA
- Actual speedups from YBWC and ABDADA on 8+ core machines? by Tom Kerrigan, CCC, July 10, 2015 » ABDADA, Young Brothers Wait Concept
- New SMP stuff (particularly Kai) by Robert Hyatt, CCC, July 20, 2015
- SMP (new style) by Ed Schröder, CCC, July 20, 2015
- scorpio can run on 8192 cores by Daniel Shawul, CCC, August 22, 2015 » Scorpio
- lazy smp questions by Lucas Braesch, CCC, September 09, 2015 » Lazy SMP
- Lazy SMP Better than YBWC? by Steve Maughan, CCC, October 23, 2015 » Young Brothers Wait Concept
- New Stockfish with Lazy_SMP, but what about the TC bug ? by Ernest Bonnem, CCC, October 26, 2015 » Stockfish, TCEC Season 8
- An interesting parallel search non-bug by Robert Hyatt, CCC, November 05, 2015 » Crafty
- lazy smp questions by Marco Belli, CCC, December 21, 2015 » Lazy SMP
2016Explanation for non-expert? by Louis Zulli, CCC, February 16, 2015 » Stockfish
Best Stockfish NPS scaling yet by Louis Zulli, CCC, March 02, 2015
- NUMA 101 by Robert Hyatt, CCC, January 07, 2016 » NUMA
- Using more than 1 thread in C beginner question by Uri Blass, CCC, January 11, 2016 » C
- Threads test incl. Stockfish 7 by Andreas Strangmüller, CCC, January 11, 2016 » Thread, Stockfish
- Parallel Search to Fixed Depth by Brian Richardson, CCC, January 17, 2016
- Threads test incl. Komodo 9.3 by Andreas Strangmüller, CCC, January 17, 2016 » Komodo
- Lazy SMP - how it works by Kalyankumar Ramaseshan, CCC, February 29, 2016 » Lazy SMP
- Crafty SMP measurement by Robert Hyatt, CCC, April 04, 2016 » Crafty
- parallel search speed measurement by Robert Hyatt, CCC, May 24, 2016
- Crazy SMP by Harm Geert Muller, CCC, June 19, 2016
- NUMA in a YBWC implementation by Edsel Apostol, CCC, July 20, 2016 » Young Brothers Wait Concept
- lazy smp using ms vs2015 c++11 std::async by Edward Yu, CCC, July 29, 2016 » Lazy SMP, Thread [33]
- Time to depth concerns by Carl Bicknell, CCC, August 15, 2016
- lets get the ball moving down the field on numa awareness by Mohammed Li, FishCooking, August 30, 2016 » NUMA, Stockfish, asmFish
- Baffling multithreading scaling behavior by Tom Kerrigan, CCC, September 06, 2016
- What do you do with NUMA? by Matthew Lai, CCC, September 19, 2016 » NUMA
- Scaling of Asmfish with large thread count by Dann Corbit, CCC, October 07, 2016 » asmFish
- Stockfish 8 - Double time control vs. 2 threads by Andreas Strangmüller, CCC, November 15, 2016 » Doubling TC, Diminishing Returns, Playing Strength, Stockfish
2017Re: Baffling multithreading scaling behavior by Robert Hyatt, CCC, September 07, 2016
Approximate ABDADA by Peter Österlund, CCC, August 23, 2017 » ABDADA
External Links
Parallel Search
Parallel Computing
Scalability
Shared Memory
False sharing from Wikipedia
Cache
MSI protocol from Wikipedia
MESI protocol from Wikipedia
MOESI protocol from Wikipedia
assembly - The prefetch instruction - Stack Overflow
Data Prefetch Support - GNU Project - Free Software Foundation (FSF)
Software prefetching considered harmful by Linus Torvalds, LWN.net, May 19, 2011
Concurrency and Synchronization
Cooperating sequential processes (EWD 123)
A challenge to memory designers? (EWD 497)
Misc
References
What links here?
Up one level