Home * Search * Parallel Search
scalability.jpg

Parallel Search, also known as Multithreaded Search or SMP Search, is a way to increase search speed by using additional processors. This topic that has been gaining popularity recently with multiprocessor computers becoming widely available. Utilizing these additional processors is an interesting domain of research, as traversing a search tree is inherently serial. Several approaches have been devised, with the most popular today being Young Brothers Wait Concept and Shared Hash Table.

This page gives a brief summary of the different types. SMP algorithms are classified by their scalability (trend in search speed as the number of processors becomes large) and their speedup (change in time to complete a search). Typically, programmers use scaling to mean change in nodes per second (NPS) rates, and speedup to mean change in time to depth. Scaling and scalability are thus two different concepts.

A subtype of parallel algorithms, distributed algorithms are algorithms designed to work in cluster computing and distributed computing environments, where additional concerns beyond the scope of "classical" parallel algorithms need to be addressed.
Parallel scalability [1]

Shared Hash Table

see Main page: Shared Hash Table

This technique is a very simple approach to SMP. The implementation requires little more than starting additional processors. Processors are simply fed the root position at the beginning of the search, and each searches the same tree with the only communication being the transposition table. The gains come from the effect of nondeterminism. Each processor will finish the various subtrees in varying amounts of time, and as the search continues, these effects grow making the search trees diverge. The speedup is then based on how many nodes the main processor is able to skip from transposition table entries. Many programs used this if a "quick and dirty" approach to SMP is needed. It had the reputation of little speedup on a mere 2 processors, and to scale quite badly after this.

Lazy SMP

Recent improvements by Dan Homan [2] , Martin Sedlak [3] and others on Lazy SMP indicate that the algorithm scales quite well up to 8 cores and beyond [4] .

ABDADA

see Main page: ABDADA

ABDADA, Alpha-Bêta Distribué avec Droit d'Aînesse (Distributed Alpha-Beta Search with Eldest Son Right) is a loosely synchronized, distributed search algorithm by Jean-Christophe Weill [5] . It is based on the Shared Hash Table, and adds the number of processors searching this node inside the hash-table entry for better utilization - considering the Young Brothers Wait Concept.

Parallel Alpha-Beta

These algorithms divide the Alpha-Beta tree, giving different subtrees to different processors. Because alpha-beta is a serial algorithm, this approach is much more complex. However, these techniques also provide for the greatest gains from additional processors.

Principal Variation Splitting (PVS)

In Principal Variation Splitting (PVS), each node is expressed by a thread. A thread will spawn one child thread for each legal move. But data dependency specified by the algorithm exists among these threads: After getting a tighter bound from the thread corresponding to the PV node, the remaining threads are ready to run.
PVSplit.JPG
PV Splitting [6]

YBWC and Jamboree

The idea in Feldmann's Young Brothers Wait Concept (YBWC) [7] [8] as well in Kuszmaul's Jamboree Search [9] [10] [11] , is to search the first sibling node first before spawning the remaining siblings in parallel. This is based on the observations that the first move is either going to produce a cutoff (in which case processing sibling nodes is wasted effort) or return much better bounds. If the first move does not produce a cut-off, then the remaining moves are searched in parallel. This process is recursive.

Since the number of processors is not infinite the process of "spawning" work normally consists in putting it on some kind of "work to be done stack" where processors are free to grab work in FIFO fashion when there is no work to do. In YBW you would not "spawn" or place work on the stack until the first sibling is searched.

In their 1983 paper Improved Speedup Bounds for Parallel Alpha-Beta Search [12] , Raphael Finkel and John Philip Fishburn already gave the theoretical confirmation to the common sense wisdom that parallel resources should first be thrown into searching the first child. Assuming the tree is already in an approximation to best-first order, this establishes a good alpha value that can then be used to parallel search the later children. The algorithm in the 1982 Artificial Intelligence paper [13] , which Fishburn called the "dumb algorithm" in his 1981 thesis presentation [14] gives p^0.5 speedup with p processors, while the 1983 PAMI algorithm (the "smart algorithm") gives p^0.8 speedup for lookahead trees with the branching factor of chess.

Dynamic Tree Splitting (DTS)

Main page: Dynamic Tree Splitting

This algorithm, invented by the Cray Blitz team (including Robert Hyatt [15] ), is the most complex. Though this gives the best known scalability for any SMP algorithm, there are very few programs using it because of its difficulty of implementation.

Other Approaches

Many different approaches have been tried that do not directly split the search tree. These algorithms have not enjoyed popular success due to the fact that they are not scalable. Typical examples include one processor that evaluates positions fed to it by a searching processor, or a tactical search that confirms or refutes a positional search.


Taxonomy

Overview and taxonomy of parallel algorithms based on alpha-beta, given by Mark Brockington, ICCA Journal, Vol. 19: No. 3 in 1996 [16]
First
Described
Algorithm
Author(s)
Processor Hierarchy
Control Distribution
Parallelism Possible
At These Nodes
Synchronisation Done
At These Nodes
1978
Parallel Aspiration Search
Gérard M. Baudet
Static/Centralized
Root (αβ - Window)
Root
1979
Mandatory Work First
Selim Akl et al.
Static/Centralized
Type-1+3+
Leftmost child of 3
Bad Type-2
1980
Tree Splitting
Raphael Finkel,
John Philip Fishburn
Static/Centralized
Top k-ply
Root
1981
PVS
Tony Marsland,
Murray Campbell
Static/Centralized
Type-1
Type-1
1983
Key Node
Gary Lindstrom
Static/Centralized
Type-1+3+
Leftmost child of 3
Bad Type-2
1986
UIDPABS [17]
Monroe Newborn
Static/Centralized
Root
None
1987
DPVS
Jonathan Schaeffer
Dynamic/Centralized
Type-1+3+2
Type 1+3+Bad 2
EPVS
Robert Hyatt,
Bruce W. Suter,
Harry Nelson
Dynamic/Centralized
Type-1+3
Type-1+3
Waycool
Ed Felten,
Steve Otto
Dynamic/Distributed
All, except Type-2
with no TT-Entry
Nodes with TT
& no cutoff
YBWC
Rainer Feldmann
Dynamic/Distributed
Type-1+3+Bad 2
Type-1+Bad 2
1988
Dynamic Tree Splitting
Robert Hyatt
Dynamic/Distributed
Type-1+3+Bad 2
Type-1+Bad 2
Bound-and-Branch
Chris Ferguson,
Richard Korf
Dynamic/Distributed
Type-1+3+Bad 2
Type-1+Bad 2
1990
Delayed Branch
Tree Expansion
Feng-hsiung Hsu
Static/Centralized
Type-1+3
Bad Type-2
1993
Frontier Splitting
Paul Lu
Dynamic/Distributed
All
Root
αβ*
Vincent David
Dynamic/Distributed
Type-1+3
Type-1+3+Bad 2
1994
CABP [18]
Van-Dat Cung
Static/Centralized
Type-1+3
Bad Type-2
Jamboree
Bradley Kuszmaul
Dynamic/Distributed
Type-1+3+Bad 2
Type-1+Bad 2
1995
ABDADA
Jean-Christophe Weill
Dynamic/Distributed
Type-1+3+Bad 2
Type-1+Bad 2
Dynamic Multiple PV-Split
Tony Marsland,
Yaoqing Gao
Dynamic/Distributed
Nodes within PV set
Nodes within PV set
1996
APHID
Mark Brockington,
Jonathan Schaeffer
Static/Centralized
Top k-ply
None

Comparison

In 1994, Robert Hyatt [19] reported some scaling comparisons of different Parallel Search algorithms. The values indicate the speedup to solve certain testpositions with n processors.

Algorithm
1
2
4
8
16
PVS
1.0
1.8
3.0
4.1
4.6
EPVS
1.0
1.9
3.4
5.4
6.0
DTS
1.0
2.0
3.7
6.6
11.1

Other Considerations

Semaphores

During an parallel search, certain areas of memory must be protected to make sure processors do not write simultaneously and corrupt the data. Some type of semaphore system must be used. Semaphores access a piece of shared memory, typically an integer. When a processor wants to access protected memory, it reads the integer. If it is zero, meaning no other process is accessing the memory, then the processor attempts to make the integer nonzero. This whole process must be done atomically, meaning that the read, compare, and write are all done at the same time. Without this atomicity another processor could read the integer at the same time and both would see that they can freely access the memory.

In chess programs that use parallel alpha-beta, usually spinlocks are used. Because the semaphores are held for such short periods of time, processors want to waste as little time as possible after the semaphore is released before acquiring access. To do this, if the semaphore is held when a processor reaches it, the processor continuously reads the semaphore. This technique can waste a lot of processing time in applications with high contention, meaning that many processes try to access the same semaphore simultaneously. In chess programs, however, we want as much processing power as possible.

Spinlocks are sometimes implemented in assembly language because the operating system does not have an API for them.

Threads vs. Processes

There are two ways of utilizing the extra processing power of multiple CPUs, threads and processes. The difference between them is that threads share all memory in the program, but there are multiple threads of execution. In processes, all memory is local to each processor except memory that is explicitly shared. This means that in a threaded program, functions must pass around an extra argument specifying which thread is active, thus which board structure to use. When using processes, a single global board can be used that will be duplicated for each process.

Threads are more common, because they are easier to debug as well as implement, provided the program does not already have lots of global variables. Processes are favored by some because the need to explicitly share memory makes subtle bugs easier to avoid. Also, in processes, the extra argument to most functions is not needed.

Some programs that use threads:

Some programs that use processes:

Didactic Programs


See also


Publications

1950 ...

1970 ...

1980 ...

1981
  • John Philip Fishburn (1981). Analysis of Speedup in Distributed Algorithms. Ph.D. Thesis, pdf
  • Selim Akl, R.J. Doran (1981). A comparison of parallel implementations of the alpha-beta and Scout tree search algorithms using the game of checkers, Department of Computing and Information Science, Queen's University, Kingston, Ontario.
  • Tony Marsland, Murray Campbell (1981). Parallel Search of Strongly Ordered Game Trees. Technical Report TR 81-9, Department of Computing Science , University of Alberta, pdf
1982
1983
1984

1985 ...

1986
1987
1988
1989

1990 ...

1992
1993
1994

1995 ...

1996
1997
1998
1999

2000 ...

2001
2002
2003
2004

2005 ...

  • Jan Renze Steenhuisen (2005). Transposition-Driven Scheduling in Parallel Two-Player State-Space Search. Masters Thesis, pdf
2006
2007
2008
2009

2010 ...

2011
2012
2013
2014

2015 ...


Forum Posts

1995 ...

2000 ...

2005 ...

2010 ...

2011
2012
2013
2014

2015 ...

2016

External Links

Parallel Search


Parallel Computing


Scalability


Shared Memory


Cache


Concurrency and Synchronization


Misc


References

  1. ^ Super Linearity and the Bigger Machine from Dr. Dobb's
  2. ^ Lazy SMP, part 2 by Daniel Homan, CCC, January 12, 2013
  3. ^ Lazy SMP in Cheng by Martin Sedlak, CCC, February 02, 2015
  4. ^ Re: A new chess engine : m8 (comming not so soon) by Peter Österlund, CCC, February 01, 2015
  5. ^ Jean-Christophe Weill (1996). The ABDADA Distributed Minimax Search Agorithm. Proceedings of the 1996 ACM Computer Science Conference, pp. 131-138. ACM, New York, N.Y, reprinted ICCA Journal, Vol. 19, No. 1, zipped postscript
  6. ^ Yaoqing Gao, Tony Marsland (1996). Multithreaded Pruned Tree Search in Distributed Systems. Journal of Computing and Information, 2(1), 482-492, pdf
  7. ^ Rainer Feldmann, Peter Mysliwietz, Burkhard Monien (1991). A Fully Distributed Chess Program. Advances in Computer Chess 6, pdf
  8. ^ Rainer Feldmann (1993). Game Tree Search on Massively Parallel Systems. Ph.D. Thesis, pdf
  9. ^ Christopher F. Joerg, Bradley C. Kuszmaul (1994). Massively Parallel Chess, pdf
  10. ^ Bradley C. Kuszmaul (1994). Synchronized MIMD Computing. Ph. D. Thesis, Massachusetts Institute of Technology, pdf
  11. ^ Bradley C. Kuszmaul (1995). The StarTech Massively Parallel Chess Program. pdf
  12. ^ Raphael Finkel, John Philip Fishburn (1983). Improved Speedup Bounds for Parallel Alpha-Beta Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 5, No. 1, pp. 89 - 92
  13. ^ Raphael Finkel, John Philip Fishburn (1982). Parallelism in Alpha-Beta Search. Artificial Intelligence, Vol. 19, No. 1
  14. ^ John Philip Fishburn (1981). Analysis of Speedup in Distributed Algorithms. Ph.D. Thesis, pdf
  15. ^ Robert Hyatt (1994). The DTS high-performance parallel tree search algorithm
  16. ^ Mark Brockington (1996). A Taxonomy of Parallel Game-Tree Search Algorithms. ICCA Journal, Vol. 19: No. 3
  17. ^ UIDPABS = Unsynchronized Iteratively Deepening Parallel Alpha-Beta Search
  18. ^ CABP = Concurrent Alpha-Beta Pruning
  19. ^ Robert Hyatt (1994). The DTS high-performance parallel tree search algorithm
  20. ^ It should also be noted that Crafty uses threads on Windows, and used processes on Unix
  21. ^ threads vs processes again by Robert Hyatt, CCC, February 27, 2006
  22. ^ Jonathan Schaeffer (1985). Lionel Moser: An Experiment in Distributed Game Tree Searching. ICCA Journal, Vol. 8, No. 2 (Review)
  23. ^ An Introduction to Computer Chess by Alejandro López-Ortiz, 1993
  24. ^ Nagging from Wikipedia
  25. ^ Re: scorpio can run on 8192 cores by Daniel Shawul, CCC, August 29, 2015
  26. ^ Transposition-driven scheduling - Wikipedia
  27. ^ Transposition driven scheduling by Daniel Shawul, CCC, April 04, 2013
  28. ^ Dovetailing (computer science) from Wikipedia
  29. ^ Message Passing Interface from Wikipedia
  30. ^ John Romein, Henri Bal, Jonathan Schaeffer, Aske Plaat (2002). A Performance Analysis of Transposition-Table-Driven Scheduling in Distributed Search. IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 5, pp. 447–459. pdf
  31. ^ Message Passing Interface from Wikipedia

What links here?


Up one level