Home * Search * Parallel Search
scalability.jpg

Parallel Search, also known as Multithreaded Search or SMP Search, is a way to increase search speed by using additional processors. This topic that has been gaining popularity recently with multiprocessor computers becoming widely available. Utilizing these additional processors is an interesting domain of research, as traversing a search tree is inherently serial. Several approaches have been devised, with the most popular today being Young Brothers Wait Concept and Shared Hash Table.

This page gives a brief summary of the different types. SMP algorithms are classified by their scalability (trend in search speed as the number of processors becomes large) and their speedup (change in time to complete a search). Typically, programmers use scaling to mean change in nodes per second (NPS) rates, and speedup to mean change in time to depth. Scaling and scalability are thus two different concepts.

A subtype of parallel algorithms, distributed algorithms are algorithms designed to work in cluster computing and distributed computing environments, where additional concerns beyond the scope of "classical" parallel algorithms need to be addressed.
Parallel scalability [1]

Shared Hash Table

Main page: Shared Hash Table
This technique is a very simple approach to SMP. The implementation requires little more than starting additional processors. Processors are simply fed the root position at the beginning of the search, and each searches the same tree with the only communication being the transposition table. The gains come from the effect of nondeterminism. Each processor will finish the various subtrees in varying amounts of time, and as the search continues, these effects grow making the search trees diverge. The speedup is then based on how many nodes the main processor is able to skip from transposition table entries.

Many programs use this if a "quick and dirty" approach to SMP is needed. Unfortunately, this approach gives little speedup on a mere 2 processors, and scales quite badly after this. However, the NPS scaling is nearly perfect.

ABDADA

ABDADA, Alpha-Bêta Distribué avec Droit d'Anesse (Distributed Alpha-Beta Search with Eldest Son Right) is a loosely synchronized, distributed search algorithm by Jean-Christophe Weill [2] . It is based on the Shared Hash Table, and adds the number of processors searching this node inside the hash-table entry for better utilization - considering the Young Brothers Wait Concept.

Parallel Alpha-Beta

These algorithms divide the Alpha-Beta tree, giving different subtrees to different processors. Because alpha-beta is a serial algorithm, this approach is much more complex. However, these techniques also provide for the greatest gains from additional processors.

Principal Variation Splitting (PVS)

In Principal Variation Splitting (PVS), each node is expressed by a thread. A thread will spawn one child thread for each legal move. But data dependency specified by the algorithm exists among these threads: After getting a tighter bound from the thread corresponding to the PV node, the remaining threads are ready to run.
PVSplit.JPG
PV Splitting [3]

YBWC and Jamboree

The idea in Feldmann's Young Brothers Wait Concept (YBWC) [4] [5] as well in Kuszmaul's Jamboree Search [6] [7] [8] , is to search the first sibling node first before spawning the remaining siblings in parallel. This is based on the observations that the first move is either going to produce a cutoff (in which case processing sibling nodes is wasted effort) or return much better bounds. If the first move does not produce a cut-off, then the remaining moves are searched in parallel. This process is recursive.

Since the number of processors is not infinite the process of "spawning" work normally consists in putting it on some kind of "work to be done stack" where processors are free to grab work in FIFO fashion when there is no work to do. In YBW you would not "spawn" or place work on the stack until the first sibling is searched.

In their 1983 paper Improved Speedup Bounds for Parallel Alpha-Beta Search [9] , Raphael Finkel and John Philip Fishburn already gave the theoretical confirmation to the common sense wisdom that parallel resources should first be thrown into searching the first child. Assuming the tree is already in an approximation to best-first order, this establishes a good alpha value that can then be used to parallel search the later children. The algorithm in the 1982 Artificial Intelligence paper [10] , which Fishburn called the "dumb algorithm" in his 1981 thesis presentation [11] gives p^0.5 speedup with p processors, while the 1983 PAMI algorithm (the "smart algorithm") gives p^0.8 speedup for lookahead trees with the branching factor of chess.

Dynamic Tree Splitting (DTS)

Main page: Dynamic Tree Splitting
This algorithm, invented by the Cray Blitz team (including Robert Hyatt [12] ), is the most complex. Though this gives the best known scalability for any SMP algorithm, there are very few programs using it because of its difficulty of implementation.

Other Approaches

Many different approaches have been tried that do not directly split the search tree. These algorithms have not enjoyed popular success due to the fact that they are not scalable. Typical examples include one processor that evaluates positions fed to it by a searching processor, or a tactical search that confirms or refutes a positional search.

Comparison

In 1994 Robert Hyatt[13] reported some scaling comparisons of different Parallel Search algorithms. The values indicate the speedup to solve certain testpositions with n processors.

Algorithm
1
2
4
8
16
PVS
1.0
1.8
3.0
4.1
4.6
EPVS
1.0
1.9
3.4
5.4
6.0
DTS
1.0
2.0
3.7
6.6
11.1

Other Considerations

Semaphores

During an parallel search, certain areas of memory must be protected to make sure processors do not write simultaneously and corrupt the data. Some type of semaphore system must be used. Semaphores access a piece of shared memory, typically an integer. When a processor wants to access protected memory, it reads the integer. If it is zero, meaning no other process is accessing the memory, then the processor attempts to make the integer nonzero. This whole process must be done atomically, meaning that the read, compare, and write are all done at the same time. Without this atomicity another processor could read the integer at the same time and both would see that they can freely access the memory.

In chess programs that use parallel alpha-beta, usually spinlocks are used. Because the semaphores are held for such short periods of time, processors want to waste as little time as possible after the semaphore is released before acquiring access. To do this, if the semaphore is held when a processor reaches it, the processor continuously reads the semaphore. This technique can waste a lot of processing time in applications with high contention, meaning that many processes try to access the same semaphore simultaneously. In chess programs, however, we want as much processing power as possible.

Spinlocks are sometimes implemented in assembly language because the operating system does not have an API for them.

Threads vs. Processes

There are two ways of utilizing the extra processing power of multiple CPUs, threads and processes. The difference between them is that threads share all memory in the program, but there are multiple threads of execution. In processes, all memory is local to each processor except memory that is explicitly shared. This means that in a threaded program, functions must pass around an extra argument specifying which thread is active, thus which board structure to use. When using processes, a single global board can be used that will be duplicated for each process.

Threads are more common, because they are easier to debug as well as implement, provided the program does not already have lots of global variables. Processes are favored by some because the need to explicitly share memory makes subtle bugs easier to avoid. Also, in processes, the extra argument to most functions is not needed.

Some programs that use threads:

Some programs that use processes:

Didactic Programs


See also


Publications

1950 ...

1970 ...

1980 ...

1985 ...

1990 ...

1995 ...

2000 ...

2005 ...

2010 ...


Forum Posts

1995 ...

2000 ...

2005 ...

2010

2011

2012

2013

2014


External Links

Parallel Search


Parallel Computing


Scalability


Shared Memory


Cache


Concurrency and Synchronization


Misc


References

  1. ^ Super Linearity and the Bigger Machine from Dr. Dobb's
  2. ^ Jean-Christophe Weill (1996). The ABDADA Distributed Minimax Search Agorithm. Proceedings of the 1996 ACM Computer Science Conference, pp. 131-138. ACM, New York, N.Y, reprinted ICCA Journal, Vol. 19, No. 1, zipped postscript
  3. ^ Yaoqing Gao, Tony Marsland (1996). Multithreaded Pruned Tree Search in Distributed Systems. Journal of Computing and Information, 2(1), 482-492, pdf
  4. ^ Rainer Feldmann, Peter Mysliwietz, Burkhard Monien (1991). A Fully Distributed Chess Program. Advances in Computer Chess 6, pdf
  5. ^ Rainer Feldmann (1993). Game Tree Search on Massively Parallel Systems. Ph.D. Thesis, pdf
  6. ^ Chris Joerg and Bradley C. Kuszmaul (1994). Massively Parallel Chess, pdf
  7. ^ Bradley C. Kuszmaul (1994). Synchronized MIMD Computing. Ph. D. Thesis, Massachusetts Institute of Technology, pdf
  8. ^ Bradley C. Kuszmaul (1995). The StarTech Massively Parallel Chess Program. pdf
  9. ^ Raphael Finkel, John Philip Fishburn (1983). Improved Speedup Bounds for Parallel Alpha-Beta Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 5, No. 1, pp. 89 - 92
  10. ^ Raphael Finkel, John Philip Fishburn (1982). Parallelism in Alpha-Beta Search. Artif. Intell. 19(1): 89-106
  11. ^ John Philip Fishburn (1981). Analysis of Speedup in Distributed Algorithms. Ph.D. Thesis, pdf
  12. ^ Robert Hyatt (1994). The DTS high-performance parallel tree search algorithm
  13. ^ Robert Hyatt (1994). The DTS high-performance parallel tree search algorithm
  14. ^ It should also be noted that Crafty uses threads on Windows, and used processes on Unix
  15. ^ threads vs processes again by Robert Hyatt, CCC, February 27, 2006
  16. ^ Jonathan Schaeffer (1985). Lionel Moser: An Experiment in Distributed Game Tree Searching. ICCA Journal, Vol. 8, No. 2 (Review)
  17. ^ An Introduction to Computer Chess by Alejandro López-Ortiz, 1993
  18. ^ Nagging from Wikipedia
  19. ^ Transposition-driven scheduling - Wikipedia
  20. ^ Transposition driven scheduling by Daniel Shawul, CCC, April 04, 2013
  21. ^ Dovetailing (computer science) from Wikipedia
  22. ^ Message Passing Interface from Wikipedia
  23. ^ John Romein, Henri Bal, Jonathan Schaeffer, Aske Plaat (2002). A Performance Analysis of Transposition-Table-Driven Scheduling in Distributed Search. IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 5, pp. 447–459. pdf

What links here?


Up one level