GPU

a specialized processor primarily intended to rapidly manipulate and alter memory for fast [|image processing], usually but not necessarily mapped to a [|framebuffer] of a display. GPUs have more raw computing power than general purpose [|CPUs] but need a limited, specialized and massive parallelized way of programming, not that conform with the serial nature of alpha-beta if it is about a massive parallel search in chess. Best-first Monte-Carlo approaches in conjunction with SIMD and SWAR techniques for move generation and evaluation purposes for the parallel play-outs might be a way to go. || toc =GPGPU= There are varios frameworks for [|GPGPU], General Purpose computing on Graphics Processing Unit. Despite language wrappers and mobile devices with special APIs, there are in main three ways to make use of GPGPU.
 * Home * Hardware * GPU**
 * [[image:220px-6600GT_GPU.jpg link="https://en.wikipedia.org/wiki/Graphics_processing_unit"]] ||~  || **GPU** (Graphics Processing Unit),
 * [|GeForce 6600GT (NV43)] GPU ||~  ||^   ||

1. Mapping to a native graphics API

 * [|BrookGPU] (translates to [|OpenGL] and [|DirectX])
 * [|C++ AMP] (Open standard by Microsoft that extends C++ )
 * [|DirectCompute] (GPGPU API by Microsoft)

2. Native compilers

 * [|CUDA] (GPGPU framework by [|Nvidia])
 * OpenCL (Open Compute Language specified by [|Khronos Group])

3. Intermediate languages

 * HSAIL
 * [|PTX]
 * [|SPIR]

[|CUDA] is supported on [|Nvidia] devices, OpenCL is implemented for [|APUs]., CPUs, FPGA, GPUs by various vendors like AMD, Intel, Nvidia and IBM.

=Inside= Modern GPUs consist of up to hundreds of SIMD or [|Vector] units, coupled to compute units. Each compute unit processes multible Warps (Nvidia term) resp. Wavefronts (AMD term) in ([|SIMT]) fashion. Each Warp resp. Wavefront runs n threads simultaneously.

The Nvidia GeForce [|GTX 580] ,for example, is able to run 32 threads in one Warp, in total of 24576 threads, spread on 16 compute units with a total of 512 cores. The AMD Radeon [|HD 7970] is able to run 64 threads in one Wavefront, in total of 81920 threads, spread on 32 compute units with a total of 2048 cores.

In real life the register and shared memory size limits this total.

=Memory= The memory hierarchy of an GPU consists in main of private memory (registers accessed by an single thread resp. work-item), local memory (shared by threads of an block resp. work-items of an work-group ), constant memory, different types of cache and global memory. Size, latency and bandwidth vary between vendors and architectures.

Here the data for the Nvidia GeForce GTX 580 ([|Fermi)] as an example: Here the data for the AMD Radeon HD 7970 ([|GCN]) as an example:
 * 128 KiB private memory per compute unit
 * 48 KiB (16 KiB) local memory per compute unit (configurable)
 * 64 KiB constant memory
 * 8 KiB constant cache per compute unit
 * 16 KiB (48 KiB) L1 cache per compute unit (configurable)
 * 768 KiB L2 cache
 * 1.5 GiB to 3 GiB global memory
 * 256 KiB private memory per compute unit
 * 64 KiB local memory per compute unit
 * 64 KiB constant memory
 * 16 KiB constant cache per four compute units
 * 16 KiB L1 cache per compute unit
 * 768 KiB L2 cache
 * 3 GiB to 6 GiB global memory

=Integer Throughput= GPUs are used in [|HPC]environments because of their good [|FLOP]/Watt ratio. The 32 bit integer performance can be less than 32 bit FLOP or 24 bit integer performance. The instruction throuhput depends on the architecture (like Nvidia's [|Tesla], [|Fermi], [|Kepler], [|Maxwell] or AMD's [|Terascale], [|GCN]), the brand (like Nvidia [|GeForce], [|Quadro], [|Tesla] or AMD [|Radeon], [|FirePro], [|FireStream]) and the specific model.

As an example, here the 32 bit integer performance of the Nvidia GeForce GTX 580 (Fermi, CC 2.0) and AMD Radeon HD 7970 (GCN 1.0):

Nvidia GeForce GTX 580 - 32 bit integer operations/clock cycle per compute unit Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec
 * MAD 16
 * MUL 16
 * ADD 32
 * Bit-shift 16
 * Bitwise XOR 32

AMD Radeon HD 7970 - 32 bit integer operations/clock cycle per processing element Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec =See also=
 * MAD 1/4
 * MUL 1/4
 * ADD 1
 * Bit-shift 1
 * Bitwise XOR 1
 * Monte-Carlo Tree Search
 * Parallel Search
 * Perft(15)
 * SIMD and SWAR Techniques
 * UCT
 * Monte Carlo alpha beta, MCαβ

=Publications=

2009

 * Ren Wu, [|Bin Zhang], [|Meichun Hsu] (**2009**). //[|Clustering billions of data points using GPUs]//. [|ACM International Conference on Computing Frontiers]
 * [|Mark Govett], [|Craig Tierney], Jacques Middlecoff, [|Tom Henderson] (**2009**). //Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models//. [|CAS2K9 Workshop], [|pdf]

2010...

 * [|Avi Bleiweiss] (**2010**). //Playing Zero-Sum Games on the GPU//. [|NVIDIA Corporation], [|GPU Technology Conference 2010], [|slides as pdf]
 * [|Mark Govett], Jacques Middlecoff, [|Tom Henderson] (**2010**). //[|Running the NIM Next-Generation Weather Model on GPUs]//. [|CCGRID 2010]
 * [|Mark Govett], Jacques Middlecoff, [|Tom Henderson], [|Jim Rosinski], [|Craig Tierney] (**2011**). //Parallelization of the NIM Dynamical Core for GPUs//. [|slides as pdf]
 * Ľubomír Lackovič (**2011**). //[|Parallel Game Tree Search Using GPU]//. Institute of Informatics and Software Engineering, [|Faculty of Informatics and Information Technologies], [|Slovak University of Technology in Bratislava], [|pdf]
 * Dan Anthony Feliciano Alcantara (**2011**). //Effcient Hash Tables on the GPU//. Ph.D. thesis, [|University of California, Davis], [|pdf] » Hash Table
 * Damian Sulewski (**2011**). //Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks//. Ph.D. thesis, University of Dortmund, [|pdf]
 * Damjan Strnad, Nikola Guid (**2011**). //[|Parallel Alpha-Beta Algorithm on the GPU]//. [|CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » Parallel Search, Reversi
 * Liang Li, Hong Liu, Peiyu Liu, Taoying Liu, Wei Li, Hao Wang (**2012**). //[|A Node-based Parallel Game Tree Algorithm Using GPUs]//. CLUSTER 2012, [|pdf] » Parallel Search
 * S. Ali Mirsoleimani, [|Ali Karami], [|Farshad Khunjush] (**2013**). //[|A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]//. [|GECCO '13], [|pdf]

2015 ...

 * Peter H. Jin, Kurt Keutzer (**2015**). //Convolutional Monte Carlo Rollouts in Go//. [|arXiv:1512.03375] » Deep Learning, Go, MCTS
 * Liang Li, Hong Liu, Hao Wang, Taoying Liu, Wei Li (**2015**). //[|A Parallel Algorithm for Game Tree Search Using GPGPU]//. IEEE Transactions on Parallel and Distributed Systems, Vol. 26, No. 8 » Parallel Search
 * [|Sean Sheen] (**2016**). //[|Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]//. Master's thesis, [|California Polytechnic State University], [|pdf]

=Forum Posts=

2005 ...

 * [|Hardware assist] by Nicolai Czempin, Winboard Forum, August 27, 2006
 * [|Monte carlo on a NVIDIA GPU ?] by Marco Costalba, CCC, August 01, 2008

2010 ...

 * [|Using the GPU] by Louis Zulli, CCC, February 19, 2010
 * [|GPGPU and computer chess] by Wim Sjoho, CCC, February 09, 2011
 * [|Possible Board Presentation and Move Generation for GPUs?] by Srdja Matovic, CCC, March 19, 2011
 * [|Zeta plays chess on a gpu] by Srdja Matovic, CCC, June 23, 2011 » Zeta
 * [|GPU Search Methods] by Joshua Haglund, CCC, July 04, 2011
 * [|Possible Search Algorithms for GPUs?] by Srdja Matovic, CCC, January 07, 2012
 * [|uct on gpu] by Daniel Shawul, CCC, February 24, 2012 » UCT
 * [|Is there such a thing as branchless move generation?] by John Hamlen, CCC, June 07, 2012 » Move Generation
 * [|Choosing a GPU platform: AMD and Nvidia] by John Hamlen, CCC, June 10, 2012
 * [|Nvidias K20 with Recursion] by Srdja Matovic, CCC, December 04, 2012
 * [|Kogge Stone, Vector Based] by Srdja Matovic, CCC, January 22, 2013 » Kogge-Stone Algorithm
 * [|GPU chess engine] by Samuel Siltanen, CCC, February 27, 2013
 * [|Fast perft on GPU (upto 20 Billion nps w/o hashing)] by Ankan Banerjee, CCC, June 22, 2013 » Perft, Kogge-Stone Algorithm

2015 ...

 * [|GPU chess update, local memory...] by Srdja Matovic, CCC, June 06, 2016
 * [|Jetson GPU architecture] by Dann Corbit, CCC, October 18, 2016 » Astro
 * [|Pigeon is now running on the GPU] by Stuart Riffle, CCC, November 02, 2016 » Pigeon
 * [|Back to the basics, generating moves on gpu in parallel...] by Srdja Matovic, CCC, March 05, 2017 » Move Generation
 * [|Re: Perft(15): comparison of estimates with Ankan's result] by Ankan Banerjee, CCC, August 26, 2017 » Perft(15)
 * [|To TPU or not to TPU...] by Srdja Matovic, CCC, December 16, 2017 » Deep Learning

=External Links=
 * [|Graphics processing unit from Wikipedia]
 * [|GPU-Programming] from [|AMD Developer Central]
 * [|NVIDIA GPU Programming Guide] from [|NVIDIA Developer Zone]
 * [|Deep Learning | NVIDIA Developer] » Deep Learning
 * [|GPU Chess Blog]
 * [|Zeta OpenCL Chess]
 * [|Part 1: OpenCL™ – Portable Parallelism - CodeProject]
 * [|Part 2: OpenCL™ – Memory Spaces - CodeProject]
 * [|Advanced game programing | Session 5 - GPGPU programming] from [|Game programming lecture notes] by Andy Thomason
 * [|ankan-ban/perft_gpu · GitHub]
 * [|Faster deep learning with GPUs and Theano] by [|Manojit Nandi], August 05, 2015 » Deep Learning, Python
 * [|Tensor processing unit from Wikipedia]

=References= =What links here?= include component="backlinks" page="GPU" limit="80"
 * Up one Level**