GPU (Graphics Processing Unit),
a specialized processor primarily intended to rapidly manipulate and alter memory for fast image processing, usually but not necessarily mapped to a framebuffer of a display. GPUs have more raw computing power than general purpose CPUs but need a limited, specialized and massive parallelized way of programming, not that conform with the serial nature of alpha-beta if it is about a massive parallel search in chess. Best-firstMonte-Carlo approaches in conjunction with SIMD and SWAR techniques for move generation and evaluation purposes for the parallel play-outs might be a way to go.
There are varios frameworks for GPGPU, General Purpose computing on Graphics Processing Unit.
Despite language wrappers and mobile devices with special APIs, there are in main three ways to make use of GPGPU.
CUDA is supported on Nvidia devices, OpenCL is implemented for APUs., CPUs, FPGA, GPUs by various vendors like AMD, Intel, Nvidia and IBM.
Inside
Modern GPUs consist of up to hundreds of SIMD or Vector units, coupled to compute units. Each compute unit processes multible Warps (Nvidia term) resp. Wavefronts (AMD term) in
(SIMT) fashion. Each Warp resp. Wavefront runs n threads simultaneously.
The Nvidia GeForce GTX 580 ,for example, is able to run 32 threads in one Warp, in total of 24576 threads, spread on 16 compute units with a total of 512 cores. [2]
The AMD Radeon HD 7970 is able to run 64 threads in one Wavefront, in total of 81920 threads, spread on 32 compute units with a total of 2048 cores. [3]
In real life the register and shared memory size limits this total.
Memory
The memory hierarchy of an GPU consists in main of private memory (registers accessed by an single thread resp. work-item), local memory (shared by threads of an block resp. work-items of an work-group ), constant memory, different types of cache and global memory. Size, latency and bandwidth vary between vendors and architectures.
Here the data for the Nvidia GeForce GTX 580 (Fermi) as an example: [4]
128 KiB private memory per compute unit
48 KiB (16 KiB) local memory per compute unit (configurable)
64 KiB constant memory
8 KiB constant cache per compute unit
16 KiB (48 KiB) L1 cache per compute unit (configurable)
768 KiB L2 cache
1.5 GiB to 3 GiB global memory
Here the data for the AMD Radeon HD 7970 (GCN) as an example: [5]
256 KiB private memory per compute unit
64 KiB local memory per compute unit
64 KiB constant memory
16 KiB constant cache per four compute units
16 KiB L1 cache per compute unit
768 KiB L2 cache
3 GiB to 6 GiB global memory
Integer Throughput
GPUs are used in HPC environments because of their good FLOP/Watt ratio. The 32 bit integer performance can be less than 32 bit FLOP or 24 bit integer performance.
The instruction throuhput depends on the architecture (like Nvidia's Tesla, Fermi, Kepler, Maxwell or AMD's Terascale, GCN), the brand (like Nvidia GeForce, Quadro, Tesla or AMD Radeon, FirePro, FireStream) and the specific model.
As an example, here the 32 bit integer performance of the Nvidia GeForce GTX 580 (Fermi, CC 2.0) and AMD Radeon HD 7970 (GCN 1.0):
Nvidia GeForce GTX 580 - 32 bit integer operations/clock cycle per compute unit [6]
Damian Sulewski (2011). Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks. Ph.D. thesis, University of Dortmund, pdf
a specialized processor primarily intended to rapidly manipulate and alter memory for fast image processing, usually but not necessarily mapped to a framebuffer of a display. GPUs have more raw computing power than general purpose CPUs but need a limited, specialized and massive parallelized way of programming, not that conform with the serial nature of alpha-beta if it is about a massive parallel search in chess. Best-first Monte-Carlo approaches in conjunction with SIMD and SWAR techniques for move generation and evaluation purposes for the parallel play-outs might be a way to go.
Table of Contents
GPGPU
There are varios frameworks for GPGPU, General Purpose computing on Graphics Processing Unit.Despite language wrappers and mobile devices with special APIs, there are in main three ways to make use of GPGPU.
1. Mapping to a native graphics API
2. Native compilers
3. Intermediate languages
CUDA is supported on Nvidia devices, OpenCL is implemented for APUs., CPUs, FPGA, GPUs by various vendors like AMD, Intel, Nvidia and IBM.
Inside
Modern GPUs consist of up to hundreds of SIMD or Vector units, coupled to compute units. Each compute unit processes multible Warps (Nvidia term) resp. Wavefronts (AMD term) in(SIMT) fashion. Each Warp resp. Wavefront runs n threads simultaneously.
The Nvidia GeForce GTX 580 ,for example, is able to run 32 threads in one Warp, in total of 24576 threads, spread on 16 compute units with a total of 512 cores. [2]
The AMD Radeon HD 7970 is able to run 64 threads in one Wavefront, in total of 81920 threads, spread on 32 compute units with a total of 2048 cores. [3]
In real life the register and shared memory size limits this total.
Memory
The memory hierarchy of an GPU consists in main of private memory (registers accessed by an single thread resp. work-item), local memory (shared by threads of an block resp. work-items of an work-group ), constant memory, different types of cache and global memory. Size, latency and bandwidth vary between vendors and architectures.Here the data for the Nvidia GeForce GTX 580 (Fermi) as an example: [4]
- 128 KiB private memory per compute unit
- 48 KiB (16 KiB) local memory per compute unit (configurable)
- 64 KiB constant memory
- 8 KiB constant cache per compute unit
- 16 KiB (48 KiB) L1 cache per compute unit (configurable)
- 768 KiB L2 cache
- 1.5 GiB to 3 GiB global memory
Here the data for the AMD Radeon HD 7970 (GCN) as an example: [5]Integer Throughput
GPUs are used in HPC environments because of their good FLOP/Watt ratio. The 32 bit integer performance can be less than 32 bit FLOP or 24 bit integer performance.The instruction throuhput depends on the architecture (like Nvidia's Tesla, Fermi, Kepler, Maxwell or AMD's Terascale, GCN), the brand (like Nvidia GeForce, Quadro, Tesla or AMD Radeon, FirePro, FireStream) and the specific model.
As an example, here the 32 bit integer performance of the Nvidia GeForce GTX 580 (Fermi, CC 2.0) and AMD Radeon HD 7970 (GCN 1.0):
Nvidia GeForce GTX 580 - 32 bit integer operations/clock cycle per compute unit [6]
- MAD 16
- MUL 16
- ADD 32
- Bit-shift 16
- Bitwise XOR 32
Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/secAMD Radeon HD 7970 - 32 bit integer operations/clock cycle per processing element [7]
- MAD 1/4
- MUL 1/4
- ADD 1
- Bit-shift 1
- Bitwise XOR 1
Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/secSee also
Publications
2009
2010...
2015 ...
Forum Posts
2005 ...
2010 ...
2015 ...
External Links
References
What links here?
Up one Level