SIMD+and+SWAR+Techniques


 * Home * Programming * SIMD and SWAR Techniques**
 * [[image:250px-SIMD.svg.png width="240" height="240" link="https://en.wikipedia.org/wiki/File:SIMD.svg"]] ||~  || x86, x86-64, as well as PowerPC and [|Power ISA v.2.03] processors provide **Single Instructions** on **Multiple Data** (SIMD), namely on vectors of floats, doubles or various integers, bytes, words, double words or quad words, available through assembly and compiler intrinsics. SIMD-applications related to computer chess cover bitboard computations and fill-algorithms like Dumb7Fill and Kogge-Stone Algorithm, as well as evaluation related stuff, like this SSE2 dot-product of 64 bits by a vector of 64 bytes.

toc =SIMD Instruction Sets=  =SWAR Arithmetic= To apply addition and subtraction on vectors of bit-aggregates or [|bit-field structures] within a general purpose register, one has to take care carries and borrows don't wrap around. Thus the need to mask of all most significant bits (H) and add in two steps, one 'add' with MSB clear and one add modulo 2 aka 'xor' for the MSB itself. For bytewise (rankwise) math inside a 64-bit register, H is 0x8080808080808080 and L is 0x0101010101010101. code format="Cpp" SWAR add z = x + y   z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H) code code format="Cpp" SWAR sub z = x - y   z = ((x | H) - (y &~H)) ^ ((x ^~y) & H) code code format="Cpp" SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y) z = (x & y) + (((x ^ y) & ~L) >> 1) code
 * SWAR** as acronym for SIMD Within A Register was coined by Hank Dietz and Randy Fisher . It is a processing model which applies SIMD parallel processing across sections of a CPU register, often vectors of smaller than byte-entities are processed in parallel prefix manner. ||
 * [|SIMD] ||~  ||^   ||
 * MMX on x86 and x86-64
 * SSE2, SSE3, SSSE3 and SSE4 on x86 and x86-64
 * SSE5 by AMD (proposed but not implemented, replaced by XOP )
 * AltiVec on PowerPC G4, PowerPC G5
 * [|ARM NEON Technology]
 * AVX by Intel
 * AVX2 by Intel
 * AVX-512 by Intel
 * XOP by AMD

=Samples= Amazing, how similar these two SWAR- and parallel prefix wise routines are. Mirror horizontally and population count have in common to act on vectors of duos, nibbles and bytes. One swaps bits, duos and nibbles, while the second adds populations of them. code format="cpp" U64 mirrorHorizontal (U64 x) { const U64 k1 = C64(0x5555555555555555); const U64 k2 = C64(0x3333333333333333); const U64 k4 = C64(0x0f0f0f0f0f0f0f0f); x = ((x & k1) << 1) | ((x >> 1) & k1); x = ((x & k2) << 2) | ((x >> 2) & k2); x = ((x & k4) << 4) | ((x >> 4) & k4); return x; } code code format="cpp" int popCount (U64 x) { const U64 k1 = C64(0x5555555555555555); const U64 k2 = C64(0x3333333333333333); const U64 k4 = C64(0x0f0f0f0f0f0f0f0f); x =  x             - ((x >> 1)  & k1); x = (x & k2)       + ((x >> 2)  & k2); x = ( x            +  (x >> 4)) & k4 ; x = (x * C64(0x0101010101010101))>> 56; return (int) x; } code =Publications=
 * [|Tom Thompson] (**1999**). //[|AltiVec Revealed]//. [|MacTech], Vol. 15, No. 7
 * [|Nicolas Fritz] (**2009**). //SIMD Code Generation in Data-Parallel Programming//. Ph.D. thesis, [|Saarland University], [|pdf]
 * [|Georg Hager], [|Jan Treibig], [|Gerhard Wellein] (**2013**). //The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems//. [|RRZE], [|SC13], [|slides as pdf]
 * [|Kaixi Hou], Hao Wang, [|Wu-chun Feng] (**2015**). //ASPaS: A Framework for Automatic SIMDIZation of Parallel Sorting on x86-based Many-core Processors//. [|ICS2015],

=Manuals=

AMD

 * [|AMD64 Architecture Volume 4: 128-Bit and 256-Bit Media Instructions] (pdf)
 * [|AMD64 Architecture Volume 5: 64-Bit Media and x87 Floating-Point Instructions] (pdf)
 * [|AMD64 Architecture Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions] (pdf)

NXP Semiconductors

 * [|AltiVec Technology - Programming Interface Manual] (pdf)

Intel

 * [|Intel 64 and IA32 Architectures Optimization Reference Manual] (pdf)

=Forum Posts=
 * [|G4 & AltiVec] by Will Singleton, CCC, October 04, 1999
 * [|Superlinear interpolator: a nice novelity ?] by Marco Costalba, CCC, September 20, 2008 » Tapered Eval
 * [|Re: talk about IPP's evaluation] by Richard Vida, CCC, November 07, 2009 » Ippolit, Tapered Eval
 * [|My experience with Linux/GCC] by Richard Vida, CCC, March 23, 2011 » C, Linux, Tapered Eval
 * [|Re: Utilizing Architecture Specific Functions from a HL Language] by Wylie Garvin, CCC, July 31, 2011
 * [|two values in one integer] by Pierre Bokma, CCC, January 18, 2012
 * [|couple of questions about stockfish code ?] by Mahmoud Uthman, CCC, October 26, 2016 » Stockfish, Tapered Eval

=External Links=
 * [|SIMD from Wikipedia]
 * [|SWAR from Wikipedia]
 * [|The Aggregate: SWAR, SIMD Within A Register] by Hank Dietz
 * [|Advanced game programing | Session 4 - Math libraries and SIMD] from [|Game programming lecture notes] by Andy Thomason

x86

 * [|MMX from Wikipedia]
 * [|3DNow! from Wikipedia]
 * [|Streaming SIMD Extensions from Wikipedia]
 * [|SSE2 from Wikipedia]
 * [|SSE3 from Wikipedia]
 * [|SSSE3 from Wikipedia]
 * [|SSE4 from Wikipedia]
 * [|SSE4a from Wikipedia]
 * [|SSE5 from Wikipedia]
 * [|XOP instruction set from Wikipedia]
 * [|Advanced Vector Extensions from Wikipedia]
 * [|AVX-512 from Wikipedia]
 * [|SSEPlus Project] from [|AMD Developer Central]
 * [|SSEPlus Project Documentation]

Other SIMD

 * [|ARM NEON Technology]
 * [|ARM NEON Technology from Wikipedia]
 * [|AltiVec from Wikipedia]
 * [|Hardware - SSE Performance Programming] from [|Apple Developer]
 * [|Apple Instruction Cross-Reference] from [|Apple Developer]

Misc

 * [|Explicitly parallel instruction computing (EPIC) from Wikipedia]
 * [|Instruction-level parallelism from Wikipedia]
 * [|MIMD from Wikipedia]
 * [|Parallel Thread Execution from Wikipedia] » GPU, Thread
 * [|SPMD from Wikipedia]
 * [|Very long instruction word (VLIW) from Wikipedia]

=References= =What links here?= include page="SIMD and SWAR Techniques" component="backlinks" limit="80"
 * Up one Level**