SSE2 (Streaming SIMD Extensions 2) and further x86- or x86-64 streaming SIMD extensions, like SSE3, SSSE3, SSE4 and AMD's announced SSE5, as major enhancement to SSE, provide an instruction set on 128-bit registers, namely on vectors of four floats or two doubles, as well since SSE2 as vectors of 16 bytes, eight words, four double words or two quad words^{[1]} . In 64-bit mode there are 16 xmm registers available, xmm0..xmm15, in 32-bit mode only eight, xmm0..xmm7. SSE is explicitly available through C-Compiler intrinsics ^{[2]} or (inline) assembly. Some compiler implicitly use SSE-float and double instructions for floating point data types, others even provide automatic SSE2 vectorization, while processing arrays of various integer types. SSE- and SSE2-intrinsic functions are available in Visual C^{[3]} or Intel-C^{[4]}^{[5]}.

Some vc2005 tested, sample SSE2 bitboard routines:

One Step Only

Since 128-bit xmm registers may treated as vector of 16 bytes, shifting techniques such as one step in all eight directions can be done more efficiently with respect to wraps from a- to the h-file or vice versa. It is recommend to write a own SSE2-wrapper class with overloaded operators in C++ to encapsulate a vector of two bitboards.

northwest north northeast
noWe nort noEa
+7 +8 +9
\ | /
west -1 <- 0 -> +1 east
/ | \
-9 -8 -7
soWe sout soEa
southwest south southeast

Veritcal steps as usual with 64-byte shift a rank each:

__m128i nortOne(__m128i b){
b = _mm_slli_epi64 (b, 8);return b;}
__m128i soutOne(__m128i b){
b = _mm_srli_epi64 (b, 8);return b;}

Unfortunately there is no byte-wise shift in the SSE2-instruction set (as well as MMX), but using byte-wise parallel add avoids using the wrap masks, which need otherwise load from memory or computed. Applying the wraps mask takes two instructions.

__m128i butNotA(__m128i b){
b = _mm_srli_epi64 (b, 1);
b = _mm_add_epi8 (b, b);return b;}
__m128i butNotH(__m128i b){
b = _mm_add_epi8 (b, b);
b = _mm_srli_epi64 (b, 1);return b;}

This is how the east direction are computed based on parallel byte-wise add. Either one or two SSE2-instructions:

__m128i eastOne(__m128i b){
b = _mm_add_epi8 (b, b);return b;}
__m128i noEaOne (__m128i b){
b = _mm_add_epi8 (b, b);
b = _mm_slli_epi64 (b, 8);return b;}
__m128i soEaOne (__m128i b){
b = _mm_add_epi8 (b, b);
b = _mm_srli_epi64 (b, 8);return b;}

West directions need a leading not A-file and take three instructions each:

__m128i westOne(__m128i b){
b = _mm_srli_epi64 (b, 1);
b = _mm_add_epi8 (b, b);
b = _mm_srli_epi64 (b, 1);return b;}
__m128i soWeOne (__m128i b){
b = _mm_srli_epi64 (b, 1);
b = _mm_add_epi8 (b, b);
b = _mm_srli_epi64 (b, 9);return b;}
__m128i noWeOne (__m128i b){
b = _mm_srli_epi64 (b, 1);
b = _mm_add_epi8 (b, b);
b = _mm_slli_epi64 (b, 7);return b;}

East Attacks

SIMD-wise Fill by Subtraction with byte- or rank-wise arithmetic takes only a few instructions with SSE2:

The dot product^{[6]} of a vector of bits by a weight vector of bytes might be used in determining mobility for evaluation purposes. The vector of bits is a bitboard of all squares attacked by one (or multiple) piece(s), while the the weight vector considers the "importance" of squares, like center control, or even square controls near the opponent king, e.g. by providing 64 weight vectors for each king square.

The 64-bit times 64-byte dot product implements a kind of weighted population count similar than following loop approach, but completely unrolled and branchless:

The SSE2 routine expands a bitboard as a vector of 64 bits into 64-bytes inside four 128-bit xmm registers, and performs the multiplication with the byte-vector by bitwise 'and', before it finally adds all bytes together. The bitboard is therefor manifolded eight times with a sequence of seven unpack and interleave instructions to remain the same expanded byte-order of the bits from the bitboards, before to mask and compare them with a vector of bytes with one appropriate bit set each.

The dot product is designed for unsigned weights in the 0..63 range, so that vertical bytewise adds of the four weights can not overflow. Nevertheless, three PADDUSB - packed add unsigned byte with saturation instructions (_mm_adds_epu8) are used to limit the maximum four byte sum to 255 to make the routine more "robust" for cases with average weights greater than 63. The horizontal add of both quad words of the 128-bit xmmregister is performed by the PSADBW - packed sum of absolute differences of bytes into a word instruction (_mm_sad_epu8) with zero, while the final add of the two resulting word sums in the high and low quad word of the xmm register is done with general purpose registers.

Following proposal of a SWAR-Popcount combined with a dot product might be quite competitive on recent x86-64 processors with a throughput of up to three simd-instructions per cycle ^{[8]}^{[9]} . It determines the cardinalities of eight bitboards, multiplies them with a corresponding weight, a signed 16-bit word, to finally add all together as integer. However, Wojciech Muła'sSSSE3 PopCnt would save some more cycles, even more with doubled or fourfold register widths using AVX2 or AVX-512.

/**
* popCountWeight8
* @author Gerd Isenberg
* @param bb vector of eight bitboards
* weight vector of eight short weights
* @return sum(0,7) popcnt(bb[i]) * weight[i]
*/int popCountWeight8(const U64 bb[8], constshort weight[8]){staticconst U64 XMM_ALIGN masks[6]={
C64(0x5555555555555555), C64(0x5555555555555555),
C64(0x3333333333333333), C64(0x3333333333333333),
C64(0x0f0f0f0f0f0f0f0f), C64(0x0f0f0f0f0f0f0f0f)};const __m128i* pM =(const __m128i*) masks;const __m128i* pb =(const __m128i*) bb;
__m128i v = pb[0], w = pb[1], x = pb[2], y = pb[3];
v = _mm_sub_epi8(v, _mm_and_si128(_mm_srli_epi64(v, 1), pM[0]));
w = _mm_sub_epi8(w, _mm_and_si128(_mm_srli_epi64(w, 1), pM[0]));
x = _mm_sub_epi8(x, _mm_and_si128(_mm_srli_epi64(x, 1), pM[0]));
y = _mm_sub_epi8(y, _mm_and_si128(_mm_srli_epi64(y, 1), pM[0]));
v = _mm_add_epi8(_mm_and_si128(v, pM[1]), _mm_and_si128(_mm_srli_epi64(v, 2), pM[1]));
w = _mm_add_epi8(_mm_and_si128(w, pM[1]), _mm_and_si128(_mm_srli_epi64(w, 2), pM[1]));
x = _mm_add_epi8(_mm_and_si128(x, pM[1]), _mm_and_si128(_mm_srli_epi64(x, 2), pM[1]));
y = _mm_add_epi8(_mm_and_si128(y, pM[1]), _mm_and_si128(_mm_srli_epi64(y, 2), pM[1]));
v = _mm_and_si128(_mm_add_epi8 (v, _mm_srli_epi64(v, 4)), pM[2]);
w = _mm_and_si128(_mm_add_epi8 (w, _mm_srli_epi64(w, 4)), pM[2]);
x = _mm_and_si128(_mm_add_epi8 (x, _mm_srli_epi64(x, 4)), pM[2]);
y = _mm_and_si128(_mm_add_epi8 (y, _mm_srli_epi64(y, 4)), pM[2]);
__m128i z = _mm_setzero_si128();
v = _mm_packs_epi16(_mm_sad_epu8(v, z), _mm_sad_epu8(w, z));
x = _mm_packs_epi16(_mm_sad_epu8(x, z), _mm_sad_epu8(y, z));const __m128i* pW =(const __m128i*) weight;
v = _mm_madd_epi16 (_mm_packs_epi16(v, x), pW[0]);
v = _mm_add_epi32 (v, _mm_srli_si128(v, 4));
v = _mm_add_epi32 (v, _mm_srli_si128(v, 8));return _mm_cvtsi128_si32(v);}

SSE2-Wrapper in C++

C++ quite efficiently allows to wrap all the intrinsics and to write classes with appropriate operators overloaded - the proposal here overloads + and - operators for byte- or rank-wise arithmetic, shifts work on 64-bit entities as often used in mentioned SSE2-routines or Kogge-Stone fill-stuff. A base class for the memory layout and two derivations provide to implement routines with SSE2 or general purpose instructions - or any other available SIMD-architecture like AltiVec. For instance a quad-bitboard attack-getter:

// T is either XMM or GPRtemplate<class T>inlinevoid eastAttacks(QBB& t, const QBB& s, U64 occ){
T* pt =(T*)&t;
T r0(s.bb[0]);
T r1(s.bb[2]);
T o0(occ, occ);
T o1 = o0 | r1;
o0 = o0 | r0;
pt[0]= o0 ^((o0 ^ r0)- r0);
pt[1]= o1 ^((o1 ^ r1)- r1);}

## Table of Contents

Home * Hardware * x86 * SSE2SSE2(Streaming SIMD Extensions 2) and further x86- or x86-64 streaming SIMD extensions, like SSE3, SSSE3, SSE4 and AMD's announced SSE5, as major enhancement to SSE, provide an instruction set on 128-bit registers, namely on vectors of four floats or two doubles, as well since SSE2 as vectors of 16 bytes, eight words, four double words or two quad words^{[1]}. In 64-bit mode there are 16 xmm registers available, xmm0..xmm15, in 32-bit mode only eight, xmm0..xmm7. SSE is explicitly available through C-Compiler intrinsics^{[2]}or (inline) assembly. Some compiler implicitly use SSE-float and double instructions for floating point data types, others even provide automatic SSE2 vectorization, while processing arrays of various integer types. SSE- and SSE2-intrinsic functions are available in Visual C^{[3]}or Intel-C^{[4]}^{[5]}.## Integer Instructions

x86 and x86-64 - SSE2 C-integer intrinsic reference from MSDN Library, the intrinsic data type_m128irefers a xmm-register or memory location:pandpandnporpxorpsrlqpsllqpaddbpsubbpsadbwof bytes into a word

pmaxswpmaxubpminswpminubpcmpeqbpmullwpmulhwpmulhuwpmaddwdpunpcklbwgGhHfFeE:dDcCbBaA :=xxxxxxxx:GHFEDCBA #xxxxxxxx:ghfedcbapunpckhbwgGhHfFeE:dDcCbBaA :=GHFEDCBA:xxxxxxxx #ghfedcba:xxxxxxxxpunpcklwddDcC:bBaA := xxxx:DCBA#xxxx:dcbapunpckhwddDcC:bBaA := DCBA:xxxx#dcba:xxxxpunpckldqbB:aA := xx:BA # xx:bapunpckhdqbB:aA := BA:xx # ba:xxpunpcklqdqa:A := x:A # x:apunpckhqdqa:A := A:x # a:xpshuflwpshufhwpshufdmovdqaxmm := *p

movdquxmm := *p

movdqa*p := xmm

movdqu*p := xmm

movqmovqmovdxmm := gp64

movdmovdpextrwpinsrwpmovmskbgp32 := 16 sign-bits(xmm)

prefetch## Applications

Some vc2005 tested, sample SSE2 bitboard routines:## One Step Only

Since 128-bit xmm registers may treated as vector of 16 bytes, shifting techniques such as one step in all eight directions can be done more efficiently with respect to wraps from a- to the h-file or vice versa. It is recommend to write a own SSE2-wrapper class with overloaded operators in C++ to encapsulate a vector of two bitboards.Veritcal steps as usual with 64-byte shift a rank each:

Unfortunately there is no byte-wise shift in the SSE2-instruction set (as well as MMX), but using byte-wise parallel add avoids using the wrap masks, which need otherwise load from memory or computed. Applying the wraps mask takes two instructions.

This is how the east direction are computed based on parallel byte-wise add. Either one or two SSE2-instructions:

West directions need a leading not A-file and take three instructions each:

## East Attacks

SIMD-wise Fill by Subtraction with byte- or rank-wise arithmetic takes only a few instructions with SSE2:## SSE2 dot product

The dot product^{[6]}of a vector of bits by a weight vector of bytes might be used in determining mobility for evaluation purposes. The vector of bits is a bitboard of all squares attacked by one (or multiple) piece(s), while the the weight vector considers the "importance" of squares, like center control, or even square controls near the opponent king, e.g. by providing 64 weight vectors for each king square.The 64-bit times 64-byte dot product implements a kind of weighted population count similar than following loop approach, but completely unrolled and branchless:

The SSE2 routine expands a bitboard as a vector of 64 bits into 64-bytes inside four 128-bit xmm registers, and performs the multiplication with the byte-vector by bitwise 'and', before it finally adds all bytes together. The bitboard is therefor manifolded eight times with a sequence of seven unpack and interleave instructions to remain the same expanded byte-order of the bits from the bitboards, before to mask and compare them with a vector of bytes with one appropriate bit set each.

The dot product is designed for unsigned weights in the 0..63 range, so that vertical bytewise adds of the four weights can not overflow. Nevertheless, three

PADDUSB - packed add unsigned byte with saturationinstructions (_mm_adds_epu8) are used to limit the maximum four byte sum to 255 to make the routine more "robust" for cases with average weights greater than 63. The horizontal add of both quad words of the 128-bit xmmregister is performed by thePSADBW - packed sum of absolute differences of bytes into a wordinstruction (_mm_sad_epu8) with zero, while the final add of the two resulting word sums in the high and low quad word of the xmm register is done with general purpose registers.## Rotated Dot Product

A little bit cheaper is to expand the bitboard to a vector or 90 degree rotated {0,255} bytes, which requires a rotated weight vector as well^{[7]}.## SSE2 Population Count

Following proposal of a SWAR-Popcount combined with a dot product might be quite competitive on recent x86-64 processors with a throughput of up to three simd-instructions per cycle^{[8]}^{[9]}. It determines the cardinalities of eight bitboards, multiplies them with a corresponding weight, a signed 16-bit word, to finally add all together as integer. However, Wojciech Muła's SSSE3 PopCnt would save some more cycles, even more with doubled or fourfold register widths using AVX2 or AVX-512.## SSE2-Wrapper in C++

C++ quite efficiently allows to wrap all the intrinsics and to write classes with appropriate operators overloaded - the proposal here overloads + and - operators for byte- or rank-wise arithmetic, shifts work on 64-bit entities as often used in mentioned SSE2-routines or Kogge-Stone fill-stuff. A base class for the memory layout and two derivations provide to implement routines with SSE2 or general purpose instructions - or any other available SIMD-architecture like AltiVec. For instance a quad-bitboard attack-getter:A proposal for a class skeleton:

## See also

## Manuals

## Forum Posts

## External Links

Calling conventions for different C++ compilers and operating systems (pdf) by Agner Fog

^{[10]}- Vollgas Ahoi - Live April 25, 2009, Rätschenmühle, Geislingen, YouTube VideoHellmut Hattler, Peter Wolbrandt, Jan Fride Wolbrandt

## References

## What links here?

Up one Level