XOP

toc
 * Home * Hardware * x86 * XOP**

a x86-64 SIMD instruction set extension by AMD released with the [|Bulldozer microarchitecture] which have the same functionality as the SSE5 instruction set formerly proposed by AMD in August 2007, but with a revision of encoding in order to improve compatibility with Intel's AVX and the [|VEX coding scheme].
 * XOP**, (eXtended Operations)

The XOP instructions utilize a new three-byte XOP prefix preceding the opcode byte. This prefix replaces the use of the 0F, 66, F2 and F3 prefix bytes and the REX prefix and encodes additional information as well. XOP requires bit 11 in EXC set as returned by CPUID function EAX 80000001H.  =Instructions= 

Integer Multiply, Add and Accumulate
XOP has a variety of [|multiply, add and accumulate] instructions operate on and produce packed signed integer values. These instructions are certainly worthwhile for evaluation purpose, for instance VPMACSSWW: Since these instructions have the same performance as typical multiply instructions like PMULLW and PMADDWD and require the same execution resources, they effectively make the add step "free". The primary catch to using these instructions is latency; for example, the following sequence to sum a series of multiplies is extremely slow and will take 16 cycles: Whereas the simple version, without XOP, will take just 8 cycles, albeit with more uops: Multiple accumulators can help avoid this problem, as well as finding other ways to hide the latency. 
 * [[image:VPMACSWW.JPG]] ||
 * VPMACSSWW — Packed Multiply Accumulate Signed Word to Signed Word with Saturation ||
 * ~ Instruction ||~ Starting Cycle ||~ Ending Cycle ||
 * vpmacssww xmm0, xmm1, xmm2, xmm0 ||= 0 ||= 3 ||
 * vpmacssww xmm0, xmm3, xmm4, xmm0 ||= 4 ||= 7 ||
 * vpmacssww xmm0, xmm5, xmm6, xmm0 ||= 8 ||= 11 ||
 * vpmacssww xmm0, xmm7, xmm8, xmm0 ||= 12 ||= 15 ||
 * ~ Instruction ||~ Starting Cycle ||~ Ending Cycle ||
 * pmullw xmm1, xmm2 ||= 0 ||= 3 ||
 * pmullw xmm3, xmm4 ||= 1 ||= 4 ||
 * pmullw xmm5, xmm6 ||= 2 ||= 5 ||
 * pmullw xmm7, xmm8 ||= 3 ||= 6 ||
 * paddsw xmm0, xmm1 ||= 1 ||= 2 ||
 * paddsw xmm0, xmm3 ||= 2 ||= 3 ||
 * paddsw xmm0, xmm5 ||= 3 ||= 4 ||
 * paddsw xmm0, xmm7 ||= 4 ||= 5 ||

Horizontal Add and Subtract
XOP packed horizontal add and subtract signed integer instructions successively add adjacent pairs from the source XMM register and pack the (sign extended) integer result in the destination. For instance, VPHADDWQ can be used to continue the [|dot product] from a previous Multiply, Add and Accumulate: While some of these instructions may at first appear to be less powerful than the existing SSSE3 phaddw and psubhw, the latter tend to be rather slow in most implementations, while the XOP variants are all fast, single-uop instructions. 
 * [[image:VPHADDWQ.JPG]] ||
 * VPHADDWQ - Packed Horizontal Add Signed Word to Signed Quadword ||

Vector Conditional Moves
The Vector Conditional Moves (**VPCMOV**) instruction implements the C/C++ language ternary ‘?’ operator at bit level on 128-bit XMM or 256-bit YMM registers. VPCMOV has four XMM/YMM register operands: code VPCMOV dest, src1, src2, selector code The 256-bit version executes following pseudo code in parallel: code format="cpp" for (int i = 0; i < 256; i++) dest[i] = selector[i] ? src1[i] : src2[i] code 

Packed Permute Bytes
The Packed Permute Bytes (**VPPERM**) instruction can shuffle 16 bytes out of 32 bytes of input and perform a variety of operations on each byte. VPPERM has four XMM register operands: code VPPERM dest, src1, src2, selector code For each of 16 destination bytes the corresponding selector-byte addresses one of 32 input bytes (from src1, src2) and a logical operation including bit-reversal: code format="cpp" char src[32];  // src2:src1 char select[16]; char dest[16]; for (int i = 0; i < 16; i++) { char opera = select[i] >>> 5; // unsigned shift char idx32 = select[i] & 31;

switch ( opera ) { case 0: dest[i] = src[idx32]; break; case 1: dest[i] = ~src[idx32]; break; case 2: dest[i] = bitreverse( src[idx32]); break; case 3: dest[i] = ~bitreverse( src[idx32]); break; case 4: dest[i] = 0x00; break; case 5: dest[i] = 0xFF; break; case 6: dest[i] = src[idx32] >> 7;  break; // signed shift case 7: dest[i] = ~src[idx32] >> 7; break; // signed shift } } code The "bit reverse" operation is novel on x86 (some other architectures, like ARM, already have fast bit reverse instructions). This allows extremely fast reversal of bitboards. Since VPPERM can simultaneously reverse bits and bytes, it can for instance reverse two bitboards in one run, even from different sources, which beside other applications makes Hyperbola Quintessence work for all four lines. 

Generalized Shift and Rotate
XOP has general logical (unsigned) and arithmetical (signed) shifts and rotates on 128-bit XMM registers. Unlike the existing SSE shift instructions, the XOP variants allow each element of either a byte, word, dword and qword vector to be shifted/rotated by different amounts. If the count value is positive, bits are shifted/rotated to the left, otherwise right. All these new instructions require three operands: code VPROT* dest, src, fixed-count VPROT* dest, src, variable-count-src VPSHL* dest, src, variable-count-src VPSHA* dest, src, variable-count-src code //* either B,W,D, or Q//.


 * [[image:VPSHLB.JPG]] ||
 * VPSHLB - 16 individual left or right shifts ||

Applications
The bytewise shifts allow horizontal one step shifts of bitboards without wraps over rank bounderies from A- to H-file or vice versa. While one bitboard (8 bytes) might be shifted left, the other one might be shifted right, for instance for white pawn attacks: code format="cpp" __m128i noEa_noWe_Attacks( __m128i wPawns {wp:wp} ) { const __m128i shifts(0x0101010101010101, 0xFFFFFFFFFFFFFFFF); /* +1,..., -1,... */  b = _mm_shl_epi8(wPawns, shifts); /* east:west */ b = _mm_slli_epi64 (b, 8); /* north */ return b; } code =See Also=
 * AltiVec
 * AVX
 * AVX2
 * AVX-512
 * SIMD and SWAR Techniques
 * SSE
 * SSE2
 * SSE3
 * SSSE3
 * SSE4
 * SSE5

=Manuals=
 * [|Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions] (pdf)
 * [|Software Optimization Guide for AMD Family 15h Processors] (pdf)

=External Links=
 * [|XOP instruction set from Wikipedia]
 * [|Stop the instruction set war] by [|Agner Fog]
 * [|Population count using XOP instructions] by Wojciech Muła, December 16, 2016
 * [|XOP Intrinsics Added for Visual Studio 2010 SP1] from [|MSDN Library]

=References= =What links here?= include page="XOP" component="backlinks" limit="60"
 * Up one Level**