Home * Hardware * x86 * XOP

XOP, (eXtended Operations)
a x86-64 SIMD instruction set extension by AMD released with the Bulldozer microarchitecture which have the same functionality as the SSE5 instruction set formerly proposed by AMD in August 2007, but with a revision of encoding in order to improve compatibility with Intel's AVX and the VEX coding scheme.

The XOP instructions utilize a new three-byte XOP prefix preceding the opcode byte. This prefix replaces the use of the 0F, 66, F2 and F3 prefix bytes and the REX prefix and encodes additional information as well [1]. XOP requires bit 11 in EXC set as returned by CPUID function EAX 80000001H.

Instructions


Integer Multiply, Add and Accumulate

XOP has a variety of multiply, add and accumulate instructions operate on and produce packed signed integer values. These instructions are certainly worthwhile for evaluation purpose, for instance VPMACSSWW:
VPMACSWW.JPG
VPMACSSWW — Packed Multiply Accumulate Signed Word to Signed Word with Saturation
Since these instructions have the same performance as typical multiply instructions like PMULLW and PMADDWD and require the same execution resources, they effectively make the add step "free". The primary catch to using these instructions is latency; for example, the following sequence to sum a series of multiplies is extremely slow and will take 16 cycles:
Instruction
Starting Cycle
Ending Cycle
vpmacssww xmm0, xmm1, xmm2, xmm0
0
3
vpmacssww xmm0, xmm3, xmm4, xmm0
4
7
vpmacssww xmm0, xmm5, xmm6, xmm0
8
11
vpmacssww xmm0, xmm7, xmm8, xmm0
12
15
Whereas the simple version, without XOP, will take just 8 cycles, albeit with more uops:
Instruction
Starting Cycle
Ending Cycle
pmullw xmm1, xmm2
0
3
pmullw xmm3, xmm4
1
4
pmullw xmm5, xmm6
2
5
pmullw xmm7, xmm8
3
6
paddsw xmm0, xmm1
1
2
paddsw xmm0, xmm3
2
3
paddsw xmm0, xmm5
3
4
paddsw xmm0, xmm7
4
5
Multiple accumulators can help avoid this problem, as well as finding other ways to hide the latency.

Horizontal Add and Subtract

XOP packed horizontal add and subtract signed integer instructions successively add adjacent pairs from the source XMM register and pack the (sign extended) integer result in the destination. For instance, VPHADDWQ can be used to continue the dot product from a previous Multiply, Add and Accumulate:
VPHADDWQ.JPG
VPHADDWQ - Packed Horizontal Add Signed Word to Signed Quadword
While some of these instructions may at first appear to be less powerful than the existing SSSE3 phaddw and psubhw, the latter tend to be rather slow in most implementations, while the XOP variants are all fast, single-uop instructions.

Vector Conditional Moves

The Vector Conditional Moves (VPCMOV) instruction implements the C/C++ language ternary ‘?’ operator at bit level on 128-bit XMM [2] or 256-bit YMM registers [3]. VPCMOV has four XMM/YMM register operands:
 VPCMOV dest, src1, src2, selector
The 256-bit version executes following pseudo code in parallel:
for (int i = 0; i < 256; i++)
   dest[i] = selector[i] ? src1[i] : src2[i]

Packed Permute Bytes

The Packed Permute Bytes (VPPERM) instruction can shuffle 16 bytes out of 32 bytes of input and perform a variety of operations on each byte [4]. VPPERM has four XMM register operands:
 VPPERM dest, src1, src2, selector
For each of 16 destination bytes the corresponding selector-byte addresses one of 32 input bytes (from src1, src2) and a logical operation including bit-reversal:
char src[32];   // src2:src1
char select[16];
char dest[16];
for (int i = 0; i < 16; i++) {
   char opera = select[i] >>> 5; // unsigned shift
   char idx32 = select[i] & 31;
 
   switch ( opera ) {
      case 0: dest[i] =  src[idx32]; break;
      case 1: dest[i] = ~src[idx32]; break;
      case 2: dest[i] =  bitreverse( src[idx32]); break;
      case 3: dest[i] = ~bitreverse( src[idx32]); break;
      case 4: dest[i] = 0x00; break;
      case 5: dest[i] = 0xFF; break;
      case 6: dest[i] =  src[idx32] >> 7;  break; // signed shift
      case 7: dest[i] = ~src[idx32] >> 7;  break; // signed shift
   }
}
The "bit reverse" operation is novel on x86 (some other architectures, like ARM, already have fast bit reverse instructions). This allows extremely fast reversal of bitboards. Since VPPERM can simultaneously reverse bits and bytes, it can for instance reverse two bitboards in one run, even from different sources, which beside other applications makes Hyperbola Quintessence work for all four lines.

Generalized Shift and Rotate

XOP has general logical (unsigned) and arithmetical (signed) shifts and rotates on 128-bit XMM registers. Unlike the existing SSE shift instructions, the XOP variants allow each element of either a byte, word, dword and qword vector to be shifted/rotated by different amounts. If the count value is positive, bits are shifted/rotated to the left, otherwise right. All these new instructions require three operands:
 VPROT* dest, src, fixed-count
 VPROT* dest, src, variable-count-src
 VPSHL* dest, src, variable-count-src
 VPSHA* dest, src, variable-count-src
* either B,W,D, or Q.

VPSHLB.JPG
VPSHLB - 16 individual left or right shifts

Applications

The bytewise shifts [5] allow horizontal one step shifts of bitboards without wraps over rank bounderies from A- to H-file or vice versa. While one bitboard (8 bytes) might be shifted left, the other one might be shifted right, for instance for white pawn attacks:
__m128i noEa_noWe_Attacks( __m128i wPawns {wp:wp} ) {
   const __m128i shifts(0x0101010101010101, 0xFFFFFFFFFFFFFFFF); /* +1,... , -1,... */
   b = _mm_shl_epi8(wPawns, shifts); /* east:west */
   b = _mm_slli_epi64 (b, 8); /* north */
   return b;
}

See Also


Manuals


External Links


References

  1. ^ Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions (pdf)
  2. ^ _mm_cmov_si128
  3. ^ _mm256_cmov_si256
  4. ^ _mm_perm_epi8
  5. ^ _mm_shl_epi8

What links here?


Up one Level