AVX2

toc
 * Home * Hardware * x86 * AVX2**


 * Advanced Vector Extensions 2** (AVX2) is an expansion of the AVX instruction set. Support for 256-bit expansions of the SSE2 128-bit integer instructions will be added in AVX2, which was along with BMI2 part of Intel's [|Haswell] architecture in 2013, and since 2015, of AMD's [|Excavator] microarchitecture.

=Features= Beside expanding most integer AVX instructions to 256 bit, AVX2 has 3-operand general-purpose bit manipulation and multiply, vector shifts, Double- and Quad Word-granularity any-to-any permutes, and 3-operand [|fused multiply-accumulate] support. An important catch is that not all of the instructions are simply generalizations of their 128-bit equivalents: many work "in-lane", applying the same 128-bit operation to each 128-bit half of the register instead of a 256-bit generalization of the operation. For example:
 * ~ Set ||~ Instruction ||~ Result ||
 * ~ AVX || **vpunpckldq** xmm0, ABCD, EFGH || xmm1 := AEBF ||
 * ~ AVX2 || **vpunpckldq** ymm0, ABCDEFGH, IJKLMNOP || xmm1 := AIBJEMFN ||

If vpunpckldq had been expanded in the more intuitive fashion, the result of the AVX2 operation would be AIBJCKDL. The reason for this design might be to allow AVX to be implemented more easily with two separate 128-bit arithmetic units.

Some AVX2 instructions, such as type conversion instructions, take both xmm and ymm registers as arguments. For example:  =Individual Vector Shifts= With AVX2 each data element, such as a bitboard of a quad-bitboard, may be shifted left or right individually, as specified by the second source operand, with following Assembly [|mnemonics] and C intrinsic equivalents:
 * ~ Set ||~ Instruction ||~ Operation ||
 * ~ AVX || **vpmovzxbw** xmm1, xmm2/m64 || xmm1 := Packed_Zero_Extend_Byte_To_Word(xmm2/m64) ||
 * ~ AVX2 || **vpmovzxbw** ymm1, xmm2/m128 || ymm1 := Packed_Zero_Extend_Byte_To_Word(xmm2/m128) ||
 * ~ Instruction ||~ ||~ Description  ||~ ||~ Intrinsic   ||
 * ||~ || Variable Bit Shift Right Logical ||~ || ||
 * ||~ || Variable Bit Shift Left Logical ||~ || ||

=Applications= With an appropriate quad-bitboard class, one may generate attacks of up to four different directions using individual shifts, for instance knight attacks or sliding piece attacks with Dumb7Fill to generate all positive or negative sliding ray attacks passing two times orthogonal and diagonal sliding pieces.

include page="MappingHint" 

Knight Attacks
code noNoWe   noNoEa +15 +17             |     | noWeWe  +6 __|     |__+10  noEaEa \  /               >0<           __ /   \ __ soWeWe -10   |     |   -6  soEaEa |    |            -17  -15        soSoWe    soSoEa code code format="cpp" QBB noEaEa_noNoEa_noNoWe_noWeWe(U64 knights) { const QBB qmask (notAB, notA,notH,notGH); const QBB qshift (10,17,15,6); QBB qknights (knights); return (qknights << qshift) & qmask; }

QBB soWeWe_soSoWe_soSoEa_soEaEa(U64 knights) { const QBB qmask (notGH,notH,notA,notAB); const QBB qshift (10,17,15,6); QBB qknights (knights); return (qknights >> qshift) & qmask; } code 

Dumb7Fill
code northwest   north   northeast noWe        nort         noEa +7   +8    +9              \  |  /  west    -1 <-  0 -> +1    east / |  \          -9    -8    -7  soWe         sout         soEa southwest   south   southeast code code format="cpp" QBB east_nort_noWe_noEa_Attacks(QBB qsliders {rq,rq,bq,bq}, U64 empty) { const QBB qmask (notA,-1,notH,notA); const QBB qshift (1,8,7,9); QBB qflood (sliders); QBB qempty = QBB(empty) & qmask; qflood |= qsliders = (qsliders << qshift) & qempty; qflood |= qsliders = (qsliders << qshift) & qempty; qflood |= qsliders = (qsliders << qshift) & qempty; qflood |= qsliders = (qsliders << qshift) & qempty; qflood |= qsliders = (qsliders << qshift) & qempty; qflood |=           (qsliders << qshift) & qempty; return              (qflood   << qshift) & qmask }

QBB west_sout_soEa_soWe_Attacks(QBB qsliders {rq,rq,bq,bq}, U64 empty) { const QBB qmask (notH,-1, notA,notH); const QBB qshift (1,8,7,9); QBB qflood (sliders); QBB qempty = QBB(empty) & qmask; qflood |= qsliders = (qsliders >> qshift) & qempty; qflood |= qsliders = (qsliders >> qshift) & qempty; qflood |= qsliders = (qsliders >> qshift) & qempty; qflood |= qsliders = (qsliders >> qshift) & qempty; qflood |= qsliders = (qsliders >> qshift) & qempty; qflood |=           (qsliders >> qshift) & qempty; return              (qflood   >> qshift) & qmask } code 

Bitboard Permutation
For each bitboard in a destination quad-bitboard, the Qwords Element Permutation (**VPERMQ**) instruction selects one bitboard of a source quad-bitboard. This permits a bitboard in the source operand to be copied to multiple locations in the destination. code format="cpp" destQBB.bb[0] = sourceQBB.bb[ (imm8 >> 0) & 3 ] destQBB.bb[1] = sourceQBB.bb[ (imm8 >> 2) & 3 ] destQBB.bb[2] = sourceQBB.bb[ (imm8 >> 4) & 3 ] destQBB.bb[3] = sourceQBB.bb[ (imm8 >> 6) & 3 ] code 

Vertical Nibble
Following code extracts the piece-code as "vertical nibble" from a quad-bitboard as board representation inside a register, "indexed" by square. The idea is to shift the square bits to the leftmost bit, the [|sign bit] of each bitboard, to perform the //VPMOVMSKB// instruction to get the sign bits of all 32 bytes into a general purpose register. Unfortunately, there is no //VPMOVMSKQ// to get only the signs of four bitboards, so some more masking and mapping is required to get the four-bit piece code ... code format="cpp" int getPiece (__m256i qbb, U64 sq) { __m128i shift = _mm_cvtsi32x_si128( sq ^ 63 ); /* left shift amount 63-sq */ qbb =  _mm256_sll_epi64( qbb, shift ); /* squares to signs */ uint32  qbbsigns = _mm256_movemask_epi8( qbb );  /* get sign bits of 32 bytes */ return ((qbbsigns & 0x80808080) * 0x00204081) >> 28; /* mask, nibble-map, shift */ } code ... using these intrinsics ...
 * [|_mm_cvtsi64x_si128]
 * [|_mm256_sll_epi64]
 * [|_mm256_movemask_epi8]

... with seven assembly instructions intended, assuming the quad-bitboard passed in ymm0 and the square in rcx code xor      rcx, 63          ; left shift amount 63-sq movd     xmm6, rcx        ; shift amount via xmm vpsllq   ymm6, ymm0, xmm6 ; squares to signs vpmovmskb eax, ymm6       ; get sign bits of 32 bytes and      eax, 0x80808080  ; mask the four bitboard sign bits imul     eax, 0x00204081  ; map them to the upper nibble shr      eax, 28          ; nibble as piece code code

=See also=
 * AltiVec
 * AVX
 * AVX-512
 * BMI2
 * DirGolem
 * MMX
 * SIMD and SWAR Techniques
 * SSE
 * SSE2
 * SSE3
 * SSSE3
 * SSE4

=Publications=
 * Wojciech Muła, [|Nathan Kurz], [|Daniel Lemire] (**2016**). //Faster Population Counts Using AVX2 Instructions//. [|arXiv:1611.07612] » AVX-512, Population Count
 * Wojciech Muła, [|Daniel Lemire] (**2017**). //Faster Base64 Encoding and Decoding Using AVX2 Instructions//. [|arXiv:1704.00605] » [|Base64]

=Manuals=
 * [|Intel® Architecture Instruction Set Extensions Programming Reference] (pdf)
 * [|Intel® 64 and IA-32 Architectures Optimization Reference Manual] (pdf)

=External Links=
 * [|Advanced Vector Extensions 2 from Wikipedia]
 * [|Overview: Intrinsics for Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions | Intel® Software]
 * [|Intel Intrinsics Guide - AVX2]
 * [|Intel Software Development Emulator], which can be used to experiment with AVX and AVX2 on a CPU that doesn't support them.
 * [|AMD and Intel incompatible - What to do?] from [|AMD Developer Central]
 * [|Stop the instruction set war] by [|Agner Fog]
 * [|Processing Arrays of Bits with Intel® Advanced Vector Extensions 2 (Intel® AVX2) | Intel® Developer Zone] by [|Thomas Willhalm], May 17, 2013
 * [|Haswell Instructions Latency]

=References= =What links here?= include page="AVX2" component="backlinks" limit="60"
 * Up one Level**