Home * Hardware * x86-64 * AVX-512
AVX-512,
an expansion of Intel's the AVX and AVX2 instructions using the EVEX prefix, featuring 32 512-bit wide vector SIMD registers zmm0 through zmm31, keeping either eight doubles or integer quad words such as bitboards, and eight (seven) dedicated mask registers which specify which vector elements are operated on and written. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation; otherwise, dependent of the instruction, the element is zeroed, or overridden by an element from another source register (remains unchanged if same source). A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic instructions.

Extensions

AVX-512 consists of multiple extensions not all meant to be supported by all AVX-512 capable processors. Only the core extension AVX-512F (AVX-512 Foundation) is required by all implementations [1] AVX-512F and AVX-512CD were first implemented in the Xeon Phi processor and coprocessor known by the code name Knights Landing [2] , launched on June 20, 2016.

Extension

Description
Architecture
CPUID 7
Reg:Bit [3]
AVX-512F

Foundation
Knights Landing
EBX:16
AVX-512CD

Conflict Detection Instructions
EBX:28
AVX-512ER

Exponential and Reciprocal Instructions
EBX:27
AVX-512PF

Prefetch Instructions
EBX:26
AVX-512BW

Byte and Word Instructions
Skylake X
EBX:30
AVX-512DQ

Doubleword and Quadword Instructions
EBX:17
AVX-512VL

Vector Length Extensions
EBX:31
AVX-512IFMA

Integer Fused Multiply Add
Cannonlake
EBX:21
AVX-512VBMI

Vector Byte Manipulation Instructions
ECX:01
AVX-512VPOPCNTDQ

Vector Population Count
Knights Mill
ECX:14
AVX-512-4VNNIW

Vector Neural Network Instructions
Word variable precision
EDX:02
AVX-512-4FMAPS

Fused Multiply Accumulation
Packed Single precision
EDX:03

Selected Instructions

VPTERNLOG

AVX-512F features the instruction VPTERNLOGQ (or VPTERNLOGD) to perform bitwise ternary logic, for instance to operate on vectors of bitboards. Three input vectors are bitwise combined by an operation determined by an immediate byte operand (imm8), whose 256 possible values corresponds with the boolean output vector of the truth table for all eight combinations of the three input bits, as demonstrated with some selected imm8 values in the table below [4] [5] :
Input


Output of Operations





imm8
0x00

0x01

0x16

0x17

0x28

0x80

0x88

0x96

0xca

0xe8

0xfe

0xff
#
a
b
c

C-exp
false

~(a|b|c)

a?~(b|c):b^c

minor(a,b,c)

c&(a^b)

a&b&c

b&c

a^b^c

a?b:c

major(a,b,c)

a|b|c

true
0
0
0
0


0

1

0

1

0

0

0

0

0

0

0

1
1
0
0
1


0

0

1

1

0

0

0

1

1

0

1

1
2
0
1
0


0

0

1

1

0

0

0

1

0

0

1

1
3
0
1
1


0

0

0

0

1

0

1

0

1

1

1

1
4
1
0
0


0

0

1

1

0

0

0

1

0

0

1

1
5
1
0
1


0

0

0

0

1

0

0

0

0

1

1

1
6
1
1
0


0

0

0

0

0

0

0

0

1

1

1

1
7
1
1
1


0

0

0

0

0

1

1

1

1

1

1

1
Following VPTERNLOGQ intrinsics are declared, where the maskz version sets unmasked destination quad word elements to zero, while the mask version copies unmasked elements from s:
__m512i _mm512_ternarylogic_epi64(__m512i a, __m512i b, __m512i c, int imm8);
__m512i _mm512_maskz_ternarylogic_epi64( __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);
__m512i _mm512_mask_ternarylogic_epi64(__m512i s, __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);

VPLZCNT

AVX-512CD has Vector Leading Zero Count - VPLZCNTQ counts leading zeroes on a vector of eight bitboards in parallel [6] - using following intrinsics [7], where the maskz version sets unmasked destination elements to zero, while the mask version copies unmasked elements from s:
__m512i _mm512_lzcnt_epi64(__m512i a);
__m512i _mm512_maskz_lzcnt_epi64(__mmask8 m, __m512i a);
__m512i _mm512_mask_lzcnt_epi64(__m512i s, __mmask8 m, __m512i a);

VPOPCNT

The future AVX-512VPOPCNTDQ extension has a vector population count instruction to count one bits of either 16 32-bit double words (VPOPCNTD) or 8 64-bit quad words aka bitboards (VPOPCNTQ) in parallel [8] [9].

See also


Manuals


External Links

Blog Postings

Compiler Support


References

  1. ^ AVX-512 from Wikipedia
  2. ^ Additional AVX-512 instructions | Intel® Developer Zone by James Reinders, July 17, 2014
  3. ^ AVX512 table from Heise
  4. ^ AVX512: ternary functions evaluation by Wojciech Muła, March 03, 2015
  5. ^ Intel® Architecture Instruction Set Extensions Programming Reference (pdf) 5.3 TERNARY BIT VECTOR LOGIC TABLE
  6. ^ Patent US9372692 - Methods, apparatus, instructions, and logic to provide permute controls with leading zero count functionality - Google Patent Search
  7. ^ VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values
  8. ^ sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub
  9. ^ Wojciech Muła, Nathan Kurz, Daniel Lemire (2016). Faster Population Counts Using AVX2 Instructions. arXiv:1611.07612

What links here?


Up one Level