Assume following masked occupancy on a file, diagonal or anti-diagonal - for simplicity as a flat byte (in a real bitboard with masked files or diagonals you have 6..8 scratch-bits between the bits of this byte). Thus, vertical flip reverses the bits of this byte.
o' = reverse(o)
r' = reverse(r)
normal reversed
o 11010101 10101011 o' occupancy including slider
r 00010000 00001000 r' slider
o-r 11000101 10100011 o'-r' 1. sub clears the slider
o-2r 10110101 10011011 o'-2r' 2. sub borrows "one" from next blocker
|......| \....../
normal 10110101 \..../
11011001 <--XXXX re-reverse
single
xor 01101100 -> to get the attack set
The first subtraction of (o-2r) is done implicitly by masking off the line, removing the slider from the occupied set. The second subtraction borrows a "one" from the next nearest blocker in msb-direction, falling through all unset bits outside the line. Of course, if no blocker is available, it borrows a "one" in usual arithmetical manner from the hidden 2^N. Only the changed bits (from original o, o') are the appropriate sliding attacks, including the blocker but excluding the slider. The result finally needs to be intersected with the same line mask as previously the occupancy, to clear the wrapped borrow one bits outside the file or diagonal. The fine optimization by Aleks Peshkov covers the final union of positive and negative ray-attacks. Since opposed ray-directions are always disjoint sets, using xor instead of bitwise or safes two instructions per line-attack. That is because bit-reversal or any mirroring or flipping is own inverse and distributive over xor.
Using x86-64bswap makes it quite competitive for bishops and files, on AMDK8 or K10 it has a latency of one cycle with a throughput of 1/3, like other cheap instructions. However, Intel is tad slower - while the recent Core 2 duo processors perform 128-bit SIMD-instructions with 128-bit alus, that is bitwise logical instructions with a latency of one cycle and throughput of 1/3, the general purpose bswap-instruction takes four cycles with a throughput of one. In Intel 64 and IA32 Architectures Optimization Reference Manual[4] , it is therefor recommend (5.6.5. endian conversion) to use the SSSE3pshufb instruction to swap bytes, available through intrinsic [5] , see SSSE3 Hyperbola Quintessence for bishop attacks.
As long there is no fast bit reversal instruction, there is no general solution for all four lines, and the rook attack-getter still needs some standard technique for the rank-attacks. Tim Cooijmans proposed to map the rank to the main diagonal before applying HQ, and to re-map the calculated attacks back to the original rank [6] .
Generalized set-wise attacks
Hyperbola quintessence can be generalized to work on whole sets of sliding pieces instead on individual pieces, whose ranks to be masked. The problem arising, when not masking the rank of the piece is that attacks wrap around the board during subtraction. This is shown below:
........ ........ 11111111
........ ........ 11111111
........ ........ 11111111
r = ........ , o = ........ this leads to o - 2*r = 11111111
........ ........ 11111111
........ ........ 11111111
....1... ....1... 11111...
........ ........ ........
instead of
........
........
........
........
........
........
11111...
........
This is not the intended result. It can be avioded, by bitwise adding an overflow barrier on the right-hand side. Afterwards this barrier needs to be removed from the attack set:
u64 right = 0x0101010101010101ULL;
......1. 1..1..1. 1...1111
....1... 1...1... .1111..1
......1. 11....1. 1.111111
r = .....1.. o = .11..1.. now: ((o | right) - 2*r) = .1.111.1
........ ........ ........
......1. ......1. 1111111.
.......1 .......1 1111111.
1....... 1....... 1......1
Note, that the 4th rank was not flooded by the subtraction! Next, the blockers are removed as usual:
...111.1
1111...1
.11111.1
o ^ ((o | right) - 2*r) = ..111..1
........
111111..
11111111
.......1
The last step is to remove the barrier at the right side that became visible after the last operation.
...111..
1111...
.11111..
(o ^ ((o | right) - 2*r) & ~right = ..111...
........
111111..
1111111.
........
This is the correct attack set for the left direction.
the complete algorithm for the left direction is therefore:
Long.reverse for a generalized attack getter even for ranks is too expensive, except a JVM can use a machine instruction rather than a bit-reversal routine:
/**
* @param occ - occupancy
* line - {0..3} {rank, file, diagonal, antidiagonal}
* sq - from square
* @return attacks from sq on line with occupancy occ
*/staticpubliclong attacks(long occ, int line, int sq){long forward = occ & maskEx[sq][line];long reverse = Long.reverse(forward);
forward -= bitMask[sq];
reverse -= bitMask[sq^63];
forward ^= Long.reverse(reverse);
forward &= maskEx[sq][line];return forward;}
AMD's SIMD Future in 2011
In 2007 AMD announced the hypothetical SSE5 instruction set [7] with a PPERM instruction [8] , which would be able to reverse vectors of two bitboards. In May 2009 AMD revised SSE5, to become more compatible with Intel's Advanced Vector Extensions, and also expanding the SIMD-register width from 128 to 256-bit, the renamed XOP instruction set [9] . However, the successor of the planned PPERM SSE5-instruction is now called VPPERM, but still works on 128-bit XMM registers [10] . VPPERM instruction requires four operands (xmm-registers):
VPPERM dest, src1, src2, selector
For each of 16 destination bytes the corresponding selector-byte addresses one of 32 input bytes (from src1, src2) and a logical operation including bit-reversal:
char src[32];// src2:src1char select[16];char dest[16];for(int i =0; i <16; i++){char opera = select[i]>>>5;// unsigned shiftchar idx32 = select[i]&31;switch( opera ){case0: dest[i]= src[idx32];break;case1: dest[i]= ~src[idx32];break;case2: dest[i]= bitreverse( src[idx32]);break;case3: dest[i]= ~bitreverse( src[idx32]);break;case4: dest[i]=0x00;break;case5: dest[i]=0xFF;break;case6: dest[i]= src[idx32]>>7;break;// signed shiftcase7: dest[i]= ~src[idx32]>>7;break;// signed shift}}
Since VPPERM can simultaneously reverse bits and bytes, it can for instance reverse two bitboards in one run, even from different sources, to make Hyperbola Quintessence work for all four lines.
Table of Contents
Reverse Math
Assume following masked occupancy on a file, diagonal or anti-diagonal - for simplicity as a flat byte (in a real bitboard with masked files or diagonals you have 6..8 scratch-bits between the bits of this byte). Thus, vertical flip reverses the bits of this byte.o' = reverse(o) r' = reverse(r) normal reversed o 11010101 10101011 o' occupancy including slider r 00010000 00001000 r' slider o-r 11000101 10100011 o'-r' 1. sub clears the slider o-2r 10110101 10011011 o'-2r' 2. sub borrows "one" from next blocker |......| \....../ normal 10110101 \..../ 11011001 <--XXXX re-reverse single xor 01101100 -> to get the attack setThe first subtraction of (o-2r) is done implicitly by masking off the line, removing the slider from the occupied set. The second subtraction borrows a "one" from the next nearest blocker in msb-direction, falling through all unset bits outside the line. Of course, if no blocker is available, it borrows a "one" in usual arithmetical manner from the hidden 2^N. Only the changed bits (from original o, o') are the appropriate sliding attacks, including the blocker but excluding the slider. The result finally needs to be intersected with the same line mask as previously the occupancy, to clear the wrapped borrow one bits outside the file or diagonal. The fine optimization by Aleks Peshkov covers the final union of positive and negative ray-attacks. Since opposed ray-directions are always disjoint sets, using xor instead of bitwise or safes two instructions per line-attack. That is because bit-reversal or any mirroring or flipping is own inverse and distributive over xor.thus
and finally
Beside shorter code this reduces register pressure - and clearly outperforms kindergarten bitboards - ipc-wise, in code size and memory requirements.
Source Code
C
The three C-routines only differ by the line-mask applied:For better locality of the line-attacks on the otherwise empty board, we may use an properly aligned array of structs.
Using x86-64 bswap makes it quite competitive for bishops and files, on AMD K8 or K10 it has a latency of one cycle with a throughput of 1/3, like other cheap instructions. However, Intel is tad slower - while the recent Core 2 duo processors perform 128-bit SIMD-instructions with 128-bit alus, that is bitwise logical instructions with a latency of one cycle and throughput of 1/3, the general purpose bswap-instruction takes four cycles with a throughput of one. In Intel 64 and IA32 Architectures Optimization Reference Manual [4] , it is therefor recommend (5.6.5. endian conversion) to use the SSSE3 pshufb instruction to swap bytes, available through intrinsic [5] , see SSSE3 Hyperbola Quintessence for bishop attacks.
As long there is no fast bit reversal instruction, there is no general solution for all four lines, and the rook attack-getter still needs some standard technique for the rank-attacks. Tim Cooijmans proposed to map the rank to the main diagonal before applying HQ, and to re-map the calculated attacks back to the original rank [6] .
Generalized set-wise attacks
Hyperbola quintessence can be generalized to work on whole sets of sliding pieces instead on individual pieces, whose ranks to be masked. The problem arising, when not masking the rank of the piece is that attacks wrap around the board during subtraction. This is shown below:
........ ........ 11111111 ........ ........ 11111111 ........ ........ 11111111 r = ........ , o = ........ this leads to o - 2*r = 11111111 ........ ........ 11111111 ........ ........ 11111111 ....1... ....1... 11111... ........ ........ ........ instead of ........ ........ ........ ........ ........ ........ 11111... ........This is not the intended result. It can be avioded, by bitwise adding an overflow barrier on the right-hand side. Afterwards this barrier needs to be removed from the attack set:u64 right = 0x0101010101010101ULL; ......1. 1..1..1. 1...1111 ....1... 1...1... .1111..1 ......1. 11....1. 1.111111 r = .....1.. o = .11..1.. now: ((o | right) - 2*r) = .1.111.1 ........ ........ ........ ......1. ......1. 1111111. .......1 .......1 1111111. 1....... 1....... 1......1 Note, that the 4th rank was not flooded by the subtraction! Next, the blockers are removed as usual: ...111.1 1111...1 .11111.1 o ^ ((o | right) - 2*r) = ..111..1 ........ 111111.. 11111111 .......1 The last step is to remove the barrier at the right side that became visible after the last operation. ...111.. 1111... .11111.. (o ^ ((o | right) - 2*r) & ~right = ..111... ........ 111111.. 1111111. ........ This is the correct attack set for the left direction.the complete algorithm for the left direction is therefore:For the right-hand direction, the bits need to be reversed rank-wise.
x86-64 assembly
The VC2005 generated x86-64 assembly of bishopAttacks indicates what ipc-monster Hyperbola Quintessence is:Java
Java programmer may try Long.reverseBytes:Long.reverse for a generalized attack getter even for ranks is too expensive, except a JVM can use a machine instruction rather than a bit-reversal routine:
AMD's SIMD Future in 2011
In 2007 AMD announced the hypothetical SSE5 instruction set [7] with a PPERM instruction [8] , which would be able to reverse vectors of two bitboards. In May 2009 AMD revised SSE5, to become more compatible with Intel's Advanced Vector Extensions, and also expanding the SIMD-register width from 128 to 256-bit, the renamed XOP instruction set [9] . However, the successor of the planned PPERM SSE5-instruction is now called VPPERM, but still works on 128-bit XMM registers [10] . VPPERM instruction requires four operands (xmm-registers):For each of 16 destination bytes the corresponding selector-byte addresses one of 32 input bytes (from src1, src2) and a logical operation including bit-reversal:
Since VPPERM can simultaneously reverse bits and bytes, it can for instance reverse two bitboards in one run, even from different sources, to make Hyperbola Quintessence work for all four lines.
See also
Forum Posts
External Links
Hyperbola
Quintessence
Misc
References
What links here?
Up one Level