Efficient generation of Moves and Controls of Sliding Pieces

Sa., 16:00, TU Dresden, WIL C207 Attack-Sets as Base for Bitboard Move-generation: Opposed to “None-sliding” pieces, Knight, King and Pawn, whose attacks are determined by its origin square only, sliding piece attacks like Rook-, Bishop- and Queen-attacks are dependend on other pieces as well, which may block the attacking ray in one particular ray-direction. In Quiescence Search the performance of generating (winning) capture moves is crucial. Opposed to classical square-centric board-representations, which require loops over squares, bitboards permit more efficient algorithms in generating sliding attacks.

Bitboard Basics

Bitboards are 64-bit integers and represent a finite set of up to 64 elements - all the squares of a chessboard with specific boolean properties of those squares. For instance whether squares are empty or occupied, or occupied by a specific kind of piece - or which is the main topic of this talk, whether a square is controlled or attacked (defended) by a specific kind of piece, especially attacked by sliding pieces.

There is a bijective one-to-one correspondence between bits of a bitboard and the squares of a board. There are 64! different mappings, but most commonly bits are enumerated "in order" along ranks or files (orthogonal). Some programs (Rotated Bitboards) keep redundant mappings and also enumerate consecutive bits along the diagonals or anti-diagonals.

If not stated otherwise, we further rely on Little-Endian Rank-File Mapping.

To represent the board we typically need one bitboard for each piece-type and color - likely encapsulated inside a class or structure, or as an array of bitboards as part of a position object. A one-bit inside a bitboard implies the existence of a piece of this piece-type on a certain square - one to one associated by the bit-position.

Boolean algebra is an algebraic structure that captures essential properties of both set operations and logic operations. Specifically, it deals with the set operations of intersection, union, complement - and the bitwise boolean operations of AND, OR, NOT. Bitwise boolean operations on 64-bit words are in fact 64 parallel operations on each bit.

Assume we have an attack set of a queen, and like to know whether the queen attacks opponent pieces it may capture, we need to 'and' the queen-attacks with the set of opponent pieces.

Shift acts like a multiplication (shift left) or division (shift right) by a power of two.
With orthogonal square mapping, shift with appropriate ray directions amounts of {1,7,8,9} are used to "move" the pieces one step in that direction, or to generate their square controls set-wise.

In the 8*8 board centric world with one scalar square-coordinate 0..63, each of the max eight neighboring squares can be determined by adding an offset for each direction. For border squares one has to care about overflows and wraps from a-file to h-file or vice versa. Some conditional code is needed to avoid that. Such code is usually part of move generation for particular pieces.

The mentioned square mapping implies following eight ray-directions as a compass-rose:

northwest north northeast
noWe nort noEa
+7 +8 +9
\ | /
west -1 <- 0 -> +1 east
/ | \
-9 -8 -7
soWe sout soEa
southwest south southeast

In the set-wise world of bitboards, where a square as member of a set is determined by an appropriate one-bit 2^square, the operation to apply such movements is shifting.

One has to consider wraps, which requires one further intersection.

To be aware of their scalar 64-bit origin, we use so far a type defined unsigned integer U64 in our C or C++ source snippets, the scalar 64-bit long in Java. Feel free to define a distinct type or wrap U64 into classes for better abstraction and type-safety during compile time. The macro C64 will append a suffix to 64-bit constants as required by some compilers:

typedef unsigned __int64 U64; // for the old microsoft compilers
typedef unsigned long long U64; // supported by MSC 13.00+ and C99
#define C64(constantU64) constantU64##ULL

One Step Only

The advantage with bitboards is, that the shift applies to all set bits in parallel, e.g. with all pawns. Vertical shifts by +-8 don't need any under- or overflow conditions since bits simply fall out and disappear.

U64 soutOne (U64 b){return b >>8;}
U64 nortOne (U64 b){return b <<8;}

Wraps from a-file to h-file or vice versa may be considered by only shifting subsets which may not wrap. Thus we can mask off the a- or h-file before or after a +-1,7,9 shift:

At some point bitboards require serialization - for instance if we need to process move-target sets to generate moves. Thanks to the Two's Complement, isolation and reset is cheap. To determine the bit- or square-index in the 0..63 range is tad more work ...

while( x ){*list++= log2OfPowerOfTwo (x &-x);// determine bit index, also referred as BitScan
x &= x-1;// reset LS1B}

or alternatively

while( x ){*list++= bitScanForward(x);// determine bit index of least significant one bit
x &= x-1;// reset LS1B}

Despite recent processors have hardware instruction for bitScan, two hashing methods are mentioned to determine the base two logarithm of a single populated Bitboard. One hash-index is based on multiplication and shift, while the second one is determined by modulo. As we will see, both techniques also appear in hashing of multiple bits of a line.

De Bruijn Multiplication

The classical De Bruijn bitscan, as described by Leiserson et al.^{[2]}, to determine the LS1B index by minimal perfect hashing. So called De Bruijn sequences were invented by the Dutch mathematician Nicolaas de Bruijn. A 64-bit De Bruijn sequence contains 64-overlapping unique 6-bit sequences, thus a ring of 64+5 bits. The five hidden "trailing" zeros are in fact common with the five leading zeros. There are 2^26 = 67108864 odd sequences with 6 leading binary zeros and 2^26 even sequences with 5 leading binary zeros, which may be calculated from the odd ones by shifting left one.

A multiplication with a power of two value (the isolated LS1B) acts like a left shift by it's exponent. Thus, if we multiply a 64-bit De Bruijn sequence with the isolated LS1B, we get a unique six bit sequence in the most significant bits. To obtain the bit-index we need to extract these upper six bits by shifting right the product, to lookup an array.

constint index64[64]={63, 0, 58, 1, 59, 47, 53, 2,
60, 39, 48, 27, 54, 33, 42, 3,
61, 51, 37, 40, 49, 18, 28, 20,
55, 30, 34, 11, 43, 14, 22, 4,
62, 57, 46, 52, 38, 26, 32, 41,
50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16, 9, 12,
44, 24, 15, 8, 23, 7, 6, 5};/**
* bitScanForward
* @author Charles E. Leiserson
* Harald Prokop
* Keith H. Randall
* "Using de Bruijn Sequences to Index a 1 in a Computer Word"
* @param bb bitboard to scan
* @precondition bb != 0
* @return index (0..63) of least significant one bit
*/int bitScanForward(U64 bb){const U64 debruijn64 = C64(0x07EDD5E59A4E28C2);return index64[((bb &-bb)* debruijn64)>>58];}

Bitscan by Modulo

Another idea is to apply a modulo operation of the isolated LS1B by the prime number 67 ^{[3]}^{[4]}^{[5]}. The remainder 0..66 can be used to perfectly hash the bit-index table. Three gaps are 0, 17, and 34, so the mod 67 can make a branchless trailing zero count.

Since div/mod is an expensive instruction, a modulo by constant is likely replaced by reciprocal fixed point multiplication to get the quotient and a second multiplication and difference to get the remainder. Compared with De Bruijn multiplication it is still to slow.

Multiplication with disjoint intermediate results was nominated as Kindergarten Multiplication.

Multiplying the masked A-file with the main-diagonal, maps the file-bits to the 8th rank, similar to a flip about the anti-diagonal A8-H1. Shifting down to the 1st rank, leaves the bits like a 90-degree anti clockwise rotation.

masked bits mapped to 8th rank
bits on A-file * main-diagonal = with garbage -> 1st rank
A . . . . . . . . . . . . . . 1 A B C D E F G H . . . . . . . .
B . . . . . . . . . . . . . 1 . B C D E F G H . . . . . . . . .
C . . . . . . . . . . . . 1 . . C D E F G H . . . . . . . . . .
D . . . . . . . . . . . 1 . . . D E F G H . . . >> . . . . . . . .
E . . . . . . . * . . . 1 . . . . = E F G H . . . . 56 . . . . . . . .
F . . . . . . . . . 1 . . . . . F G H . . . . . . . . . . . . .
G . . . . . . . . 1 . . . . . . G H . . . . . . . . . . . . . .
H . . . . . . . 1 . . . . . . . H . . . . . . . A B C D E F G H

That is straight forward multiplication of a masked diagonal or anti-diagonal with the A-file.
To mask the garbage off, we further shift down by 7 ranks.

masked diagonal * A-file mapped
to 8th rank -> 1st rank
. . . . . . . H 1 . . . . . . . A B C D E F G H . . . . . . . .
. . . . . . G . 1 . . . . . . . A B C D E F G . . . . . . . . .
. . . . . F . . 1 . . . . . . . A B C D E F . . >> . . . . . . . .
. . . . E . . . 1 . . . . . . . A B C D E . . . 56 . . . . . . . .
. . . D . . . . * 1 . . . . . . . = A B C D . . . . . . . . . . . .
. . C . . . . . 1 . . . . . . . A B C . . . . . . . . . . . . .
. B . . . . . . 1 . . . . . . . A B . . . . . . . . . . . . . .
A . . . . . . . 1 . . . . . . . A . . . . . . . A B C D E F G H

Attack-Sets as Base for Bitboard Move-generation

Opposed to "None-sliding" pieces, Knight, King and Pawn, whose controls or attacks are determined by its origin square only, sliding piece attacks like Rook-, Bishop- and Queen-attacks are dependent on other pieces as well, which may block the attacking ray in one particular ray-direction.

In Quiescence Search the performance of generating (winning) capture moves is crucial. Opposed to classical square-centric board-representations, which require loops over squares, bitboards permit more efficient algorithms in generating sliding attacks.

Sliding piece Attacks on the otherwise empty board

Attacks of single sliding pieces on the otherwise empty board or their disjoint subsets on lines or rays are that simple than none sliding pieces. We simply use pre-calculated tables for each piece-type, line or ray, indexed by square-index.

Ray Attacks

The mentioned square mapping implies following eight ray-directions as a compass-rose:

northwest north northeast
noWe nort noEa
+7 +8 +9
\ | /
west -1 <- 0 -> +1 east
/ | \
-9 -8 -7
soWe sout soEa
southwest south southeast

For one single sliding piece, one may consider subtraction from the set of blockers, to determine attack sets. Zeros, that are empty squares, between the nearest blocker and the sliding pieces became "ones", the nearest blocker becomes zero. In fact all flipped bits are the attacked squares in positive ray direction.

occupied 11000101
slider 00000100 subset of occupied
o-s 11000001 resets the slider in occupied
o-2s 10111101 borrows the one from nearest blocker
occupied 11000101
o^(o-2s) 01111000 all flipped bits are the attacks in most significant bit direction

To restrict the operations to a certain line, rank file or diagonal, a leading intersection with the line mask is necessary, same for the final result.

Subtraction and Reverse Subtraction of rooks from blockers

With s (sliding piece) single element set and subset of o (occupied):

positiveRayAttacks = o ^(o - 2s);

The first subtraction resets the siding-piece-bit in occupied.
The second subtraction borrows a one from nearest blocker (if any), and sets all intermediate zero bits (empty squares). The exclusive or gains all flipped bits, which turnes out to be the rook attacks in "positive" attacking direction.

For negative directions one needs "reverse" arithmetic (2 + 2 == 1).

with o' = reverse (o)
and s'= reverse (s)
negativeRayAttacks = reverse (o' ^ (o'- 2s'));

Since bit-reversal or any mirroring or flipping is distributive over xor:

reverse(a ^ b)== reverse (a)^ reverse(b)

One can reformulate negative rays

negativeRayAttacks = o ^ reverse (o' - 2s');

which leads to a simplification in the line-attacks, since positiveRayAttacks and negativeRayAttacks are disjoint sets for one particular sliding piece and may be combined by "xor", and o ^ o == 0.

lineAttacks = positiveRayAttacks ^ negativeRayAttacks;
lineAttacks = o ^(o-2s)^ reverse( o'-2s')^ o
lineAttacks =(o-2s)^ reverse( o'-2s')

Diagonal or vertical line occupancies need leading and trailing intersection with line-masks. Thanks to the little-endian versus big-endian war, recent processors have appropriate instructions to swap bytes (ranks) inside a 64-bit word for the reverse arithmetic, note that masked files or diagonals have only max one bit per rank:

U64 diagonalAttacks(U64 occ, int sqOfSlider){
U64 forward, reverse, slider, lineMask;
lineMask = diagonalMaskEx[sqOfSlider];// excludes square of slider
slider = singleBitboard[sqOfSlider];// single bit 1 << sq, 2^sq
forward = occ & lineMask;// also performs the first subtraction by clearing the s in o
reverse = byteswap( forward );// o'-s'
forward -=( slider );// o -2s
reverse -= byteswap( slider );// o'-2s'
forward ^= byteswap( reverse );return forward & lineMask;// mask the line again}

Flood Fill Techniques with multiple sliders

Fill approaches determine attacks set-wise for multiple pieces (i.e. rooks and queen(s)) in one particular ray-direction. The flood stops on each particular ray (either 8 ranks or files, or 15 diagonals or anti-diagonals), if a square is not empty.

Occluded Fill

see main article Dumb7Fill
For sliding piece-attacks one direction step can be repeated seven times, after each shift, intersecting with empty squares accumulating all intermediate results to a union set, which results in an occluded fill set, including the sliding piece but excluding the blocker.

U64 eastOccluded (U64 rooks, U64 empty){
empty = empty & notAFile;// make A-File all occupied, to consider H-A-wraps after shift
rooks |= empty &(rooks <<1);// 1. fill
rooks |= empty &(rooks <<1);// 2. fill
rooks |= empty &(rooks <<1);// 3. fill
rooks |= empty &(rooks <<1);// 4. fill
rooks |= empty &(rooks <<1);// 5. fill
rooks |= empty &(rooks <<1);// 6. fill
rooks |= empty &(rooks <<1);// 7. fill // not necessary for further attack generationreturn rooks;}

The Kogge-Stone parallel prefix algorithm for sliding piece attack generation was first introduced by Steffan Westcott in CCC^{[6]}. It is a parallel prefix approach of a occluded dumb7flood-fill, propagating sliding piece attacks in software like carries of a kogge-stone hardware adder^{[7]}. We need to pass sliders as generator set and the set of empty squares as propagator set. For appropriate attacks we need to shift one step further, considering wraps.

Ray-wise Attacks in one Direction

For square attack sets, occluded fill sets need to be shifted one more step. Thus, the occluded fill needs one fill cycle less.

U64 eastAttacks (U64 rooks, U64 empty){
empty = empty & notAFile;// make A-File all occupied, to consider H-A-wraps after shift
rooks |= empty &(rooks <<1);// 1. fill
rooks |= empty &(rooks <<1);// 2. fill
rooks |= empty &(rooks <<1);// 3. fill
rooks |= empty &(rooks <<1);// 4. fill
rooks |= empty &(rooks <<1);// 5. fill
rooks |= empty &(rooks <<1);// 6. fillreturn notAFile&(rooks <<1);}

Because fill-approaches keep distinct ray-directions, there is still a unique source-target square relation. Fill approaches with pure calculation are Cpu-intensive, but require no memory reads. They are a domain of SIMD instruction sets like SSE2 with sixteen 128-bit registers, to process two or four distinct sets (e.g. white and black, sliders and king as queen like meta-slider) per register and several ray-directions simultaneously - e.g. three 128-bit instructions per cycle for core2duo.

Keeping distinct ray-directions has some advantages, for instance in determining pinned pieces or discovered checks. Fill-Algorithms are great to feed in line-wise attack sets of single sliders, to determine a progressive mobility on the orthogonal rays, i.e. move targets in two moves, or trajectories against important squares or areas.

Lookup Techniques are used to hash pre-calculated attack-sets of a single pieces by square-index and occupancy of the affected lines.

Rook Attacks on the first Rank - as a base of occupancy lookup

see main article First Rank Attacks
The first rank is the ideal line to introduce occupancy lookups.

One Byte Only

Assume we (temporary) reduce the chess-board to one rank. Occupancy bitboard is one byte with up to 256 states. A rook attack-set from one of the eight squares (file) on this single rank is also only one byte. Thus we can construct an array of bytes[256][8], indexed by all 256 occupancies and 8 files, to lookup the pre-calculated rank-attack bytes.

Occupancy of the first rank = 01001010B

Rank-attacks ::= f (e-file, Occupancy) = 01110110B

If we think about to the occupancy lookup, we may recognize that the outer squares don't matter. There are no more squares behind. The outer squares are either attacked or not - independent from their occupancy state. We can use the six inner bits only as lookup-index with two additional cheap instruction. This reduces the lookup-table by factor of four.

Rotated Bitboards are a bitboard move generation technique invented independently by Robert Hyatt^{[8]} and by Ernst A. Heinz^{[9]} with Peter Gillgasch from the DarkThought team. This variation uses rotated copies of the occupancy in order to place bits along a file or diagonal in adjacent bits. Because of this, these bits can be easily extracted to obtain an occupancy map for a rank, file, or diagonal. This is used, along with the square of a slider, to lookup a bitboard containing attacks in an array.

Rotated bitboards are fast to extract the occupancy index and require 32 KByte for each line direction (rank, file, diagonal and anti-diagonal) for the lookup-tables, considering the inner six bits. However, there is more work in updating four occupied bitboards while making or unmaking moves, instead of one. The fast 64-bit multiplication of recent x64 processors (~4-cycles) with a throughput of up to one cycle, makes on the fly calculations of occupied states on files and diagonals more attractive.

Kindergarten Bitboards

see main article Kindergarten Bitboards
Kindergarten Bitboards perform a fill-multiplication and uses a shared lookup table for ranks and diagonals.

Ranks and Diagonals

Ranks and diagonals - that is their appropriate line-mask by square-index - are first intersected by the occupancy of the whole board. Doesn't matter whether the slider itself is cleared or not - it is redundant anyway, considered by the pre-calculated lookup-table. Since there is only up to one bit per file, the north-fill multiplication by the A-file maps the diagonal to the 8th rank. Or - since we only need the inner six bits, we combine the required shift left one by multiplying with the B-file. Shifting right the product by 58 (64-6) leaves the six-bit occupancy-index in the 0..63 range.

For instance the diagonal-attacks of a bishop on d4. 'A'-'H' represent the masked occupied bits along this diagonal, which are either zero or one. We need 'B'-'G' as six bit number:

masked line * B-File = B-G upper six occupancy 6 bit
. . . . . . . H . 1 . . . . . . . A[B C . E F G] . . . . . . . .
. . . . . . G . . 1 . . . . . . . A B C . E F G . . . . . . . .
. . . . . F . . . 1 . . . . . . . A B C . E F . . . . . . . . .
. . . . E . . . . 1 . . . . . . . A B C . E . . >> . . . . . . . .
. . . . . . . . * . 1 . . . . . . = . A B C . . . . 58 . . . . . . . .
. . C . . . . . . 1 . . . . . . . A B C . . . . . . . . . . . .
. B . . . . . . . 1 . . . . . . . A B . . . . . . . . . . . . .
A . . . . . . . . 1 . . . . . . . A . . . . . . [B C . E F G]. .

The pre-calculated lookup-table contains the attacks of the first rank - but eight copies in each rank or byte. It is indexed by the six bit occupied-state ('B'-'G') and the file of the slider's square. It needs to be intersected with the same line-mask as formerly the occupancy - to map the first rank attack bits to the appropriate line - that's all. Appropriate pre-calculated attack bits are represented by 'a'-'h':

fillUpAttacks[file == 3][B C . E F G]:
8 copies of rank the attack set
attacks & l-mask -> of this line
a b c . e f g h . . . . . . . h
a b c . e f g h . . . . . . g .
a b c . e f g h . . . . . f . .
a b c . e f g h . . . . e . . .
a b c . e f g h . . . . . . . .
a b c . e f g h . . c . . . . .
a b c . e f g h . b . . . . . .
a b c . e f g h a . . . . . . .

Fenner and Levene use masked line modulo 514 for the diagonals, modulo 257 for the anti-diagonals and mod 258 for files, to calculate the occupied index, but they didn't consider the inner six bits for a denser lookup. Of course, tables could made denser by storing indices, but that would require a second indirection. While Fenner and Levene used a Matlab 32-bit implementation to conclude their approach might be competitive, this is how it may be implemented in C by looking up pre-calculated attack-bitboards:

Magic bitboards uses a multiply-right-shift perfect hashing algorithm to index a attack bitboard database - which leaves both line-attacks of a bishop or rook in one run.

The magic bitboard approach was introduced by Lasse Hansen in the Winboard programming forum. He had the idea to hash the up to twelve relevant occupied bits of both directions of a rook- or bishop movement simultaneously ^{[12]}.

Pradu Kannan's improvements to Lasse Hansen's initial approach was to introduce a Java-like, two-dimensional array with individual size for each square and all it's relevant occupancies ^{[13]}. Big savings in table-size - since many squares on either orthogonals or diagonals require less bits than others, especially considering the inner six bits. While center squares are more dense for rooks, it is the opposite for bishops ^{[14]}.

How it works

A magic move-bitboard generation technique consists of four key steps:

Mask the relevant occupancy bits to form a key. For example if you had a rook on a1, the relevant occupancy bits will be from a2-a7 and b1-g1.

Multiply the key by a "magic number" to obtain an index mapping. This magic number can be generated by brute-force trial and error quite easily although it isn't 100% certain that the magic number is the best possible (see step 3).

Right shift the index mapping by 64-n bits to create an index, where n is the number of bits in the index. A better magic number will have less bits required in the index.

Use the index to reference a preinitialized move database.

The above illustration is correct for the b1 bishop, since it has only one ray and one bit per file and works kindergarten like. In general a one to one mapping of N scattered occupied bits to N consecutive bits is not always possible due to possible overflows. A perfect mapping of N scattered bits to N consecutive bits is likely not minimal for most squares. It requires one or two gaps inside the consecutive N bits, to avoid collisions, blowing up the table size.

But the purpose is to perfectly hash attack-sets rather than consecutive occupied bits.

The number of distinct attack-sets is much smaller than the relevant occupancies. Thus, with the help of constructive collisions, some initial guess how to map the bits, and/or trial and error, using exactly N bits is always possible. If we try hard enough to maximize constructive collisions - even less.

Perfect Hashing

Magic bitboards applies perfect hashing. A surjective function, to map the vector of all relevant occupancies to a range of attack-sets per square. The less bits the attack-set - the closer the blockers, the more those attack-sets are shared by occupancies with different, but redundant outer squares.

The cardinality of all relevant occupancies is determined by the number of bits to map, varying from five to twelve - thus, the cardinality is the power of two the number of bits, varying from 32 to 4096.

The cardinality of distinct attack-sets is determined by the product of the length of each of the max four direction rays greater than zero (or one). The rook on d4 has 3*4*3*4 = 144 distinct attack-sets, a bishop on a8 has only 7.

The ratio of both cardinalities, that is all relevant occupancies versus the all distinct attack-sets is illustrated below: As a quarter of a board - due to the symmetry, the other squares may deduced by flipping and mirroring. Noticeable is the huge 4096/49 ratio of 2^12 occupied states versus 7 times 7 attack-sets of the edge rooks - 12 bits instead of 6. Those "expensive" squares make constructive collisions very desirable. To become more "minimal" by saving an index bit - to halve down the table for one square or the other.

bishop on square

#occs/
#attset

rook on square

A

B

C

D

A

B

C

D

8

64/7

32/6

32/10

32/ 12

8

4096/49

2048/42

2048/ 70

2048/ 84

7

32/6

32/6

32/10

32/ 12

7

2048/42

1024/36

1024/ 60

1024/ 72

6

32/10

32/10

128/40

128/ 48

6

2048/70

1024/60

1024/100

1024/120

5

32/12

32/12

128/48

512/108

5

2048/84

1024/72

1024/120

1024/144

bishop on square

rook on square

A

B

C

D

A

B

C

D

8

9.14

5.33

3.20

2.67

8

83.59

48.76

29.26

24.38

7

5.33

5.33

3.20

2.67

7

48.76

28.44

17.07

14.22

6

3.20

3.20

3.20

2.67

6

29.26

17.07

10.24

8.53

5

2.67

2.67

2.67

4.74

5

24.38

14.22

8.53

7.11

The idea to implement minimal perfect hashing by an additional 16-bit indirection turned out to be slower.

Recent table sizes were about 38 KByte for the bishop attacks, but still about 800 KByte for rook attacks. That sounds huge, considering L1 and L2 (L3) cache-sizes and number of cachelines and pages needed - we likely fetch distinct cachelines for each different square or occupancy. On the other hand caches and pages become larger in future processors. And occupancy and squares of the lookups don't change that randomly inside a search that we can still expect a lot of L1-hits. Unfortunately changes in occupancy outside the blockers and therefor not affecting the attack-set will introduce some more cache misses.

Sample Code

Anyway, register usage and code size are important issues as well - and here magic bitboards are unbeatable - specially bishopAttacks, with respect to the relative small table.

U64 attacks_table[...];// ~840K Byte all rook and bishop attacksstruct SMagic {
U64* ptr;// pointer to the attack-table for one particular square
U64 mask;// to mask of both lines
U64 magic;// magic 64-bit factorint shift;// shift right };
SMagic mBishopTbl[64];
SMagic mRookTbl[64];
U64 bishopAttacks(U64 occ, enumSquare sq){
U64* aptr = mBishopTbl[sq].ptr;
occ &= mBishopTbl[sq].mask;
occ *= mBishopTbl[sq].magic;
occ >>= mBishopTbl[sq].shift;return aptr[occ];}
U64 rookAttacks(U64 occ, enumSquare sq){
U64* aptr = mRookTbl[sq].ptr;
occ &= mRookTbl[sq].mask;
occ *= mRookTbl[sq].magic;
occ >>= mRookTbl[sq].shift;return aptr[occ];}

Summary

The bitboard method for holding a board game appears to have been invented in the mid 1950's, by Arthur Samuel and was used in his checkers program. In computer chess bitboards were first described by Georgy Adelson-Velsky et al. in 1967 ^{[15]}, reprinted 1970 ^{[16]}. Bitboards were used both in Kaissa and independently in Chess. The invention and publication of Rotated Bitboards by Robert Hyatt and Peter Gillgasch with Ernst A. Heinz in the 90s was another milestone in the history of bitboards. While rotated bitboards was the mainstream in bitboard chess programs, Magic Bitboards become more populated nowadays. Finally, Steffan Westcott's innovations, too expensive with 32-bit x86 architectures, should be revisited with x86-64 and SIMD instructions in mind.

Home * Conferences * Workshop Chess and Mathematics * Efficient Generation of Sliding Piece Attacks## Table of Contents

Fakultät Mathematik und Naturwissenschaften Institut für Numerische MathematikWorkshop Chess and Mathematics^{[1]}Gerd Isenberg, Hattingen

Efficient generation of Moves and Controls of Sliding PiecesSa., 16:00, TU Dresden, WIL C207

Attack-Sets as Base for Bitboard Move-generation: Opposed to “None-sliding” pieces, Knight, King and Pawn, whose attacks are determined by its origin square only, sliding piece attacks like Rook-, Bishop- and Queen-attacks are dependend on other pieces as well, which may block the attacking ray in one particular ray-direction. In Quiescence Search the performance of generating (winning) capture moves is crucial. Opposed to classical square-centric board-representations, which require loops over squares, bitboards permit more efficient algorithms in generating sliding attacks.## Bitboard Basics

Bitboards are 64-bit integers and represent a finite set of up to 64 elements - all the squares of a chessboard with specific boolean properties of those squares. For instance whether squares are empty or occupied, or occupied by a specific kind of piece - or which is the main topic of this talk, whether a square is controlled or attacked (defended) by a specific kind of piece, especially attacked by sliding pieces.## Squares and Bitindex

see main article Square Mapping ConsiderationsThere is a bijective one-to-one correspondence between bits of a bitboard and the squares of a board. There are 64! different mappings, but most commonly bits are enumerated "in order" along ranks or files (orthogonal). Some programs (Rotated Bitboards) keep redundant mappings and also enumerate consecutive bits along the diagonals or anti-diagonals.

If not stated otherwise, we further rely on Little-Endian Rank-File Mapping.with following relations:

Diagonals may be enumerated in various ways determined by the difference of rankIndex and fileIndex:

Anti-Diagonals may be enumerated in various ways determined by the sum of rankIndex and fileIndex:

## Empty and Universe

The numerical values and set-wise representations of those sets, and how the appear as a board:

## Bitboard Board-Definition

see main article Bitboard Board-DefinitionTo represent the board we typically need one bitboard for each piece-type and color - likely encapsulated inside a class or structure, or as an array of bitboards as part of a position object. A one-bit inside a bitboard implies the existence of a piece of this piece-type on a certain square - one to one associated by the bit-position.

## Setwise Operations

see main article General Setwise OperationsBoolean algebra is an algebraic structure that captures essential properties of both set operations and logic operations. Specifically, it deals with the set operations of

intersection,union,complement- and the bitwise boolean operations ofAND, OR, NOT. Bitwise boolean operations on 64-bit words are in fact 64 parallel operations on each bit.Assume we have an attack set of a queen, and like to know whether the queen attacks opponent pieces it may capture, we need to 'and' the queen-attacks with the set of opponent pieces.

## Shifts

Shift acts like a multiplication (shift left) or division (shift right) by a power of two.With orthogonal square mapping, shift with appropriate ray directions amounts of {1,7,8,9} are used to "move" the pieces one step in that direction, or to generate their square controls set-wise.

In the 8*8 board centric world with one scalar square-coordinate 0..63, each of the max eight neighboring squares can be determined by adding an offset for each direction. For border squares one has to care about overflows and wraps from a-file to h-file or vice versa. Some conditional code is needed to avoid that. Such code is usually part of move generation for particular pieces.

The mentioned square mapping implies following eight ray-directions as a compass-rose:

In the set-wise world of bitboards, where a square as member of a set is determined by an appropriate one-bit 2^square, the operation to apply such movements is shifting.

One has to consider wraps, which requires one further intersection.

To be aware of their scalar 64-bit origin, we use so far a type defined unsigned integer U64 in our C or C++ source snippets, the scalar 64-bit long in Java. Feel free to define a distinct type or wrap U64 into classes for better abstraction and type-safety during compile time. The macro C64 will append a suffix to 64-bit constants as required by some compilers:## One Step Only

The advantage with bitboards is, that the shift applies to all set bits in parallel, e.g. with all pawns. Vertical shifts by +-8 don't need any under- or overflow conditions since bits simply fall out and disappear.Wraps from a-file to h-file or vice versa may be considered by only shifting subsets which may not wrap. Thus we can mask off the a- or h-file before or after a +-1,7,9 shift:

Post-shift masks, ...

## Pawn Attacks

see main article Pawn Attacks (Bitboards)Pawn Attacks set-wise as application of One Step Only.

## Bit-Twiddling relying on the Two's Complement

## LS-Bit-Isolation

Intersection with its Two's Complement isolates the least significant one bit:## LS-Bit-Reset

Intersection with its Ones' Decrement resets the least significant one bit:... since two's complement (-x) and ones' decrement (x-1) are complement sets.

## Converting Bitboards to Lists

see main article Bitboard Serialization and Traversing Subsets of a SetAt some point bitboards require serialization - for instance if we need to process move-target sets to generate moves. Thanks to the Two's Complement, isolation and reset is cheap. To determine the bit- or square-index in the 0..63 range is tad more work ...

or alternatively

## BitScan

see main article BitScanDespite recent processors have hardware instruction for bitScan, two hashing methods are mentioned to determine the base two logarithm of a single populated Bitboard. One hash-index is based on multiplication and shift, while the second one is determined by modulo. As we will see, both techniques also appear in hashing of multiple bits of a line.

## De Bruijn Multiplication

The classicalDe Bruijnbitscan, as described by Leiserson et al.^{[2]}, to determine the LS1B index by minimal perfect hashing. So called De Bruijn sequences were invented by the Dutch mathematician Nicolaas de Bruijn. A 64-bit De Bruijn sequence contains 64-overlapping unique 6-bit sequences, thus a ring of 64+5 bits. The five hidden "trailing" zeros are in fact common with the five leading zeros. There are 2^26 = 67108864 odd sequences with 6 leading binary zeros and 2^26 even sequences with 5 leading binary zeros, which may be calculated from the odd ones by shifting left one.A multiplication with a power of two value (the isolated LS1B) acts like a left shift by it's exponent. Thus, if we multiply a 64-bit De Bruijn sequence with the isolated LS1B, we get a unique six bit sequence in the most significant bits. To obtain the bit-index we need to extract these upper six bits by shifting right the product, to lookup an array.

See also generating your "private" De Bruijn Bitscan routine.

## Bitscan by Modulo

Another idea is to apply a modulo operation of the isolated LS1B by the prime number 67^{[3]}^{[4]}^{[5]}. The remainder 0..66 can be used to perfectly hash the bit-index table. Three gaps are 0, 17, and 34, so the mod 67 can make a branchless trailing zero count.Since div/mod is an expensive instruction, a modulo by constant is likely replaced by reciprocal fixed point multiplication to get the quotient and a second multiplication and difference to get the remainder. Compared with De Bruijn multiplication it is still to slow.

## Kindergarten Multiplication

see main article Flipping Mirroring and RotatingMultiplication with

disjointintermediate results was nominated as Kindergarten Multiplication.Multiplying the masked A-file with the main-diagonal, maps the file-bits to the 8th rank, similar to a flip about the anti-diagonal A8-H1. Shifting down to the 1st rank, leaves the bits like a 90-degree anti clockwise rotation.

That is straight forward multiplication of a masked diagonal or anti-diagonal with the A-file.

To mask the garbage off, we further shift down by 7 ranks.

## Attack-Sets as Base for Bitboard Move-generation

Opposed to "None-sliding" pieces, Knight, King and Pawn, whosecontrolsorattacksare determined by its originsquareonly, sliding piece attacks like Rook-, Bishop- and Queen-attacks are dependent on other pieces as well, which mayblockthe attacking ray in one particular ray-direction.In Quiescence Search the performance of generating (winning) capture moves is crucial. Opposed to classical square-centric board-representations, which require loops over squares, bitboards permit more efficient algorithms in generating sliding attacks.

## Sliding piece Attacks on the otherwise empty board

see main article On an empty BoardAttacks of single sliding pieces on the otherwise empty board or their disjoint subsets on lines or rays are that simple than none sliding pieces. We simply use pre-calculated tables for each piece-type, line or ray, indexed by square-index.

## Ray Attacks

The mentioned square mapping implies following eight ray-directions as a compass-rose:Positive Rays:

Negative Rays:

## Line Attacks

## Piece attacks

Piece attacks are the union of either orthogonal or diagonal lines. Queen attacks are the union of rook- and bishop attacks.## Sliding Attacks by Calculation

For one single sliding piece, one may consider subtraction from the set of blockers, to determine attack sets. Zeros, that are empty squares, between the nearest blocker and the sliding pieces became "ones", the nearest blocker becomes zero. In fact all flipped bits are the attacked squares in positive ray direction.To restrict the operations to a certain line, rank file or diagonal, a leading intersection with the line mask is necessary, same for the final result.

## Subtraction and Reverse Subtraction of rooks from blockers

see main article Subtracting a rook from a blocking piece and Hyperbola QuintessenceWith s (sliding piece) single element set and subset of o (occupied):

The first subtraction resets the siding-piece-bit in occupied.

The second subtraction borrows a one from nearest blocker (if any), and sets all intermediate zero bits (empty squares). The exclusive or gains all flipped bits, which turnes out to be the rook attacks in "positive" attacking direction.

For negative directions one needs "reverse" arithmetic (2 + 2 == 1).

Since bit-reversal or any mirroring or flipping is distributive over xor:

One can reformulate negative rays

which leads to a simplification in the line-attacks, since positiveRayAttacks and negativeRayAttacks are disjoint sets for one particular sliding piece and may be combined by "xor", and o ^ o == 0.

Diagonal or vertical line occupancies need leading and trailing intersection with line-masks. Thanks to the little-endian versus big-endian war, recent processors have appropriate instructions to swap bytes (ranks) inside a 64-bit word for the reverse arithmetic, note that masked files or diagonals have only max one bit per rank:

## Flood Fill Techniques with multiple sliders

Fill approaches determine attacks set-wise for multiple pieces (i.e. rooks and queen(s)) in one particular ray-direction. The flood stops on each particular ray (either 8 ranks or files, or 15 diagonals or anti-diagonals), if a square is not empty.## Occluded Fill

see main article Dumb7FillFor sliding piece-attacks one direction step can be repeated seven times, after each shift, intersecting with empty squares accumulating all intermediate results to a union set, which results in an occluded fill set, including the sliding piece but excluding the blocker.

## Kogge-Stone Parallel Prefix Algorithm

see main article Kogge-Stone Algorithm and Parallel Prefix AlgorithmsOccluded fill might be done more efficiently parallel prefix wise by Kogge-Stone Algorithms:

The Kogge-Stone parallel prefix algorithm for sliding piece attack generation was first introduced by Steffan Westcott in CCC

^{[6]}. It is a parallel prefix approach of a occluded dumb7 flood-fill, propagating sliding piece attacks in software like carries of a kogge-stone hardware adder^{[7]}. We need to pass sliders as generator set and the set of empty squares as propagator set. For appropriate attacks we need to shift one step further, considering wraps.## Ray-wise Attacks in one Direction

For square attack sets, occluded fill sets need to be shifted one more step. Thus, the occluded fill needs one fill cycle less.Because fill-approaches keep distinct ray-directions, there is still a unique source-target square relation. Fill approaches with pure calculation are Cpu-intensive, but require no memory reads. They are a domain of SIMD instruction sets like SSE2 with sixteen 128-bit registers, to process two or four distinct sets (e.g. white and black, sliders and king as queen like meta-slider) per register and several ray-directions simultaneously - e.g. three 128-bit instructions per cycle for core2duo.

Keeping distinct ray-directions has some advantages, for instance in determining pinned pieces or discovered checks. Fill-Algorithms are great to feed in line-wise attack sets of single sliders, to determine a progressive mobility on the orthogonal rays, i.e. move targets in two moves, or trajectories against important squares or areas.

## Lookup Techniques

Lookup Techniques are used to hash pre-calculated attack-sets of a single pieces by square-index and occupancy of the affected lines.## Rook Attacks on the first Rank - as a base of occupancy lookup

see main article First Rank AttacksThe first rank is the ideal line to introduce occupancy lookups.

## One Byte Only

Assume we (temporary) reduce the chess-board to one rank. Occupancy bitboard is one byte with up to 256 states. A rook attack-set from one of the eight squares (file) on this single rank is also only one byte. Thus we can construct an array of bytes[256][8], indexed by all 256 occupancies and 8 files, to lookup the pre-calculated rank-attack bytes.## The Outer Squares

If we think about to the occupancy lookup, we may recognize that the outer squares don't matter. There are no more squares behind. The outer squares are either attacked or not - independent from their occupancy state. We can use thesix inner bitsonly as lookup-index with two additional cheap instruction. This reduces the lookup-table by factor offour.## Other ranks

It is simple to shift other ranks to the first rank, and to shift back the attack set.## Rotated Bitboards

see main article Rotated BitboardsRotated Bitboardsare a bitboard move generation technique invented independently by Robert Hyatt^{[8]}and by Ernst A. Heinz^{[9]}with Peter Gillgasch from the DarkThought team. This variation uses rotated copies of the occupancy in order to place bits along a file or diagonal in adjacent bits. Because of this, these bits can be easily extracted to obtain an occupancy map for a rank, file, or diagonal. This is used, along with the square of a slider, to lookup a bitboard containing attacks in an array.Rotated bitboards are fast to extract the occupancy index and require 32 KByte for each line direction (rank, file, diagonal and anti-diagonal) for the lookup-tables, considering the inner six bits. However, there is more work in updating four occupied bitboards while making or unmaking moves, instead of one. The fast 64-bit multiplication of recent x64 processors (~4-cycles) with a throughput of up to one cycle, makes on the fly calculations of occupied states on files and diagonals more attractive.

## Kindergarten Bitboards

see main article Kindergarten BitboardsKindergarten Bitboards perform a fill-multiplication and uses a shared lookup table for ranks and diagonals.

## Ranks and Diagonals

Ranks and diagonals - that is their appropriate line-mask by square-index - are first intersected by the occupancy of the whole board. Doesn't matter whether the slider itself is cleared or not - it is redundant anyway, considered by the pre-calculated lookup-table. Since there is only up to one bit per file, the north-fill multiplication by the A-file maps the diagonal to the 8th rank. Or - since we only need the inner six bits, we combine the required shift left one by multiplying with the B-file. Shifting right the product by 58 (64-6) leaves the six-bit occupancy-index in the 0..63 range.For instance the diagonal-attacks of a bishop on d4. 'A'-'H' represent the masked occupied bits along this diagonal, which are either zero or one. We need 'B'-'G' as six bit number:

The pre-calculated lookup-table contains the attacks of the first rank - but eight copies in each rank or byte. It is indexed by the six bit occupied-state ('B'-'G') and the file of the slider's square. It needs to be intersected with the same line-mask as formerly the occupancy - to map the first rank attack bits to the appropriate line - that's all. Appropriate pre-calculated attack bits are represented by 'a'-'h':

Code spippets.

## Files

File attacks need tad more work:## Congruent Modulo Bitboards

see main article Congruent Modulo BitboardsCongruent Modulo Bitboardswas introduced by Trevor Fenner and Mark Levene in the ICGA Journal Vol. 31, No. 1 in 2008^{[10]}. While their Perfect Hashing approach provides great mathematical insights in Congruent Modulo arithmetic, their final conclusion in comparison with Hashing Dictionaries, Rotated Bitboards and Magic Bitboards was criticized by the obvious comparison with Kindergarten Bitboards^{[11]}.Fenner and Levene use masked line modulo 514 for the diagonals, modulo 257 for the anti-diagonals and mod 258 for files, to calculate the occupied index, but they didn't consider the inner six bits for a denser lookup. Of course, tables could made denser by storing indices, but that would require a second indirection. While Fenner and Levene used a Matlab 32-bit implementation to conclude their approach might be competitive, this is how it may be implemented in C by looking up pre-calculated attack-bitboards:

## Magic Bitboards

see main article Magic BitboardsMagic bitboards uses a multiply-right-shift perfect hashing algorithm to index a attack bitboard database - which leaves both line-attacks of a bishop or rook in one run.

The magic bitboard approach was introduced by Lasse Hansen in the Winboard programming forum. He had the idea to hash the up to twelve relevant occupied bits of

both directionsof a rook- or bishop movement simultaneously^{[12]}.Pradu Kannan's improvements to Lasse Hansen's initial approach was to introduce a Java-like, two-dimensional array with individual size for each square and all it's relevant occupancies

^{[13]}. Big savings in table-size - since many squares on either orthogonals or diagonals require less bits than others, especially considering the inner six bits. While center squares are more dense for rooks, it is the opposite for bishops^{[14]}.## How it works

A magic move-bitboard generation technique consists of four key steps:The above illustration is correct for the b1 bishop, since it has only one ray and one bit per file and works kindergarten like. In general a one to one mapping of N scattered occupied bits to N consecutive bits is not always possible due to possible overflows. A perfect mapping of N scattered bits to N consecutive bits is likely not minimal for most squares. It requires one or two gaps inside the consecutive N bits, to avoid collisions, blowing up the table size.

But the purpose is to perfectly hash attack-sets rather than consecutive occupied bits.

The number of distinct attack-sets is much smaller than the relevant occupancies. Thus, with the help of constructive collisions, some initial guess how to map the bits, and/or trial and error, using exactly N bits is always possible. If we try hard enough to maximize constructive collisions - even less.

## Perfect Hashing

Magic bitboards applies perfect hashing. A surjective function, to map the vector of all relevant occupancies to a range of attack-sets per square. The less bits the attack-set - the closer the blockers, the more those attack-sets are shared by occupancies with different, but redundant outer squares.cardinalityof allrelevant occupanciesis determined by the number of bits to map, varying from five to twelve - thus, the cardinality is the power of two the number of bits, varying from 32 to 4096.cardinalityofdistinct attack-setsis determined by the product of the length of each of the max four direction rays greater than zero (or one). The rook on d4 has 3*4*3*4 = 144 distinct attack-sets, a bishop on a8 has only 7.The

ratioof both cardinalities, that is allrelevant occupanciesversus the alldistinct attack-setsis illustrated below: As a quarter of a board - due to the symmetry, the other squares may deduced by flipping and mirroring. Noticeable is the huge 4096/49 ratio of 2^12 occupied states versus 7 times 7 attack-sets of the edge rooks - 12 bits instead of 6. Those "expensive" squares make constructive collisions very desirable. To become more "minimal" by saving an index bit - to halve down the table for one square or the other.#attset

The idea to implement minimal perfect hashing by an additional 16-bit indirection turned out to be slower.

Recent table sizes were about

38 KBytefor the bishop attacks, but still about800 KBytefor rook attacks. That sounds huge, considering L1 and L2 (L3) cache-sizes and number of cachelines and pages needed - we likely fetch distinct cachelines for each different square or occupancy. On the other hand caches and pages become larger in future processors. And occupancy and squares of the lookups don't change that randomly inside a search that we can still expect a lot of L1-hits. Unfortunately changes in occupancy outside the blockers and therefor not affecting the attack-set will introduce some more cache misses.## Sample Code

Anyway, register usage and code size are important issues as well - and heremagic bitboardsare unbeatable - specially bishopAttacks, with respect to the relative small table.## Summary

The bitboard method for holding a board game appears to have been invented in the mid 1950's, by Arthur Samuel and was used in his checkers program. In computer chess bitboards were first described by Georgy Adelson-Velsky et al. in 1967^{[15]}, reprinted 1970^{[16]}. Bitboards were used both in Kaissa and independently in Chess. The invention and publication of Rotated Bitboards by Robert Hyatt and Peter Gillgasch with Ernst A. Heinz in the 90s was another milestone in the history of bitboards. While rotated bitboards was the mainstream in bitboard chess programs, Magic Bitboards become more populated nowadays. Finally, Steffan Westcott's innovations, too expensive with 32-bit x86 architectures, should be revisited with x86-64 and SIMD instructions in mind.## References

1998) Using de Bruijn Sequences to Index a 1 in a Computer Word (pdf)1999).Rotated Bitmaps, a New Twist on an Old Idea. ICCA Journal, Vol. 22, No. 4, pp. 213–222.2008).Move Generation with Perfect Hashing Functions.ICGA Journal, Vol. 31, No. 1, pp. 3-12. ISSN 1389-6911.1970).Programming a Computer to Play Chess. Russian Mathematical Surveys, Vol. 25, pp. 221-262Up one Level