Future or recent x8664 processors (AMDK10  SSE4a, IntelNehalem  SSE4.2) provide a 64bit popcount instruction^{[2]}, available via C++ compiler intrinsic ^{[3]}^{[4]}^{[5]}^{[6]} or inline assembly^{[7]}. Despite different Intrinsic prototypes (_mm_popcnt_u64 vs. popcnt64), Intel and AMD popcnt instructions are binary compatible, have same encoding (F3 [REX] 0F B8 /r), and both require bit 23 set in RCX of the CPUID function 0000_0001h.
The recursiverecurrence relation of population counts can be transformed to iteration as well, but lacks an arithmetical sumformula:
However, it is helpful to initialize a lookup table, i.e. for bytes:
unsignedchar popCountOfByte256[];void initpopCountOfByte256(){
popCountOfByte256[0]=0;for(int i =1; i <256; i++)
popCountOfByte256[i]= popCountOfByte256[i /2]+(i &1);}
Empty or Single?
Often one has to deal with sparsely populated or even empty bitboards. To determine whether a bitboard is empty or a single populated power of two value, one may use simple boolean statements rather than a complete population count.
Empty Bitboards
To test a bitboard is empty, one compares it with zero, or use the logical not operator:
if ( x == 0 ) > bitboard is empty
if ( !x ) > bitboard is empty
The inverse condition (not empty) is of course
if ( x != 0 ) > bitboard is not empty
if ( x ) > bitboard is not empty
Single Populated Bitboards
If the bitboard is not empty, we can extract the LS1B to look whether it is equal with the bitboard. Alternatively and faster, we can reset the LS1B to look whether the bitboard becomes empty.
if ( x != 0 && (x & (x1)) == 0 ) > population count is one, power of two value
One can skip the leading x != 0 condition to test popcount <= 1:
if ( (x & (x1)) == 0 ) > population count is less or equal than one
Again the inverse relation tests, whether a bitboard has more than one bit set:
if ( x & (x1) ) > population count is greater than one
An alternative approach to determine single populated sets, aka power of two values is based on Inclusive LS1B separation divided by two equals the ones' decrement ^{[9]}:
if ( ((x ^ (x1)) >> 1) == (x1) ) > population count is one, power of two value
LoopApproaches
Too Slow
Brute force adding all 64bits
int popCount (U64 x){int count =0;for(int i =0; i <64; i++, x >>=1)
count +=(int)x &1;return count;}
Of course, this is a slow algorithm, which might be improved by testing x not empty rather than i < 64. But unrolled in parallel prefix manner it already reminds on this one.
Brian Kernighan's way
Consecutively reset LS1B in a loop body and counting loop cycles until the bitset becomes empty. Brian Kernighan^{[10]} mentioned the trick in his and Ritchie's book The C Programming_Language, 2nd Edition 1988, exercise 29. However, the method was first published in 1960 by Peter Wegner^{[11]}, discovered independently by Derrick Henry Lehmer, published in 1964 ^{[12]}:
int popCount (U64 x){int count =0;while(x){
count++;
x &= x 1;// reset LS1B}return count;}
Despite branches  this is still one of the fastest approaches for sparsely populated bitboards. Of course the more bits that are set, the longer it takes.
Lookup
Of course we can not use the whole bitboard as index to a lookup table  since it's size would be 18,446,744,073,709,551,616 bytes! If it is about the population count of a byte, we can simply construct a table lookup with 256 elements. For a bitboard that takes eight byte lookups we can add together:
unsignedchar popCountOfByte256[];void initpopCountOfByte256(){
popCountOfByte256[0]=0;for(int i =1; i <256; i++)
popCountOfByte256[i]= popCountOfByte256[i /2]+(i &1);}int popCount (U64 x){return popCountOfByte256[ x &0xff]+
popCountOfByte256[(x >>8)&0xff]+
popCountOfByte256[(x >>16)&0xff]+
popCountOfByte256[(x >>24)&0xff]+
popCountOfByte256[(x >>32)&0xff]+
popCountOfByte256[(x >>40)&0xff]+
popCountOfByte256[(x >>48)&0xff]+
popCountOfByte256[ x >>56];}
Looks quite expensive  one may use four 16bit wordlookups with a precalculated 64KByte table though, but that pollutes the memory caches quite a bit. One can also treat the bitboard as array of bytes or words in memory, since endian issues don't care here  that safes all the shifts and 'ands', but has to read byte for byte from memory.
int popCount (U64 x){unsignedchar* p =(unsignedchar*)&x;return popCountOfByte256[p[0]]+
popCountOfByte256[p[1]]+
popCountOfByte256[p[2]]+
popCountOfByte256[p[3]]+
popCountOfByte256[p[4]]+
popCountOfByte256[p[5]]+
popCountOfByte256[p[6]]+
popCountOfByte256[p[7]];}
SWARPopcount
The divide and conquerSWARapproach deals with counting bits of duos, to aggregate the duocounts to nibbles and bytes inside one 64bit register in parallel, to finally sum all bytes together. According to Donald Knuth^{[13]}, a parallel population count routine was already introduced in 1957 due to Donald B. Gillies and Jeffrey C. P. Miller in the first textbook on programming, second edition: The Preparation of Programs for an Electronic Digital Computer, by Maurice Wilkes, David Wheeler and Stanley Gill, pages 191–193 ^{[14]}^{[15]}.
Counting DuoBits
A bitduo (two neighboring bits) can be interpreted with bit 0 = a, and bit 1 = b as
The duo population is
which can be archived by
or
The bitduo has up to four states with population count from zero to two as demonstrated in following table with binary digits:
x

x div 2
→
popcnt(x)
00

00
→
00
01

00
→
01
10

01
→
01
11

01
→
10
Only the lower bit is needed from x div 2  and one don't has to worry about borrows from neighboring duos. SWARwise, one needs to clear all "even" bits of the div 2 subtrahend to perform a 64bit subtraction of all 32 duos:
x = x  ((x >> 1) & 0x5555555555555555);
Note that the popcountresult of the bitduos still takes two bits.
Counting NibbleBits
The next step is to add the duocounts to populations of four neighboring bits, the 16 nibblecounts, which may range from zero to four. SWARwise it is done by masking odd and even duocounts to add them together:
Note that the popcountresult of the nibbles takes only three bits, since 100B is the maximum population (of the nibble 1111B).
ByteCounts
You already got the idea? Now it is about to get the bytepopulations from two nibblepopulations. Maximum bytepopulation of 1000B only takes four bits, so it is safe to mask all those four bits of the sum, rather than to mask the summands:
x = (x + (x >> 4)) & 0x0f0f0f0f0f0f0f0f;
Adding the ByteCounts
Parallel Prefix Adds
We may continue with maskless parallel prefix SWARadds for bytecounts, wordcounts and finally doublewordcounts, to mask the least significant 8 (or 7) bits for final result in the 0..64 range:
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return x & 255;
Multiplication
With todays fast 64bit multiplication one can multiply the vector of 8bytecounts with 0x0101010101010101 to get the final result in the most significant byte, which is then shifted right:
x = (x * 0x0101010101010101) >> 56;
Casting out
Interestingly, there is another approach to add the bytes together. As demonstrated with decimal digits (base 10) and Casting out nines^{[16]}, casting out by modulo base minus one is equivalent to taking the digit sum, which might be applied here with low range 0..8 "base 256" digits:
x = x % 255;
However, since division and modulo are usually slow instructions and modulo by constant is likely replaced by reciprocal multiplication anyway by the compiler, the multiplication by 0x0101010101010101 aka the 2adic fraction 1/255 is the preferred method.
The PopCount routine
The Constants
Putting all together, the various SWARMasks and factors as defined by Donald Knuth as 2adic fractions ^{[17]}:
int popCount (U64 x){
x = x ((x >>1)& k1);/* put count of each 2 bits into those 2 bits */
x =(x & k2)+((x >>2)& k2);/* put count of each 4 bits into those 4 bits */
x =(x +(x >>4))& k4 ;/* put count of each 8 bits into those 8 bits */
x =(x * kf)>>56;/* returns 8 most significant bits of x + (x<<8) + (x<<16) + (x<<24) + ... */return(int) x;}
Advantage: no branches, no memory lookups, constant runtime  independent from population Drawback: dependency chain, not much parallel speedup
slowmul_popCount
And as stated before, for computers with relatively slower multiplication, the addition can be done manually:
int slowmul_popCount (U64 x){
x = x ((x >>1)& k1);/* put count of each 2 bits into those 2 bits */
x =(x & k2)+((x >>2)& k2);/* put count of each 4 bits into those 4 bits */
x =(x +(x >>4))& k4 ;/* put count of each 8 bits into those 8 bits */
x += x >>8;/* put count of each 16 bits into their lowest 8 bits */
x += x >>16;/* put count of each 32 bits into their lowest 8 bits */
x += x >>32;/* put count of the final 64 bits into the lowest 8 bits */return(int) x &255;}
For likely sparsely populated bitboards, the loopwise Brian Kernighan's way might be the faster one.
HAKMEM 169
A similar technique was proposed by Bill Gosper et al. from Massachusetts Institute of Technology, as published 1972 in Memo 239 (HAKMEM)^{[18]}^{[19]}, to add bittrio rather than duo populations consecutively, and the 32 bit version relies on casting out 63. Note that the constants in the code below have octal (base8) digits, originally written in Assembly for the PDP6^{[20]}. An expanded 64bit version, casting out 511 or 4095, is slightly less efficient than the binary SWAR version above.
int hakmem169_32(unsignedint x){
x = x ((x >>1)&033333333333)((x >>2)&011111111111);
x =(x +(x >>3))&030707070707;return x %63;/* casting out 63 */}
Miscellaneous
Cardinality of Multiple Sets
If we like to count arrays of sets, we can reduce 2^N1 popcounts to N popcounts, by applying the oddmajortrick. For three sets to count we safe one, with five additional cheap instructions.
The combined popCount3 likely gains more parallel speedup, since there are two independent chains to calculate. Possible Application is to pass the union of both bishops (since usually bishops cover disjoint sets due to different square colors) plus the up to two knight movetarget sets.
// return popCount(x) + popCount(y) + popCount(z)int popCount3 (U64 x, U64 y, U64 z){
U64 maj =((x ^ y )& z)(x & y);
U64 odd =((x ^ y )^ z);
maj = maj ((maj >>1)& k1 );
odd = odd ((odd >>1)& k1 );
maj =(maj & k2)+((maj >>2)& k2);
odd =(odd & k2)+((odd >>2)& k2);
maj =(maj +(maj >>4))& k4;
odd =(odd +(odd >>4))& k4;
odd =((maj + maj + odd)* kf )>>56;return(int) odd;}
Odd and Major 715
Of course  reducing seven popcount to three, or even 15 popcounts to four sounds even more promising.
For N = 2^n  1 it takes N  n oddmajor pairs. Thus one addmajor pair  five instructions  per saved popCount.
Assuming an architecture has a fast popcountinstruction (but no bitscan). One can isolate the LS1B, decrement it and count the remaining trailing "ones" to perform the logarithm dualis:
int hammingDistance (U64 a, U64 b) {return popcnt( a ^ b);}
Hamming distance greater than one or two is an important property of codes to detect or even correct onebit errors.
The minimum and average hamming distance over all Zobrist keys was considered as "quality"measure of the keys. However, as long the minimum hamming distance is greater zero, linear independence (that is a small subset of all keys doesn't xor to zero), is much more important than hamming distance ^{[21]}. Maximizing the minimal hamming distance leads to very poor Zobrist keys ^{[22]}.
Similar to Attacks by Occupancy Lookup to determine attack sets of sliding pieces, we may use precalculated population count or even centerweighted population count as a rough estimate on piece mobility^{[23]}. It may not consider subsets of let say safe target squares.
Piece Attacks Count
As pointed out by Marco Costalba^{[24]}^{[25]}, specialized routines to count the population (Mobility) of attack sets of king, knight and linewise subsets of sliding pieces can be done more efficiently than the general SWARPopcount. This is similar to Flipping Mirroring and Rotating the whole bitboard versus Rank, File and Diagonal, and is based on mapping the up to eight scattered occupied bits to one byte, to perform a single byte lookup. For various mapping techniques, see:
^ Color of each pixel is Hamming distance between the binary representations of its x and y coordinates, modulo 16, in the 16color system, by Josiedraus, June 8, 2007, Richard Hamming from Wikipedia
Future or recent x8664 processors (AMD K10  SSE4a, Intel Nehalem  SSE4.2) provide a 64bit popcount instruction ^{[2]}, available via C++ compiler intrinsic ^{[3]}^{[4]}^{[5]}^{[6]} or inline assembly ^{[7]}. Despite different Intrinsic prototypes (_mm_popcnt_u64 vs. popcnt64), Intel and AMD popcnt instructions are binary compatible, have same encoding (F3 [REX] 0F B8 /r), and both require bit 23 set in RCX of the CPUID function 0000_0001h.
Table of Contents
Recurrence Relation
The recursive recurrence relation of population counts can be transformed to iteration as well, but lacks an arithmetical sumformula:However, it is helpful to initialize a lookup table, i.e. for bytes:
Empty or Single?
Often one has to deal with sparsely populated or even empty bitboards. To determine whether a bitboard is empty or a single populated power of two value, one may use simple boolean statements rather than a complete population count.Empty Bitboards
To test a bitboard is empty, one compares it with zero, or use the logical not operator:The inverse condition (not empty) is of course
Single Populated Bitboards
If the bitboard is not empty, we can extract the LS1B to look whether it is equal with the bitboard. Alternatively and faster, we can reset the LS1B to look whether the bitboard becomes empty.One can skip the leading x != 0 condition to test popcount <= 1:
Again the inverse relation tests, whether a bitboard has more than one bit set:
An alternative approach to determine single populated sets, aka power of two values is based on Inclusive LS1B separation divided by two equals the ones' decrement ^{[9]}:
LoopApproaches
Too Slow
Brute force adding all 64bitsOf course, this is a slow algorithm, which might be improved by testing x not empty rather than i < 64. But unrolled in parallel prefix manner it already reminds on this one.
Brian Kernighan's way
Consecutively reset LS1B in a loop body and counting loop cycles until the bitset becomes empty. Brian Kernighan ^{[10]} mentioned the trick in his and Ritchie's book The C Programming_Language, 2nd Edition 1988, exercise 29. However, the method was first published in 1960 by Peter Wegner ^{[11]}, discovered independently by Derrick Henry Lehmer, published in 1964 ^{[12]}:Despite branches  this is still one of the fastest approaches for sparsely populated bitboards. Of course the more bits that are set, the longer it takes.
Lookup
Of course we can not use the whole bitboard as index to a lookup table  since it's size would be 18,446,744,073,709,551,616 bytes! If it is about the population count of a byte, we can simply construct a table lookup with 256 elements. For a bitboard that takes eight byte lookups we can add together:Looks quite expensive  one may use four 16bit wordlookups with a precalculated 64KByte table though, but that pollutes the memory caches quite a bit. One can also treat the bitboard as array of bytes or words in memory, since endian issues don't care here  that safes all the shifts and 'ands', but has to read byte for byte from memory.
SWARPopcount
The divide and conquer SWARapproach deals with counting bits of duos, to aggregate the duocounts to nibbles and bytes inside one 64bit register in parallel, to finally sum all bytes together. According to Donald Knuth ^{[13]}, a parallel population count routine was already introduced in 1957 due to Donald B. Gillies and Jeffrey C. P. Miller in the first textbook on programming, second edition: The Preparation of Programs for an Electronic Digital Computer, by Maurice Wilkes, David Wheeler and Stanley Gill, pages 191–193 ^{[14]} ^{[15]}.Counting DuoBits
A bitduo (two neighboring bits) can be interpreted with bit 0 = a, and bit 1 = b asThe duo population is
which can be archived by
or
The bitduo has up to four states with population count from zero to two as demonstrated in following table with binary digits:
Note that the popcountresult of the bitduos still takes two bits.
Counting NibbleBits
The next step is to add the duocounts to populations of four neighboring bits, the 16 nibblecounts, which may range from zero to four. SWARwise it is done by masking odd and even duocounts to add them together:Note that the popcountresult of the nibbles takes only three bits, since 100B is the maximum population (of the nibble 1111B).
ByteCounts
You already got the idea? Now it is about to get the bytepopulations from two nibblepopulations. Maximum bytepopulation of 1000B only takes four bits, so it is safe to mask all those four bits of the sum, rather than to mask the summands:Adding the ByteCounts
Parallel Prefix Adds
We may continue with maskless parallel prefix SWARadds for bytecounts, wordcounts and finally doublewordcounts, to mask the least significant 8 (or 7) bits for final result in the 0..64 range:Multiplication
With todays fast 64bit multiplication one can multiply the vector of 8bytecounts with 0x0101010101010101 to get the final result in the most significant byte, which is then shifted right:Casting out
Interestingly, there is another approach to add the bytes together. As demonstrated with decimal digits (base 10) and Casting out nines ^{[16]}, casting out by modulo base minus one is equivalent to taking the digit sum, which might be applied here with low range 0..8 "base 256" digits:However, since division and modulo are usually slow instructions and modulo by constant is likely replaced by reciprocal multiplication anyway by the compiler, the multiplication by 0x0101010101010101 aka the 2adic fraction 1/255 is the preferred method.
The PopCount routine
The Constants
Putting all together, the various SWARMasks and factors as defined by Donald Knuth as 2adic fractions ^{[17]}:represented as bitboards:
popCount
This is how the complete routine looks in C:Advantage: no branches, no memory lookups, constant runtime  independent from population
Drawback: dependency chain, not much parallel speedup
slowmul_popCount
And as stated before, for computers with relatively slower multiplication, the addition can be done manually:For likely sparsely populated bitboards, the loopwise Brian Kernighan's way might be the faster one.
HAKMEM 169
A similar technique was proposed by Bill Gosper et al. from Massachusetts Institute of Technology, as published 1972 in Memo 239 (HAKMEM) ^{[18]} ^{[19]}, to add bittrio rather than duo populations consecutively, and the 32 bit version relies on casting out 63. Note that the constants in the code below have octal (base8) digits, originally written in Assembly for the PDP6 ^{[20]}. An expanded 64bit version, casting out 511 or 4095, is slightly less efficient than the binary SWAR version above.Miscellaneous
Cardinality of Multiple Sets
If we like to count arrays of sets, we can reduce 2^N1 popcounts to N popcounts, by applying the oddmajortrick. For three sets to count we safe one, with five additional cheap instructions.The combined popCount3 likely gains more parallel speedup, since there are two independent chains to calculate. Possible Application is to pass the union of both bishops (since usually bishops cover disjoint sets due to different square colors) plus the up to two knight movetarget sets.
Odd and Major 715
Of course  reducing seven popcount to three, or even 15 popcounts to four sounds even more promising.For N = 2^n  1 it takes N  n oddmajor pairs. Thus one addmajor pair  five instructions  per saved popCount.
That is 7  3 = 4 pairs:
Or 15  4 = 11 pairs:
Odd and Major Digit Counts
OddMajor is probably also useful to determine digit count sets of attacks or other stuff:with following semantics:
Popcount as log2 of LS1B
Assuming an architecture has a fast popcountinstruction (but no bitscan). One can isolate the LS1B, decrement it and count the remaining trailing "ones" to perform the logarithm dualis:For instance, LS1B is 2^44, decrementing leaves a below LSB1 mask with exactly 44 bits set:
Hamming Distance
The hamming distance of two words is defined as the number of corresponding different bits.Hamming distance greater than one or two is an important property of codes to detect or even correct onebit errors.
The minimum and average hamming distance over all Zobrist keys was considered as "quality"measure of the keys. However, as long the minimum hamming distance is greater zero, linear independence (that is a small subset of all keys doesn't xor to zero), is much more important than hamming distance ^{[21]}. Maximizing the minimal hamming distance leads to very poor Zobrist keys ^{[22]}.
Weighted PopCount
For a SIMDwise kind of weighted population count, see the SSE2 dotproduct.Precalculated Mobility
Similar to Attacks by Occupancy Lookup to determine attack sets of sliding pieces, we may use precalculated population count or even centerweighted population count as a rough estimate on piece mobility ^{[23]}. It may not consider subsets of let say safe target squares.Piece Attacks Count
As pointed out by Marco Costalba ^{[24]} ^{[25]}, specialized routines to count the population (Mobility) of attack sets of king, knight and linewise subsets of sliding pieces can be done more efficiently than the general SWARPopcount. This is similar to Flipping Mirroring and Rotating the whole bitboard versus Rank, File and Diagonal, and is based on mapping the up to eight scattered occupied bits to one byte, to perform a single byte lookup. For various mapping techniques, see:Popcount in Hardware
See also
Publications
1949 ...
2000 ...
2010 ...
Postings
1998 ...
2000 ...
2005 ...
2010 ...
2015 ...
External Links
HAKMEMC  HAKMEM Programming hacks in C by Alan Mycroft
John Abercrombie, Bobo Stenson, Lars Danielsson, Jon Christensen
References
What links here?
Up one Level