This article presents six solutions to this problem. Source code in C is available.

1. Iterated Count

int bitcount (unsigned int n) { int count = 0; while (n) { count += n & 0x1u; n >>= 1; } return count; }

2. Sparse Ones

int bitcount (unsigned int n) { int count = 0 ; while (n) { count++ ; n &= (n - 1) ; } return count ; }

3. Dense Ones

int bitcount (unsigned int n) { int count = 8 * sizeof(int) ; n ^= (unsigned int) - 1 ; while (n) { count-- ; n &= (n - 1) ; } return count ; }

Sparse Ones and Dense Ones were first described by Peter Wegner in "A Technique for Counting Ones in a Binary Computer", Communications of the ACM, Volume 3 (1960) Number 5, page 322.

4a. Precompute-8bit

static int bits_in_char [256] ; int bitcount (unsigned int n) { // works only for 32-bit ints return bits_in_char [n & 0xffu] + bits_in_char [(n >> 8 ) & 0xffu] + bits_in_char [(n >> 16) & 0xffu] + bits_in_char [(n >> 24) & 0xffu] ; }

4b. Precompute-16bit

static char bits_in_16bits [0x1u << 16] ; int bitcount (unsigned int n) { // works only for 32-bit ints return bits_in_16bits [n & 0xffffu] + bits_in_16bits [(n >> 16) & 0xffffu] ; }

5. Parallel Count

#define TWO(c) (0x1u << (c)) #define MASK(c) \ (((unsigned int)(-1)) / (TWO(TWO(c)) + 1u)) #define COUNT(x,c) \ ((x) & MASK(c)) + (((x) >> (TWO(c))) & MASK(c)) int bitcount (unsigned int n) { n = COUNT(n, 0) ; n = COUNT(n, 1) ; n = COUNT(n, 2) ; n = COUNT(n, 3) ; n = COUNT(n, 4) ; /* n = COUNT(n, 5) ; for 64-bit integers */ return n ; }

6. Nifty Parallel Count

#define MASK_01010101 (((unsigned int)(-1))/3) #define MASK_00110011 (((unsigned int)(-1))/5) #define MASK_00001111 (((unsigned int)(-1))/17) int bitcount (unsigned int n) { n = (n & MASK_01010101) + ((n >> 1) & MASK_01010101) ; n = (n & MASK_00110011) + ((n >> 2) & MASK_00110011) ; n = (n & MASK_00001111) + ((n >> 4) & MASK_00001111) ; return n % 255 ; }

According to Don Knuth (The Art of Computer Programming Vol IV, p 11), in the first textbook on programming, The Preparation of Programs for an Electronic Digital Computer by Wilkes, Wheeler and Gill (1957, reprinted 1984), pages 191--193 presented Nifty Parallel Count by D B Gillies and J C P Miller.

7. MIT HAKMEM Count

int bitcount(unsigned int n) { /* works for 32-bit numbers only */ /* fix last line for 64-bit numbers */ register unsigned int tmp; tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111); return ((tmp + (tmp >> 3)) & 030707070707) % 63; }

MIT HAKMEM Count is funky. Consider a 3 bit number as being 4a+2b+c. If we shift it right 1 bit, we have 2a+b. Subtracting this from the original gives 2a+b+c. If we right-shift the original 3-bit number by two bits, we get a, and so with another subtraction we have a+b+c, which is the number of bits in the original number. How is this insight employed? The first assignment statement in the routine computes *tmp*. Consider the octal representation of *tmp*. Each digit in the octal representation is simply the number of 1's in the corresponding three bit positions in *n*. The last return statement sums these octal digits to produce the final answer. The key idea is to add adjacent pairs of octal digits together and then compute the remainder modulus 63. This is accomplished by right-shifting *tmp* by three bits, adding it to *tmp* itself and ANDing with a suitable mask. This yields a number in which groups of six adjacent bits (starting from the LSB) contain the number of 1's among those six positions in *n*. This number modulo 63 yields the final answer. For 64-bit numbers, we would have to add triples of octal digits and use modulus 1023. This is HACKMEM 169, as used in X11 sources. Source: MIT AI Lab memo, late 1970's.

8. Builtin Instructions

GNU compiler allows for

which translates into a single CPU instruction if the underlying machine architecture supports it. For example, Intel machines have POPCNT (SSE4 Instruction set announced in 2006). Many GCC builtin functions exist.int __builtin_popcount (unsigned int x);

Performance Measurements

No Some Heavy Optimization Optimization Optimization Precomp_16 52.94 Mcps 76.22 Mcps 80.58 Mcps Precomp_8 29.74 Mcps 49.83 Mcps 51.65 Mcps Parallel 19.30 Mcps 36.00 Mcps 38.55 Mcps MIT 16.93 Mcps 17.10 Mcps 31.82 Mcps Nifty 12.78 Mcps 16.07 Mcps 29.71 Mcps Sparse 5.70 Mcps 15.01 Mcps 14.62 Mcps Dense 5.30 Mcps 14.11 Mcps 14.56 Mcps Iterated 3.60 Mcps 3.84 Mcps 9.24 Mcps Mcps = Million counts per second

Which of the several bit counting routines is the fastest? Results of speed trials on an i686 are summarized in the table on left. "No Optimization" was compiled with plain *gcc*. "Some Optimizations" was *gcc -O3*. "Heavy Optimizations" corresponds to *gcc -O3 -mcpu=i686 -march=i686 -fforce-addr -funroll-loops -frerun-cse-after-loop -frerun-loop-opt -malign-functions=4*.

Thanks to Seth Robertson who suggested performing speed trials by extending bitcount.c. Seth also pointed me to MIT_Hackmem routine. Thanks to Denny Gursky who suggested the idea of Precompute_11bit. That would require three sums (11-bit, 11-bit and 10-bit precomputed counts). I then tried Precompute_16bit which turned out to be even faster.

If you have niftier solutions up your sleeves, please send me an e-mail or write comments below!

Further Reading

- HAKMEM (bit counting is memo number 169), MIT AI Lab, Artificial Intelligence Memo No. 239, February 29, 1972.
- Bit Twiddling Hacks by Sean Anderson at Stanford University.
- Bitwise Tricks and Techniques by Don Knuth (The Art of Computer Programming, Part IV).

On 15 March 2009, this article was discussed on Reddit — you might learn quite a bit by reading the comments therein.

© Copyright 2008—2017, Gurmeet Manku.