Introduction

After much searching the internet for a .net implementation for WAH compressed BitArray and not finding one, I decided to write and publish an article here so we the .net community are not left out as all the implementations are in the java language.  This ties into a pretty advanced topic of bitmap indexing techniques for databases and I needed this for my upcoming RaptorDB Document data-store database engine.

What is it? 

A BitArray is a data structure in .net library for storing true/false or bits of data in a compact form within an array of Int32 objects. A WAH or word-aligned hybrid BitArray is a special run length compressed version of a BitArray which saves a lot of space and memory. All the implementations that exist in java essentially duplicate the functionality of a BitSet that is the AND, OR, NOT and XOR operations with the compressed internal storage format.

In my implementation I defer the functionality of the BitArray to itself and just add compression and decompression routines. This is much faster than the java way at the expense of memory usage, to overcome this I have also added a FreeMemory method to release the BitArray contents and keep the compressed contents. Arguably if you are using 100s million bits then a full implementation is more performant than my implementation but for most of our usecases we are at most in the 10s millions of bits range.

This original method was invented at the Berkeley Labs of US Department of Energy, it is a project named FastBit and is used for high energy physics department experiments, you can see it here : FastBit site  

Why should I care?

So what! you ask?, well as mentioned before BitArrays are used in an indexing technique called bitmap indexes (wiki) and Column based databases which store data in columns instead of rows, a example which you might know is Microsoft's PowerPivot for Excel which can process millions of rows in seconds. Interestingly Microsoft has only recently announced the implementation of bitmap indexes in the upcoming SQL Server platform post 2008 R2. It has long been in use by other RDBM vendors like Oracle.

How it works

The WAH compression algorithm is as follows:

  1. take 31 bits from the array.
  2. if all zeros then increment the zero count by 31.
    1. if ones count > 0 then output 32 bits with bit 32 =1 and bit 31=1 and 0-30 = ones count
  3. if all ones then increment the ones count by 31.
    1. if zeros count >0 then output 32 bits with bit 32=1 and bit 31=0 and 0-30 = zeros count
  4. else output the 31 bits as 32 bits with bit 32 = 0.
    1. if zeros or ones count >0 then output as above before.

From the above in the worst case you will get N/31 more bits encoded or about 3% increase in size to the original. 

What you get

WAHBitArray is essentially the same as the standard BitArray in the .net framework with the following additions:

  • FreeMemory() : this will first compress the internal BitArray then free the the memory it used.
  • GetCompressed() : this will compress the current BitArray then return them as an array of uint.

Using the code 

To create a WAHBitArray you can do the following :
WAHBitArray ba1 = new WAHBitArray(1000000); // 1 million bits

WAHBitArray ba2 = new WAHBitArray(new BitArray(2000000)); // 2 million bits from another bitarray

WAHBitArray ba3 = new WAHBitArray(1000000, new uint[] { /* compressed values here */}); // from a compressed list of uint 


To perform operations you can do the following:

BitArray result1 = ba1.And( ba2);
BitArray result2 = ba1.Or( ba3);
BitArray result3 = ba1.Xor( ba2); 

Points of Interest

  • BitArray class is sealed by Microsoft so inheriting for it was not possible.
  • BitArray throws a exception if the length of two BitArrays are not equal on bit operations, WAHBitArray makes then the same as the largest before operations.  

History  

Initial Release v1.0 : 22nd June 2011

推荐.NET配套的通用数据层ORM框架:CYQ.Data 通用数据层框架