(a works-in-progress page.)

Burrows-Wheeler Transform

This is an extremely naive implementation of BTW. Not finished. I'm not really looking for speed or memory efficiency. My purposes in writing it are:

  • Practice and demonstrate template classes
  • Practice and demonstrate operator overloading
  • Learn to make better use of the STL (Hey, C++ is my day job, remember?)
  • Learn how BWT works, and set myself up with some code to experiment with some ideas I had; I don't see why it needs to be specific to 'characters', and the Wikipedia page doesn't describe it more generically.
  • Consider how to safely mix in an EOF condition with an arbitrary data type.


Things remaining to do:

  • I still need to write the reverse transform.
  • Try to compile any of it.
  • Put just about all of it in a static template class, so I can say something like BTW<char>::funcname and have all the templates be more or less predefined for me.

No, I don't require this many comments. I figured not everyone who looks at it will find the code particularly obvious, though.

<lang cpp>// A terribly, terribly naive implementation.

// This allows us to overload a type with an EOF state. BWT requires knowing where EOF is. template<typename tCellType> class { public: typedef bool EOF_STATE; // Constructor for EOF case. Also serves as default constructor. cell(EOF_STATE = false) : bEOF(_bEOF) {}

// Constructor for value-type case cell(tCellType _value) : bEOF(false), value(_value) {}

// Assigning one cell to another. operator =(const cell& o) { if(&o == this) return *this; bEOF = o.bEOF; value = o.value; return *this; }

// Assigning a value to us. We can't use a reference, as C++ disallows taking references to literals, // and someone may assign a literal. operator =(const tCellType o) { // We have a value, we're definitely not an EOF. bEOF = false; value = o;

return *this; }

// Allow this class to be cast to the original type. operator tCellType() const { return value; }

// operator < is used in sorting. operator <(const tCellType& o) const { if(bEOF) // EOF shall always be greater. } return false;

return (value < o.value); }

bool isEOF() const { return bEOF; } private: EOF_STATE bEOF; tCellType value; };

// typedefs make type derivative easer to maintain and track. typedef char cell_core; typedef std::vector< std::dequeue<cell<cell_core> > > nb_type;

// Usinve vector, as it comes with size information. // We'll return the position of the EOF pointer, as we'd otherwise need t void bwt_forward(const std::vector<cell_core>& block, std::vector<cell_core>& before_eof, std::vector<cell_core>& after_eof) { nb_type naive_block;

// The naive implementation requires // as many rows in the transform space // as cells in the original block // The "+ 1" is for our EOF marker, which we assume has not been provided. naive_block.resize(block.size() + 1);

// We need to fill in our naive block. nb_type_iterator iBlock = naive_block.begin();

// Reserve enough entries in our inner dequeue to hold all our original block's data. iBlock->resize(block.size());

// Copy the original block into our naive block. // We can't use a range insert, as our naive block uses our 'cell' type, // while the originally-provided block does not. We're stuck with // manually copying each cell. for(size_t idx = 0; iBlock->size() < idx; ++idx) { // iBlock is an iterator pointing to a dequeue that uses our cell data type. // Doing the assignment in this way will automatically generate the dequeue. (*iBlock)[idx] = block[idx]; }

// Tack on our EOF cell. I use an explicit calling out of the EOF_TYPE type, as this function template // might be instanciated as the 'bool' type, which would result in an ambiguity trying to resolve // which 'cell' constructor to use. iBlock->push_back(cell((cell::EOF_TYPE) true));

// Now we fill the rest of our naive space with rotated forms of our block.

// Create an iterator pointing to the current position, and advance iBlock. nb_type::iterator iLastBlock = iBlock++;

// Until we're all the way through our naive blcok while(naive_block.end() != iBlock) { // Copy the block. *iBlock = *iLastBlock;

// Rotate the block cell tmpCell = iBlock->pop_back(); iBlock->push_front(tmpCell); }

//We need to sort our naive block. sort(naive_block.begin(), naive_block.end());

// The last colum nin each row of our naive block represents our transformed data. // We need to copy that transformed data back into the original block. (sans our EOF pointer) size_t EOF_PTR = 0; const size_t transformed_row = naive_block.size();

// Find where our EOF wound up. for(size_t nb_idx = 0; nb_idx < naive_block.size(); ++nb_idx) { if(naive_block[nb_idx][transformed_row].isEOF()) { // We found our EOF pointer. EOF_PTR = nb_idx; break; } }

// Ensure that our before_EOF and after_EOF vectors are large enough.

// (EOF_PTR + 1 ) // The number of cells required to include EOF_PTR, an index to a 0-based array. // ( (EOF_PTR +1 ) - 1 ) // The number of cells we want // EOF_PTR // Also turns out to be the number of cells we want. Yay, Algebra. before_eof.resize(EOF_PTR); after_eof.resize(block.size() - before_eof.size());

// Fill before_eof size_t nb_idx = 0; for(; nb_idx < before_eof.size(); ++nb_idx) { before_eof[nb_idx] = naive_block[nb_idx][transformed_row]; }

// Bump nb_idx past our EOF marker ++nb_idx;

// Fill after_eof for(size_t aeof_idx = 0 ; aeof_idx < after_eof.size(); ++aeof_idx, ++nb_idx) { after_eof[aeof_idx] = naive_block[nb_idx][transformed_row]; } }

</lang>