User:MikeMol/WIP
(a works-in-progress page.)
Burrows-Wheeler Transform
This is an extremely naive implementation of BTW. Not finished. I'm not really looking for speed or memory efficiency. My purposes in writing it are:
- Practice and demonstrate template classes
- Practice and demonstrate operator overloading
- Learn to make better use of the STL (Hey, C++ is my day job, remember?)
- Learn how BWT works, and set myself up with some code to experiment with some ideas I had; I don't see why it needs to be specific to 'characters', and the Wikipedia page doesn't describe it more generically.
- Consider how to safely mix in an EOF condition with an arbitrary data type.
Things remaining to do:
- I still need to write the reverse transform.
- Try to compile any of it.
- Put just about all of it in a static template class, so I can say something like BTW<char>::funcname and have all the templates be more or less predefined for me.
No, I don't require this many comments. I figured not everyone who looks at it will find the code particularly obvious, though.
<lang cpp>// A terribly, terribly naive implementation.
// This allows us to overload a type with an EOF state. BWT requires knowing where EOF is. template<typename tCellType> class { public: typedef bool EOF_STATE; // Constructor for EOF case. Also serves as default constructor. cell(EOF_STATE = false) : bEOF(_bEOF) {}
// Constructor for value-type case cell(tCellType _value) : bEOF(false), value(_value) {}
// Assigning one cell to another. operator =(const cell& o) { if(&o == this) return *this; bEOF = o.bEOF; value = o.value; return *this; }
// Assigning a value to us. We can't use a reference, as C++ disallows taking references to literals, // and someone may assign a literal. operator =(const tCellType o) { // We have a value, we're definitely not an EOF. bEOF = false; value = o;
return *this; }
// Allow this class to be cast to the original type. operator tCellType() const { return value; }
// operator < is used in sorting. operator <(const tCellType& o) const { if(bEOF) // EOF shall always be greater. } return false;
return (value < o.value); }
bool isEOF() const { return bEOF; } private: EOF_STATE bEOF; tCellType value; };
// typedefs make type derivative easer to maintain and track. typedef char cell_core; typedef std::vector< std::dequeue<cell<cell_core> > > nb_type;
// Usinve vector, as it comes with size information. // We'll return the position of the EOF pointer, as we'd otherwise need t void bwt_forward(const std::vector<cell_core>& block, std::vector<cell_core>& before_eof, std::vector<cell_core>& after_eof) { nb_type naive_block;
// The naive implementation requires // as many rows in the transform space // as cells in the original block // The "+ 1" is for our EOF marker, which we assume has not been provided. naive_block.resize(block.size() + 1);
// We need to fill in our naive block. nb_type_iterator iBlock = naive_block.begin();
// Reserve enough entries in our inner dequeue to hold all our original block's data. iBlock->resize(block.size());
// Copy the original block into our naive block. // We can't use a range insert, as our naive block uses our 'cell' type, // while the originally-provided block does not. We're stuck with // manually copying each cell. for(size_t idx = 0; iBlock->size() < idx; ++idx) { // iBlock is an iterator pointing to a dequeue that uses our cell data type. // Doing the assignment in this way will automatically generate the dequeue. (*iBlock)[idx] = block[idx]; }
// Tack on our EOF cell. I use an explicit calling out of the EOF_TYPE type, as this function template // might be instanciated as the 'bool' type, which would result in an ambiguity trying to resolve // which 'cell' constructor to use. iBlock->push_back(cell((cell::EOF_TYPE) true));
// Now we fill the rest of our naive space with rotated forms of our block.
// Create an iterator pointing to the current position, and advance iBlock. nb_type::iterator iLastBlock = iBlock++;
// Until we're all the way through our naive blcok while(naive_block.end() != iBlock) { // Copy the block. *iBlock = *iLastBlock;
// Rotate the block cell tmpCell = iBlock->pop_back(); iBlock->push_front(tmpCell); }
//We need to sort our naive block. sort(naive_block.begin(), naive_block.end());
// The last colum nin each row of our naive block represents our transformed data. // We need to copy that transformed data back into the original block. (sans our EOF pointer) size_t EOF_PTR = 0; const size_t transformed_row = naive_block.size();
// Find where our EOF wound up. for(size_t nb_idx = 0; nb_idx < naive_block.size(); ++nb_idx) { if(naive_block[nb_idx][transformed_row].isEOF()) { // We found our EOF pointer. EOF_PTR = nb_idx; break; } }
// Ensure that our before_EOF and after_EOF vectors are large enough.
// (EOF_PTR + 1 ) // The number of cells required to include EOF_PTR, an index to a 0-based array. // ( (EOF_PTR +1 ) - 1 ) // The number of cells we want // EOF_PTR // Also turns out to be the number of cells we want. Yay, Algebra. before_eof.resize(EOF_PTR); after_eof.resize(block.size() - before_eof.size());
// Fill before_eof size_t nb_idx = 0; for(; nb_idx < before_eof.size(); ++nb_idx) { before_eof[nb_idx] = naive_block[nb_idx][transformed_row]; }
// Bump nb_idx past our EOF marker ++nb_idx;
// Fill after_eof for(size_t aeof_idx = 0 ; aeof_idx < after_eof.size(); ++aeof_idx, ++nb_idx) { after_eof[aeof_idx] = naive_block[nb_idx][transformed_row]; } }
</lang>