Burrows–Wheeler transform

From Rosetta Code
Revision as of 16:26, 27 July 2018 by Marko (talk | contribs) (Created page with "{{draft task}}The Burrows–Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for co...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Burrows–Wheeler transform is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

The Burrows–Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation.

Source: https://en.wikipedia.org/wiki/Burrows–Wheeler_transform

This Python implementation sacrifices speed for simplicity: the program is short, but takes more than the linear time that would be desired in a practical implementation.

Using the STX/ETX control codes to mark the start and end of the text, and using s[i:] + s[:i] to construct the ith rotation of s, the forward transform takes the last character of each of the sorted rows. The inverse transform repeatedly inserts r as the left column of the table and sorts the table. After the whole table is built, it returns the row that ends with ETX, minus the STX and ETX.

Python

<lang Python>


def bwt(s):

   """Apply Burrows-Wheeler transform to input string."""
   assert "\002" not in s and "\003" not in s, "Input string cannot contain STX and ETX characters"
   s = "\002" + s + "\003"  # Add start and end of text marker
   table = sorted(s[i:] + s[:i] for i in range(len(s)))  # Table of rotations of string
   last_column = [row[-1:] for row in table]  # Last characters of each row
   return "".join(last_column)  # Convert list of characters into string


def ibwt(r):

   """Apply inverse Burrows-Wheeler transform."""
   table = [""] * len(r)  # Make empty table
   for i in range(len(r)):
       table = sorted(r[i] + table[i] for i in range(len(r)))  # Add a column of r
   s = [row for row in table if row.endswith("\003")][0]  # Find the correct row (ending in ETX)
   return s.rstrip("\003").strip("\002")  # Get rid of start and end markers

</lang>