Run-length encoding: Difference between revisions

Content deleted Content added

Inline

Revision as of 15:40, 5 June 2009

This page uses content from Wikipedia. The original article was at Run-length_encoding. The list of authors can be seen in the page history. As with Rosetta Code, the text of Wikipedia is available under the GNU FDL. (See links for details on variance)

Given a string containing uppercase characters (A-Z), compress repeated 'runs' of the same character by storing the length of that run, and provide a function to reverse the compression. The output can be anything, as long as you can recreate the input with it.

Example:

Input: WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

Output: 12W1B12W3B24W1B14W

Ada

<lang Ada> with Ada.Text_IO; use Ada.Text_IO; with Ada.Strings.Fixed; use Ada.Strings.Fixed; procedure Test_Run_Length_Encoding is

  function Encode (Data : String) return String is
  begin
     if Data'Length = 0 then
        return "";
     else
        declare
           Code  : constant Character := Data (Data'First);
           Index : Integer := Data'First + 1;
        begin
           while Index <= Data'Last and then Code = Data (Index) loop
              Index := Index + 1;
           end loop;
           declare
              Prefix : constant String := Integer'Image (Index - Data'First);
           begin
              return Prefix (2..Prefix'Last) & Code & Encode (Data (Index..Data'Last));
           end;
        end;
     end if;
  end Encode;
  function Decode (Data : String) return String is
  begin
     if Data'Length = 0 then
        return "";
     else
        declare
           Index : Integer := Data'First;
           Count : Natural := 0;
        begin
           while Index < Data'Last and then Data (Index) in '0'..'9' loop
              Count := Count * 10 + Character'Pos (Data (Index)) - Character'Pos ('0');
              Index := Index + 1;
           end loop;
           if Index > Data'First then
              return Count * Data (Index) & Decode (Data (Index + 1..Data'Last));
           else
              return Data;
           end if;
        end;
     end if;
  end Decode;

begin

  Put_Line (Encode ("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"));
  Put_Line (Decode ("12W1B12W3B24W1B14W"));

end Test_Run_Length_Encoding; </lang> Sample output:

12W1B12W3B24W1B14W
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

AWK

Works with: gawk

It works with "textual" input. Lines containing numbers are skipped, since they can't be represented in a not ambiguous way in this implementation (e.g. "11AA" would be encoded as "212A", which would be decoded as A repeated 212 times!)

Encoding

<lang awk>BEGIN {

FS=""

} /^[^0-9]+$/ {

 cp = $1; j = 0
 for(i=1; i <= NF; i++) {
   if ( $i == cp ) {
     j++; 
   } else {
     printf("%d%c", j, cp)
     j = 1
   }
   cp = $i
 }
 printf("%d%c", j, cp)

}</lang>

Decoding

<lang awk>BEGIN {

 RS="[0-9]+[^0-9]"
 final = "";

} {

 match(RT, /([0-9]+)([^0-9])/, r)
 for(i=0; i < int(r[1]); i++) {
   final = final r[2]
 }

} END {

 print final

}</lang>

ALGOL 68

Works with: ALGOL 68 version Standard - no extensions to language used

Works with: ALGOL 68G version Any - tested with release mk15-0.8b.fc9.i386

Note: The following uses iterators, eliminating the need of declaring arbitrarily large CHAR arrays for caching. <lang algol>STRING input := "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"; STRING output := "12W1B12W3B24W1B14W";

MODE ITERATOR = VOID; MODE YIELDCHAR = PROC(CHAR)ITERATOR; MODE ITERCHAR = PROC(YIELDCHAR)ITERATOR;

PROC char seq = (REF STRING s, YIELDCHAR yield)ITERATOR:

 FOR i FROM LWB s TO UPB s DO
   yield(s[i])
 OD;

Note: The following 2 lines use currying. This not supported by ELLA ALGOL 68RS #

ITERCHAR input seq = char seq(input,),

        output seq = char seq(output,);

PROC for encode = (ITERCHAR for seq, YIELDCHAR yield)ITERATOR: (

 INT count := 0;
 CHAR prev;

FOR c IN seq DO #

 for seq((CHAR c)VOID: (
     IF count = 0 THEN
       count := 1;
       prev := c
     ELIF c /= prev THEN
       STRING str count := whole(count,0);
       char seq(str count, yield); count := 1;
       yield(prev); prev := c
     ELSE
       count +:=1
     FI
   )
 )

OD #;

 IF count /= 0 THEN
   STRING str count := whole(count,0);
   char seq(str count,yield);
   yield(prev)
 FI

);

STRING zero2nine = "0123456789";

PROC for decode = (ITERCHAR for seq, YIELDCHAR yield)ITERATOR: (

 INT repeat := 0;

FOR c IN seq DO #

 for seq((CHAR c)VOID: (
   IF char in string(c, NIL, zero2nine) THEN
     repeat := repeat*10 + ABS c - ABS "0"
   ELSE
     FOR i TO repeat DO
       yield(c)
     OD;
     repeat := 0
   FI
 ))

);

iterate through input string #

print("Encode input: ");

FOR c IN encode(input seq) DO #

 for encode(input seq, (CHAR c)VOID:
   print(c)
 )

OD #;

print(new line);

print("Decode output: ");

FOR c IN decode(output seq) DO #

 for decode(output seq, (CHAR c)VOID:
   print(c)
 )

OD #;

print(new line)</lang>

Encode input: 12W1B12W3B24W1B14W
Decode output: WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

AutoHotkey

<lang AutoHotkey> msgbox % key := rle_encode("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW") msgbox % rle_decode(key) rle_encode(message) {

 StringLeft, previous, message, 1
 StringRight, last, message, 1
 message .= asc(chr(last)+1)
 count = 0
 loop, parse, message
 {
   if (previous == A_LoopField)
     count +=1
   else
   {
     output .= previous . count
     previous := A_LoopField 
     count = 1
   }
 }
 return output

}

rle_decode(message) {

 pos = 1
 while, item := RegExMatch(message, "\D", char, pos)
 {
   digpos := RegExMatch(message, "\d+", dig, item)
   loop, % dig
   {
     output .= char
   }
   pos := digpos 
 }
 return output

} </lang>

C

These functions have no check for the size of the output buffers.

Encoding function

Since repeat counter must fit a single byte in this implementation, it can't be greater than 255, so a byte repeated more than 255 times generates in the compressed stream more than 2 bytes (4 bytes if the length of the repeated byte sequence is less than 511 and so on)

<lang c>int rle_encode(char *out, const char *in, int l) {

 int dl, i;
 char cp, c;

 for(cp=c= *in++, i = 0, dl=0; l>0 ; c = *in++, l-- ) {
   if ( c == cp ) {
     i++;
     if ( i > 255 ) {

*out++ = 255; *out++ = c; dl += 2; i = 1;

     }
   } else {
     *out++ = i;
     *out++ = cp; dl += 2;
     i = 1;
   }
   cp = c;
 }
 *out++ = i; *out++ = cp; dl += 2;
 return dl;

}</lang>

Decoding function

<lang c>int rle_decode(char *out, const char *in, int l) {

 int i, j, tb;
 char c;

 for(tb=0 ; l>=0 ; l -= 2 ) {
   i = *in++;
   c = *in++;
   tb += i;
   while(i-- > 0) *out++ = c;
 }
 return tb;

}</lang>

Usage example

<lang c>#include <stdio.h>

include <stdlib.h>
include <string.h>

const char *o = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW";

int main() {

 char *d = malloc(2*strlen(o));
 char *oc = malloc(strlen(o));
 
 int rl = rle_encode(d, o, strlen(o));
 /* fwrite(d, 1, rl, stdout); */

 int ocl = rle_decode(oc, d, rl);
 fwrite(oc, 1, ocl, stdout);

 free(d); free(oc);
 return 0;

}</lang>

In the following codes, encoding and decoding are implementend as "filters" which compress/decompress standard input to standard output writing ASCII strings; they will work as long as the input has no ASCII digits in it, and the compressed/original ratio of a "single group" will be less than or equal to 1 as long as the ASCII decimal representation's length of the repeat counter will be shorter than the length of the "group". It should be so except in the case the group is a single isolated character, e.g. B gives 1B (one byte against two compressed bytes)

Encoding filter

<lang c>#include <stdio.h>

int main() {

 int i, c, cp;

 for(cp=c=getchar(), i = 0; c != EOF; c = getchar() ) {
   if ( c == cp ) {
     i++;
   } else {
     printf("%d%c", i, cp);
     i = 1;
   }
   cp = c;
 }
 printf("%d%c", i, cp);

}</lang>

Decoding filter

<lang c>#include <stdio.h>

int main() {

 int i, c, j;

 while( scanf("%d%c", &i, &c) == 2 ) {
   for(j=0; j < i; j++) printf("%c", c);
 }

}</lang>

Final note: since the repeat counter value 0 has no meaning, it could be used as it would be 256, so extending by one the maximum number of repetitions representable with a single byte; or instead it could be used as a special marker to encode in a more efficient way (long) sequences of isolated characters, e.g. "ABCDE" would be encoded as "1A1B1C1D1E"; it could be instead encoded as "05ABCDE".

Common Lisp

<lang lisp>(defun group-similar (sequence &key (test 'eql))

 (loop for x in (rest sequence)
       with temp = (subseq sequence 0 1)
       if (funcall test (first temp) x)
         do (push x temp)
       else
         collect temp
         and do (setf temp (list x))))

(defun run-length-encode (sequence)

 (mapcar (lambda (group) (list (first group) (length group)))
         (group-similar (coerce sequence 'list))))

(defun run-length-decode (sequence)

 (reduce (lambda (s1 s2) (concatenate 'simple-string s1 s2))
         (mapcar (lambda (elem)
                   (make-string (second elem)
                                :initial-element
                                (first elem)))
                 sequence)))

(run-length-encode "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW") (run-length-decode '((#\W 12) (#\B 1) (#\W 12) (#\B 3) (#\W 24) (#\B 1)))</lang>

E

<lang e>def rle(string) {

 var seen := null
 var count := 0
 var result := []
 def put() {
   if (seen != null) {
     result with= [count, seen]
   }
 }
 for ch in string {
   if (ch != seen) {
     put()
     seen := ch
     count := 0
   }
   count += 1
 }
 put()
 return result

}

def unrle(coded) {

 var result := ""
 for [count, ch] in coded {
   result += E.toString(ch) * count
 }
 return result

}</lang>

<lang e>? rle("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW")

value: [[12, 'W'], [1, 'B'], [12, 'W'], [3, 'B'], [24, 'W'], [1, 'B'], [14, 'W']]

? unrle(rle("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"))

value: "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"</lang>

Fan

- Generates a run-length encoding for a string

class RLE {

 Run[] encode(Str s)
 {
   runs := Run[,]

   s.size.times |i|
   {
     ch := s[i]
     if (runs.size==0 || runs.last.char != ch)
       runs.add(Run(ch))
     runs.last.inc
   }
   return runs
 }

 Str decode(Run[] runs)
 {
   buf := StrBuf()
   runs.each |run|
   {
     run.count.times { buf.add(run.char.toChar) }
   }
   return buf.toStr
 }

 Void main()
 {
   echo(decode(encode(

"WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"

       )))
 }

}

internal class Run {

 Int char
 Int count := 0
 new make(Int ch) { char = ch }
 Void inc() { ++count }

 override Str toStr() { return "${count}${char.toChar}" }

} </lang>

Haskell

<lang haskell>import Data.List (group)

-- Datatypes type Encoded = [(Int, Char)] -- An encoded String with form [(times, char), ...] type Decoded = String

-- Takes a decoded string and returns an encoded list of tuples rlencode :: Decoded -> Encoded rlencode = map (\g -> (length g, head g)) . group

-- Takes an encoded list of tuples and returns the associated decoded String rldecode :: Encoded -> Decoded rldecode = concatMap decodeTuple

   where decodeTuple (n,c) = replicate n c

main :: IO () main = do

 -- Get input
 putStr "String to encode: "
 input <- getLine
 -- Output encoded and decoded versions of input
 let encoded = rlencode input
     decoded = rldecode encoded
 putStrLn $ "Encoded: " ++ show encoded ++ "\nDecoded: " ++ show decoded</lang>

Java

<lang java>import java.util.regex.Matcher; import java.util.regex.Pattern; public class RunLengthEncoding {

   public String encode(String source) {
       StringBuffer dest = new StringBuffer();
       for (int i = 0; i < source.length(); i++) {
           int runLength = 1;
           while (i+1 < source.length() && source.charAt(i) == source.charAt(i+1)) {
               runLength++;
               i++;
           }
           dest.append(runLength);
           dest.append(source.charAt(i));
       }
       return dest.toString();
   }

   public String decode(String source) {
       StringBuffer dest = new StringBuffer();
       Pattern pattern = Pattern.compile("[0-9]+|[a-zA-Z]");
       Matcher matcher = pattern.matcher(source);
       while (matcher.find()) {
           int number = Integer.parseInt(matcher.group());
           matcher.find();
           while (number-- != 0) {
               dest.append(matcher.group());
           }
       }
       return dest.toString();
   }

   public static void main(String[] args) {
       RunLengthEncoding RLE = new RunLengthEncoding();
       String example = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW";
       System.out.println(RLE.encode(example));
       System.out.println(RLE.decode("1W1B1W1B1W1B1W1B1W1B1W1B1W1B"));
   }

}</lang>

Objective-C

Works with: GNUstep

The class RCRunLengthEncoder represents internally data with which it was feeded as pair character - repetition counter: it does not implement a binary representation of itself; it is left to another class, so that different input/output encodings are possible starting from the same class.

<lang objc>#import <Foundation/Foundation.h>

@interface RCRunLengthEncoder : NSObject {

 NSMutableArray *contents;
 NSMutableArray *counters;

} + (id)alloc; + (id)encoderWithData: (NSData *)data; - (id)init; - (id)initWithData: (NSData *)data; - (void)dealloc; - (void)addByte: (char)aByte; - (void)addByte: (char)aByte repeated: (int)repetitionCount; - (void)addData: (NSData *)data; - (NSData *)data; - (NSArray *)counters; - (NSArray *)contents; @end

@implementation RCRunLengthEncoder + (id)alloc {

 self = [[super alloc] init];
 return self;

}

+ (id)encoderWithData: (NSData *)data {

 id ne = [[RCRunLengthEncoder alloc] initWithData: data];
 return [ne autorelease];

}

- (id)initWithData: (NSData *)data {

 [self addData: data];
 return self;

}

- (id)init {

 contents = [[NSMutableArray alloc] init];
 counters = [[NSMutableArray alloc] init];
 return self;

}

- (void)dealloc {

 [counters release];
 [contents release];
 [super dealloc];

}

- (void)addByte: (char)aByte {

 [self addByte: aByte repeated: 1];

}

- (void)addByte: (char)aByte repeated: (int)repetitionCount {

 if ( ([contents count] == 0) || ([[contents lastObject] charValue] != aByte) ) {
   [contents addObject: [NSNumber numberWithChar: aByte]];
   [counters addObject: [NSNumber numberWithInt: repetitionCount]];
 } else {
   NSNumber *a = [[counters lastObject] retain];
   [counters removeLastObject];
   [counters addObject: [NSNumber numberWithInt:

([a intValue] + repetitionCount) ]];

   [a release];
 }

}

- (void)addData: (NSData *)data {

 [data retain];
 char *d = (char *)[data bytes];
 NSUInteger i;
 for(i=0; i < [data length]; i++) [self addByte: d[i]];
 [data release];

}

- (NSArray *)contents {

 return contents;

}

- (NSArray *)counters {

 return counters;

}

- (NSData *)data {

 NSMutableData *d = [[NSMutableData alloc] initWithCapacity: 256];
 char buf[256];
 int i;

 for(i=0; i < [contents count]; i++) {
   char c = [[contents objectAtIndex: i] charValue];
   int n = [[counters objectAtIndex: i] intValue];
   memset(buf, c, 256);
   while ( n > 0 ) {
     [d appendBytes: buf length: MIN(256, n)];
     n -= 256;
   }
 }
 return d;

} @end</lang>

The class codecRLE is derived from the previous, adding the methods that allow to binary encode the data internally held, and to create a internal representation from the encoded data. The specification here used are:

byte N >= 128, then the next byte must be repeated N-128 times (if N-128 is 0, the the byte must be repeated 128 times)
byte N < 128, then the next N bytes must be taken as they are (if N is 0, the next 128 bytes are literal)

<lang objc>@interface codecRLE : RCRunLengthEncoder - (NSData *)encode; - (codecRLE *)decode: (NSData *)enc; @end

@implementation codecRLE - (codecRLE *)decode: (NSData *)enc {

 char *buf = (char *)[enc bytes];
 int i;

 for(i = 0; i < [enc length]; ) {
   if ( (buf[i] & 0x80) != 0) {
     [self addByte: buf[i+1] repeated: ( ((buf[i]&0x7f) == 0) ? 128 : (buf[i]&0x7f) ) ];
     i += 2;
   } else {
     int rc = (buf[i] == 0) ? 128 : buf[i];
     [self addData: [NSData dataWithBytes: &buf[i+1] length: rc]];
     i += rc + 1;
   }
 }
 return self;

}

- (NSData *)encode {

 int literalCount=0;
 int i;
 char buf[129];

 NSMutableData *r = [[NSMutableData alloc] initWithCapacity: 256];

 for(i=0; i < [counters count]; i++) {
   char c = [[contents objectAtIndex: i] charValue];
   int howMany = [[counters objectAtIndex: i] intValue];
   if ( literalCount == 128 ) {
     buf[0] = 0;
     [r appendBytes: buf length: 129];
     literalCount = 0;
   }
   if ( howMany == 1 ) {
     buf[literalCount+1] = c;
     literalCount++;
   } else {
     if ( literalCount > 0 ) {

buf[0] = literalCount & 0x7f; [r appendBytes: buf length: (literalCount+1) ]; literalCount = 0;

     }
     buf[1] = c;
     while( howMany > 127 ) {

buf[0] = 0x80; [r appendBytes: buf length: 2]; howMany -= 128;

     }
     if (howMany > 0) {

buf[0] = howMany | 0x80; [r appendBytes: buf length: 2];

     }
   }
 }
 return r;

} @end</lang>

Usage example:

<lang objc>const char *s = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW";

int main() {

 NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

 codecRLE *enc = [[codecRLE alloc]

initWithData: [NSData dataWithBytes: s length: strlen(s)] ];

 NSData *repr = [enc encode];
 fwrite([repr bytes], 1, [repr length], stdout);
 [enc release];

 enc = [[codecRLE alloc] decode: repr];
 [repr release];
 NSData *d = [enc data];
 fwrite([d bytes], 1, [d length], stdout);
 [d release];
 [enc release];

 [pool drain];
 return EXIT_SUCCESS;

}</lang>

Notes

The code is not deeply tested yet
Methods encode and data creates a NSData object that must be released by the user (it is not autoreleased nor released when the class is released)

Perl

<lang perl>sub encode

{my $str = shift;
 $str =~ s {(.)(\1*)} {(length($2) + 1) . $1 . ';'}gse;
 return $str;}

sub decode

{my $str = shift;
 $str =~ s {(\d+)(.);} {$2 x $1}gse;
 return $str;}</lang>

The following modified versions of the previous one, encode/decode a bytes sequence in a way compatible with the functions of the C version.

<lang perl>sub encode

{my $str = shift;
 $str =~ s {(.)(\1{0,254})} {pack("C",(length($2) + 1)) . $1 }gse;
 return $str;}

sub decode {

    my @str = split //, shift;
    my $r = "";
    foreach my $i (0 .. scalar(@str)/2-1) {

$r .= $str[2*$i + 1] x unpack("C", $str[2*$i]);

    }
    return $r;

}</lang>

Python

<lang python>def encode(input_string):

   count = 1
   prev = 
   lst = []
   for character in input_string:
       if character != prev:
           if prev:
               entry = (prev,count)
               lst.append(entry)
               #print lst
           count = 1
           prev = character
       else:
           count += 1
   else:
       entry = (character,count)
       lst.append(entry)
   return lst

def decode(lst):

   q = ""
   for character, count in lst:
       q += character * count
   return q

Method call

encode("aaaaahhhhhhmmmmmmmuiiiiiiiaaaaaa") decode([('a', 5), ('h', 6), ('m', 7), ('u', 1), ('i', 7), ('a', 6)])</lang>

Functional

Works with: Python version 2.4

<lang python>from itertools import groupby def encode(input_string):

   return [(len(list(g)), k) for k,g in groupby(input_string)]

def decode(lst):

   return .join(c * n for n,c in lst)

encode("aaaaahhhhhhmmmmmmmuiiiiiiiaaaaaa") decode([(5, 'a'), (6, 'h'), (7, 'm'), (1, 'u'), (7, 'i'), (6, 'a')])</lang>

By regular expression
The simplified input range of only uppercase characters allows a simple regular expression to be applied repeatedly for encoding, and another for decoding: <lang python>from re import finditer

def encode(text):

   
   Doctest:
       >>> encode('WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW')
       '12W1B12W3B24W1B14W'    
   
   return .join( str(len(run.group(0))) + run.group(1)
                   for run in finditer(r'(.)\1*', text) )

def decode(text):

   
   Doctest:
       >>> decode('12W1B12W3B24W1B14W')
       'WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW'
   
   return .join( run.group(2) * int(run.group(1))
                   for run in finditer(r'(\d+)(\D)', text) )

textin = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" assert decode(encode(textin)) == textin</lang>

Ruby

<lang ruby>def encode(s)

 s.scan(/(.)(\1*)/).inject([]) do |encoding, matchgroups|
   char, repeat = matchgroups
   encoding << [char, 1 + repeat.length] 
 end

end

def decode(e)

 e.inject("") do |decoded, pair|
   char, length = pair
   decoded << char * length
 end

end

orig = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" encoded = encode(orig) # => [["W", 12], ["B", 1], ["W", 12], ["B", 3], ["W", 24], ["B", 1], ["W", 14]] decoded = decode(encoded) puts "success!" if decoded == orig</lang>

Smalltalk

Works with: GNU Smalltalk

The class RunLengthEnc holds a representation of a run length encoded sequence of objects.

<lang smalltalk>Object subclass: RunLengthEnc [

 |counters contents|
 <category: 'Compression'>
 <comment: 'My instances are similar to a Bag, except

that items are ordered and counted iff they are adjacent. So that this instance keeps a representation of the added items suitable for performing a RunLengthEncoding, hence the name.'>

 RunLengthEnc class >> new [ ^self basicNew initialize ]

 size [ ^counters size ]

 add: anObject [ ^(self add: anObject withCount: 1) ]

 add: anObject withCount: anInt [
   anObject isNil
       ifTrue: [ 
           SystemExceptions.InvalidArgument signalOn: anObject
             reason: 'RunLengthEnc encodes existing objects, e.g. integers or characters, not nil' 
       ].
   (self size) > 0
   ifTrue: [
     (contents last) == anObject
         ifTrue: [
            self incrementLastCounter: anInt.
         ]
         ifFalse: [

self appendObject: anObject withCount: anInt

         ]
   ] ifFalse: [ self appendObject: anObject withCount: anInt ].
   ^anObject
 ]

 initialize [
   counters := OrderedCollection new.
   contents := OrderedCollection new.
 ]

 appendObject: anObject withCount: anInt [
   contents addLast: anObject.
   counters addLast: anInt
 ]

 appendObject: anObject [
   contents addLast: anObject.
   counters addLast: 1
 ]

 incrementLastCounter: howMuch [ | c |
   c := counters removeLast.
   counters addLast: (c+howMuch)
 ]

 "the 'do:' can be used to let the user store the compressed 'stream' as s/he
  prefers, while 'add:withCount:' can be used to rebuild the informations from
  the custom storage" 
 do: aBlock [
   1 to: (counters size) do: [ :i | | l |
     aBlock value: (contents at: i) value: (counters at: i)
   ]
 ]

 asOrderedCollection [ |oc|
   oc := OrderedCollection new.
   self do: [ :o :c |
     1 to: c do: [ :i| oc add: o ]
   ].
   ^oc
 ]

 printOn: aStream [
     "output a representation of the object:
      counter [object] ... for every object"
     1 to: (counters size) do: [ :i |
        (counters at: i) printOn: aStream.

aStream nextPut: (Character value: 32). (contents at: i) printOn: aStream. aStream nextPut: (Character nl).

     ]
 ]

 asString [ |oc| 
   "'build' a string from the run length encoded objects;
    works only if objects are Characters or Strings"
   oc := self asOrderedCollection.
   ^(oc asString)
 ]

].</lang>

The following extension to the OrderedCollection class allows to run length encode an ordered collection (theoretically of any objects' kind, but the RunLengthEnc class is supposed to work with characters mainly).

<lang smalltalk>OrderedCollection extend [

  asRunLengthEnc [ |rc|
      rc := RunLengthEnc new.
      self do: [ :o |
         rc add: o
      ].
      ^rc
  ]

].</lang>

The following extention to the String class allows to run length encode a string (it is basically a shortcut for aString asOrderedCollection asRunLengthEnc).

<lang smalltalk>String extend [

  asRunLengthEnc [ ^self asOrderedCollection asRunLengthEnc ]

].</lang>

Usage examples

<lang smalltalk>|cs s os|

s := 'WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW'.

"let us run length encode the string" cs := s asRunLengthEnc. cs print. "this shows an ASCII representation of the run length encoded objects collection;

          in this case, of the string"

"let us show that the class is able to return the string back; this really works

iff the objects of the collection are strings or characters"

cs asString displayNl.</lang>

The class does not mandate a way of storing itself into a file that can be loaded later. The following sample code shows how it could be done quickly, but not efficiently from the point of view of a compressor.

<lang smalltalk>|f| "let's write the object and its informations to a file..." f := FileStream open: 'rledump' mode: FileStream write. ObjectDumper dump: cs to: f. f close.

"... and let's read it back" |o| f := FileStream open: 'rledump' mode: FileStream read. o := ObjectDumper loadFrom: f. o print. "show that the object holds the same informations of cs" f close.</lang>

Tcl

The encoding is an even-length list with elements {count char ...} <lang tcl>proc encode {string} {

   set encoding {}
   # use a regular expression to match runs of one character
   foreach {run -} [regexp -all -inline {(.)\1+|.} $string] {
       lappend encoding [string length $run] [string index $run 0]
   }
   return $encoding

}

proc decode {encoding} {

   foreach {count char} $encoding  {
       append decoded [string repeat $char $count]
   }
   return $decoded

}</lang>

<lang tcl>set str "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" set enc [encode $str] ;# ==> {12 W 1 B 12 W 3 B 24 W 1 B 14 W} set dec [decode $enc] if {$str eq $dec} {

   puts "success"

}</lang>