Text processing/1: Difference between revisions

Content added Content deleted

Inline

Revision as of 08:09, 23 March 2009

This task has been flagged for clarification. Code on this page in its current state may be flagged incorrect once this task has been clarified. See this page's Talk page for discussion.

Often data is produced by one program, in the wrong format for later use by another program or person. In these situations another program can be written to parse and transform the original data into a format useful to the other. The term "Data Munging" is often used in programming circles for this task.

A request on the comp.lang.awk newsgroup lead to a typical data munging task:

I have to analyse data files that have the following format:
Each row corresponds to 1 day and the field logic is: $1 is the date,
followed by 24 value/flag pairs, representing measurements at 01:00,
02:00 ... 24:00 of the respective day. In short:

<date> <val1> <flag1> <val2> <flag2> ...  <val24> <flag24>

Some test data is available at: 
... (nolonger available at original location)

I have to sum up the values (per day and only valid data, i.e. with
flag>0) in order to calculate the mean. That's not too difficult.
However, I also need to know what the "maximum data gap" is, i.e. the
longest period with successive invalid measurements (i.e values with
flag<=0)

The data is free to download and use and is of this format:

1991-03-30	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1
1991-03-31	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	20.000	1	20.000	1	20.000	1	35.000	1	50.000	1	60.000	1	40.000	1	30.000	1	30.000	1	30.000	1	25.000	1	20.000	1	20.000	1	20.000	1	20.000	1	20.000	1	35.000	1
1991-03-31	40.000	1	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2
1991-04-01	0.000	-2	13.000	1	16.000	1	21.000	1	24.000	1	22.000	1	20.000	1	18.000	1	29.000	1	44.000	1	50.000	1	43.000	1	38.000	1	27.000	1	27.000	1	24.000	1	23.000	1	18.000	1	12.000	1	13.000	1	14.000	1	15.000	1	13.000	1	10.000	1
1991-04-02	8.000	1	9.000	1	11.000	1	12.000	1	12.000	1	12.000	1	27.000	1	26.000	1	27.000	1	33.000	1	32.000	1	31.000	1	29.000	1	31.000	1	25.000	1	25.000	1	24.000	1	21.000	1	17.000	1	14.000	1	15.000	1	12.000	1	12.000	1	10.000	1
1991-04-03	10.000	1	9.000	1	10.000	1	10.000	1	9.000	1	10.000	1	15.000	1	24.000	1	28.000	1	24.000	1	18.000	1	14.000	1	12.000	1	13.000	1	14.000	1	15.000	1	14.000	1	15.000	1	13.000	1	13.000	1	13.000	1	12.000	1	10.000	1	10.000	1

Only a sample of the data showing its format is given above. The full example file may be downloaded here.

Structure your program to show statistics for each line of the file, (similar to the original Python, Perl, and AWK examples below), followed by summary statistics for the file. When showing example output just show a few line statistics and the full end summary.

Ada

Library: Simple components for Ada

<lang ada> with Ada.Text_IO; use Ada.Text_IO; with Strings_Edit; use Strings_Edit; with Strings_Edit.Floats; use Strings_Edit.Floats; with Strings_Edit.Integers; use Strings_Edit.Integers;

procedure Data_Munging is

  Syntax_Error : exception;
  type Gap_Data is record
     Count   : Natural := 0;
     Line    : Natural := 0;
     Pointer : Integer;
     Year    : Integer;
     Month   : Integer;
     Day     : Integer;
  end record;
  File    : File_Type;
  Max     : Gap_Data;
  This    : Gap_Data;
  Current : Gap_Data;
  Count   : Natural := 0;
  Sum     : Float   := 0.0;

begin

  Open (File, In_File, "readings.txt");
  loop
     declare
        Line    : constant String := Get_Line (File);
        Pointer : Integer := Line'First;
        Flag    : Integer;
        Data    : Float;
     begin
        Current.Line := Current.Line + 1;
        Get (Line, Pointer, SpaceAndTab);
        Get (Line, Pointer, Current.Year);
        Get (Line, Pointer, Current.Month);
        Get (Line, Pointer, Current.Day);
        while Pointer <= Line'Last loop
           Get (Line, Pointer, SpaceAndTab);
           Current.Pointer := Pointer;
           Get (Line, Pointer, Data);
           Get (Line, Pointer, SpaceAndTab);
           Get (Line, Pointer, Flag);
           if Flag < 0 then
              if This.Count = 0 then
                 This := Current;
              end if;
              This.Count := This.Count + 1;
           else
              if This.Count > 0 and then Max.Count < This.Count then
                 Max := This;
              end if;
              This.Count := 0;
              Count := Count + 1;
              Sum   := Sum + Data;
           end if;
        end loop;
     exception
        when End_Error =>
           raise Syntax_Error;
     end;
  end loop;

exception

  when End_Error =>
     Close (File);
     if This.Count > 0 and then Max.Count < This.Count then
        Max := This;
     end if;
     Put_Line ("Average " & Image (Sum / Float (Count)) & " over " & Image (Count));
     if Max.Count > 0 then
        Put ("Max. " & Image (Max.Count) & " false readings start at ");
        Put (Image (Max.Line) & ':' & Image (Max.Pointer) & " stamped ");
        Put_Line (Image (Max.Year) & Image (Max.Month) & Image (Max.Day));
     end if;
  when others =>
     Close (File);
     Put_Line ("Syntax error at " & Image (Current.Line) & ':' & Image (Max.Pointer));

end Data_Munging; </lang> The implementation performs minimal checks. The average is calculated over all valid data. For the maximal chain of consequent invalid data, the source line number, the column number, and the time stamp of the first invalid data is printed. Sample output:

Average 10.47915 over 129628
Max. 589 false readings start at 1136:20 stamped 1993-2-9

ALGOL 68

Translation of: python

Works with: ALGOL 68G version Any - tested with release mk15-0.8b.fc9.i386

<lang algol>INT no data := 0; # Current run of consecutive flags<0 in lines of file # INT no data max := -1; # Max consecutive flags<0 in lines of file # FLEX[0]STRING no data max line; # ... and line number(s) where it occurs #

REAL tot file := 0; # Sum of file data # INT num file := 0; # Number of file data items with flag>0 #

CHAR fs = " "; #

INT nf = 24;

INT upb list := nf; FORMAT list repr = $n(upb list-1)(g", ")g$;

PROC exception = ([]STRING args)VOID:(

 putf(stand error, ($"Exception"$,$", "g$, args, $l$));
 stop

);

PROC raise io error = (STRING message)VOID:exception(("io error", message));

OP +:= = (REF FLEX []STRING rhs, STRING append)REF FLEX[]STRING: (

 HEAP [UPB rhs+1]STRING out rhs;
 out rhs[:UPB rhs] := rhs;
 out rhs[UPB rhs+1] := append;
 rhs := out rhs;
 out rhs

);

INT upb opts = 3; # these are "a68g" "./Data_Munging.a68" & "-" # [argc - upb opts]STRING in files; FOR arg TO UPB in files DO in files[arg] := argv(upb opts + arg) OD;

MODE FIELD = STRUCT(REAL data, INT flag); FORMAT field repr = $2(g)$;

FOR index file TO UPB in files DO

 STRING file name = in files[index file], FILE file;
 IF open(file, file name, stand in channel) NE 0 THEN
   raise io error("Cannot open """+file name+"""") FI;
 on logical file end(file, (REF FILE f)BOOL: logical file end done);
 REAL tot line, INT num line;
 # make term(file,",") for CSV data #
 STRING date;
 DO
   tot line := 0;             # sum of line data #
   num line := 0;             # number of line data items with flag>0 #
   # extract field info #
   [nf]FIELD data;
   getf(file, ($10a$, date, field repr, data, $l$));
 
   FOR key TO UPB data DO
     FIELD field = data[key];
     IF flag OF field<1 THEN
       no data +:= 1
     ELSE
       # check run of data-absent data #
       IF no data max = no data AND no data>0 THEN
         no data max line +:= date FI;
       IF no data max<no data AND no data>0 THEN
         no data max := no data;
         no data max line := date FI;
       # re-initialise run of no data counter #
       no data := 0; 
       # gather values for averaging #
       tot line +:= data OF field;
       num line +:= 1
     FI
   OD;

   # totals for the file so far #
   tot file +:= tot line;
   num file +:= num line;
 
   printf(($"Line: "g"  Reject: "g(-2)"  Accept: "g(-2)"  Line tot: "g(-14,3)"  Line avg: "g(-14,3)l$,
         date, 
         UPB(data) -num line, 
         num line, tot line, 
         IF num line>0 THEN tot line/num line ELSE 0 FI))
 OD;
 logical file end done:
   close(file)

OD;

FORMAT plural = $b(" ", "s")$,

      p = $b("", "s")$;

upb list := UPB in files; printf(($l"File"f(plural)" = "$,upb list = 1, list repr,in files, $l$,

       $"Total    = "g(-0,3)l$, tot file,
       $"Readings = "g(-0)l$, num file,
       $"Average  = "g(-0,3)l$, tot file / num file));

upb list := UPB no data max line; printf(($l"Maximum run"f(p)" of "g(-0)" consecutive false reading"f(p)" ends at line starting with date"f(p)": "$,

   upb list = 1, no data max, no data max = 0, upb list = 1, list repr, no data max line, $l$))</lang>

Command:

$ a68g ./Data_Munging.a68 - data

Output:

Line: 1991-03-30  Reject:  0  Accept: 24  Line tot:        240.000  Line avg:         10.000
Line: 1991-03-31  Reject:  0  Accept: 24  Line tot:        565.000  Line avg:         23.542
Line: 1991-03-31  Reject: 23  Accept:  1  Line tot:         40.000  Line avg:         40.000
Line: 1991-04-01  Reject:  1  Accept: 23  Line tot:        534.000  Line avg:         23.217
Line: 1991-04-02  Reject:  0  Accept: 24  Line tot:        475.000  Line avg:         19.792
Line: 1991-04-03  Reject:  0  Accept: 24  Line tot:        335.000  Line avg:         13.958

File     = data
Total    = 2189.000
Readings = 120
Average  = 18.242

Maximum run of 24 consecutive false readings ends at line starting with date: 1991-04-01

AWK

<lang c># Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN{

 nodata = 0;             # Current run of consecutive flags<0 in lines of file
 nodata_max=-1;          # Max consecutive flags<0 in lines of file
 nodata_maxline="!";     # ... and line number(s) where it occurs

} FNR==1 {

 # Accumulate input file names
 if(infiles){
   infiles = infiles "," infiles
 } else {
   infiles = FILENAME
 }

} {

 tot_line=0;             # sum of line data
 num_line=0;             # number of line data items with flag>0

 # extract field info, skipping initial date field
 for(field=2; field<=NF; field+=2){
   datum=$field; 
   flag=$(field+1); 
   if(flag<1){
     nodata++
   }else{
     # check run of data-absent fields
     if(nodata_max==nodata && (nodata>0)){
       nodata_maxline=nodata_maxline ", " $1
     }
     if(nodata_max<nodata && (nodata>0)){
       nodata_max=nodata
       nodata_maxline=$1
     }
     # re-initialise run of nodata counter
     nodata=0; 
     # gather values for averaging
     tot_line+=datum
     num_line++;
   }
 }

 # totals for the file so far
 tot_file += tot_line
 num_file += num_line

 printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n", \
        $1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0

 # debug prints of original data plus some of the computed values
 #printf "%s  %15.3g  %4i\n", $0, tot_line, num_line
 #printf "%s\n  %15.3f  %4i  %4i  %4i  %s\n", $0, tot_line, num_line,  nodata, nodata_max, nodata_maxline

}

END{

 printf "\n"
 printf "File(s)  = %s\n", infiles
 printf "Total    = %10.3f\n", tot_file
 printf "Readings = %6i\n", num_file
 printf "Average  = %10.3f\n", tot_file / num_file

 printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline

}</lang> Sample output:

bash$ awk -f readings.awk readings.txt | tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readings.txt
Total    = 1358393.400
Readings = 129403
Average  =     10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05
bash$

D

<lang d>// Author Daniel Keep Mar 23 2009 module data_munging;

import std.conv : toInt, toDouble; import std.stdio : writefln; import std.stream : BufferedFile; import std.string : split, join, format;

void main(string[] args) {

   int      noData,
            noDataMax = -1;
   string[] noDataMaxLine;

   double   fileTotal = 0.0;
   int      fileValues;

   foreach( arg ; args[1..$] )
   {
       foreach( char[] line ; new BufferedFile(arg) )
       {
           double lineTotal = 0.0;
           int    lineValues;

           // Extract field info
           auto parts = split(line);
           auto date = parts[0];
           auto fields = parts[1..$];
           assert( (fields.length & 2) == 0,
                   format("Expected even number of fields, not %d.",
                       fields.length) );

           for( auto i=0; i<fields.length; i += 2 )
           {
               auto value = toDouble(fields[i]);
               auto flag = toInt(fields[i+1]);

               if( flag < 1 )
               {
                   ++ noData;
                   continue;
               }

               // Check run of data-absent fields
               if( noDataMax == noData && noData > 0 )
                   noDataMaxLine ~= date;
               
               if( noDataMax < noData && noData > 0 )
               {
                   noDataMax = noData;
                   noDataMaxLine.length = 1;
                   noDataMaxLine[0] = date;
               }

               // Re-initialise run of noData counter
               noData = 0;

               // Gather values for averaging
               lineTotal += value;
               ++ lineValues;
           }

           // Totals for the file so far
           fileTotal += lineTotal;
           fileValues += lineValues;

           writefln("Line: %11s  Reject: %2d  Accept: %2d"
                   "  Line_tot: %10.3f  Line_avg: %10.3f",
                   date,
                   fields.length/2 - lineValues,
                   lineValues,
                   lineTotal,
                   lineValues > 0
                       ? lineTotal/lineValues
                       : 0.0);
       }
   }

   writefln("");
   writefln("File(s)  = ", join(args[1..$], ", "));
   writefln("Total    = %10.3f", fileTotal);
   writefln("Readings = %6d", fileValues);
   writefln("Average  = %10.3f", fileTotal/fileValues);

   writefln("\nMaximum run(s) of %d consecutive false readings ends"
           " at line starting with date(s): %s",
           noDataMax, join(noDataMaxLine, ", "));

}</lang>

Output matches that of the Python version.

Perl

<lang perl># Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN {

 $nodata = 0;             # Current run of consecutive flags<0 in lines of file
 $nodata_max=-1;          # Max consecutive flags<0 in lines of file
 $nodata_maxline="!";     # ... and line number(s) where it occurs

} foreach (@ARGV) {

 # Accumulate input file names
 if($infiles ne ""){
   $infiles = "$infiles, $_";
 } else {
   $infiles = $_;
 }

}

while (<>){

 $tot_line=0;             # sum of line data
 $num_line=0;             # number of line data items with flag>0

 # extract field info, skipping initial date field
 chomp;
 @fields = split(/\s+/);
 $nf = @fields;
 $date = $fields[0];
 for($field=1; $field<$nf; $field+=2){
   $datum = $fields[$field] +0.0; 
   $flag  = $fields[$field+1] +0; 
   if(($flag+1<2)){
     $nodata++;
   }else{
     # check run of data-absent fields
     if($nodata_max==$nodata and ($nodata>0)){
       $nodata_maxline = "$nodata_maxline, $fields[0]";
     }
     if($nodata_max<$nodata and ($nodata>0)){
       $nodata_max = $nodata;
       $nodata_maxline=$fields[0];
     }
     # re-initialise run of nodata counter
     $nodata = 0; 
     # gather values for averaging
     $tot_line += $datum;
     $num_line++;
   }
 }

 # totals for the file so far
 $tot_file += $tot_line;
 $num_file += $num_line;

 printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n",
        $date, (($nf -1)/2) -$num_line, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;

}

printf "\n"; printf "File(s) = %s\n", $infiles; printf "Total = %10.3f\n", $tot_file; printf "Readings = %6i\n", $num_file; printf "Average = %10.3f\n", $tot_file / $num_file;

printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",

      $nodata_max, $nodata_maxline;</lang>

Sample output:

bash$ perl -f readings.pl readings.txt | tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readings.txt
Total    = 1358393.400
Readings = 129403
Average  =     10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05
bash$

Python

<lang python># Author Donald 'Paddy' McCarthy Jan 01 2007

import fileinput import sys

nodata = 0; # Current run of consecutive flags<0 in lines of file nodata_max=-1; # Max consecutive flags<0 in lines of file nodata_maxline=[]; # ... and line number(s) where it occurs

tot_file = 0 # Sum of file data num_file = 0 # Number of file data items with flag>0

infiles = sys.argv[1:]

for line in fileinput.input():

 tot_line=0;             # sum of line data
 num_line=0;             # number of line data items with flag>0

 # extract field info
 field = line.split()
 date  = field[0]
 data  = [float(f) for f in field[1::2]]
 flags = [int(f)   for f in field[2::2]]

 for datum, flag in zip(data, flags):
   if flag<1:
     nodata += 1
   else:
     # check run of data-absent fields
     if nodata_max==nodata and nodata>0:
       nodata_maxline.append(date)
     if nodata_max<nodata and nodata>0:
       nodata_max=nodata
       nodata_maxline=[date]
     # re-initialise run of nodata counter
     nodata=0; 
     # gather values for averaging
     tot_line += datum
     num_line += 1

 # totals for the file so far
 tot_file += tot_line
 num_file += num_line

 print "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f" % (
       date, 
       len(data) -num_line, 
       num_line, tot_line, 
       tot_line/num_line if (num_line>0) else 0)

print "" print "File(s) = %s" % (", ".join(infiles),) print "Total = %10.3f" % (tot_file,) print "Readings = %6i" % (num_file,) print "Average = %10.3f" % (tot_file / num_file,)

print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (

   nodata_max, ", ".join(nodata_maxline))</lang>

Sample output:

bash$ /cygdrive/c/Python26/python readings.py readings.txt|tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readings.txt
Total    = 1358393.400
Readings = 129403
Average  =     10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05
bash$

Vedit macro language

Translation of: AWK

Vedit does not have floating point data type, so fixed point calculations are used here.

50 = Buf_Num // Current edit buffer (source data)

File_Open("output.txt")

51 = Buf_Num // Edit buffer for output file

Buf_Switch(#50)

10 = 0 // total sum of file data
11 = 0 // number of valid data items in file
12 = 0 // Current run of consecutive flags<0 in lines of file
13 = -1 // Max consecutive flags<0 in lines of file

Reg_Empty(15) // ... and date tag(s) at line(s) where it occurs

While(!At_EOF) {

   #20 = 0		// sum of line data
   #21 = 0		// number of line data items with flag>0
   #22 = 0		// number of line data items with flag<0
   Reg_Copy_Block(14, Cur_Pos, Cur_Pos+10)	// date field
   
   // extract field info, skipping initial date field
   Repeat(ALL) {

Search("|{|T,|N}", ADVANCE+ERRBREAK) // next Tab or Newline if (Match_Item==2) { Break } // end of line #30 = Num_Eval(ADVANCE) * 1000 // #30 = value Char // fixed point, 3 decimal digits #30 += Num_Eval(ADVANCE+SUPPRESS) #31 = Num_Eval(ADVANCE) // #31 = flag if (#31 < 1) { // not valid field? #12++ #22++ } else { // valid field // check run of data-absent fields if(#13 == #12 && #12 > 0) { Reg_Set(15, ", ", APPEND) Reg_Set(15, @14, APPEND) } if(#13 < #12 && #12 > 0) { #13 = #12 Reg_Set(15, @14) }

// re-initialise run of nodata counter #12 = 0 // gather values for averaging #20 += #30 #21++ }

   // totals for the file so far
   #10 += #20
   #11 += #21
   
   Buf_Switch(#51)	// buffer for output data
   IT("Line: ") Reg_Ins(14)
   IT("  Reject:") Num_Ins(#22, COUNT, 3)
   IT("  Accept:") Num_Ins(#21, COUNT, 3)
   IT("  Line tot:") Num_Ins(#20, COUNT, 8) Char(-3) IC('.') EOL
   IT("  Line avg:") Num_Ins((#20+#21/2)/#21, COUNT, 7) Char(-3) IC('.') EOL IN
   Buf_Switch(#50)	// buffer for input data

}

Buf_Switch(#51) // buffer for output data IN IT("Total: ") Num_Ins(#10, FORCE+NOCR) Char(-3) IC('.') EOL IN IT("Readings: ") Num_Ins(#11, FORCE) IT("Average: ") Num_Ins((#10+#11/2)/#11, FORCE+NOCR) Char(-3) IC('.') EOL IN IN IT("Maximum run(s) of ") Num_Ins(#13, LEFT+NOCR) IT(" consecutive false readings ends at line starting with date(s): ") Reg_Ins(15) IN </lang>

Sample output:

Line: 2004-12-28  Reject:  1  Accept: 23  Line tot:   77.800  Line avg:   3.383
Line: 2004-12-29  Reject:  1  Accept: 23  Line tot:   56.300  Line avg:   2.448
Line: 2004-12-30  Reject:  1  Accept: 23  Line tot:   65.300  Line avg:   2.839
Line: 2004-12-31  Reject:  1  Accept: 23  Line tot:   47.300  Line avg:   2.057

Total:   1358393.400
Readings:     129403
Average:      10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05