Text processing/1
You are encouraged to solve this task according to the task description, using any language you may know.
Often data is produced by one program, in the wrong format for later use by another program or person. In these situations another program can be written to parse and transform the original data into a format useful to the other. The term "Data Munging" is often used in programming circles for this task.
A request on the comp.lang.awk newsgroup lead to a typical data munging task:
I have to analyse data files that have the following format: Each row corresponds to 1 day and the field logic is: $1 is the date, followed by 24 value/flag pairs, representing measurements at 01:00, 02:00 ... 24:00 of the respective day. In short: <date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24> Some test data is available at: ... (nolonger available at original location) I have to sum up the values (per day and only valid data, i.e. with flag>0) in order to calculate the mean. That's not too difficult. However, I also need to know what the "maximum data gap" is, i.e. the longest period with successive invalid measurements (i.e values with flag<=0)
The data is free to download and use and is of this format:
1991-03-30 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 1991-03-31 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 20.000 1 20.000 1 20.000 1 35.000 1 50.000 1 60.000 1 40.000 1 30.000 1 30.000 1 30.000 1 25.000 1 20.000 1 20.000 1 20.000 1 20.000 1 20.000 1 35.000 1 1991-03-31 40.000 1 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 1991-04-01 0.000 -2 13.000 1 16.000 1 21.000 1 24.000 1 22.000 1 20.000 1 18.000 1 29.000 1 44.000 1 50.000 1 43.000 1 38.000 1 27.000 1 27.000 1 24.000 1 23.000 1 18.000 1 12.000 1 13.000 1 14.000 1 15.000 1 13.000 1 10.000 1 1991-04-02 8.000 1 9.000 1 11.000 1 12.000 1 12.000 1 12.000 1 27.000 1 26.000 1 27.000 1 33.000 1 32.000 1 31.000 1 29.000 1 31.000 1 25.000 1 25.000 1 24.000 1 21.000 1 17.000 1 14.000 1 15.000 1 12.000 1 12.000 1 10.000 1 1991-04-03 10.000 1 9.000 1 10.000 1 10.000 1 9.000 1 10.000 1 15.000 1 24.000 1 28.000 1 24.000 1 18.000 1 14.000 1 12.000 1 13.000 1 14.000 1 15.000 1 14.000 1 15.000 1 13.000 1 13.000 1 13.000 1 12.000 1 10.000 1 10.000 1
Only a sample of the data showing its format is given above. The full example file may be downloaded here.
Structure your program to show statistics for each line of the file, (similar to the original Python, Perl, and AWK examples below), followed by summary statistics for the file. When showing example output just show a few line statistics and the full end summary.
Ada
<ada> with Ada.Text_IO; use Ada.Text_IO; with Strings_Edit; use Strings_Edit; with Strings_Edit.Floats; use Strings_Edit.Floats; with Strings_Edit.Integers; use Strings_Edit.Integers;
procedure Data_Munging is
Syntax_Error : exception; type Gap_Data is record Count : Natural := 0; Line : Natural := 0; Pointer : Integer; Year : Integer; Month : Integer; Day : Integer; end record; File : File_Type; Max : Gap_Data; This : Gap_Data; Current : Gap_Data; Count : Natural := 0; Sum : Float := 0.0;
begin
Open (File, In_File, "readings.txt"); loop declare Line : constant String := Get_Line (File); Pointer : Integer := Line'First; Flag : Integer; Data : Float; begin Current.Line := Current.Line + 1; Get (Line, Pointer, SpaceAndTab); Get (Line, Pointer, Current.Year); Get (Line, Pointer, Current.Month); Get (Line, Pointer, Current.Day); while Pointer < Line'Last loop Get (Line, Pointer, SpaceAndTab); Current.Pointer := Pointer; Get (Line, Pointer, Data); Get (Line, Pointer, SpaceAndTab); Get (Line, Pointer, Flag); if Flag < 0 then if This.Count = 0 then This := Current; end if; This.Count := This.Count + 1; else if This.Count > 0 and then Max.Count < This.Count then Max := This; end if; This.Count := 0; Count := Count + 1; Sum := Sum + Data; end if; end loop; exception when End_Error => raise Syntax_Error; end; end loop;
exception
when End_Error => Close (File); if This.Count > 0 and then Max.Count < This.Count then Max := This; end if; Put_Line ("Average " & Image (Sum / Float (Count)) & " over " & Image (Count)); if Max.Count > 0 then Put ("Max. " & Image (Max.Count) & " false readings start at "); Put (Image (Max.Line) & ':' & Image (Max.Pointer) & " stamped "); Put_Line (Image (Max.Year) & Image (Max.Month) & Image (Max.Day)); end if; when others => Close (File); Put_Line ("Syntax error at " & Image (Current.Line) & ':' & Image (Max.Pointer));
end Data_Munging; </ada> The implementation performs minimal checks. The average is calculated over all valid data. For the maximal chain of consequent invalid data, the source line number, the column number, and the time stamp of the first invalid data is printed. Sample output:
Average 10.47915 over 129628 Max. 589 false readings start at 1136:20 stamped 1993-2-9
AWK
<c># Author Donald 'Paddy' McCarthy Jan 01 2007
BEGIN{
nodata = 0; # Current run of consecutive flags<0 in lines of file nodata_max=-1; # Max consecutive flags<0 in lines of file nodata_maxline="!"; # ... and line number(s) where it occurs
} FNR==1 {
# Accumulate input file names if(infiles){ infiles = infiles "," infiles } else { infiles = FILENAME }
} {
tot_line=0; # sum of line data num_line=0; # number of line data items with flag>0
# extract field info, skipping initial date field for(field=2; field<=NF; field+=2){ datum=$field; flag=$(field+1); if(flag<1){ nodata++ }else{ # check run of data-absent fields if(nodata_max==nodata && (nodata>0)){ nodata_maxline=nodata_maxline ", " $1 } if(nodata_max<nodata && (nodata>0)){ nodata_max=nodata nodata_maxline=$1 } # re-initialise run of nodata counter nodata=0; # gather values for averaging tot_line+=datum num_line++; } }
# totals for the file so far tot_file += tot_line num_file += num_line
printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f\n", \ $1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0
# debug prints of original data plus some of the computed values #printf "%s %15.3g %4i\n", $0, tot_line, num_line #printf "%s\n %15.3f %4i %4i %4i %s\n", $0, tot_line, num_line, nodata, nodata_max, nodata_maxline
}
END{
printf "\n" printf "File(s) = %s\n", infiles printf "Total = %10.3f\n", tot_file printf "Readings = %6i\n", num_file printf "Average = %10.3f\n", tot_file / num_file
printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline
}</c> Sample output:
bash$ awk -f readings.awk readings.txt | tail Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448 Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839 Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057 File(s) = readings.txt Total = 1358393.400 Readings = 129403 Average = 10.497 Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05 bash$
Perl
<perl># Author Donald 'Paddy' McCarthy Jan 01 2007
BEGIN {
$nodata = 0; # Current run of consecutive flags<0 in lines of file $nodata_max=-1; # Max consecutive flags<0 in lines of file $nodata_maxline="!"; # ... and line number(s) where it occurs
} foreach (@ARGV) {
# Accumulate input file names if($infiles ne ""){ $infiles = "$infiles, $_"; } else { $infiles = $_; }
}
while (<>){
$tot_line=0; # sum of line data $num_line=0; # number of line data items with flag>0
# extract field info, skipping initial date field chomp; @fields = split(/\s+/); $nf = @fields; $date = $fields[0]; for($field=1; $field<$nf; $field+=2){ $datum = $fields[$field] +0.0; $flag = $fields[$field+1] +0; if(($flag+1<2)){ $nodata++; }else{ # check run of data-absent fields if($nodata_max==$nodata and ($nodata>0)){ $nodata_maxline = "$nodata_maxline, $fields[0]"; } if($nodata_max<$nodata and ($nodata>0)){ $nodata_max = $nodata; $nodata_maxline=$fields[0]; } # re-initialise run of nodata counter $nodata = 0; # gather values for averaging $tot_line += $datum; $num_line++; } }
# totals for the file so far $tot_file += $tot_line; $num_file += $num_line;
printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f\n", $date, (($nf -1)/2) -$num_line, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;
}
printf "\n"; printf "File(s) = %s\n", $infiles; printf "Total = %10.3f\n", $tot_file; printf "Readings = %6i\n", $num_file; printf "Average = %10.3f\n", $tot_file / $num_file;
printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",
$nodata_max, $nodata_maxline;</perl>
Sample output:
bash$ perl -f readings.pl readings.txt | tail Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448 Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839 Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057 File(s) = readings.txt Total = 1358393.400 Readings = 129403 Average = 10.497 Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05 bash$
Python
<python># Author Donald 'Paddy' McCarthy Jan 01 2007
import fileinput import sys
nodata = 0; # Current run of consecutive flags<0 in lines of file nodata_max=-1; # Max consecutive flags<0 in lines of file nodata_maxline=[]; # ... and line number(s) where it occurs
tot_file = 0 # Sum of file data num_file = 0 # Number of file data items with flag>0
infiles = sys.argv[1:]
for line in fileinput.input():
tot_line=0; # sum of line data num_line=0; # number of line data items with flag>0
# extract field info field = line.split() date = field[0] data = [float(f) for f in field[1::2]] flags = [int(f) for f in field[2::2]]
for datum, flag in zip(data, flags): if flag<1: nodata += 1 else: # check run of data-absent fields if nodata_max==nodata and nodata>0: nodata_maxline.append(date) if nodata_max<nodata and nodata>0: nodata_max=nodata nodata_maxline=[date] # re-initialise run of nodata counter nodata=0; # gather values for averaging tot_line += datum num_line += 1
# totals for the file so far tot_file += tot_line num_file += num_line
print "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f" % ( date, len(data) -num_line, num_line, tot_line, tot_line/num_line if (num_line>0) else 0)
print "" print "File(s) = %s" % (", ".join(infiles),) print "Total = %10.3f" % (tot_file,) print "Readings = %6i" % (num_file,) print "Average = %10.3f" % (tot_file / num_file,)
print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (
nodata_max, ", ".join(nodata_maxline))</python>
Sample output:
bash$ /cygdrive/c/Python26/python readings.py readings.txt|tail Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448 Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839 Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057 File(s) = readings.txt Total = 1358393.400 Readings = 129403 Average = 10.497 Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05 bash$