Text processing/1: Difference between revisions

From Rosetta Code
Content added Content deleted
m ({{omit from|Openscad}})
No edit summary
Line 239: Line 239:




=={{header|AutoHotKey}}==
=={{header|AutoHotkey}}==
<lang AutoHotKey># Author AlephX Aug 17 2011
<lang AutoHotkey># Author AlephX Aug 17 2011


SetFormat, float, 4.2
SetFormat, float, 4.2

Revision as of 13:04, 22 August 2011

Task
Text processing/1
You are encouraged to solve this task according to the task description, using any language you may know.
This task has been flagged for clarification. Code on this page in its current state may be flagged incorrect once this task has been clarified. See this page's Talk page for discussion.


Often data is produced by one program, in the wrong format for later use by another program or person. In these situations another program can be written to parse and transform the original data into a format useful to the other. The term "Data Munging" is often used in programming circles for this task.

A request on the comp.lang.awk newsgroup lead to a typical data munging task:

I have to analyse data files that have the following format:
Each row corresponds to 1 day and the field logic is: $1 is the date,
followed by 24 value/flag pairs, representing measurements at 01:00,
02:00 ... 24:00 of the respective day. In short:

<date> <val1> <flag1> <val2> <flag2> ...  <val24> <flag24>

Some test data is available at: 
... (nolonger available at original location)

I have to sum up the values (per day and only valid data, i.e. with
flag>0) in order to calculate the mean. That's not too difficult.
However, I also need to know what the "maximum data gap" is, i.e. the
longest period with successive invalid measurements (i.e values with
flag<=0)

The data is free to download and use and is of this format:

1991-03-30	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1
1991-03-31	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	20.000	1	20.000	1	20.000	1	35.000	1	50.000	1	60.000	1	40.000	1	30.000	1	30.000	1	30.000	1	25.000	1	20.000	1	20.000	1	20.000	1	20.000	1	20.000	1	35.000	1
1991-03-31	40.000	1	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2
1991-04-01	0.000	-2	13.000	1	16.000	1	21.000	1	24.000	1	22.000	1	20.000	1	18.000	1	29.000	1	44.000	1	50.000	1	43.000	1	38.000	1	27.000	1	27.000	1	24.000	1	23.000	1	18.000	1	12.000	1	13.000	1	14.000	1	15.000	1	13.000	1	10.000	1
1991-04-02	8.000	1	9.000	1	11.000	1	12.000	1	12.000	1	12.000	1	27.000	1	26.000	1	27.000	1	33.000	1	32.000	1	31.000	1	29.000	1	31.000	1	25.000	1	25.000	1	24.000	1	21.000	1	17.000	1	14.000	1	15.000	1	12.000	1	12.000	1	10.000	1
1991-04-03	10.000	1	9.000	1	10.000	1	10.000	1	9.000	1	10.000	1	15.000	1	24.000	1	28.000	1	24.000	1	18.000	1	14.000	1	12.000	1	13.000	1	14.000	1	15.000	1	14.000	1	15.000	1	13.000	1	13.000	1	13.000	1	12.000	1	10.000	1	10.000	1

Only a sample of the data showing its format is given above. The full example file may be downloaded here.

Structure your program to show statistics for each line of the file, (similar to the original Python, Perl, and AWK examples below), followed by summary statistics for the file. When showing example output just show a few line statistics and the full end summary.

Ada

<lang ada>with Ada.Text_IO; use Ada.Text_IO; with Strings_Edit; use Strings_Edit; with Strings_Edit.Floats; use Strings_Edit.Floats; with Strings_Edit.Integers; use Strings_Edit.Integers;

procedure Data_Munging is

  Syntax_Error : exception;
  type Gap_Data is record
     Count   : Natural := 0;
     Line    : Natural := 0;
     Pointer : Integer;
     Year    : Integer;
     Month   : Integer;
     Day     : Integer;
  end record;
  File    : File_Type;
  Max     : Gap_Data;
  This    : Gap_Data;
  Current : Gap_Data;
  Count   : Natural := 0;
  Sum     : Float   := 0.0;

begin

  Open (File, In_File, "readings.txt");
  loop
     declare
        Line    : constant String := Get_Line (File);
        Pointer : Integer := Line'First;
        Flag    : Integer;
        Data    : Float;
     begin
        Current.Line := Current.Line + 1;
        Get (Line, Pointer, SpaceAndTab);
        Get (Line, Pointer, Current.Year);
        Get (Line, Pointer, Current.Month);
        Get (Line, Pointer, Current.Day);
        while Pointer <= Line'Last loop
           Get (Line, Pointer, SpaceAndTab);
           Current.Pointer := Pointer;
           Get (Line, Pointer, Data);
           Get (Line, Pointer, SpaceAndTab);
           Get (Line, Pointer, Flag);
           if Flag < 0 then
              if This.Count = 0 then
                 This := Current;
              end if;
              This.Count := This.Count + 1;
           else
              if This.Count > 0 and then Max.Count < This.Count then
                 Max := This;
              end if;
              This.Count := 0;
              Count := Count + 1;
              Sum   := Sum + Data;
           end if;
        end loop;
     exception
        when End_Error =>
           raise Syntax_Error;
     end;
  end loop;

exception

  when End_Error =>
     Close (File);
     if This.Count > 0 and then Max.Count < This.Count then
        Max := This;
     end if;
     Put_Line ("Average " & Image (Sum / Float (Count)) & " over " & Image (Count));
     if Max.Count > 0 then
        Put ("Max. " & Image (Max.Count) & " false readings start at ");
        Put (Image (Max.Line) & ':' & Image (Max.Pointer) & " stamped ");
        Put_Line (Image (Max.Year) & Image (Max.Month) & Image (Max.Day));
     end if;
  when others =>
     Close (File);
     Put_Line ("Syntax error at " & Image (Current.Line) & ':' & Image (Max.Pointer));

end Data_Munging;</lang> The implementation performs minimal checks. The average is calculated over all valid data. For the maximal chain of consequent invalid data, the source line number, the column number, and the time stamp of the first invalid data is printed. Sample output:

Average 10.47915 over 129628
Max. 589 false readings start at 1136:20 stamped 1993-2-9

ALGOL 68

Translation of: python
Works with: ALGOL 68G version Any - tested with release mk15-0.8b.fc9.i386

<lang algol68>INT no data := 0; # Current run of consecutive flags<0 in lines of file # INT no data max := -1; # Max consecutive flags<0 in lines of file # FLEX[0]STRING no data max line; # ... and line number(s) where it occurs #

REAL tot file := 0; # Sum of file data # INT num file := 0; # Number of file data items with flag>0 #

  1. CHAR fs = " "; #

INT nf = 24;

INT upb list := nf; FORMAT list repr = $n(upb list-1)(g", ")g$;

PROC exception = ([]STRING args)VOID:(

 putf(stand error, ($"Exception"$, $", "g$, args, $l$));
 stop

);

PROC raise io error = (STRING message)VOID:exception(("io error", message));

OP +:= = (REF FLEX []STRING rhs, STRING append)REF FLEX[]STRING: (

 HEAP [UPB rhs+1]STRING out rhs;
 out rhs[:UPB rhs] := rhs;
 out rhs[UPB rhs+1] := append;
 rhs := out rhs;
 out rhs

);

INT upb opts = 3; # these are "a68g" "./Data_Munging.a68" & "-" # [argc - upb opts]STRING in files; FOR arg TO UPB in files DO in files[arg] := argv(upb opts + arg) OD;

MODE FIELD = STRUCT(REAL data, INT flag); FORMAT field repr = $2(g)$;

FOR index file TO UPB in files DO

 STRING file name = in files[index file], FILE file;
 IF open(file, file name, stand in channel) NE 0 THEN
   raise io error("Cannot open """+file name+"""") FI;
 on logical file end(file, (REF FILE f)BOOL: logical file end done);
 REAL tot line, INT num line;
 # make term(file, ", ") for CSV data #
 STRING date;
 DO
   tot line := 0;             # sum of line data #
   num line := 0;             # number of line data items with flag>0 #
   # extract field info #
   [nf]FIELD data;
   getf(file, ($10a$, date, field repr, data, $l$));
   FOR key TO UPB data DO
     FIELD field = data[key];
     IF flag OF field<1 THEN
       no data +:= 1
     ELSE
       # check run of data-absent data #
       IF no data max = no data AND no data>0 THEN
         no data max line +:= date FI;
       IF no data max<no data AND no data>0 THEN
         no data max := no data;
         no data max line := date FI;
       # re-initialise run of no data counter #
       no data := 0;
       # gather values for averaging #
       tot line +:= data OF field;
       num line +:= 1
     FI
   OD;
   # totals for the file so far #
   tot file +:= tot line;
   num file +:= num line;
   printf(($"Line: "g"  Reject: "g(-2)"  Accept: "g(-2)"  Line tot: "g(-14, 3)"  Line avg: "g(-14, 3)l$,
         date,
         UPB(data) -num line,
         num line, tot line,
         IF num line>0 THEN tot line/num line ELSE 0 FI))
 OD;
 logical file end done:
   close(file)

OD;

FORMAT plural = $b(" ", "s")$,

      p = $b("", "s")$;

upb list := UPB in files; printf(($l"File"f(plural)" = "$, upb list = 1, list repr, in files, $l$,

       $"Total    = "g(-0, 3)l$, tot file,
       $"Readings = "g(-0)l$, num file,
       $"Average  = "g(-0, 3)l$, tot file / num file));

upb list := UPB no data max line; printf(($l"Maximum run"f(p)" of "g(-0)" consecutive false reading"f(p)" ends at line starting with date"f(p)": "$,

   upb list = 1, no data max, no data max = 0, upb list = 1, list repr, no data max line, $l$))</lang>

Command:

$ a68g ./Data_Munging.a68 - data

Output:

Line: 1991-03-30  Reject:  0  Accept: 24  Line tot:        240.000  Line avg:         10.000
Line: 1991-03-31  Reject:  0  Accept: 24  Line tot:        565.000  Line avg:         23.542
Line: 1991-03-31  Reject: 23  Accept:  1  Line tot:         40.000  Line avg:         40.000
Line: 1991-04-01  Reject:  1  Accept: 23  Line tot:        534.000  Line avg:         23.217
Line: 1991-04-02  Reject:  0  Accept: 24  Line tot:        475.000  Line avg:         19.792
Line: 1991-04-03  Reject:  0  Accept: 24  Line tot:        335.000  Line avg:         13.958

File     = data
Total    = 2189.000
Readings = 120
Average  = 18.242

Maximum run of 24 consecutive false readings ends at line starting with date: 1991-04-01


AutoHotkey

<lang AutoHotkey># Author AlephX Aug 17 2011

SetFormat, float, 4.2 SetFormat, FloatFast, 4.2

data = %A_scriptdir%\readings.txt result = %A_scriptdir%\results.txt totvalid := 0 totsum := 0 totavg:= 0

Loop, Read, %data%, %result% { sum := 0 Valid := 0 Couples := 0 Lines := A_Index

   Loop, parse, A_LoopReadLine, %A_Tab%

{

       ;MsgBox, Field number %A_Index% is %A_LoopField%

if A_index = 1 { Date := A_LoopField Counter := 0 } else { Counter++ couples := Couples + 0.5 if Counter = 1 { value := A_LoopField / 1 } else { if A_loopfield > 0 { Sum := Sum + value Valid++

if (wrong > maxwrong) { maxwrong := wrong lastwrongdate := currwrongdate startwrongdate := firstwrongdate startoccurrence := firstoccurrence lastoccurrence := curroccurrence } wrong := 0 } else { wrong++ currwrongdate := date curroccurrence := (A_index-1) / 2 if (wrong = 1) { firstwrongdate := date firstoccurrence := curroccurrence } } Counter := 0 } } } avg := sum / valid TotValid := Totvalid+valid TotSum := Totsum+sum FileAppend, Day: %date% sum: %sum% avg: %avg% Readings: %valid%/%couples%`n }

Totavg := TotSum / TotValid FileAppend, `n`nDays %Lines%`nMaximal wrong readings: %maxwrong% from %startwrongdate% at %startoccurrence% to %lastwrongdate% at %lastoccurrence%`n`n, %result% FileAppend, Valid readings: %TotValid%`nTotal Value: %TotSUm%`nAverage: %TotAvg%, %result%</lang> Sample output:

Day: 1990-01-01 sum: 590.00 avg: 26.82 Readings: 22/24.00
Day: 1990-01-02 sum: 410.00 avg: 17.08 Readings: 24/24.00
Day: 1990-01-03 sum: 1415.00 avg: 58.96 Readings: 24/24.00
Day: 1990-01-04 sum: 1800.00 avg: 75.00 Readings: 24/24.00
Day: 1990-01-05 sum: 1130.00 avg: 47.08 Readings: 24/24.00
...
Day: 2004-12-31 sum: 47.30 avg: 2.06 Readings: 23/24.00


Days 5471
Maximal wrong readings: 589 from 1993-02-09 at 2.00 to 1993-03-05 at 14.00

Valid readings: 129403
Total Value: 1358393.40
Average: 10.50

AWK

<lang awk># Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN{

 nodata = 0;             # Current run of consecutive flags<0 in lines of file
 nodata_max=-1;          # Max consecutive flags<0 in lines of file
 nodata_maxline="!";     # ... and line number(s) where it occurs

} FNR==1 {

 # Accumulate input file names
 if(infiles){
   infiles = infiles "," infiles
 } else {
   infiles = FILENAME
 }

} {

 tot_line=0;             # sum of line data
 num_line=0;             # number of line data items with flag>0
 # extract field info, skipping initial date field
 for(field=2; field<=NF; field+=2){
   datum=$field; 
   flag=$(field+1); 
   if(flag<1){
     nodata++
   }else{
     # check run of data-absent fields
     if(nodata_max==nodata && (nodata>0)){
       nodata_maxline=nodata_maxline ", " $1
     }
     if(nodata_max<nodata && (nodata>0)){
       nodata_max=nodata
       nodata_maxline=$1
     }
     # re-initialise run of nodata counter
     nodata=0; 
     # gather values for averaging
     tot_line+=datum
     num_line++;
   }
 }
 # totals for the file so far
 tot_file += tot_line
 num_file += num_line
 printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n", \
        $1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0
 # debug prints of original data plus some of the computed values
 #printf "%s  %15.3g  %4i\n", $0, tot_line, num_line
 #printf "%s\n  %15.3f  %4i  %4i  %4i  %s\n", $0, tot_line, num_line,  nodata, nodata_max, nodata_maxline


}

END{

 printf "\n"
 printf "File(s)  = %s\n", infiles
 printf "Total    = %10.3f\n", tot_file
 printf "Readings = %6i\n", num_file
 printf "Average  = %10.3f\n", tot_file / num_file
 printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline

}</lang> Sample output:

bash$ awk -f readings.awk readings.txt | tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readings.txt
Total    = 1358393.400
Readings = 129403
Average  =     10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05
bash$ 


Batch File

<lang dos> @echo off setlocal ENABLEDELAYEDEXPANSION set maxrun= 0 set maxstart= set maxend= set notok=0 set inputfile=%1 for /F "tokens=1,*" %%i in (%inputfile%) do (

   set date=%%i
   call :processline %%j

)

echo\ echo max false: %maxrun% from %maxstart% until %maxend%

goto :EOF

processline

set sum=0000 set count=0 set hour=1

loop

if "%1"=="" goto :result set num=%1 if "%2"=="1" (

   if "%notok%" NEQ "0" (
       set notok=     !notok!
       if /I "!notok:~-5!" GTR "%maxrun%" (
           set maxrun=!notok:~-5!
           set maxstart=%nok0date% %nok0hour%
           set maxend=%nok1date% %nok1hour%
       )
       set notok=0
   )
   set /a sum+=%num:.=%
   set /a count+=1

) else (

   if "%notok%" EQU "0" (
       set nok0date=%date%
       set nok0hour=%hour%
   ) else (
       set nok1date=%date%
       set nok1hour=%hour%
   )
   set /a notok+=1

) shift shift set /a hour+=1 goto :loop

result

if "%count%"=="0" (

   set mean=0

) else (

   set /a mean=%sum%/%count%

) if "%mean%"=="0" set mean=0000 if "%sum%"=="0" set sum=0000 set mean=%mean:~0,-3%.%mean:~-3% set sum=%sum:~0,-3%.%sum:~-3% set count= %count% set sum= %sum% set mean= %mean% echo Line: %date% Accept: %count:~-3% tot: %sum:~-8% avg: %mean:~-8%

goto :EOF

</lang>

sample output:


C:\ >batch-fileparsing.bat readings-2.txt
Line: 1990-01-01 Accept:  22  tot:  590.000  avg:   26.818
Line: 1990-01-02 Accept:  24  tot:  410.000  avg:   17.083
Line: 1990-01-03 Accept:  24  tot: 1415.000  avg:   58.958
Line: 1990-01-04 Accept:  24  tot: 1800.000  avg:   75.000
Line: 1990-01-05 Accept:  24  tot: 1130.000  avg:   47.083
Line: 1990-01-06 Accept:  24  tot: 1820.000  avg:   75.833
...
Line: 1993-12-26 Accept:  24  tot:  195.000  avg:    8.125
Line: 1993-12-27 Accept:  24  tot:  112.000  avg:    4.666
Line: 1993-12-28 Accept:  24  tot:  303.000  avg:   12.625
Line: 1993-12-29 Accept:  24  tot:  339.000  avg:   14.125
Line: 1993-12-30 Accept:  24  tot:  593.000  avg:   24.708
Line: 1993-12-31 Accept:  24  tot:  865.000  avg:   36.041
...
max false:   589  from 1993-02-09 2 until 1993-03-05 14

C

<lang c>#include <stdio.h>

  1. include <stdlib.h>
  2. include <string.h>

static int badHrs, maxBadHrs;

static double hrsTot = 0.0; static int rdgsTot = 0; char bhEndDate[40];

int mungeLine( char *line, int lno, FILE *fout ) {

   char date[40], *tkn;
   int   dHrs, flag, hrs2, hrs;
   double hrsSum;
   int   hrsCnt = 0;
   double avg;
   tkn = strtok(line, ".");
   if (tkn) {
       int n = sscanf(tkn, "%s %d", &date, &hrs2);
       if (n<2) {
           printf("badly formated line - %d %s\n", lno, tkn);
           return 0;
       }
       hrsSum = 0.0;
       while( tkn= strtok(NULL, ".")) {
           n = sscanf(tkn,"%d %d %d", &dHrs, &flag, &hrs);
           if (n>=2) {
               if (flag > 0) {
                   hrsSum += 1.0*hrs2 + .001*dHrs;
                   hrsCnt += 1;
                   if (maxBadHrs < badHrs) {
                       maxBadHrs = badHrs;
                       strcpy(bhEndDate, date);
                   }
                   badHrs = 0;
               }
               else {
                   badHrs += 1;
               }
               hrs2 = hrs;
           }
           else {
               printf("bad file syntax line %d: %s\n",lno, tkn);
           }
       }
       avg = (hrsCnt > 0)? hrsSum/hrsCnt : 0.0;
       fprintf(fout, "%s  Reject: %2d  Accept: %2d  Average: %7.3f\n",
               date, 24-hrsCnt, hrsCnt, hrsSum/hrsCnt);
       hrsTot += hrsSum;
       rdgsTot += hrsCnt;
   }
   return 1;

}

int main() {

   FILE *infile, *outfile;
   int lineNo = 0;
   char line[512];
   const char *ifilename = "readings.txt";
   outfile = fopen("V0.txt", "w");
   infile = fopen(ifilename, "rb");
   if (!infile) {
       printf("Can't open %s\n", ifilename);
       exit(1);
   }
   while (NULL != fgets(line, 512, infile)) {
       lineNo += 1;
       if (0 == mungeLine(line, lineNo, outfile))
           printf("Bad line at %d",lineNo);
   }
   fclose(infile);
   fprintf(outfile, "File:     %s\n", ifilename);
   fprintf(outfile, "Total:    %.3f\n", hrsTot);
   fprintf(outfile, "Readings: %d\n", rdgsTot);
   fprintf(outfile, "Average:  %.3f\n", hrsTot/rdgsTot);
   fprintf(outfile, "\nMaximum number of consecutive bad readings is %d\n", maxBadHrs);
   fprintf(outfile, "Ends on date %s\n", bhEndDate);
   fclose(outfile);
   return 0;

}</lang> Sample output

1990-01-01  Reject:  2  Accept: 22  Average:  26.818
1990-01-02  Reject:  0  Accept: 24  Average:  17.083
1990-01-03  Reject:  0  Accept: 24  Average:  58.958
1990-01-04  Reject:  0  Accept: 24  Average:  75.000
1990-01-05  Reject:  0  Accept: 24  Average:  47.083
...
File:     readings.txt
Total:    1358393.400
Readings: 129403
Average:  10.497

Maximum number of consecutive bad readings is 589
Ends on date 1993-03-05

C++

<lang Cpp>#include <iostream>

  1. include <fstream>
  2. include <string>
  3. include <vector>
  4. include <iomanip>
  5. include <boost/lexical_cast.hpp>
  6. include <boost/algorithm/string.hpp>

using std::cout; using std::endl; const int NumFlags = 24;

int main() {

   std::fstream file("readings.txt");
   int badCount = 0;
   std::string badDate;
   int badCountMax = 0;
   while(true)
   {
       std::string line;
       getline(file, line);
       if(!file.good())
           break;
       std::vector<std::string> tokens;
       boost::algorithm::split(tokens, line, boost::is_space());
       if(tokens.size() != NumFlags * 2 + 1)
       {
           cout << "Bad input file." << endl;
           return 0;
       }
       double total = 0.0;
       int accepted = 0;
       for(size_t i = 1; i < tokens.size(); i += 2)
       {
           double val = boost::lexical_cast<double>(tokens[i]);
           int flag = boost::lexical_cast<int>(tokens[i+1]);
           if(flag > 0)
           {
               total += val;
               ++accepted;
               badCount = 0;
           }
           else
           {
               ++badCount;
               if(badCount > badCountMax)
               {
                   badCountMax = badCount;
                   badDate = tokens[0];
               }
           }
       }
       cout << tokens[0];
       cout << "  Reject: " << std::setw(2) << (NumFlags - accepted);
       cout << "  Accept: " << std::setw(2) << accepted;
       cout << "  Average: " << std::setprecision(5) << total / accepted << endl;
   }
   cout << endl;
   cout << "Maximum number of consecutive bad readings is " << badCountMax << endl;
   cout << "Ends on date " << badDate << endl;

}</lang>

Output:

1990-01-01  Reject:  2  Accept: 22  Average: 26.818
1990-01-02  Reject:  0  Accept: 24  Average: 17.083
1990-01-03  Reject:  0  Accept: 24  Average: 58.958
1990-01-04  Reject:  0  Accept: 24  Average: 75
1990-01-05  Reject:  0  Accept: 24  Average: 47.083
...
Maximum number of consecutive bad readings is 589
Ends on date 1993-03-05

Common Lisp

This example is incorrect. Please fix the code and remove this message.

Details: The output is incorrect - worst run

<lang lisp>(defstruct (measurement (:conc-name "MEASUREMENT-") (:constructor make-measurement (counter line date flag value)))

 (counter 0 :type (integer 0))
 (line 0 :type (integer 0))
 (date nil :type symbol)
 (flag 0 :type integer)
 (value 0 :type real))

(defun measurement-valid-p (m)

 (> (measurement-flag m) 0))

(defun map-data-stream (function stream)

 (flet ((scan (&optional (errorp t)) (read stream errorp nil)))
   (loop 
      :with global-count = 0
      :for date = (scan nil) :then (scan nil)
      :for line-number :upfrom 1
      :while date
      :do (loop 

:for count :upfrom 0 :below 24 :do (let* ((value (scan)) (flag (scan))) (funcall function (make-measurement global-count line-number date flag value)) (incf global-count)))

      :finally (return global-count))))

(defun map-data-file (function pathname)

 (with-open-file (stream pathname

:element-type 'character :direction :input :if-does-not-exist :error)

   (map-data-stream function stream)))

(defmacro do-data-stream ((variable stream) &body body)

 `(map-data-stream 
    (lambda (,variable) ,@body)
    ,stream))

(defmacro do-data-file ((variable file) &body body)

 `(map-data-file 
    (lambda (,variable) ,@body)
    ,file))

(let ((current-day nil)

     (current-line 0)
     (beginning-of-misreadings nil)
     (current-length 0)
     (worst-beginning nil)
     (worst-length 0)
     (sum-of-day 0)
     (count-of-day 0))
 (flet ((write-end-of-day-report () 

(when current-day (format t "Line ~5D Date ~A: Accepted ~2D Total ~8,3F Average ~8,3F~%" current-line current-day count-of-day sum-of-day (if (> count-of-day 0) (/ sum-of-day count-of-day) sum-of-day)))))

   (do-data-file (m #P"D:/Scratch/data.txt")
   
     (let* ((date (measurement-date m))

(line-number (measurement-line m)) (validp (measurement-valid-p m)) (day-changed-p (/= current-line line-number)) (value (measurement-value m)))

(when day-changed-p (write-end-of-day-report) (setf current-day date) (setf current-line line-number) (setf sum-of-day 0) (setf count-of-day 0))

(if (not validp) (if beginning-of-misreadings (incf current-length) (progn (setf beginning-of-misreadings m) (setf current-length 1))) (progn (when beginning-of-misreadings (if (> current-length worst-length) (progn (setf worst-beginning beginning-of-misreadings) (setf worst-length current-length)) (progn (setf beginning-of-misreadings nil) (setf current-length 0)))) (incf sum-of-day value) (incf count-of-day)))))

   (when (and beginning-of-misreadings (> current-length worst-length))
     (setf worst-beginning beginning-of-misreadings)
     (setf worst-length current-length))
   
   (write-end-of-day-report))
 (format t "Worst run started ~A (~D) and has length ~D~%"

(measurement-date worst-beginning) (measurement-counter worst-beginning) worst-length))</lang>

Example output:

Line     1 Date 1991-03-30: Accepted 24 Total  240.000 Average   10.000
Line     2 Date 1991-03-31: Accepted 24 Total  565.000 Average   23.542
Line     3 Date 1991-03-31: Accepted  1 Total   40.000 Average   40.000
Line     4 Date 1991-04-01: Accepted 23 Total  534.000 Average   23.217
Line     5 Date 1991-04-02: Accepted 24 Total  475.000 Average   19.792
Line     6 Date 1991-04-03: Accepted 24 Total  335.000 Average   13.958
Worst run started 1991-03-31 (49) and has length 24

D

<lang d>// Author Daniel Keep Mar 23 2009 module data_munging;

import std.conv  : toInt, toDouble; import std.stdio  : writefln; import std.stream  : BufferedFile; import std.string  : split, join, format;

void main(string[] args) {

   int      noData,
            noDataMax = -1;
   string[] noDataMaxLine;
   double   fileTotal = 0.0;
   int      fileValues;
   foreach( arg ; args[1..$] )
   {
       foreach( char[] line ; new BufferedFile(arg) )
       {
           double lineTotal = 0.0;
           int    lineValues;
           // Extract field info
           auto parts = split(line);
           auto date = parts[0];
           auto fields = parts[1..$];
           assert( (fields.length & 2) == 0,
                   format("Expected even number of fields, not %d.",
                       fields.length) );
           for( auto i=0; i<fields.length; i += 2 )
           {
               auto value = toDouble(fields[i]);
               auto flag = toInt(fields[i+1]);
               if( flag < 1 )
               {
                   ++ noData;
                   continue;
               }
               // Check run of data-absent fields
               if( noDataMax == noData && noData > 0 )
                   noDataMaxLine ~= date;
               
               if( noDataMax < noData && noData > 0 )
               {
                   noDataMax = noData;
                   noDataMaxLine.length = 1;
                   noDataMaxLine[0] = date;
               }
               // Re-initialise run of noData counter
               noData = 0;
               // Gather values for averaging
               lineTotal += value;
               ++ lineValues;
           }
           // Totals for the file so far
           fileTotal += lineTotal;
           fileValues += lineValues;
           writefln("Line: %11s  Reject: %2d  Accept: %2d"
                   "  Line_tot: %10.3f  Line_avg: %10.3f",
                   date,
                   fields.length/2 - lineValues,
                   lineValues,
                   lineTotal,
                   lineValues > 0
                       ? lineTotal/lineValues
                       : 0.0);
       }
   }
   writefln("");
   writefln("File(s)  = ", join(args[1..$], ", "));
   writefln("Total    = %10.3f", fileTotal);
   writefln("Readings = %6d", fileValues);
   writefln("Average  = %10.3f", fileTotal/fileValues);
   writefln("\nMaximum run(s) of %d consecutive false readings ends"
           " at line starting with date(s): %s",
           noDataMax, join(noDataMaxLine, ", "));

}</lang>

Output matches that of the Python version.

Forth

Works with: GNU Forth

<lang forth>\ data munging

\ 1991-03-30[\t10.000\t[-]1]*24

\ 1. mean of valid (flag > 0) values per day and overall \ 2. length of longest run of invalid values, and when it happened

fvariable day-sum variable day-n

fvariable total-sum variable total-n

10 constant date-size \ yyyy-mm-dd create cur-date date-size allot

create bad-date date-size allot variable bad-n

create worst-date date-size allot variable worst-n

split ( buf len char -- buf' l2 buf l1 ) \ where buf'[0] = char, l1 = len-l2
 >r 2dup r> scan
 2swap 2 pick - ;
next-sample ( buf len -- buf' len' fvalue flag )
 #tab split >float   drop    1 /string
 #tab split snumber? drop >r 1 /string r> ;
ok? 0> ;
add-sample ( value -- )
 day-sum f@ f+ day-sum f!
 1 day-n +! ;
 
add-day
 day-sum f@ total-sum f@ f+ total-sum f!
 day-n @ total-n +! ;
add-bad-run
 bad-n @ 0= if
   cur-date bad-date date-size move
 then
 1 bad-n +! ;
check-worst-run
 bad-n @ worst-n @ > if
   bad-n @ worst-n !
   bad-date worst-date date-size move
 then
 0 bad-n ! ;
hour ( buf len -- buf' len' )
 next-sample ok? if
   add-sample
   check-worst-run
 else
   fdrop
   add-bad-run
 then ;
.mean ( sum count -- ) 0 d>f f/ f. ;
day ( line len -- )
 2dup + #tab swap c! 1+			\ append tab for parsing
 #tab split cur-date swap move 1 /string	\ skip date
 0e day-sum f!
 0  day-n !
 24 0 do hour loop 2drop
 cur-date date-size type ."  mean = "
 day-sum f@ day-n @ .mean cr
 add-day ;

stdin value input

main
 s" input.txt" r/o open-file throw to input
 0e total-sum f!
 0 total-n !
 0 worst-n !
 begin  pad 512 input read-line throw
 while  pad swap day
 repeat
 input close-file throw
 worst-n @ if
   ."  Longest interruption: " worst-n @ .
   ." hours starting " worst-date date-size type cr
 then
 ."  Total mean = "
 total-sum f@ total-n @ .mean cr ;

main bye</lang>

Go

<lang go>package main

import (

   "bufio"
   "fmt"
   "os"
   "strconv"
   "strings"

)

var fn = "readings.txt"

func main() {

   f, err := os.Open(fn)
   if err != nil {
       fmt.Println(err)
       return
   }
   defer f.Close()
   var (
       badRun, maxRun   int
       badDate, maxDate string
       fileSum          float64
       fileAccept       int
   )
   for lr := bufio.NewReader(f); ; {
       line, pref, err := lr.ReadLine()
       if err == os.EOF {
           break
       }
       if err != nil {
           fmt.Println(err)
           return
       }
       if pref {
           fmt.Println("Unexpected long line.")
           return
       }
       f := strings.Fields(string(line))
       if len(f) != 49 {
           fmt.Println("unexpected format,", len(f), "fields.")
           return
       }
       var accept int
       var sum float64
       for i := 1; i < 49; i += 2 {
           flag, err := strconv.Atoi(f[i+1])
           if err != nil {
               fmt.Println(err)
               return
           }
           if flag > 0 { // value is good
               if badRun > 0 { // terminate bad run
                   if badRun > maxRun {
                       maxRun = badRun
                       maxDate = badDate
                   }
                   badRun = 0
               }
               value, err := strconv.Atof64(f[i])
               if err != nil {
                   fmt.Println(err)
                   return
               }
               sum += value
               accept++
           } else { // value is bad
               if badRun == 0 {
                   badDate = f[0]
               }
               badRun++
           }
       }
       fmt.Printf("Line: %s  Reject %2d  Accept: %2d  Line_tot:%9.3f",
           f[0], 24-accept, accept, sum)
       if accept > 0 {
           fmt.Printf("  Line_avg:%8.3f\n", sum/float64(accept))
       } else {
           fmt.Println("")
       }
       fileSum += sum
       fileAccept += accept
   }
   fmt.Println("\nFile     =", fn)
   fmt.Printf("Total    = %.3f\n", fileSum)
   fmt.Println("Readings = ", fileAccept)
   if fileAccept > 0 {
       fmt.Printf("Average  =  %.3f\n", fileSum/float64(fileAccept))
   }
   if badRun > 0 && badRun > maxRun {
       maxRun = badRun
       maxDate = badDate
   }
   if maxRun == 0 {
       fmt.Println("\nAll data valid.")
   } else {
       fmt.Printf("\nMax data gap = %d, beginning on line %s.\n",
           maxRun, maxDate)
   }

}</lang> Output:

...
Line: 2004-12-28  Reject  1  Accept: 23  Line_tot:   77.800  Line_avg:   3.383
Line: 2004-12-29  Reject  1  Accept: 23  Line_tot:   56.300  Line_avg:   2.448
Line: 2004-12-30  Reject  1  Accept: 23  Line_tot:   65.300  Line_avg:   2.839
Line: 2004-12-31  Reject  1  Accept: 23  Line_tot:   47.300  Line_avg:   2.057

File     = readings.txt
Total    = 1358393.400
Readings =  129403
Average  = 10.497

Max data gap = 589, beginning on line 1993-02-09.

Haskell

<lang Haskell>import Data.List import Numeric import Control.Arrow import Control.Monad import Text.Printf import System.Environment import Data.Function

type Date = String type Value = Double type Flag = Bool

readFlg :: String -> Flag readFlg = (> 0).read

readNum :: String -> Value readNum = fst.head.readFloat

take2 = takeWhile(not.null).unfoldr (Just.splitAt 2)

parseData :: [String] -> (Date,[(Value,Flag)]) parseData = head &&& map(readNum.head &&& readFlg.last).take2.tail

sumAccs :: (Date,[(Value,Flag)]) -> (Date, ((Value,Int),[Flag])) sumAccs = second (((sum &&& length).concat.uncurry(zipWith(\v f -> [v|f])) &&& snd).unzip)

maxNAseq :: [Flag] -> [(Int,Int)] maxNAseq = head.groupBy((==) `on` fst).sortBy(flip compare)

          . concat.uncurry(zipWith(\i (r,b)->[(r,i)|not b]))
          . first(init.scanl(+)0). unzip
          . map ((fst &&& id).(length &&& head)). group

main = do

   file:_ <- getArgs
   f <- readFile file
   let dat :: [(Date,((Value,Int),[Flag]))]
       dat      = map (sumAccs. parseData. words).lines $ f
       summ     = ((sum *** sum). unzip *** maxNAseq.concat). unzip $ map snd dat
       totalFmt = "\nSummary\t\t accept: %d\t total: %.3f \taverage: %6.3f\n\n"
       lineFmt  = "%8s\t accept: %2d\t total: %11.3f \taverage: %6.3f\n"
       maxFmt   =  "Maximum of %d consecutive false readings, starting on line /%s/ and ending on line /%s/\n"

-- output statistics

   putStrLn "\nSome lines:\n"
   mapM_ (\(d,((v,n),_)) -> printf lineFmt d n v (v/fromIntegral n)) $ take 4 $ drop 2200 dat 
   (\(t,n) -> printf totalFmt  n t (t/fromIntegral n)) $ fst summ
   mapM_ ((\(l, d1,d2) -> printf maxFmt l d1 d2)
             . (\(a,b)-> (a,(fst.(dat!!).(`div`24))b,(fst.(dat!!).(`div`24))(a+b)))) $ snd summ</lang>

Output: <lang Haskell>*Main> :main ["./RC/readings.txt"]</lang>

Some lines:

1996-01-11       accept: 24      total:     437.000     average: 18.208
1996-01-12       accept: 24      total:     536.000     average: 22.333
1996-01-13       accept: 24      total:    1062.000     average: 44.250
1996-01-14       accept: 24      total:     787.000     average: 32.792

Summary          accept: 129403  total: 1358393.400     average: 10.497

Maximum of 589 consecutive false readings, starting on line /1993-02-09/ and ending on line /1993-03-05/

Icon and Unicon

<lang Icon>record badrun(count,fromdate,todate) # record to track bad runs

procedure main() return mungetask1("readings1-input.txt","readings1-output.txt") end

procedure mungetask1(fin,fout)

fin  := open(fin) | stop("Unable to open input file ",fin) fout := open(fout,"w") | stop("Unable to open output file ",fout)

F_tot := F_acc := F_rej := 0 # data set totals rejmax := badrun(-1) # longest reject runs rejcur := badrun(0) # current reject runs

while line := read(fin) do {

  line ? {
     ldate := tab(many(&digits ++ '-'))           # date (poorly checked)
     fields := tot := rej := 0                    # record counters & totals
     while tab(many(' \t')) do {                  # whitespace before every pair
        value := real(tab(many(&digits++'-.')))   | stop("Bad value in ",ldate)
        tab(many(' \t'))
        flag := integer(tab(many(&digits++'-')))  | stop("Bad flag in ",ldate)
        fields +:= 1
        if flag > 0 then {                        # good data, ends a bad run
           if rejcur.count > rejmax.count then rejmax := rejcur
           rejcur := badrun(0)
           tot +:= value
           }
        else {                                    # bad (flagged) data
           if rejcur.count = 0 then rejcur.fromdate := ldate
           rejcur.todate := ldate
           rejcur.count +:= 1 
           rej +:= 1
           }
        }
     }
  F_tot +:= tot
  F_acc +:= acc := fields - rej
  F_rej +:= rej
  write(fout,"Line: ",ldate," Reject: ", rej," Accept: ", acc," Line_tot: ",tot," Line_avg: ", if acc > 0 then tot / acc else 0) 
  }

write(fout,"\nTotal = ",F_tot,"\nReadings = ",F_acc,"\nRejects = ",F_rej,"\nAverage = ",F_tot / F_acc) if rejmax.count > 0 then

  write(fout,"Maximum run of bad data was ",rejmax.count," readings from ",rejmax.fromdate," to ",rejmax.todate)

else

  write(fout,"No bad runs of data")

end</lang> Sample Output:

...
Line: 2004-12-28 Reject: 1 Accept: 23 Line_tot: 77.80000000000001 Line_avg: 3.382608695652174
Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.3 Line_avg: 2.447826086956522
Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.3 Line_avg: 2.839130434782609
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.3 Line_avg: 2.056521739130435

Total    = 1358393.399999999
Readings = 129403
Rejects  = 1901
Average  = 10.49738723213526
Maximum run of bad data was 589 readings from 1993-02-09 to 1993-03-05

J

Solution: <lang j> load 'files'

 parseLine=: 10&({. ,&< (_99&".;._1)@:}.)  NB. custom parser
 summarize=: # , +/ , +/ % #               NB. count,sum,mean
 filter=: #~ 0&<                           NB. keep valid measurements
 'Dates dat'=: |: parseLine;._2 CR -.~ fread jpath '~temp/readings.txt'
 Vals=:  (+: i.24){"1 dat
 Flags=: (>: +: i.24){"1 dat
 DailySummary=: Vals summarize@filter"1 Flags
 RunLengths=: ([: #(;.1) 0 , }. *. }:) , 0 >: Flags
  ]MaxRun=: >./ RunLengths

589

  ]StartDates=: Dates {~ (>:@I.@e.&MaxRun (24 <.@%~ +/)@{. ]) RunLengths

1993-03-05</lang> Formatting Output
Define report formatting verbs: <lang j>formatDailySumry=: dyad define

 labels=. , ];.2 'Line: Accept: Line_tot: Line_avg: '
 labels , x ,. 7j0 10j3 10j3 ": y

) formatFileSumry=: dyad define

 labels=. ];.2 'Total: Readings: Average: '
 sumryvals=. (, %/) 1 0{ +/y
 out=. labels ,. 12j3 12j0 12j3 ":&> sumryvals
 'maxrun dates'=. x
 out=. out,LF,'Maximum run(s) of ',(": maxrun),' consecutive false readings ends at line(s) starting with date(s): ',dates

)</lang> Show output: <lang j> (_4{.Dates) formatDailySumry _4{. DailySummary Line: Accept: Line_tot: Line_avg: 2004-12-28 23 77.800 3.383 2004-12-29 23 56.300 2.448 2004-12-30 23 65.300 2.839 2004-12-31 23 47.300 2.057

  (MaxRun;StartDates) formatFileSumry DailySummary

Total: 1358393.400 Readings: 129403 Average: 10.497

Maximum run(s) of 589 consecutive false readings ends at line(s) starting with date(s): 1993-03-05</lang>

JavaScript

Works with: JScript

<lang javascript>var filename = 'readings.txt'; var show_lines = 5; var file_stats = {

   'num_readings': 0,
   'total': 0,
   'reject_run': 0,
   'reject_run_max': 0,
   'reject_run_date': 

};

var fh = new ActiveXObject("Scripting.FileSystemObject").openTextFile(filename, 1); // 1 = for reading while ( ! fh.atEndOfStream) {

   var line = fh.ReadLine();
   line_stats(line, (show_lines-- > 0));

} fh.close();

WScript.echo(

   "\nFile(s)  = " + filename + "\n" +
   "Total    = " + dec3(file_stats.total) + "\n" +
   "Readings = " + file_stats.num_readings + "\n" +
   "Average  = " + dec3(file_stats.total / file_stats.num_readings) + "\n\n" +
   "Maximum run of " + file_stats.reject_run_max + 
   " consecutive false readings ends at " + file_stats.reject_run_date

);

function line_stats(line, print_line) {

   var readings = 0;
   var rejects = 0;
   var total = 0;
   var fields = line.split('\t');
   var date = fields.shift();
   while (fields.length > 0) {
       var value = parseFloat(fields.shift());
       var flag = parseInt(fields.shift(), 10);
       readings++;
       if (flag <= 0) {
           rejects++;
           file_stats.reject_run++;
       }
       else {
           total += value;
           if (file_stats.reject_run > file_stats.reject_run_max) {
               file_stats.reject_run_max = file_stats.reject_run;
               file_stats.reject_run_date = date;
           }
           file_stats.reject_run = 0;
       }
   }
   file_stats.num_readings += readings - rejects;
   file_stats.total += total;
   if (print_line) {
       WScript.echo(
           "Line: " + date + "\t" +
           "Reject: " + rejects + "\t" +
           "Accept: " + (readings - rejects) + "\t" +
           "Line_tot: " + dec3(total) + "\t" +
           "Line_avg: " + ((readings == rejects) ? "0.0" : dec3(total / (readings - rejects)))
       );
   }

}

// round a number to 3 decimal places function dec3(value) {

   return Math.round(value * 1e3) / 1e3;

}</lang>

outputs:

Line: 1990-01-01        Reject: 2       Accept: 22      Line_tot: 590   Line_avg: 26.818
Line: 1990-01-02        Reject: 0       Accept: 24      Line_tot: 410   Line_avg: 17.083
Line: 1990-01-03        Reject: 0       Accept: 24      Line_tot: 1415  Line_avg: 58.958
Line: 1990-01-04        Reject: 0       Accept: 24      Line_tot: 1800  Line_avg: 75
Line: 1990-01-05        Reject: 0       Accept: 24      Line_tot: 1130  Line_avg: 47.083

File(s)  = readings.txt
Total    = 1358393.4
Readings = 129403
Average  = 10.497

Maximum run of 589 consecutive false readings ends at 1993-03-05

Lua

<lang Lua>filename = "readings.txt" io.input( filename )

file_sum, file_cnt_data, file_lines = 0, 0, 0 max_rejected, n_rejected = 0, 0 max_rejected_date, rejected_date = "", ""

while true do

   data = io.read("*line")
   if data == nil then break end
   
   date = string.match( data, "%d+%-%d+%-%d+" )
   if date == nil then break end
   val = {}
   for w in string.gmatch( data, "%s%-*%d+[%.%d]*" ) do
       val[#val+1] = tonumber(w)
   end
   
   sum, cnt = 0, 0
   for i = 1, #val, 2 do
   	if val[i+1] > 0 then 
   	    sum = sum + val[i]
   	    cnt = cnt + 1
   	    n_rejected = 0
   	else

if n_rejected == 0 then rejected_date = date

 	    end
   	    n_rejected = n_rejected + 1
   	    if n_rejected > max_rejected then
   	        max_rejected = n_rejected
   	        max_rejected_date = rejected_date
   	    end
   	end
   end
   
   file_sum = file_sum + sum
   file_cnt_data = file_cnt_data + cnt
   file_lines = file_lines + 1
   
   print( string.format( "%s:\tRejected: %d\tAccepted: %d\tLine_total: %f\tLine_average: %f", date, #val/2-cnt, cnt, sum, sum/cnt ) )

end

print( string.format( "\nFile:\t %s", filename ) ) print( string.format( "Total:\t %f", file_sum ) ) print( string.format( "Readings: %d", file_lines ) ) print( string.format( "Average: %f", file_sum/file_cnt_data ) ) print( string.format( "Maximum %d consecutive false readings starting at %s.", max_rejected, max_rejected_date ) )</lang>

Output:
File:	  readings.txt
Total:	  1358393.400000
Readings: 5471
Average:  10.497387
Maximum 589 consecutive false readings starting at 1993-02-09.

OCaml

<lang ocaml>let input_line ic =

 try Some(input_line ic)
 with End_of_file -> None

let fold_input f ini ic =

 let rec fold ac =
   match input_line ic with
   | Some line -> fold (f ac line)
   | None -> ac
 in
 fold ini

let ic = open_in "readings.txt"

let scan line =

 Scanf.sscanf line "%s\
   \t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\
   \t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\
   \t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\
   \t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d"
   (fun date
        v1  f1  v2  f2  v3  f3  v4  f4  v5  f5  v6  f6
        v7  f7  v8  f8  v9  f9  v10 f10 v11 f11 v12 f12
        v13 f13 v14 f14 v15 f15 v16 f16 v17 f17 v18 f18
        v19 f19 v20 f20 v21 f21 v22 f22 v23 f23 v24 f24 ->
     (date),
     [ (v1,  f1 ); (v2,  f2 ); (v3,  f3 ); (v4,  f4 ); (v5,  f5 ); (v6,  f6 );
       (v7,  f7 ); (v8,  f8 ); (v9,  f9 ); (v10, f10); (v11, f11); (v12, f12);
       (v13, f13); (v14, f14); (v15, f15); (v16, f16); (v17, f17); (v18, f18);
       (v19, f19); (v20, f20); (v21, f21); (v22, f22); (v23, f23); (v24, f24); ])

let tot_file, num_file, _, nodata_max, nodata_maxline =

 fold_input
   (fun (tot_file, num_file, nodata, nodata_max, nodata_maxline) line ->
      let date, datas = scan line in
      let _datas = List.filter (fun (_, flag) -> flag > 0) datas in
      let ok = List.length _datas in
      let tot = List.fold_left (fun ac (value, _) -> ac +. value) 0.0 _datas in
      let nodata, nodata_max, nodata_maxline =
        List.fold_left
            (fun (nodata, nodata_max, nodata_maxline) (_, flag) ->
               if flag <= 0
               then (succ nodata, nodata_max, nodata_maxline)
               else
                 if nodata_max = nodata && nodata > 0
                 then (0, nodata_max, date::nodata_maxline)
                 else if nodata_max < nodata && nodata > 0
                 then (0, nodata, [date])
                 else (0, nodata_max, nodata_maxline)
            )
            (nodata, nodata_max, nodata_maxline) datas in
      Printf.printf "Line: %s" date;
      Printf.printf "  Reject: %2d  Accept: %2d" (24 - ok) ok;
      Printf.printf "\tLine_tot: %8.3f" tot;
      Printf.printf "\tLine_avg: %8.3f\n" (tot /. float ok);
      (tot_file +. tot, num_file + ok, nodata, nodata_max, nodata_maxline))
   (0.0, 0, 0, 0, [])
   ic ;;

close_in ic ;;

Printf.printf "Total = %f\n" tot_file; Printf.printf "Readings = %d\n" num_file; Printf.printf "Average = %f\n" (tot_file /. float num_file); Printf.printf "Maximum run(s) of %d consecutive false readings \

              ends at line starting with date(s): %s\n"
              nodata_max (String.concat ", " nodata_maxline);</lang>

Perl

An AWK-like solution

<lang perl>use strict; use warnings;

my $nodata = 0; # Current run of consecutive flags<0 in lines of file my $nodata_max = -1; # Max consecutive flags<0 in lines of file my $nodata_maxline = "!"; # ... and line number(s) where it occurs

my $infiles = join ", ", @ARGV;

my $tot_file = 0; my $num_file = 0;

while (<>) {

 chomp;
 my $tot_line = 0;             # sum of line data
 my $num_line = 0;             # number of line data items with flag>0
 my $rejects  = 0;

 # extract field info, skipping initial date field
 my ($date, @fields) = split;
 while (@fields and my ($datum, $flag) = splice @fields, 0, 2) {
   if ($flag+1 < 2) {
     $nodata++;
     $rejects++;
     next;
   }
   # check run of data-absent fields
   if($nodata_max == $nodata and $nodata > 0){
     $nodata_maxline = "$nodata_maxline, $date";
   }
   if($nodata_max < $nodata and $nodata > 0){
     $nodata_max = $nodata;
     $nodata_maxline = $date;
   }
   # re-initialise run of nodata counter
   $nodata = 0; 
   # gather values for averaging
   $tot_line += $datum;
   $num_line++;
 }

 # totals for the file so far
 $tot_file += $tot_line;
 $num_file += $num_line;

 printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n",
        $date, $rejects, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;

}

printf "\n"; printf "File(s) = %s\n", $infiles; printf "Total = %10.3f\n", $tot_file; printf "Readings = %6i\n", $num_file; printf "Average = %10.3f\n", $tot_file / $num_file;

printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",

      $nodata_max, $nodata_maxline;</lang>

Sample output:

bash$ perl -f readings.pl readings.txt | tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readings.txt
Total    = 1358393.400
Readings = 129403
Average  =     10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05
bash$

An object-oriented solution

<lang perl>use strict; use warnings;

use constant RESULT_TEMPLATE => "%-19s = %12.3f / %-6u = %.3f\n";

my $parser = Parser->new;

  1. parse lines and print results

printf RESULT_TEMPLATE, $parser->parse(split)

   while <>;

$parser->finish;

  1. print total and summary

printf "\n".RESULT_TEMPLATE."\n", $parser->result; printf "the maximum of %u consecutive bad values was reached %u time(s)\n",

   $parser->bad_max, scalar $parser->bad_ranges;
  1. print bad ranges

print for map { ' '.join(' - ', @$_)."\n" } $parser->bad_ranges;

BEGIN {

   package main::Parser;
   sub new {
       my $obj = {
           SUM => 0,
           COUNT => 0,
           CURRENT_DATE => undef,
           BAD_DATE => undef,
           BAD_RANGES => [],
           BAD_MAX => 0,
           BAD_COUNT => 0
       };
       return bless $obj;
   }
   sub _average {
       my ($sum, $count) = @_;
       return ($sum, $count, $count && $sum / $count);
   }
   sub _push_bad_range_if_necessary {
       my ($parser) = @_;
       my ($count, $max) = @$parser{qw(BAD_COUNT BAD_MAX)};
       return if $count < $max;
       if ($count > $max) {
           $parser->{BAD_RANGES} = [];
           $parser->{BAD_MAX} = $count;
       }
       push @{$parser->{BAD_RANGES}}, [ @$parser{qw(BAD_DATE CURRENT_DATE)} ];
   }
   sub _check {
       my ($parser, $flag) = @_;
       if ($flag <= 0) {
           ++$parser->{BAD_COUNT};
           $parser->{BAD_DATE} = $parser->{CURRENT_DATE}
               unless defined $parser->{BAD_DATE};
           return 0;
       }
       else {
           $parser->_push_bad_range_if_necessary;
           $parser->{BAD_COUNT} = 0;
           $parser->{BAD_DATE} = undef;
           return 1;
       }
   }
   sub bad_max {
       my ($parser) = @_;
       return $parser->{BAD_MAX}
   }
   sub bad_ranges {
       my ($parser) = @_;
       return @{$parser->{BAD_RANGES}}
   }
   sub parse {
       my $parser = shift;
       my $date = shift;
       $parser->{CURRENT_DATE} = $date;
       my $sum = 0;
       my $count = 0;
       while (my ($value, $flag) = splice @_, 0, 2) {
           next unless $parser->_check($flag);
           $sum += $value;
           ++$count;
       }
       $parser->{SUM} += $sum;
       $parser->{COUNT} += $count;
       return ("average($date)", _average($sum, $count));
   }
   sub result {
       my ($parser) = @_;
       return ('total-average', _average(@$parser{qw(SUM COUNT)}));
   }
   sub finish {
       my ($parser) = @_;
       $parser->_push_bad_range_if_necessary
   }

}</lang>

Sample output:

$ perl readings.pl < readings.txt | tail
average(2004-12-27) =       57.100 / 23     = 2.483
average(2004-12-28) =       77.800 / 23     = 3.383
average(2004-12-29) =       56.300 / 23     = 2.448
average(2004-12-30) =       65.300 / 23     = 2.839
average(2004-12-31) =       47.300 / 23     = 2.057

total-average       =  1358393.400 / 129403 = 10.497

the maximum of 589 consecutive bad values was reached 1 time(s)
  1993-02-09 - 1993-03-05

$

Perl 6

<lang perl6>my @gaps; my $previous = 'valid';

for $*IN.lines -> $line {

   my ($date, @readings) = split /\s+/, $line;
   my @valid;
   my $hour = 0;
   for @readings -> $reading, $flag {
       if $flag > 0 {    
           @valid.push($reading);
           if $previous eq 'invalid' {
               @gaps[*-1]{'end'} = "$date $hour:00";
               $previous = 'valid';
           }
       } 
       else
       {
           if $previous eq 'valid' {
               @gaps.push( {start => "$date $hour:00"} );
           }
           @gaps[*-1]{'count'}++;
           $previous = 'invalid';
       }
       $hour++;
   }
   say "$date: { ( +@valid ?? ( ( [+] @valid ) / +@valid ).fmt("%.3f") !! 0 ).fmt("%8s") }",
       " mean from { (+@valid).fmt("%2s") } valid."; 

};

my $longest = @gaps.sort({-$^a<count>})[0];

say "Longest period of invalid readings was {$longest<count>} hours,\n",

   "from {$longest<start>} till {$longest<end>}."</lang>

Output:

1990-01-01:   26.818 mean from 22 valid.
1990-01-02:   17.083 mean from 24 valid.
1990-01-03:   58.958 mean from 24 valid.
1990-01-04:   75.000 mean from 24 valid.
1990-01-05:   47.083 mean from 24 valid.
...
(many lines omitted)
...
2004-12-27:    2.483 mean from 23 valid.
2004-12-28:    3.383 mean from 23 valid.
2004-12-29:    2.448 mean from 23 valid.
2004-12-30:    2.839 mean from 23 valid.
2004-12-31:    2.057 mean from 23 valid.
Longest period of invalid readings was 589 hours,
from 1993-02-09 1:00 till 1993-03-05 14:00.

PicoLisp

Translation of: AWK

Put the following into an executable file "readings": <lang PicoLisp>#!/usr/bin/picolisp /usr/lib/picolisp/lib.l

(let (NoData 0 NoDataMax -1 NoDataMaxline "!" TotFile 0 NumFile 0)

  (let InFiles
     (glue ","
        (mapcar
           '((File)
              (in File
                 (while (split (line) "^I")
                    (let (Len (length @)  Date (car @)  TotLine 0  NumLine 0)
                       (for (L (cdr @)  L  (cddr L))
                          (if (> 1 (format (cadr L)))
                             (inc 'NoData)
                             (when (gt0 NoData)
                                (when (= NoDataMax NoData)
                                   (setq NoDataMaxline (pack NoDataMaxline ", " Date)) )
                                (when (> NoData NoDataMax)
                                   (setq NoDataMax NoData  NoDataMaxline Date) ) )
                             (zero NoData)
                             (inc 'TotLine (format (car L) 3))
                             (inc 'NumLine) ) )
                       (inc 'TotFile TotLine)
                       (inc 'NumFile NumLine)
                       (tab (-7 -12 -7 3 -9 3 -11 11 -11 11)
                          "Line:" Date
                          "Reject:" (- (/ (dec Len) 2) NumLine)
                          "  Accept:" NumLine
                          "  Line_tot:" (format TotLine 3)
                          "  Line_avg:"
                          (and (gt0 NumLine) (format (*/ TotLine @) 3)) ) ) ) )
              File )
           (argv) ) )
     (prinl)
     (prinl "File(s)  = " InFiles)
     (prinl "Total    = " (format TotFile 3))
     (prinl "Readings = " NumFile)
     (prinl "Average  = " (format (*/ TotFile NumFile) 3))
     (prinl)
     (prinl
        "Maximum run(s) of " NoDataMax
        " consecutive false readings ends at line starting with date(s): " NoDataMaxline ) ) )

(bye)</lang> Then it can be called as

$ ./readings readings.txt |tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readings.txt
Total    = 1358393.400
Readings = 129403
Average  = 10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05
$


PL/I

<lang PL/I> text1: procedure options (main); /* 13 May 2010 */

  declare line character (2000) varying;
  declare 1 pairs(24),
              2 value fixed (10,4),
              2 flag  fixed;
  declare date character (12) varying;
  declare no_items fixed decimal (10);
  declare (nv, sum, line_no, ndud_values, max_ndud_values) fixed;
  declare (i, k) fixed binary;
  declare in file input;
  open file (in) title ('/TEXT1.DAT,TYPE(TEXT),RECSIZE(2000)' );
  on endfile (in) go to finish_up;
  line_no = 0;

loop:

  do forever;
     get file (in) edit (line) (L);
     /* put skip list (line); */
     line = translate(line, ' ', '09'x);
     line_no = line_no + 1;
     line = trim(line);
     no_items = tally(line, ' ') - tally(line, '  ') + 1;
     if no_items ^= 49 then
        do; put skip list ('There are not 49 items on this line'); iterate loop; end;
     k = index(line, ' '); /* Find the first blank in the line. */
     date = substr(line, 1, k);
     line = substr(line, k) || ' ';
     on conversion go to loop;
     get string (line) list (pairs);
     sum, nv, ndud_values, max_ndud_values = 0;
     do i = 1 to 24;
        if flag(i) > 0 then
           do; sum = sum + value(i); nv = nv + 1;
           ndud_values = 0; /* reset the counter of dud values */
           end;
        else
           do; /* we have a dud reading. */
               ndud_values = ndud_values + 1;
               if ndud_values > max_ndud_values then
                 max_ndud_values = ndud_values;
           end;
     end;
     if nv = 0 then iterate;
     put skip list ('Line ' || trim(line_no) || ' average=', divide(sum, nv, 10,4) );
     if max_ndud_values > 0 then
        put skip list ('Maximum run of dud readings =', max_ndud_values);
  end;

finish_up:

end text1; </lang>

PureBasic

<lang PureBasic>#TASK="Text processing/1" Define File$, InLine$, Part$, i, Out$, ErrEnds$, Errcnt, ErrMax Define lsum.d, tsum.d, rejects, val.d, readings

File$=OpenFileRequester(#TASK,"readings.txt","",0) If OpenConsole() And ReadFile(0,File$)

 While Not Eof(0)
   InLine$=ReadString(0)
   For i=1 To 1+2*24
     Part$=StringField(InLine$,i,#TAB$)
     If i=1        ; Date
       Out$=Part$: lsum=0: rejects=0
     ElseIf i%2=0  ; Recorded value
       val=ValD(Part$)
     Else          ; Status part
       If Val(Part$)>0
         Errcnt=0 : readings+1
         lsum+val : tsum+val
       Else
         rejects+1: Errcnt+1
         If Errcnt>ErrMax
           ErrMax=Errcnt
           ErrEnds$=Out$
         EndIf
       EndIf
     EndIf
   Next i
   Out$+" Rejects: " + Str(rejects)
   Out$+" Accepts: " + Str(24-rejects)
   Out$+" Line_tot: "+ StrD(lsum,3)
   If rejects<24
     Out$+" Line_avg: "+StrD(lsum/(24-rejects),3)
   Else
     Out$+" Line_avg: N/A"
   EndIf
   PrintN("Line: "+Out$)
 Wend
 PrintN(#CRLF$+"File     = "+GetFilePart(File$))
 PrintN("Total    = "+ StrD(tsum,3))
 PrintN("Readings = "+ Str(readings))
 PrintN("Average  = "+ StrD(tsum/readings,3))
 Print(#CRLF$+"Maximum of "+Str(ErrMax))
 PrintN(" consecutive false readings, ends at "+ErrEnds$)
 CloseFile(0)
 ;
 Print("Press ENTER to exit"): Input()

EndIf</lang> Sample output;

...
Line: 2004-12-27 Rejects: 1 Accepts: 23 Line_tot: 57.100 Line_avg: 2.483
Line: 2004-12-28 Rejects: 1 Accepts: 23 Line_tot: 77.800 Line_avg: 3.383
Line: 2004-12-29 Rejects: 1 Accepts: 23 Line_tot: 56.300 Line_avg: 2.448
Line: 2004-12-30 Rejects: 1 Accepts: 23 Line_tot: 65.300 Line_avg: 2.839
Line: 2004-12-31 Rejects: 1 Accepts: 23 Line_tot: 47.300 Line_avg: 2.057

File     = readings.txt
Total    = 1358393.400
Readings = 129403
Average  = 10.497

Maximum of 589 consecutive false readings, ends at 1993-03-05

Python

<lang python># Author Donald 'Paddy' McCarthy Jan 01 2007

import fileinput import sys

nodata = 0; # Current run of consecutive flags<0 in lines of file nodata_max=-1; # Max consecutive flags<0 in lines of file nodata_maxline=[]; # ... and line number(s) where it occurs

tot_file = 0 # Sum of file data num_file = 0 # Number of file data items with flag>0

infiles = sys.argv[1:]

for line in fileinput.input():

 tot_line=0;             # sum of line data
 num_line=0;             # number of line data items with flag>0
 # extract field info
 field = line.split()
 date  = field[0]
 data  = [float(f) for f in field[1::2]]
 flags = [int(f)   for f in field[2::2]]
 for datum, flag in zip(data, flags):
   if flag<1:
     nodata += 1
   else:
     # check run of data-absent fields
     if nodata_max==nodata and nodata>0:
       nodata_maxline.append(date)
     if nodata_max<nodata and nodata>0:
       nodata_max=nodata
       nodata_maxline=[date]
     # re-initialise run of nodata counter
     nodata=0; 
     # gather values for averaging
     tot_line += datum
     num_line += 1
 # totals for the file so far
 tot_file += tot_line
 num_file += num_line
 print "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f" % (
       date, 
       len(data) -num_line, 
       num_line, tot_line, 
       tot_line/num_line if (num_line>0) else 0)

print "" print "File(s) = %s" % (", ".join(infiles),) print "Total = %10.3f" % (tot_file,) print "Readings = %6i" % (num_file,) print "Average = %10.3f" % (tot_file / num_file,)

print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (

   nodata_max, ", ".join(nodata_maxline))</lang>

Sample output:

bash$ /cygdrive/c/Python26/python readings.py readings.txt|tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readings.txt
Total    = 1358393.400
Readings = 129403
Average  =     10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05
bash$

R

<lang R>#Read in data from file dfr <- read.delim("readings.txt")

  1. Calculate daily means

flags <- as.matrix(dfr[,seq(3,49,2)])>0 vals <- as.matrix(dfr[,seq(2,49,2)]) daily.means <- rowSums(ifelse(flags, vals, 0))/rowSums(flags)

  1. Calculate time between good measurements

times <- strptime(dfr[1,1], "%Y-%m-%d", tz="GMT") + 3600*seq(1,24*nrow(dfr),1) hours.between.good.measurements <- diff(times[t(flags)])/3600</lang>

REXX

<lang rexx> /*REXX program to process instrument data from a data file. */

numeric digits 20 /*allow for bigger numbers. */ ifid='READINGS.TXT' /*the input file. */ ofid='READINGS.OUT' /*the outut file. */ grandSum=0 /*grand sum of whole file. */ grandflg=0 /*grand num of flagged data. */ grandOKs=0 longFlag=0 /*longest period of flagged data.*/ contFlag=0 /*longest continous flagged data.*/ w=16 /*width of fields when displayed.*/

 do recs=1 while lines(ifid)\==0      /*read until finished.           */
 rec=linein(ifid)                     /*read the next record (line).   */
 parse var rec datestamp Idata        /*pick off the dateStamp & data. */
 sum=0
 flg=0
 OKs=0
   do j=1 until Idata=              /*process the instrument data.  */
   parse var Idata data.j flag.j Idata
   if flag.j>0 then do                /*if good data, ...              */
                    OKs=OKs+1
                    sum=sum+data.j
                    if contFlag>longFlag then do
                                              longdate=datestamp
                                              longFlag=contFlag
                                              end
                    contFlag=0
                    end
               else do                /*flagged data ...               */
                    flg=flg+1
                    contFlag=contFlag+1
                    end
   end
 if OKs\==0 then avg=format(sum/OKs,,3)
            else avg='[n/a]'
 grandOKs=grandOKs+OKs
 _=right(comma(avg),w)
 grandSum=grandSum+sum
 grandFlg=grandFlg+flg
 if flg==0 then call sy datestamp ' average='_
           else call sy datestamp ' average='_ '  flagged='right(flg,2)
 end

recs=recs-1 /*adjust for reading end-of-file.*/ if grandOKs\==0 then Gavg=format(grandsum/grandOKs,,3)

               else Gavg='[n/a]'

call sy call sy copies('=',60) call sy ' records read:' right(comma(recs),w) call sy ' grand sum:' right(comma(grandSum),w+4) call sy ' grand average:' right(comma(Gavg),w+4) call sy ' grand OK data:' right(comma(grandOKs),w) call sy ' grand flagged:' right(comma(grandFlg),w) if longFlag\==0 then call sy ' longest flagged:' right(comma(longFlag),w) " ending at " longdate call sy copies('=',60) call sy exit


/*─────────────────────────────────────SY subroutine────────────────────*/ sy: procedure; parse arg stuff

   say stuff
   if  1==0  then call lineout ofid,stuff
   return


/*─────────────────────────────────────COMMA subroutine─────────────────*/ comma: procedure; parse arg _,c,p,t;arg ,cu;c=word(c ",",1);

      if cu=='BLANK' then c=' ';o=word(p 3,1);p=abs(o);t=word(t 999999999,1);
      if \datatype(p,'W')|\datatype(t,'W')|p==0|arg()>4 then return _;n=_'.9';
      #=123456789;k=0;if o<0 then do;b=verify(_,' ');if b==0 then return _;
      e=length(_)-verify(reverse(_),' ')+1;end;else do;b=verify(n,#,"M");
      e=verify(n,#'0',,verify(n,#"0.",'M'))-p-1;end;
      do j=e to b by -p while k<t;_=insert(c,_,j);k=k+1;end;return _

</lang> Output:

   ∙
   ∙
   ∙
1991-10-16  average=           4.167   flagged= 6
1991-10-17  average=          10.867   flagged= 9
1991-10-18  average=           3.083
   ∙
   ∙
   ∙
============================================================
      records read:            5,471
     grand     sum:        1,358,393.400
     grand average:               10.497
     grand OK data:          129,403
     grand flagged:            1,901
   longest flagged:              589  ending at  1993-03-05
============================================================

Ruby

<lang ruby>filename = "readings.txt" total = { "num_readings" => 0, "num_good_readings" => 0, "sum_readings" => 0.0 } invalid_count = 0 max_invalid_count = 0 invalid_run_end = ""

File.new(filename).each do |line|

 num_readings = 0
 num_good_readings = 0
 sum_readings = 0.0
 fields = line.split
 fields[1..-1].each_slice(2) do |reading, flag|
   num_readings += 1
   if Integer(flag) > 0
     num_good_readings += 1
     sum_readings += Float(reading)
     invalid_count = 0
   else
     invalid_count += 1
     if invalid_count > max_invalid_count
       max_invalid_count = invalid_count
       invalid_run_end = fields[0]
     end
   end
 end
 printf "Line: %11s  Reject: %2d  Accept: %2d  Line_tot: %10.3f  Line_avg: %10.3f\n",
   fields[0], num_readings - num_good_readings, num_good_readings, sum_readings,
   num_good_readings > 0 ? sum_readings/num_good_readings : 0.0
 total["num_readings"] += num_readings
 total["num_good_readings"] += num_good_readings
 total["sum_readings"] += sum_readings

end

puts "" puts "File(s) = #{filename}" printf "Total = %.3f\n", total['sum_readings'] puts "Readings = #{total['num_good_readings']}" printf "Average = %.3f\n", total['sum_readings']/total['num_good_readings'] puts "" puts "Maximum run(s) of #{max_invalid_count} consecutive false readings ends at #{invalid_run_end}"</lang>

Scala

Works with: Scala version 2.8

A fully functional solution, minus the fact that it uses iterators: <lang scala>object DataMunging {

 import scala.io.Source
 
 def spans[A](list: List[A]) = list.tail.foldLeft(List((list.head, 1))) {
   case ((a, n) :: tail, b) if a == b => (a, n + 1) :: tail
   case (l, b) => (b, 1) :: l
 }
 
 type Flag = ((Boolean, Int), String)
 type Flags = List[Flag]
 type LineIterator = Iterator[Option[(Double, Int, Flags)]]
 
 val pattern = """^(\d+-\d+-\d+)""" + """\s+(\d+\.\d+)\s+(-?\d+)""" * 24 + "$" r;
 def linesIterator(file: java.io.File) = Source.fromFile(file).getLines().map(
   pattern findFirstMatchIn _ map (
     _.subgroups match {
       case List(date, rawData @ _*) =>
         val dataset = (rawData map (_ toDouble) iterator) grouped 2 toList;
         val valid = dataset filter (_.last > 0) map (_.head)
         val validSize = valid length;
         val validSum = valid sum;
         val flags = spans(dataset map (_.last > 0)) map ((_, date))
         println("Line: %11s  Reject: %2d  Accept: %2d  Line_tot: %10.3f  Line_avg: %10.3f" format
                 (date, 24 - validSize, validSize, validSum, validSum / validSize))
         (validSum, validSize, flags)
     }
   )
 )
 
 def totalizeLines(fileIterator: LineIterator) =
   fileIterator.foldLeft(0.0, 0, List[Flag]()) {
     case ((totalSum, totalSize, ((flag, size), date) :: tail), Some((validSum, validSize, flags))) =>
       val ((firstFlag, firstSize), _) = flags.last
       if (firstFlag == flag) {
         (totalSum + validSum, totalSize + validSize, flags.init ::: ((flag, size + firstSize), date) :: tail)
       } else {
         (totalSum + validSum, totalSize + validSize, flags ::: ((flag, size), date) :: tail)
       }
     case ((_, _, Nil), Some(partials)) => partials
     case (totals, None) => totals
   }
 
 def main(args: Array[String]) {
   val files = args map (new java.io.File(_)) filter (file => file.isFile && file.canRead)
   val lines =  files.iterator flatMap linesIterator
   val (totalSum, totalSize, flags) = totalizeLines(lines)
   val ((_, invalidCount), startDate) = flags.filter(!_._1._1).max
   val report = """|
                   |File(s)  = %s
                   |Total    = %10.3f
                   |Readings = %6d
                   |Average  = %10.3f
                   |
                   |Maximum run(s) of %d consecutive false readings began at %s""".stripMargin
   println(report format (files mkString " ", totalSum, totalSize, totalSum / totalSize, invalidCount, startDate))
 }

}</lang>

A quick&dirty solution: <lang scala>object AltDataMunging {

 def main(args: Array[String]) {
   var totalSum = 0.0
   var totalSize  = 0
   var maxInvalidDate = ""
   var maxInvalidCount = 0
   var invalidDate = ""
   var invalidCount = 0
   val files = args map (new java.io.File(_)) filter (file => file.isFile && file.canRead)
   
   files.iterator flatMap (file => Source fromFile file getLines ()) map (_.trim split "\\s+") foreach {
     case Array(date, rawData @ _*) =>
       val dataset = (rawData map (_ toDouble) iterator) grouped 2 toList;
       val valid = dataset filter (_.last > 0) map (_.head)
       val flags = spans(dataset map (_.last > 0)) map ((_, date))
       println("Line: %11s  Reject: %2d  Accept: %2d  Line_tot: %10.3f  Line_avg: %10.3f" format
               (date, 24 - valid.size, valid.size, valid.sum, valid.sum / valid.size))
       totalSum += valid.sum
       totalSize += valid.size
       dataset foreach {
         case _ :: flag :: Nil if flag > 0 =>
           if (invalidCount > maxInvalidCount) {
             maxInvalidDate = invalidDate
             maxInvalidCount = invalidCount
           }
           invalidCount = 0
         case _ =>
           if (invalidCount == 0) invalidDate = date
           invalidCount += 1
       }
   }
   
   val report = """|
                   |File(s)  = %s
                   |Total    = %10.3f
                   |Readings = %6d
                   |Average  = %10.3f
                   |
                   |Maximum run(s) of %d consecutive false readings began at %s""".stripMargin
   println(report format (files mkString " ", totalSum, totalSize, totalSum / totalSize, maxInvalidCount, maxInvalidDate))
 }

}</lang>

Last few lines of the sample output (either version):

Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readings.txt
Total    = 1358393.400
Readings = 129403
Average  =     10.497

Maximum run(s) of 589 consecutive false readings began at 1993-02-09

Though it is easier to show when the consecutive false readings ends, if longest run is the last thing in the file, it hasn't really "ended".

Tcl

<lang tcl>set max_invalid_run 0 set max_invalid_run_end "" set tot_file 0 set num_file 0

set linefmt "Line: %11s Reject: %2d Accept: %2d Line_tot: %10.3f Line_avg: %10.3f"

set filename readings.txt set fh [open $filename] while {[gets $fh line] != -1} {

   set tot_line [set count [set num_line 0]]
   set fields [regexp -all -inline {\S+} $line]
   set date [lindex $fields 0]
   foreach {val flag} [lrange $fields 1 end] {
       incr count
       if {$flag > 0} {
           incr num_line
           incr num_file
           set tot_line [expr {$tot_line + $val}]
           set invalid_run_count 0
       } else {
           incr invalid_run_count
           if {$invalid_run_count > $max_invalid_run} {
               set max_invalid_run $invalid_run_count
               set max_invalid_run_end $date
           }
       }
   }
   set tot_file [expr {$tot_file + $tot_line}]
   puts [format $linefmt $date [expr {$count - $num_line}] $num_line $tot_line \
                [expr {$num_line > 0 ? $tot_line / $num_line : 0}]]

} close $fh

puts "" puts "File(s) = $filename" puts "Total = [format %.3f $tot_file]" puts "Readings = $num_file" puts "Average = [format %.3f [expr {$tot_file / $num_file}]]" puts "" puts "Maximum run(s) of $max_invalid_run consecutive false readings ends at $max_invalid_run_end"</lang>

Ursala

The input file is transformed to a list of assignments of character strings to lists of pairs of floats and booleans (type %ebXLm) in the parsed data. The same function is used to compute the daily and the cumulative statistics. <lang Ursala>#import std

  1. import nat
  2. import flo

parsed_data = ^|A(~&,* ^|/%ep@iNC ~&h==`1)*htK27K28pPCS (sep 9%cOi&)*FyS readings_dot_txt

daily_stats =

  • ^|A(~&,@rFlS ^/length ^/plus:-0. ||0.! ~&i&& mean); mat` + <.
  ~&n,
  'accept: '--+ @ml printf/'%7.0f'+ float,
  'total: '--+ @mrl printf/'%10.1f',
  'average: '--+ @mrr printf/'%7.3f'>

long_run =

-+

  ~&i&& ^|TNC('maximum of '--@h+ %nP,' consecutive false readings ending on line '--),
  @nmrSPDSL -&~&,leql$^; ^/length ~&zn&-@hrZPF+ rlc both ~&rZ+-

main = ^T(daily_stats^lrNCT/~& @mSL 'summary ':,long_run) parsed_data</lang>

last few lines of output:

2004-12-29 accept:      23 total:       56.3 average:   2.448
2004-12-30 accept:      23 total:       65.3 average:   2.839
2004-12-31 accept:      23 total:       47.3 average:   2.057
summary    accept:  129403 total:  1358393.4 average:  10.497
maximum of 589 consecutive false readings ending on line 1993-03-05

Vedit macro language

Translation of: AWK

Vedit does not have floating point data type, so fixed point calculations are used here.

<lang vedit>#50 = Buf_Num // Current edit buffer (source data) File_Open("output.txt")

  1. 51 = Buf_Num // Edit buffer for output file

Buf_Switch(#50)

  1. 10 = 0 // total sum of file data
  2. 11 = 0 // number of valid data items in file
  3. 12 = 0 // Current run of consecutive flags<0 in lines of file
  4. 13 = -1 // Max consecutive flags<0 in lines of file

Reg_Empty(15) // ... and date tag(s) at line(s) where it occurs

While(!At_EOF) {

   #20 = 0		// sum of line data
   #21 = 0		// number of line data items with flag>0
   #22 = 0		// number of line data items with flag<0
   Reg_Copy_Block(14, Cur_Pos, Cur_Pos+10)	// date field
   
   // extract field info, skipping initial date field
   Repeat(ALL) {

Search("|{|T,|N}", ADVANCE+ERRBREAK) // next Tab or Newline if (Match_Item==2) { Break } // end of line #30 = Num_Eval(ADVANCE) * 1000 // #30 = value Char // fixed point, 3 decimal digits #30 += Num_Eval(ADVANCE+SUPPRESS) #31 = Num_Eval(ADVANCE) // #31 = flag if (#31 < 1) { // not valid field? #12++ #22++ } else { // valid field // check run of data-absent fields if(#13 == #12 && #12 > 0) { Reg_Set(15, ", ", APPEND) Reg_Set(15, @14, APPEND) } if(#13 < #12 && #12 > 0) { #13 = #12 Reg_Set(15, @14) }

// re-initialise run of nodata counter #12 = 0 // gather values for averaging #20 += #30 #21++ }

   }
   // totals for the file so far
   #10 += #20
   #11 += #21
   
   Buf_Switch(#51)	// buffer for output data
   IT("Line: ") Reg_Ins(14)
   IT("  Reject:") Num_Ins(#22, COUNT, 3)
   IT("  Accept:") Num_Ins(#21, COUNT, 3)
   IT("  Line tot:") Num_Ins(#20, COUNT, 8) Char(-3) IC('.') EOL
   IT("  Line avg:") Num_Ins((#20+#21/2)/#21, COUNT, 7) Char(-3) IC('.') EOL IN
   Buf_Switch(#50)	// buffer for input data

}

Buf_Switch(#51) // buffer for output data IN IT("Total: ") Num_Ins(#10, FORCE+NOCR) Char(-3) IC('.') EOL IN IT("Readings: ") Num_Ins(#11, FORCE) IT("Average: ") Num_Ins((#10+#11/2)/#11, FORCE+NOCR) Char(-3) IC('.') EOL IN IN IT("Maximum run(s) of ") Num_Ins(#13, LEFT+NOCR) IT(" consecutive false readings ends at line starting with date(s): ") Reg_Ins(15) IN</lang>

Sample output:

Line: 2004-12-28  Reject:  1  Accept: 23  Line tot:   77.800  Line avg:   3.383
Line: 2004-12-29  Reject:  1  Accept: 23  Line tot:   56.300  Line avg:   2.448
Line: 2004-12-30  Reject:  1  Accept: 23  Line tot:   65.300  Line avg:   2.839
Line: 2004-12-31  Reject:  1  Accept: 23  Line tot:   47.300  Line avg:   2.057

Total:   1358393.400
Readings:     129403
Average:      10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05