Text processing/1: Difference between revisions
(→{{header|Common Lisp}}: flag as having bad output) |
|||
Line 420: | Line 420: | ||
=={{header|Common Lisp}}== |
=={{header|Common Lisp}}== |
||
{{incorrect}} |
|||
<lang lisp>(defstruct (measurement |
<lang lisp>(defstruct (measurement |
||
(:conc-name "MEASUREMENT-") |
(:conc-name "MEASUREMENT-") |
Revision as of 18:36, 12 December 2009
You are encouraged to solve this task according to the task description, using any language you may know.
Often data is produced by one program, in the wrong format for later use by another program or person. In these situations another program can be written to parse and transform the original data into a format useful to the other. The term "Data Munging" is often used in programming circles for this task.
A request on the comp.lang.awk newsgroup lead to a typical data munging task:
I have to analyse data files that have the following format: Each row corresponds to 1 day and the field logic is: $1 is the date, followed by 24 value/flag pairs, representing measurements at 01:00, 02:00 ... 24:00 of the respective day. In short: <date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24> Some test data is available at: ... (nolonger available at original location) I have to sum up the values (per day and only valid data, i.e. with flag>0) in order to calculate the mean. That's not too difficult. However, I also need to know what the "maximum data gap" is, i.e. the longest period with successive invalid measurements (i.e values with flag<=0)
The data is free to download and use and is of this format:
1991-03-30 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 1991-03-31 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 20.000 1 20.000 1 20.000 1 35.000 1 50.000 1 60.000 1 40.000 1 30.000 1 30.000 1 30.000 1 25.000 1 20.000 1 20.000 1 20.000 1 20.000 1 20.000 1 35.000 1 1991-03-31 40.000 1 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 1991-04-01 0.000 -2 13.000 1 16.000 1 21.000 1 24.000 1 22.000 1 20.000 1 18.000 1 29.000 1 44.000 1 50.000 1 43.000 1 38.000 1 27.000 1 27.000 1 24.000 1 23.000 1 18.000 1 12.000 1 13.000 1 14.000 1 15.000 1 13.000 1 10.000 1 1991-04-02 8.000 1 9.000 1 11.000 1 12.000 1 12.000 1 12.000 1 27.000 1 26.000 1 27.000 1 33.000 1 32.000 1 31.000 1 29.000 1 31.000 1 25.000 1 25.000 1 24.000 1 21.000 1 17.000 1 14.000 1 15.000 1 12.000 1 12.000 1 10.000 1 1991-04-03 10.000 1 9.000 1 10.000 1 10.000 1 9.000 1 10.000 1 15.000 1 24.000 1 28.000 1 24.000 1 18.000 1 14.000 1 12.000 1 13.000 1 14.000 1 15.000 1 14.000 1 15.000 1 13.000 1 13.000 1 13.000 1 12.000 1 10.000 1 10.000 1
Only a sample of the data showing its format is given above. The full example file may be downloaded here.
Structure your program to show statistics for each line of the file, (similar to the original Python, Perl, and AWK examples below), followed by summary statistics for the file. When showing example output just show a few line statistics and the full end summary.
Ada
<lang ada>with Ada.Text_IO; use Ada.Text_IO; with Strings_Edit; use Strings_Edit; with Strings_Edit.Floats; use Strings_Edit.Floats; with Strings_Edit.Integers; use Strings_Edit.Integers;
procedure Data_Munging is
Syntax_Error : exception; type Gap_Data is record Count : Natural := 0; Line : Natural := 0; Pointer : Integer; Year : Integer; Month : Integer; Day : Integer; end record; File : File_Type; Max : Gap_Data; This : Gap_Data; Current : Gap_Data; Count : Natural := 0; Sum : Float := 0.0;
begin
Open (File, In_File, "readings.txt"); loop declare Line : constant String := Get_Line (File); Pointer : Integer := Line'First; Flag : Integer; Data : Float; begin Current.Line := Current.Line + 1; Get (Line, Pointer, SpaceAndTab); Get (Line, Pointer, Current.Year); Get (Line, Pointer, Current.Month); Get (Line, Pointer, Current.Day); while Pointer <= Line'Last loop Get (Line, Pointer, SpaceAndTab); Current.Pointer := Pointer; Get (Line, Pointer, Data); Get (Line, Pointer, SpaceAndTab); Get (Line, Pointer, Flag); if Flag < 0 then if This.Count = 0 then This := Current; end if; This.Count := This.Count + 1; else if This.Count > 0 and then Max.Count < This.Count then Max := This; end if; This.Count := 0; Count := Count + 1; Sum := Sum + Data; end if; end loop; exception when End_Error => raise Syntax_Error; end; end loop;
exception
when End_Error => Close (File); if This.Count > 0 and then Max.Count < This.Count then Max := This; end if; Put_Line ("Average " & Image (Sum / Float (Count)) & " over " & Image (Count)); if Max.Count > 0 then Put ("Max. " & Image (Max.Count) & " false readings start at "); Put (Image (Max.Line) & ':' & Image (Max.Pointer) & " stamped "); Put_Line (Image (Max.Year) & Image (Max.Month) & Image (Max.Day)); end if; when others => Close (File); Put_Line ("Syntax error at " & Image (Current.Line) & ':' & Image (Max.Pointer));
end Data_Munging;</lang> The implementation performs minimal checks. The average is calculated over all valid data. For the maximal chain of consequent invalid data, the source line number, the column number, and the time stamp of the first invalid data is printed. Sample output:
Average 10.47915 over 129628 Max. 589 false readings start at 1136:20 stamped 1993-2-9
ALGOL 68
<lang algol68>INT no data := 0; # Current run of consecutive flags<0 in lines of file # INT no data max := -1; # Max consecutive flags<0 in lines of file # FLEX[0]STRING no data max line; # ... and line number(s) where it occurs #
REAL tot file := 0; # Sum of file data # INT num file := 0; # Number of file data items with flag>0 #
- CHAR fs = " "; #
INT nf = 24;
INT upb list := nf; FORMAT list repr = $n(upb list-1)(g", ")g$;
PROC exception = ([]STRING args)VOID:(
putf(stand error, ($"Exception"$, $", "g$, args, $l$)); stop
);
PROC raise io error = (STRING message)VOID:exception(("io error", message));
OP +:= = (REF FLEX []STRING rhs, STRING append)REF FLEX[]STRING: (
HEAP [UPB rhs+1]STRING out rhs; out rhs[:UPB rhs] := rhs; out rhs[UPB rhs+1] := append; rhs := out rhs; out rhs
);
INT upb opts = 3; # these are "a68g" "./Data_Munging.a68" & "-" # [argc - upb opts]STRING in files; FOR arg TO UPB in files DO in files[arg] := argv(upb opts + arg) OD;
MODE FIELD = STRUCT(REAL data, INT flag); FORMAT field repr = $2(g)$;
FOR index file TO UPB in files DO
STRING file name = in files[index file], FILE file; IF open(file, file name, stand in channel) NE 0 THEN raise io error("Cannot open """+file name+"""") FI; on logical file end(file, (REF FILE f)BOOL: logical file end done); REAL tot line, INT num line; # make term(file, ", ") for CSV data # STRING date; DO tot line := 0; # sum of line data # num line := 0; # number of line data items with flag>0 # # extract field info # [nf]FIELD data; getf(file, ($10a$, date, field repr, data, $l$));
FOR key TO UPB data DO FIELD field = data[key]; IF flag OF field<1 THEN no data +:= 1 ELSE # check run of data-absent data # IF no data max = no data AND no data>0 THEN no data max line +:= date FI; IF no data max<no data AND no data>0 THEN no data max := no data; no data max line := date FI; # re-initialise run of no data counter # no data := 0; # gather values for averaging # tot line +:= data OF field; num line +:= 1 FI OD;
# totals for the file so far # tot file +:= tot line; num file +:= num line;
printf(($"Line: "g" Reject: "g(-2)" Accept: "g(-2)" Line tot: "g(-14, 3)" Line avg: "g(-14, 3)l$, date, UPB(data) -num line, num line, tot line, IF num line>0 THEN tot line/num line ELSE 0 FI)) OD; logical file end done: close(file)
OD;
FORMAT plural = $b(" ", "s")$,
p = $b("", "s")$;
upb list := UPB in files; printf(($l"File"f(plural)" = "$, upb list = 1, list repr, in files, $l$,
$"Total = "g(-0, 3)l$, tot file, $"Readings = "g(-0)l$, num file, $"Average = "g(-0, 3)l$, tot file / num file));
upb list := UPB no data max line; printf(($l"Maximum run"f(p)" of "g(-0)" consecutive false reading"f(p)" ends at line starting with date"f(p)": "$,
upb list = 1, no data max, no data max = 0, upb list = 1, list repr, no data max line, $l$))</lang>
Command:
$ a68g ./Data_Munging.a68 - data
Output:
Line: 1991-03-30 Reject: 0 Accept: 24 Line tot: 240.000 Line avg: 10.000 Line: 1991-03-31 Reject: 0 Accept: 24 Line tot: 565.000 Line avg: 23.542 Line: 1991-03-31 Reject: 23 Accept: 1 Line tot: 40.000 Line avg: 40.000 Line: 1991-04-01 Reject: 1 Accept: 23 Line tot: 534.000 Line avg: 23.217 Line: 1991-04-02 Reject: 0 Accept: 24 Line tot: 475.000 Line avg: 19.792 Line: 1991-04-03 Reject: 0 Accept: 24 Line tot: 335.000 Line avg: 13.958 File = data Total = 2189.000 Readings = 120 Average = 18.242 Maximum run of 24 consecutive false readings ends at line starting with date: 1991-04-01
AWK
<lang awk># Author Donald 'Paddy' McCarthy Jan 01 2007
BEGIN{
nodata = 0; # Current run of consecutive flags<0 in lines of file nodata_max=-1; # Max consecutive flags<0 in lines of file nodata_maxline="!"; # ... and line number(s) where it occurs
} FNR==1 {
# Accumulate input file names if(infiles){ infiles = infiles "," infiles } else { infiles = FILENAME }
} {
tot_line=0; # sum of line data num_line=0; # number of line data items with flag>0
# extract field info, skipping initial date field for(field=2; field<=NF; field+=2){ datum=$field; flag=$(field+1); if(flag<1){ nodata++ }else{ # check run of data-absent fields if(nodata_max==nodata && (nodata>0)){ nodata_maxline=nodata_maxline ", " $1 } if(nodata_max<nodata && (nodata>0)){ nodata_max=nodata nodata_maxline=$1 } # re-initialise run of nodata counter nodata=0; # gather values for averaging tot_line+=datum num_line++; } }
# totals for the file so far tot_file += tot_line num_file += num_line
printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f\n", \ $1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0
# debug prints of original data plus some of the computed values #printf "%s %15.3g %4i\n", $0, tot_line, num_line #printf "%s\n %15.3f %4i %4i %4i %s\n", $0, tot_line, num_line, nodata, nodata_max, nodata_maxline
}
END{
printf "\n" printf "File(s) = %s\n", infiles printf "Total = %10.3f\n", tot_file printf "Readings = %6i\n", num_file printf "Average = %10.3f\n", tot_file / num_file
printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline
}</lang> Sample output:
bash$ awk -f readings.awk readings.txt | tail Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448 Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839 Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057 File(s) = readings.txt Total = 1358393.400 Readings = 129403 Average = 10.497 Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05 bash$
C
<lang c>#include <stdio.h>
- include <stdlib.h>
- include <string.h>
static int badHrs, maxBadHrs;
static double hrsTot = 0.0; static int rdgsTot = 0; char bhEndDate[40];
int mungeLine( char *line, int lno, FILE *fout ) {
char date[40], *tkn; int dHrs, flag, hrs2, hrs; double hrsSum; int hrsCnt = 0; double avg;
tkn = strtok(line, "."); if (tkn) { int n = sscanf(tkn, "%s %d", &date, &hrs2); if (n<2) { printf("badly formated line - %d %s\n", lno, tkn); return 0; } hrsSum = 0.0; while( tkn= strtok(NULL, ".")) { n = sscanf(tkn,"%d %d %d", &dHrs, &flag, &hrs); if (n>=2) { if (flag > 0) { hrsSum += 1.0*hrs2 + .001*dHrs; hrsCnt += 1; if (maxBadHrs < badHrs) { maxBadHrs = badHrs; strcpy(bhEndDate, date); } badHrs = 0; } else { badHrs += 1; } hrs2 = hrs; } else { printf("bad file syntax line %d: %s\n",lno, tkn); } } avg = (hrsCnt > 0)? hrsSum/hrsCnt : 0.0; fprintf(fout, "%s Reject: %2d Accept: %2d Average: %7.3f\n", date, 24-hrsCnt, hrsCnt, hrsSum/hrsCnt); hrsTot += hrsSum; rdgsTot += hrsCnt; } return 1;
}
int main(int argc, char*argv[]) {
FILE *infile, *outfile; int lineNo = 0; char line[512]; char *ifilename = "readings.txt"; outfile = fopen("V0.txt", "w");
infile = fopen(ifilename, "rb"); if (!infile) { printf("Can't open %s\n", ifilename); exit(1); } while (NULL != fgets(line, 512, infile)) { lineNo += 1; if (0 == mungeLine(line, lineNo, outfile)) printf("Bad line at %d",lineNo); } fclose(infile);
fprintf(outfile, "File: %s\n", ifilename); fprintf(outfile, "Total: %.3f\n", hrsTot); fprintf(outfile, "Readings: %d\n", rdgsTot); fprintf(outfile, "Average: %.3f\n", hrsTot/rdgsTot); fprintf(outfile, "\nMaximum number of consecutive bad readings is %d\n", maxBadHrs); fprintf(outfile, "Ends on date %s\n", bhEndDate); fclose(outfile); return 0;
}</lang> Sample output
1990-01-01 Reject: 2 Accept: 22 Average: 26.818 1990-01-02 Reject: 0 Accept: 24 Average: 17.083 1990-01-03 Reject: 0 Accept: 24 Average: 58.958 1990-01-04 Reject: 0 Accept: 24 Average: 75.000 1990-01-05 Reject: 0 Accept: 24 Average: 47.083 ... File: readings.txt Total: 1358393.400 Readings: 129403 Average: 10.497 Maximum number of consecutive bad readings is 589 Ends on date 1993-03-05
Common Lisp
<lang lisp>(defstruct (measurement (:conc-name "MEASUREMENT-") (:constructor make-measurement (counter line date flag value)))
(counter 0 :type (integer 0)) (line 0 :type (integer 0)) (date nil :type symbol) (flag 0 :type integer) (value 0 :type real))
(defun measurement-valid-p (m)
(> (measurement-flag m) 0))
(defun map-data-stream (function stream)
(flet ((scan (&optional (errorp t)) (read stream errorp nil))) (loop :with global-count = 0 :for date = (scan nil) :then (scan nil) :for line-number :upfrom 1 :while date :do (loop
:for count :upfrom 0 :below 24 :do (let* ((value (scan)) (flag (scan))) (funcall function (make-measurement global-count line-number date flag value)) (incf global-count)))
:finally (return global-count))))
(defun map-data-file (function pathname)
(with-open-file (stream pathname
:element-type 'character :direction :input :if-does-not-exist :error)
(map-data-stream function stream)))
(defmacro do-data-stream ((variable stream) &body body)
`(map-data-stream (lambda (,variable) ,@body) ,stream))
(defmacro do-data-file ((variable file) &body body)
`(map-data-file (lambda (,variable) ,@body) ,file))
(let ((current-day nil)
(current-line 0) (beginning-of-misreadings nil) (current-length 0) (worst-beginning nil) (worst-length 0) (sum-of-day 0) (count-of-day 0))
(flet ((write-end-of-day-report ()
(when current-day (format t "Line ~5D Date ~A: Accepted ~2D Total ~8,3F Average ~8,3F~%" current-line current-day count-of-day sum-of-day (if (> count-of-day 0) (/ sum-of-day count-of-day) sum-of-day)))))
(do-data-file (m #P"D:/Scratch/data.txt") (let* ((date (measurement-date m))
(line-number (measurement-line m)) (validp (measurement-valid-p m)) (day-changed-p (/= current-line line-number)) (value (measurement-value m)))
(when day-changed-p (write-end-of-day-report) (setf current-day date) (setf current-line line-number) (setf sum-of-day 0) (setf count-of-day 0))
(if (not validp) (if beginning-of-misreadings (incf current-length) (progn (setf beginning-of-misreadings m) (setf current-length 1))) (progn (when beginning-of-misreadings (if (> current-length worst-length) (progn (setf worst-beginning beginning-of-misreadings) (setf worst-length current-length)) (progn (setf beginning-of-misreadings nil) (setf current-length 0)))) (incf sum-of-day value) (incf count-of-day)))))
(when (and beginning-of-misreadings (> current-length worst-length)) (setf worst-beginning beginning-of-misreadings) (setf worst-length current-length)) (write-end-of-day-report))
(format t "Worst run started ~A (~D) and has length ~D~%"
(measurement-date worst-beginning) (measurement-counter worst-beginning) worst-length))</lang>
Example output:
Line 1 Date 1991-03-30: Accepted 24 Total 240.000 Average 10.000 Line 2 Date 1991-03-31: Accepted 24 Total 565.000 Average 23.542 Line 3 Date 1991-03-31: Accepted 1 Total 40.000 Average 40.000 Line 4 Date 1991-04-01: Accepted 23 Total 534.000 Average 23.217 Line 5 Date 1991-04-02: Accepted 24 Total 475.000 Average 19.792 Line 6 Date 1991-04-03: Accepted 24 Total 335.000 Average 13.958 Worst run started 1991-03-31 (49) and has length 24
[The output is incorrect - worst run]
D
<lang d>// Author Daniel Keep Mar 23 2009 module data_munging;
import std.conv : toInt, toDouble; import std.stdio : writefln; import std.stream : BufferedFile; import std.string : split, join, format;
void main(string[] args) {
int noData, noDataMax = -1; string[] noDataMaxLine;
double fileTotal = 0.0; int fileValues;
foreach( arg ; args[1..$] ) { foreach( char[] line ; new BufferedFile(arg) ) { double lineTotal = 0.0; int lineValues;
// Extract field info auto parts = split(line); auto date = parts[0]; auto fields = parts[1..$]; assert( (fields.length & 2) == 0, format("Expected even number of fields, not %d.", fields.length) );
for( auto i=0; i<fields.length; i += 2 ) { auto value = toDouble(fields[i]); auto flag = toInt(fields[i+1]);
if( flag < 1 ) { ++ noData; continue; }
// Check run of data-absent fields if( noDataMax == noData && noData > 0 ) noDataMaxLine ~= date; if( noDataMax < noData && noData > 0 ) { noDataMax = noData; noDataMaxLine.length = 1; noDataMaxLine[0] = date; }
// Re-initialise run of noData counter noData = 0;
// Gather values for averaging lineTotal += value; ++ lineValues; }
// Totals for the file so far fileTotal += lineTotal; fileValues += lineValues;
writefln("Line: %11s Reject: %2d Accept: %2d" " Line_tot: %10.3f Line_avg: %10.3f", date, fields.length/2 - lineValues, lineValues, lineTotal, lineValues > 0 ? lineTotal/lineValues : 0.0); } }
writefln(""); writefln("File(s) = ", join(args[1..$], ", ")); writefln("Total = %10.3f", fileTotal); writefln("Readings = %6d", fileValues); writefln("Average = %10.3f", fileTotal/fileValues);
writefln("\nMaximum run(s) of %d consecutive false readings ends" " at line starting with date(s): %s", noDataMax, join(noDataMaxLine, ", "));
}</lang>
Output matches that of the Python version.
Forth
<lang forth>\ data munging
\ 1991-03-30[\t10.000\t[-]1]*24
\ 1. mean of valid (flag > 0) values per day and overall \ 2. length of longest run of invalid values, and when it happened
fvariable day-sum variable day-n
fvariable total-sum variable total-n
10 constant date-size \ yyyy-mm-dd create cur-date date-size allot
create bad-date date-size allot variable bad-n
create worst-date date-size allot variable worst-n
- split ( buf len char -- buf' l2 buf l1 ) \ where buf'[0] = char, l1 = len-l2
>r 2dup r> scan 2swap 2 pick - ;
- next-sample ( buf len -- buf' len' fvalue flag )
#tab split >float drop 1 /string #tab split snumber? drop >r 1 /string r> ;
- ok? 0> ;
- add-sample ( value -- )
day-sum f@ f+ day-sum f! 1 day-n +! ;
- add-day
day-sum f@ total-sum f@ f+ total-sum f! day-n @ total-n +! ;
- add-bad-run
bad-n @ 0= if cur-date bad-date date-size move then 1 bad-n +! ;
- check-worst-run
bad-n @ worst-n @ > if bad-n @ worst-n ! bad-date worst-date date-size move then 0 bad-n ! ;
- hour ( buf len -- buf' len' )
next-sample ok? if add-sample check-worst-run else fdrop add-bad-run then ;
- .mean ( sum count -- ) 0 d>f f/ f. ;
- day ( line len -- )
2dup + #tab swap c! 1+ \ append tab for parsing #tab split cur-date swap move 1 /string \ skip date 0e day-sum f! 0 day-n ! 24 0 do hour loop 2drop cur-date date-size type ." mean = " day-sum f@ day-n @ .mean cr add-day ;
stdin value input
- main
s" input.txt" r/o open-file throw to input 0e total-sum f! 0 total-n ! 0 worst-n ! begin pad 512 input read-line throw while pad swap day repeat input close-file throw worst-n @ if ." Longest interruption: " worst-n @ . ." hours starting " worst-date date-size type cr then ." Total mean = " total-sum f@ total-n @ .mean cr ;
main bye</lang>
Haskell
<lang Haskell>import Data.List import Numeric import Control.Arrow import Control.Monad import Text.Printf import System.Environment import Data.Function
type Date = String type Value = Double type Flag = Bool
readFlg :: String -> Flag readFlg = (> 0).read
readNum :: String -> Value readNum = fst.head.readFloat
take2 = takeWhile(not.null).unfoldr (Just.splitAt 2)
parseData :: [String] -> (Date,[(Value,Flag)]) parseData = head &&& map(readNum.head &&& readFlg.last).take2.tail
sumAccs :: (Date,[(Value,Flag)]) -> (Date, ((Value,Int),[Flag])) sumAccs = second (((sum &&& length).concat.uncurry(zipWith(\v f -> [v|f])) &&& snd).unzip)
maxNAseq :: [Flag] -> [(Int,Int)] maxNAseq = head.groupBy((==) `on` fst).sortBy(flip compare)
. concat.uncurry(zipWith(\i (r,b)->[(r,i)|not b])) . first(init.scanl(+)0). unzip . map ((fst &&& id).(length &&& head)). group
main = do
file:_ <- getArgs f <- readFile file let dat :: [(Date,((Value,Int),[Flag]))] dat = map (sumAccs. parseData. words).lines $ f summ = ((sum *** sum). unzip *** maxNAseq.concat). unzip $ map snd dat totalFmt = "\nSummary\t\t accept: %d\t total: %.3f \taverage: %6.3f\n\n" lineFmt = "%8s\t accept: %2d\t total: %11.3f \taverage: %6.3f\n" maxFmt = "Maximum of %d consecutive false readings, starting on line /%s/ and ending on line /%s/\n"
-- output statistics
putStrLn "\nSome lines:\n" mapM_ (\(d,((v,n),_)) -> printf lineFmt d n v (v/fromIntegral n)) $ take 4 $ drop 2200 dat (\(t,n) -> printf totalFmt n t (t/fromIntegral n)) $ fst summ mapM_ ((\(l, d1,d2) -> printf maxFmt l d1 d2) . (\(a,b)-> (a,(fst.(dat!!).(`div`24))b,(fst.(dat!!).(`div`24))(a+b)))) $ snd summ</lang>
Output: <lang Haskell>*Main> :main ["./RC/readings.txt"]</lang>
Some lines: 1996-01-11 accept: 24 total: 437.000 average: 18.208 1996-01-12 accept: 24 total: 536.000 average: 22.333 1996-01-13 accept: 24 total: 1062.000 average: 44.250 1996-01-14 accept: 24 total: 787.000 average: 32.792 Summary accept: 129403 total: 1358393.400 average: 10.497 Maximum of 589 consecutive false readings, starting on line /1993-02-09/ and ending on line /1993-03-05/
J
Solution: <lang j> load 'files'
parseLine=: 10&({. ,&< (_99&".;._1)@:}.) NB. custom parser summarize=: # , +/ , +/ % # NB. count,sum,mean filter=: #~ 0&< NB. keep valid measurements
'Dates dat'=: |: parseLine;._2 CR -.~ fread jpath '~temp/readings.txt' Vals=: (+: i.24){"1 dat Flags=: (>: +: i.24){"1 dat DailySummary=: Vals summarize@filter"1 Flags RunLengths=: ([: #(;.1) 0 , }. *. }:) , 0 >: Flags ]MaxRun=: >./ RunLengths
589
]StartDates=: Dates {~ (>:@I.@e.&MaxRun (24 <.@%~ +/)@{. ]) RunLengths
1993-03-05</lang>
Formatting Output
Define report formatting verbs:
<lang j>formatDailySumry=: dyad define
labels=. , ];.2 'Line: Accept: Line_tot: Line_avg: ' labels , x ,. 7j0 10j3 10j3 ": y
) formatFileSumry=: dyad define
labels=. ];.2 'Total: Readings: Average: ' sumryvals=. (, %/) 1 0{ +/y out=. labels ,. 12j3 12j0 12j3 ":&> sumryvals 'maxrun dates'=. x out=. out,LF,'Maximum run(s) of ',(": maxrun),' consecutive false readings ends at line(s) starting with date(s): ',dates
)</lang> Show output: <lang j> (_4{.Dates) formatDailySumry _4{. DailySummary Line: Accept: Line_tot: Line_avg: 2004-12-28 23 77.800 3.383 2004-12-29 23 56.300 2.448 2004-12-30 23 65.300 2.839 2004-12-31 23 47.300 2.057
(MaxRun;StartDates) formatFileSumry DailySummary
Total: 1358393.400 Readings: 129403 Average: 10.497
Maximum run(s) of 589 consecutive false readings ends at line(s) starting with date(s): 1993-03-05</lang>
JavaScript
<lang javascript>var filename = 'readings.txt'; var show_lines = 5; var file_stats = {
'num_readings': 0, 'total': 0, 'reject_run': 0, 'reject_run_max': 0, 'reject_run_date':
};
var fh = new ActiveXObject("Scripting.FileSystemObject").openTextFile(filename, 1); // 1 = for reading while ( ! fh.atEndOfStream) {
var line = fh.ReadLine(); line_stats(line, (show_lines-- > 0));
} fh.close();
WScript.echo(
"\nFile(s) = " + filename + "\n" + "Total = " + dec3(file_stats.total) + "\n" + "Readings = " + file_stats.num_readings + "\n" + "Average = " + dec3(file_stats.total / file_stats.num_readings) + "\n\n" + "Maximum run of " + file_stats.reject_run_max + " consecutive false readings ends at " + file_stats.reject_run_date
);
function line_stats(line, print_line) {
var readings = 0; var rejects = 0; var total = 0; var fields = line.split('\t'); var date = fields.shift();
while (fields.length > 0) { var value = parseFloat(fields.shift()); var flag = parseInt(fields.shift(), 10); readings++; if (flag <= 0) { rejects++; file_stats.reject_run++; } else { total += value; if (file_stats.reject_run > file_stats.reject_run_max) { file_stats.reject_run_max = file_stats.reject_run; file_stats.reject_run_date = date; } file_stats.reject_run = 0; } }
file_stats.num_readings += readings - rejects; file_stats.total += total;
if (print_line) { WScript.echo( "Line: " + date + "\t" + "Reject: " + rejects + "\t" + "Accept: " + (readings - rejects) + "\t" + "Line_tot: " + dec3(total) + "\t" + "Line_avg: " + ((readings == rejects) ? "0.0" : dec3(total / (readings - rejects))) ); }
}
// round a number to 3 decimal places function dec3(value) {
return Math.round(value * 1e3) / 1e3;
}</lang>
outputs:
Line: 1990-01-01 Reject: 2 Accept: 22 Line_tot: 590 Line_avg: 26.818 Line: 1990-01-02 Reject: 0 Accept: 24 Line_tot: 410 Line_avg: 17.083 Line: 1990-01-03 Reject: 0 Accept: 24 Line_tot: 1415 Line_avg: 58.958 Line: 1990-01-04 Reject: 0 Accept: 24 Line_tot: 1800 Line_avg: 75 Line: 1990-01-05 Reject: 0 Accept: 24 Line_tot: 1130 Line_avg: 47.083 File(s) = readings.txt Total = 1358393.4 Readings = 129403 Average = 10.497 Maximum run of 589 consecutive false readings ends at 1993-03-05
OCaml
<lang ocaml>let input_line ic =
try Some(input_line ic) with End_of_file -> None
let fold_input f ini ic =
let rec fold ac = match input_line ic with | Some line -> fold (f ac line) | None -> ac in fold ini
let ic = open_in "readings.txt"
let scan line =
Scanf.sscanf line "%s\ \t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\ \t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\ \t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\ \t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d\t%f\t%d" (fun date v1 f1 v2 f2 v3 f3 v4 f4 v5 f5 v6 f6 v7 f7 v8 f8 v9 f9 v10 f10 v11 f11 v12 f12 v13 f13 v14 f14 v15 f15 v16 f16 v17 f17 v18 f18 v19 f19 v20 f20 v21 f21 v22 f22 v23 f23 v24 f24 -> (date), [ (v1, f1 ); (v2, f2 ); (v3, f3 ); (v4, f4 ); (v5, f5 ); (v6, f6 ); (v7, f7 ); (v8, f8 ); (v9, f9 ); (v10, f10); (v11, f11); (v12, f12); (v13, f13); (v14, f14); (v15, f15); (v16, f16); (v17, f17); (v18, f18); (v19, f19); (v20, f20); (v21, f21); (v22, f22); (v23, f23); (v24, f24); ])
let tot_file, num_file, _, nodata_max, nodata_maxline =
fold_input (fun (tot_file, num_file, nodata, nodata_max, nodata_maxline) line -> let date, datas = scan line in let _datas = List.filter (fun (_, flag) -> flag > 0) datas in let ok = List.length _datas in let tot = List.fold_left (fun ac (value, _) -> ac +. value) 0.0 _datas in let nodata, nodata_max, nodata_maxline = List.fold_left (fun (nodata, nodata_max, nodata_maxline) (_, flag) -> if flag <= 0 then (succ nodata, nodata_max, nodata_maxline) else if nodata_max = nodata && nodata > 0 then (0, nodata_max, date::nodata_maxline) else if nodata_max < nodata && nodata > 0 then (0, nodata, [date]) else (0, nodata_max, nodata_maxline) ) (nodata, nodata_max, nodata_maxline) datas in Printf.printf "Line: %s" date; Printf.printf " Reject: %2d Accept: %2d" (24 - ok) ok; Printf.printf "\tLine_tot: %8.3f" tot; Printf.printf "\tLine_avg: %8.3f\n" (tot /. float ok); (tot_file +. tot, num_file + ok, nodata, nodata_max, nodata_maxline)) (0.0, 0, 0, 0, []) ic ;;
close_in ic ;;
Printf.printf "Total = %f\n" tot_file; Printf.printf "Readings = %d\n" num_file; Printf.printf "Average = %f\n" (tot_file /. float num_file); Printf.printf "Maximum run(s) of %d consecutive false readings \
ends at line starting with date(s): %s\n" nodata_max (String.concat ", " nodata_maxline);</lang>
Perl
<lang perl># Author Donald 'Paddy' McCarthy Jan 01 2007
BEGIN {
$nodata = 0; # Current run of consecutive flags<0 in lines of file $nodata_max=-1; # Max consecutive flags<0 in lines of file $nodata_maxline="!"; # ... and line number(s) where it occurs
} foreach (@ARGV) {
# Accumulate input file names if($infiles ne ""){ $infiles = "$infiles, $_"; } else { $infiles = $_; }
}
while (<>){
$tot_line=0; # sum of line data $num_line=0; # number of line data items with flag>0
# extract field info, skipping initial date field chomp; @fields = split(/\s+/); $nf = @fields; $date = $fields[0]; for($field=1; $field<$nf; $field+=2){ $datum = $fields[$field] +0.0; $flag = $fields[$field+1] +0; if(($flag+1<2)){ $nodata++; }else{ # check run of data-absent fields if($nodata_max==$nodata and ($nodata>0)){ $nodata_maxline = "$nodata_maxline, $fields[0]"; } if($nodata_max<$nodata and ($nodata>0)){ $nodata_max = $nodata; $nodata_maxline=$fields[0]; } # re-initialise run of nodata counter $nodata = 0; # gather values for averaging $tot_line += $datum; $num_line++; } }
# totals for the file so far $tot_file += $tot_line; $num_file += $num_line;
printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f\n", $date, (($nf -1)/2) -$num_line, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;
}
printf "\n"; printf "File(s) = %s\n", $infiles; printf "Total = %10.3f\n", $tot_file; printf "Readings = %6i\n", $num_file; printf "Average = %10.3f\n", $tot_file / $num_file;
printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",
$nodata_max, $nodata_maxline;</lang>
Sample output:
bash$ perl -f readings.pl readings.txt | tail Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448 Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839 Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057 File(s) = readings.txt Total = 1358393.400 Readings = 129403 Average = 10.497 Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05 bash$
Python
<lang python># Author Donald 'Paddy' McCarthy Jan 01 2007
import fileinput import sys
nodata = 0; # Current run of consecutive flags<0 in lines of file nodata_max=-1; # Max consecutive flags<0 in lines of file nodata_maxline=[]; # ... and line number(s) where it occurs
tot_file = 0 # Sum of file data num_file = 0 # Number of file data items with flag>0
infiles = sys.argv[1:]
for line in fileinput.input():
tot_line=0; # sum of line data num_line=0; # number of line data items with flag>0
# extract field info field = line.split() date = field[0] data = [float(f) for f in field[1::2]] flags = [int(f) for f in field[2::2]]
for datum, flag in zip(data, flags): if flag<1: nodata += 1 else: # check run of data-absent fields if nodata_max==nodata and nodata>0: nodata_maxline.append(date) if nodata_max<nodata and nodata>0: nodata_max=nodata nodata_maxline=[date] # re-initialise run of nodata counter nodata=0; # gather values for averaging tot_line += datum num_line += 1
# totals for the file so far tot_file += tot_line num_file += num_line
print "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f" % ( date, len(data) -num_line, num_line, tot_line, tot_line/num_line if (num_line>0) else 0)
print "" print "File(s) = %s" % (", ".join(infiles),) print "Total = %10.3f" % (tot_file,) print "Readings = %6i" % (num_file,) print "Average = %10.3f" % (tot_file / num_file,)
print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (
nodata_max, ", ".join(nodata_maxline))</lang>
Sample output:
bash$ /cygdrive/c/Python26/python readings.py readings.txt|tail Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300 Line_avg: 2.448 Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300 Line_avg: 2.839 Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300 Line_avg: 2.057 File(s) = readings.txt Total = 1358393.400 Readings = 129403 Average = 10.497 Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05 bash$
R
<lang R>#Read in data from file dfr <- read.delim("readings.txt")
- Calculate daily means
flags <- as.matrix(dfr[,seq(3,49,2)])>0 vals <- as.matrix(dfr[,seq(2,49,2)]) daily.means <- rowSums(ifelse(flags, vals, 0))/rowSums(flags)
- Calculate time between good measurements
times <- strptime(dfr[1,1], "%Y-%m-%d", tz="GMT") + 3600*seq(1,24*nrow(dfr),1) hours.between.good.measurements <- diff(times[t(flags)])/3600</lang>
Ruby
<lang ruby>filename = "readings.txt" total = { "num_readings" => 0, "num_good_readings" => 0, "sum_readings" => 0.0 } invalid_count = 0 max_invalid_count = 0 invalid_run_end = ""
File.new(filename).each do |line|
num_readings = 0 num_good_readings = 0 sum_readings = 0.0
fields = line.split fields[1..-1].each_slice(2) do |reading, flag| num_readings += 1 if Integer(flag) > 0 num_good_readings += 1 sum_readings += Float(reading) invalid_count = 0 else invalid_count += 1 if invalid_count > max_invalid_count max_invalid_count = invalid_count invalid_run_end = fields[0] end end end
printf "Line: %11s Reject: %2d Accept: %2d Line_tot: %10.3f Line_avg: %10.3f\n", fields[0], num_readings - num_good_readings, num_good_readings, sum_readings, num_good_readings > 0 ? sum_readings/num_good_readings : 0.0
total["num_readings"] += num_readings total["num_good_readings"] += num_good_readings total["sum_readings"] += sum_readings
end
puts "" puts "File(s) = #{filename}" printf "Total = %.3f\n", total['sum_readings'] puts "Readings = #{total['num_good_readings']}" printf "Average = %.3f\n", total['sum_readings']/total['num_good_readings'] puts "" puts "Maximum run(s) of #{max_invalid_count} consecutive false readings ends at #{invalid_run_end}"</lang>
Tcl
<lang tcl>set max_invalid_run 0 set max_invalid_run_end "" set tot_file 0 set num_file 0
set linefmt "Line: %11s Reject: %2d Accept: %2d Line_tot: %10.3f Line_avg: %10.3f"
set filename readings.txt set fh [open $filename] while {[gets $fh line] != -1} {
set tot_line [set count [set num_line 0]] set fields [regexp -all -inline {\S+} $line] set date [lindex $fields 0] foreach {val flag} [lrange $fields 1 end] { incr count if {$flag > 0} { incr num_line incr num_file set tot_line [expr {$tot_line + $val}] set invalid_run_count 0 } else { incr invalid_run_count if {$invalid_run_count > $max_invalid_run} { set max_invalid_run $invalid_run_count set max_invalid_run_end $date } } } set tot_file [expr {$tot_file + $tot_line}] puts [format $linefmt $date [expr {$count - $num_line}] $num_line $tot_line \ [expr {$num_line > 0 ? $tot_line / $num_line : 0}]]
} close $fh
puts "" puts "File(s) = $filename" puts "Total = [format %.3f $tot_file]" puts "Readings = $num_file" puts "Average = [format %.3f [expr {$tot_file / $num_file}]]" puts "" puts "Maximum run(s) of $max_invalid_run consecutive false readings ends at $max_invalid_run_end"</lang>
Vedit macro language
Vedit does not have floating point data type, so fixed point calculations are used here.
<lang vedit>#50 = Buf_Num // Current edit buffer (source data) File_Open("output.txt")
- 51 = Buf_Num // Edit buffer for output file
Buf_Switch(#50)
- 10 = 0 // total sum of file data
- 11 = 0 // number of valid data items in file
- 12 = 0 // Current run of consecutive flags<0 in lines of file
- 13 = -1 // Max consecutive flags<0 in lines of file
Reg_Empty(15) // ... and date tag(s) at line(s) where it occurs
While(!At_EOF) {
#20 = 0 // sum of line data #21 = 0 // number of line data items with flag>0 #22 = 0 // number of line data items with flag<0 Reg_Copy_Block(14, Cur_Pos, Cur_Pos+10) // date field // extract field info, skipping initial date field Repeat(ALL) {
Search("|{|T,|N}", ADVANCE+ERRBREAK) // next Tab or Newline if (Match_Item==2) { Break } // end of line #30 = Num_Eval(ADVANCE) * 1000 // #30 = value Char // fixed point, 3 decimal digits #30 += Num_Eval(ADVANCE+SUPPRESS) #31 = Num_Eval(ADVANCE) // #31 = flag if (#31 < 1) { // not valid field? #12++ #22++ } else { // valid field // check run of data-absent fields if(#13 == #12 && #12 > 0) { Reg_Set(15, ", ", APPEND) Reg_Set(15, @14, APPEND) } if(#13 < #12 && #12 > 0) { #13 = #12 Reg_Set(15, @14) }
// re-initialise run of nodata counter #12 = 0 // gather values for averaging #20 += #30 #21++ }
}
// totals for the file so far #10 += #20 #11 += #21 Buf_Switch(#51) // buffer for output data IT("Line: ") Reg_Ins(14) IT(" Reject:") Num_Ins(#22, COUNT, 3) IT(" Accept:") Num_Ins(#21, COUNT, 3) IT(" Line tot:") Num_Ins(#20, COUNT, 8) Char(-3) IC('.') EOL IT(" Line avg:") Num_Ins((#20+#21/2)/#21, COUNT, 7) Char(-3) IC('.') EOL IN Buf_Switch(#50) // buffer for input data
}
Buf_Switch(#51) // buffer for output data IN IT("Total: ") Num_Ins(#10, FORCE+NOCR) Char(-3) IC('.') EOL IN IT("Readings: ") Num_Ins(#11, FORCE) IT("Average: ") Num_Ins((#10+#11/2)/#11, FORCE+NOCR) Char(-3) IC('.') EOL IN IN IT("Maximum run(s) of ") Num_Ins(#13, LEFT+NOCR) IT(" consecutive false readings ends at line starting with date(s): ") Reg_Ins(15) IN</lang>
Sample output:
Line: 2004-12-28 Reject: 1 Accept: 23 Line tot: 77.800 Line avg: 3.383 Line: 2004-12-29 Reject: 1 Accept: 23 Line tot: 56.300 Line avg: 2.448 Line: 2004-12-30 Reject: 1 Accept: 23 Line tot: 65.300 Line avg: 2.839 Line: 2004-12-31 Reject: 1 Accept: 23 Line tot: 47.300 Line avg: 2.057 Total: 1358393.400 Readings: 129403 Average: 10.497 Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05