Text processing/2
You are encouraged to solve this task according to the task description, using any language you may know.
The following data shows a few lines from the file readings.txt (as used in the Data Munging task).
The data comes from a pollution monitoring station with twenty four instruments monitoring twenty four aspects of pollution in the air. Periodically a record is added to the file constituting a line of 49 white-space separated fields, where white-space can be one or more space or tab characters.
The fields (from the left) are:
DATESTAMP [ VALUEn FLAGn ] * 24
i.e. a datestamp followed by twenty four repetitions of a floating point instrument value and that instruments associated integer flag. Flag values are >= 1 if the instrument is working and < 1 if there is some problem with that instrument, in which case that instrument's value should be ignored.
A sample from the full data file readings.txt is:
1991-03-30 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 1991-03-31 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 20.000 1 20.000 1 20.000 1 35.000 1 50.000 1 60.000 1 40.000 1 30.000 1 30.000 1 30.000 1 25.000 1 20.000 1 20.000 1 20.000 1 20.000 1 20.000 1 35.000 1 1991-03-31 40.000 1 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 1991-04-01 0.000 -2 13.000 1 16.000 1 21.000 1 24.000 1 22.000 1 20.000 1 18.000 1 29.000 1 44.000 1 50.000 1 43.000 1 38.000 1 27.000 1 27.000 1 24.000 1 23.000 1 18.000 1 12.000 1 13.000 1 14.000 1 15.000 1 13.000 1 10.000 1 1991-04-02 8.000 1 9.000 1 11.000 1 12.000 1 12.000 1 12.000 1 27.000 1 26.000 1 27.000 1 33.000 1 32.000 1 31.000 1 29.000 1 31.000 1 25.000 1 25.000 1 24.000 1 21.000 1 17.000 1 14.000 1 15.000 1 12.000 1 12.000 1 10.000 1 1991-04-03 10.000 1 9.000 1 10.000 1 10.000 1 9.000 1 10.000 1 15.000 1 24.000 1 28.000 1 24.000 1 18.000 1 14.000 1 12.000 1 13.000 1 14.000 1 15.000 1 14.000 1 15.000 1 13.000 1 13.000 1 13.000 1 12.000 1 10.000 1 10.000 1
The task:
- Confirm the general field format of the file
- Identify any DATESTAMPs that are duplicated.
- What number of records have good readings for all instruments.
Contents |
[edit] Ada
with Ada.Calendar; use Ada.Calendar;
with Ada.Text_IO; use Ada.Text_IO;
with Strings_Edit; use Strings_Edit;
with Strings_Edit.Floats; use Strings_Edit.Floats;
with Strings_Edit.Integers; use Strings_Edit.Integers;
with Generic_Map;
procedure Data_Munging_2 is
package Time_To_Line is new Generic_Map (Time, Natural);
use Time_To_Line;
File : File_Type;
Line_No : Natural := 0;
Count : Natural := 0;
Stamps : Map;
begin
Open (File, In_File, "readings.txt");
loop
declare
Line : constant String := Get_Line (File);
Pointer : Integer := Line'First;
Flag : Integer;
Year, Month, Day : Integer;
Data : Float;
Stamp : Time;
Valid : Boolean := True;
begin
Line_No := Line_No + 1;
Get (Line, Pointer, SpaceAndTab);
Get (Line, Pointer, Year);
Get (Line, Pointer, Month);
Get (Line, Pointer, Day);
Stamp := Time_Of (Year_Number (Year), Month_Number (-Month), Day_Number (-Day));
begin
Add (Stamps, Stamp, Line_No);
exception
when Constraint_Error =>
Put (Image (Year) & Image (Month) & Image (Day) & ": record at " & Image (Line_No));
Put_Line (" duplicates record at " & Image (Get (Stamps, Stamp)));
end;
Get (Line, Pointer, SpaceAndTab);
for Reading in 1..24 loop
Get (Line, Pointer, Data);
Get (Line, Pointer, SpaceAndTab);
Get (Line, Pointer, Flag);
Get (Line, Pointer, SpaceAndTab);
Valid := Valid and then Flag >= 1;
end loop;
if Pointer <= Line'Last then
Put_Line ("Unrecognized tail at " & Image (Line_No) & ':' & Image (Pointer));
elsif Valid then
Count := Count + 1;
end if;
exception
when End_Error | Data_Error | Constraint_Error | Time_Error =>
Put_Line ("Syntax error at " & Image (Line_No) & ':' & Image (Pointer));
end;
end loop;
exception
when End_Error =>
Close (File);
Put_Line ("Valid records " & Image (Count) & " of " & Image (Line_No) & " total");
end Data_Munging_2;
Sample output
1990-3-25: record at 85 duplicates record at 84 1991-3-31: record at 456 duplicates record at 455 1992-3-29: record at 820 duplicates record at 819 1993-3-28: record at 1184 duplicates record at 1183 1995-3-26: record at 1911 duplicates record at 1910 Valid records 5017 of 5471 total
[edit] AutoHotkey
; Author: AlephX Aug 17 2011
data = %A_scriptdir%\readings.txt
Loop, Read, %data%
{
Lines := A_Index
StringReplace, dummy, A_LoopReadLine, %A_Tab%,, All UseErrorLevel
Loop, parse, A_LoopReadLine, %A_Tab%
{
wrong := 0
if A_index = 1
{
Date := A_LoopField
if (Date == OldDate)
{
WrongDates = %WrongDates%%OldDate% at %Lines%`n
TotwrongDates++
Wrong := 1
break
}
}
else
{
if (A_loopfield/1 < 0)
{
Wrong := 1
break
}
}
}
if (wrong == 1)
totwrong++
else
valid++
if (errorlevel <> 48)
{
if (wrong == 0)
{
totwrong++
valid--
}
unvalidformat++
}
olddate := date
}
msgbox, Duplicate Dates:`n%wrongDates%`nRead Lines: %lines%`nValid Lines: %valid%`nwrong lines: %totwrong%`nDuplicates: %TotWrongDates%`nWrong Formatted: %unvalidformat%`n
Sample Output:
Duplicate Dates: 1990-03-25 at 85 1991-03-31 at 456 1992-03-29 at 820 1993-03-28 at 1184 1995-03-26 at 1911 Read Lines: 5471 Valid Lines: 5129 wrong lines: 342 Duplicates: 5 Wrong Formatted: 0
[edit] AWK
A series of AWK one-liners are shown as this is often what is done. If this information were needed repeatedly, (and this is not known), a more permanent shell script might be created that combined multi-line versions of the scripts below.
Gradually tie down the format.
(In each case offending lines will be printed)
If their are any scientific notation fields then their will be an e in the file:
bash$ awk '/[eE]/' readings.txt
bash$
Quick check on the number of fields:
bash$ awk 'NF != 49' readings.txt
bash$
Full check on the file format using a regular expression:
bash$ awk '!(/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+)+$/ && NF==49)' readings.txt
bash$
Full check on the file format as above but using regular expressions allowing intervals (gnu awk):
bash$ awk --re-interval '!(/^[0-9]{4}-[0-9]{2}-[0-9]{2}([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+){24}+$/ )' readings.txt
bash$
Identify any DATESTAMPs that are duplicated.
Accomplished by counting how many times the first field occurs and noting any second occurrences.
bash$ awk '++count[$1]==2{print $1}' readings.txt
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
bash$
What number of records have good readings for all instruments.
bash$ awk '{rec++;ok=1; for(i=0;i<24;i++){if($(2*i+3)<1){ok=0}}; recordok += ok} END {print "Total records",rec,"OK records", recordok, "or", recordok/rec*100,"%"}' readings.txt
Total records 5471 OK records 5017 or 91.7017 %
bash$
[edit] C
#include <stdio.h>output
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
typedef struct { char *s; int ln, bad; } rec_t;
int cmp_rec(const void *aa, const void *bb)
{
const rec_t *a = aa, *b = bb;
return a->s == b->s ? 0 : !a->s ? 1 : !b->s ? -1 : strncmp(a->s, b->s, 10);
}
int read_file(char *fn)
{
int fd = open(fn, O_RDONLY);
if (fd == -1) return 0;
struct stat s;
fstat(fd, &s);
char *txt = malloc(s.st_size);
read(fd, txt, s.st_size);
close(fd);
int i, j, lines = 0, k, di, bad;
for (i = lines = 0; i < s.st_size; i++)
if (txt[i] == '\n') {
txt[i] = '\0';
lines++;
}
rec_t *rec = calloc(sizeof(rec_t), lines);
char *ptr, *end;
rec[0].s = txt;
rec[0].ln = 1;
for (i = 0; i < lines; i++) {
if (i + 1 < lines) {
rec[i + 1].s = rec[i].s + strlen(rec[i].s) + 1;
rec[i + 1].ln = i + 2;
}
if (sscanf(rec[i].s, "%4d-%2d-%2d", &di, &di, &di) != 3) {
printf("bad line %d: %s\n", i, rec[i].s);
rec[i].s = 0;
continue;
}
ptr = rec[i].s + 10;
for (j = k = 0; j < 25; j++) {
if (!strtod(ptr, &end) && end == ptr) break;
k++, ptr = end;
if (!(di = strtol(ptr, &end, 10)) && end == ptr) break;
k++, ptr = end;
if (di < 1) rec[i].bad = 1;
}
if (k != 48) {
printf("bad format at line %d: %s\n", i, rec[i].s);
rec[i].s = 0;
}
}
qsort(rec, lines, sizeof(rec_t), cmp_rec);
for (i = 1, bad = rec[0].bad, j = 0; i < lines && rec[i].s; i++) {
if (rec[i].bad) bad++;
if (strncmp(rec[i].s, rec[j].s, 10)) {
j = i;
} else
printf("dup line %d: %.10s\n", rec[i].ln, rec[i].s);
}
free(rec);
free(txt);
printf("\n%d out %d lines good\n", lines - bad, lines);
return 0;
}
int main()
{
read_file("readings.txt");
return 0;
}
dup line 85: 1990-03-25
dup line 456: 1991-03-31
dup line 820: 1992-03-29
dup line 1184: 1993-03-28
dup line 1911: 1995-03-26
5017 out 5471 lines good
[edit] C++
Library: Boost
#include <boost/regex.hpp>
#include <fstream>
#include <iostream>
#include <vector>
#include <string>
#include <set>
#include <cstdlib>
#include <algorithm>
using namespace std ;
boost::regex e ( "\\s+" ) ;
int main( int argc , char *argv[ ] ) {
ifstream infile( argv[ 1 ] ) ;
vector<string> duplicates ;
set<string> datestamps ; //for the datestamps
if ( ! infile.is_open( ) ) {
cerr << "Can't open file " << argv[ 1 ] << '\n' ;
return 1 ;
}
int all_ok = 0 ;//all_ok for lines in the given pattern e
int pattern_ok = 0 ; //overall field pattern of record is ok
while ( infile ) {
string eingabe ;
getline( infile , eingabe ) ;
boost::sregex_token_iterator i ( eingabe.begin( ), eingabe.end( ) , e , -1 ), j ;//we tokenize on empty fields
vector<string> fields( i, j ) ;
if ( fields.size( ) == 49 ) //we expect 49 fields in a record
pattern_ok++ ;
else
cout << "Format not ok!\n" ;
if ( datestamps.insert( fields[ 0 ] ).second ) { //not duplicated
int howoften = ( fields.size( ) - 1 ) / 2 ;//number of measurement
//devices and values
for ( int n = 1 ; atoi( fields[ 2 * n ].c_str( ) ) >= 1 ; n++ ) {
if ( n == howoften ) {
all_ok++ ;
break ;
}
}
}
else {
duplicates.push_back( fields[ 0 ] ) ;//first field holds datestamp
}
}
infile.close( ) ;
cout << "The following " << duplicates.size() << " datestamps were duplicated:\n" ;
copy( duplicates.begin( ) , duplicates.end( ) ,
ostream_iterator<string>( cout , "\n" ) ) ;
cout << all_ok << " records were complete and ok!\n" ;
return 0 ;
}
The program produces the following output:
Format not ok! The following 6 datestamps were duplicated: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 2004-12-31
[edit] C#
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.IO;
namespace TextProc2
{
class Program
{
static void Main(string[] args)
{
Regex multiWhite = new Regex(@"\s+");
Regex dateEx = new Regex(@"^\d{4}-\d{2}-\d{2}$");
Regex valEx = new Regex(@"^\d+\.{1}\d{3}$");
Regex flagEx = new Regex(@"^[1-9]{1}$");
int missformcount = 0, totalcount = 0;
Dictionary<int, string> dates = new Dictionary<int, string>();
using (StreamReader sr = new StreamReader("readings.txt"))
{
string line = sr.ReadLine();
while (line != null)
{
line = multiWhite.Replace(line, @" ");
string[] splitLine = line.Split(' ');
if (splitLine.Length != 49)
missformcount++;
if (!dateEx.IsMatch(splitLine[0]))
missformcount++;
else
dates.Add(totalcount + 1, dateEx.Match(splitLine[0]).ToString());
int err = 0;
for (int i = 1; i < splitLine.Length; i++)
{
if (i%2 != 0)
{
if (!valEx.IsMatch(splitLine[i]))
err++;
}
else
{
if (!flagEx.IsMatch(splitLine[i]))
err++;
}
}
if (err != 0) missformcount++;
line = sr.ReadLine();
totalcount++;
}
}
int goodEntries = totalcount - missformcount;
Dictionary<string,List<int>> dateReverse = new Dictionary<string,List<int>>();
foreach (KeyValuePair<int, string> kvp in dates)
{
if (!dateReverse.ContainsKey(kvp.Value))
dateReverse[kvp.Value] = new List<int>();
dateReverse[kvp.Value].Add(kvp.Key);
}
Console.WriteLine(goodEntries + " valid Records out of " + totalcount);
foreach (KeyValuePair<string, List<int>> kvp in dateReverse)
{
if (kvp.Value.Count > 1)
Console.WriteLine("{0} is duplicated at Lines : {1}", kvp.Key, string.Join(",", kvp.Value));
}
}
}
}
5017 valid Records out of 5471 1990-03-25 is duplicated at Lines : 84,85 1991-03-31 is duplicated at Lines : 455,456 1992-03-29 is duplicated at Lines : 819,820 1993-03-28 is duplicated at Lines : 1183,1184 1995-03-26 is duplicated at Lines : 1910,1911
[edit] D
import std.stdio, std.array, std.string, std.regex, std.conv,
std.algorithm;
void main() {
// works but eats lot of RAM in DMD 2.059
//const rxDate = ctRegex!(`^\d\d\d\d-\d\d-\d\d$`);
auto rxDate = regex(`^\d\d\d\d-\d\d-\d\d$`);
int[string] repeatedDates;
int goodReadings;
foreach (string line; lines(File("readings.txt"))) {
try {
auto parts = line.split();
if (parts.length != 49)
throw new Exception("Wrong column count");
if (match(parts[0], rxDate).empty)
throw new Exception("Date is wrong");
repeatedDates[parts[0]]++;
bool noProblem = true;
for (int i = 1; i < 48; i += 2) {
if (to!int(parts[i + 1]) < 1)
// don't break loop because it's validation too.
noProblem = false;
if (!isNumeric(parts[i]))
throw new Exception("Reading is wrong: "~parts[i]);
}
if (noProblem)
goodReadings++;
} catch(Exception ex) {
writefln(`Problem in line "%s": %s`, line, ex);
}
}
writeln("Duplicated timestamps: ",
repeatedDates.keys.filter!(k => repeatedDates[k] > 1)().
join(", "));
writeln("Good reading records: ", goodReadings);
}
- Output:
Duplicated timestamps: 1990-03-25, 1991-03-31, 1992-03-29, 1993-03-28, 1995-03-26 Good reading records: 5017
[edit] F#
let file = @"readings.txt"
let dates = HashSet(HashIdentity.Structural)
let mutable ok = 0
do
for line in System.IO.File.ReadAllLines file do
match String.split [' '; '\t'] line with
| [] -> ()
| date::xys ->
if dates.Contains date then
printf "Date %s is duplicated\n" date
else
dates.Add date
let f (b, t) h = not b, if b then int h::t else t
let _, states = Seq.fold f (false, []) xys
if Seq.forall (fun s -> s >= 1) states then
ok <- ok + 1
printf "%d records were ok\n" ok
Prints:
Date 1990-03-25 is duplicated
Date 1991-03-31 is duplicated
Date 1992-03-29 is duplicated
Date 1993-03-28 is duplicated
Date 1995-03-26 is duplicated
5017 records were ok
[edit] Go
package main
import (
"bufio"
"fmt"
"io"
"os"
"strconv"
"strings"
)
var fn = "readings.txt"
func main() {
f, err := os.Open(fn)
if err != nil {
fmt.Println(err)
return
}
defer f.Close()
var allGood, uniqueGood int
// map records not only dates seen, but also if an all-good record was
// seen for the key date.
m := make(map[string]bool)
for lr := bufio.NewReader(f); ; {
line, pref, err := lr.ReadLine()
if err == io.EOF {
break
}
if err != nil {
fmt.Println(err)
return
}
if pref {
fmt.Println("Unexpected long line.")
return
}
f := strings.Fields(string(line))
if len(f) != 49 {
fmt.Println("unexpected format,", len(f), "fields.")
return
}
good := true
for i := 1; i < 49; i += 2 {
flag, err := strconv.Atoi(f[i+1])
if err != nil {
fmt.Println(err)
return
}
if flag > 0 { // value is good
_, err := strconv.ParseFloat(f[i], 64)
if err != nil {
fmt.Println(err)
return
}
} else { // value is bad
good = false
}
}
if good {
allGood++
}
previouslyGood, seen := m[f[0]]
if seen {
fmt.Println("Duplicate datestamp:", f[0])
if !previouslyGood && good {
m[string([]byte(f[0]))] = true
uniqueGood++
}
} else {
m[string([]byte(f[0]))] = good
if good {
uniqueGood++
}
}
}
fmt.Println("\nData format valid.")
fmt.Println(allGood, "records with good readings for all instruments.")
fmt.Println(uniqueGood,
"unique dates with good readings for all instruments.")
}
Output:
Duplicate datestamp: 1990-03-25 Duplicate datestamp: 1991-03-31 Duplicate datestamp: 1992-03-29 Duplicate datestamp: 1993-03-28 Duplicate datestamp: 1995-03-26 Data format valid. 5017 records with good readings for all instruments. 5013 unique dates with good readings for all instruments.
[edit] Haskell
import Data.List (nub, (\\))
data Record = Record {date :: String, recs :: [(Double, Int)]}
duplicatedDates rs = rs \\ nub rs
goodRecords = filter ((== 24) . length . filter ((>= 1) . snd) . recs)
parseLine l = let ws = words l in Record (head ws) (mapRecords (tail ws))
mapRecords [] = []
mapRecords [_] = error "invalid data"
mapRecords (value:flag:tail) = (read value, read flag) : mapRecords tail
main = do
inputs <- (map parseLine . lines) `fmap` readFile "readings.txt"
putStr (unlines ("duplicated dates:": duplicatedDates (map date inputs)))
putStrLn ("number of good records: " ++ show (length $ goodRecords inputs))
this script outputs:
duplicated dates: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 number of good records: 5017
[edit] J
require 'tables/dsv dates'
dat=: TAB readdsv jpath '~temp/readings.txt'
Dates=: getdate"1 >{."1 dat
Vals=: _99 ". >(1 + +: i.24){"1 dat
Flags=: _99 ". >(2 + +: i.24){"1 dat
# Dates NB. Total # lines
5471
+/ *./"1 ] 0 = Dates NB. # lines with invalid date formats
0
+/ _99 e."1 Vals,.Flags NB. # lines with invalid value or flag formats
0
+/ *./"1 [0 < Flags NB. # lines with only valid flags
5017
~. (#~ (i.~ ~: i:~)) Dates NB. Duplicate dates
1990 3 25
1991 3 31
1992 3 29
1993 3 28
1995 3 26
[edit] Java
import java.util.*;
import java.util.regex.*;
import java.io.*;
public class DataMunging2 {
public static final Pattern e = Pattern.compile("\\s+");
public static void main(String[] args) {
try {
BufferedReader infile = new BufferedReader(new FileReader(args[0]));
List<String> duplicates = new ArrayList<String>();
Set<String> datestamps = new HashSet<String>(); //for the datestamps
String eingabe;
int all_ok = 0;//all_ok for lines in the given pattern e
while ((eingabe = infile.readLine()) != null) {
String[] fields = e.split(eingabe); //we tokenize on empty fields
if (fields.length != 49) //we expect 49 fields in a record
System.out.println("Format not ok!");
if (datestamps.add(fields[0])) { //not duplicated
int howoften = (fields.length - 1) / 2 ; //number of measurement
//devices and values
for (int n = 1; Integer.parseInt(fields[2*n]) >= 1; n++) {
if (n == howoften) {
all_ok++ ;
break ;
}
}
} else {
duplicates.add(fields[0]); //first field holds datestamp
}
}
infile.close();
System.out.println("The following " + duplicates.size() + " datestamps were duplicated:");
for (String x : duplicates)
System.out.println(x);
System.out.println(all_ok + " records were complete and ok!");
} catch (IOException e) {
System.err.println("Can't open file " + args[0]);
System.exit(1);
}
}
}
The program produces the following output:
The following 5 datestamps were duplicated: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 5013 records were complete and ok!
[edit] JavaScript
// wrap up the counter variables in a closure.
function analyze_func(filename) {
var dates_seen = {};
var format_bad = 0;
var records_all = 0;
var records_good = 0;
return function() {
var fh = new ActiveXObject("Scripting.FileSystemObject").openTextFile(filename, 1); // 1 = for reading
while ( ! fh.atEndOfStream) {
records_all ++;
var allOK = true;
var line = fh.ReadLine();
var fields = line.split('\t');
if (fields.length != 49) {
format_bad ++;
continue;
}
var date = fields.shift();
if (has_property(dates_seen, date))
WScript.echo("duplicate date: " + date);
else
dates_seen[date] = 1;
while (fields.length > 0) {
var value = parseFloat(fields.shift());
var flag = parseInt(fields.shift(), 10);
if (isNaN(value) || isNaN(flag)) {
format_bad ++;
}
else if (flag <= 0) {
allOK = false;
}
}
if (allOK)
records_good ++;
}
fh.close();
WScript.echo("total records: " + records_all);
WScript.echo("Wrong format: " + format_bad);
WScript.echo("records with no bad readings: " + records_good);
}
}
function has_property(obj, propname) {
return typeof(obj[propname]) == "undefined" ? false : true;
}
var analyze = analyze_func('readings.txt');
analyze();
[edit] Lua
filename = "readings.txt"
io.input( filename )
dates = {}
duplicated, bad_format = {}, {}
num_good_records, lines_total = 0, 0
while true do
line = io.read( "*line" )
if line == nil then break end
lines_total = lines_total + 1
date = string.match( line, "%d+%-%d+%-%d+" )
if dates[date] ~= nil then
duplicated[#duplicated+1] = date
end
dates[date] = 1
count_pairs, bad_values = 0, false
for v, w in string.gmatch( line, "%s(%d+[%.%d+]*)%s(%-?%d)" ) do
count_pairs = count_pairs + 1
if tonumber(w) <= 0 then
bad_values = true
end
end
if count_pairs ~= 24 then
bad_format[#bad_format+1] = date
end
if not bad_values then
num_good_records = num_good_records + 1
end
end
print( "Lines read:", lines_total )
print( "Valid records: ", num_good_records )
print( "Duplicate dates:" )
for i = 1, #duplicated do
print( " ", duplicated[i] )
end
print( "Bad format:" )
for i = 1, #bad_format do
print( " ", bad_format[i] )
end
Output:
Lines read: 5471 Valid records: 5017 Duplicate dates: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 Bad format:
[edit] Mathematica
data = Import["Readings.txt","TSV"]; Print["duplicated dates: "];
Select[Tally@data[[;;,1]], #[[2]]>1&][[;;,1]]//Column
Print["number of good records: ", Count[(Times@@#[[3;;All;;2]])& /@ data, 1],
" (out of a total of ", Length[data], ")"]
duplicated dates: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 number of good records: 5017 (out of a total of 5471)
[edit] MATLAB / Octave
function [val,count] = readdat(configfile)
% READDAT reads readings.txt file
%
% The value of boolean parameters can be tested with
% exist(parameter,'var')
if nargin<1,
filename = 'readings.txt';
end;
fid = fopen(filename);
if fid<0, error('cannot open file %s\n',a); end;
[val,count] = fscanf(fid,'%04d-%02d-%02d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d \n');
fclose(fid);
count = count/51;
if (count<1) || count~=floor(count),
error('file has incorrect format\n')
end;
val = reshape(val,51,count)'; % make matrix with 51 rows and count columns, then transpose it.
d = datenum(val(:,1:3)); % compute timestamps
printf('The following records are followed by a duplicate:');
dix = find(diff(d)==0) % check for to consequtive timestamps with zero difference
printf('number of valid records: %i\n ', sum( all( val(:,5:2:end) >= 1, 2) ) );
>> [val,count]=readdat;
The following records are followed by a duplicate:dix =
84
455
819
1183
1910
number of valid records: 5017
[edit] OCaml
#load "str.cma"
open Str
let strip_cr str =
let last = pred(String.length str) in
if str.[last] <> '\r' then (str) else (String.sub str 0 last)
let map_records =
let rec aux acc = function
| value::flag::tail ->
let e = (float_of_string value, int_of_string flag) in
aux (e::acc) tail
| [_] -> invalid_arg "invalid data"
| [] -> (List.rev acc)
in
aux [] ;;
let duplicated_dates =
let same_date (d1,_) (d2,_) = (d1 = d2) in
let date (d,_) = d in
let rec aux acc = function
| a::b::tl when same_date a b ->
aux (date a::acc) tl
| _::tl ->
aux acc tl
| [] ->
(List.rev acc)
in
aux [] ;;
let record_ok (_,record) =
let is_ok (_,v) = (v >= 1) in
let sum_ok =
List.fold_left (fun sum this ->
if is_ok this then succ sum else sum) 0 record
in
(sum_ok = 24)
let num_good_records =
List.fold_left (fun sum record ->
if record_ok record then succ sum else sum) 0 ;;
let parse_line line =
let li = split (regexp "[ \t]+") line in
let records = map_records (List.tl li)
and date = (List.hd li) in
(date, records)
let () =
let ic = open_in "readings.txt" in
let rec read_loop acc =
try
let line = strip_cr(input_line ic) in
read_loop ((parse_line line) :: acc)
with End_of_file ->
close_in ic;
(List.rev acc)
in
let inputs = read_loop [] in
Printf.printf "%d total lines\n" (List.length inputs);
Printf.printf "duplicated dates:\n";
let dups = duplicated_dates inputs in
List.iter print_endline dups;
Printf.printf "number of good records: %d\n" (num_good_records inputs);
;;
this script outputs:
5471 total lines duplicated dates: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 number of good records: 5017
[edit] Perl
use List::MoreUtils 'natatime';
use constant FIELDS => 49;
binmode STDIN, ':crlf';
# Read the newlines properly even if we're not running on
# Windows.
my ($line, $good_records, %dates) = (0, 0);
while (<>)
{++$line;
my @fs = split /\s+/;
@fs == FIELDS or die "$line: Bad number of fields.\n";
for (shift @fs)
{/\d{4}-\d{2}-\d{2}/ or die "$line: Bad date format.\n";
++$dates{$_};}
my $iterator = natatime 2, @fs;
my $all_flags_okay = 1;
while ( my ($val, $flag) = $iterator->() )
{$val =~ /\d+\.\d+/ or die "$line: Bad value format.\n";
$flag =~ /\A-?\d+/ or die "$line: Bad flag format.\n";
$flag < 1 and $all_flags_okay = 0;}
$all_flags_okay and ++$good_records;}
print "Good records: $good_records\n",
"Repeated timestamps:\n",
map {" $_\n"}
grep {$dates{$_} > 1}
sort keys %dates;
Output:
Good records: 5017 Repeated timestamps: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26
[edit] Perl 6
my $fields = 49;
my ($good-records, %dates) = 0;
for 1 .. * Z $*IN.lines -> $line, $s {
my @fs = split /\s+/, $s;
@fs == $fields or die "$line: Bad number of fields";
given shift @fs {
m/\d**4 \- \d**2 \- \d**2/ or die "$line: Bad date format";
++%dates{$_};
}
my $all-flags-okay = True;
for @fs -> $val, $flag {
$val ~~ /\d+ \. \d+/ or die "$line: Bad value format";
$flag ~~ /^ \-? \d+/ or die "$line: Bad flag format";
$flag < 1 and $all-flags-okay = False;
}
$all-flags-okay and ++$good-records;
}
say 'Good records: ', $good-records;
say 'Repeated timestamps:';
say ' ', $_ for grep { %dates{$_} > 1 }, sort keys %dates;
Output:
Good records: 5017 Repeated timestamps: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26
The first version demonstrates that you can program Perl 6 almost like Perl 5. Here's a more idiomatic Perl 6 version that runs several times faster:
my $good-records;
my $line;
my %dates;
for lines() {
$line++;
/ ^
(\d ** 4 '-' \d\d '-' \d\d)
[ \h+ \d+'.'\d+ \h+ ('-'?\d+) ] ** 24
$ /
or note "Bad format at line $line" and next;
%dates.push: $0 => $line;
$good-records++ if $1.all >= 1;
}
say "$good-records good records out of $line total";
say 'Repeated timestamps (with line numbers):';
.say for sort %dates.pairs.grep: *.value.elems > 1;
Output:
5017 good records out of 5471 total Repeated timestamps (with line numbers): 1990-03-25 84 85 1991-03-31 455 456 1992-03-29 819 820 1993-03-28 1183 1184 1995-03-26 1910 1911
Note how this version does validation with a single Perl 6 regex that is much more readable than the typical regex, and arguably expresses the data structure more straightforwardly. Here we use normal quotes for literals, and \h for horizontal whitespace.
Variables like $good-record that are going to be autoincremented do not need to be initialized. (Perl 6 allows hyphens in variable names, as you can see.)
The .push method on a hash is magical and loses no information; if a duplicate key is found in the pushed pair, an array of values is automatically created of the old value and the new value pushed. Hence we can easily track all the lines that a particular duplicate occurred at.
The .all method does "junctional" logic: it autothreads through comparators as any English speaker would expect. Junctions can also short-circuit as soon as they find a value that doesn't match, and the evaluation order is up to the computer, so it can be optimized or parallelized.
The final line simply greps out the pairs from the hash whose value is an array with more than 1 element. (Those values that are not arrays nevertheless have a .elems method that always reports 1.) The .pairs is merely there for clarity; grepping a hash directly has the same effect. Note that we sort the pairs after we've grepped them, not before; this works fine in Perl 6, sorting on the key and value as primary and secondary keys. Finally, pairs and arrays provide a default print format that is sufficient without additional formatting in this case.
[edit] PHP
$handle = fopen("readings.txt", "rb");
$missformcount = 0;
$totalcount = 0;
$dates = array();
while (!feof($handle)) {
$buffer = fgets($handle);
$line = preg_replace('/\s+/',' ',$buffer);
$line = explode(' ',trim($line));
$datepattern = '/^\d{4}-\d{2}-\d{2}$/';
$valpattern = '/^\d+\.{1}\d{3}$/';
$flagpattern = '/^[1-9]{1}$/';
if(count($line) != 49) $missformcount++;
if(!preg_match($datepattern,$line[0],$check)) $missformcount++;
else $dates[$totalcount+1] = $check[0];
$errcount = 0;
for($i=1;$i<count($line);$i++){
if($i%2!=0){
if(!preg_match($valpattern,$line[$i],$check)) $errcount++;
}else{
if(!preg_match($flagpattern,$line[$i],$check)) $errcount++;
}
}
if($errcount != 0) $missformcount++;
$totalcount++;
}
fclose ($handle);
$good = $totalcount - $missformcount;
$duplicates = array_diff_key( $dates , array_unique( $dates ));
echo 'Valid records ' . $good . ' of ' . $totalcount . ' total<br>';
echo 'Duplicates : <br>';
foreach ($duplicates as $key => $val){
echo $val . ' at Line : ' . $key . '<br>';
}
Valid records 5017 of 5471 total Duplicates : 1990-03-25 at Line : 85 1991-03-31 at Line : 456 1992-03-29 at Line : 820 1993-03-28 at Line : 1184 1995-03-26 at Line : 1911
[edit] PL/I
/* To process readings produced by automatic reading stations. */
check: procedure options (main);
declare 1 date, 2 (yy, mm, dd) character (2),
(j1, j2) character (1);
declare old_date character (6);
declare line character (330) varying;
declare R(24) fixed decimal, Machine(24) fixed binary;
declare (i, k, n, faulty static initial (0)) fixed binary;
declare input file;
open file (input) title ('/READINGS.TXT,TYPE(CRLF),RECSIZE(300)');
on endfile (input) go to done;
old_date = '';
k = 0;
do forever;
k = k + 1;
get file (input) edit (line) (L);
get string(line) edit (yy, j1, mm, j2, dd) (a(4), a(1), a(2), a(1), a(2));
line = substr(line, 11);
do i = 1 to length(line);
if substr(line, i, 1) = '09'x then substr(line, i, 1) = ' ';
end;
line = trim(line);
n = tally(line, ' ') - tally (line, ' ') + 1;
if n ^= 48 then
do;
put skip list ('There are ' || n || ' readings in line ' || k);
end;
n = n/2;
line = line || ' ';
get string(line) list ((R(i), Machine(i) do i = 1 to n));
if any(Machine < 1) ^= '0'B then
faulty = faulty + 1;
if old_date ^= ' ' then if old_date = string(date) then
put skip list ('Dates are the same at line' || k);
old_date = string(date);
end;
done:
put skip list ('There were ' || k || ' sets of readings');
put skip list ('There were ' || faulty || ' faulty readings' );
put skip list ('There were ' || k-faulty || ' good readings' );
end check;
[edit] PicoLisp
Put the following into an executable file "checkReadings":
#!/usr/bin/picolisp /usr/lib/picolisp/lib.l
(load "@lib/misc.l")
(in (opt)
(until (eof)
(let Lst (split (line) "^I")
(unless
(and
(= 49 (length Lst)) # Check total length
($dat (car Lst) "-") # Check for valid date
(not
(find # Check data format
'((L F)
(not
(if F # Alternating:
(format L 3) # Number
(>= 9 (format L) -9) ) ) ) # or flag
(cdr Lst)
'(T NIL .) ) ) )
(prinl "Bad line format: " (glue " " Lst))
(bye 1) ) ) ) )
(bye)
Then it can be called as
$ ./checkReadings readings.txt
[edit] PowerShell
$dateHash = @{}
$goodLineCount = 0
get-content c:\temp\readings.txt |
ForEach-Object {
$line = $_.split(" |`t",2)
if ($dateHash.containskey($line[0])) {
$line[0] + " is duplicated"
} else {
$dateHash.add($line[0], $line[1])
}
$readings = $line[1].split()
$goodLine = $true
if ($readings.count -ne 48) { $goodLine = $false; "incorrect line length : $line[0]" }
for ($i=0; $i -lt $readings.count; $i++) {
if ($i % 2 -ne 0) {
if ([int]$readings[$i] -lt 1) {
$goodLine = $false
}
}
}
if ($goodLine) { $goodLineCount++ }
}
[string]$goodLineCount + " good lines"
Output:
1990-03-25 is duplicated 1991-03-31 is duplicated 1992-03-29 is duplicated 1993-03-28 is duplicated 1995-03-26 is duplicated 5017
An alternative using regular expression syntax:
$dateHash = @{}
$goodLineCount = 0
ForEach ($rawLine in ( get-content c:\temp\readings.txt) ){
$line = $rawLine.split(" |`t",2)
if ($dateHash.containskey($line[0])) {
$line[0] + " is duplicated"
} else {
$dateHash.add($line[0], $line[1])
}
$readings = [regex]::matches($line[1],"\d+\.\d+\s-?\d")
if ($readings.count -ne 24) { "incorrect number of readings for date " + $line[0] }
$goodLine = $true
foreach ($flagMatch in [regex]::matches($line[1],"\d\.\d*\s(?<flag>-?\d)")) {
if ([int][string]$flagMatch.groups["flag"].value -lt 1) {
$goodLine = $false
}
}
if ($goodLine) { $goodLineCount++}
}
[string]$goodLineCount + " good lines"
Output:
1990-03-25 is duplicated 1991-03-31 is duplicated 1992-03-29 is duplicated 1993-03-28 is duplicated 1995-03-26 is duplicated 5017 good lines
[edit] PureBasic
Using regular expressions.
Define filename.s = "readings.txt"
#instrumentCount = 24
Enumeration
#exp_date
#exp_instruments
#exp_instrumentStatus
EndEnumeration
Structure duplicate
date.s
firstLine.i
line.i
EndStructure
NewMap dates() ;records line date occurs first
NewList duplicated.duplicate()
NewList syntaxError()
Define goodRecordCount, totalLines, line.s, i
Dim inputDate.s(0)
Dim instruments.s(0)
If ReadFile(0, filename)
CreateRegularExpression(#exp_date, "\d+-\d+-\d+")
CreateRegularExpression(#exp_instruments, "(\t|\x20)+(\d+\.\d+)(\t|\x20)+\-?\d")
CreateRegularExpression(#exp_instrumentStatus, "(\t|\x20)+(\d+\.\d+)(\t|\x20)+")
Repeat
line = ReadString(0, #PB_Ascii)
If line = "": Break: EndIf
totalLines + 1
ExtractRegularExpression(#exp_date, line, inputDate())
If FindMapElement(dates(), inputDate(0))
AddElement(duplicated())
duplicated()\date = inputDate(0)
duplicated()\firstLine = dates()
duplicated()\line = totalLines
Else
dates(inputDate(0)) = totalLines
EndIf
ExtractRegularExpression(#exp_instruments, Mid(line, Len(inputDate(0)) + 1), instruments())
Define pairsCount = ArraySize(instruments()), containsBadValues = #False
For i = 0 To pairsCount
If Val(ReplaceRegularExpression(#exp_instrumentStatus, instruments(i), "")) < 1
containsBadValues = #True
Break
EndIf
Next
If pairsCount <> #instrumentCount - 1
AddElement(syntaxError()): syntaxError() = totalLines
EndIf
If Not containsBadValues
goodRecordCount + 1
EndIf
ForEver
CloseFile(0)
If OpenConsole()
ForEach duplicated()
PrintN("Duplicate date: " + duplicated()\date + " occurs on lines " + Str(duplicated()\line) + " and " + Str(duplicated()\firstLine) + ".")
Next
ForEach syntaxError()
PrintN( "Syntax error in line " + Str(syntaxError()))
Next
PrintN(#CRLF$ + Str(goodRecordCount) + " of " + Str(totalLines) + " lines read were valid records.")
Print(#CRLF$ + #CRLF$ + "Press ENTER to exit"): Input()
CloseConsole()
EndIf
EndIf
Sample output:
Duplicate date: 1990-03-25 occurs on lines 85 and 84. Duplicate date: 1991-03-31 occurs on lines 456 and 455. Duplicate date: 1992-03-29 occurs on lines 820 and 819. Duplicate date: 1993-03-28 occurs on lines 1184 and 1183. Duplicate date: 1995-03-26 occurs on lines 1911 and 1910. 5017 of 5471 lines read were valid records.
[edit] Python
import re
import zipfile
import StringIO
def munge2(readings):
datePat = re.compile(r'\d{4}-\d{2}-\d{2}')
valuPat = re.compile(r'[-+]?\d+\.\d+')
statPat = re.compile(r'-?\d+')
allOk, totalLines = 0, 0
datestamps = set([])
for line in readings:
totalLines += 1
fields = line.split('\t')
date = fields[0]
pairs = [(fields[i],fields[i+1]) for i in range(1,len(fields),2)]
lineFormatOk = datePat.match(date) and \
all( valuPat.match(p[0]) for p in pairs ) and \
all( statPat.match(p[1]) for p in pairs )
if not lineFormatOk:
print 'Bad formatting', line
continue
if len(pairs)!=24 or any( int(p[1]) < 1 for p in pairs ):
print 'Missing values', line
continue
if date in datestamps:
print 'Duplicate datestamp', line
continue
datestamps.add(date)
allOk += 1
print 'Lines with all readings: ', allOk
print 'Total records: ', totalLines
#zfs = zipfile.ZipFile('readings.zip','r')
#readings = StringIO.StringIO(zfs.read('readings.txt'))
readings = open('readings.txt','r')
munge2(readings)
The results indicate 5013 good records, which differs from the Awk implementation. The final few lines of the output are as follows
Missing values 2004-12-29 2.900 1 2.700 1 2.800 1 3.300 1 2.900 1 2.300 1 0.000 0 1.700 1 1.900 1 2.300 1 2.600 1 2.900 1 2.600 1 2.600 1 2.600 1 2.700 1 2.300 1 2.200 1 2.100 1 2.000 1 2.100 1 2.100 1 2.300 1 2.400 1 Missing values 2004-12-30 2.400 1 2.600 1 2.600 1 2.600 1 3.000 1 0.000 0 3.300 1 2.600 1 2.900 1 2.400 1 2.300 1 2.900 1 3.500 1 3.700 1 3.600 1 4.000 1 3.400 1 2.400 1 2.500 1 2.600 1 2.600 1 2.800 1 2.400 1 2.200 1 Missing values 2004-12-31 2.400 1 2.500 1 2.500 1 2.400 1 0.000 0 2.400 1 2.400 1 2.400 1 2.200 1 2.400 1 2.500 1 2.000 1 1.700 1 1.400 1 1.500 1 1.900 1 1.700 1 2.000 1 2.000 1 2.200 1 1.700 1 1.500 1 1.800 1 1.800 1 Lines with all readings: 5013 Total records: 5471
Second Version
Modification of the version above to:
- Remove continue statements so it counts as the AWK example does.
- Generate mostly summary information that is easier to compare to other solutions.
import re
import zipfile
import StringIO
def munge2(readings, debug=False):
datePat = re.compile(r'\d{4}-\d{2}-\d{2}')
valuPat = re.compile(r'[-+]?\d+\.\d+')
statPat = re.compile(r'-?\d+')
totalLines = 0
dupdate, badform, badlen, badreading = set(), set(), set(), 0
datestamps = set([])
for line in readings:
totalLines += 1
fields = line.split('\t')
date = fields[0]
pairs = [(fields[i],fields[i+1]) for i in range(1,len(fields),2)]
lineFormatOk = datePat.match(date) and \
all( valuPat.match(p[0]) for p in pairs ) and \
all( statPat.match(p[1]) for p in pairs )
if not lineFormatOk:
if debug: print 'Bad formatting', line
badform.add(date)
if len(pairs)!=24 or any( int(p[1]) < 1 for p in pairs ):
if debug: print 'Missing values', line
if len(pairs)!=24: badlen.add(date)
if any( int(p[1]) < 1 for p in pairs ): badreading += 1
if date in datestamps:
if debug: print 'Duplicate datestamp', line
dupdate.add(date)
datestamps.add(date)
print 'Duplicate dates:\n ', '\n '.join(sorted(dupdate))
print 'Bad format:\n ', '\n '.join(sorted(badform))
print 'Bad number of fields:\n ', '\n '.join(sorted(badlen))
print 'Records with good readings: %i = %5.2f%%\n' % (
totalLines-badreading, (totalLines-badreading)/float(totalLines)*100 )
print 'Total records: ', totalLines
readings = open('readings.txt','r')
munge2(readings)
bash$ /cygdrive/c/Python26/python munge2.py Duplicate dates: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 Bad format: Bad number of fields: Records with good readings: 5017 = 91.70% Total records: 5471 bash$
[edit] R
# Read in data from file
dfr <- read.delim("d:/readings.txt", colClasses=c("character", rep(c("numeric", "integer"), 24)))
dates <- strptime(dfr[,1], "%Y-%m-%d")
# Any bad values?
dfr[which(is.na(dfr))]
# Any duplicated dates
dates[duplicated(dates)]
# Number of rows with no bad values
flags <- as.matrix(dfr[,seq(3,49,2)])>0
sum(apply(flags, 1, all))
[edit] REXX
This REXX program process the file mentioned in "text processing 1" and does
further valiidate on the dates, flags, and data.
Some of the checks performed are:
∙ checks for duplicated date records.
∙ checks for a bad date (YYYY-MM-DD) format, among:
∙ ∙ wrong length
∙ ∙ year > current year
∙ ∙ year < 1970 (to allow for posthumous data)
∙ ∙ mm < 1 or mm > 12
∙ ∙ dd < 1 or dd > days for the month
∙ ∙ yyyy, dd, mm isn't numeric
∙ missing data (or flags)
∙ flag isn't an integer
∙ flag contains a decimal point
∙ data isn't numeric
In addition, all of the presented numbers (may) have commas inserted.
The program has (negated) code to write the report to a file in addition
to the console.
/*REXX program to process instrument data from a data file. */
numeric digits 20 /*allow for bigger numbers. */
ifid='READINGS.TXT' /*the input file. */
ofid='READINGS.OUT' /*the outut file. */
grandSum=0 /*grand sum of whole file. */
grandflg=0 /*grand num of flagged data. */
grandOKs=0
longFlag=0 /*longest period of flagged data.*/
contFlag=0 /*longest continous flagged data.*/
oldDate =0 /*placeholder of penutilmate date*/
w =16 /*width of fields when displayed.*/
dupDates=0 /*count of duplicated timestamps.*/
badflags=0 /*count of bad flags (¬ integer).*/
badDates=0 /*count of bad dates (bad format)*/
badData =0 /*count of bad datas (¬ numeric).*/
ignoredR=0 /*count of ignored records (bad).*/
maxInstruments=24 /*maximum number of instruments. */
yyyyCurr=right(date(),4) /*get the current year (today). */
monDD. =31 /*number of days in every month. */
/*February is figured on the fly.*/
monDD.4 =30
monDD.6 =30
monDD.9 =30
monDD.11=30
do records=1 while lines(ifid)\==0 /*read until finished. */
rec=linein(ifid) /*read the next record (line). */
parse var rec datestamp Idata /*pick off the dateStamp & data. */
if datestamp==oldDate then do /*found a duplicate timestamp. */
dupDates=dupDates+1 /*bump the counter.*/
call sy datestamp copies('~',30),
'is a duplicate of the',
"previous datestamp."
ignoredR=ignoredR+1 /*bump ignoredRecs.*/
iterate /*ignore this duplicate record. */
end
parse var datestamp yyyy '-' mm '-' dd /*obtain YYYY, MM, and DD. */
monDD.2=28+leapyear(yyyy) /*how long is February in YYYY ? */
/*check for various bad formats. */
if verify(yyyy||mm||dd,1234567890)\==0 |,
length(datestamp)\==10 |,
length(yyyy)\==4 |,
length(mm )\==2 |,
length(dd )\==2 |,
yyyy<1970 |,
yyyy>yyyyCurr |,
mm=0 | dd=0 |,
mm>12 | dd>monDD.mm then do
badDates=badDates+1
call sy datestamp copies('~'),
'has an illegal format.'
ignoredR=ignoredR+1 /*bump ignoredRecs.*/
iterate /*ignore this bad date record. */
end
oldDate=datestamp /*save datestamp for next read. */
sum=0
flg=0
OKs=0
do j=1 until Idata='' /*process the instrument data. */
parse var Idata data.j flag.j Idata
if pos('.',flag.j)\==0 |, /*flag have a decimal point -or-*/
\datatype(flag.j,'W') then do /*is the flag not a whole number?*/
badflags=badflags+1 /*bump counter.*/
call sy datestamp copies('~'),
'instrument' j "has a bad flag:",
flag.j
iterate /*ignore it & it's data.*/
end
if \datatype(data.j,'N') then do /*is the flag not a whole number?*/
badData=badData+1 /*bump counter.*/
call sy datestamp copies('~'),
'instrument' j "has bad data:",
data.j
iterate /*ignore it & it's flag.*/
end
if flag.j>0 then do /*if good data, ... */
OKs=OKs+1
sum=sum+data.j
if contFlag>longFlag then do
longdate=datestamp
longFlag=contFlag
end
contFlag=0
end
else do /*flagged data ... */
flg=flg+1
contFlag=contFlag+1
end
end
if j>maxInstruments then do
badData=badData+1 /*bump counter.*/
call sy datestamp copies('~'),
'too many instrument datum'
end
if OKs\==0 then avg=format(sum/OKs,,3)
else avg='[n/a]'
grandOKs=grandOKs+OKs
_=right(comma(avg),w)
grandSum=grandSum+sum
grandFlg=grandFlg+flg
if flg==0 then call sy datestamp ' average='_
else call sy datestamp ' average='_ ' flagged='right(flg,2)
end
records=records-1 /*adjust for reading end-of-file.*/
if grandOKs\==0 then grandAvg=format(grandsum/grandOKs,,3)
else grandAvg='[n/a]'
call sy
call sy copies('=',60)
call sy ' records read:' right(comma(records ),w)
call sy ' records ignored:' right(comma(ignoredR),w)
call sy ' grand sum:' right(comma(grandSum),w+4)
call sy ' grand average:' right(comma(grandAvg),w+4)
call sy ' grand OK data:' right(comma(grandOKs),w)
call sy ' grand flagged:' right(comma(grandFlg),w)
call sy ' duplicate dates:' right(comma(dupDates),w)
call sy ' bad dates:' right(comma(badDates),w)
call sy ' bad data:' right(comma(badData ),w)
call sy ' bad flags:' right(comma(badflags),w)
if longFlag\==0 then
call sy ' longest flagged:' right(comma(longFlag),w) " ending at " longdate
call sy copies('=',60)
call sy
exit
/*─────────────────────────────────────LEAPYEAR subroutine──────────────*/
leapyear: procedure; arg y /*year could be: Y, YY, YYY, YYYY*/
if length(y)==2 then y=left(right(date(),4),2)y /*adjust for YY year.*/
if y//4\==0 then return 0 /* not ÷ by 4? Not a leapyear.*/
return y//100\==0 | y//400==0 /*apply 100 and 400 year rule. */
/*─────────────────────────────────────SY subroutine────────────────────*/
sy: procedure; parse arg stuff
say stuff
if 1==0 then call lineout ofid,stuff
return
/*─────────────────────────────────────COMMA subroutine─────────────────*/
comma: procedure; parse arg _,c,p,t;arg ,cu;c=word(c ",",1);
if cu=='BLANK' then c=' ';o=word(p 3,1);p=abs(o);t=word(t 999999999,1);
if \datatype(p,'W')|\datatype(t,'W')|p==0|arg()>4 then return _;n=_'.9';
#=123456789;k=0;if o<0 then do;b=verify(_,' ');if b==0 then return _;
e=length(_)-verify(reverse(_),' ')+1;end;else do;b=verify(n,#,"M");
e=verify(n,#'0',,verify(n,#"0.",'M'))-p-1;end;
do j=e to b by -p while k<t;_=insert(c,_,j);k=k+1;end;return _
Output:
∙
∙
∙
1991-03-31 average= 23.542
1991-03-31 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ is a duplicate of the previous datestamp.
1991-04-01 average= 23.217 flagged= 1
1991-04-02 average= 19.792
1991-04-03 average= 13.958
∙
∙
∙
============================================================
records read: 5,471
records ignored: 5
grand sum: 1,357,152.400
grand average: 10.496
grand OK data: 129,306
grand flagged: 1,878
duplicate dates: 5
bad dates: 0
bad data: 0
bad flags: 0
longest flagged: 589 ending at 1993-03-05
============================================================
[edit] Ruby
require 'set'
def munge2(readings, debug=false)
datePat = /^\d{4}-\d{2}-\d{2}/
valuPat = /^[-+]?\d+\.\d+/
statPat = /^-?\d+/
totalLines = 0
dupdate, badform, badlen, badreading = Set[], Set[], Set[], 0
datestamps = Set[[]]
for line in readings
totalLines += 1
fields = line.split(/\t/)
date = fields.shift
pairs = fields.enum_slice(2).to_a
lineFormatOk = date =~ datePat &&
pairs.all? { |x,y| x =~ valuPat && y =~ statPat }
if !lineFormatOk
puts 'Bad formatting ' + line if debug
badform << date
end
if pairs.length != 24 ||
pairs.any? { |x,y| y.to_i < 1 }
puts 'Missing values ' + line if debug
end
if pairs.length != 24
badlen << date
end
if pairs.any? { |x,y| y.to_i < 1 }
badreading += 1
end
if datestamps.include?(date)
puts 'Duplicate datestamp ' + line if debug
dupdate << date
end
datestamps << date
end
puts 'Duplicate dates:', dupdate.sort.map { |x| ' ' + x }
puts 'Bad format:', badform.sort.map { |x| ' ' + x }
puts 'Bad number of fields:', badlen.sort.map { |x| ' ' + x }
puts 'Records with good readings: %i = %5.2f%%' % [
totalLines-badreading, (totalLines-badreading)/totalLines.to_f*100 ]
puts
puts 'Total records: %d' % totalLines
end
open('readings.txt','r') do |readings|
munge2(readings)
end
[edit] Scala
object DataMunging2 {
import scala.io.Source
import scala.collection.immutable.{TreeMap => Map}
val pattern = """^(\d+-\d+-\d+)""" + """\s+(\d+\.\d+)\s+(-?\d+)""" * 24 + "$" r;
def main(args: Array[String]) {
val files = args map (new java.io.File(_)) filter (file => file.isFile && file.canRead)
val (numFormatErrors, numValidRecords, dateMap) =
files.iterator.flatMap(file => Source fromFile file getLines ()).
foldLeft((0, 0, new Map[String, Int] withDefaultValue 0)) {
case ((nFE, nVR, dM), line) => pattern findFirstMatchIn line map (_.subgroups) match {
case Some(List(date, rawData @ _*)) =>
val allValid = (rawData map (_ toDouble) iterator) grouped 2 forall (_.last > 0)
(nFE, nVR + (if (allValid) 1 else 0), dM(date) += 1)
case None => (nFE + 1, nVR, dM)
}
}
dateMap foreach {
case (date, repetitions) if repetitions > 1 => println(date+": "+repetitions+" repetitions")
case _ =>
}
println("""|
|Valid records: %d
|Duplicated dates: %d
|Duplicated records: %d
|Data format errors: %d
|Invalid data records: %d
|Total records: %d""".stripMargin format (
numValidRecords,
dateMap filter { case (_, repetitions) => repetitions > 1 } size,
dateMap.valuesIterable filter (_ > 1) map (_ - 1) sum,
numFormatErrors,
dateMap.valuesIterable.sum - numValidRecords,
dateMap.valuesIterable.sum))
}
}
Sample output:
1990-03-25: 2 repetitions 1991-03-31: 2 repetitions 1992-03-29: 2 repetitions 1993-03-28: 2 repetitions 1995-03-26: 2 repetitions Valid records: 5017 Duplicated dates: 5 Duplicated records: 5 Data format errors: 0 Invalid data records: 454 Total records: 5471
[edit] Tcl
set data [lrange [split [read [open "readings.txt" "r"]] "\n"] 0 end-1]
set total [llength $data]
set correct $total
set datestamps {}
foreach line $data {
set formatOk true
set hasAllMeasurements true
set date [lindex $line 0]
if {[llength $line] != 49} { set formatOk false }
if {![regexp {\d{4}-\d{2}-\d{2}} $date]} { set formatOk false }
if {[lsearch $datestamps $date] != -1} { puts "Duplicate datestamp: $date" } {lappend datestamps $date}
foreach {value flag} [lrange $line 1 end] {
if {$flag < 1} { set hasAllMeasurements false }
if {![regexp -- {[-+]?\d+\.\d+} $value] || ![regexp -- {-?\d+} $flag]} {set formatOk false}
}
if {!$hasAllMeasurements} { incr correct -1 }
if {!$formatOk} { puts "line \"$line\" has wrong format" }
}
puts "$correct records with good readings = [expr $correct * 100.0 / $total]%"
puts "Total records: $total"
$ tclsh munge2.tcl Duplicate datestamp: 1990-03-25 Duplicate datestamp: 1991-03-31 Duplicate datestamp: 1992-03-29 Duplicate datestamp: 1993-03-28 Duplicate datestamp: 1995-03-26 5017 records with good readings = 91.7016998721% Total records: 5471
Second version
To demonstate a different method to iterate over the file, and different ways to verify data types:
set total [set good 0]
array set seen {}
set fh [open readings.txt]
while {[gets $fh line] != -1} {
incr total
set fields [regexp -inline -all {[^ \t\r\n]+} $line]
if {[llength $fields] != 49} {
puts "bad format: not 49 fields on line $total"
continue
}
if { ! [regexp {^(\d{4}-\d\d-\d\d)$} [lindex $fields 0] -> date]} {
puts "bad format: invalid date on line $total: '$date'"
continue
}
if {[info exists seen($date)]} {
puts "duplicate date on line $total: $date"
}
incr seen($date)
set line_format_ok true
set readings_ignored 0
foreach {value flag} [lrange $fields 1 end] {
if { ! [string is double -strict $value]} {
puts "bad format: value not a float on line $total: '$value'"
set line_format_ok false
}
if { ! [string is int -strict $flag]} {
puts "bad format: flag not an integer on line $total: '$flag'"
set line_format_ok false
}
if {$flag < 1} {incr readings_ignored}
}
if {$line_format_ok && $readings_ignored == 0} {incr good}
}
close $fh
puts "total: $total"
puts [format "good: %d = %5.2f%%" $good [expr {100.0 * $good / $total}]]
Results:
duplicate date on line 85: 1990-03-25 duplicate date on line 456: 1991-03-31 duplicate date on line 820: 1992-03-29 duplicate date on line 1184: 1993-03-28 duplicate date on line 1911: 1995-03-26 total: 5471 good: 5017 = 91.70%
[edit] Ursala
compiled and run in a single step, with the input file accessed as a list of strings pre-declared in readings_dot_txt
#import std
#import nat
readings = (*F ~&c/;digits+ rlc ==+ ~~ -={` ,9%cOi&,13%cOi&}) readings_dot_txt
valid_format = all -&length==49,@tK27 all ~&w/`.&& ~&jZ\digits--'-.',@tK28 all ~&jZ\digits--'-'&-
duplicate_dates = :/'duplicated dates:'+ ~&hK2tFhhPS|| -[(none)]-!
good_readings = --' good readings'@h+ %nP+ length+ *~ @tK28 all ~='0'&& ~&wZ/`-
#show+
main = valid_format?(^C/good_readings duplicate_dates,-[invalid format]-!) readings
output:
5017 good readings duplicated dates: 1995-03-26 1993-03-28 1992-03-29 1991-03-31 1990-03-25
[edit] Vedit macro language
This implementation does the following checks:
- Checks for duplicate date fields. Note: duplicates can still be counted as valid records, as in other implementations.
- Checks date format.
- Checks that value fields have 1 or more digits followed by decimal point followed by 3 digits
- Reads flag value and checks if it is positive
- Requires 24 value/flag pairs on each line
#50 = Buf_Num // Current edit buffer (source data)
File_Open("|(PATH_ONLY)\output.txt")
#51 = Buf_Num // Edit buffer for output file
Buf_Switch(#50)
#11 = #12 = #13 = #14 = #15 = 0
Reg_Set(15, "xxx")
While(!At_EOF) {
#10 = 0
#12++
// Check for repeated date field
if (Match(@15) == 0) {
#20 = Cur_Line
Buf_Switch(#51) // Output file
Reg_ins(15) IT(": duplicate record at ") Num_Ins(#20)
Buf_Switch(#50) // Input file
#13++
}
// Check format of date field
if (Match("|d|d|d|d-|d|d-|d|d|w", ADVANCE) != 0) {
#10 = 1
#14++
}
Reg_Copy_Block(15, BOL_pos, Cur_Pos-1)
// Check data fields and flags:
Repeat(24) {
if ( Match("|d|*.|d|d|d|w", ADVANCE) != 0 || Num_Eval(ADVANCE) < 1) {
#10 = 1
#15++
Break
}
Match("|W", ADVANCE)
}
if (#10 == 0) { #11++ } // record was OK
Line(1, ERRBREAK)
}
Buf_Switch(#51) // buffer for output data
IN
IT("Valid records: ") Num_Ins(#11)
IT("Duplicates: ") Num_Ins(#13)
IT("Date format errors: ") Num_Ins(#14)
IT("Invalid data records:") Num_Ins(#15)
IT("Total records: ") Num_Ins(#12)
Sample output:
1990-03-25: duplicate record at 85
1991-03-31: duplicate record at 456
1992-03-29: duplicate record at 820
1993-03-28: duplicate record at 1184
1995-03-26: duplicate record at 1911
Valid records: 5017
Duplicates: 5
Date format errors: 0
Invalid data records: 454
Total records: 5471