Text processing/2

The following data shows a few lines from the file readings.txt (as used in in the Data Munging task).

The data comes from a pollution monitoring station with twenty four instruments monitoring twenty four aspects of pollution in the air. Periodically a record is added to the file constituting a line of 49 white-space separated fields, where white-space can be one or more space or tab characters.

The fields (from the left) are:

 DATESTAMP [ VALUEn FLAGn ] * 24

i.e. a datestamp followed by twenty four repetitions of a floating point instrument value and that instruments associated integer flag. Flag values are >= 1 if the instrument is working and < 1 if there is some problem with that instrument, in which case that instrument's value should be ignored.

A sample from the full data file readings.txt is:

1991-03-30	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1
1991-03-31	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	20.000	1	20.000	1	20.000	1	35.000	1	50.000	1	60.000	1	40.000	1	30.000	1	30.000	1	30.000	1	25.000	1	20.000	1	20.000	1	20.000	1	20.000	1	20.000	1	35.000	1
1991-03-31	40.000	1	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2
1991-04-01	0.000	-2	13.000	1	16.000	1	21.000	1	24.000	1	22.000	1	20.000	1	18.000	1	29.000	1	44.000	1	50.000	1	43.000	1	38.000	1	27.000	1	27.000	1	24.000	1	23.000	1	18.000	1	12.000	1	13.000	1	14.000	1	15.000	1	13.000	1	10.000	1
1991-04-02	8.000	1	9.000	1	11.000	1	12.000	1	12.000	1	12.000	1	27.000	1	26.000	1	27.000	1	33.000	1	32.000	1	31.000	1	29.000	1	31.000	1	25.000	1	25.000	1	24.000	1	21.000	1	17.000	1	14.000	1	15.000	1	12.000	1	12.000	1	10.000	1
1991-04-03	10.000	1	9.000	1	10.000	1	10.000	1	9.000	1	10.000	1	15.000	1	24.000	1	28.000	1	24.000	1	18.000	1	14.000	1	12.000	1	13.000	1	14.000	1	15.000	1	14.000	1	15.000	1	13.000	1	13.000	1	13.000	1	12.000	1	10.000	1	10.000	1

The task:

Confirm the general field format of the file
Identify any DATESTAMPs that are duplicated.
What number of records have good readings for all instruments.

Ada

Library: Simple components for Ada

<lang ada> with Ada.Calendar; use Ada.Calendar; with Ada.Text_IO; use Ada.Text_IO; with Strings_Edit; use Strings_Edit; with Strings_Edit.Floats; use Strings_Edit.Floats; with Strings_Edit.Integers; use Strings_Edit.Integers;

with Generic_Map;

procedure Data_Munging_2 is

  package Time_To_Line is new Generic_Map (Time, Natural);
  use Time_To_Line;
  File    : File_Type;
  Line_No : Natural := 0;
  Count   : Natural := 0;
  Stamps  : Map;

begin

  Open (File, In_File, "readings.txt");
  loop
     declare
        Line    : constant String := Get_Line (File);
        Pointer : Integer := Line'First;
        Flag    : Integer;
        Year, Month, Day : Integer;
        Data    : Float;
        Stamp   : Time;
        Valid   : Boolean := True;
     begin
        Line_No := Line_No + 1;
        Get (Line, Pointer, SpaceAndTab);
        Get (Line, Pointer, Year);
        Get (Line, Pointer, Month);
        Get (Line, Pointer, Day);
        Stamp := Time_Of (Year_Number (Year), Month_Number (-Month), Day_Number (-Day));
        begin
           Add (Stamps, Stamp, Line_No);
        exception
           when Constraint_Error =>
              Put (Image (Year) & Image (Month) & Image (Day) & ": record at " & Image (Line_No));
              Put_Line (" duplicates record at " & Image (Get (Stamps, Stamp)));
        end;
        Get (Line, Pointer, SpaceAndTab);
        for Reading in 1..24 loop
           Get (Line, Pointer, Data);
           Get (Line, Pointer, SpaceAndTab);
           Get (Line, Pointer, Flag);
           Get (Line, Pointer, SpaceAndTab);
           Valid := Valid and then Flag >= 1;
        end loop;
        if Pointer <= Line'Last then
           Put_Line ("Unrecognized tail at " & Image (Line_No) & ':' & Image (Pointer));
        elsif Valid then
           Count := Count + 1;
        end if;
     exception
        when End_Error | Data_Error | Constraint_Error | Time_Error =>
           Put_Line ("Syntax error at " & Image (Line_No) & ':' & Image (Pointer));
     end;
  end loop;

exception

  when End_Error =>
     Close (File);
     Put_Line ("Valid records " & Image (Count) & " of " & Image (Line_No) & " total");

end Data_Munging_2; </lang> Sample output

1990-3-25: record at 85 duplicates record at 84
1991-3-31: record at 456 duplicates record at 455
1992-3-29: record at 820 duplicates record at 819
1993-3-28: record at 1184 duplicates record at 1183
1995-3-26: record at 1911 duplicates record at 1910
Valid records 5017 of 5471 total

AWK

A series of AWK one-liners are shown as this is often what is done. If this information were needed repeatedly, (and this is not known), a more permanent shell script might be created that combined multi-line versions of the scripts below.

Gradually tie down the format.

(In each case offending lines will be printed)

If their are any scientific notation fields then their will be an e in the file:

bash$ awk '/[eE]/' readings.txt
bash$

Quick check on the number of fields:

bash$ awk 'NF != 49' readings.txt
bash$

Full check on the file format using a regular expression:

bash$ awk '!(/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+)+$/ && NF==49)' readings.txt         
bash$

Full check on the file format as above but using regular expressions allowing intervals (gnu awk):

bash$ awk  --re-interval '!(/^[0-9]{4}-[0-9]{2}-[0-9]{2}([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+){24}+$/ )' readings.txt         
bash$

Identify any DATESTAMPs that are duplicated.

Accomplished by counting how many times the first field occurs and noting any second occurrences.

bash$ awk '++count[$1]==2{print $1}' readings.txt
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
bash$

What number of records have good readings for all instruments.

bash$ awk '{rec++;ok=1; for(i=0;i<24;i++){if($(2*i+3)<1){ok=0}}; recordok += ok} END {print "Total records",rec,"OK records", recordok, "or", recordok/rec*100,"%"}'  readings.txt                        
Total records 5471 OK records 5017 or 91.7017 %
bash$

C++

#include <boost/regex.hpp>
#include <fstream>
#include <iostream>
#include <vector>
#include <set>
#include <cstdlib>
#include <algorithm>
#include <iterator>
using namespace std ;

boost::regex e ( "\\s+" ) ; 

int main( int argc , char *argv[ ] ) { 
   ifstream infile( argv[ 1 ] , ios::in ) ; 
   vector<string> filelines , fields, duplicates ;
   set<string> datestamps ; //for the datestamps
   if ( ! ( infile.is_open( ) ) ) { 
      cerr << "Can't open file " << argv[ 1 ] << '\n' ;
      return 1 ; 
   }   
   else {
      string eingabe ;
      while ( infile ) { 
         getline( infile , eingabe ) ;
         filelines.push_back( eingabe ) ;//store file in list
      }
      infile.close( ) ;
   }
   vector<string>::iterator lsi , fi ;
   int all_ok = 0 , duplicated = 0 ;//all_ok for lines in the given pattern
                                    //e, duplicated for datestamps mentioned
                                    //twice
   int pattern_ok = 0 ; //overall field pattern of record is ok
   for ( lsi = filelines.begin( ) ; lsi != filelines.end( ) ; ++lsi ) {
      boost::sregex_token_iterator i ( lsi->begin( ), lsi->end( ) , e , -1 ), j  ;//we tokenize on empty fields
      while ( i != j ) {
         fields.push_back(*i) ;
         ++i ;
      }
      if ( fields.size( ) != 49 )//we expect 49 fields in a record
         cout << "Format not ok!\n" ;
      if ( fields.size( ) == 49 )
         pattern_ok++ ;
      fi = fields.begin( ) ;
      if ( datestamps.insert( *fi ).second ) { //not duplicated
         int n = 1 ;
      int howoften = ( fields.size( ) - 1 ) / 2 ;//number of measurement
                                                    //devices and values
         while ( atoi( (fields.begin( ) + 2 * n )->c_str( ) ) >= 1 ) {
            n++ ;
            if ( n == howoften + 1 ) {
               all_ok++ ;
               break ;
            }
         }
      }
      else {
         duplicated++ ;
         duplicates.push_back( *(fields.begin( ) ) ) ;//first field holds datestamp
      }
      fields.clear( ) ; //must be cleared before next line is read
   }
   cout << "The following " << duplicated << " datestamps were duplicated:\n" ;
   copy( duplicates.begin( ) , duplicates.end( ) ,
         ostream_iterator<string>( cout , "\n" ) ) ;
   cout << all_ok << " records were complete and ok!\n" ;
   return 0 ;
}

The program produces the following output:

Format not ok!
The following 6 datestamps were duplicated:
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
2004-12-31

Haskell

Translation of: OCaml

type Date = String
type Value = Double
type Flag = Int
type Record = (Date, [(Value,Flag)])

duplicatedDates :: [Record] -> [Date]
duplicatedDates [] = []
duplicatedDates [_] = []
duplicatedDates (a:b:tl)
    | sameDate a b = date a : duplicatedDates tl
    | otherwise    = duplicatedDates (b:tl)
    where sameDate a b = date a == date b
          date = fst

numGoodRecords :: [Record] -> Int
numGoodRecords = length . filter recordOk
    where recordOk :: Record -> Bool
          recordOk (_,record) = sumOk == 24
              where sumOk = length $ filter isOk record
                    isOk (_,v) = v >= 1

parseLine :: String -> Record
parseLine line = (date, records')
    where (date:records) = words line
          records' = mapRecords records
          
          mapRecords :: [String] -> [(Value,Flag)]
          mapRecords [] = []
          mapRecords [_] = error "invalid data"
          mapRecords (value:flag:tail) =
              (read value, read flag) : mapRecords tail

main :: IO ()
main = do
  contents <- readFile "readings.txt"
  let inputs = map parseLine $ lines contents
  putStrLn $ show (length inputs) ++ " total lines"
  putStrLn "duplicated dates:"
  mapM_ putStrLn $ duplicatedDates inputs
  putStrLn $ "number of good records: " ++ show (numGoodRecords inputs)

this script outputs:

5471 total lines
duplicated dates:
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
number of good records: 5017

OCaml

<lang ocaml>#load "str.cma" open Str

let strip_cr str =

 let last = pred(String.length str) in
 if str.[last] <> '\r' then (str) else (String.sub str 0 last)

let map_records =

 let rec aux acc = function
   | value::flag::tail ->
       let e = (float_of_string value, int_of_string flag) in
       aux (e::acc) tail
   | _::[] -> invalid_arg "invalid data"
   | [] -> (List.rev acc)
 in
 aux [] ;;

let duplicated_dates =

 let same_date (d1,_) (d2,_) = (d1 = d2) in
 let date (d,_) = d in
 let rec aux acc = function
   | a::b::tl when same_date a b ->
       aux (date a::acc) tl
   | _::tl ->
       aux acc tl
   | [] ->
       (List.rev acc)
 in
 aux [] ;;

let record_ok (_,record) =

 let is_ok (_,v) = (v >= 1) in
 let sum_ok =
   List.fold_left (fun sum this ->
     if is_ok this then succ sum else sum) 0 record
 in
 (sum_ok = 24)

let num_good_records =

 List.fold_left  (fun sum record ->
   if record_ok record then succ sum else sum) 0 ;;

let parse_line line =

 let li = split (regexp "[ \t]+") line in
 let records = map_records (List.tl li)
 and date = (List.hd li) in
 (date, records)

let () =

 let ic = open_in "readings.txt" in
 let rec read_loop acc =
   try
     let line = strip_cr(input_line ic) in
     read_loop ((parse_line line) :: acc)
   with End_of_file ->
     close_in ic;
     (List.rev acc)
 in
 let inputs = read_loop [] in

 Printf.printf "%d total lines\n" (List.length inputs);

 Printf.printf "duplicated dates:\n";
 let dups = duplicated_dates inputs in
 List.iter print_endline dups;

 Printf.printf "number of good records: %d\n" (num_good_records inputs);

</lang>

this script outputs:

5471 total lines
duplicated dates:
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
number of good records: 5017

Perl

<lang perl>use List::MoreUtils 'natatime'; use constant FIELDS => 49;

binmode STDIN, ':crlf';

 # Read the newlines properly even if we're not running on
 # Windows.

my ($line, $good_records, %dates) = (0, 0); while (<>)

  {++$line;
   my @fs = split /\s+/;
   @fs == FIELDS or die "$line: Bad number of fields.\n";
   for (shift @fs)
      {/\d{4}-\d{2}-\d{2}/ or die "$line: Bad date format.\n";
       ++$dates{$_};}
   my $iterator = natatime 2, @fs;
   my $all_flags_okay = 1;
   while ( my ($val, $flag) = $iterator->() )
      {$val =~ /\d+\.\d+/ or die "$line: Bad value format.\n";
       $flag =~ /\A-?\d+/ or die "$line: Bad flag format.\n";
       $flag < 1 and $all_flags_okay = 0;}
   $all_flags_okay and ++$good_records;}

print "Good records: $good_records\n",

  "Repeated timestamps:\n",
  map {"  $_\n"}
  grep {$dates{$_} > 1}
  sort keys %dates;</lang>

Output:

Good records: 5017
Repeated timestamps:
  1990-03-25
  1991-03-31
  1992-03-29
  1993-03-28
  1995-03-26

Python

<lang python>import re import zipfile import StringIO

def munge2(readings):

  datePat = re.compile(r'\d{4}-\d{2}-\d{2}')
  valuPat = re.compile(r'[-+]?\d+\.\d+')
  statPat = re.compile(r'-?\d+')
  allOk, totalLines = 0, 0
  datestamps = set([])
  for line in readings:
     totalLines += 1
     fields = line.split('\t')
     date = fields[0]
     pairs = [(fields[i],fields[i+1]) for i in range(1,len(fields),2)]

     lineFormatOk = datePat.match(date) and \
        all( valuPat.match(p[0]) for p in pairs ) and \
        all( statPat.match(p[1]) for p in pairs )
     if not lineFormatOk:
        print 'Bad formatting', line
        continue

     if len(pairs)!=24 or any( int(p[1]) < 1 for p in pairs ):
        print 'Missing values', line
        continue

     if date in datestamps:
        print 'Duplicate datestamp', line
        continue
     datestamps.add(date)
     allOk += 1

  print 'Lines with all readings: ', allOk
  print 'Total records: ', totalLines

zfs = zipfile.ZipFile('readings.zip','r')
readings = StringIO.StringIO(zfs.read('readings.txt'))

readings = open('readings.txt','r') munge2(readings)</lang> The results indicate 5013 good records, which differs from the Awk implementation. The final few lines of the output are as follows

Missing values 2004-12-29	2.900	1	2.700	1	2.800	1	3.300	1	2.900	1	2.300	1	0.000	0	1.700	1	1.900	1	2.300	1	2.600	1	2.900	1	2.600	1	2.600	1	2.600	1	2.700	1	2.300	1	2.200	1	2.100	1	2.000	1	2.100	1	2.100	1	2.300	1	2.400	1

Missing values 2004-12-30	2.400	1	2.600	1	2.600	1	2.600	1	3.000	1	0.000	0	3.300	1	2.600	1	2.900	1	2.400	1	2.300	1	2.900	1	3.500	1	3.700	1	3.600	1	4.000	1	3.400	1	2.400	1	2.500	1	2.600	1	2.600	1	2.800	1	2.400	1	2.200	1

Missing values 2004-12-31	2.400	1	2.500	1	2.500	1	2.400	1	0.000	0	2.400	1	2.400	1	2.400	1	2.200	1	2.400	1	2.500	1	2.000	1	1.700	1	1.400	1	1.500	1	1.900	1	1.700	1	2.000	1	2.000	1	2.200	1	1.700	1	1.500	1	1.800	1	1.800	1

Lines with all readings:  5013
Total records:  5471

Second Version

Modification of the version above to:

Remove continue statements so it counts as the AWK example does.
Generate mostly summary information that is easier to compare to other solutions.

<lang python>import re import zipfile import StringIO

def munge2(readings, debug=False):

  datePat = re.compile(r'\d{4}-\d{2}-\d{2}')
  valuPat = re.compile(r'[-+]?\d+\.\d+')
  statPat = re.compile(r'-?\d+')
  totalLines = 0
  dupdate, badform, badlen, badreading = set(), set(), set(), 0
  datestamps = set([])
  for line in readings:
     totalLines += 1
     fields = line.split('\t')
     date = fields[0]
     pairs = [(fields[i],fields[i+1]) for i in range(1,len(fields),2)]

     lineFormatOk = datePat.match(date) and \
        all( valuPat.match(p[0]) for p in pairs ) and \
        all( statPat.match(p[1]) for p in pairs )
     if not lineFormatOk:
        if debug: print 'Bad formatting', line
        badform.add(date)
        
     if len(pairs)!=24 or any( int(p[1]) < 1 for p in pairs ):
        if debug: print 'Missing values', line
     if len(pairs)!=24: badlen.add(date)
     if any( int(p[1]) < 1 for p in pairs ): badreading += 1

     if date in datestamps:
        if debug: print 'Duplicate datestamp', line
        dupdate.add(date)

     datestamps.add(date)

  print 'Duplicate dates:\n ', '\n  '.join(sorted(dupdate)) 
  print 'Bad format:\n ', '\n  '.join(sorted(badform)) 
  print 'Bad number of fields:\n ', '\n  '.join(sorted(badlen)) 
  print 'Records with good readings: %i = %5.2f%%\n' % (
     totalLines-badreading, (totalLines-badreading)/float(totalLines)*100 )
  print 'Total records: ', totalLines

readings = open('readings.txt','r') munge2(readings) </lang>

bash$  /cygdrive/c/Python26/python  munge2.py 
Duplicate dates:
  1990-03-25
  1991-03-31
  1992-03-29
  1993-03-28
  1995-03-26
Bad format:
  
Bad number of fields:
  
Records with good readings: 5017 = 91.70%

Total records:  5471
bash$

Tcl

set data [lrange [split [read [open "readings.txt" "r"]] "\n"] 0 end-1]
set total [llength $data]
set correct $total
set datestamps {}

foreach line $data {
    set formatOk true
    set hasAllMeasurements true

    set date [lindex $line 0]
    if {[llength $line] != 49} { set formatOk false }
    if {![regexp {\d{4}-\d{2}-\d{2}} $date]} { set formatOk false }
    if {[lsearch $datestamps $date] != -1} { puts "Duplicate datestamp: $date" } {lappend datestamps $date}

    foreach {value flag} [lrange $line 1 end] {
        if {$flag < 1} { set hasAllMeasurements false }

        if {![regexp -- {[-+]?\d+\.\d+} $value] || ![regexp -- {-?\d+} $flag]} {set formatOk false}
    }   
    if {!$hasAllMeasurements} { incr correct -1 }
    if {!$formatOk} { puts "line \"$line\" has wrong format" }
}

puts "$correct records with good readings = [expr $correct * 100.0 / $total]%"
puts "Total records: $total"

$ tclsh munge2.tcl 
Duplicate datestamp: 1990-03-25
Duplicate datestamp: 1991-03-31
Duplicate datestamp: 1992-03-29
Duplicate datestamp: 1993-03-28
Duplicate datestamp: 1995-03-26
5017 records with good readings = 91.7016998721%
Total records: 5471

Second version

To demonstate a different method to iterate over the file, and different ways to verify data types:

<lang tcl>set total [set good 0] array set seen {} set fh [open readings.txt] while {[gets $fh line] != -1} {

   incr total
   set fields [regexp -inline -all {[^ \t\r\n]+} $line]
   if {[llength $fields] != 49} {
       puts "bad format: not 49 fields on line $total"
       continue
   }
   if { ! [regexp {^(\d{4}-\d\d-\d\d)$} [lindex $fields 0] -> date]} {
       puts "bad format: invalid date on line $total: '$date'"
       continue
   }

   if {[info exists seen($date)]} {
       puts "duplicate date on line $total: $date"
   }
   incr seen($date)
   
   set line_format_ok true
   set readings_ignored 0
   foreach {value flag} [lrange $fields 1 end] {
       if { ! [string is double -strict $value]} {
           puts "bad format: value not a float on line $total: '$value'"
           set line_format_ok false
       }
       if { ! [string is int -strict $flag]} {
           puts "bad format: flag not an integer on line $total: '$flag'"
           set line_format_ok false
       }
       if {$flag < 1} {incr readings_ignored}
   }
   if {$line_format_ok && $readings_ignored == 0} {incr good}

} close $fh

puts "total: $total" puts [format "good: %d = %5.2f%%" $good [expr {100.0 * $good / $total}]]</lang> Results:

duplicate date on line 85: 1990-03-25
duplicate date on line 456: 1991-03-31
duplicate date on line 820: 1992-03-29
duplicate date on line 1184: 1993-03-28
duplicate date on line 1911: 1995-03-26
total: 5471
good:  5017 = 91.70%

Vedit macro language

This implementation does the following checks:

Checks for duplicate date fields. Note: duplicates can still be counted as valid records, as in other implementations.
Checks date format.
Checks that value fields have 1 or more digits followed by decimal point followed by 3 digits
Reads flag value and checks if it is positive
Requires 24 value/flag pairs on each line

#50 = Buf_Num           // Current edit buffer (source data)
File_Open("|(PATH_ONLY)\output.txt")
#51 = Buf_Num           // Edit buffer for output file
Buf_Switch(#50)

#11 = #12 = #13 = #14 = #15 = 0
Reg_Set(15, "xxx")

While(!At_EOF) {
    #10 = 0
    #12++

    // Check for repeated date field
    if (Match(@15) == 0) {
        #20 = Cur_Line
        Buf_Switch(#51)   // Output file
        Reg_ins(15) IT(": duplicate record at ") Num_Ins(#20)
        Buf_Switch(#50)   // Input file
        #13++
    }

    // Check format of date field
    if (Match("|d|d|d|d-|d|d-|d|d|w", ADVANCE) != 0) {
        #10 = 1
        #14++
    }
    Reg_Copy_Block(15, BOL_pos, Cur_Pos-1)

    // Check data fields and flags:
    Repeat(24) {
        if ( Match("|d|*.|d|d|d|w", ADVANCE) != 0 || Num_Eval(ADVANCE) < 1) {
            #10 = 1
            #15++
            Break
        }
        Match("|W", ADVANCE)
    }
    if (#10 == 0) { #11++ }             // record was OK
    Line(1, ERRBREAK)
}

Buf_Switch(#51)         // buffer for output data
IN
IT("Valid records:       ") Num_Ins(#11)
IT("Duplicates:          ") Num_Ins(#13)
IT("Date format errors:  ") Num_Ins(#14)
IT("Invalid data records:") Num_Ins(#15)
IT("Total records:       ") Num_Ins(#12)

Sample output:

1990-03-25: duplicate record at    85
1991-03-31: duplicate record at   456
1992-03-29: duplicate record at   820
1993-03-28: duplicate record at  1184
1995-03-26: duplicate record at  1911

Valid records:        5017
Duplicates:              5
Date format errors:      0
Invalid data records:  454
Total records:        5471