Text processing/2

From Rosetta Code
Revision as of 22:21, 13 November 2008 by rosettacode>Paddy3118 (More data munging)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Task
Text processing/2
You are encouraged to solve this task according to the task description, using any language you may know.

The following data shows a few lines from the file readings.txt (as used in in the Data Munging task).

The data comes from a pollution monitoring station with twenty four instruments monitoring twenty four aspects of pollution in the air. periodically a record is added to the file constituting a line of 49 white-space separated fields, where white-space can be one or more space or tab characters.

The fields (from the left) are:

 DATESTAMP [ VALUEn FLAGn ] * 24

i.e. a datestamp followed by twenty four repetitions of a floating point instrument value and that instruments associated integer flag. Flag values are >= 1 if the instruments is working and < 1 if their is some problem with that instrument in which case that instruments value should be ignored.

A sample from the full data file readings.txt is:

1991-03-30	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1
1991-03-31	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	20.000	1	20.000	1	20.000	1	35.000	1	50.000	1	60.000	1	40.000	1	30.000	1	30.000	1	30.000	1	25.000	1	20.000	1	20.000	1	20.000	1	20.000	1	20.000	1	35.000	1
1991-03-31	40.000	1	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2
1991-04-01	0.000	-2	13.000	1	16.000	1	21.000	1	24.000	1	22.000	1	20.000	1	18.000	1	29.000	1	44.000	1	50.000	1	43.000	1	38.000	1	27.000	1	27.000	1	24.000	1	23.000	1	18.000	1	12.000	1	13.000	1	14.000	1	15.000	1	13.000	1	10.000	1
1991-04-02	8.000	1	9.000	1	11.000	1	12.000	1	12.000	1	12.000	1	27.000	1	26.000	1	27.000	1	33.000	1	32.000	1	31.000	1	29.000	1	31.000	1	25.000	1	25.000	1	24.000	1	21.000	1	17.000	1	14.000	1	15.000	1	12.000	1	12.000	1	10.000	1
1991-04-03	10.000	1	9.000	1	10.000	1	10.000	1	9.000	1	10.000	1	15.000	1	24.000	1	28.000	1	24.000	1	18.000	1	14.000	1	12.000	1	13.000	1	14.000	1	15.000	1	14.000	1	15.000	1	13.000	1	13.000	1	13.000	1	12.000	1	10.000	1	10.000	1

The task:

  1. Confirm the general field format of the file
  2. Identify any DATASTAMPs that are duplicated.
  3. What number of records have good readings for all instruments.

AWK

A series of AWK one-liners are shown as this is often what is done. If this information were needed repeatedly, (and this is not known), a more permanent shell script might be created that combined multi-line versions of the scripts below.

Gradually tie down the format.

(In each case offending lines will be printed)

If their are any scientific notation fields then their will be an e in the file:

bash$ awk '/[eE]/' readings.txt
bash$ 

Quick check on the number of fields:

bash$ awk 'NF != 49' readings.txt
bash$ 

Full check on the file format using a regular expression:

bash$ awk '!(/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+)+$/ && NF==49)' readings.txt         
bash$ 

Full check on the file format as above but using regular expressions allowing intervals (gnu awk):

bash$ awk  --re-interval '!(/^[0-9]{4}-[0-9]{2}-[0-9]{2}([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+){24}+$/ )' readings.txt         
bash$ 


Identify any DATASTAMPs that are duplicated.

Accomplished by counting how many times the first field occurs and noting any second occurrences.

bash$ awk '++count[$1]==2{print $1}' readings.txt
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
bash$ 


What number of records have good readings for all instruments.

bash$ awk '{rec++;ok=1; for(i=0;i<24;i++){if($(i+3)<1){ok=0}}; recordok += ok} END {print "Total records",rec,"OK records", recordok, "or", recordok/rec*100,"%"}'  readings.txt 
Total records 5471 OK records 4728 or 86.4193 %
bash$