Text processing/2: Difference between revisions
Content added Content deleted
m (ce lead paragraphs) |
(jq) |
||
Line 1,340: | Line 1,340: | ||
var analyze = analyze_func('readings.txt'); |
var analyze = analyze_func('readings.txt'); |
||
analyze();</lang> |
analyze();</lang> |
||
=={{header|jq}}== |
|||
{{works with|jq|with regex support}} |
|||
For this problem, it is convenient to use jq in a pipeline: the first invocation of jq will convert the text file into a stream of JSON arrays (one array per line): |
|||
<lang sh>$ jq -R '[splits("[ \t]+")]' Text_processing_2.txt</lang> |
|||
The second part of the pipeline performs the task requirements. The following program is used in the second invocation of jq. |
|||
'''Generic Utilities''' |
|||
<lang jq># Given any array, produce an array of [item, count] pairs for each run. |
|||
def runs: |
|||
reduce .[] as $item |
|||
( []; |
|||
if . == [] then [ [ $item, 1] ] |
|||
else .[length-1] as $last |
|||
| if $last[0] == $item then (.[0:length-1] + [ [$item, $last[1] + 1] ] ) |
|||
else . + [[$item, 1]] |
|||
end |
|||
end ) ; |
|||
def is_float: test("^[-+]?[0-9]*[.][0-9]*([eE][-+]?[0-9]+)?$"); |
|||
def is_integral: test("^[-+]?[0-9]+$"); |
|||
def is_date: test("[12][0-9]{3}-[0-9][0-9]-[0-9][0-9]");</lang> |
|||
'''Validation''': |
|||
<lang jq># Report line and column numbers using conventional numbering (IO=1). |
|||
def validate_line(nr): |
|||
def validate_date: |
|||
if is_date then empty else "field 1 in line \(nr) has an invalid date: \(.)" end; |
|||
def validate_length(n): |
|||
if length == n then empty else "line \(nr) has \(length) fields" end; |
|||
def validate_pair(i): |
|||
( .[2*i + 1] as $n |
|||
| if ($n | is_float) then empty else "field \(2*i + 2) in line \(nr) is not a float: \($n)" end), |
|||
( .[2*i + 2] as $n |
|||
| if ($n | is_integral) then empty else "field \(2*i + 3) in line \(nr) is not an integer: \($n)" end); |
|||
(.[0] | validate_date), |
|||
(validate_length(49)), |
|||
(range(0; (length-1) / 2) as $i | validate_pair($i)) ; |
|||
def validate_lines: |
|||
. as $in |
|||
| range(0; length) as $i | ($in[$i] | validate_line($i + 1));</lang> |
|||
'''Check for duplicate timestamps''' |
|||
<lang jq>def duplicate_timestamps: |
|||
[.[][0]] | sort | runs | map( select(.[1]>1) );</lang> |
|||
'''Number of valid readings for all instruments''': |
|||
<lang jq># The following ignores any issues with respect to duplicate dates, |
|||
# but does check the validity of the record, including the date format: |
|||
def number_of_valid_readings: |
|||
def check: |
|||
. as $in |
|||
| (.[0] | is_date) |
|||
and length == 49 |
|||
and all(range(0; 24) | $in[2*. + 1] | is_float) |
|||
and all(range(0; 24) | $in[2*. + 2] | (is_integral and tonumber >= 1) ); |
|||
map(select(check)) | length ;</lang> |
|||
'''Generate Report''' |
|||
<lang jq>validate_lines, |
|||
"\nChecking for duplicate timestamps:", |
|||
duplicate_timestamps, |
|||
"\nThere are \(number_of_valid_readings) valid rows altogether."</lang> |
|||
{{out}} |
|||
'''Part 1: Simple demonstration''' |
|||
To illustrate that the program does report invalid lines, we first use the six lines at the top but mangle the last line. |
|||
<lang sh>$ jq -R '[splits("[ \t]+")]' Text_processing_2.txt | jq -s -r -f Text_processing_2.jq |
|||
field 1 in line 6 has an invalid date: 991-04-03 |
|||
line 6 has 47 fields |
|||
field 2 in line 6 is not a float: 10000 |
|||
field 3 in line 6 is not an integer: 1.0 |
|||
field 47 in line 6 is not an integer: x |
|||
Checking for duplicate timestamps: |
|||
[ |
|||
[ |
|||
"1991-03-31", |
|||
2 |
|||
] |
|||
] |
|||
There are 5 valid rows altogether.</lang> |
|||
'''Part 2: readings.txt''' |
|||
<lang sh>$ jq -R '[splits("[ \t]+")]' readings.txt | jq -s -r -f Text_processing_2.jq |
|||
Checking for duplicate timestamps: |
|||
[ |
|||
[ |
|||
"1990-03-25", |
|||
2 |
|||
], |
|||
[ |
|||
"1991-03-31", |
|||
2 |
|||
], |
|||
[ |
|||
"1992-03-29", |
|||
2 |
|||
], |
|||
[ |
|||
"1993-03-28", |
|||
2 |
|||
], |
|||
[ |
|||
"1995-03-26", |
|||
2 |
|||
] |
|||
] |
|||
There are 5017 valid rows altogether.</lang> |
|||
=={{header|Lua}}== |
=={{header|Lua}}== |
||
<lang lua>filename = "readings.txt" |
<lang lua>filename = "readings.txt" |