Text processing/1: Difference between revisions
Content added Content deleted
(jq) |
|||
Line 2,005: | Line 2,005: | ||
Maximum run of 589 consecutive false readings ends at 1993-03-05</pre> |
Maximum run of 589 consecutive false readings ends at 1993-03-05</pre> |
||
=={{header|jq}}== |
|||
{{works with|jq|with foreach}} |
|||
This article highlights jq's recently added "foreach" and "inputs" filters, |
|||
as they allow the input file to be processed efficiently on a line-by-line basis, |
|||
with minimal memory requirements. |
|||
The "foreach" syntax is: |
|||
<lang jq>foreach STREAM as $row ( INITIAL; EXPRESSION; VALUE ).</lang> |
|||
The basic idea is that for each $row in STREAM, the value specified by VALUE is emitted. |
|||
If we wished only to produce per-line synopses of the "readings.txt" |
|||
file, the following pattern could be used: |
|||
<lang jq>foreach (inputs | split("\t")) as $line (INITIAL; EXPRESSION; VALUE)</lang> |
|||
In order to distinguish the single-line synopsis from the whole-file synopsis, we will use the following pattern instead: |
|||
<lang jq>foreach ((inputs | split("\t")), null) as $line (INITIAL; EXPRESSION; VALUE)</lang> |
|||
The "null" is added so that the stream of per-line values can be distinguished from the last value in the stream. |
|||
In this section, the whole-file synopsis is focused on the runs of lines having at least one flag<=0. The maximal length of such runs is computed, and the starting line(s) and date(s) of all such runs are recorded. |
|||
One point of interest in the following program is the use of JSON objects to store values. This allows mnemonic names to be used instead of local variables. |
|||
<lang jq># Input: { "max": max_run_length, |
|||
# "starts": array_of_start_line_values, # of all the maximal runs |
|||
# "start_dates": array_of_start_dates # of all the maximal runs |
|||
# } |
|||
def report: |
|||
(.starts | length) as $l |
|||
| if $l == 1 then |
|||
"There is one maximal run of lines with flag<=0.", |
|||
"The maximal run has length \(.max) and starts at line \(.starts[0]) and has start date \(.start_dates[0])." |
|||
elif $l == 0 then |
|||
"There is no lines with flag<=0." |
|||
else |
|||
"There are \($l) maximal runs of lines with flag<=0.", |
|||
"These runs have length \(.max) and start at the following line numbers:", |
|||
"\(.starts)", |
|||
"The corresponding dates are:", |
|||
"\(.start_dates)" |
|||
end; |
|||
# "process" processes "tab-separated string values" on stdin |
|||
def process: |
|||
# Given a line in the form of an array [date, datum1, flag2, ...], |
|||
# "synopsis" returns [ number of data items on the line with flag>0, sum, number of data items on the line with flag<=0 ] |
|||
def synopsis: # of a line |
|||
. as $row |
|||
| reduce range(0; (length - 1) / 2) as $i |
|||
( [0,0,0]; |
|||
($row[1+ (2*$i)] | tonumber) as $datum |
|||
| ($row[2+(2*$i)] | tonumber) as $flag |
|||
| if ($flag>0) then .[0] += 1 | .[1] += $datum else .[2] += 1 end ); |
|||
# state: {"line": line_number # (first line is line 0) |
|||
# "synopis": _, # value returned by "synopsis" |
|||
# "start": line_number_of_start_of_current_run, |
|||
# "start_date": date_of_start_of_current_run, |
|||
# "length": length_of_current_run # so far |
|||
# "max": max_run_length # so far |
|||
# "starts": array_of_start_values # of all the maximal runs |
|||
# "start_dates": array_of_start_dates # of all the maximal runs |
|||
# } |
|||
foreach ((inputs | split("\t")), null) as $line # null signals END |
|||
# Slots are effectively initialized by default to null |
|||
( { "line": -1, "length": 0, "max": 0, "starts": [], "start_dates": [] }; |
|||
if $line == null then .line = null |
|||
else |
|||
.line += 1 |
|||
# | debug |
|||
# synopsis returns [number with flag>0, sum, number with flag<=0 ] |
|||
| .synopsis = ($line | synopsis) |
|||
| if .synopsis[2] > 0 then |
|||
if .start then . else .start = .line | .start_date = $line[0] end |
|||
| .length += 1 |
|||
| if .max < .length then |
|||
(.max = .length) |
|||
| .starts = [ .start ] |
|||
| .start_dates = [ .start_date ] |
|||
elif .max == .length then |
|||
.starts += [ .start ] |
|||
| .start_dates += [ .start_date ] |
|||
else . |
|||
end |
|||
else .start = null | .length = 0 |
|||
end |
|||
end; |
|||
.) |
|||
| if .line == null then {max, starts, start_dates} | report |
|||
else .synopsis |
|||
end; |
|||
process</lang> |
|||
{{out}} |
|||
<lang sh>$ jq -c -n -R -r -f Text_processing_1.jq readings.txt |
|||
[22,590,2] |
|||
[24,410,0] |
|||
... |
|||
[23,47.3,1] |
|||
There is one maximal run of lines with flag<=0. |
|||
The maximal run has length 93 and starts at line 5378 and has start date 2004-09-30.</lang> |
|||
=={{header|Lua}}== |
=={{header|Lua}}== |