Text processing/2: Difference between revisions

← Older edit

Text processing/2 (view source)

Revision as of 11:41, 14 February 2024

28,251 bytes added , 3 months ago

m

→‎{{header|Wren}}: Minor tidy

PureFox

9,476

edits

Revision as of 23:09, 11 September 2015 (view source) Trizen (talk \| contribs) m (Added the Sidef language) ← Older edit		Latest revision as of 11:41, 14 February 2024 (view source) PureFox (talk \| contribs) m (→‎{{header\|Wren}}: Minor tidy)
(35 intermediate revisions by 20 users not shown)
Line 1: {{task\|Text processing}} The following task concerns data that came from a pollution monitoring station with twenty-four instruments monitoring twenty-four aspects of pollution in the air. Periodically a record is added to the file, each record being a line of 49 fields separated by white-space, which can be one or more space or tab characters. Line 6 ⟶ 7: i.e. a datestamp followed by twenty-four repetitions of a floating-point instrument value and that instrument's associated integer flag. Flag values are >= 1 if the instrument is working and < 1 if there is some problem with it, in which case that instrument's value should be ignored. A sample from the full data file [http://rosettacode.org/resources/readings.zip readings.txt], which is also used in the [[~~Data~~Text ~~Munging~~processing/1]] task, follows: ~~<pre style="height:17ex;overflow:scroll">~~ Data is no longer available at that link. Zipped mirror available [https://github.com/thundergnat/rc/blob/master/resouces/readings.zip here] <pre> 1991-03-30 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 1991-03-31 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 20.000 1 20.000 1 20.000 1 35.000 1 50.000 1 60.000 1 40.000 1 30.000 1 30.000 1 30.000 1 25.000 1 20.000 1 20.000 1 20.000 1 20.000 1 20.000 1 35.000 1 Line 16 ⟶ 19: </pre> ;Task: ~~The task:~~ # Confirm the general field format of the file. # Identify any DATESTAMPs that are duplicated. # Report the number of records that have good readings for all instruments. <br><br> =={{header\|11l}}== {{trans\|Python}} <syntaxhighlight lang="11l">V debug = 0B V datePat = re:‘\d{4}-\d{2}-\d{2}’ V valuPat = re:‘[-+]?\d+\.\d+’ V statPat = re:‘-?\d+’ V totalLines = 0 Set[String] dupdate Set[String] badform Set[String] badlen V badreading = 0 Set[String] datestamps L(line) File(‘readings.txt’).read().rtrim("\n").split("\n") totalLines++ V fields = line.split("\t") V date = fields[0] V pairs = (1 .< fields.len).step(2).map(i -> (@fields[i], @fields[i + 1])) V lineFormatOk = datePat.match(date) & all(pairs.map(p -> :valuPat.match(p[0]))) & all(pairs.map(p -> :statPat.match(p[1]))) I !lineFormatOk I debug print(‘Bad formatting ’line) badform.add(date) I pairs.len != 24 \| any(pairs.map(p -> Int(p[1]) < 1)) I debug print(‘Missing values ’line) I pairs.len != 24 badlen.add(date) I any(pairs.map(p -> Int(p[1]) < 1)) badreading++ I date C datestamps I debug print(‘Duplicate datestamp ’line) dupdate.add(date) datestamps.add(date) print("Duplicate dates:\n "sorted(Array(dupdate)).join("\n ")) print("Bad format:\n "sorted(Array(badform)).join("\n ")) print("Bad number of fields:\n "sorted(Array(badlen)).join("\n ")) print("Records with good readings: #. = #2.2%\n".format( totalLines - badreading, (totalLines - badreading) / Float(totalLines) * 100)) print(‘Total records: ’totalLines)</syntaxhighlight> {{out}} <pre> Duplicate dates: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 Bad format: Bad number of fields: Records with good readings: 5017 = 91.70% Total records: 5471 </pre> =={{header\|Ada}}== {{libheader\|Simple components for Ada}} <~~lang~~syntaxhighlight lang="ada">with Ada.Calendar; use Ada.Calendar; with Ada.Text_IO; use Ada.Text_IO; with Strings_Edit; use Strings_Edit; Line 85 ⟶ 156: Close (File); Put_Line ("Valid records " & Image (Count) & " of " & Image (Line_No) & " total"); end Data_Munging_2;</~~lang~~syntaxhighlight> Sample output <pre> Line 97 ⟶ 168: =={{header\|Aime}}== <syntaxhighlight lang="aime">check_format(list l) ~~<lang aime>void~~ ~~check_format(list l)~~ { integer i; text s; if (~~l_length(~~~l) != 49) { error("~~wrong~~bad ~~number~~field ~~of fields~~count"); } s = ~~lf_q_text(~~l)[0]; if (~~length~~match(~~s) != 10 \|\| s[4] != '~~"????-~~' \|\|~~??-??", s~~[7] != '-'~~)) { error("bad date format"); } ~~atoi(~~l[0] = s.delete(7).delete(~~s, 7),~~ 4)).atoi; i = 1; while (i < 49) { ~~l_r_real(l, i,~~ atof~~(l_q_text~~(l, [i))]); i += 1; ~~l_r_integer(~~l, [i, >> 1] = atoi~~(l_q_text~~(l, [i))]); i += 1; } l.erase(25, -1); } ~~integer~~ main(void) { integer goods, i, v; file f; list l; ~~record~~index rx; goods = 0; ~~f_affix(~~f, .affix("readings.txt"); while ~~(f_list~~(f, .list(l, 0) != -1) { if (!trap(check_format, l)) { if (~~r_key~~(r,x[v ~~l_head~~= lf_x_integer(l)] += 1) != 1) { v_form("duplicate ~ line\n", ~~l_head(l)~~v); } ~~else {~~ ~~integer i;~~ i = ~~r_put(r, l_head(l), 0)~~1; l.ucall(min_i, 1, i ~~= 2~~); goods += iclip(0, ~~while (~~i, ~~< 49~~1) {; ~~if (l_q_integer(l, i) != 1) {~~ ~~break;~~ } ~~i += 2;~~ } ~~if (49 < i) {~~ ~~goods += 1;~~ } } } } ~~o_integer~~o_(goods, " good lines\n"); ~~o_text(" good unique lines\n");~~ ~~return~~ 0; }</~~lang~~syntaxhighlight> {{out}} (the "reading.txt" needs to be converted to UNIX end-of-line) <pre>duplicate ~~1990-03-25~~19900325 line duplicate ~~1991-03-31~~19910331 line duplicate ~~1992-03-29~~19920329 line duplicate ~~1993-03-28~~19930328 line duplicate ~~1995-03-26~~19950326 line ~~5013~~5017 good ~~unique~~ lines</pre> =={{header\|Amazing Hopper}}== {{Trans\|AWK}} <syntaxhighlight lang="c"> #include <basico.h> algoritmo número de campos correcto = `awk 'NF != 49' basica/readings.txt` fechas repetidas = `awk '++count[$1] >= 2{print $1, "(",count[$1],")"}' basica/readings.txt` resultados buenos = `awk '{rec++;ok=1; for(i=0;i<24;i++){if($(2i+3)<1){ok=0}}; recordok += ok} END {print "Total records",rec,"OK records", recordok, "or", recordok/rec100,"%"}' basica/readings.txt` "Check field number by line: ", #( !(number(número de campos correcto)) ? "Ok\n" : "Nok\n";),\ "\nCheck duplicated dates:\n", fechas repetidas,NL, \ "Number of records have good readings for all instruments:\n",resultados buenos,\ "(including " fijar separador( NL ) contar tokens en 'fechas repetidas' " duplicated records)\n", luego imprime todo terminar </syntaxhighlight> {{out}} <pre> Check field number by line: Ok Check duplicated dates: 1990-03-25 ( 2 ) 1991-03-31 ( 2 ) 1992-03-29 ( 2 ) 1993-03-28 ( 2 ) 1995-03-26 ( 2 ) Number of records have good readings for all instruments: Total records 5471 OK records 5017 or 91,7017 % (including 5 duplicated records) </pre> =={{header\|AutoHotkey}}== <~~lang~~syntaxhighlight lang="autohotkey">; Author: AlephX Aug 17 2011 data = %A_scriptdir%\readings.txt Line 223 ⟶ 325: msgbox, Duplicate Dates:`n%wrongDates%`nRead Lines: %lines%`nValid Lines: %valid%`nwrong lines: %totwrong%`nDuplicates: %TotWrongDates%`nWrong Formatted: %unvalidformat%`n </syntaxhighlight> ~~</lang>~~ Sample Output: Line 250 ⟶ 352: If their are any scientific notation fields then their will be an e in the file: <~~lang~~syntaxhighlight lang="awk">bash$ awk '/[eE]/' readings.txt bash$</~~lang~~syntaxhighlight> Quick check on the number of fields: <~~lang~~syntaxhighlight lang="awk">bash$ awk 'NF != 49' readings.txt bash$</~~lang~~syntaxhighlight> Full check on the file format using a regular expression: <~~lang~~syntaxhighlight lang="awk">bash$ awk '!(/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+)+$/ && NF==49)' readings.txt bash$</~~lang~~syntaxhighlight> Full check on the file format as above but using regular expressions allowing intervals (gnu awk): <~~lang~~syntaxhighlight lang="awk">bash$ awk --re-interval '!(/^[0-9]{4}-[0-9]{2}-[0-9]{2}([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+){24}+$/ )' readings.txt bash$</~~lang~~syntaxhighlight> Line 266 ⟶ 368: Accomplished by counting how many times the first field occurs and noting any second occurrences. <~~lang~~syntaxhighlight lang="awk">bash$ awk '++count[$1]==2{print $1}' readings.txt 1990-03-25 1991-03-31 Line 272 ⟶ 374: 1993-03-28 1995-03-26 bash$</~~lang~~syntaxhighlight> Line 278 ⟶ 380: <div style="width:100%;overflow:scroll"> <~~lang~~syntaxhighlight lang="awk">bash$ awk '{rec++;ok=1; for(i=0;i<24;i++){if($(2i+3)<1){ok=0}}; recordok += ok} END {print "Total records",rec,"OK records", recordok, "or", recordok/rec100,"%"}' readings.txt Total records 5471 OK records 5017 or 91.7017 % bash$</~~lang~~syntaxhighlight> </div> =={{header\|C}}== <~~lang~~syntaxhighlight lang="c">#include <stdio.h> #include <string.h> #include <stdlib.h> Line 367 ⟶ 469: read_file("readings.txt"); return 0; }</~~lang~~syntaxhighlight> {{out}} Line 378 ⟶ 480: 5017 out 5471 lines good ~~</pre>~~ ~~=={{header\|C++}}==~~ ~~{{libheader\|Boost}}~~ ~~<lang cpp>#include <boost/regex.hpp>~~ ~~#include <fstream>~~ ~~#include <iostream>~~ ~~#include <vector>~~ ~~#include <string>~~ ~~#include <set>~~ ~~#include <cstdlib>~~ ~~#include <algorithm>~~ ~~using namespace std ;~~ ~~boost::regex e ( "\\s+" ) ;~~ ~~int main( int argc , char argv[ ] ) {~~ ~~ifstream infile( argv[ 1 ] ) ;~~ ~~vector<string> duplicates ;~~ ~~set<string> datestamps ; //for the datestamps~~ ~~if ( ! infile.is_open( ) ) {~~ ~~cerr << "Can't open file " << argv[ 1 ] << '\n' ;~~ ~~return 1 ;~~ } ~~int all_ok = 0 ;//all_ok for lines in the given pattern e~~ ~~int pattern_ok = 0 ; //overall field pattern of record is ok~~ ~~while ( infile ) {~~ ~~string eingabe ;~~ ~~getline( infile , eingabe ) ;~~ ~~boost::sregex_token_iterator i ( eingabe.begin( ), eingabe.end( ) , e , -1 ), j ;//we tokenize on empty fields~~ ~~vector<string> fields( i, j ) ;~~ ~~if ( fields.size( ) == 49 ) //we expect 49 fields in a record~~ ~~pattern_ok++ ;~~ ~~else~~ ~~cout << "Format not ok!\n" ;~~ ~~if ( datestamps.insert( fields[ 0 ] ).second ) { //not duplicated~~ ~~int howoften = ( fields.size( ) - 1 ) / 2 ;//number of measurement~~ ~~//devices and values~~ ~~for ( int n = 1 ; atoi( fields[ 2 n ].c_str( ) ) >= 1 ; n++ ) {~~ ~~if ( n == howoften ) {~~ ~~all_ok++ ;~~ ~~break ;~~ } } } ~~else {~~ ~~duplicates.push_back( fields[ 0 ] ) ;//first field holds datestamp~~ } } ~~infile.close( ) ;~~ ~~cout << "The following " << duplicates.size() << " datestamps were duplicated:\n" ;~~ ~~copy( duplicates.begin( ) , duplicates.end( ) ,~~ ~~ostream_iterator<string>( cout , "\n" ) ) ;~~ ~~cout << all_ok << " records were complete and ok!\n" ;~~ ~~return 0 ;~~ ~~}</lang>~~ ~~{{out}}~~ ~~<pre>~~ ~~Format not ok!~~ ~~The following 6 datestamps were duplicated:~~ ~~1990-03-25~~ ~~1991-03-31~~ ~~1992-03-29~~ ~~1993-03-28~~ ~~1995-03-26~~ ~~2004-12-31~~ </pre> =={{header\|C sharp\|C#}}== <~~lang~~syntaxhighlight lang="csharp">using System; using System.Collections.Generic; using System.Text.RegularExpressions; Line 519 ⟶ 554: } } }</~~lang~~syntaxhighlight> <pre> Line 528 ⟶ 563: 1993-03-28 is duplicated at Lines : 1183,1184 1995-03-26 is duplicated at Lines : 1910,1911 </pre> =={{header\|C++}}== {{libheader\|Boost}} <syntaxhighlight lang="cpp">#include <boost/regex.hpp> #include <fstream> #include <iostream> #include <vector> #include <string> #include <set> #include <cstdlib> #include <algorithm> using namespace std ; boost::regex e ( "\\s+" ) ; int main( int argc , char argv[ ] ) { ifstream infile( argv[ 1 ] ) ; vector<string> duplicates ; set<string> datestamps ; //for the datestamps if ( ! infile.is_open( ) ) { cerr << "Can't open file " << argv[ 1 ] << '\n' ; return 1 ; } int all_ok = 0 ;//all_ok for lines in the given pattern e int pattern_ok = 0 ; //overall field pattern of record is ok while ( infile ) { string eingabe ; getline( infile , eingabe ) ; boost::sregex_token_iterator i ( eingabe.begin( ), eingabe.end( ) , e , -1 ), j ;//we tokenize on empty fields vector<string> fields( i, j ) ; if ( fields.size( ) == 49 ) //we expect 49 fields in a record pattern_ok++ ; else cout << "Format not ok!\n" ; if ( datestamps.insert( fields[ 0 ] ).second ) { //not duplicated int howoften = ( fields.size( ) - 1 ) / 2 ;//number of measurement //devices and values for ( int n = 1 ; atoi( fields[ 2 n ].c_str( ) ) >= 1 ; n++ ) { if ( n == howoften ) { all_ok++ ; break ; } } } else { duplicates.push_back( fields[ 0 ] ) ;//first field holds datestamp } } infile.close( ) ; cout << "The following " << duplicates.size() << " datestamps were duplicated:\n" ; copy( duplicates.begin( ) , duplicates.end( ) , ostream_iterator<string>( cout , "\n" ) ) ; cout << all_ok << " records were complete and ok!\n" ; return 0 ; }</syntaxhighlight> {{out}} <pre> Format not ok! The following 6 datestamps were duplicated: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 2004-12-31 </pre> =={{header\|Clojure}}== <syntaxhighlight lang="clojure"> (defn parse-line [s] (let [[date & data-toks] (str/split s #"\s+") data-fields (map read-string data-toks) valid-date? (fn [s] (re-find #"\d{4}-\d{2}-\d{2}" s)) valid-line? (and (valid-date? date) (= 48 (count data-toks)) (every? number? data-fields)) readings (for [[v flag] (partition 2 data-fields)] {:val v :flag flag})] (when (not valid-line?) (println "Malformed Line: " s)) {:date date :no-missing-readings? (and (= 48 (count data-toks)) (every? pos? (map :flag readings)))})) (defn analyze-file [path] (reduce (fn [m line] (let [{:keys [all-dates dupl-dates n-full-recs invalid-lines]} m this-date (:date line) dupl? (contains? all-dates this-date) full? (:no-missing-readings? line)] (cond-> m dupl? (update-in [:dupl-dates] conj this-date) full? (update-in [:n-full-recs] inc) true (update-in [:all-dates] conj this-date)))) {:dupl-dates #{} :all-dates #{} :n-full-recs 0} (->> (slurp path) clojure.string/split-lines (map parse-line)))) (defn report-summary [path] (let [m (analyze-file path)] (println (format "%d unique dates" (count (:all-dates m)))) (println (format "%d duplicated dates [%s]" (count (:dupl-dates m)) (clojure.string/join " " (sort (:dupl-dates m))))) (println (format "%d lines with no missing data" (:n-full-recs m))))) </syntaxhighlight> {{out}} <pre> 5466 unique dates 5 duplicated dates [1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26] 5017 lines with no missing data </pre> =={{header\|COBOL}}== {{works with\|OpenCOBOL}} <~~lang~~syntaxhighlight lang="cobol"> IDENTIFICATION DIVISION. PROGRAM-ID. text-processing-2. Line 694 ⟶ 845: INSPECT input-data (offset:) TALLYING data-len FOR CHARACTERS BEFORE delim .</~~lang~~syntaxhighlight> {{out}} Line 709 ⟶ 860: =={{header\|D}}== <~~lang~~syntaxhighlight lang="d">void main() { import std.stdio, std.array, std.string, std.regex, std.conv, std.algorithm; Line 745 ⟶ 896: repeatedDates.byKey.filter!(k => repeatedDates[k] > 1)); writeln("Good reading records: ", goodReadings); }</~~lang~~syntaxhighlight> {{out}} <pre>Duplicated timestamps: 1990-03-25, 1991-03-31, 1992-03-29, 1993-03-28, 1995-03-26 Line 751 ⟶ 902: =={{header\|Eiffel}}== <syntaxhighlight lang="eiffel"> ~~<lang Eiffel>~~ class APPLICATION Line 867 ⟶ 1,018: end </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 887 ⟶ 1,038: =={{header\|Erlang}}== Uses function from [[Text_processing/1]]. It does some correctness checks for us. <syntaxhighlight lang="erlang"> ~~<lang Erlang>~~ -module( text_processing2 ). Line 919 ⟶ 1,070: value_flag_records() -> 24. </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 928 ⟶ 1,079: =={{header\|F Sharp\|F#}}== <~~lang~~syntaxhighlight lang="fsharp"> let file = @"readings.txt" Line 948 ⟶ 1,099: ok <- ok + 1 printf "%d records were ok\n" ok </syntaxhighlight> ~~</lang>~~ Prints: <~~lang~~syntaxhighlight lang="fsharp"> Date 1990-03-25 is duplicated Date 1991-03-31 is duplicated Line 957 ⟶ 1,108: Date 1995-03-26 is duplicated 5017 records were ok </syntaxhighlight> ~~</lang>~~ =={{header\|Factor}}== {{works with\|Factor\|0.99 2020-03-02}} <syntaxhighlight lang="factor">USING: io io.encodings.ascii io.files kernel math math.parser prettyprint sequences sequences.extras sets splitting ; : check-format ( seq -- ) [ " \t" split length 49 = ] all? "Format okay." "Format not okay." ? print ; "readings.txt" ascii file-lines [ check-format ] keep [ "Duplicates:" print [ "\t" split1 drop ] map duplicates . ] [ [ " \t" split rest <odds> [ string>number 0 <= ] none? ] count ] bi pprint " records were good." print</syntaxhighlight> {{out}} <pre> Format okay. Duplicates: { "1990-03-25" "1991-03-31" "1992-03-29" "1993-03-28" "1995-03-26" } 5017 records were good. </pre> =={{header\|Fortran}}== The trouble with the dates rather suggests that they should be checked for correctness in themselves, and that the sequence check should be that each new record advances the date by one day. Daynumber calculations were long ago presented by H. F. Fliegel and T.C. van Flandern, in Communications of the ACM, Vol. 11, No. 10 (October, 1968). Line 963 ⟶ 1,142: Rather than copy today's data to a PDATA holder so that on the next read the new data may be compared to the old, a two-row array is used, with IT flip-flopping 1,2,1,2,1,2,... Comparison of the data as numerical values rather than text strings means that different texts that evoke the same value will not be regarded as different. If the data format were invalid, there would be horrible messages. There aren't, so ... the values should be read and plotted... <syntaxhighlight lang="fortran"> ~~<lang Fortran>~~ Crunches a set of hourly data. Starts with a date, then 24 pairs of value,indicator for that day, on one line. INTEGER Y,M,D !Year, month, and day. Line 1,026 ⟶ 1,205: 900 CLOSE(IN) !Done. END !Spaghetti rules. </syntaxhighlight> ~~</lang>~~ Output: Line 1,038 ⟶ 1,217: =={{header\|Go}}== <~~lang~~syntaxhighlight lang="go">package main import ( Line 1,112 ⟶ 1,291: fmt.Println(uniqueGood, "unique dates with good readings for all instruments.") }</~~lang~~syntaxhighlight> {{out}} <pre> Line 1,127 ⟶ 1,306: =={{header\|Haskell}}== <~~lang~~syntaxhighlight lang="haskell"> import Data.List (nub, (\\)) Line 1,146 ⟶ 1,325: putStr (unlines ("duplicated dates:": duplicatedDates (map date inputs))) putStrLn ("number of good records: " ++ show (length $ goodRecords inputs)) </syntaxhighlight> ~~</lang>~~ this script outputs: Line 1,162 ⟶ 1,341: duplicated timestamps that are on well-formed records. <~~lang~~syntaxhighlight lang="unicon">procedure main(A) dups := set() goodRecords := 0 Line 1,194 ⟶ 1,373: } end</~~lang~~syntaxhighlight> Sample run: Line 1,207 ⟶ 1,386: =={{header\|J}}== <~~lang~~syntaxhighlight lang="j"> require 'tables/dsv dates' dat=: TAB readdsv jpath '~temp/readings.txt' Dates=: getdate"1 >{."1 dat Line 1,226 ⟶ 1,405: 1992 3 29 1993 3 28 1995 3 26</~~lang~~syntaxhighlight> =={{header\|Java}}== {{trans\|C++}} {{works with\|Java\|1.5+}} <~~lang~~syntaxhighlight lang="java5">import java.util.; import java.util.regex.; import java.io.; Line 1,274 ⟶ 1,453: } } }</~~lang~~syntaxhighlight> The program produces the following output: <pre> Line 1,288 ⟶ 1,467: =={{header\|JavaScript}}== {{works with\|JScript}} <~~lang~~syntaxhighlight lang="javascript">// wrap up the counter variables in a closure. function analyze_func(filename) { var dates_seen = {}; Line 1,337 ⟶ 1,516: var analyze = analyze_func('readings.txt'); analyze();</~~lang~~syntaxhighlight> =={{header\|jq}}== {{works with\|jq\|with regex support}} For this problem, it is convenient to use jq in a pipeline: the first invocation of jq will convert the text file into a stream of JSON arrays (one array per line): <~~lang~~syntaxhighlight lang="sh">$ jq -R '[splits("[ \t]+")]' Text_processing_2.txt</~~lang~~syntaxhighlight> The second part of the pipeline performs the task requirements. The following program is used in the second invocation of jq. '''Generic Utilities''' <~~lang~~syntaxhighlight lang="jq"># Given any array, produce an array of [item, count] pairs for each run. def runs: reduce .[] as $item Line 1,362 ⟶ 1,542: def is_integral: test("^[-+]?[0-9]+$"); def is_date: test("[12][0-9]{3}-[0-9][0-9]-[0-9][0-9]");</~~lang~~syntaxhighlight> '''Validation''': <~~lang~~syntaxhighlight lang="jq"># Report line and column numbers using conventional numbering (IO=1). def validate_line(nr): def validate_date: Line 1,383 ⟶ 1,563: def validate_lines: . as $in \| range(0; length) as $i \| ($in[$i] \| validate_line($i + 1));</~~lang~~syntaxhighlight> '''Check for duplicate timestamps''' <~~lang~~syntaxhighlight lang="jq">def duplicate_timestamps: [.[][0]] \| sort \| runs \| map( select(.[1]>1) );</~~lang~~syntaxhighlight> '''Number of valid readings for all instruments''': <~~lang~~syntaxhighlight lang="jq"># The following ignores any issues with respect to duplicate dates, # but does check the validity of the record, including the date format: def number_of_valid_readings: Line 1,400 ⟶ 1,580: and all(range(0; 24) \| $in[2. + 2] \| (is_integral and tonumber >= 1) ); map(select(check)) \| length ;</~~lang~~syntaxhighlight> '''Generate Report''' <~~lang~~syntaxhighlight lang="jq">validate_lines, "\nChecking for duplicate timestamps:", duplicate_timestamps, "\nThere are \(number_of_valid_readings) valid rows altogether."</~~lang~~syntaxhighlight> {{out}} '''Part 1: Simple demonstration''' To illustrate that the program does report invalid lines, we first use the six lines at the top but mangle the last line. <~~lang~~syntaxhighlight lang="sh">$ jq -R '[splits("[ \t]+")]' Text_processing_2.txt \| jq -s -r -f Text_processing_2.jq field 1 in line 6 has an invalid date: 991-04-03 line 6 has 47 fields Line 1,426 ⟶ 1,606: ] There are 5 valid rows altogether.</~~lang~~syntaxhighlight> '''Part 2: readings.txt''' <~~lang~~syntaxhighlight lang="sh">$ jq -R '[splits("[ \t]+")]' readings.txt \| jq -s -r -f Text_processing_2.jq Checking for duplicate timestamps: [ Line 1,454 ⟶ 1,634: ] There are 5017 valid rows altogether.</~~lang~~syntaxhighlight> =={{header\|Julia}}== Refer to the code at https://rosettacode.org/wiki/Text_processing/1#Julia. Add at the end of that code the following: <syntaxhighlight lang="julia"> dupdate = df[nonunique(df[:,[:Date]]),:][:Date] println("The following rows have duplicate DATESTAMP:") println(df[df[:Date] .== dupdate,:]) println("All values good in these rows:") println(df[df[:ValidValues] .== 24,:]) </syntaxhighlight> {{output}} <pre> The following rows have duplicate DATESTAMP: 2×29 DataFrames.DataFrame │ Row │ Date │ Mean │ ValidValues │ MaximumGap │ GapPosition │ 0:00 │ 1:00 │ 2:00 │ 3:00 │ 4:00 │ ├─────┼─────────────────────┼─────────┼─────────────┼────────────┼─────────────┼──────┼──────┼──────┼──────┼──────┤ │ 1 │ 1991-03-31T00:00:00 │ 23.5417 │ 24 │ 0 │ 0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ │ 2 │ 1991-03-31T00:00:00 │ 40.0 │ 1 │ 23 │ 2 │ 40.0 │ NaN │ NaN │ NaN │ NaN │ │ Row │ 5:00 │ 6:00 │ 7:00 │ 8:00 │ 9:00 │ 10:00 │ 11:00 │ 12:00 │ 13:00 │ 14:00 │ 15:00 │ 16:00 │ 17:00 │ 18:00 │ ├─────┼──────┼──────┼──────┼──────┼──────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤ │ 1 │ 10.0 │ 10.0 │ 20.0 │ 20.0 │ 20.0 │ 35.0 │ 50.0 │ 60.0 │ 40.0 │ 30.0 │ 30.0 │ 30.0 │ 25.0 │ 20.0 │ │ 2 │ NaN │ NaN │ NaN │ NaN │ NaN │ NaN │ NaN │ NaN │ NaN │ NaN │ NaN │ NaN │ NaN │ NaN │ │ Row │ 19:00 │ 20:00 │ 21:00 │ 22:00 │ 23:00 │ ├─────┼───────┼───────┼───────┼───────┼───────┤ │ 1 │ 20.0 │ 20.0 │ 20.0 │ 20.0 │ 35.0 │ │ 2 │ NaN │ NaN │ NaN │ NaN │ NaN │ All values good in these rows: 4×29 DataFrames.DataFrame │ Row │ Date │ Mean │ ValidValues │ MaximumGap │ GapPosition │ 0:00 │ 1:00 │ 2:00 │ 3:00 │ 4:00 │ ├─────┼─────────────────────┼─────────┼─────────────┼────────────┼─────────────┼──────┼──────┼──────┼──────┼──────┤ │ 1 │ 1991-03-30T00:00:00 │ 10.0 │ 24 │ 0 │ 0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ │ 2 │ 1991-03-31T00:00:00 │ 23.5417 │ 24 │ 0 │ 0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ │ 3 │ 1991-04-02T00:00:00 │ 19.7917 │ 24 │ 0 │ 0 │ 8.0 │ 9.0 │ 11.0 │ 12.0 │ 12.0 │ │ 4 │ 1991-04-03T00:00:00 │ 13.9583 │ 24 │ 0 │ 0 │ 10.0 │ 9.0 │ 10.0 │ 10.0 │ 9.0 │ │ Row │ 5:00 │ 6:00 │ 7:00 │ 8:00 │ 9:00 │ 10:00 │ 11:00 │ 12:00 │ 13:00 │ 14:00 │ 15:00 │ 16:00 │ 17:00 │ 18:00 │ ├─────┼──────┼──────┼──────┼──────┼──────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤ │ 1 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ │ 2 │ 10.0 │ 10.0 │ 20.0 │ 20.0 │ 20.0 │ 35.0 │ 50.0 │ 60.0 │ 40.0 │ 30.0 │ 30.0 │ 30.0 │ 25.0 │ 20.0 │ │ 3 │ 12.0 │ 27.0 │ 26.0 │ 27.0 │ 33.0 │ 32.0 │ 31.0 │ 29.0 │ 31.0 │ 25.0 │ 25.0 │ 24.0 │ 21.0 │ 17.0 │ │ 4 │ 10.0 │ 15.0 │ 24.0 │ 28.0 │ 24.0 │ 18.0 │ 14.0 │ 12.0 │ 13.0 │ 14.0 │ 15.0 │ 14.0 │ 15.0 │ 13.0 │ │ Row │ 19:00 │ 20:00 │ 21:00 │ 22:00 │ 23:00 │ ├─────┼───────┼───────┼───────┼───────┼───────┤ │ 1 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ 10.0 │ │ 2 │ 20.0 │ 20.0 │ 20.0 │ 20.0 │ 35.0 │ │ 3 │ 14.0 │ 15.0 │ 12.0 │ 12.0 │ 10.0 │ │ 4 │ 13.0 │ 13.0 │ 12.0 │ 10.0 │ 10.0 │ </pre> =={{header\|Kotlin}}== <syntaxhighlight lang="scala">// version 1.2.31 import java.io.File fun main(args: Array<String>) { val rx = Regex("""\s+""") val file = File("readings.txt") var count = 0 var invalid = 0 var allGood = 0 var map = mutableMapOf<String, Int>() file.forEachLine { line -> count++ val fields = line.split(rx) val date = fields[0] if (fields.size == 49) { if (map.containsKey(date)) map[date] = map[date]!! + 1 else map.put(date, 1) var good = 0 for (i in 2 until fields.size step 2) { if (fields[i].toInt() >= 1) { good++ } } if (good == 24) allGood++ } else invalid++ } println("File = ${file.name}") println("\nDuplicated dates:") for ((k,v) in map) { if (v > 1) println(" $k ($v times)") } println("\nTotal number of records : $count") var percent = invalid.toDouble() / count * 100.0 println("Number of invalid records : $invalid (${"%5.2f".format(percent)}%)") percent = allGood.toDouble() / count * 100.0 println("Number which are all good : $allGood (${"%5.2f".format(percent)}%)") }</syntaxhighlight> {{out}} <pre> File = readings.txt Duplicated dates: 1990-03-25 (2 times) 1991-03-31 (2 times) 1992-03-29 (2 times) 1993-03-28 (2 times) 1995-03-26 (2 times) Total number of records : 5471 Number of invalid records : 0 ( 0.00%) Number which are all good : 5017 (91.70%) </pre> =={{header\|Lua}}== <~~lang~~syntaxhighlight lang="lua">filename = "readings.txt" io.input( filename ) Line 1,499 ⟶ 1,791: for i = 1, #bad_format do print( " ", bad_format[i] ) end</~~lang~~syntaxhighlight> Output: <pre>Lines read: 5471 Line 1,512 ⟶ 1,804: </pre> =={{header\|M2000 Interpreter}}== File is in user dir. Use Win Dir$ to open the explorer window and copy there the readings.txt <syntaxhighlight lang="m2000 interpreter">Module TestThis { ~~=={{header\|Mathematica}}==~~ Document a$, exp$ ~~<lang Mathematica>data = Import["Readings.txt","TSV"]; Print["duplicated dates: "];~~ \\ automatic find the enconding and the line break Load.doc a$, "readings.txt" m=0 n=doc.par(a$) k=list nl$={ } l=0 exp$=format$("Records: {0}", n)+nl$ For i=1 to n b$=paragraph$(a$, i) If exist(k,Left$(b$, 10)) then m++ : where=eval(k) exp$=format$("Duplicate for {0} at {1}",where, i)+nl$ Else Append k, Left$(b$, 10):=i End if Stack New { Stack Mid$(Replace$(chr$(9)," ", b$), 11) while not empty { Read a, b if b<=0 then l++ : exit } } Next exp$= format$("Duplicates {0}",m)+nl$ exp$= format$("Valid Records {0}",n-l)+nl$ clipboard exp$ report exp$ } TestThis </syntaxhighlight> {{out}} <pre> Records: 5471 Duplicate for 84 at 85 Duplicate for 455 at 456 Duplicate for 819 at 820 Duplicate for 1183 at 1184 Duplicate for 1910 at 1911 Duplicates 5 Valid Records 5017 </pre> =={{header\|Mathematica}}/{{header\|Wolfram Language}}== <syntaxhighlight lang="mathematica">data = Import["Readings.txt","TSV"]; Print["duplicated dates: "]; Select[Tally@data[[;;,1]], #[[2]]>1&][[;;,1]]//Column Print["number of good records: ", Count[(Times@@#[[3;;All;;2]])& /@ data, 1], " (out of a total of ", Length[data], ")"]</~~lang~~syntaxhighlight> {{out}} <pre>duplicated dates: 1990-03-25 Line 1,525 ⟶ 1,866: 1993-03-28 1995-03-26 number of good records: 5017 (out of a total of 5471)</pre> =={{header\|MATLAB}} / {{header\|Octave}}== <~~lang~~syntaxhighlight ~~MATLAB~~lang="matlab">function [val,count] = readdat(configfile) % READDAT reads readings.txt file % Line 1,558 ⟶ 1,898: dix = find(diff(d)==0) % check for to consequtive timestamps with zero difference printf('number of valid records: %i\n ', sum( all( val(:,5:2:end) >= 1, 2) ) );</~~lang~~syntaxhighlight> <pre>>> [val,count]=readdat; Line 1,571 ⟶ 1,911: number of valid records: 5017 </pre> =={{header\|Nim}}== <syntaxhighlight lang="nim">import strutils, tables const NumFields = 49 const DateField = 0 const FlagGoodValue = 1 var badRecords: int # Number of records that have invalid formatted values. var totalRecords: int # Total number of records in the file. var badInstruments: int # Total number of records that have at least one instrument showing error. var seenDates: Table[string, bool] # Table to keep track of what dates we have seen. proc checkFloats(floats: seq[string]): bool = ## Ensure we can parse all records as floats (except the date stamp). for index in 1..<NumFields: try: # We're assuming all instrument flags are floats not integers. discard parseFloat(floats[index]) except ValueError: return false true proc areAllFlagsOk(instruments: seq[string]): bool = ## Ensure that all sensor flags are ok. # Flags start at index 2, and occur every 2 fields. for index in countup(2, NumFields, 2): # We're assuming all instrument flags are floats not integers var flag = parseFloat(instruments[index]) if flag < FlagGoodValue: return false true # Note: we're not checking the format of the date stamp. # Main. var currentLine = 0 for line in "readings.txt".lines: currentLine.inc if line.len == 0: continue # Empty lines don't count as records. var tokens = line.split({' ', '\t'}) totalRecords.inc if tokens.len != NumFields: badRecords.inc continue if not checkFloats(tokens): badRecords.inc continue if not areAllFlagsOk(tokens): badInstruments.inc if seenDates.hasKeyOrPut(tokens[DateField], true): echo tokens[DateField], " duplicated on line ", currentLine let goodRecords = totalRecords - badRecords let goodInstruments = goodRecords - badInstruments echo "Total Records: ", totalRecords echo "Records with wrong format: ", badRecords echo "Records where all instruments were OK: ", goodInstruments</syntaxhighlight> {{out}} <pre>1990-03-25 duplicated on line 85 1991-03-31 duplicated on line 456 1992-03-29 duplicated on line 820 1993-03-28 duplicated on line 1184 1995-03-26 duplicated on line 1911 Total Records: 5471 Records with wrong format: 0 Records where all instruments were OK: 5017</pre> =={{header\|OCaml}}== <~~lang~~syntaxhighlight lang="ocaml">#load "str.cma" open Str let strip_cr str = let last = pred (String.length str) in if str.[last] <> '\r' then (str) else (String.sub str 0 last) let map_records = Line 1,586 ⟶ 2,002: aux (e::acc) tail \| [_] -> invalid_arg "invalid data" \| [] -> (List.rev acc) in aux [] ;; Line 1,599 ⟶ 2,015: aux acc tl \| [] -> (List.rev acc) in aux [] ;; let record_ok (_,record) = let is_ok (_,v) = (v >= 1) in let sum_ok = List.fold_left (fun sum this -> if is_ok this then succ sum else sum) 0 record in (sum_ok = 24) let num_good_records = Line 1,618 ⟶ 2,034: let li = split (regexp "[ \t]+") line in let records = map_records (List.tl li) and date = (List.hd li) in (date, records) Line 1,624 ⟶ 2,040: let ic = open_in "readings.txt" in let rec read_loop acc = let line_opt = try Some (strip_cr (input_line ic)) ~~try~~ with End_of_file -> None ~~let line = strip_cr(input_line ic) in~~ in ~~read_loop ((parse_line line) :: acc)~~ ~~with~~match ~~End_of_file~~line_opt ->with None -> close_in ic; List.rev acc \| Some line -> read_loop (~~List.rev~~parse_line line :: acc) in let inputs = read_loop [] in Line 1,640 ⟶ 2,056: Printf.printf "number of good records: %d\n" (num_good_records inputs); ;;</~~lang~~syntaxhighlight> this script outputs: Line 1,654 ⟶ 2,070: =={{header\|Perl}}== <~~lang~~syntaxhighlight lang="perl">use List::MoreUtils 'natatime'; use constant FIELDS => 49; Line 1,681 ⟶ 2,097: map {" $_\n"} grep {$dates{$_} > 1} sort keys %dates;</~~lang~~syntaxhighlight> Output: Line 1,692 ⟶ 2,108: 1995-03-26</pre> =={{header\|~~Perl 6~~Phix}}== <!--<syntaxhighlight lang="phix">(phixonline)--> ~~{{trans\|Perl}}~~ <span style="color: #000080;font-style:italic;">-- demo\rosetta\TextProcessing2.exw</span> ~~{{works with\|Rakudo\|2010.11}}~~ <span style="color: #008080;">with</span> <span style="color: #008080;">javascript_semantics</span> <span style="color: #000080;font-style:italic;">-- (include version/first of next three lines only)</span> <span style="color: #008080;">include</span> <span style="color: #000000;">readings</span><span style="color: #0000FF;">.</span><span style="color: #000000;">e</span> <span style="color: #000080;font-style:italic;">-- global constant lines, or: ~~<lang perl6>my $fields = 49;~~ --assert(write_lines("readings.txt",lines)!=-1) -- first run, then: --constant lines = read_lines("readings.txt")</span> ~~my ($good-records, %dates) = 0;~~ ~~for 1 .. * Z $IN.lines -> $line, $s {~~ <span style="color: #008080;">include</span> <span style="color: #000000;">builtins</span><span style="color: #0000FF;">\</span><span style="color: #004080;">timedate</span><span style="color: #0000FF;">.</span><span style="color: #000000;">e</span> ~~my @fs = split /\s+/, $s;~~ ~~@fs == $fields or die "$line: Bad number of fields";~~ <span style="color: #004080;">integer</span> <span style="color: #000000;">all_good</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">0</span> ~~given shift @fs {~~ ~~m/\d4 \- \d2 \- \d2/ or die "$line: Bad date format";~~ <span style="color: #004080;">string</span> <span style="color: #000000;">fmt</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">"%d-%d-%d\t"</span><span style="color: #0000FF;">&</span><span style="color: #7060A8;">join</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">repeat</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"%f"</span><span style="color: #0000FF;">,</span><span style="color: #000000;">48</span><span style="color: #0000FF;">),</span><span style="color: #008000;">'\t'</span><span style="color: #0000FF;">)</span> ~~++%dates{$_};~~ <span style="color: #004080;">sequence</span> <span style="color: #000000;">extset</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">sq_mul</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">tagset</span><span style="color: #0000FF;">(</span><span style="color: #000000;">24</span><span style="color: #0000FF;">),</span><span style="color: #000000;">2</span><span style="color: #0000FF;">),</span> <span style="color: #000080;font-style:italic;">-- {2,4,6,..48}</span> } <span style="color: #000000;">curr</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">last</span> ~~my $all-flags-okay = True;~~ ~~for @fs -> $val, $flag {~~ <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">lines</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span> ~~$val ~~ /\d+ \. \d+/ or die "$line: Bad value format";~~ <span style="color: #004080;">string</span> <span style="color: #000000;">li</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">lines</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> ~~$flag ~~ /^ \-? \d+/ or die "$line: Bad flag format";~~ <span style="color: #004080;">sequence</span> <span style="color: #000000;">r</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">scanf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">li</span><span style="color: #0000FF;">,</span><span style="color: #000000;">fmt</span><span style="color: #0000FF;">)</span> ~~$flag < 1 and $all-flags-okay = False;~~ <span style="color: #008080;">if</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">r</span><span style="color: #0000FF;">)!=</span><span style="color: #000000;">1</span> <span style="color: #008080;">then</span> } <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"bad line [%d]:%s\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">i</span><span style="color: #0000FF;">,</span><span style="color: #000000;">li</span><span style="color: #0000FF;">})</span> ~~$all-flags-okay and ++$good-records;~~ <span style="color: #008080;">else</span> } <span style="color: #000000;">curr</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">r</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">][</span><span style="color: #000000;">1</span><span style="color: #0000FF;">..</span><span style="color: #000000;">3</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">if</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">></span><span style="color: #000000;">1</span> <span style="color: #008080;">and</span> <span style="color: #000000;">curr</span><span style="color: #0000FF;">=</span><span style="color: #000000;">last</span> <span style="color: #008080;">then</span> ~~say 'Good records: ', $good-records;~~ <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"duplicate line for %04d/%02d/%02d\n"</span><span style="color: #0000FF;">,</span><span style="color: #000000;">last</span><span style="color: #0000FF;">)</span> ~~say 'Repeated timestamps:';~~ <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> ~~say ' ', $_ for grep { %dates{$_} > 1 }, sort keys %dates;</lang>~~ <span style="color: #000000;">last</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">curr</span> <span style="color: #000000;">all_good</span> <span style="color: #0000FF;">+=</span> <span style="color: #7060A8;">sum</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">sq_le</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">extract</span><span style="color: #0000FF;">(</span><span style="color: #000000;">r</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">][</span><span style="color: #000000;">4</span><span style="color: #0000FF;">..$],</span><span style="color: #000000;">extset</span><span style="color: #0000FF;">),</span><span style="color: #000000;">0</span><span style="color: #0000FF;">))=</span><span style="color: #000000;">0</span> ~~Output:~~ <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> ~~<pre>Good records: 5017~~ <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> ~~Repeated timestamps:~~ ~~1990-03-25~~ <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"Valid records %d of %d total\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">all_good</span><span style="color: #0000FF;">,</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">lines</span><span style="color: #0000FF;">)})</span> ~~1991-03-31~~ ~~1992-03-29~~ <span style="color: #0000FF;">?</span><span style="color: #008000;">"done"</span> ~~1993-03-28~~ <span style="color: #0000FF;">{}</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">wait_key</span><span style="color: #0000FF;">()</span> ~~1995-03-26</pre>~~ <!--</syntaxhighlight>--> ~~The first version demonstrates that you can program Perl 6 almost like Perl 5. Here's a more idiomatic Perl 6 version that runs several times faster:~~ {{out}} ~~<lang perl6>my $good-records;~~ <pre> ~~my $line;~~ duplicate line for 1990/03/25 ~~my %dates;~~ duplicate line for 1991/03/31 duplicate line for 1992/03/29 ~~for lines() {~~ duplicate line for 1993/03/28 ~~$line++;~~ duplicate line for 1995/03/26 ~~/ ^~~ Valid records 5017 of 5471 total ~~(\d * 4 '-' \d\d '-' \d\d)~~ </pre> ~~[ \h+ \d+'.'\d+ \h+ ('-'?\d+) ] ** 24~~ ~~$ /~~ ~~or note "Bad format at line $line" and next;~~ ~~%dates.push: $0 => $line;~~ ~~$good-records++ if $1.all >= 1;~~ } ~~say "$good-records good records out of $line total";~~ ~~say 'Repeated timestamps (with line numbers):';~~ ~~.say for sort %dates.pairs.grep: .value.elems > 1;</lang>~~ ~~Output:~~ ~~<pre>5017 good records out of 5471 total~~ ~~Repeated timestamps (with line numbers):~~ ~~1990-03-25 84 85~~ ~~1991-03-31 455 456~~ ~~1992-03-29 819 820~~ ~~1993-03-28 1183 1184~~ ~~1995-03-26 1910 1911</pre>~~ ~~Note how this version does validation with a single Perl 6 regex that is much more readable than the typical regex, and arguably expresses the data structure more straightforwardly.~~ ~~Here we use normal quotes for literals, and <tt>\h</tt> for horizontal whitespace.~~ ~~Variables like <tt>$good-record</tt> that are going to be autoincremented do not need to be initialized. (Perl 6 allows hyphens in variable names, as you can see.)~~ The <tt>.push</tt> method on a hash is magical and loses no information; if a duplicate key is found in the pushed pair, an array of values is automatically created of the old value and the new value pushed. Hence we can easily track all the lines that a particular duplicate occurred at. The <tt>.all</tt> method does "junctional" logic: it autothreads through comparators as any English speaker would expect. Junctions can also short-circuit as soon as they find a value that doesn't match, and the evaluation order is up to the computer, so it can be optimized or parallelized. The final line simply greps out the pairs from the hash whose value is an array with more than 1 element. (Those values that are not arrays nevertheless have a <tt>.elems</tt> method that always reports <tt>1</tt>.) The <tt>.pairs</tt> is merely there for clarity; grepping a hash directly has the same effect. Note that we sort the pairs after we've grepped them, not before; this works fine in Perl 6, sorting on the key and value as primary and secondary keys. Finally, pairs and arrays provide a default print format that is sufficient without additional formatting in this case. =={{header\|PHP}}== <~~lang~~syntaxhighlight lang="php">$handle = fopen("readings.txt", "rb"); $missformcount = 0; $totalcount = 0; Line 1,802 ⟶ 2,189: foreach ($duplicates as $key => $val){ echo $val . ' at Line : ' . $key . '<br>'; }</~~lang~~syntaxhighlight> <pre>Valid records 5017 of 5471 total Duplicates : Line 1,810 ⟶ 2,197: 1993-03-28 at Line : 1184 1995-03-26 at Line : 1911</pre> =={{header\|Picat}}== <syntaxhighlight lang="picat">import util. go => Readings = [split(Record) : Record in read_file_lines("readings.txt")], DateStamps = new_map(), GoodReadings = 0, foreach({Rec,Id} in zip(Readings,1..Readings.length)) if Rec.length != 49 then printf("Entry %d has bad_length %d\n", Id, Rec.length) end, Date = Rec[1], if DateStamps.has_key(Date) then printf("Entry %d (date %w) is a duplicate of entry %w\n", Id, Date, DateStamps.get(Date)) else if sum([1: I in 3..2..49, check_field(Rec[I])]) == 0 then GoodReadings := GoodReadings + 1 end end, DateStamps.put(Date, Id) end, nl, printf("Total readings: %d\n",Readings.len), printf("Good readings: %d\n",GoodReadings), nl. check_field(Field) => Field == "-2" ; Field == "-1" ; Field == "0".</syntaxhighlight> {{out}} <pre>Entry 85 (date 1990-03-25) is a duplicate of entry 84 Entry 456 (date 1991-03-31) is a duplicate of entry 455 Entry 820 (date 1992-03-29) is a duplicate of entry 819 Entry 1184 (date 1993-03-28) is a duplicate of entry 1183 Entry 1911 (date 1995-03-26) is a duplicate of entry 1910 Total readings: 5471 Good readings: 5013</pre> =={{header\|PicoLisp}}== Put the following into an executable file "checkReadings": <syntaxhighlight lang="picolisp">#!/usr/bin/picolisp /usr/lib/picolisp/lib.l (load "@lib/misc.l") (in (opt) (until (eof) (let Lst (split (line) "^I") (unless (and (= 49 (length Lst)) # Check total length ($dat (car Lst) "-") # Check for valid date (fully # Check data format '((L F) (if F # Alternating: (format L 3) # Number (>= 9 (format L) -9) ) ) # or flag (cdr Lst) '(T NIL .) ) ) (prinl "Bad line format: " (glue " " Lst)) (bye 1) ) ) ) ) (bye)</syntaxhighlight> Then it can be called as <pre>$ ./checkReadings readings.txt</pre> =={{header\|PL/I}}== <~~lang~~syntaxhighlight lang="pli"> / To process readings produced by automatic reading stations. / Line 1,865 ⟶ 2,317: put skip list ('There were ' \|\| k-faulty \|\| ' good readings' ); end check; </syntaxhighlight> ~~</lang>~~ ~~=={{header\|PicoLisp}}==~~ ~~Put the following into an executable file "checkReadings":~~ ~~<lang PicoLisp>#!/usr/bin/picolisp /usr/lib/picolisp/lib.l~~ ~~(load "@lib/misc.l")~~ ~~(in (opt)~~ ~~(until (eof)~~ ~~(let Lst (split (line) "^I")~~ ~~(unless~~ ~~(and~~ ~~(= 49 (length Lst)) # Check total length~~ ~~($dat (car Lst) "-") # Check for valid date~~ ~~(fully # Check data format~~ ~~'((L F)~~ ~~(if F # Alternating:~~ ~~(format L 3) # Number~~ ~~(>= 9 (format L) -9) ) ) # or flag~~ ~~(cdr Lst)~~ ~~'(T NIL .) ) )~~ ~~(prinl "Bad line format: " (glue " " Lst))~~ ~~(bye 1) ) ) ) )~~ ~~(bye)</lang>~~ ~~Then it can be called as~~ ~~<pre>$ ./checkReadings readings.txt</pre>~~ =={{header\|PowerShell}}== <~~lang~~syntaxhighlight lang="powershell">$dateHash = @{} $goodLineCount = 0 get-content c:\temp\readings.txt \| Line 1,918 ⟶ 2,343: } [string]$goodLineCount + " good lines" </syntaxhighlight> ~~</lang>~~ Output: Line 1,929 ⟶ 2,354: An alternative using regular expression syntax: <~~lang~~syntaxhighlight lang="powershell"> $dateHash = @{} $goodLineCount = 0 Line 1,950 ⟶ 2,375: } [string]$goodLineCount + " good lines" </syntaxhighlight> ~~</lang>~~ Output: Line 1,961 ⟶ 2,386: 5017 good lines </pre> =={{header\|PureBasic}}== Using regular expressions. <~~lang~~syntaxhighlight ~~PureBasic~~lang="purebasic">Define filename.s = "readings.txt" #instrumentCount = 24 Line 2,034 ⟶ 2,460: CloseConsole() EndIf EndIf</~~lang~~syntaxhighlight> Sample output: <pre>Duplicate date: 1990-03-25 occurs on lines 85 and 84. Line 2,045 ⟶ 2,471: =={{header\|Python}}== <~~lang~~syntaxhighlight lang="python">import re import zipfile import StringIO Line 2,085 ⟶ 2,511: #readings = StringIO.StringIO(zfs.read('readings.txt')) readings = open('readings.txt','r') munge2(readings)</~~lang~~syntaxhighlight> The results indicate 5013 good records, which differs from the Awk implementation. The final few lines of the output are as follows <pre style="height:10ex;overflow:scroll"> Line 2,104 ⟶ 2,530: Generate mostly summary information that is easier to compare to other solutions. <~~lang~~syntaxhighlight lang="python">import re import zipfile import StringIO Line 2,148 ⟶ 2,574: readings = open('readings.txt','r') munge2(readings)</~~lang~~syntaxhighlight> <pre>bash$ /cygdrive/c/Python26/python munge2.py Duplicate dates: Line 2,166 ⟶ 2,592: =={{header\|R}}== <~~lang~~syntaxhighlight Rlang="r"># Read in data from file dfr <- read.delim("d:/readings.txt", colClasses=c("character", rep(c("numeric", "integer"), 24))) dates <- strptime(dfr[,1], "%Y-%m-%d") Line 2,178 ⟶ 2,604: # Number of rows with no bad values flags <- as.matrix(dfr[,seq(3,49,2)])>0 sum(apply(flags, 1, all))</~~lang~~syntaxhighlight> =={{header\|Racket}}== <~~lang~~syntaxhighlight lang="racket">#lang racket (read-decimal-as-inexact #f) ;; files to read is a sequence, so it could be either a list or vector of files Line 2,230 ⟶ 2,656: (printf "~a records have good readings for all instruments~%" (text-processing/2 (current-command-line-arguments)))</~~lang~~syntaxhighlight> Example session: <pre>$ racket 2.rkt readings/readings.txt Line 2,239 ⟶ 2,665: duplicate datestamp: 1995-03-26 at line: 1911 (first seen at: 1910) 5013 records have good readings for all instruments</pre> =={{header\|Raku}}== (formerly Perl 6) {{trans\|Perl}} {{works with\|Rakudo\|2018.03}} This version does validation with a single Raku regex that is much more readable than the typical regex, and arguably expresses the data structure more straightforwardly. Here we use normal quotes for literals, and <tt>\h</tt> for horizontal whitespace. Variables like <tt>$good-record</tt> that are going to be autoincremented do not need to be initialized. The <tt>.push</tt> method on a hash is magical and loses no information; if a duplicate key is found in the pushed pair, an array of values is automatically created of the old value and the new value pushed. Hence we can easily track all the lines that a particular duplicate occurred at. The <tt>.all</tt> method does "junctional" logic: it autothreads through comparators as any English speaker would expect. Junctions can also short-circuit as soon as they find a value that doesn't match, and the evaluation order is up to the computer, so it can be optimized or parallelized. The final line simply greps out the pairs from the hash whose value is an array with more than 1 element. (Those values that are not arrays nevertheless have a <tt>.elems</tt> method that always reports <tt>1</tt>.) The <tt>.pairs</tt> is merely there for clarity; grepping a hash directly has the same effect. Note that we sort the pairs after we've grepped them, not before; this works fine in Raku, sorting on the key and value as primary and secondary keys. Finally, pairs and arrays provide a default print format that is sufficient without additional formatting in this case. <syntaxhighlight lang="raku" line>my $good-records; my $line; my %dates; for lines() { $line++; / ^ (\d 4 '-' \d\d '-' \d\d) [ \h+ \d+'.'\d+ \h+ ('-'?\d+) ] 24 $ / or note "Bad format at line $line" and next; %dates.push: $0 => $line; $good-records++ if $1.all >= 1; } say "$good-records good records out of $line total"; say 'Repeated timestamps (with line numbers):'; .say for sort %dates.pairs.grep: .value.elems > 1;</syntaxhighlight> Output: <pre>5017 good records out of 5471 total Repeated timestamps (with line numbers): 1990-03-25 => [84 85] 1991-03-31 => [455 456] 1992-03-29 => [819 820] 1993-03-28 => [1183 1184] 1995-03-26 => [1910 1911]</pre> =={{header\|REXX}}== Line 2,259 ⟶ 2,730: <br><br> The program has (negated) code to write the report to a file in addition to the console. <~~lang~~syntaxhighlight lang="rexx">/REXX program to process instrument data from a data file. / numeric digits 20 /allow for bigger numbers. / ifid='READINGS.TXT' /name of the input file. / Line 2,398 ⟶ 2,869: return y//100\==0 \| y//400==0 /apply the 100 and the 400 year rule./ /────────────────────────────────────────────────────────────────────────────/ sy: say arg(1); call lineout ofid,arg(1); return</~~lang~~syntaxhighlight> '''output'''   when using the default input file: <pre style="height:35ex"> Line 2,428 ⟶ 2,899: =={{header\|Ruby}}== <~~lang~~syntaxhighlight lang="ruby">require 'set' def munge2(readings, debug=false) Line 2,480 ⟶ 2,951: open('readings.txt','r') do \|readings\| munge2(readings) end</~~lang~~syntaxhighlight> =={{header\|Scala}}== {{works with\|Scala\|2.8}} <~~lang~~syntaxhighlight lang="scala">object DataMunging2 { import scala.io.Source import scala.collection.immutable.{TreeMap => Map} Line 2,522 ⟶ 2,993: dateMap.valuesIterable.sum)) } }</~~lang~~syntaxhighlight> Sample output: Line 2,541 ⟶ 3,012: =={{header\|Sidef}}== {{trans\|~~Perl 6~~Raku}} <~~lang~~syntaxhighlight lang="ruby">var good_records = 0; var dates = Hash~~.new -> default~~(0); ARGF.each { \|line\| ~~line~~var m ~~= /^(\d\d\d\d-\d\d-\d\d)((?:\h+\d+\.\d+\h+-?\d+){24})\s$/.match(line); m \|\| (warn "Bad format at line #{$.}"; next); ~~dates[$1]++;~~ dates{m[0]} := 0 ++; var i = 0; $2m[1].words.all{\|n\| i++ .is_even \|\| (n.to_num >= 1) } && ~~good_records~~++good_records; } say "#{good_records} good records out of #{$.} total"; say 'Repeated timestamps:'; say dates.~~pairs~~to_a.grep{ .~~second~~value > 1 }.map { .~~first~~key }.sort.join("\n");</~~lang~~syntaxhighlight> {{out}} <pre> Line 2,565 ⟶ 3,037: 1993-03-28 1995-03-26 </pre> =={{header\|Snobol4}}== Developed using the Snobol4 dialect Spitbol for Linux, version 4.0 <syntaxhighlight lang="snobol4">* Read text/2 v = array(24) f = array(24) tos = char(9) " " ;* break characters are both tab and space pat1 = break(tos) . dstamp pat2 = span(tos) break(tos) . v[i] span(tos) (break(tos) \| (len(1) rem)) . f[i] rowcount = 0 hold_dstamp = "" num_bad_rows = 0 num_invalid_rows = 0 in0 row = input :f(endinput) rowcount = rowcount + 1 row ? pat1 = :f(invalid_row) * duplicated datestamp? * if dstamp = hold_dstamp then duplicated hold_dstamp = differ(hold_dstamp,dstamp) dstamp :s(nodup) output = dstamp ": datestamp at row " rowcount " duplicates datestamp at " rowcount - 1 nodup i = 1 in1 row ? pat2 = :f(invalid_row) i = lt(i,24) i + 1 :s(in1) * Is this a goodrow? * if any flag is < 1 then row has bad data c = 0 goodrow c = lt(c,24) c + 1 :f(goodrow2) num_bad_rows = lt(f[c],1) num_bad_rows + 1 :s(goodrow2)f(goodrow) goodrow2 :(in0) invalid_row num_invalid_rows = num_invalid_rows + 1 :(in0) endinput output = output = "Total number of rows : " rowcount output = "Total number of rows with invalid format: " num_invalid_rows output = "Total number of rows with bad data : " num_bad_rows output = "Total number of good rows : " rowcount - num_invalid_rows - num_bad_rows end </syntaxhighlight> {{out}} <pre>1990-03-25: datestamp at row 85 duplicates datestamp at 84 1991-03-31: datestamp at row 456 duplicates datestamp at 455 1992-03-29: datestamp at row 820 duplicates datestamp at 819 1993-03-28: datestamp at row 1184 duplicates datestamp at 1183 1995-03-26: datestamp at row 1911 duplicates datestamp at 1910 Total number of rows : 5471 Total number of rows with invalid format: 0 Total number of rows with bad data : 454 Total number of good rows : 5017 </pre> =={{header\|Tcl}}== <~~lang~~syntaxhighlight lang="tcl">set data [lrange [split [read [open "readings.txt" "r"]] "\n"] 0 end-1] set total [llength $data] set correct $total Line 2,593 ⟶ 3,134: puts "$correct records with good readings = [expr $correct * 100.0 / $total]%" puts "Total records: $total"</~~lang~~syntaxhighlight> <pre>$ tclsh munge2.tcl Duplicate datestamp: 1990-03-25 Line 2,608 ⟶ 3,149: To demonstate a different method to iterate over the file, and different ways to verify data types: <~~lang~~syntaxhighlight lang="tcl">set total [set good 0] array set seen {} set fh [open readings.txt] Line 2,646 ⟶ 3,187: puts "total: $total" puts [format "good: %d = %5.2f%%" $good [expr {100.0 * $good / $total}]]</~~lang~~syntaxhighlight> Results: <pre>duplicate date on line 85: 1990-03-25 Line 2,659 ⟶ 3,200: compiled and run in a single step, with the input file accessed as a list of strings pre-declared in readings_dot_txt <~~lang~~syntaxhighlight ~~Ursala~~lang="ursala">#import std #import nat Line 2,672 ⟶ 3,213: #show+ main = valid_format?(^C/good_readings duplicate_dates,-[invalid format]-!) readings</~~lang~~syntaxhighlight> output: <pre>5017 good readings Line 2,683 ⟶ 3,224: =={{header\|VBScript}}== <~~lang~~syntaxhighlight lang="vb">Set objFSO = CreateObject("Scripting.FileSystemObject") Set objFile = objFSO.OpenTextFile(objFSO.GetParentFolderName(WScript.ScriptFullName) &_ "\readings.txt",1) Line 2,736 ⟶ 3,277: objFile.Close Set objFSO = Nothing</~~lang~~syntaxhighlight> {{Out}} Line 2,757 ⟶ 3,298: * Reads flag value and checks if it is positive * Requires 24 value/flag pairs on each line <~~lang~~syntaxhighlight lang="vedit">#50 = Buf_Num // Current edit buffer (source data) File_Open("\|(PATH_ONLY)\output.txt") #51 = Buf_Num // Edit buffer for output file Line 2,804 ⟶ 3,345: IT("Date format errors: ") Num_Ins(#14) IT("Invalid data records:") Num_Ins(#15) IT("Total records: ") Num_Ins(#12)</~~lang~~syntaxhighlight> Sample output: <~~lang~~syntaxhighlight lang="vedit">1990-03-25: duplicate record at 85 1991-03-31: duplicate record at 456 1992-03-29: duplicate record at 820 Line 2,816 ⟶ 3,357: Date format errors: 0 Invalid data records: 454 Total records: 5471</~~lang~~syntaxhighlight> =={{header\|Wren}}== {{trans\|Kotlin}} {{libheader\|Wren-pattern}} {{libheader\|Wren-fmt}} {{libheader\|Wren-sort}} <syntaxhighlight lang="wren">import "io" for File import "./pattern" for Pattern import "./fmt" for Fmt import "./sort" for Sort var p = Pattern.new("+1/s") var fileName = "readings.txt" var lines = File.read(fileName).trimEnd().split("\r\n") var count = 0 var invalid = 0 var allGood = 0 var map = {} for (line in lines) { count = count + 1 var fields = p.splitAll(line) var date = fields[0] if (fields.count == 49) { map[date] = map.containsKey(date) ? map[date] + 1 : 1 var good = 0 var i = 2 while (i < fields.count) { if (Num.fromString(fields[i]) >= 1) good = good + 1 i = i + 2 } if (good == 24) allGood = allGood + 1 } else { invalid = invalid + 1 } } Fmt.print("File = $s", fileName) System.print("\nDuplicated dates:") var keys = map.keys.toList Sort.quick(keys) for (k in keys) { var v = map[k] if (v > 1) Fmt.print(" $s ($d times)", k, v) } Fmt.print("\nTotal number of records : $d", count) var percent = invalid/count * 100 Fmt.print("Number of invalid records : $d ($5.2f)\%", invalid, percent) percent = allGood/count * 100 Fmt.print("Number which are all good : $d ($5.2f)\%", allGood, percent)</syntaxhighlight> {{out}} <pre> File = readings.txt Duplicated dates: 1990-03-25 (2 times) 1991-03-31 (2 times) 1992-03-29 (2 times) 1993-03-28 (2 times) 1995-03-26 (2 times) Total number of records : 5471 Number of invalid records : 0 ( 0.00)% Number which are all good : 5017 (91.70)% </pre> =={{header\|zkl}}== <~~lang~~syntaxhighlight lang="zkl"> // the RegExp engine has a low limit on groups so // I can't use it to select all fields, only verify them re:=RegExp(0'\|^(\d+-\d+-\d+)\| + 0'\|\s+\d+\.\d+\s+-\d+\| 24 + ".+$"); w:=~~Utils~~[1.~~Helpers~~.~~zipW~~].zip(File("readings.txt")~~,[1..]~~); //-->lazy (~~line,~~line #,line) reg datep,N, good=0, dd=0; foreach ~~line,~~n,line in (w){ N=n; // since n is local to this scope if (not re.search(line)){ println("Line %d: malformed".fmt(n)); continue; } Line 2,837 ⟶ 3,443: good+=1; } println("%d records read, %d duplicate dates, %d valid".fmt(N,dd,good));</~~lang~~syntaxhighlight> {{out}} <pre>