Text processing/2: Difference between revisions

Rename Perl 6 -> Raku, alphabetize, minor clean-up
(Rename Perl 6 -> Raku, alphabetize, minor clean-up)
Line 371:
 
5017 out 5471 lines good
</pre>
 
=={{header|C++}}==
{{libheader|Boost}}
<lang cpp>#include <boost/regex.hpp>
#include <fstream>
#include <iostream>
#include <vector>
#include <string>
#include <set>
#include <cstdlib>
#include <algorithm>
using namespace std ;
 
boost::regex e ( "\\s+" ) ;
 
int main( int argc , char *argv[ ] ) {
ifstream infile( argv[ 1 ] ) ;
vector<string> duplicates ;
set<string> datestamps ; //for the datestamps
if ( ! infile.is_open( ) ) {
cerr << "Can't open file " << argv[ 1 ] << '\n' ;
return 1 ;
}
int all_ok = 0 ;//all_ok for lines in the given pattern e
int pattern_ok = 0 ; //overall field pattern of record is ok
while ( infile ) {
string eingabe ;
getline( infile , eingabe ) ;
boost::sregex_token_iterator i ( eingabe.begin( ), eingabe.end( ) , e , -1 ), j ;//we tokenize on empty fields
vector<string> fields( i, j ) ;
if ( fields.size( ) == 49 ) //we expect 49 fields in a record
pattern_ok++ ;
else
cout << "Format not ok!\n" ;
if ( datestamps.insert( fields[ 0 ] ).second ) { //not duplicated
int howoften = ( fields.size( ) - 1 ) / 2 ;//number of measurement
//devices and values
for ( int n = 1 ; atoi( fields[ 2 * n ].c_str( ) ) >= 1 ; n++ ) {
if ( n == howoften ) {
all_ok++ ;
break ;
}
}
}
else {
duplicates.push_back( fields[ 0 ] ) ;//first field holds datestamp
}
}
infile.close( ) ;
cout << "The following " << duplicates.size() << " datestamps were duplicated:\n" ;
copy( duplicates.begin( ) , duplicates.end( ) ,
ostream_iterator<string>( cout , "\n" ) ) ;
cout << all_ok << " records were complete and ok!\n" ;
return 0 ;
}</lang>
 
{{out}}
<pre>
Format not ok!
The following 6 datestamps were duplicated:
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
2004-12-31
</pre>
 
Line 521 ⟶ 454:
1993-03-28 is duplicated at Lines : 1183,1184
1995-03-26 is duplicated at Lines : 1910,1911
</pre>
 
=={{header|C++}}==
{{libheader|Boost}}
<lang cpp>#include <boost/regex.hpp>
#include <fstream>
#include <iostream>
#include <vector>
#include <string>
#include <set>
#include <cstdlib>
#include <algorithm>
using namespace std ;
 
boost::regex e ( "\\s+" ) ;
 
int main( int argc , char *argv[ ] ) {
ifstream infile( argv[ 1 ] ) ;
vector<string> duplicates ;
set<string> datestamps ; //for the datestamps
if ( ! infile.is_open( ) ) {
cerr << "Can't open file " << argv[ 1 ] << '\n' ;
return 1 ;
}
int all_ok = 0 ;//all_ok for lines in the given pattern e
int pattern_ok = 0 ; //overall field pattern of record is ok
while ( infile ) {
string eingabe ;
getline( infile , eingabe ) ;
boost::sregex_token_iterator i ( eingabe.begin( ), eingabe.end( ) , e , -1 ), j ;//we tokenize on empty fields
vector<string> fields( i, j ) ;
if ( fields.size( ) == 49 ) //we expect 49 fields in a record
pattern_ok++ ;
else
cout << "Format not ok!\n" ;
if ( datestamps.insert( fields[ 0 ] ).second ) { //not duplicated
int howoften = ( fields.size( ) - 1 ) / 2 ;//number of measurement
//devices and values
for ( int n = 1 ; atoi( fields[ 2 * n ].c_str( ) ) >= 1 ; n++ ) {
if ( n == howoften ) {
all_ok++ ;
break ;
}
}
}
else {
duplicates.push_back( fields[ 0 ] ) ;//first field holds datestamp
}
}
infile.close( ) ;
cout << "The following " << duplicates.size() << " datestamps were duplicated:\n" ;
copy( duplicates.begin( ) , duplicates.end( ) ,
ostream_iterator<string>( cout , "\n" ) ) ;
cout << all_ok << " records were complete and ok!\n" ;
return 0 ;
}</lang>
 
{{out}}
<pre>
Format not ok!
The following 6 datestamps were duplicated:
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
2004-12-31
</pre>
 
Line 951:
5017 records were ok
</lang>
 
=={{header|Fortran}}==
The trouble with the dates rather suggests that they should be checked for correctness in themselves, and that the sequence check should be that each new record advances the date by one day. Daynumber calculations were long ago presented by H. F. Fliegel and T.C. van Flandern, in Communications of the ACM, Vol. 11, No. 10 (October, 1968).
Line 1,331 ⟶ 1,332:
var analyze = analyze_func('readings.txt');
analyze();</lang>
 
=={{header|jq}}==
{{works with|jq|with regex support}}
Line 1,676 ⟶ 1,678:
number of valid records: 5017
</pre>
 
=={{header|Nim}}==
<lang Nim>
Line 1,866 ⟶ 1,868:
1993-03-28
1995-03-26</pre>
 
=={{header|Perl 6}}==
{{trans|Perl}}
{{works with|Rakudo|2018.03}}
 
This version does validation with a single Perl&nbsp;6 regex that is much more readable than the typical regex, and arguably expresses the data structure more straightforwardly.
Here we use normal quotes for literals, and <tt>\h</tt> for horizontal whitespace.
 
Variables like <tt>$good-record</tt> that are going to be autoincremented do not need to be initialized.
 
The <tt>.push</tt> method on a hash is magical and loses no information; if a duplicate key is found in the pushed pair, an array of values is automatically created of the old value and the new value pushed. Hence we can easily track all the lines that a particular duplicate occurred at.
 
The <tt>.all</tt> method does "junctional" logic: it autothreads through comparators as any English speaker would expect. Junctions can also short-circuit as soon as they find a value that doesn't match, and the evaluation order is up to the computer, so it can be optimized or parallelized.
 
The final line simply greps out the pairs from the hash whose value is an array with more than 1 element. (Those values that are not arrays nevertheless have a <tt>.elems</tt> method that always reports <tt>1</tt>.) The <tt>.pairs</tt> is merely there for clarity; grepping a hash directly has the same effect.
Note that we sort the pairs after we've grepped them, not before; this works fine in Perl&nbsp;6, sorting on the key and value as primary and secondary keys. Finally, pairs and arrays provide a default print format that is sufficient without additional formatting in this case.
 
<lang perl6>my $good-records;
my $line;
my %dates;
 
for lines() {
$line++;
/ ^
(\d ** 4 '-' \d\d '-' \d\d)
[ \h+ \d+'.'\d+ \h+ ('-'?\d+) ] ** 24
$ /
or note "Bad format at line $line" and next;
%dates.push: $0 => $line;
$good-records++ if $1.all >= 1;
}
 
say "$good-records good records out of $line total";
 
say 'Repeated timestamps (with line numbers):';
.say for sort %dates.pairs.grep: *.value.elems > 1;</lang>
Output:
<pre>5017 good records out of 5471 total
Repeated timestamps (with line numbers):
1990-03-25 => [84 85]
1991-03-31 => [455 456]
1992-03-29 => [819 820]
1993-03-28 => [1183 1184]
1995-03-26 => [1910 1911]</pre>
 
=={{header|Phix}}==
Line 1,998 ⟶ 1,956:
1993-03-28 at Line : 1184
1995-03-26 at Line : 1911</pre>
 
=={{header|PicoLisp}}==
Put the following into an executable file "checkReadings":
<lang PicoLisp>#!/usr/bin/picolisp /usr/lib/picolisp/lib.l
 
(load "@lib/misc.l")
 
(in (opt)
(until (eof)
(let Lst (split (line) "^I")
(unless
(and
(= 49 (length Lst)) # Check total length
($dat (car Lst) "-") # Check for valid date
(fully # Check data format
'((L F)
(if F # Alternating:
(format L 3) # Number
(>= 9 (format L) -9) ) ) # or flag
(cdr Lst)
'(T NIL .) ) )
(prinl "Bad line format: " (glue " " Lst))
(bye 1) ) ) ) )
 
(bye)</lang>
Then it can be called as
<pre>$ ./checkReadings readings.txt</pre>
 
=={{header|PL/I}}==
Line 2,054 ⟶ 2,039:
end check;
</lang>
 
=={{header|PicoLisp}}==
Put the following into an executable file "checkReadings":
<lang PicoLisp>#!/usr/bin/picolisp /usr/lib/picolisp/lib.l
 
(load "@lib/misc.l")
 
(in (opt)
(until (eof)
(let Lst (split (line) "^I")
(unless
(and
(= 49 (length Lst)) # Check total length
($dat (car Lst) "-") # Check for valid date
(fully # Check data format
'((L F)
(if F # Alternating:
(format L 3) # Number
(>= 9 (format L) -9) ) ) # or flag
(cdr Lst)
'(T NIL .) ) )
(prinl "Bad line format: " (glue " " Lst))
(bye 1) ) ) ) )
 
(bye)</lang>
Then it can be called as
<pre>$ ./checkReadings readings.txt</pre>
 
=={{header|PowerShell}}==
Line 2,149 ⟶ 2,107:
5017 good lines
</pre>
 
=={{header|PureBasic}}==
Using regular expressions.
Line 2,427 ⟶ 2,386:
duplicate datestamp: 1995-03-26 at line: 1911 (first seen at: 1910)
5013 records have good readings for all instruments</pre>
 
=={{header|Raku}}==
(formerly Perl 6)
{{trans|Perl}}
{{works with|Rakudo|2018.03}}
 
This version does validation with a single Perl&nbsp;6 regex that is much more readable than the typical regex, and arguably expresses the data structure more straightforwardly.
Here we use normal quotes for literals, and <tt>\h</tt> for horizontal whitespace.
 
Variables like <tt>$good-record</tt> that are going to be autoincremented do not need to be initialized.
 
The <tt>.push</tt> method on a hash is magical and loses no information; if a duplicate key is found in the pushed pair, an array of values is automatically created of the old value and the new value pushed. Hence we can easily track all the lines that a particular duplicate occurred at.
 
The <tt>.all</tt> method does "junctional" logic: it autothreads through comparators as any English speaker would expect. Junctions can also short-circuit as soon as they find a value that doesn't match, and the evaluation order is up to the computer, so it can be optimized or parallelized.
 
The final line simply greps out the pairs from the hash whose value is an array with more than 1 element. (Those values that are not arrays nevertheless have a <tt>.elems</tt> method that always reports <tt>1</tt>.) The <tt>.pairs</tt> is merely there for clarity; grepping a hash directly has the same effect.
Note that we sort the pairs after we've grepped them, not before; this works fine in Perl&nbsp;6, sorting on the key and value as primary and secondary keys. Finally, pairs and arrays provide a default print format that is sufficient without additional formatting in this case.
 
<lang perl6>my $good-records;
my $line;
my %dates;
 
for lines() {
$line++;
/ ^
(\d ** 4 '-' \d\d '-' \d\d)
[ \h+ \d+'.'\d+ \h+ ('-'?\d+) ] ** 24
$ /
or note "Bad format at line $line" and next;
%dates.push: $0 => $line;
$good-records++ if $1.all >= 1;
}
 
say "$good-records good records out of $line total";
 
say 'Repeated timestamps (with line numbers):';
.say for sort %dates.pairs.grep: *.value.elems > 1;</lang>
Output:
<pre>5017 good records out of 5471 total
Repeated timestamps (with line numbers):
1990-03-25 => [84 85]
1991-03-31 => [455 456]
1992-03-29 => [819 820]
1993-03-28 => [1183 1184]
1995-03-26 => [1910 1911]</pre>
 
=={{header|REXX}}==
10,327

edits