Word frequency: Difference between revisions
m
→{{header|Wren}}: Minor tidy and rerun
(→{{header|Raku}}: use % operator, add type constraint and properly align output) |
m (→{{header|Wren}}: Minor tidy and rerun) |
||
(8 intermediate revisions by 5 users not shown) | |||
Line 2,812:
=={{header|Java}}==
This is relatively simple in Java.<br />
I used a ''URL'' class to download the content, a ''BufferedReader'' class to examine the text line-for-line, a ''Pattern'' and ''Matcher'' to identify words, and a ''Map'' to hold to values.
<syntaxhighlight lang="java">
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
</syntaxhighlight>
<syntaxhighlight lang="java">
void printWordFrequency() throws URISyntaxException, IOException {
URL url = new URI("https://www.gutenberg.org/files/135/135-0.txt").toURL();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()))) {
Pattern pattern = Pattern.compile("(\\w+)");
Matcher matcher;
String line;
String word;
Map<String, Integer> map = new HashMap<>();
while ((line = reader.readLine()) != null) {
matcher = pattern.matcher(line);
while (matcher.find()) {
word = matcher.group().toLowerCase();
if (map.containsKey(word)) {
map.put(word, map.get(word) + 1);
} else {
map.put(word, 1);
}
}
}
/* print out top 10 */
List<Map.Entry<String, Integer>> list = new ArrayList<>(map.entrySet());
list.sort(Map.Entry.comparingByValue());
Collections.reverse(list);
int count = 1;
for (Map.Entry<String, Integer> value : list) {
System.out.printf("%-20s%,7d%n", value.getKey(), value.getValue());
if (count++ == 10) break;
}
}
}
</syntaxhighlight>
<pre>
the 41,043
of 19,952
and 14,938
a 14,539
to 13,942
in 11,208
he 9,646
was 8,620
that 7,922
it 6,659
</pre>
<br />
An alternate demonstration
{{trans|Kotlin}}
<syntaxhighlight lang="java">import java.io.IOException;
Line 2,929 ⟶ 2,993:
"he" │ 6816
"had" │ 6140</pre>
=={{header|K}}==
{{works with|ngn/k}}<syntaxhighlight lang=K>common:{+((!d)o)!n@o:x#>n:#'.d:=("&"\`c$"&"|_,/0:y)^,""}
{(,'!x),'.x}common[10;"135-0.txt"]
(("the";41019)
("of";19898)
("and";14658)
(,"a";14517)
("to";13695)
("in";11134)
("he";9405)
("was";8361)
("that";7592)
("his";6446))</syntaxhighlight>
(The relatively easy to read output format here is arguably less useful than the table produced by <code>common</code> but it would have been more concise to have <code>common</code> generate it directly.)
=={{header|KAP}}==
Line 3,325 ⟶ 3,405:
=={{header|Perl}}==
{{trans|Raku}}
<syntaxhighlight lang="perl">
use warnings;
use utf8;
my $top = 10;
open my $fh, '<', 'ref/word-count.txt';
(my $text = join '', <$fh>) =~ tr/A-Z/a-z/;
my @matcher = (
qr/[a-z]+/, # simple 7-bit ASCII
qr/\w+/, # word characters with underscore
Line 3,337 ⟶ 3,420:
);
for my $reg (@matcher) {
print "\nTop $top using regex: " . $reg
my @matches = $text =~ /$reg/g;
my %words;
for my $w (@matches) { $words{$w}++ };
my $c = 0;
for my $w ( sort { $words{$b} <=> $words{$a} } keys %words ) {
printf "%-7s %6d\n", $w, $words{$w};
last if ++$c >= $top;
Line 3,350 ⟶ 3,433:
{{out}}
<pre>
Top 10 using regex: (?^:[a-z]+)
the 41089
of 19949
Line 3,384 ⟶ 3,468:
was 8621
that 7924
it 6661
</pre>
=={{header|Phix}}==
Line 4,001 ⟶ 4,086:
for @matcher -> $reg {
say "\nTop $top using regex: ", $reg.raku;
printf "%-{$length}s %d\n", .key, .value for @words;
}
Line 4,958 ⟶ 5,043:
6 garbage collection(s) in 0.2 seconds.
</pre>
=={{header|Smalltalk}}==
The ASCII text file is from https://www.gutenberg.org/files/135/old/lesms10.txt.
===Cuis Smalltalk, ASCII===
{{works with|Cuis|6.0}}
<syntaxhighlight lang="smalltalk">
(StandardFileStream new open: 'lesms10.txt' forWrite: false)
contents asLowercase substrings asBag sortedCounts first: 10.
</syntaxhighlight>
{{Out}}<pre>an OrderedCollection(40543 -> 'the' 19796 -> 'of' 14448 -> 'and' 14380 -> 'a' 13582 -> 'to' 11006 -> 'in' 9221 -> 'he' 8351 -> 'was' 7258 -> 'that' 6420 -> 'his') </pre>
===Squeak Smalltalk, ASCII===
{{works with|Squeak|6.0}}
<syntaxhighlight lang="smalltalk">
(StandardFileStream readOnlyFileNamed: 'lesms10.txt')
contents asLowercase substrings asBag sortedCounts first: 10.
</syntaxhighlight>
{{Out}}<pre>{40543->'the' . 19796->'of' . 14448->'and' . 14380->'a' . 13582->'to' . 11006->'in' . 9221->'he' . 8351->'was' . 7258->'that' . 6420->'his'} </pre>
=={{header|Swift}}==
Line 5,333 ⟶ 5,437:
I've taken the view that 'letter' means either a letter or digit for Unicode codepoints up to 255. I haven't included underscore, hyphen nor apostrophe as these usually separate compound words.
Not very quick (runs in about
If the Go example is re-run today (
<syntaxhighlight lang="
import "./str" for Str
import "./sort" for Sort
import "./fmt" for Fmt
import "./pattern" for Pattern
var fileName = "135-0.txt"
|