Word frequency: Difference between revisions

← Older edit

Word frequency (view source)

Revision as of 10:35, 17 February 2024

3,709 bytes added , 3 months ago

m

→‎{{header|Wren}}: Minor tidy and rerun

PureFox

9,476

edits

Revision as of 09:35, 24 September 2022 (view source) Grondilu (talk \| contribs) (→‎{{header\|Raku}}: use % operator, add type constraint and properly align output) ← Older edit		Latest revision as of 10:35, 17 February 2024 (view source) PureFox (talk \| contribs) m (→‎{{header\|Wren}}: Minor tidy and rerun)
(8 intermediate revisions by 5 users not shown)
Line 2,812: =={{header\|Java}}== This is relatively simple in Java.<br /> I used a ''URL'' class to download the content, a ''BufferedReader'' class to examine the text line-for-line, a ''Pattern'' and ''Matcher'' to identify words, and a ''Map'' to hold to values. <syntaxhighlight lang="java"> import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URI; import java.net.URISyntaxException; import java.net.URL; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.regex.Matcher; import java.util.regex.Pattern; </syntaxhighlight> <syntaxhighlight lang="java"> void printWordFrequency() throws URISyntaxException, IOException { URL url = new URI("https://www.gutenberg.org/files/135/135-0.txt").toURL(); try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()))) { Pattern pattern = Pattern.compile("(\\w+)"); Matcher matcher; String line; String word; Map<String, Integer> map = new HashMap<>(); while ((line = reader.readLine()) != null) { matcher = pattern.matcher(line); while (matcher.find()) { word = matcher.group().toLowerCase(); if (map.containsKey(word)) { map.put(word, map.get(word) + 1); } else { map.put(word, 1); } } } /* print out top 10 / List<Map.Entry<String, Integer>> list = new ArrayList<>(map.entrySet()); list.sort(Map.Entry.comparingByValue()); Collections.reverse(list); int count = 1; for (Map.Entry<String, Integer> value : list) { System.out.printf("%-20s%,7d%n", value.getKey(), value.getValue()); if (count++ == 10) break; } } } </syntaxhighlight> <pre> the 41,043 of 19,952 and 14,938 a 14,539 to 13,942 in 11,208 he 9,646 was 8,620 that 7,922 it 6,659 </pre> <br /> An alternate demonstration {{trans\|Kotlin}} <syntaxhighlight lang="java">import java.io.IOException; Line 2,929 ⟶ 2,993: "he" │ 6816 "had" │ 6140</pre> =={{header\|K}}== {{works with\|ngn/k}}<syntaxhighlight lang=K>common:{+((!d)o)!n@o:x#>n:#'.d:=("&"\`c$"&"\|_,/0:y)^,""} {(,'!x),'.x}common[10;"135-0.txt"] (("the";41019) ("of";19898) ("and";14658) (,"a";14517) ("to";13695) ("in";11134) ("he";9405) ("was";8361) ("that";7592) ("his";6446))</syntaxhighlight> (The relatively easy to read output format here is arguably less useful than the table produced by <code>common</code> but it would have been more concise to have <code>common</code> generate it directly.) =={{header\|KAP}}== Line 3,325 ⟶ 3,405: =={{header\|Perl}}== {{trans\|Raku}} <syntaxhighlight lang="perl">~~$top~~use ~~= 10~~strict; use warnings; use utf8; my $top = 10; ~~open $fh, "<", '135-0.txt';~~ ~~($text = join '', <$fh>) =~ tr/A-Z/a-z/~~ ~~or die "Can't open '135-0.txt': $!\n";~~ open my $fh, '<', 'ref/word-count.txt'; ~~@matcher = (~~ (my $text = join '', <$fh>) =~ tr/A-Z/a-z/; my @matcher = ( qr/[a-z]+/, # simple 7-bit ASCII qr/\w+/, # word characters with underscore Line 3,337 ⟶ 3,420: ); for my $reg (@matcher) { print "\nTop $top using regex: " . $reg ~~. "~~\n"; my @matches = $text =~ /$reg/g; my %words; for my $w (@matches) { $words{$w}++ }; my $c = 0; for my $w ( sort { $words{$b} <=> $words{$a} } keys %words ) { printf "%-7s %6d\n", $w, $words{$w}; last if ++$c >= $top; Line 3,350 ⟶ 3,433: {{out}} <pre> ~~<pre>Top 10 using regex: (?^:[a-z]+)~~ Top 10 using regex: (?^:[a-z]+) the 41089 of 19949 Line 3,384 ⟶ 3,468: was 8621 that 7924 it 6661~~</pre>~~ </pre> =={{header\|Phix}}== Line 4,001 ⟶ 4,086: for @matcher -> $reg { say "\nTop $top using regex: ", $reg.raku; my @words = $file.comb($reg).Bag.sort(-.value)[^$top]; my $length = max @words».key».chars; printf "%-{$length}s %d\n", .key, .value for @words; } Line 4,958 ⟶ 5,043: 6 garbage collection(s) in 0.2 seconds. </pre> =={{header\|Smalltalk}}== The ASCII text file is from https://www.gutenberg.org/files/135/old/lesms10.txt. ===Cuis Smalltalk, ASCII=== {{works with\|Cuis\|6.0}} <syntaxhighlight lang="smalltalk"> (StandardFileStream new open: 'lesms10.txt' forWrite: false) contents asLowercase substrings asBag sortedCounts first: 10. </syntaxhighlight> {{Out}}<pre>an OrderedCollection(40543 -> 'the' 19796 -> 'of' 14448 -> 'and' 14380 -> 'a' 13582 -> 'to' 11006 -> 'in' 9221 -> 'he' 8351 -> 'was' 7258 -> 'that' 6420 -> 'his') </pre> ===Squeak Smalltalk, ASCII=== {{works with\|Squeak\|6.0}} <syntaxhighlight lang="smalltalk"> (StandardFileStream readOnlyFileNamed: 'lesms10.txt') contents asLowercase substrings asBag sortedCounts first: 10. </syntaxhighlight> {{Out}}<pre>{40543->'the' . 19796->'of' . 14448->'and' . 14380->'a' . 13582->'to' . 11006->'in' . 9221->'he' . 8351->'was' . 7258->'that' . 6420->'his'} </pre> =={{header\|Swift}}== Line 5,333 ⟶ 5,437: I've taken the view that 'letter' means either a letter or digit for Unicode codepoints up to 255. I haven't included underscore, hyphen nor apostrophe as these usually separate compound words. Not very quick (runs in about 4715 seconds on my system) though this is partially due to Wren not having regular expressions and the string pattern matching module being written in Wren itself rather than C. If the Go example is re-run today (2117 ~~October~~February ~~2020~~2024), then the output matches this Wren example precisely though it appears that the text file has changed since the former was written more than 25 years ago. <syntaxhighlight lang="~~ecmascript~~wren">import "io" for File import "./str" for Str import "./sort" for Sort import "./fmt" for Fmt import "./pattern" for Pattern var fileName = "135-0.txt"