Word frequency: Difference between revisions

→‎{{header|Perl 6}}: Add a Perl 6 example
(→‎{{header|Perl 6}}: Add a Perl 6 example)
Line 38:
["in" 11204] ["he" 9645] ["was" 8619] ["that" 7922] ["it" 6659])
</pre>
 
=={{header|Perl 6}}==
{{works with|Rakudo|2017.07}}
This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is '''full''' of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup.
 
We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs. Those '''are''' letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)
 
Actually, in this specific case /A-Za-z/ returns the correct answer since none of the French accented words or their inappropriately broken fractions are in the top 10, but that is only by accident. The correct regex matcher is some kind of Unicode aware /\w/ minus underscore.
 
( Really, a better regex would allow for contractions and embedded apostrophes but that is beyond the scope of this task. There are words like cat-o'-nine-tails and will-o'-the-wisps in there too to make your day even more interesting. )
 
<lang perl6>sub MAIN ($filename, $count = 10) {
my $file = $filename.IO.slurp;
my $word = rx/ [ <[\w] - [_]> ]+ /;
$file.lc ~~ m:g/ <$word> /;
.say for $/».Str.Bag.sort( -*.value )[^$count];
}</lang>
 
{{out}}
Passing in the file name and 10:
<pre>the => 41088
of => 19949
and => 14942
a => 14596
to => 13951
in => 11214
he => 9648
was => 8621
that => 7924
it => 6661</pre>
 
=={{header|Python}}==
10,339

edits