Word frequency: Difference between revisions

m
→‎{{header|Perl 6}}: Note problem with /A-Za-z/
(Made draft task.)
m (→‎{{header|Perl 6}}: Note problem with /A-Za-z/)
Line 45:
We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs. Those '''are''' letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)
 
Actually, in this specific case /A-Za-z/ returns '''very nearly''' the correct answer. sinceUnfortunately, nonethe ofname "Alèthe" appears once (only once!) in the Frenchtext, accentedgets wordsincorrectly orsplit theirinto inappropriatelyAl broken& fractionsthe, areand inincorrectly reports 41089 occurrences of "the". The other 9 of the top 10 are "correct" using /A-Za-z/, but thatit is onlymostly by accident. The correctmore accurate regex matcher is some kind of Unicode aware /\w/ minus underscore.
 
( Really, a better regex would allow for contractions and embedded apostrophes but that is beyond the scope of this task as it stands. There are words like cat-o'-nine-tails and will-o'-the-wisps in there too to make your day even more interesting. )
 
<lang perl6>sub MAIN ($filename, $top = 10) {
10,339

edits