Word frequency: Difference between revisions
Content added Content deleted
Thundergnat (talk | contribs) m (→{{header|Perl 6}}: Note problem with /A-Za-z/) |
Thundergnat (talk | contribs) (→{{header|Perl 6}}: More problems with /A-Za-z/) |
||
Line 45: | Line 45: | ||
We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs. Those '''are''' letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.) |
We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs. Those '''are''' letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.) |
||
Actually, in this case /A-Za-z/ returns '''very nearly''' the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the" |
Actually, in this case /A-Za-z/ returns '''very nearly''' the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the". |
||
The text has several words like "Panathenæa", "ça" and "Keksekça" so the counts for 'a' are off too. The other 8 of the top 10 are "correct" using /A-Za-z/, but it is mostly by accident. The more accurate regex matcher is some kind of Unicode aware /\w/ minus underscore. |
|||
( Really, a better regex would allow for contractions and embedded apostrophes but that is beyond the scope of this task as it stands. There are words like cat-o'-nine-tails and will-o'-the-wisps in there too to make your day even more interesting. ) |
( Really, a better regex would allow for contractions and embedded apostrophes but that is beyond the scope of this task as it stands. There are words like cat-o'-nine-tails and will-o'-the-wisps in there too to make your day even more interesting. ) |