Talk:Word frequency: Difference between revisions

m (Amakukha moved page Talk:Word count to Talk:Word frequency: word count (wc) is a completely different algorithm and Unix utility)
 
(3 intermediate revisions by 3 users not shown)
Line 74:
:::Really, the only thing at all in question to my mind is, is underscore a letter or not? On the face of it, it seems clear; no, of course not. So should the word "_The" in the text be counted as "_the" or "the"? The \w assertion in PCRE (which most languages use directly or emulate) includes underscore for historical reasons, so "_the" and "the" are counted as different words. On the other hand, does the word "Alèthe" contain any non-letter characters? It is awfully narrow-minded to insist that "if you can't fit it in 7 bits, it ain't a character." That being said, I think disregarding hyphenated words and contractions with embedded apostrophes when counting words is ridiculous too so I added a second version which accounts for them, but it doesn't meet the task requirements as written (hence it being a ''second'' version). --[[User:Thundergnat|Thundergnat]] ([[User talk:Thundergnat|talk]]) 23:44, 18 August 2017 (UTC)
::::For the purposes of this task, which requires the top 10 words, defining a word as [A-z0-9À-ÿ]+ works well and gives the same answers as Perl6 (41088 for the and 14596 for a). Obviously Nigel's is one word, but it would require a lot of possession to promote the single character s into the top 10 words.--[[User:Nigel Galloway|Nigel Galloway]] ([[User talk:Nigel Galloway|talk]]) 13:23, 19 August 2017 (UTC)
 
----
 
"The n most common words" is only meaningful where there are exactly n words with the highest numbers of instances. --[[User:Nig|Nig]] ([[User talk:Nig|talk]]) 10:06, 29 March 2020 (UTC)
 
: I take it to mean   ''the N (top-most) common words''.     -- [[User:Gerard Schildberger|Gerard Schildberger]] ([[User talk:Gerard Schildberger|talk]]) 10:23, 29 March 2020 (UTC)
 
:: But when the nth most common word ties with others such that there's a choice of more than n qualifying words, either more than n have to be returned or some have to be arbitrarily left out. --[[User:Nig|Nig]] ([[User talk:Nig|talk]]) 10:41, 29 March 2020 (UTC)
 
==Using Microsoft Word 2010 to count words==
Line 1,131 ⟶ 1,139:
 
:::::::I agree that freedom to choose is the best, so I'm going with your suggestion, Tigerofdarkness. Thanks! --[[User:Kentros|Kentros]] ([[User talk:Kentros|talk]]) 01:27, 31 August 2017 (UTC)
 
==Code Golf mention==
*[https://codegolf.stackexchange.com/questions/188133/bentleys-coding-challenge-k-most-frequent-words Bentley's coding challenge: k most frequent words] at Code Golf Stack Exchange mentions this task.
557

edits