Talk:Word frequency: Difference between revisions

 
(19 intermediate revisions by 7 users not shown)
Line 1:
==Note from original author==
When it doubt assume you have the freedom to define the requirements as whatever you feel is most appropriate in your language of choice. --[[User:Kentros|Kentros]] ([[User talk:Kentros|talk]]) 01:31, 31 August 2017 (UTC)
 
==why entered as a ''task'' instead of ''draft task''?==
Why was this entry entered as a   ''task''   instead of a   ''draft task''?   -- [[User:Gerard Schildberger|Gerard Schildberger]] ([[User talk:Gerard Schildberger|talk]]) 03:08, 16 August 2017 (UTC)
Line 6 ⟶ 9:
==task clarification==
I assume we are to code programs to handle the general case, not just the file specified/mandated to be used as a test case.
:True. Originally, I suggested a specific text file. I've taken that off and now leave it to the example writer. --[[User:Kentros|Kentros]] ([[User talk:Kentros|talk]]) 01:32, 31 August 2017 (UTC)
 
What is a "word"?
Line 70 ⟶ 74:
:::Really, the only thing at all in question to my mind is, is underscore a letter or not? On the face of it, it seems clear; no, of course not. So should the word "_The" in the text be counted as "_the" or "the"? The \w assertion in PCRE (which most languages use directly or emulate) includes underscore for historical reasons, so "_the" and "the" are counted as different words. On the other hand, does the word "Alèthe" contain any non-letter characters? It is awfully narrow-minded to insist that "if you can't fit it in 7 bits, it ain't a character." That being said, I think disregarding hyphenated words and contractions with embedded apostrophes when counting words is ridiculous too so I added a second version which accounts for them, but it doesn't meet the task requirements as written (hence it being a ''second'' version). --[[User:Thundergnat|Thundergnat]] ([[User talk:Thundergnat|talk]]) 23:44, 18 August 2017 (UTC)
::::For the purposes of this task, which requires the top 10 words, defining a word as [A-z0-9À-ÿ]+ works well and gives the same answers as Perl6 (41088 for the and 14596 for a). Obviously Nigel's is one word, but it would require a lot of possession to promote the single character s into the top 10 words.--[[User:Nigel Galloway|Nigel Galloway]] ([[User talk:Nigel Galloway|talk]]) 13:23, 19 August 2017 (UTC)
 
----
 
"The n most common words" is only meaningful where there are exactly n words with the highest numbers of instances. --[[User:Nig|Nig]] ([[User talk:Nig|talk]]) 10:06, 29 March 2020 (UTC)
 
: I take it to mean   ''the N (top-most) common words''.     -- [[User:Gerard Schildberger|Gerard Schildberger]] ([[User talk:Gerard Schildberger|talk]]) 10:23, 29 March 2020 (UTC)
 
:: But when the nth most common word ties with others such that there's a choice of more than n qualifying words, either more than n have to be returned or some have to be arbitrarily left out. --[[User:Nig|Nig]] ([[User talk:Nig|talk]]) 10:41, 29 March 2020 (UTC)
 
==Using Microsoft Word 2010 to count words==
Line 1,107 ⟶ 1,119:
:FWIW, The article cited is not free on the ACM website, but it is free from the Princeton CS (Donald Knuth) site. Just type "Programming pearls: a literate program" into Google and press "I'm feeling lucky".
:Also, that UNIX shell script example is already on the task page. It was one of the first ones added. --[[User:Thundergnat|Thundergnat]] ([[User talk:Thundergnat|talk]]) 17:41, 23 August 2017 (UTC)
 
::Thanks for the reference. The Unix example on the page does not acknowledge that it is McIlroy's solution. Reading Knuth's version it is clear that it would also have given 41089 as the answer. My point is that the task description relies on these two articles, both of which return 41089 as the answer when applied to the mandated test input, and examples in Clojure and Python which both give 41036 as the answer. I think that the answer should be 41088. This can be explained:
:::The Python and Clojure examples are wrong;
:::The references were never designed to be run using Unicode, which apparently traces it's origins to 1987, but I don't think was widely used before the late 90's.
:::The task author has never run the cited references using the mandated input.
::::The task's description has been updated so as to give freedom for example writers. McIlroy's solution has been more explicitly acknowledged as having come from the cited article in the history. --[[User:Kentros|Kentros]] ([[User talk:Kentros|talk]]) 01:42, 31 August 2017 (UTC)
::The original author does not seem to be taking any further interest in this task. Perhaps you would like to update the description and mark the Clojure and Python examples as wrong to resolve this.--[[User:Nigel Galloway|Nigel Galloway]] ([[User talk:Nigel Galloway|talk]]) 13:41, 25 August 2017 (UTC)
:::The original author apologies profusely for not elaborating sooner. I've updated the description, and I think I've addressed each concern. --[[User:Kentros|Kentros]] ([[User talk:Kentros|talk]]) 01:38, 31 August 2017 (UTC)
 
::Knuth's paper clarifies another issue 'Let us agree that a word is a sequence of one or more contiguous letters; “Bentley” is a word, but ain't isn’t'.--[[User:Nigel Galloway|Nigel Galloway]] ([[User talk:Nigel Galloway|talk]]) 14:26, 25 August 2017 (UTC)
 
::: That's a lot of good detective work, but if the task leaves the definition of what a word is up to the example writer, then the Python can't be wrong. --[[User:Paddy3118|Paddy3118]] ([[User talk:Paddy3118|talk]]) 17:38, 25 August 2017 (UTC)
 
::::The task defines a word as a sequence of contiguous letters (as in McIlroy's solution) without defining what a letter is. How about leaving it up to the sample writer what a letter is? Samples could then use the Unicode definition or the ASCII definition (or even some other character set) as convenient? --[[User:Tigerofdarkness|Tigerofdarkness]] ([[User talk:Tigerofdarkness|talk]]) 18:09, 25 August 2017 (UTC)
 
:::::One could for a laugh but not seriously. For a laugh I asked MS Word to open the mandated input using US-ASCII. It then thinks the book is Les MisC)rables. Knuth defined the task assuming it was going to read US-ASCII, and clearly defines what a letter is in that context. It makes no sense to write a task for US-ASCII (e.g. Unix on the task page) and then run it on an example in UTF-8. Obviously an alternative is to mandate an example written in US-ASCII--[[User:Nigel Galloway|Nigel Galloway]] ([[User talk:Nigel Galloway|talk]]) 10:32, 26 August 2017 (UTC)
 
::::::The task as currently defined does not specify what a letter is. I suggested a definition that would allow both the "classic" (pre-RC) solutions and the new Python etc. samples to be accepted. --[[User:Tigerofdarkness|Tigerofdarkness]] ([[User talk:Tigerofdarkness|talk]]) 12:31, 26 August 2017 (UTC)
 
:::::::I agree that freedom to choose is the best, so I'm going with your suggestion, Tigerofdarkness. Thanks! --[[User:Kentros|Kentros]] ([[User talk:Kentros|talk]]) 01:27, 31 August 2017 (UTC)
 
==Code Golf mention==
*[https://codegolf.stackexchange.com/questions/188133/bentleys-coding-challenge-k-most-frequent-words Bentley's coding challenge: k most frequent words] at Code Golf Stack Exchange mentions this task.
557

edits