Word frequency: Difference between revisions

Content deleted Content added
m added highlighting and whitespace, added periods to some end-of-sentences in the task preamble.
m favored highlighting and whitespace over the use of double-quoted text, split a series of directives into separate bullets,
Line 5: Line 5:
Given a text file and an integer   '''n''',   print/display the   '''n'''   most
Given a text file and an integer   '''n''',   print/display the   '''n'''   most
common words in the file   (and the number of their occurrences)   in decreasing frequency.
common words in the file   (and the number of their occurrences)   in decreasing frequency.



For the purposes of this task:
For the purposes of this task:
* A word is a sequence of one or more contiguous letters.
*   A word is a sequence of one or more contiguous letters.
*   You are free to define what a   ''letter''   is.
* You are free to define what a letter is.   Underscores, accented letters, apostrophes, and other special characters can be handled at the example writer's discretion.   For example, you may treat a compound word like "well-dressed" as either one word or two.   The word "it's" could also be one or two words as you see fit.   You may also choose not to support non US-ASCII characters.   Feel free to explicitly state the thoughts behind the program decisions.
*   Underscores, accented letters, apostrophes, hyphens, and other special characters can be handled at your discretion.
* Assume words will not span multiple lines.
*   You may treat a compound word like   '''well-dressed'''   as either one word or two.
* Do not worry about normalization of word spelling differences.   Treat "color" and "colour" as two distinct words.
*   The word   '''it's'''   could also be one or two words as you see fit.
* Uppercase letters are considered equivalent to their lowercase counterparts.
*   You may also choose not to support non US-ASCII characters.
* Words of equal frequency can be listed in any order.
*   Assume words will not span multiple lines.
*   Don't worry about normalization of word spelling differences.
*   Treat   '''color'''   and   '''colour'''   as two distinct words.
*   Uppercase letters are considered equivalent to their lowercase counterparts.
*   Words of equal frequency can be listed in any order.
*   Feel free to explicitly state the thoughts behind the program decisions.




Show example output using [http://www.gutenberg.org/files/135/135-0.txt Les Misérables from Project Gutenberg] as the text file input and display the top 10 most used words.
Show example output using [http://www.gutenberg.org/files/135/135-0.txt Les Misérables from Project Gutenberg] as the text file input and display the top   '''10'''   most used words.