Word frequency
You are encouraged to solve this task according to the task description, using any language you may know.
- Task
Given a text file and an integer n, print the n most common words in the file (and the number of their occurrences) in decreasing frequency.
For the purposes of this task:
- A word is a sequence of one or more contiguous letters
- Uppercase letters are considered equivalent to their lowercase counterparts
- Words of equal frequency can be listed in any order
Show example output using Les Misérables from Project Gutenberg
as the text file input and display the top 10 most used words.
- History
This task was originally taken from programming pearls from Communications of the ACM June 1986 Volume 29 Number 6
where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy,
demonstrating solving the problem in a 6 line Unix shell script.
Clojure
<lang clojure>(defn count-words [file n]
(->> file slurp clojure.string/lower-case (re-seq #"\w+") frequencies (sort-by val >) (take n)))</lang>
- Output:
user=> (count-words "135-0.txt" 10) (["the" 41036] ["of" 19946] ["and" 14940] ["a" 14589] ["to" 13939] ["in" 11204] ["he" 9645] ["was" 8619] ["that" 7922] ["it" 6659])
Python
<lang python>import collections import re import string import sys
def main():
counter = collections.Counter(re.findall(r"\w+",open(sys.argv[1]).read().lower())) print counter.most_common(int(sys.argv[2]))
if __name__ == "__main__":
main()</lang>
- Output:
$ python wordcount.py 135-0.txt 10 [('the', 41036), ('of', 19946), ('and', 14940), ('a', 14589), ('to', 13939), ('in', 11204), ('he', 9645), ('was', 8619), ('that', 7922), ('it', 6659)]
UNIX Shell
<lang bash>#!/bin/sh cat ${1} | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${2}q</lang>
- Output:
$ ./wordcount.sh 135-0.txt 10 41089 the 19949 of 14942 and 14608 a 13951 to 11214 in 9648 he 8621 was 7924 that 6661 it