Word frequency

From Rosetta Code
Revision as of 22:13, 14 August 2017 by rosettacode>Kentros (Adds word count task)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Task
Word frequency
You are encouraged to solve this task according to the task description, using any language you may know.
Task

Given a text file and an integer n, print the n most common words in the file (and the number of their occurrences) in decreasing frequency.

For the purposes of this task:

  • A word is a sequence of one or more contiguous letters
  • Uppercase letters are considered equivalent to their lowercase counterparts
  • Words of equal frequency can be listed in any order


Show example output using Les Misérables from Project Gutenberg as the text file input and display the top 10 most used words.

History

This task was originally taken from programming pearls from Communications of the ACM June 1986 Volume 29 Number 6 where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy, demonstrating solving the problem in a 6 line Unix shell script.

Clojure

<lang clojure>(defn count-words [file n]

 (->> file
   slurp
   clojure.string/lower-case
   (re-seq #"\w+")
   frequencies
   (sort-by val >)
   (take n)))</lang>
Output:
user=> (count-words "135-0.txt" 10)
(["the" 41036] ["of" 19946] ["and" 14940] ["a" 14589] ["to" 13939]
 ["in" 11204] ["he" 9645] ["was" 8619] ["that" 7922] ["it" 6659])

Python

<lang python>import collections import re import string import sys

def main():

 counter = collections.Counter(re.findall(r"\w+",open(sys.argv[1]).read().lower()))
 print counter.most_common(int(sys.argv[2]))

if __name__ == "__main__":

 main()</lang>
Output:
$ python wordcount.py 135-0.txt 10
[('the', 41036), ('of', 19946), ('and', 14940), ('a', 14589), ('to', 13939),
 ('in', 11204), ('he', 9645), ('was', 8619), ('that', 7922), ('it', 6659)]

UNIX Shell

Works with: Bash
Works with: zsh

<lang bash>#!/bin/sh cat ${1} | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${2}q</lang>


Output:
$ ./wordcount.sh 135-0.txt 10 
41089 the
19949 of
14942 and
14608 a
13951 to
11214 in
9648 he
8621 was
7924 that
6661 it