Word count

From Rosetta Code
Word count is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
Task

Given a text file and an integer n, print the n most common words in the file (and the number of their occurrences) in decreasing frequency.

For the purposes of this task:

  • A word is a sequence of one or more contiguous letters
  • You are free to define what a letter is. Underscores, accented letters, apostrophes, and other special characters can be handled at the example writer's discretion. For example, you may treat a compound word like "well-dressed" as either one word or two. The word "it's" could also be one or two words as you see fit. You may also choose not to support non US-ASCII characters. Feel free to explicitly state the thoughts behind the program decisions.
  • Assume words will not span multiple lines.
  • Do not worry about normalization of word spelling differences. Treat "color" and "colour" as two distinct words.
  • Uppercase letters are considered equivalent to their lowercase counterparts
  • Words of equal frequency can be listed in any order

Show example output using Les Misérables from Project Gutenberg as the text file input and display the top 10 most used words.


History

This task was originally taken from programming pearls from Communications of the ACM June 1986 Volume 29 Number 6 where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy, demonstrating solving the problem in a 6 line Unix shell script (provided as an example below).

References


Clojure[edit]

(defn count-words [file n]
(->> file
slurp
clojure.string/lower-case
(re-seq #"\w+")
frequencies
(sort-by val >)
(take n)))
Output:
user=> (count-words "135-0.txt" 10)
(["the" 41036] ["of" 19946] ["and" 14940] ["a" 14589] ["to" 13939]
 ["in" 11204] ["he" 9645] ["was" 8619] ["that" 7922] ["it" 6659])


F#[edit]

 
open System.IO
open System.Text.RegularExpressions
let g=Regex("[A-Za-zÀ-ÿ]+").Matches(File.ReadAllText "135-0.txt")
[for n in g do yield n.Value.ToLower()]|>List.countBy(id)|>List.sortBy(fun n->(-(snd n)))|>List.take 10|>List.iter(fun n->printfn "%A" n)
 
Output:
("the", 41088)
("of", 19949)
("and", 14942)
("a", 14596)
("to", 13951)
("in", 11214)
("he", 9648)
("was", 8621)
("that", 7924)
("it", 6661)

Factor[edit]

This program expects stdin to read from a file via the command line. ( e.g. invoking the program in Windows: >factor word-count.factor < input.txt ) The definition of a word here is simply any string surrounded by some combination of spaces, punctuation, or newlines.

 
USING: ascii io math.statistics prettyprint sequences
splitting ;
IN: rosetta-code.word-count
 
lines " " join " .,?!:;()\"-" split harvest [ >lower ] map
sorted-histogram reverse 10 head .
 
Output:
{
    { "the" 41021 }
    { "of" 19945 }
    { "and" 14938 }
    { "a" 14522 }
    { "to" 13938 }
    { "in" 11201 }
    { "he" 9600 }
    { "was" 8618 }
    { "that" 7822 }
    { "it" 6532 }
}

Julia[edit]

# v0.6
 
using FreqTables
 
txt = readstring("les-mis.txt")
words = split(replace(txt, r"\P{L}"i, " "))
table = sort(freqtable(words); rev=true)
println(table[1:10])10-element Named Array{Int64,1}
Output:
Dim1   │
───────┼──────
"the"  │ 36671
"of"   │ 19618
"and"  │ 14081
"to"   │ 13541
"a"    │ 13529
"in"   │ 10265
"was"  │  8545
"that" │  7326
"he"   │  6816
"had"  │  6140

Kotlin[edit]

The author of the Perl 6 entry has given a good account of the difficulties with this task and, in the absence of any clarification on the various issues, I've followed a similar 'literal' approach.

So, after first converting the text to lower case, I've assumed that a word is any sequence of one or more lower-case Unicode letters and obtained the same results as the Perl 6 version.

There is no change in the results if the numerals 0-9 are also regarded as letters.

// version 1.1.3
 
import java.io.File
 
fun main(args: Array<String>) {
val text = File("135-0.txt").readText().toLowerCase()
val r = Regex("""\p{javaLowerCase}+""")
val matches = r.findAll(text)
val wordGroups = matches.map { it.value }
.groupBy { it }
.map { Pair(it.key, it.value.size) }
.sortedByDescending { it.second }
.take(10)
println("Rank Word Frequency")
println("==== ==== =========")
var rank = 1
for ((word, freq) in wordGroups)
System.out.printf("%2d  %-4s  %5d\n", rank++, word, freq)
}
Output:
Rank  Word  Frequency
====  ====  =========
 1    the     41088
 2    of      19949
 3    and     14942
 4    a       14596
 5    to      13951
 6    in      11214
 7    he       9648
 8    was      8621
 9    that     7924
10    it       6661

Objeck[edit]

use System.IO.File;
use Collection;
use RegEx;
 
class Rosetta {
function : Main(args : String[]) ~ Nil {
if(args->Size() <> 1) {
return;
};
 
input := FileReader->ReadFile(args[0]);
filter := RegEx->New("\\w+");
words := filter->Find(input);
 
word_counts := StringMap->New();
each(i : words) {
word := words->Get(i)->As(String);
if(word <> Nil & word->Size() > 0) {
word := word->ToLower();
if(word_counts->Has(word)) {
count := word_counts->Find(word)->As(IntHolder);
count->Set(count->Get() + 1);
}
else {
word_counts->Insert(word, IntHolder->New(1));
};
};
};
 
count_words := IntMap->New();
words := word_counts->GetKeys();
each(i : words) {
word := words->Get(i)->As(String);
count := word_counts->Find(word)->As(IntHolder);
count_words->Insert(count->Get(), word);
};
 
counts := count_words->GetKeys();
counts->Sort();
 
index := 1;
"Rank\tWord\tFrequency"->PrintLine();
"====\t====\t===="->PrintLine();
for(i := count_words->Size() - 1; i >= 0; i -= 1;) {
if(count_words->Size() - 10 <= i) {
count := counts->Get(i);
word := count_words->Find(count)->As(String);
"{$index}\t{$word}\t{$count}"->PrintLine();
index += 1;
};
};
}
}

Output:

Rank    Word    Frequency
====    ====    ====
1       the     41036
2       of      19946
3       and     14940
4       a       14589
5       to      13939
6       in      11204
7       he      9645
8       was     8619
9       that    7922
10      it      6659

Perl 6[edit]

Works with: Rakudo version 2017.07

Note: much of the following exposition is no longer critical to the task as the requirements have been updated, but is left here for historical and informational reasons.

This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is full of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup.

We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs. Those are letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)

Actually, in this case /A-Za-z/ returns very nearly the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the". The text has several words like "Panathenæa", "ça", "aérostiers" and "Keksekça" so the counts for 'a' are off too. The other 8 of the top 10 are "correct" using /A-Za-z/, but it is mostly by accident.

A more accurate regex matcher would be some kind of Unicode aware /\w/ minus underscore. It may also be useful, depending on your requirements, to recognize contractions with embedded apostrophes, hyphenated words, and hyphenated words broken across lines.

Here is a sample that shows the result when using various different matchers.

sub MAIN ($filename, $top = 10) {
my $file = $filename.IO.slurp.lc.subst(/ (<[\w]-[_]>'-')\n(<[\w]-[_]>) /, {$0 ~ $1}, :g );
my @matcher = (
rx/ <[a..z]>+ /, # simple 7-bit ASCII
rx/ \w+ /, # word characters with underscore
rx/ <[\w]-[_]>+ /, # word characters without underscore
rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* / # word characters without underscore but with hyphens and contractions
);
for @matcher -> $reg {
say "\nTop $top using regex: ", $reg.perl;
.put for $file.comb( $reg ).Bag.sort(-*.value)[^$top];
}
}
Output:

Passing in the file name and 10:

Top 10 using regex: rx/ <[a..z]>+ /
the	41089
of	19949
and	14942
a	14608
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661

Top 10 using regex: rx/ \w+ /
the	41035
of	19946
and	14940
a	14577
to	13939
in	11204
he	9645
was	8619
that	7922
it	6659

Top 10 using regex: rx/ <[\w]-[_]>+ /
the	41088
of	19949
and	14942
a	14596
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661

Top 10 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
the	41081
of	19930
and	14934
a	14587
to	13735
in	11204
he	9607
was	8620
that	7825
it	6535

Phix[edit]

?"loading..."
constant subs = "\t\r\n_.,\"\'!;:?][()|=<>#/*{}[email protected]%&$",
reps = repeat(' ',length(subs)),
fn = open("135-0.txt","r")
string text = lower(substitute_all(get_text(fn),subs,reps))
close(fn)
sequence words = append(sort(split(text,no_empty:=true)),"")
constant wf = new_dict()
string last = words[1]
integer count = 1
for i=2 to length(words) do
if words[i]!=last then
setd({count,last},0,wf)
count = 0
last = words[i]
end if
count += 1
end for
count = 10
function visitor(object key, object /*data*/, object /*user_data*/)
 ?key
count -= 1
return count>0
end function
traverse_dict(routine_id("visitor"),0,wf,true)
Output:
loading...
{40743,"the"}
{19925,"of"}
{14881,"and"}
{14474,"a"}
{13704,"to"}
{11174,"in"}
{9623,"he"}
{8613,"was"}
{7867,"that"}
{6612,"it"}

PicoLisp[edit]

(setq *Delim " ^I^J^M-_.,\"'*[]?!&@#$%^\(\):;")
(setq *Skip (chop *Delim))
 
(de word+ NIL
(prog1
(lowc (till *Delim T))
(while (member (peek) *Skip) (char)) ) )
 
(off B)
(in "135-0.txt"
(until (eof)
(let W (word+)
(if (idx 'B W T) (inc (car @)) (set W 1)) ) ) )
(for L (head 10 (flip (by val sort (idx 'B))))
(println L (val L)) )
Output:
"the" 41088
"of" 19949
"and" 14942
"a" 14545
"to" 13950
"in" 11214
"he" 9647
"was" 8620
"that" 7924
"it" 6661

Python[edit]

Python2.7[edit]

import collections
import re
import string
import sys
 
def main():
counter = collections.Counter(re.findall(r"\w+",open(sys.argv[1]).read().lower()))
print counter.most_common(int(sys.argv[2]))
 
if __name__ == "__main__":
main()
Output:
$ python wordcount.py 135-0.txt 10
[('the', 41036), ('of', 19946), ('and', 14940), ('a', 14589), ('to', 13939),
 ('in', 11204), ('he', 9645), ('was', 8619), ('that', 7922), ('it', 6659)]

Python3.6[edit]

from collections import Counter
from re import findall
 
les_mis_file = 'les_mis_135-0.txt'
 
def _count_words(fname):
with open(fname) as f:
text = f.read()
words = findall(r'\w+', text.lower())
return Counter(words)
 
def most_common_words_in_file(fname, n):
counts = _count_words(fname)
for word, count in [['WORD', 'COUNT']] + counts.most_common(n):
print(f'{word:>10} {count:>6}')
 
 
if __name__ == "__main__":
n = int(input('How many?: '))
most_common_words_in_file(les_mis_file, n)
Output:
How many?: 10
      WORD  COUNT
       the  41036
        of  19946
       and  14940
         a  14586
        to  13939
        in  11204
        he   9645
       was   8619
      that   7922
        it   6659

Racket[edit]

#lang racket
 
(define (all-words f (case-fold string-downcase))
(map case-fold (regexp-match* #px"\\w+" (file->string f))))
 
(define (l.|l| l) (cons (car l) (length l)))
 
(define (counts l (>? >)) (sort (map l.|l| (group-by values l)) >? #:key cdr))
 
(module+ main
(take (counts (all-words "data/les-mis.txt")) 10))
Output:
'(("the" . 41036)
  ("of" . 19946)
  ("and" . 14940)
  ("a" . 14589)
  ("to" . 13939)
  ("in" . 11204)
  ("he" . 9645)
  ("was" . 8619)
  ("that" . 7922)
  ("it" . 6659))

REXX[edit]

version 1[edit]

This REXX version doesn't need to sort the list of words.

Currently, this version recognizes all the accented (non-Latin) accented letters that are present in the text (file) that is specified to be used   (and some other non-Latin letters as well).   This means that the word     Alèthe     is treated as one word, not as two words     Al  the     (and not thereby adding two words).

This version also supports words that contain embedded apostrophes ( ' )     [that is, within a word, but not those words that start or end with an apostrophe; for those words, the apostrophe is elided].

Thus,   it's   is counted separately from   it   or   its.

Since REXX doesn't support UTF-8 encodings, code was added to this REXX version to support the accented letters in the mandated input file.

/*REXX pgm displays top 10 words in a file (includes foreign letters),  case is ignored.*/
parse arg fID top . /*obtain optional arguments from the CL*/
if fID=='' | fID=="," then fID= 'les_mes.TXT' /*None specified? Then use the default.*/
if top=='' | top=="," then top= 10 /* " " " " " " */
@.=0; c=0; abcL="abcdefghijklmnopqrstuvwxyz'" /*initialize word list, count; alphabet*/
q= "'"; abcU= abcL; upper abcU /*define uppercase version of alphabet*/
totW=0; accL= 'üéâÄàÅÇêëèïîìéæôÖòûùÿáíóúÑ' /* " " of some accented chrs*/
accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ' /* " lowercase accented characters.*/
accG= 'αßΓπΣσµτΦΘΩδφε' /* " some upper/lower Greek letters*/
a=abcL || abcL ||accL ||accL || accG /* " char string of after letters.*/
b=abcL || abcU ||accL ||accU || accG || xrange() /* " char string of before " */
x= 'Çà åÅ çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü' /*list of 16-bit chars.*/
xs= words(x) /*num. " " " */
!.= /*define the original word instances. */
do #=0 while lines(fID)\==0; $=linein(fID) /*loop whilst there are lines in file. */
if pos('├', $)\==0 then do k=1 for xs; _=word(x, k) /*any 16-bit chars? */
$=changestr('├'left(_, 1), $, right(_, 1) ) /*convert.*/
end /*k*/
$=translate( $, a, b) /*remove superfluous blanks in the line*/
do while $\=''; parse var $ z $ /*now, process each word in the $ list.*/
parse var z z1 2 zr '' -1 zL /*extract: first, middle, & last char.*/
if z1==q then do; z=zr; if z=='' then iterate; end /*starts with apostrophe? */
if zL==q then z=strip(left(z, length(z) - 1)) /*ends " " */
if z=='' then iterate /*if Z is now null, skip.*/
if @.z==0 then do; c=c+1; !.c=z; end /*bump word count; assign word to array*/
totW=totW + 1; @.[email protected].z + 1 /*bump total words & count of the word.*/
end /*while*/
end /*#*/
say commas(totW) ' words found ('commas(c) "unique) in " commas(#),
' records read from file: ' fID; say
say right('word', 40) " " center(' rank ', 6) " count " /*display title for output*/
say right('════', 40) " " center('══════', 6) " ═══════" /* " title separator.*/
tops=1
do until otops==tops | tops>top /*process enough words to satisfy TOP.*/
WL=; mk=0; otops=tops /*initialize the word list (to a NULL).*/
do n=1 for c; z=!.n; [email protected].z /*process the list of words in the file*/
if k==mk then WL=WL z /*handle cases of tied number of words.*/
if k> mk then do; mk=k; WL=z; end /*this word count is the current max. */
end /*n*/
wr=max( length(' rank '), length(top) ) /*find the maximum length of the rank #*/
do d=1 for words(WL); _=word(WL, d) /*process all words in the word list. */
if d==1 then w=max(10, length(@._) ) /*use length of the first number used. */
say right(@._, 40) right(commas(tops), wr) right(commas(@._), w)
@._= -1 /*nullify word count for next go around*/
end /*d*/ /* [↑] this allows a non-sorted list. */
tops=tops + words(WL) /*correctly handle any tied rankings.*/
end /*until*/
exit /*stick a fork in it, we're all done. */
/*──────────────────────────────────────────────────────────────────────────────────────*/
commas: procedure; parse arg _; n=_'.9'; #=123456789; b=verify(n, #, "M")
e=verify(n, #'0', , verify(n, #"0.", 'M') ) - 4
do j=e to b by -3; _=insert(',', _, j); end /*j*/; return _
output   when using the default inputs:
574,122  words found  (23,414 unique)  in  67,663  records read from file:  les_mes.TXT

                                    word    rank    count
                                    ════   ══════  ═══════
                                     the      1     41,088
                                      of      2     19,949
                                     and      3     14,942
                                       a      4     14,595
                                      to      5     13,950
                                      in      6     11,214
                                      he      7      9,607
                                     was      8      8,620
                                    that      9      7,826
                                      it     10      6,535

To see a list of the top 1,000 words that show (among other things) words like   it's   and other accented words, see the discussion page.

version 2[edit]

Inspired by version 1 and adapted for ooRexx. It ignores all characters other than a-z and A-Z (which are translated to a-z).

/*REXX program   reads  and  displays  a  count  of words a file.  Word case is ignored.*/
Call time 'R'
abc='abcdefghijklmnopqrstuvwxyz'
abcABC=abc||translate(abc)
parse arg fID_top /*obtain optional arguments from the CL*/
Parse Var fid_top fid ',' top
if fID=='' then fID= 'mis.TXT' /* Use default if not specified */
if top=='' then top= 10 /* Use default if not specified */
occ.=0 /* occurrences of word (stem) in file */
wn=0
Do While lines(fid)>0 /*loop whilst there are lines in file. */
line=linein(fID)
line=translate(line,abc||abc,abcABC||xrange('00'x,'ff'x)) /*use only lowercase letters*/
Do While line<>''
Parse Var line word line /* take a word */
If occ.word=0 Then Do /* not yet in word list */
wn=wn+1
word.wn=word
End
occ.word=occ.word+1
End
End
Say 'We found' wn 'different words'
say right('word',40) ' rank count ' /* header */
say right('----',40) '------ -------' /* separator. */
tops=0
Do Until tops>=top | tops>=wn /*process enough words to satisfy TOP.*/
max_occ=0
tl='' /*initialize (possibly) a list of words*/
Do wi=1 To wn /*process the list of words in the file*/
word=word.wi /* take a word from the list */
Select
When occ.word>max_occ Then Do /* most occurrences so far */
tl=word /* candidate for output */
max_occ=occ.word /* current maximum occurrences */
End
When occ.word=max_occ Then Do /* tied */
tl=tl word /* add to output candidate */
End
Otherwise /* no candidate (yet) */
Nop
End
End
do d=1 for words(tl)
word=word(tl,d)
say right(word,40) right(tops+1,4) right(occ.word,8)
occ.word=0 /*nullify this word count for next time*/
End
tops=tops+words(tl) /*correctly handle the tied rankings. */
end
Say time('E') 'seconds elapsed'
Output:
We found 22820 different words
                                    word  rank   count
                                    ---- ------ -------
                                     the    1    41089
                                      of    2    19949
                                     and    3    14942
                                       a    4    14608
                                      to    5    13951
                                      in    6    11214
                                      he    7     9648
                                     was    8     8621
                                    that    9     7924
                                      it   10     6661
1.750000 seconds elapsed

Ruby[edit]

 
class String
def wc
n = Hash.new(0)
downcase.scan(/[A-Za--ÿ]+/) { |g| n[g] += 1 }
n.sort{|n,g| n[1]<=>g[1]}
end
end
 
open('135-0.txt') { |n| n.read.wc[-10,10].each{|n| puts n[0].to_s+"->"+n[1].to_s} }
 
Output:
it->6661
that->7924
was->8621
he->9648
in->11214
to->13951
a->14596
and->14942
of->19949
the->41088

Sidef[edit]

var count = Hash()
var file = File(ARGV[0] \\ '135-0.txt')
 
file.open_r.each { |line|
line.lc.scan(/[\pL]+/).each { |word|
count{word} := 0 ++
}
}
 
var top = count.sort_by {|_,v| v }.last(10).flip
 
top.each { |pair|
say "#{pair.key}\t-> #{pair.value}"
}
Output:
the	-> 41088
of	-> 19949
and	-> 14942
a	-> 14596
to	-> 13951
in	-> 11214
he	-> 9648
was	-> 8621
that	-> 7924
it	-> 6661

Simula[edit]

COMMENT COMPILE WITH
$ cim -m64 word-count.sim
;
BEGIN
 
COMMENT ----- CLASSES FOR GENERAL USE ;
 
 ! ABSTRACT HASH KEY TYPE ;
CLASS HASHKEY;
VIRTUAL:
PROCEDURE HASH IS
INTEGER PROCEDURE HASH;;
PROCEDURE EQUALTO IS
BOOLEAN PROCEDURE EQUALTO(K); REF(HASHKEY) K;;
BEGIN
END HASHKEY;
 
 ! ABSTRACT HASH VALUE TYPE ;
CLASS HASHVAL;
BEGIN
 ! THERE IS NOTHING REQUIRED FOR THE VALUE TYPE ;
END HASHVAL;
 
CLASS HASHMAP;
BEGIN
CLASS INNERHASHMAP(N); INTEGER N;
BEGIN
 
INTEGER PROCEDURE INDEX(K); REF(HASHKEY) K;
BEGIN
INTEGER I;
IF K == NONE THEN
ERROR("HASHMAP.INDEX: NONE IS NOT A VALID KEY");
I := MOD(K.HASH,N);
LOOP:
IF KEYTABLE(I) == NONE OR ELSE KEYTABLE(I).EQUALTO(K) THEN
INDEX := I
ELSE BEGIN
I := IF I+1 = N THEN 0 ELSE I+1;
GO TO LOOP;
END;
END INDEX;
 
 ! PUT SOMETHING IN ;
PROCEDURE PUT(K,V); REF(HASHKEY) K; REF(HASHVAL) V;
BEGIN
INTEGER I;
IF V == NONE THEN
ERROR("HASHMAP.PUT: NONE IS NOT A VALID VALUE");
I := INDEX(K);
IF KEYTABLE(I) == NONE THEN BEGIN
IF SIZE = N THEN
ERROR("HASHMAP.PUT: TABLE FILLED COMPLETELY");
KEYTABLE(I) :- K;
VALTABLE(I) :- V;
SIZE := SIZE+1;
END ELSE
VALTABLE(I) :- V;
END PUT;
 
 ! GET SOMETHING OUT ;
REF(HASHVAL) PROCEDURE GET(K); REF(HASHKEY) K;
BEGIN
INTEGER I;
IF K == NONE THEN
ERROR("HASHMAP.GET: NONE IS NOT A VALID KEY");
I := INDEX(K);
IF KEYTABLE(I) == NONE THEN
GET :- NONE ! ERROR("HASHMAP.GET: KEY NOT FOUND");
ELSE
GET :- VALTABLE(I);
END GET;
 
PROCEDURE CLEAR;
BEGIN
INTEGER I;
FOR I := 0 STEP 1 UNTIL N-1 DO BEGIN
KEYTABLE(I) :- NONE;
VALTABLE(I) :- NONE;
END;
SIZE := 0;
END CLEAR;
 
 ! DATA MEMBERS OF CLASS HASHMAP ;
REF(HASHKEY) ARRAY KEYTABLE(0:N-1);
REF(HASHVAL) ARRAY VALTABLE(0:N-1);
INTEGER SIZE;
 
END INNERHASHMAP;
 
PROCEDURE PUT(K,V); REF(HASHKEY) K; REF(HASHVAL) V;
BEGIN
IF IMAP.SIZE >= 0.75 * IMAP.N THEN
BEGIN
COMMENT RESIZE HASHMAP ;
REF(INNERHASHMAP) NEWIMAP;
REF(ITERATOR) IT;
NEWIMAP :- NEW INNERHASHMAP(2 * IMAP.N);
IT :- NEW ITERATOR(THIS HASHMAP);
WHILE IT.MORE DO
BEGIN
REF(HASHKEY) KEY;
KEY :- IT.NEXT;
NEWIMAP.PUT(KEY, IMAP.GET(KEY));
END;
IMAP.CLEAR;
IMAP :- NEWIMAP;
END;
IMAP.PUT(K, V);
END;
 
REF(HASHVAL) PROCEDURE GET(K); REF(HASHKEY) K;
GET :- IMAP.GET(K);
 
PROCEDURE CLEAR;
IMAP.CLEAR;
 
INTEGER PROCEDURE SIZE;
SIZE := IMAP.SIZE;
 
REF(INNERHASHMAP) IMAP;
 
IMAP :- NEW INNERHASHMAP(16);
END HASHMAP;
 
CLASS ITERATOR(H); REF(HASHMAP) H;
BEGIN
INTEGER POS,KEYCOUNT;
 
BOOLEAN PROCEDURE MORE;
MORE := KEYCOUNT < H.SIZE;
 
REF(HASHKEY) PROCEDURE NEXT;
BEGIN
INSPECT H DO
INSPECT IMAP DO
BEGIN
WHILE KEYTABLE(POS) == NONE DO
POS := POS+1;
NEXT :- KEYTABLE(POS);
KEYCOUNT := KEYCOUNT+1;
POS := POS+1;
END;
END NEXT;
 
END ITERATOR;
 
COMMENT ----- PROBLEM SPECIFIC CLASSES ;
 
HASHKEY CLASS TEXTHASHKEY(T); VALUE T; TEXT T;
BEGIN
INTEGER PROCEDURE HASH;
BEGIN
INTEGER I;
T.SETPOS(1);
WHILE T.MORE DO
I := 31*I+RANK(T.GETCHAR);
HASH := I;
END HASH;
BOOLEAN PROCEDURE EQUALTO(K); REF(HASHKEY) K;
EQUALTO := T = K QUA TEXTHASHKEY.T;
END TEXTHASHKEY;
 
HASHVAL CLASS COUNTER;
BEGIN
INTEGER COUNT;
END COUNTER;
 
REF(INFILE) INF;
REF(HASHMAP) MAP;
REF(TEXTHASHKEY) KEY;
REF(COUNTER) VAL;
REF(ITERATOR) IT;
TEXT LINE, WORD;
INTEGER I, J, MAXCOUNT, LINENO;
INTEGER ARRAY MAXCOUNTS(1:10);
REF(TEXTHASHKEY) ARRAY MAXWORDS(1:10);
 
WORD :- BLANKS(1000);
MAP :- NEW HASHMAP;
 
COMMENT MAP WORDS TO COUNTERS ;
 
INF :- NEW INFILE("135-0.txt");
INF.OPEN(BLANKS(4096));
WHILE NOT INF.LASTITEM DO
BEGIN
BOOLEAN INWORD;
 
PROCEDURE SAVE;
BEGIN
IF WORD.POS > 1 THEN
BEGIN
KEY :- NEW TEXTHASHKEY(WORD.SUB(1, WORD.POS - 1));
VAL :- MAP.GET(KEY);
IF VAL == NONE THEN
BEGIN
VAL :- NEW COUNTER;
MAP.PUT(KEY, VAL);
END;
VAL.COUNT := VAL.COUNT + 1;
WORD := " ";
WORD.SETPOS(1);
END;
END SAVE;
 
LINENO := LINENO + 1;
LINE :- COPY(INF.IMAGE).STRIP; INF.INIMAGE;
 
COMMENT SEARCH WORDS IN LINE ;
COMMENT A WORD IS ANY SEQUENCE OF LETTERS ;
 
INWORD := FALSE;
LINE.SETPOS(1);
WHILE LINE.MORE DO
BEGIN
CHARACTER CH;
CH := LINE.GETCHAR;
IF CH >= 'a' AND CH <= 'z' THEN
CH := CHAR(RANK(CH) - RANK('a') + RANK('A'));
IF CH >= 'A' AND CH <= 'Z' THEN
BEGIN
IF NOT INWORD THEN
BEGIN
SAVE;
INWORD := TRUE;
END;
WORD.PUTCHAR(CH);
END ELSE
BEGIN
IF INWORD THEN
BEGIN
SAVE;
INWORD := FALSE;
END;
END;
END;
SAVE; COMMENT LAST WORD ;
END;
INF.CLOSE;
 
COMMENT FIND 10 MOST COMMON WORDS ;
 
IT :- NEW ITERATOR(MAP);
WHILE IT.MORE DO
BEGIN
KEY :- IT.NEXT;
VAL :- MAP.GET(KEY);
FOR I := 1 STEP 1 UNTIL 10 DO
BEGIN
IF VAL.COUNT >= MAXCOUNTS(I) THEN
BEGIN
FOR J := 10 STEP -1 UNTIL I + 1 DO
BEGIN
MAXCOUNTS(J) := MAXCOUNTS(J - 1);
MAXWORDS(J) :- MAXWORDS(J - 1);
END;
MAXCOUNTS(I) := VAL.COUNT;
MAXWORDS(I) :- KEY;
GO TO BREAK;
END;
END;
BREAK:
END;
 
COMMENT OUTPUT 10 MOST COMMON WORDS ;
 
FOR I := 1 STEP 1 UNTIL 10 DO
BEGIN
IF MAXWORDS(I) =/= NONE THEN
BEGIN
OUTINT(MAXCOUNTS(I), 10);
OUTTEXT(" ");
OUTTEXT(MAXWORDS(I) QUA TEXTHASHKEY.T);
OUTIMAGE;
END;
END;
 
END
 
Output:
     41089 THE
     19949 OF
     14942 AND
     14608 A
     13951 TO
     11214 IN
      9648 HE
      8621 WAS
      7924 THAT
      6661 IT

6 garbage collection(s) in 0.2 seconds.

UNIX Shell[edit]

Works with: Bash
Works with: zsh

This is derived from Doug McIlroy's original 6-line note in the ACM article cited in the task.

#!/bin/sh
cat ${1} | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${2}q


Output:
$ ./wordcount.sh 135-0.txt 10 
41089 the
19949 of
14942 and
14608 a
13951 to
11214 in
9648 he
8621 was
7924 that
6661 it

zkl[edit]

fname,count := vm.arglist;	// grab cammand line args
 
// words may have leading or trailing "_", ie "the" and "_the"
File(fname).pump(Void,"toLower", // read the file line by line and hash words
RegExp("[a-z]+").pump.fp1(Dictionary().incV)) // line-->(word:count,..)
.toList().copy().sort(fcn(a,b){ b[1]<a[1] })[0,count.toInt()] // hash-->list
.pump(String,Void.Xplode,"%s,%s\n".fmt).println();
Output:
$ zkl bbb ~/Documents/Les\ Miserables.txt 10
the,41089
of,19949
and,14942
a,14608
to,13951
in,11214
he,9648
was,8621
that,7924
it,6661