Word frequency: Difference between revisions

m (→‎version 1: added words and highlighting to the note in the output section.)
(93 intermediate revisions by 39 users not shown)
Line 33:
*[http://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/ McIlroy's program]
<syntaxhighlight lang="11l">DefaultDict[String, Int] cnt
L(word) re:‘\w+’.find_strings(File(‘135-0.txt’).read().lowercase())
print(sorted(cnt.items(), key' wordc -> wordc[1], reverse' 1B)[0.<10])</syntaxhighlight>
[(the, 41045), (of, 19953), (and, 14939), (a, 14527), (to, 13942), (in, 11210), (he, 9646), (was, 8620), (that, 7922), (it, 6659)]
Line 41 ⟶ 56:
{{works with|Ada|Ada|2012}}
<langsyntaxhighlight Adalang="ada">with Ada.Command_Line;
with Ada.Text_IO;
with Ada.Integer_Text_IO;
Line 128 ⟶ 143:
end loop;
end Word_Frequency;
Line 143 ⟶ 158:
6661 it
=={{header|ALGOL 68}}==
{{works with|ALGOL 68G|Any - tested with release 2.8.3.win32}}
Uses the associative array implementations in [[ALGOL_68/prelude]].
<syntaxhighlight lang="algol68"># find the n most common words in a file #
# use the associative array in the Associate array/iteration task #
# but with integer values #
PR read "aArrayBase.a68" PR
AAVALUE init element value = 0;
# returns text converted to upper case #
STRING result := text;
FOR ch pos FROM LWB result TO UPB result DO
IF is lower( result[ ch pos ] ) THEN result[ ch pos ] := to upper( result[ ch pos ] ) FI
# returns text converted to an INT or -1 if text is not a number #
INT result := 0;
BOOL is numeric := TRUE;
FOR ch pos FROM UPB text BY -1 TO LWB text WHILE is numeric DO
CHAR c = text[ ch pos ];
is numeric := is numeric AND c >= "0" AND c <= "9";
IF is numeric THEN ( result *:= 10 ) +:= ABS c - ABS "0" FI
IF is numeric THEN result ELSE -1 FI
# returns TRUE if c is a letter, FALSE otherwise #
IF ( c >= "a" AND c <= "z" )
OR ( c >= "A" AND c <= "Z" )
ELSE char in string( c, NIL, "ÇåçêëÆôöÿÖØáóÔ" )
# get the file name and number of words from then commmand line #
STRING file name := "pg-les-misrables.txt";
INT number of words := 10;
FOR arg pos TO argc - 1 DO
STRING arg upper = TOUPPER argv( arg pos );
IF arg upper = "FILE" THEN
file name := argv( arg pos + 1 )
ELIF arg upper = "NUMBER" THEN
number of words := TOINT argv( arg pos + 1 )
IF FILE input file;
open( input file, file name, stand in channel ) /= 0
# failed to open the file #
print( ( "Unable to open """ + file name + """", newline ) )
# file opened OK #
print( ( "Processing: ", file name, newline ) );
BOOL at eof := FALSE;
BOOL at eol := FALSE;
# set the EOF handler for the file #
on logical file end( input file, ( REF FILE f )BOOL:
# note that we reached EOF on the #
# latest read #
at eof := TRUE;
# return TRUE so processing can continue #
# set the end-of-line handler for the file so get word can see line boundaries #
on line end( input file
# note we reached end-of-line #
at eol := TRUE;
# return FALSE to use the default eol handling #
# i.e. just get the next charactefr #
# get the words from the file and store the counts in an associative array #
INT word count := 0;
CHAR c := " ";
WHILE get( input file, ( c ) );
NOT at eof
WHILE NOT ISLETTER c AND NOT at eof DO get( input file, ( c ) ) OD;
STRING word := "";
at eol := FALSE;
WHILE ISLETTER c AND NOT at eol AND NOT at eof DO word +:= c; get( input file, ( c ) ) OD;
word count +:= 1;
words // TOUPPER word +:= 1
close( input file );
print( ( file name, " contains ", whole( word count, 0 ), " words", newline ) );
# find the most used words #
[ number of words ]STRING top words;
[ number of words ]INT top counts;
FOR i TO number of words DO top words[ i ] := ""; top counts[ i ] := 0 OD;
WHILE w ISNT nil element DO
INT count = value OF w;
STRING word = key OF w;
BOOL found := FALSE;
FOR i TO number of words WHILE NOT found DO
IF count > top counts[ i ] THEN
# found a word that is used nore than a current #
# most used word #
found := TRUE;
# move the other words down one place #
FOR move pos FROM number of words BY - 1 TO i + 1 DO
top counts[ move pos ] := top counts[ move pos - 1 ];
top words [ move pos ] := top words [ move pos - 1 ]
# install the new word #
top counts[ i ] := count;
top words [ i ] := word
w := NEXT words
print( ( whole( number of words, 0 ), " most used words:", newline ) );
print( ( " count word", newline ) );
FOR i TO number of words DO
print( ( whole( top counts[ i ], -6 ), ": ", top words[ i ], newline ) )
Processing: pg-les-misrables.txt
pg-les-misrables.txt contains 578381 words
10 most used words:
count word
39333: THE
19154: OF
14628: AND
14229: A
13431: TO
11275: HE
10879: IN
8236: WAS
7527: THAT
6491: IT
{{works with|GNU APL}}
<syntaxhighlight lang="apl">
⍝⍝ NOTE: input text is assumed to be encoded in ISO-8859-1
⍝⍝ (The suggested example '135-0.txt' of Les Miserables on
⍝⍝ Project Gutenberg is in UTF-8.)
⍝⍝ Use Unix 'iconv' if required
∇r ← lowerAndStrip s;stripped;mixedCase
⍝⍝ Convert text to lowercase, punctuation and newlines to spaces
stripped ← ' abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
mixedCase ← ⎕av[11],' ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
r ← stripped[mixedCase ⍳ s]
⍝⍝ Return the _n_ most frequent words and a count of their occurrences
∇r ← n wordCount fname ;D;wl;sidx;swv;pv;wc;uw;sortOrder
D ← lowerAndStrip (⎕fio['read_file'] fname) ⍝ raw text with newlines
wl ← (~ D ∊ ' ') ⊂ D
sidx ← ⍒wl
swv ← wl[sidx]
pv ← +\ 1,~2 ≡/ swv
wc ← ∊ ⍴¨ pv ⊂ pv
uw ← 1 ⊃¨ pv ⊂ swv
sortOrder ← ⍒wc
r ← n↑[2] uw[sortOrder],[0.5]wc[sortOrder]
5 wordCount '135-0.txt'
the of and a to
41042 19952 14938 14526 13942
<langsyntaxhighlight lang="applescript">(*
For simplicity here, words are considered to be uninterrupted sequences of letters and/or digits.
The set text is too messy to warrant faffing around with anything more sophisticated.
Line 227 ⟶ 424:
set filePath to POSIX path of ((path to desktop as text) & "www.rosettacode.org:Word frequency:135-0.txt")
set n to 10
return wordFrequency(filePath, n)</langsyntaxhighlight>
<langsyntaxhighlight lang="applescript">"The 10 most frequently occurring words in the file are:
The: 41092
Of: 19954
Line 240 ⟶ 437:
Was: 8622
That: 7924
It: 6661"</langsyntaxhighlight>
<syntaxhighlight lang="rebol">findFrequency: function [file, count][
freqs: #[]
r: {/[[:alpha:]]+/}
loop flatten map split.lines read file 'l -> match lower l r 'word [
if not? key? freqs word -> freqs\[word]: 0
freqs\[word]: freqs\[word] + 1
freqs: sort.values.descending freqs
result: new []
loop 0..dec count 'x [
'result ++ @[@[get keys freqs x, get values freqs x]]
return result
loop findFrequency "https://www.gutenberg.org/files/135/135-0.txt" 10 'pair [
print pair
<pre>the 41096
of 19955
and 14939
a 14558
to 13954
in 11218
he 9649
was 8622
that 7924
it 6661</pre>
<langsyntaxhighlight AutoHotkeylang="autohotkey">URLDownloadToFile, http://www.gutenberg.org/files/135/135-0.txt, % A_temp "\tempfile.txt"
FileRead, H, % A_temp "\tempfile.txt"
FileDelete, % A_temp "\tempfile.txt"
Line 259 ⟶ 490:
MsgBox % "Freq`tWord`n" result
Outputs:<pre>Freq Word
41036 The
Line 273 ⟶ 504:
<syntaxhighlight lang="awk">
<lang AWK>
Line 302 ⟶ 533:
Line 321 ⟶ 552:
This is a rather long code. I fulfilled the requirement with QB64. It "cleans" each word so it takes as a word anything that begins and ends with a letter. It works with arrays. Amazing the speed of QB64 to do this job with such a big file as Les Miserables.txt.
<syntaxhighlight lang="qbasic">
<lang QBASIC>
Line 889 ⟶ 1,120:
Line 929 ⟶ 1,160:
Try again? (Y/n)
Removing all punctuation, digits, tabs and carriage returns. So "This", "this" and "this." are the same. Full support for UTF8 characters in words. The code itself could be smaller, but for sake of clarity all has been written explicitly.
<syntaxhighlight lang="bacon">' We do not count superfluous spaces as words
' Optional: use TRE regex library to speed up the program
PRAGMA RE tre INCLUDE <tre/regex.h> LDFLAGS -ltre
' We're using associative arrays
' Load the text and remove all punctuation, digits, tabs and cr
book$ = EXTRACT$(LOAD$("miserables.txt"), "[[:punct:]]|[[:digit:]]|[\t\r]", TRUE)
' Count each word in lowercase
FOR word$ IN REPLACE$(book$, NL$, CHR$(32))
INCR frequency(LCASE$(word$))
' Sort the associative array and then map the index to a string array
LOOKUP frequency TO term$ SIZE x SORT DOWN
' Show results
FOR i = 0 TO 9
PRINT term$[i], " : ", frequency(term$[i])
the : 40440
of : 19903
and : 14738
a : 14306
to : 13630
in : 11083
he : 9452
was : 8605
that : 7535
his : 6434
=={{header|Batch File}}==
Line 936 ⟶ 1,208:
You could cut the length of this down drastically if you didn't need to be able to recall the word at nth position and wished only to display the top 10 words.
<langsyntaxhighlight lang="dos">
@echo off
Line 982 ⟶ 1,254:
Line 1,001 ⟶ 1,273:
101. - 0000000004 ears
This solution assumes that words consists of characters that exist in a lowercase and a highercase version. So it won't work with many non-European alphabets.
The built-in <code>vap</code> function can take either two or three arguments. The first argument must be the name of a function or a function definition. The second argument must be a string. The two-argument version maps the function to each character in the string. The three-argument version splits the string at each occurrence of the third argument, which must be a single character, and applies the function to the intervening substrings. The output of <code>vap</code> is a space-separated list of results from the function argument.
The expression <code>!('($arg:?A [($pivot) ?Z))</code> must be read as follows:
The subexpression <code>'($arg:?A [($pivot) ?Z)</code> is a macro expression. The symbols <code>arg</code> and <code>pivot</code>, which are the right hand sides of <code>$</code> operators with empty left hand side, are replaced by the actual values of <code>!arg</code> and <code>!pivot</code>. The whole subexpression is made the right hand side of a <code>=</code> operator with empty left hand side, e.g.
<code>=a b c d e:?A [2 ?Z</code>. The <code>=</code> operator protects the subexpression against evaluation. By prefixing the expression with the <code>!</code> unary operator (which normally is used to obtain the value of a variable), the pattern match operation <code>a b c d e:?A [2 ?Z</code> is executed, assigning <code>a b</code> to <code>A</code> and assigning <code>c d e</code> to <code>Z</code>.
The reason for using a macro expression is that the evaluation of a pattern match operation with pattern variable as in <code>!arg:?A [!pivot ?Z</code> is unecessary slow, since <code>!pivot</code> is evaluated up to <code>!pivot+1</code> times.
<syntaxhighlight lang="bracmat"> ( 10-most-frequent-words
= MergeSort { Local variable declarations. }
. ( MergeSort { Definition of function MergeSort. }
= A N Z pivot
. !arg:? [?N { [?N is a subpattern that counts the number of preceding elements }
& ( !N:>1 { if N at least 2 ... }
& div$(!N.2):?pivot { divide N by 2 ... }
& !('($arg:?A [($pivot) ?Z)) { split list in two halves A and Z ... }
& MergeSort$!A+MergeSort$!Z { sort each of A and Z and return sum }
| !arg { else just return a single element}
& MergeSort { Sort }
$ ( vap { Split second argument at each occurrence of third character and apply first argument to each chunk. }
$ ( (=.low$!arg) { Return input, lowercased. }
. str
$ ( vap { Vaporize second argument in UTF-8 or Latin-1 characters and apply first argument to each of them. }
$ ( (
. upp$!arg:low$!arg&\n { Return newline instead of non-alphabetic character. }
| !arg { Return (Euro-centric) alphabetic character.}
. get$(!arg,NEW STR) { Read input text as a single string. }
. \n { Split at newlines }
: ?sorted-words { Assign sum of (frequency*lowercasedword) terms to sorted-words. }
& :?types { Initialize types as an empty list. }
& whl { Loop until right hand side fails. }
' ( !sorted-words:#?frequency*%@?type+?sorted-words { Extract first frequency*type term from sum. }
& (!frequency.!type) !types:?types { Prepend (frequency.type) pair to types list}
& MergeSort$!types { Sort the list of (frequency.type) pairs. }
: (?+[-11+?most-frequent-words|?most-frequent-words) { Pick the last 10 terms from the sum returned by MergeSort. }
& !most-frequent-words { Return the last 10 terms. }
& out$(10-most-frequent-words$"135-0.txt") { Call 10-most-frequent-words with name of inout file and print result to screen. }</syntaxhighlight>
<pre> (6661.it)
+ (7924.that)
+ (8622.was)
+ (9649.he)
+ (11219.in)
+ (13953.to)
+ (14546.a)
+ (14943.and)
+ (19954.of)
+ (41092.the)</pre>
Words are defined by the regular expression "\w+".
<langsyntaxhighlight lang="c">#include <stdbool.h>
#include <stdio.h>
#include <glib.h>
Line 1,093 ⟶ 1,434:
if (!get_top_words(argv[1], 1510))
Top 1510 words
Rank Count Word
1 41039 the
Line 1,112 ⟶ 1,453:
9 7922 that
10 6659 it
11 6469 his
12 6194 is
13 6181 had
14 5149 which
15 4530 with
=={{header|C sharp|C#}}==
<langsyntaxhighlight lang="csharp">using System;
using System.Collections.Generic;
using System.IO;
Line 1,153 ⟶ 1,489:
<pre>Rank Word Frequency
Line 1,169 ⟶ 1,505:
<syntaxhighlight lang="cpp">#include <algorithm>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>
#include <unordered_map>
#include <vector>
int main(int ac, char** av) {
int head = (ac > 1) ? std::atoi(av[1]) : 10;
std::istreambuf_iterator<char> it(std::cin), eof;
std::filebuf file;
if (ac > 2) {
if (file.open(av[2], std::ios::in), file.is_open()) {
it = std::istreambuf_iterator<char>(&file);
} else return std::cerr << "file " << av[2] << " open failed\n", 1;
auto alpha = [](unsigned c) { return c-'A' < 26 || c-'a' < 26; };
auto lower = [](unsigned c) { return c | '\x20'; };
std::unordered_map<std::string, int> counts;
std::string word;
for (; it != eof; ++it) {
if (alpha(*it)) {
} else if (!word.empty()) {
if (!word.empty()) ++counts[word]; // if file ends w/o ws
std::vector<std::pair<const std::string,int> const*> out;
for (auto& count : counts) out.push_back(&count);
out.size() < head ? out.end() : out.begin() + head,
out.end(), [](auto const* a, auto const* b) {
return a->second > b->second;
if (out.size() > head) out.resize(head);
for (auto const& count : out) {
std::cout << count->first << ' ' << count->second << '\n';
return 0;
$ ./a.out 10 135-0.txt
the 41093
of 19954
and 14943
a 14558
to 13953
in 11219
he 9649
was 8622
that 7924
it 6661
<langsyntaxhighlight lang="cpp">#include <algorithm>
#include <iostream>
#include <fstream>
Line 1,179 ⟶ 1,577:
int main() {
std::regex wordRgx("\\w+");
using namespace std;
std::map<std::string, int> freq;
regex wordRgx("\\w+");
map<std::string, int> freqline;
stringconst lineint top = 10;
std::ifstream in("135-0.txt");
if (!in.is_open()) {
std::cerr << "Failed to open file\n";
return 1;
while (std::getline(in, line)) {
auto words_itr = std::sregex_iterator(line.cbegin(), line.cend(), wordRgx);
auto words_end = sregex_iterator line.cbegin(), line.cend(), wordRgx);
auto words_end = std::sregex_iterator();
while (words_itr != words_end) {
auto match = *words_itr;
auto word = match.str();
if (word.size() > 0) {
transform (word.begin(), word.end(), word.begin(), ::tolower);
auto entry = freq.find(word);
if (entry != freq.end()) {
} else {
freq.insert(std::make_pair(word, 1));
words_itr = std::next(words_itr);
std::vector<std::pair<std::string, int>> pairs;
for (auto iter = freq.cbegin(); iter != freq.cend(); ++iter) {
std::sort(pairs.begin(), pairs.end(), [=](pair<string, int>auto& a, pair<string, int>auto& b) {
return a.second > b.second;
std::cout << "Rank Word Frequency\n";
cout << "==== ==== =========\n";
int rank = 1;
for (auto iter = pairs.cbegin(); iter != pairs.cend() && rank <= 10top; ++iter) {
std::printf("%2d %4s %5d\n", rank++, iter->first.c_str(), iter->second);
return 0;
<pre>Rank Word Frequency
Line 1,238 ⟶ 1,638:
9 he 6814
10 had 6139</pre>
<syntaxhighlight lang="cpp">#include <algorithm>
#include <iostream>
#include <format>
#include <fstream>
#include <map>
#include <ranges>
#include <regex>
#include <string>
#include <vector>
int main() {
std::ifstream in("135-0.txt");
std::string text{
std::istreambuf_iterator<char>{in}, std::istreambuf_iterator<char>{}
std::regex word_rx("\\w+");
std::map<std::string, int> freq;
for (const auto& a : std::ranges::subrange(
std::sregex_iterator{ text.cbegin(),text.cend(), word_rx }, std::sregex_iterator{}
auto word = a.str();
transform(word.begin(), word.end(), word.begin(), ::tolower);
std::vector<std::pair<std::string, int>> pairs;
for (const auto& elem : freq)
std::ranges::sort(pairs, std::ranges::greater{}, &std::pair<std::string, int>::second);
std::cout << "Rank Word Frequency\n"
"==== ==== =========\n";
for (int rank=1; const auto& [word, count] : pairs | std::views::take(10))
std::cout << std::format("{:2} {:>4} {:5}\n", rank++, word, count);
<pre>Rank Word Frequency
==== ==== =========
0 the 41043
1 of 19952
2 and 14938
3 a 14539
4 to 13942
5 in 11208
6 he 9646
7 was 8620
8 that 7922
9 it 6659</pre>
<langsyntaxhighlight lang="clojure">(defn count-words [file n]
(->> file
Line 1,247 ⟶ 1,706:
(sort-by val >)
(take n)))</langsyntaxhighlight>
Line 1,257 ⟶ 1,716:
<syntaxhighlight lang="cobol">
<lang COBOL>
PROGRAM-ID. WordFrequency.
Line 1,471 ⟶ 1,930:
CLOSE Word-File Output-File.
Line 1,494 ⟶ 1,953:
=={{header|Common Lisp}}==
<langsyntaxhighlight lang="lisp">
(defun count-word (n pathname)
(with-open-file (s pathname :direction :input)
Line 1,515 ⟶ 1,974:
(dolist (word words) (incf (gethash word hash 0)))
(maphash #'(lambda (e n) (push `(,e . ,n) ac)) hash) ac)
Line 1,525 ⟶ 1,984:
<langsyntaxhighlight lang="ruby">require "http/client"
require "regex"
Line 1,543 ⟶ 2,002:
.sort { |a, b| b[1] <=> a[1] }[0..9] # sort and get the first 10 elements
.each_with_index(1) { |(word, n), i| puts "#{i} \t #{word} \t #{n}" } # print the result
Line 1,560 ⟶ 2,019:
<langsyntaxhighlight Dlang="d">import std.algorithm : sort;
import std.array : appender, split;
import std.range : take;
Line 1,595 ⟶ 2,054:
writefln("%4s %-10s %9s", rank++, word.k, word.v);
Line 1,609 ⟶ 2,068:
9 that 7251
10 his 6414</pre>
{{libheader| System.SysUtils}}
{{libheader| System.IOUtils}}
{{libheader| System.Generics.Collections}}
{{libheader| System.Generics.Defaults}}
{{libheader| System.RegularExpressions}}
<syntaxhighlight lang="delphi">
program Word_frequency;
TWords = TDictionary<string, Integer>;
TFreqPair = TPair<string, Integer>;
TFreq = TArray<TFreqPair>;
function CreateValueCompare: IComparer<TFreqPair>;
Result := TComparer<TFreqPair>.Construct(
function(const Left, Right: TFreqPair): Integer
Result := Right.Value - Left.Value;
function WordFrequency(const Text: string): TFreq;
words: TWords;
match: TMatch;
w: string;
words := TWords.Create();
match := TRegEx.Match(Text, '\w+');
while match.Success do
w := match.Value;
if words.ContainsKey(w) then
words[w] := words[w] + 1
words.Add(w, 1);
match := match.NextMatch();
Result := words.ToArray;
TArray.Sort<TFreqPair>(Result, CreateValueCompare);
Text: string;
rank: integer;
Freq: TFreq;
w: TFreqPair;
Text := TFile.ReadAllText('135-0.txt').ToLower();
Freq := WordFrequency(Text);
Writeln('Rank Word Frequency');
Writeln('==== ==== =========');
for rank := 1 to 10 do
w := Freq[rank - 1];
Writeln(format('%2d %6s %5d', [rank, w.Key, w.Value]));
Rank Word Frequency
==== ==== =========
1 the 41040
2 of 19951
3 and 14942
4 a 14539
5 to 13941
6 in 11209
7 he 9646
8 was 8620
9 that 7922
10 it 6659
=={{header|F Sharp}}==
<langsyntaxhighlight lang="fsharp">
open System.IO
open System.Text.RegularExpressions
let g=Regex("[A-Za-zÀ-ÿ]+").Matches(File.ReadAllText "135-0.txt")
[for n in g do yield n.Value.ToLower()]|>List.countBy(id)|>List.sortBy(fun n->(-(snd n)))|>List.take 10|>List.iter(fun n->printfn "%A" n)
Line 1,633 ⟶ 2,187:
This program expects stdin to read from a file via the command line. ( e.g. invoking the program in Windows: <tt>>factor word-count.factor < input.txt</tt> ) The definition of a word here is simply any string surrounded by some combination of spaces, punctuation, or newlines.
<langsyntaxhighlight lang="factor">
USING: ascii io math.statistics prettyprint sequences
splitting ;
Line 1,640 ⟶ 2,194:
lines " " join " .,?!:;()\"-" split harvest [ >lower ] map
sorted-histogram <reversed> 10 head .
Line 1,655 ⟶ 2,209:
{ "it" 6532 }
<syntaxhighlight lang="freebasic">
#Include "file.bi"
type tally
as string s
as long l
end type
Sub quicksort(array() As String,begin As Long,Finish As Long)
Dim As Long i=begin,j=finish
Dim As String x =array(((I+J)\2))
While I <= J
While array(I) < X :I+=1:Wend
While array(J) > X :J-=1:Wend
If I<=J Then Swap array(I),array(J): I+=1:J-=1
If J >begin Then quicksort(array(),begin,J)
If I <Finish Then quicksort(array(),I,Finish)
End Sub
Sub tallysort(array() As tally,begin As Long,Finish As long)
Dim As Long i=begin,j=finish
Dim As tally x =array(((I+J)\2))
While I <= J
While array(I).l > X .l:I+=1:Wend
While array(J).l < X .l:J-=1:Wend
If I<=J Then Swap array(I),array(J): I+=1:J-=1
If J >begin Then tallysort(array(),begin,J)
If I <Finish Then tallysort(array(),I,Finish)
End Sub
Function loadfile(file As String) As String
If Fileexists(file)=0 Then Print file;" not found":Sleep:End
Dim As Long f=Freefile
Open file For Binary Access Read As #f
Dim As String text
If Lof(f) > 0 Then
text = String(Lof(f), 0)
Get #f, , text
End If
Close #f
Return text
End Function
Function String_Split(s_in As String,chars As String,result() As String) As Long
Dim As Long ctr,ctr2,k,n,LC=Len(chars)
Dim As boolean tally(Len(s_in))
#macro check_instring()
While n<Lc
If chars[n]=s_in[k] Then
If (ctr2-1) Then ctr+=1
Exit While
End If
#macro splice()
If tally(k) Then
If (ctr2-1) Then ctr+=1:result(ctr)=Mid(s_in,k+2-ctr2,ctr2-1)
End If
'================== LOOP TWICE =======================
For k =0 To Len(s_in)-1
Next k
If ctr=0 Then
If Len(s_in) Andalso Instr(chars,Chr(s_in[0])) Then ctr=1':
End If
If ctr Then Redim result(1 To ctr): ctr=0:ctr2=0 Else Return 0
For k =0 To Len(s_in)-1
Next k
'===================== Last one ========================
If ctr2>0 Then
Redim Preserve result(1 To ctr+1)
End If
Return Ubound(result)
End Function
Redim As String s()
redim as tally t()
dim as string p1,p2,deliminators
dim as long count,jmp
dim as double tm=timer
Var L=loadfile("rosettalesmiserables.txt")
'get deliminators
for n as long=1 to 96
for n as long=123 to 255
For n As Long=lbound(s) To ubound(s)-1
if s(n+1)=s(n) then jmp+=1
if s(n+1)<>s(n) then
redim preserve t(1 to count)
end if
tallysort(t(),lbound(t),ubound(t))'sort by frequency
print "frequency","word"
for n as long=lbound(t) to lbound(t)+9
print t(n).l,t(n).s
print "time for operation ";timer-tm;" seconds"
I saved and reloaded the file as ascii text.
frequency word
41098 the
19955 of
14939 and
14557 a
13953 to
11219 in
9648 he
8621 was
7923 that
6660 it
time for operation 1.099869600031525 seconds
Line 1,660 ⟶ 2,366:
This example shows some of the subtle and non-obvious power of Frink in processing text files in a language-aware and Unicode-aware fashion:
* Frink has a Unicode-aware function, <CODE>wordList[''str'']</CODE>, which intelligently enumerates through the words in a string (and correctly handles compound words, hyphenated words, accented characters, etc.) It returns words, spaces, and punctuation marks separately. For the purposes of this program, "words" that do not contain any alphanumeric characters (as decided by the Unicode standard) are filtered out. These are likely punctuation and spaces. There is also a two-argument function, <CODE>wordList[''str'', ''lang'']</CODE> which allows you to specify a language code ''e.g.'' <CODE>"fr"</CODE> to use the rules of French (or many other human languages) to perform correct word-breaking according to the rules of that language!
* The file fetched from Project Gutenberg is supposed to be encoded in UTF-8 character encoding, but their servers incorrectly send either that it is Windows-1252 encoded or send no character encoding at all, so this program fixes that.
Line 1,673 ⟶ 2,379:
There are two sample programs below. First, a simple but powerful method that works in old versions of Frink:
<langsyntaxhighlight lang="frink">d = new dict
for w = select[wordList[read[normalizeUnicode["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]], %r/[[:alnum:]]/ ]
d.increment[lc[w], 1]
println[join["\n", first[reverse[sort[array[d], {|a,b| a@1 <=> b@1}]], 10]]]</langsyntaxhighlight>
Line 1,695 ⟶ 2,401:
Next, a "showing off" one-liner that works in recent versions of Frink that uses the <CODE>countToArray</CODE> function which easily creates sorted frequency lists and the <CODE>formatTable</CODE> function that formats into a nice table with columns lined up, and still performs full Unicode-aware normalization, capitalization, and word-breaking:
<langsyntaxhighlight lang="frink">formatTable[first[countToArray[select[wordList[lc[normalizeUnicode[read["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]]], %r/[[:alnum:]]/ ]], 10], "right"]</langsyntaxhighlight>
Line 1,709 ⟶ 2,415:
he 6812
had 6133
Task said: "Feel free to explicitly state the thoughts behind the program decisions." Thus the heavy comments.
<syntaxhighlight lang="futurebasic">
include "NSLog.incl"
local fn WordFrequency( textStr as CFStringRef, caseSensitive as Boolean, ascendingOrder as Boolean ) as CFStringRef
CFStringRef wrd
CFDictionaryRef dict
// Depending on the value of the caseSensitive Boolean function parameter above, lowercase incoming text
if caseSensitive == NO then textStr = fn StringLowercaseString( textStr )
// Trim non-alphabetic characters from string, and separate individual words with a space
CFStringRef tempStr = fn ArrayComponentsJoinedByString( fn StringComponentsSeparatedByCharactersInSet( textStr, fn CharacterSetInvertedSet( fn CharacterSetLetterSet ) ), @" " )
// Prepare separators to parse string into array
CFMutableCharacterSetRef separators = fn MutableCharacterSetInit
// Informally, this set is the set of all non-whitespace characters used to separate linguistic units in scripts, such as periods, dashes, parentheses, and so on.
MutableCharacterSetFormUnionWithCharacterSet( separators, fn CharacterSetPunctuationSet )
// A character set containing all the whitespace and newline characters including characters in Unicode General Category Z*, U+000A U+000D, and U+0085.
MutableCharacterSetFormUnionWithCharacterSet( separators, fn CharacterSetWhitespaceAndNewlineSet )
// Create array of separated words
CFArrayRef tempArr = fn StringComponentsSeparatedByCharactersInSet( tempStr, separators )
// Create a counted set with each word and its frequency
CountedSetRef freqencies = fn CountedSetWithArray( tempArr )
// Enumerate each word-frequency pair in the counted set...
EnumeratorRef enumRef = fn CountedSetObjectEnumerator( freqencies )
// .. and use it to create array of words in counted set
CFArrayRef array = fn EnumeratorAllObjects( enumRef )
// Create an empty mutable array
CFMutableArrayRef wordArr = fn MutableArrayWithCapacity( 0 )
// Create word counter
NSInteger totalWords = 0
// Enumerate each unique word, get its frequency, create its own key/value pair dictionary, add each dictionary into master array
for wrd in array
// Create dictionary with frequency and matching word
dict = @{ @"count":fn NumberWithUnsignedInteger( fn CountedSetCountForObject( freqencies, wrd ) ), @"object":wrd }
// Add each dictionary to the master mutable array, checking for a valid word by length
if ( fn StringLength( wrd ) != 0 )
MutableArrayAddObject( wordArr, dict )
end if
// Store the total words as a global application property
AppSetProperty( @"totalWords", fn StringWithFormat( @"%d", totalWords - 1 ) )
// Sort the array in ascending or descending order as determined by the ascendingOrder Boolean function input parameter
SortDescriptorRef descriptors = fn SortDescriptorWithKey( @"count", ascendingOrder )
CFArrayRef sortedArray = fn ArraySortedArrayUsingDescriptors( wordArr, @[descriptors] )
// Create an empty mutable string
CFMutableStringRef mutStr = fn MutableStringWithCapacity( 0 )
// Use each dictionary in sorted array to build the formatted output string
NSInteger count = 1
for dict in sortedArray
MutableStringAppendString( mutStr, fn StringWithFormat( @"%-7d %-7lu %@\n", count, fn StringIntegerValue( fn DictionaryValueForKey( dict, @"count" ) ), fn DictionaryValueForKey( dict, @"object" ) ) )
// Create an immutable output string from mutable the string
CFStringRef resultStr = fn StringWithFormat( @"%@", mutStr )
end fn = resultStr
local fn ParseTextFromWebsite( webSite as CFStringRef )
// Convert incoming string to URL
CFURLRef textURL = fn URLWithString( webSite )
// Read contents of URL into a string
CFStringRef textStr = fn StringWithContentsOfURL( textURL, NSUTF8StringEncoding, NULL )
// Start timer
CFAbsoluteTime startTime = fn CFAbsoluteTimeGetCurrent
// Calculate frequency of words in text and sort by occurrence
CFStringRef frequencyStr = fn WordFrequency( textStr, NO, NO )
// Log results and post post processing time
NSLog( @"%@", frequencyStr )
NSLog( @"Total unique words in document: %@", fn AppProperty( @"totalWords" ) )
// Stop timer and log elapsed processing time
NSLog( @"Elapsed time: %f milliseconds.", ( fn CFAbsoluteTimeGetCurrent - startTime ) * 1000.0 )
end fn
// Pass url for Les Misérables on Project Gutenberg and parse in background
fn ParseTextFromWebsite( @"https://www.gutenberg.org/files/135/135-0.txt" )
1 41095 the
2 19955 of
3 14939 and
4 14546 a
5 13954 to
6 11218 in
7 9649 he
8 8622 was
9 7924 that
10 6661 it
11 6470 his
12 6193 is
22900 1 millstones
22901 1 fumbles
22902 1 shunned
22903 1 avoids
22904 1 poitevin
22905 1 muleteer
22906 1 idolizes
22907 1 lapsed
22908 1 reptitalmus
22909 1 bled
22910 1 isabella
Total unique words in document: 22910
Elapsed time: 595.407963 milliseconds.
<langsyntaxhighlight lang="go">package main
import (
Line 1,755 ⟶ 2,594:
fmt.Printf("%2d %-4s %5d\n", rank, word, freq)
Line 1,775 ⟶ 2,614:
<langsyntaxhighlight lang="groovy">def topWordCounts = { String content, int n ->
def mapCounts = [:]
content.toLowerCase().split(/\W+/).each {
Line 1,783 ⟶ 2,622:
println "Rank Word Frequency\n==== ==== ========="
(0..<n).each { printf ("%4d %-4s %9d\n", it+1, top[it].key, top[it].value) }
<langsyntaxhighlight lang="groovy">def rawText = "http://www.gutenberg.org/files/135/135-0.txt".toURL().text
topWordCounts(rawText, 10)</langsyntaxhighlight>
Line 1,804 ⟶ 2,643:
===Lazy IO with pure Map, arrows===
<langsyntaxhighlight Haskelllang="haskell">module Main where
import Control.Category -- (>>>)
import Data.Char -- toLower, isSpace
import Data.List -- sortBy, (Foldable(foldl')), filter -- '
import Data.Ord -- Down
import System.IO -- stdin, ReadMode, openFile, hClose
Line 1,825 ⟶ 2,665:
frequencies :: Ord a => [a] -> Map a Integer
frequencies = foldl' (\m k -> M.insertWith (+) k 1 m) M.empty -- '
{-# SPECIALIZE frequencies :: [Text] -> Map Text Integer #-}
Line 1,845 ⟶ 2,685:
>>> take n
>>> print)
when filep (hClose hand)</langsyntaxhighlight>
Line 1,852 ⟶ 2,692:
===Lazy IO, map of IORefs===
Using IORefs as values in the map seems to give a ~2x speedup on large files. The below code is based on https://github.com/composewell/streamly-examples/blob/master/examples/WordFrequency.hs , but still using lazy IO to avoid the extra library dependency (in production you should [https://stackoverflow.com/questions/5892653/whats-so-bad-about-lazy-i-o use a streaming library] like streamly/conduit/io-streams):
<syntaxhighlight lang="haskell">
module Main where
import Control.Monad (foldM, when)
import Data.Char (isSpace, toLower)
import Data.List (sortOn, filter)
import Data.Ord (Down(..))
import System.IO (stdin, IOMode(..), openFile, hClose)
import System.Environment (getArgs)
import Data.IORef (IORef(..), newIORef, readIORef, modifyIORef') -- '
-- containers
import Data.HashMap.Strict (HashMap)
import qualified Data.HashMap.Strict as M
-- text
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.IO as T
frequencies :: [Text] -> IO (HashMap Text (IORef Int))
frequencies = foldM (flip (M.alterF alter)) M.empty
alter Nothing = Just <$> newIORef (1 :: Int)
alter (Just ref) = modifyIORef' ref (+ 1) >> return (Just ref) -- '
main :: IO ()
main = do
args <- getArgs
when (length args /= 1) (error "expecting 1 arg (number of words to print)")
let maxw = read $ head args -- no error handling, to simplify the example
T.hGetContents stdin >>= \contents -> do
freqtable <- frequencies $ filter (not . T.null) $ T.split isSpace $ T.map toLower contents
counts <-
let readRef (w, ref) = do
cnt <- readIORef ref
return (w, cnt)
in mapM readRef $ M.toList freqtable
print $ take maxw $ sortOn (Down . snd) counts
$ ./word_count 10 < ~/doc/les_miserables*
===Lazy IO, short code, but not streaming===
Or, perhaps a little more simply:
Or, perhaps a little more simply, though not streaming (will read everything into memory, don't use on big files):
<lang haskell>import qualified Data.Text.IO as T
<syntaxhighlight lang="haskell">import qualified Data.Text.IO as T
import qualified Data.Text as T
Line 1,866 ⟶ 2,754:
main :: IO ()
main = T.readFile "miserables.txt" >>= (mapM_ print . take 10 . frequentWords)</langsyntaxhighlight>
Line 1,924 ⟶ 2,812:
This is relatively simple in Java.<br />
I used a ''URL'' class to download the content, a ''BufferedReader'' class to examine the text line-for-line, a ''Pattern'' and ''Matcher'' to identify words, and a ''Map'' to hold to values.
<syntaxhighlight lang="java">
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
<syntaxhighlight lang="java">
void printWordFrequency() throws URISyntaxException, IOException {
URL url = new URI("https://www.gutenberg.org/files/135/135-0.txt").toURL();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()))) {
Pattern pattern = Pattern.compile("(\\w+)");
Matcher matcher;
String line;
String word;
Map<String, Integer> map = new HashMap<>();
while ((line = reader.readLine()) != null) {
matcher = pattern.matcher(line);
while (matcher.find()) {
word = matcher.group().toLowerCase();
if (map.containsKey(word)) {
map.put(word, map.get(word) + 1);
} else {
map.put(word, 1);
/* print out top 10 */
List<Map.Entry<String, Integer>> list = new ArrayList<>(map.entrySet());
int count = 1;
for (Map.Entry<String, Integer> value : list) {
System.out.printf("%-20s%,7d%n", value.getKey(), value.getValue());
if (count++ == 10) break;
the 41,043
of 19,952
and 14,938
a 14,539
to 13,942
in 11,208
he 9,646
was 8,620
that 7,922
it 6,659
<br />
An alternate demonstration
<langsyntaxhighlight Javalang="java">import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
Line 1,967 ⟶ 2,919:
<pre>Rank Word Frequency
Line 1,981 ⟶ 2,933:
9 that 7924
10 it 6661</pre>
The following solution uses the concept of a "bag of words" (bow), here realized as a JSON object
with the words as keys and the frequency of a word as the corresponding value.
To avoid issues with case folding, the "letters" here just the alphabet and hyphen, but a "word"
may not begin with hyphen. Thus "the-the" would count as one word, and "-the" would be excluded.
<syntaxhighlight lang="jq">
< 135-0.txt jq -nR --argjson n 10 '
def bow(stream):
reduce stream as $word ({}; .[($word|tostring)] += 1);
bow(inputs | gsub("[^-a-zA-Z]"; " ") | splits(" *") | ascii_downcase | select(test("^[a-z][-a-z]*$")))
| to_entries
| sort_by(.value)
| .[- $n :]
| reverse
| from_entries
<syntaxhighlight lang="jq">
"the": 41087,
"of": 19937,
"and": 14932,
"a": 14552,
"to": 13738,
"in": 11209,
"he": 9649,
"was": 8621,
"that": 7923,
"it": 6661
{{works with|Julia|1.0}}
<langsyntaxhighlight lang="julia">
using FreqTables
Line 1,990 ⟶ 2,978:
words = split(replace(txt, r"\P{L}"i => " "))
table = sort(freqtable(words); rev=true)
Line 2,005 ⟶ 2,993:
"he" │ 6816
"had" │ 6140</pre>
{{works with|ngn/k}}<syntaxhighlight lang=K>common:{+((!d)o)!n@o:x#>n:#'.d:=("&"\`c$"&"|_,/0:y)^,""}
(The relatively easy to read output format here is arguably less useful than the table produced by <code>common</code> but it would have been more concise to have <code>common</code> generate it directly.)
The below program defines the function 'stats' which accepts a filename containing the text.
<syntaxhighlight lang="kap">∇ stats (file) {
content ← "[\\h,.\"'\n-]+" regex:split unicode:toLower io:readFile file
sorted ← (⍋⊇⊢) content
selection ← 1,2≢/sorted
words ← selection / sorted
{⍵[10↑⍒⍵[;1];]} words ,[0.5] ≢¨ sorted ⊂⍨ +\selection
┃ "the" 40387┃
┃ "of" 19913┃
┃ "and" 14742┃
┃ "a" 14289┃
┃ "to" 13819┃
┃ "in" 11088┃
┃ "he" 9430┃
┃ "was" 8597┃
┃"that" 7516┃
┃ "his" 6435┃
Line 2,012 ⟶ 3,040:
There is no change in the results if the numerals 0-9 are also regarded as letters.
<langsyntaxhighlight lang="scala">// version 1.1.3
import java.io.File
Line 2,030 ⟶ 3,058:
for ((word, freq) in wordGroups)
System.out.printf("%2d %-4s %5d\n", rank++, word, freq)
Line 2,049 ⟶ 3,077:
=={{header|Liberty BASIC}}==
<langsyntaxhighlight lang="lb">dim words$(100000,2)'words$(a,1)=the word, words$(a,2)=the count
dim lines$(150000)
open "135-0.txt" for input as #txt
Line 2,115 ⟶ 3,143:
close #txt
<pre>Count Word
Line 2,136 ⟶ 3,164:
{{works with|lua|5.3}}
<langsyntaxhighlight lang="lua">
-- This program takes two optional command line arguments. The first (arg[1])
-- specifies the input file;, ifor itdefaults is absent, thento standard input. is used.The second
-- The second (arg[2]) refers tospecifies the number of results to show;, ifor itdefaults to is10.
-- absent, default to the top 10 words.
-- in freq, each key is a word and each value is its count
local freq = {}
for line in io.lines(arg[1]) do
-- %a stands for any letter
local lowerline = string.lower(line)
for word in string.gmatch(lowerlinestring.lower(line), "%a+") do
if not freq[word] then
freq[word] = 1
Line 2,154 ⟶ 3,182:
-- in array, each entry is an array whose first value is the count and whose
-- second value is the word
local array = {}
for word, count in pairs(freq) do
table.insert(array, {wordcount, countword})
table.sort(array, function (a, b) return a[1] > b[1] end)
table.sort(array, function (a, b) return a[2] > b[2] end)
for i = 1, arg[2] or 10 do
io.write(string.format('%7d %s\n', array[i][21] , array[i][12]))
Line 2,179 ⟶ 3,208:
7924 that
6661 it
Relevant documentation:
[https://www.lua.org/manual/5.3/manual.html#pdf-io.lines io.lines]
[https://www.lua.org/manual/5.3/manual.html#pdf-string.gmatch gmatch]
[https://www.lua.org/manual/5.3/manual.html#6.4.1 patterns like %a]
=={{header|Mathematica}} / {{header|Wolfram Language}}==
<syntaxhighlight lang="mathematica">TakeLargest[10]@WordCounts[Import["https://www.gutenberg.org/files/135/135-0.txt"], IgnoreCase->True]//Dataset</syntaxhighlight>
the 41088
of 19936
and 14931
a 14536
to 13738
in 11208
he 9607
was 8621
that 7825
it 6535
=={{header|MATLAB}} / {{header|Octave}}==
<syntaxhighlight lang="matlab">
function [result,count] = word_frequency()
DELIMITER={' ', ',', ';', ':', '.', '/', '*', '!', '?', '<', '>', '(', ')', '[', ']','{', '}', '&','$','§','"','”','“','-','—','‘','\t','\n','\r'};
words = sort(strsplit(lower(text),DELIMITER));
flag = [find(~strcmp(words(1:end-1),words(2:end))),length(words)];
dwords = words(flag); % get distinct words, and ...
count = diff([0,flag]); % ... the corresponding occurance frequency
[tmp,idx] = sort(-count); % sort according to occurance
result = dwords(idx);
count = count(idx);
for k = 1:10,
41039 the
19950 of
14942 and
14523 a
13941 to
11208 in
9605 he
8620 was
7824 that
6533 it
<langsyntaxhighlight Nimlang="nim">import tables, strutils, sequtils, httpclient
proc take[T](s: openArray[T], n: int): seq[T] = s[0 ..< min(n, s.len)]
Line 2,192 ⟶ 3,274:
for (word, count) in toSeq(wordFrequencies.pairs).take(10):
echo alignLeft($count, 8), word</langsyntaxhighlight>
<pre>4037240377 the
1986819870 of
1447214469 and
14278 a
1358913590 to
1102411025 in
9213 he
8347 was
72507249 that
6414 his</pre>
<langsyntaxhighlight lang="objeck">use System.IO.File;
use Collection;
use RegEx;
Line 2,258 ⟶ 3,340:
Line 2,278 ⟶ 3,360:
<langsyntaxhighlight lang="ocaml">let () =
let n =
try int_of_string Sys.argv.(1)
Line 2,304 ⟶ 3,386:
List.iter (fun (word, count) ->
Printf.printf "%d %s\n" count word
) r</langsyntaxhighlight>
Line 2,319 ⟶ 3,401:
7924 that
6661 it
<syntaxhighlight lang="delphi">
.OrderByDescending(w -> w.Value).Take(10).PrintLines
<syntaxhighlight lang ="perl">$top =use 10strict;
use warnings;
use utf8;
my $top = 10;
open my $fh, "'<"', '135ref/word-0count.txt';
(my $text = join '', <$fh>) =~ tr/A-Z/a-z/;
my @matcher = (
qr/[a-z]+/, # simple 7-bit ASCII
qr/\w+/, # word characters with underscore
Line 2,334 ⟶ 3,440:
for my $reg (@matcher) {
print "\nTop $top using regex: " . $reg . "\n";
my @matches = $text =~ /$reg/g;
my %words;
for my $w (@matches) { $words{$w}++ };
my $c = 0;
for my $w ( sort { $words{$b} <=> $words{$a} } keys %words ) {
printf "%-7s %6d\n", $w, $words{$w};
last if ++$c >= $top;
<pre>Top 10 using regex: (?^:[a-z]+)
Top 10 using regex: (?^:[a-z]+)
the 41089
of 19949
Line 2,381 ⟶ 3,488:
was 8621
that 7924
it 6661</pre>
<!--<syntaxhighlight lang="phix">(notonline)-->
<lang Phix>?"loading..."
<span style="color: #008080;">without</span> <span style="color: #008080;">javascript_semantics</span>
constant subs = "\t\r\n_.,\"\'!;:?][()|=<>#/*{}+@%&$",
<span style="color: #0000FF;">?</span><span style="color: #008000;">"loading..."</span>
reps = repeat(' ',length(subs)),
<span style="color: #008080;">constant</span> <span style="color: #000000;">subs</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">'\t'</span><span style="color: #0000FF;">&</span><span style="color: #008000;">"\r\n_.,\"\'!;:?][()|=&lt;&gt;#/*{}+@%&$"</span><span style="color: #0000FF;">,</span>
fn = open("135-0.txt","r")
<span style="color: #000000;">reps</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">repeat</span><span style="color: #0000FF;">(</span><span style="color: #008000;">' '</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">subs</span><span style="color: #0000FF;">)),</span>
string text = lower(substitute_all(get_text(fn),subs,reps))
<span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">open</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"135-0.txt"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"r"</span><span style="color: #0000FF;">)</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">text</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">lower</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">substitute_all</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">get_text</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">),</span><span style="color: #000000;">subs</span><span style="color: #0000FF;">,</span><span style="color: #000000;">reps</span><span style="color: #0000FF;">))</span>
sequence words = append(sort(split(text,no_empty:=true)),"")
<span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
constant wf = new_dict()
<span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">sort</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">split</span><span style="color: #0000FF;">(</span><span style="color: #000000;">text</span><span style="color: #0000FF;">,</span><span style="color: #000000;">no_empty</span><span style="color: #0000FF;">:=</span><span style="color: #004600;">true</span><span style="color: #0000FF;">)),</span><span style="color: #008000;">""</span><span style="color: #0000FF;">)</span>
string last = words[1]
<span style="color: #008080;">constant</span> <span style="color: #000000;">wf</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">new_dict</span><span style="color: #0000FF;">()</span>
integer count = 1
<span style="color: #004080;">string</span> <span style="color: #000000;">last</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span>
for i=2 to length(words) do
<span style="color: #004080;">integer</span> <span style="color: #000000;">count</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">1</span>
if words[i]!=last then
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">2</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]!=</span><span style="color: #000000;">last</span> <span style="color: #008080;">then</span>
count = 0
<span style="color: #7060A8;">setd</span><span style="color: #0000FF;">({</span><span style="color: #000000;">count</span><span style="color: #0000FF;">,</span><span style="color: #000000;">last</span><span style="color: #0000FF;">},</span><span style="color: #000000;">0</span><span style="color: #0000FF;">,</span><span style="color: #000000;">wf</span><span style="color: #0000FF;">)</span>
last = words[i]
<span style="color: #000000;">count</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">0</span>
end if
<span style="color: #000000;">last</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
count += 1
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
end for
<span style="color: #000000;">count</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span>
count = 10
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
function visitor(object key, object /*data*/, object /*user_data*/)
<span style="color: #000000;">count</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">10</span>
<span style="color: #008080;">function</span> <span style="color: #000000;">visitor</span><span style="color: #0000FF;">(</span><span style="color: #004080;">object</span> <span style="color: #000000;">key</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">object</span> <span style="color: #000080;font-style:italic;">/*data*/</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">object</span> <span style="color: #000080;font-style:italic;">/*user_data*/</span><span style="color: #0000FF;">)</span>
count -= 1
<span style="color: #0000FF;">?</span><span style="color: #000000;">key</span>
return count>0
<span style="color: #000000;">count</span> <span style="color: #0000FF;">-=</span> <span style="color: #000000;">1</span>
end function
<span style="color: #008080;">return</span> <span style="color: #000000;">count</span><span style="color: #0000FF;">></span><span style="color: #000000;">0</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<span style="color: #7060A8;">traverse_dict</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">routine_id</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"visitor"</span><span style="color: #0000FF;">),</span><span style="color: #000000;">0</span><span style="color: #0000FF;">,</span><span style="color: #000000;">wf</span><span style="color: #0000FF;">,</span><span style="color: #004600;">true</span><span style="color: #0000FF;">)</span>
Line 2,423 ⟶ 3,534:
<syntaxhighlight lang="phixmonti">include ..\Utilitys.pmt
"loading..." ?
"135-0.txt" "r" fopen var fn
" "
fn fgets number? if drop fn fclose false else lower " " chain chain true endif
"process..." ?
len for
var i
i get dup 96 > swap 123 < and not if 32 i set endif
split sort
"count..." ?
( ) var words
"" var prev
1 var n
len for
var i
i get dup prev ==
drop n 1 + var n
words ( n prev ) 0 put var words var prev 1 var n
words sort
10 for
-1 * get ?
[41093, "the"]
[19954, "of"]
[14943, "and"]
[14558, "a"]
[13953, "to"]
[11219, "in"]
[9649, "he"]
[8622, "was"]
[7924, "that"]
[6661, "it"]
=== Press any key to exit ===</pre>
<langsyntaxhighlight lang="php">
Line 2,440 ⟶ 3,605:
Line 2,456 ⟶ 3,621:
10 had 6139
To get the book proper, the header and footer are removed. Here are some tests with different sets of characters to split the words (<code>split_char/1</code>).
<syntaxhighlight lang="picat">main =>
NTop = 10,
File = "les_miserables.txt",
Chars = read_file_chars(File),
% Remove the Project Gutenberg header/footer
Book = [to_lowercase(C) : C in slice(Chars,HeaderEnd+1,FooterStart-1)],
% Split into words (different set of split characters)
Words = split(Book,SplitChars),
freq(L) = Freq =>
Freq = new_map(),
foreach(E in L)
% different set of split chars
split_chars(all,"\n\r \t,;!.?()[]”\"-“—-__‘’*").
split_chars(space_punct,"\n\r \t,;!.?").
split_chars(space,"\n\r \t").</syntaxhighlight>
<pre>split_type = all
[the = 40907,of = 19830,and = 14872,a = 14487,to = 13872,in = 11157,he = 9645,was = 8618,that = 7908,it = 6626]
split_type = space_punct
[the = 40193,of = 19779,and = 14668,a = 14227,to = 13538,in = 11033,he = 9455,was = 8604,that = 7576,” = 6578]
split_type = space
[the = 40193,of = 19747,and = 14402,a = 14222,to = 13512,in = 10964,he = 9211,was = 8345,that = 7235,his = 6414]</pre>
It is a slightly different result if the the header/footer are not removed:
<pre>split_type = all
[the = 41094,of = 19952,and = 14939,a = 14545,to = 13954,in = 11218,he = 9647,was = 8620,that = 7922,it = 6641]
split_type = space_punct
[the = 40378,of = 19901,and = 14734,a = 14284,to = 13620,in = 11094,he = 9457,was = 8606,that = 7590,” = 6578]
split_type = space
[the = 40378,of = 19869,and = 14468,a = 14278,to = 13590,in = 11025,he = 9213,was = 8347,that = 7249,his = 6414]</pre>
<langsyntaxhighlight PicoLisplang="picolisp">(setq *Delim " ^I^J^M-_.,\"'*[]?!&@#$%^\(\):;")
(setq *Skip (chop *Delim))
Line 2,472 ⟶ 3,692:
(if (idx 'B W T) (inc (car @)) (set W 1)) ) ) )
(for L (head 10 (flip (by val sort (idx 'B))))
(println L (val L)) )</langsyntaxhighlight>
Line 2,489 ⟶ 3,709:
{{works with|SWI Prolog}}
<langsyntaxhighlight lang="prolog">print_top_words(File, N):-
read_file_to_string(File, String, [encoding(utf8)]),
re_split("\\w+", String, Words),
Line 2,521 ⟶ 3,741:
print_top_words("135-0.txt", 1510).</langsyntaxhighlight>
Line 2,537 ⟶ 3,757:
9 7922 that
10 6659 it
11 6469 his
12 6195 is
13 6181 had
<syntaxhighlight lang="purebasic">EnableExplicit
14 5149 which
15 4530 with
Structure wordcount
Define token.c, word$, idx.i, start.i, arg$
NewMap wordmap.i()
NewList wordlist.wordcount()
If OpenConsole("")
arg$ = ProgramParameter(0)
If arg$ = "" : End 1 : EndIf
start = ElapsedMilliseconds()
If ReadFile(0, arg$, #PB_Ascii)
While Not Eof(0)
token = ReadCharacter(0, #PB_Ascii)
Select token
Case 'A' To 'Z', 'a' To 'z'
word$ + LCase(Chr(token))
If word$
wordmap(word$) + 1
word$ = ""
ForEach wordmap()
wordlist()\wkey$ = MapKey(wordmap())
wordlist()\count = wordmap()
SortStructuredList(wordlist(), #PB_Sort_Descending, OffsetOf(wordcount\count), TypeOf(wordcount\count))
PrintN("Elapsed milliseconds: " + Str(ElapsedMilliseconds() - start))
PrintN("File: " + GetFilePart(arg$))
PrintN(~"Rank\tCount\t\t Word")
If FirstElement(wordlist())
For idx = 1 To 10
Print(RSet(Str(idx), 2))
PrintN(RSet(Str(wordlist()\count), 6))
If NextElement(wordlist()) = 0
Elapsed milliseconds: 462
File: 135-0.txt
Rank Count Word
1 the 41093
2 of 19954
3 and 14943
4 a 14558
5 to 13953
6 in 11219
7 he 9649
8 was 8622
9 that 7924
10 it 6661
Line 2,547 ⟶ 3,835:
<langsyntaxhighlight lang="python">import collections
import re
import string
Line 2,557 ⟶ 3,845:
if __name__ == "__main__":
Line 2,567 ⟶ 3,855:
<langsyntaxhighlight lang="python">from collections import Counter
from re import findall
Line 2,586 ⟶ 3,874:
if __name__ == "__main__":
n = int(input('How many?: '))
most_common_words_in_file(les_mis_file, n)</langsyntaxhighlight>
Line 2,604 ⟶ 3,892:
===Sorted and groupby===
{{Works with|Python|3.7}}
<langsyntaxhighlight lang="python">"""
Word count task from Rosetta Code
Line 2,651 ⟶ 3,939:
if __name__ == '__main__':
<pre>('the', 40372)
Line 2,663 ⟶ 3,951:
('that', 7250)
('his', 6414)</pre>
===Collections, Sorted and Lambda===
<syntaxhighlight lang="python">
import collections
import re
count = 10
with open("135-0.txt") as f:
text = f.read()
word_freq = sorted(
collections.Counter(sorted(re.split(r"\W+", text.lower()))).items(),
key=lambda c: c[1],
for i in range(len(word_freq)):
print("[{:2d}] {:>10} : {}".format(i + 1, word_freq[i][0], word_freq[i][1]))
if i == count - 1:
<pre>[ 1] the : 41039
[ 2] of : 19951
[ 3] and : 14942
[ 4] a : 14527
[ 5] to : 13941
[ 6] in : 11209
[ 7] he : 9646
[ 8] was : 8620
[ 9] that : 7922
[10] it : 6659</pre>
===Version 1===
I chose to remove apostrophes only if they're followed by an s (so "mom" and "mom's" will show up as the same word but "they" and "they're" won't). I also chose not to remove hyphens.
<syntaxhighlight lang="r">
<lang R>
Line 2,682 ⟶ 4,005:
Line 2,698 ⟶ 4,021:
9 it 2308
10 i 1845
===Version 2===
This version is purely functional using the native pipe operator in R 4.1+ and runs in less than a second.
<syntaxhighlight lang="r">
word_frequency_pipeline <- function(file=NULL, n=10) {
file |>
vroom::vroom_lines() |>
stringi::stri_split_boundaries(type="word", skip_word_none=T, skip_word_number=T) |>
unlist() |>
tolower() |>
table() |>
sort(decreasing = T) |>
(\(.) .[1:n])() |>
> word_frequency_pipeline("~/../Downloads/135-0.txt")
Var1 Freq
1 the 41042
2 of 19952
3 and 14938
4 a 14526
5 to 13942
6 in 11208
7 he 9605
8 was 8620
9 that 7824
10 it 6533
<langsyntaxhighlight lang="racket">#lang racket
(define (all-words f (case-fold string-downcase))
Line 2,711 ⟶ 4,067:
(module+ main
(take (counts (all-words "data/les-mis.txt")) 10))</langsyntaxhighlight>
Line 2,727 ⟶ 4,083:
(formerly Perl 6)
{{works with|Rakudo|20172022.07}}
Note: much of the following exposition is no longer critical to the task as the requirements have been updated, but is left here for historical and informational reasons.
This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is '''full''' of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup.
We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs[[wp:diacritic|diacritic]]s. Those '''are''' letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)
Actually, in this case /A-Za-z/ returns '''very nearly''' the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the".
Line 2,741 ⟶ 4,096:
Here is a sample that shows the result when using various different matchers.
<syntaxhighlight lang="raku" perl6line>sub MAIN ($filename, UInt $top = 10) {
my $file = $filename.IO.slurp.lc.subst(/ (<[\w]-[_]>'-')\n(<[\w]-[_]>) /, {$0 ~ $1}, :g );
my @matcher = (
rx/ <[a..z]>+ /, # simple 7-bit ASCII
rx/ \w+ /, # word characters with underscore
rx/ <[\w]-[_]>+ /, # word characters without underscore
rx/ [<[\w]-[_]>+[["'"|]+ % < ' -'|" '-"]<[\w]-[_] >+]* / # word characters without underscore but with hyphens and contractions
for @matcher -> $reg {
say "\nTop $top using regex: ", $reg.perlraku;
my @words .put for= $file.comb( $reg ).Bag.sort(-*.value)[^$top];
my $length = max @words».key».chars;
printf "%-{$length}s %d\n", .key, .value for @words;
Line 2,804 ⟶ 4,161:
that 7825
it 6535</pre>
It can be difficult to figure out what words the different regexes do or don't match. Here are the three more complex regexes along with a list of "words" that are treated as being different using this regex as opposed to /a..z/. IE: It is lumped in as one of the top 10 word counts using /a..z/ but not with this regex.
<pre>Top 10 using regex: rx/ \w+ /
the 41035 alèthe _the _the_
of 19946 of_ _of_
and 14940 _and_ paternoster_and
a 14577 _ça aïe ça keksekça aérostiers _a poréa panathenæa
to 13939 to_ _to
in 11204 _in
he 9645 _he
was 8619 _was
that 7922 _that
it 6659 _it
Top 10 using regex: rx/ <[\w]-[_]>+ /
the 41088 alèthe
of 19949
and 14942
a 14596 poréa ça aérostiers panathenæa aïe keksekça
to 13951
in 11214
he 9648
was 8621
that 7924
it 6661
Top 10 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
the 41081 will-o'-the-wisps alèthe skip-the-gutter police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change jean-the-screw will-o'-the-wisp
of 19930 chromate-of-lead-colored die-of-hunger die-of-cold-if-you-have-bread police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change unheard-of die-of-hunger-if-you-have-a-fire
and 14934 come-and-see so-and-so cock-and-bull hide-and-seek sambre-and-meuse
a 14587 keksekça l'a ça now-a-days vis-a-vis a-dreaming police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change poréa panathenæa aérostiers a-hunting aïe die-of-hunger-if-you-have-a-fire
to 13735 to-morrow to-day hand-to-hand to-night well-to-do face-to-face
in 11204 in-pace son-in-law father-in-law whippers-in general-in-chief sons-in-law
he 9607 he's he'll
was 8620 police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change
that 7825 that's pick-me-down-that
it 6535 it's it'll</pre>
One nice thing is this isn't special cased. It will work out of the box for any text / language.
[https://www.gutenberg.org/files/14741/14741-0.txt Russian]? No problem.
<pre>$ raku wf 14741-0.txt 5</pre>
<pre>Top 5 using regex: rx/ <[a..z]>+ /
the 176
of 119
gutenberg 93
project 87
to 80
Top 5 using regex: rx/ \w+ /
и 860
в 579
не 290
на 222
ты 195
Top 5 using regex: rx/ <[\w]-[_]>+ /
и 860
в 579
не 290
на 222
ты 195
Top 5 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
и 860
в 579
не 290
на 222
ты 195</pre>
[https://www.gutenberg.org/files/39963/39963-0.txt Greek]? Sure, why not.
<pre>$ raku wf 39963-0.txt 5</pre>
<pre>Top 5 using regex: rx/ <[a..z]>+ /
the 187
of 123
gutenberg 93
project 87
to 82
Top 5 using regex: rx/ \w+ /
και 1628
εις 986
δε 982
του 895
των 859
Top 5 using regex: rx/ <[\w]-[_]>+ /
και 1628
εις 986
δε 982
του 895
των 859
Top 5 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
και 1628
εις 986
δε 982
του 895
των 859</pre>
Of course, for the first matcher, we are asking specifically to match Latin ASCII, so we end up with... well... Latin ASCII; but the other 3 match any Unicode characters.
Line 2,809 ⟶ 4,269:
This REXX version doesn't need to sort the list of words.
Extra code was added to handle some foreign letters &nbsp; (non-Latin) &nbsp; and
Currently, this version recognizes all the accented (non-Latin) accented letters that are present in the text (file) that is specified to be used &nbsp; (and some other non-Latin letters as well). &nbsp; This means that the word &nbsp; &nbsp; <big><big> Alèthe </big></big> &nbsp; &nbsp; is treated as one word, &nbsp; <u>not</u> as two words &nbsp; &nbsp; <big><big> Al &nbsp; the </big></big> &nbsp; &nbsp; (and not thereby adding two separate words).
also handle most accented letters.
This version recognizes all the accented letters that are present in the
This version also supports words that contain embedded apostrophes (<b><big>''' ' '''</big></b>) &nbsp; &nbsp; [that is, within a word, but not those words that start or end with an apostrophe; for those words, the apostrophe is elided].
required/specified text (file) &nbsp; (and some other non-Latin letters as well).
Thus,This &nbsp;means '''that it'sthe '''word &nbsp; is counted separately from &nbsp; '''it'''<big><big> Alèthe </big></big> &nbsp; or &nbsp; '''is its'''.
treated as one word, &nbsp; <u>not</u> as two words &nbsp; &nbsp; <big><big> Al &nbsp; the
</big></big> &nbsp; &nbsp; (and not thereby adding two separate words).
This version also supports words that contain embedded
Since REXX doesn't support UTF-8 encodings, code was added to this REXX version to support the accented letters in the mandated input file.
apostrophes (<b><big>''' ' '''</big></b>)
<lang rexx>/*REXX pgm displays top 10 words in a file (includes foreign letters), case is ignored.*/
<br><big>[</big>that is, within a word, &nbsp; but <u>not</u> those words that start or
end with an apostrophe; &nbsp; for those encapsulated words, &nbsp; the apostrophe is
Thus, &nbsp; ''' it's ''' &nbsp; is counted separately
from &nbsp; '''it''' &nbsp; and/or &nbsp; ''' its'''.
Since REXX doesn't support UTF-8 encodings, code was added to this REXX version to
support the accented letters in the mandated input file.
<syntaxhighlight lang="rexx">/*REXX pgm displays top 10 words in a file (includes foreign letters), case is ignored.*/
parse arg fID top . /*obtain optional arguments from the CL*/
if fID=='' | fID=="," then fID= 'les_mes.TXTtxt' /*None specified? Then use the default.*/
if top=='' | top=="," then top= 10 /* " " " " " " */
call init /*initialize varied bunch of variables.*/
@.=0; c=0; abcL="abcdefghijklmnopqrstuvwxyz'" /*initialize word list, count; alphabet*/
call rdr
q= "'"; abcU= abcL; upper abcU /*define uppercase version of alphabet*/
totW=0; accL= 'üéâÄàÅÇêëèïîìéæôÖòûùÿáíóúÑ' /* " " of some accented chrs*/
accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ' /* " lowercase accented characters.*/
accG= 'αßΓπΣσµτΦΘΩδφε' /* " some upper/lower Greek letters*/
a=abcL || abcL ||accL ||accL || accG /* " char string of after letters.*/
b=abcL || abcU ||accL ||accU || accG || xrange() /* " char string of before " */
x= 'Çà åÅ çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü' /*list of 16-bit chars.*/
xs= words(x) /*num. " " " */
!.= /*define the original word instances. */
do #=0 while lines(fID)\==0; $=linein(fID) /*loop whilst there are lines in file. */
if pos('├', $)\==0 then do k=1 for xs; _=word(x, k) /*any 16-bit chars? */
$=changestr('├'left(_, 1), $, right(_, 1) ) /*convert.*/
end /*k*/
$=translate( $, a, b) /*remove superfluous blanks in the line*/
do while $\=''; parse var $ z $ /*now, process each word in the $ list.*/
parse var z z1 2 zr '' -1 zL /*extract: first, middle, & last char.*/
if z1==q then do; z=zr; if z=='' then iterate; end /*starts with apostrophe? */
if zL==q then z=strip(left(z, length(z) - 1)) /*ends " " */
if z=='' then iterate /*if Z is now null, skip.*/
if @.z==0 then do; c=c+1; !.c=z; end /*bump word count; assign word to array*/
totW=totW + 1; @.z=@.z + 1 /*bump total words & count of the word.*/
end /*while*/
end /*#*/
say commas(totW) ' words found ('commas(c) "unique) in " commas(#),
' records read from file: ' fID; say
say right('word', 40) " " center(' rank ', 6) " count " /*display title for output*/
say right('════', 40) " " center('══════', 6) " ═══════" /* " title separator.*/
do until otops==tops | tops>top /*process enough words to satisfy TOP.*/
WL=; mk= 0; otops=tops tops /*initialize the word list (to a NULL).*/
do n=1 for c; z=!.n; k=@.z /*process the list of words in the file*/
ifdo kn==mk then WL=WL z1 for c; z= !.n; k= @.z /*handleprocess casesthe list of tiedwords numberin ofthe words.file*/
if k==mk then WL= WL z /*handle cases of tied number of words.*/
if k> mk then do; mk=k; WL=z; end /*this word count is the current max. */
end /*n*/
wr=max( length(' rank '), length(top) ) /*find the maximum length of the rank #*/
wr= max( length(' rank '), do d=1 for wordslength(WLtop); ) _=word(WL, d) /*processfind allthe wordsmaximum inlength of the rank word list. #*/
if d==1 then w=max(10, length(@._) ) /*use length of the first number used. */
saydo right(@._,d=1 40) for words(WL); y= rightword(commas(tops)WL, wrd) /*process all words in the word right(commas(@list._), w)*/
@._if d== -1 then w= max(10, length(@.y) ) /*nullifyuse wordlength countof forthe nextfirst gonumber used. around*/
say right(y, 40) right( commas(tops), wr) right(commas(@.y), w)
@.y= . /*nullify word count for next go 'round*/
end /*d*/ /* [↑] this allows a non-sorted list. */
tops=tops + words(WL) /*correctly handle any tied rankings.*/
tops= tops + words(WL) /*correctly handle any tied rankings.*/
end /*until*/
exit /*stick a fork in it, we're all done. */
commas: procedure; parse arg _?; do njc=_'.9';length(?)-3 to 1 by #=123456789-3; b?=verifyinsert(n',', #?, "M"jc); end; return ?
16bit: do k=1 for xs; _=word(x,k); $=changestr('├'left(_,1),$,right(_,1)); end; return
e=verify(n, #'0', , verify(n, #"0.", 'M') ) - 4
do j=e to b by -3; _=insert(',', _, j); end /*j*/; return _</lang>
init: x= 'Çà åÅ çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü'; xs= words(x)
abcL="abcdefghijklmnopqrstuvwxyz'" /*lowercase letters of Latin alphabet. */
abcU= abcL; upper abcU /*uppercase version of Latin alphabet. */
accL= 'üéâÄàÅÇêëèïîìéæôÖòûùÿáíóúÑ' /*some lowercase accented characters. */
accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ' /* " uppercase " " */
accG= 'αßΓπΣσµτΦΘΩδφε' /* " upper/lowercase Greek letters. */
ll= abcL || abcL ||accL ||accL || accG /*chars of after letters. */
uu= abcL || abcU ||accL ||accU || accG || xrange() /* " " before " */
@.= 0; q= "'"; totW= 0; !.= @.; c= 0; tops= 1; return
rdr: do #=0 while lines(fID)\==0; $=linein(fID) /*loop whilst there're lines in file.*/
if pos('├', $) \== 0 then call 16bit /*are there any 16-bit characters ?*/
$= translate( $, ll, uu) /*trans. uppercase letters to lower. */
do while $ \= ''; parse var $ z $ /*process each word in the $ line. */
parse var z z1 2 zr '' -1 zL /*obtain: first, middle, & last char.*/
if z1==q then do; z=zr; if z=='' then iterate; end /*starts with apostrophe?*/
if zL==q then z= strip(left(z, length(z) - 1)) /*ends " " ?*/
if z=='' then iterate /*if Z is now null, skip.*/
if @.z==0 then do; c=c+1; !.c=z; end /*bump word cnt; assign word to array*/
totW= totW + 1; @.z= @.z + 1 /*bump total words; bump a word count*/
end /*while*/
end /*#*/
say commas(totW) ' words found ('commas(c) "unique) in " commas(#),
' records read from file: ' fID; say; return</syntaxhighlight>
{{out|output|text=&nbsp; when using the default inputs:}}
574,122 words found (23,414 unique) in 67,663 records read from file: les_mes.TXTtxt
word rank count
Line 2,890 ⟶ 4,368:
Inspired by version 1 and adapted for ooRexx.
It ignores all characters other than a-z and A-Z (which are translated to a-z).
<syntaxhighlight lang="text">/*REXX program reads and displays a count of words a file. Word case is ignored.*/
Call time 'R'
Line 2,940 ⟶ 4,418:
tops=tops+words(tl) /*correctly handle the tied rankings. */
Say time('E') 'seconds elapsed'</langsyntaxhighlight>
<pre>We found 22820 different words
Line 2,958 ⟶ 4,436:
<langsyntaxhighlight lang="ring">
# project : Word count
Line 3,017 ⟶ 4,495:
b = temp
return [a, b]
Line 3,033 ⟶ 4,511:
<langsyntaxhighlight lang="ruby">
class String
def wc
Line 3,043 ⟶ 4,521:
open('135-0.txt') { |n| n.read.wc[-10,10].each{|n| puts n[0].to_s+"->"+n[1].to_s} }
Line 3,059 ⟶ 4,537:
===Tally and max_by===
{{Works with|Ruby|2.7}}
<langsyntaxhighlight lang="ruby">RE = /[[:alpha:]]+/
count = open("135-0.txt").read.downcase.scan(RE).tally.max_by(10, &:last)
count.each{|ar| puts ar.join("->") }
Line 3,074 ⟶ 4,552:
===Chain of Enumerables===
<syntaxhighlight lang="ruby">wf = File.read("135-0.txt", :encoding => "UTF-8")
.each_with_object(Hash.new(0)) { |word, hash| hash[word] += 1 }
.sort_by { |k, v| v }
.each_with_index { |w, i|
printf "[%2d] %10s : %d\n",
i += 1,
<pre>[ 1] the : 41040
[ 2] of : 19951
[ 3] and : 14942
[ 4] a : 14539
[ 5] to : 13941
[ 6] in : 11209
[ 7] he : 9646
[ 8] was : 8620
[ 9] that : 7922
[10] it : 6659
<langsyntaxhighlight Rustlang="rust">use std::cmp::Reverse;
use std::collections::HashMap;
use std::fs::File;
Line 3,108 ⟶ 4,613:
fn main() {
word_count(File::open("135-0.txt").expect("File open error"), 10)
Line 3,128 ⟶ 4,633:
Best seen running in your browser [https://scastie.scala-lang.org/EP2Fm6HXQrC1DwtSNvnUzQ Scastie (remote JVM)].
<langsyntaxhighlight Scalalang="scala">import scala.io.Source
object WordCount extends App {
Line 3,151 ⟶ 4,656:
println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]")
<pre>Rank Word Frequency
Line 3,175 ⟶ 4,680:
to get words from a fle. The words are [http://seed7.sourceforge.net/libraries/string.htm#lower(in_string) converted to lower case], to assure that "The" and "the" are considered the same.
<langsyntaxhighlight lang="seed7">$ include "seed7_05.s7i";
include "gethttp.s7i";
include "strifile.s7i";
Line 3,216 ⟶ 4,721:
end for;
end for;
end func;</langsyntaxhighlight>
Line 3,234 ⟶ 4,739:
<langsyntaxhighlight lang="ruby">var count = Hash()
var file = File(ARGV[0] \\ '135-0.txt')
Line 3,247 ⟶ 4,752:
top.each { |pair|
say "#{pair.key}\t-> #{pair.value}"
Line 3,263 ⟶ 4,768:
<langsyntaxhighlight lang="simula">COMMENT COMPILE WITH
$ cim -m64 word-count.sim
Line 3,542 ⟶ 5,047:
Line 3,557 ⟶ 5,062:
6 garbage collection(s) in 0.2 seconds.
The ASCII text file is from https://www.gutenberg.org/files/135/old/lesms10.txt.
===Cuis Smalltalk, ASCII===
{{works with|Cuis|6.0}}
<syntaxhighlight lang="smalltalk">
(StandardFileStream new open: 'lesms10.txt' forWrite: false)
contents asLowercase substrings asBag sortedCounts first: 10.
{{Out}}<pre>an OrderedCollection(40543 -> 'the' 19796 -> 'of' 14448 -> 'and' 14380 -> 'a' 13582 -> 'to' 11006 -> 'in' 9221 -> 'he' 8351 -> 'was' 7258 -> 'that' 6420 -> 'his') </pre>
===Squeak Smalltalk, ASCII===
{{works with|Squeak|6.0}}
<syntaxhighlight lang="smalltalk">
(StandardFileStream readOnlyFileNamed: 'lesms10.txt')
contents asLowercase substrings asBag sortedCounts first: 10.
{{Out}}<pre>{40543->'the' . 19796->'of' . 14448->'and' . 14380->'a' . 13582->'to' . 11006->'in' . 9221->'he' . 8351->'was' . 7258->'that' . 6420->'his'} </pre>
<syntaxhighlight lang="swift">import Foundation
func printTopWords(path: String, count: Int) throws {
// load file contents into a string
let text = try String(contentsOfFile: path, encoding: String.Encoding.utf8)
var dict = Dictionary<String, Int>()
// split text into words, convert to lowercase and store word counts in dict
let regex = try NSRegularExpression(pattern: "\\w+")
regex.enumerateMatches(in: text, range: NSRange(text.startIndex..., in: text)) {
(match, _, _) in
guard let match = match else { return }
let word = String(text[Range(match.range, in: text)!]).lowercased()
dict[word, default: 0] += 1
// sort words by number of occurrences
let wordCounts = dict.sorted(by: {$0.1 > $1.1})
// print the top count words
for (i, (word, n)) in wordCounts.prefix(count).enumerated() {
print("\(i + 1)\t\(word)\t\(n)")
do {
try printTopWords(path: "135-0.txt", count: 10)
} catch {
Rank Word Count
1 the 41039
2 of 19951
3 and 14942
4 a 14527
5 to 13941
6 in 11209
7 he 9646
8 was 8620
9 that 7922
10 it 6659
<syntaxhighlight lang="tcl">lassign $argv head
while { [gets stdin line] >= 0 } {
foreach word [regexp -all -inline {[A-Za-z]+} $line] {
dict incr wordcount [string tolower $word]
set sorted [lsort -stride 2 -index 1 -int -decr $wordcount]
foreach {word count} [lrange $sorted 0 [expr {$head * 2 - 1}]] {
puts "$count\t$word"
./wordcount-di.tcl 10 < 135-0.txt
41093 the
19954 of
14943 and
14558 a
13953 to
11219 in
9649 he
8622 was
7924 that
6661 it
McIlroy's Unix TMG:
<langsyntaxhighlight UnixTMGlang="unixtmg">/* Input format: N text */
/* Only lowercase letters can constitute a word in text. */
/* (c) 2020, Andrii Makukha, 2-clause BSD licence. */
Line 3,622 ⟶ 5,219:
/* Character classes */
letter: <<abcdefghijklmnopqrstuvwxyz>>;
other: !<<abcdefghijklmnopqrstuvwxyz>>;</langsyntaxhighlight>
Unix TMG didn't have <tt>tolower</tt> builtin. Therefore, you would use it together with <tt>tr</tt>:
<langsyntaxhighlight lang="bash">cat file | tr A-Z a-z > file1; ./a.out file1</langsyntaxhighlight>
Additionally, because 1972 TMG only understood ASCII characters, you might want to strip down the diacritics (e.g., é → e):
<langsyntaxhighlight lang="bash">cat file | uni2ascii -B | tr A-Z a-z > file1; ./a.out file1</langsyntaxhighlight>
<syntaxhighlight lang="Scheme">#lang transd
MainModule: {
_start: (λ locals: cnt 0
(with fs FileStream() words String()
(open-r fs "/mnt/text/Literature/Miserables.txt")
(textin fs words)
(with v ( -|
(split (tolower words))
(regroup-by (λ v Vector<String>() -> Int() (size v))))
(for i in v :rev do (lout (get (get (snd i) 0) 0) ":\t " (fst i))
(+= cnt 1) (if (> cnt 10) break))
the: 40379
of: 19869
and: 14468
a: 14278
to: 13590
in: 11025
he: 9213
was: 8347
that: 7249
his: 6414
had: 6051
=={{header|UNIX Shell}}==
Line 3,634 ⟶ 5,264:
{{works with|zsh}}
This is derived from Doug McIlroy's original 6-line note in the ACM article cited in the task.
<langsyntaxhighlight lang="bash">#!/bin/sh
cat <"${1} |" tr -cs A-Za-z '\n' | tr A-Z a-z | LC_ALL=C sort | uniq -c | sort -rn | sedhead -n "${2}q"</langsyntaxhighlight>
Line 3,652 ⟶ 5,282:
6661 it
=== Original + URL import ===
This is Doug McIlroy's original solution but follows other solutions in importing the task's text file from the web and directly specifying the 10 most commonly used words.
<syntaxhighlight lang="zsh">curl "https://www.gutenberg.org/files/135/135-0.txt" | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 10q</syntaxhighlight>
<pre>41096 the
19955 of
14939 and
14558 a
13954 to
11218 in
9649 he
8622 was
7924 that
6661 it</pre>
In order to use it, you have to adapt the PATHFILE Const.
<syntaxhighlight lang="vb">
<lang vb>
Option Explicit
Line 3,772 ⟶ 5,425:
If d.Exists(Word) Then _
DisplayFrequencyOf = d(Word)
End Function</langsyntaxhighlight>
<pre>Words different in this book : 25884
Line 3,795 ⟶ 5,448:
Execution Time : 7,785 sec.</pre>
I've taken the view that 'letter' means either a letter or digit for Unicode codepoints up to 255. I haven't included underscore, hyphen nor apostrophe as these usually separate compound words.
Not very quick (runs in about 15 seconds on my system) though this is partially due to Wren not having regular expressions and the string pattern matching module being written in Wren itself rather than C.
If the Go example is re-run today (17 February 2024), then the output matches this Wren example precisely though it appears that the text file has changed since the former was written more than 5 years ago.
<syntaxhighlight lang="wren">import "io" for File
import "./str" for Str
import "./sort" for Sort
import "./fmt" for Fmt
import "./pattern" for Pattern
var fileName = "135-0.txt"
var text = File.read(fileName).trimEnd()
var groups = {}
// match runs of A-z, a-z, 0-9 and any non-ASCII letters with code-points < 256
var p = Pattern.new("+1&w")
var lines = text.split("\n")
for (line in lines) {
var ms = p.findAll(line)
for (m in ms) {
var t = Str.lower(m.text)
groups[t] = groups.containsKey(t) ? groups[t] + 1 : 1
var keyVals = groups.toList
Sort.quick(keyVals, 0, keyVals.count - 1) { |i, j| (j.value - i.value).sign }
System.print("Rank Word Frequency")
System.print("==== ==== =========")
for (rank in 1..10) {
var word = keyVals[rank-1].key
var freq = keyVals[rank-1].value
Fmt.print("$2d $-4s $5d", rank, word, freq)
Rank Word Frequency
==== ==== =========
1 the 41092
2 of 19954
3 and 14943
4 a 14546
5 to 13953
6 in 11219
7 he 9649
8 was 8622
9 that 7924
10 it 6661
<langsyntaxhighlight lang="xquery">let $maxentries := 10,
$uri := 'https://www.gutenberg.org/files/135/135-0.txt'
Line 3,817 ⟶ 5,526:
return <word key="{$key}" count="{$count}"/>
)[position()=(1 to $maxentries)]
<langsyntaxhighlight lang="xml"><words in="https://www.gutenberg.org/files/135/135-0.txt" top="10">
<word key="the" count="41092"/>
<word key="of" count="19954"/>
Line 3,830 ⟶ 5,539:
<word key="that" count="7924"/>
<word key="it" count="6661"/>
<langsyntaxhighlight lang="zkl">fname,count := vm.arglist; // grab cammand line args
// words may have leading or trailing "_", ie "the" and "_the"
Line 3,839 ⟶ 5,548:
RegExp("[a-z]+").pump.fp1(Dictionary().incV)) // line-->(word:count,..)
.toList().copy().sort(fcn(a,b){ b[1]<a[1] })[0,count.toInt()] // hash-->list
Line 3,854 ⟶ 5,563:
{{omit from|6502 Assembly|The text file is much larger than the CPU's address space.}}
{{omit from|Z80 Assembly}}
{{omit from|8080 Assembly}}
