Word frequency: Difference between revisions

m
→‎{{header|Wren}}: Minor tidy and rerun
m (→‎{{header|Wren}}: Minor tidy and rerun)
 
(19 intermediate revisions by 10 users not shown)
Line 40:
 
=={{header|11l}}==
<langsyntaxhighlight lang="11l">DefaultDict[String, Int] cnt
L(word) re:‘\w+’.find_strings(File(‘135-0.txt’).read().lowercase())
cnt[word]++
print(sorted(cnt.items(), key' wordc -> wordc[1], reverse' 1B)[0.<10])</langsyntaxhighlight>
 
{{out}}
Line 56:
{{works with|Ada|Ada|2012}}
 
<langsyntaxhighlight Adalang="ada">with Ada.Command_Line;
with Ada.Text_IO;
with Ada.Integer_Text_IO;
Line 143:
end loop;
end Word_Frequency;
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 162:
{{works with|ALGOL 68G|Any - tested with release 2.8.3.win32}}
Uses the associative array implementations in [[ALGOL_68/prelude]].
<langsyntaxhighlight lang="algol68"># find the n most common words in a file #
# use the associative array in the Associate array/iteration task #
# but with integer values #
Line 286:
print( ( whole( top counts[ i ], -6 ), ": ", top words[ i ], newline ) )
OD
FI</langsyntaxhighlight>
{{out}}
<pre>
Line 308:
{{works with|GNU APL}}
 
<syntaxhighlight lang="apl">
<lang APL>
⍝⍝ NOTE: input text is assumed to be encoded in ISO-8859-1
⍝⍝ (The suggested example '135-0.txt' of Les Miserables on
Line 339:
the of and a to
41042 19952 14938 14526 13942
</syntaxhighlight>
</lang>
 
=={{header|AppleScript}}==
 
<langsyntaxhighlight lang="applescript">(*
For simplicity here, words are considered to be uninterrupted sequences of letters and/or digits.
The set text is too messy to warrant faffing around with anything more sophisticated.
Line 424:
set filePath to POSIX path of ((path to desktop as text) & "www.rosettacode.org:Word frequency:135-0.txt")
set n to 10
return wordFrequency(filePath, n)</langsyntaxhighlight>
 
{{output}}
<langsyntaxhighlight lang="applescript">"The 10 most frequently occurring words in the file are:
The: 41092
Of: 19954
Line 437:
Was: 8622
That: 7924
It: 6661"</langsyntaxhighlight>
 
=={{header|Arturo}}==
 
<langsyntaxhighlight lang="rebol">findFrequency: function [file, count][
freqs: #[]
r: {/[[:alpha:]]+/}
Line 458:
loop findFrequency "https://www.gutenberg.org/files/135/135-0.txt" 10 'pair [
print pair
]</langsyntaxhighlight>
 
{{out}}
Line 474:
 
=={{header|AutoHotkey}}==
<langsyntaxhighlight AutoHotkeylang="autohotkey">URLDownloadToFile, http://www.gutenberg.org/files/135/135-0.txt, % A_temp "\tempfile.txt"
FileRead, H, % A_temp "\tempfile.txt"
FileDelete, % A_temp "\tempfile.txt"
Line 490:
}
MsgBox % "Freq`tWord`n" result
return</langsyntaxhighlight>
Outputs:<pre>Freq Word
41036 The
Line 504:
 
=={{header|AWK}}==
<syntaxhighlight lang="awk">
<lang AWK>
# syntax: GAWK -f WORD_FREQUENCY.AWK [-v show=x] LES_MISERABLES.TXT
#
Line 533:
exit(0)
}
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 552:
==={{header|QB64}}===
This is a rather long code. I fulfilled the requirement with QB64. It "cleans" each word so it takes as a word anything that begins and ends with a letter. It works with arrays. Amazing the speed of QB64 to do this job with such a big file as Les Miserables.txt.
<syntaxhighlight lang="qbasic">
<lang QBASIC>
OPTION _EXPLICIT
 
Line 1,120:
 
END SUB
</syntaxhighlight>
</lang>
 
{{output}}
Line 1,164:
==={{header|BaCon}}===
Removing all punctuation, digits, tabs and carriage returns. So "This", "this" and "this." are the same. Full support for UTF8 characters in words. The code itself could be smaller, but for sake of clarity all has been written explicitly.
<langsyntaxhighlight lang="bacon">' We do not count superfluous spaces as words
OPTION COLLAPSE TRUE
 
Line 1,187:
FOR i = 0 TO 9
PRINT term$[i], " : ", frequency(term$[i])
NEXT</langsyntaxhighlight>
{{output}}
<pre>
Line 1,208:
You could cut the length of this down drastically if you didn't need to be able to recall the word at nth position and wished only to display the top 10 words.
 
<langsyntaxhighlight lang="dos">
@echo off
 
Line 1,254:
goto:eof
</syntaxhighlight>
</lang>
 
 
Line 1,287:
 
 
<langsyntaxhighlight lang="bracmat"> ( 10-most-frequent-words
= MergeSort { Local variable declarations. }
types
Line 1,330:
& !most-frequent-words { Return the last 10 terms. }
)
& out$(10-most-frequent-words$"135-0.txt") { Call 10-most-frequent-words with name of inout file and print result to screen. }</langsyntaxhighlight>
'''Output'''
<pre> (6661.it)
Line 1,346:
{{libheader|GLib}}
Words are defined by the regular expression "\w+".
<langsyntaxhighlight lang="c">#include <stdbool.h>
#include <stdio.h>
#include <glib.h>
Line 1,437:
return EXIT_FAILURE;
return EXIT_SUCCESS;
}</langsyntaxhighlight>
 
{{out}}
Line 1,457:
=={{header|C sharp|C#}}==
{{trans|D}}
<langsyntaxhighlight lang="csharp">using System;
using System.Collections.Generic;
using System.IO;
Line 1,489:
}
}
}</langsyntaxhighlight>
{{out}}
<pre>Rank Word Frequency
Line 1,505:
 
=={{header|C++}}==
<langsyntaxhighlight lang="cpp">#include <algorithm>
#include <cstdlib>
#include <fstream>
Line 1,550:
return 0;
}
</syntaxhighlight>
</lang>
 
{{out}}
Line 1,568:
===Alternative===
{{trans|C#}}
<langsyntaxhighlight lang="cpp">#include <algorithm>
#include <iostream>
#include <fstream>
Line 1,624:
 
return 0;
}</langsyntaxhighlight>
{{out}}
<pre>Rank Word Frequency
Line 1,638:
9 he 6814
10 had 6139</pre>
 
===C++20===
{{trans|C#}}
<syntaxhighlight lang="cpp">#include <algorithm>
#include <iostream>
#include <format>
#include <fstream>
#include <map>
#include <ranges>
#include <regex>
#include <string>
#include <vector>
 
int main() {
std::ifstream in("135-0.txt");
std::string text{
std::istreambuf_iterator<char>{in}, std::istreambuf_iterator<char>{}
};
in.close();
 
std::regex word_rx("\\w+");
std::map<std::string, int> freq;
for (const auto& a : std::ranges::subrange(
std::sregex_iterator{ text.cbegin(),text.cend(), word_rx }, std::sregex_iterator{}
))
{
auto word = a.str();
transform(word.begin(), word.end(), word.begin(), ::tolower);
freq[word]++;
}
 
std::vector<std::pair<std::string, int>> pairs;
for (const auto& elem : freq)
{
pairs.push_back(elem);
}
 
std::ranges::sort(pairs, std::ranges::greater{}, &std::pair<std::string, int>::second);
 
std::cout << "Rank Word Frequency\n"
"==== ==== =========\n";
for (int rank=1; const auto& [word, count] : pairs | std::views::take(10))
{
std::cout << std::format("{:2} {:>4} {:5}\n", rank++, word, count);
}
}</syntaxhighlight>
{{out}}
<pre>Rank Word Frequency
==== ==== =========
0 the 41043
1 of 19952
2 and 14938
3 a 14539
4 to 13942
5 in 11208
6 he 9646
7 was 8620
8 that 7922
9 it 6659</pre>
 
=={{header|Clojure}}==
<langsyntaxhighlight lang="clojure">(defn count-words [file n]
(->> file
slurp
Line 1,647 ⟶ 1,706:
frequencies
(sort-by val >)
(take n)))</langsyntaxhighlight>
 
{{Out}}
Line 1,657 ⟶ 1,716:
 
=={{header|COBOL}}==
<syntaxhighlight lang="cobol">
<lang COBOL>
IDENTIFICATION DIVISION.
PROGRAM-ID. WordFrequency.
Line 1,871 ⟶ 1,930:
CLOSE Word-File Output-File.
END-PROGRAM.
</syntaxhighlight>
</lang>
 
{{Out}}
Line 1,894 ⟶ 1,953:
 
=={{header|Common Lisp}}==
<langsyntaxhighlight lang="lisp">
(defun count-word (n pathname)
(with-open-file (s pathname :direction :input)
Line 1,915 ⟶ 1,974:
(dolist (word words) (incf (gethash word hash 0)))
(maphash #'(lambda (e n) (push `(,e . ,n) ac)) hash) ac)
</syntaxhighlight>
</lang>
 
{{Out}}
Line 1,925 ⟶ 1,984:
 
=={{header|Crystal}}==
<langsyntaxhighlight lang="ruby">require "http/client"
require "regex"
 
Line 1,943 ⟶ 2,002:
.sort { |a, b| b[1] <=> a[1] }[0..9] # sort and get the first 10 elements
.each_with_index(1) { |(word, n), i| puts "#{i} \t #{word} \t #{n}" } # print the result
</syntaxhighlight>
</lang>
 
{{out}}
Line 1,960 ⟶ 2,019:
 
=={{header|D}}==
<langsyntaxhighlight Dlang="d">import std.algorithm : sort;
import std.array : appender, split;
import std.range : take;
Line 1,995 ⟶ 2,054:
writefln("%4s %-10s %9s", rank++, word.k, word.v);
}
}</langsyntaxhighlight>
 
{{out}}
Line 2,016 ⟶ 2,075:
{{libheader| System.RegularExpressions}}
{{Trans|C#}}
<syntaxhighlight lang="delphi">
<lang Delphi>
program Word_frequency;
 
Line 2,089 ⟶ 2,148:
readln;
end.
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 2,106 ⟶ 2,165:
</pre>
=={{header|F Sharp}}==
<langsyntaxhighlight lang="fsharp">
open System.IO
open System.Text.RegularExpressions
let g=Regex("[A-Za-zÀ-ÿ]+").Matches(File.ReadAllText "135-0.txt")
[for n in g do yield n.Value.ToLower()]|>List.countBy(id)|>List.sortBy(fun n->(-(snd n)))|>List.take 10|>List.iter(fun n->printfn "%A" n)
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 2,128 ⟶ 2,187:
=={{header|Factor}}==
This program expects stdin to read from a file via the command line. ( e.g. invoking the program in Windows: <tt>>factor word-count.factor < input.txt</tt> ) The definition of a word here is simply any string surrounded by some combination of spaces, punctuation, or newlines.
<langsyntaxhighlight lang="factor">
USING: ascii io math.statistics prettyprint sequences
splitting ;
Line 2,135 ⟶ 2,194:
lines " " join " .,?!:;()\"-" split harvest [ >lower ] map
sorted-histogram <reversed> 10 head .
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 2,153 ⟶ 2,212:
 
=={{header|FreeBASIC}}==
<langsyntaxhighlight lang="freebasic">
#Include "file.bi"
type tally
Line 2,283 ⟶ 2,342:
print "time for operation ";timer-tm;" seconds"
sleep
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 2,320 ⟶ 2,379:
There are two sample programs below. First, a simple but powerful method that works in old versions of Frink:
<langsyntaxhighlight lang="frink">d = new dict
for w = select[wordList[read[normalizeUnicode["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]], %r/[[:alnum:]]/ ]
d.increment[lc[w], 1]
 
println[join["\n", first[reverse[sort[array[d], {|a,b| a@1 <=> b@1}]], 10]]]</langsyntaxhighlight>
 
{{out}}
Line 2,342 ⟶ 2,401:
Next, a "showing off" one-liner that works in recent versions of Frink that uses the <CODE>countToArray</CODE> function which easily creates sorted frequency lists and the <CODE>formatTable</CODE> function that formats into a nice table with columns lined up, and still performs full Unicode-aware normalization, capitalization, and word-breaking:
 
<langsyntaxhighlight lang="frink">formatTable[first[countToArray[select[wordList[lc[normalizeUnicode[read["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]]], %r/[[:alnum:]]/ ]], 10], "right"]</langsyntaxhighlight>
 
{{out}}
Line 2,360 ⟶ 2,419:
=={{header|FutureBasic}}==
Task said: "Feel free to explicitly state the thoughts behind the program decisions." Thus the heavy comments.
<langsyntaxhighlight lang="futurebasic">
 
include "NSLog.incl"
 
Line 2,369 ⟶ 2,427:
CFDictionaryRef dict
 
// BreakDepending outon capitalizedthe wordsvalue during seaarch or not as determined byof the caseSensitive Boolean function inputparameter parameterabove, lowercase incoming text
if caseSensitive == NO then textStr = fn StringLowercaseString( textStr )
 
// Trim non-alphabetic characters from string, and separate individual words with a space
CFStringRef tempStr = fn ArrayComponentsJoinedByString( fn StringComponentsSeparatedByCharactersInSet( textStr, fn CharacterSetInvertedSet( fn CharacterSetLetterSet ) ), @" " )
 
Line 2,390 ⟶ 2,448:
CountedSetRef freqencies = fn CountedSetWithArray( tempArr )
 
// Enumerate each word-frequenyfrequency painpair in the counted set...
EnumeratorRef enumRef = fn CountedSetObjectEnumerator( freqencies )
 
Line 2,399 ⟶ 2,457:
CFMutableArrayRef wordArr = fn MutableArrayWithCapacity( 0 )
 
// Create word coutercounter
NSInteger totalWords = 0
// Enumerate each unique word, get its frequency, create its own key/value pair dictionary, add each dictionary into master array
for wrd in array
totalWords++
Line 2,429 ⟶ 2,487:
next
 
// Create an immutable output string from mutable the string
CFStringRef resultStr = fn StringWithFormat( @"%@", mutStr )
end fn = resultStr
Line 2,458 ⟶ 2,516:
 
HandleEvents
</syntaxhighlight>
</lang>
{{Outoutput}}
<pre>
1 41095 the
Line 2,488 ⟶ 2,546:
22910 1 isabella
 
Total unique words in document: 22910
Elapsed time: 595.407963 milliseconds.
</pre>
Line 2,494 ⟶ 2,552:
=={{header|Go}}==
{{trans|Kotlin}}
<langsyntaxhighlight lang="go">package main
 
import (
Line 2,536 ⟶ 2,594:
fmt.Printf("%2d %-4s %5d\n", rank, word, freq)
}
}</langsyntaxhighlight>
 
{{out}}
Line 2,556 ⟶ 2,614:
=={{header|Groovy}}==
Solution:
<langsyntaxhighlight lang="groovy">def topWordCounts = { String content, int n ->
def mapCounts = [:]
content.toLowerCase().split(/\W+/).each {
Line 2,564 ⟶ 2,622:
println "Rank Word Frequency\n==== ==== ========="
(0..<n).each { printf ("%4d %-4s %9d\n", it+1, top[it].key, top[it].value) }
}</langsyntaxhighlight>
 
Test:
<langsyntaxhighlight lang="groovy">def rawText = "http://www.gutenberg.org/files/135/135-0.txt".toURL().text
topWordCounts(rawText, 10)</langsyntaxhighlight>
 
Output:
Line 2,587 ⟶ 2,645:
===Lazy IO with pure Map, arrows===
{{trans|Clojure}}
<langsyntaxhighlight Haskelllang="haskell">module Main where
 
import Control.Category -- (>>>)
Line 2,627 ⟶ 2,685:
>>> take n
>>> print)
when filep (hClose hand)</langsyntaxhighlight>
{{Out}}
<pre>
Line 2,636 ⟶ 2,694:
===Lazy IO, map of IORefs===
Using IORefs as values in the map seems to give a ~2x speedup on large files. The below code is based on https://github.com/composewell/streamly-examples/blob/master/examples/WordFrequency.hs , but still using lazy IO to avoid the extra library dependency (in production you should [https://stackoverflow.com/questions/5892653/whats-so-bad-about-lazy-i-o use a streaming library] like streamly/conduit/io-streams):
<langsyntaxhighlight lang="haskell">
module Main where
 
Line 2,675 ⟶ 2,733:
in mapM readRef $ M.toList freqtable
print $ take maxw $ sortOn (Down . snd) counts
</syntaxhighlight>
</lang>
{{Out}}
<pre>
Line 2,684 ⟶ 2,742:
===Lazy IO, short code, but not streaming===
Or, perhaps a little more simply, though not streaming (will read everything into memory, don't use on big files):
<langsyntaxhighlight lang="haskell">import qualified Data.Text.IO as T
import qualified Data.Text as T
 
Line 2,696 ⟶ 2,754:
 
main :: IO ()
main = T.readFile "miserables.txt" >>= (mapM_ print . take 10 . frequentWords)</langsyntaxhighlight>
{{Out}}
<pre>(40370,"the")
Line 2,754 ⟶ 2,812:
 
=={{header|Java}}==
This is relatively simple in Java.<br />
I used a ''URL'' class to download the content, a ''BufferedReader'' class to examine the text line-for-line, a ''Pattern'' and ''Matcher'' to identify words, and a ''Map'' to hold to values.
<syntaxhighlight lang="java">
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
</syntaxhighlight>
 
<syntaxhighlight lang="java">
void printWordFrequency() throws URISyntaxException, IOException {
URL url = new URI("https://www.gutenberg.org/files/135/135-0.txt").toURL();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()))) {
Pattern pattern = Pattern.compile("(\\w+)");
Matcher matcher;
String line;
String word;
Map<String, Integer> map = new HashMap<>();
while ((line = reader.readLine()) != null) {
matcher = pattern.matcher(line);
while (matcher.find()) {
word = matcher.group().toLowerCase();
if (map.containsKey(word)) {
map.put(word, map.get(word) + 1);
} else {
map.put(word, 1);
}
}
}
/* print out top 10 */
List<Map.Entry<String, Integer>> list = new ArrayList<>(map.entrySet());
list.sort(Map.Entry.comparingByValue());
Collections.reverse(list);
int count = 1;
for (Map.Entry<String, Integer> value : list) {
System.out.printf("%-20s%,7d%n", value.getKey(), value.getValue());
if (count++ == 10) break;
}
}
}
</syntaxhighlight>
<pre>
the 41,043
of 19,952
and 14,938
a 14,539
to 13,942
in 11,208
he 9,646
was 8,620
that 7,922
it 6,659
</pre>
<br />
An alternate demonstration
{{trans|Kotlin}}
<langsyntaxhighlight Javalang="java">import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
Line 2,797 ⟶ 2,919:
}
}
}</langsyntaxhighlight>
{{out}}
<pre>Rank Word Frequency
Line 2,819 ⟶ 2,941:
may not begin with hyphen. Thus "the-the" would count as one word, and "-the" would be excluded.
 
<syntaxhighlight lang="jq">
<lang jq>
< 135-0.txt jq -nR --argjson n 10 '
def bow(stream):
Line 2,831 ⟶ 2,953:
| from_entries
'
</syntaxhighlight>
</lang>
====Output====
<syntaxhighlight lang="jq">
<lang jq>
{
"the": 41087,
Line 2,846 ⟶ 2,968:
"it": 6661
}
</syntaxhighlight>
</lang>
 
=={{header|Julia}}==
{{works with|Julia|1.0}}
<langsyntaxhighlight lang="julia">
using FreqTables
 
Line 2,856 ⟶ 2,978:
words = split(replace(txt, r"\P{L}"i => " "))
table = sort(freqtable(words); rev=true)
println(table[1:10])</langsyntaxhighlight>
 
{{out}}
Line 2,871 ⟶ 2,993:
"he" │ 6816
"had" │ 6140</pre>
 
=={{header|K}}==
{{works with|ngn/k}}<syntaxhighlight lang=K>common:{+((!d)o)!n@o:x#>n:#'.d:=("&"\`c$"&"|_,/0:y)^,""}
{(,'!x),'.x}common[10;"135-0.txt"]
(("the";41019)
("of";19898)
("and";14658)
(,"a";14517)
("to";13695)
("in";11134)
("he";9405)
("was";8361)
("that";7592)
("his";6446))</syntaxhighlight>
 
(The relatively easy to read output format here is arguably less useful than the table produced by <code>common</code> but it would have been more concise to have <code>common</code> generate it directly.)
 
=={{header|KAP}}==
The below program defines the function 'stats' which accepts a filename containing the text.
 
<langsyntaxhighlight lang="kap">∇ stats (file) {
content ← "[\\h,.\"'\n-]+" regex:split unicode:toLower io:readFile file
sorted ← (⍋⊇⊢) content
Line 2,881 ⟶ 3,019:
words ← selection / sorted
{⍵[10↑⍒⍵[;1];]} words ,[0.5] ≢¨ sorted ⊂⍨ +\selection
}</langsyntaxhighlight>
{{out}}
<pre>┏━━━━━━━━━━━━┓
Line 2,902 ⟶ 3,040:
 
There is no change in the results if the numerals 0-9 are also regarded as letters.
<langsyntaxhighlight lang="scala">// version 1.1.3
 
import java.io.File
Line 2,920 ⟶ 3,058:
for ((word, freq) in wordGroups)
System.out.printf("%2d %-4s %5d\n", rank++, word, freq)
}</langsyntaxhighlight>
 
{{out}}
Line 2,939 ⟶ 3,077:
 
=={{header|Liberty BASIC}}==
<langsyntaxhighlight lang="lb">dim words$(100000,2)'words$(a,1)=the word, words$(a,2)=the count
dim lines$(150000)
open "135-0.txt" for input as #txt
Line 3,005 ⟶ 3,143:
close #txt
end
</syntaxhighlight>
</lang>
{{out}}
<pre>Count Word
Line 3,026 ⟶ 3,164:
=={{header|Lua}}==
{{works with|lua|5.3}}
<langsyntaxhighlight lang="lua">
-- This program takes two optional command line arguments. The first (arg[1])
-- specifies the input file, or defaults to standard input. The second
Line 3,055 ⟶ 3,193:
io.write(string.format('%7d %s\n', array[i][1] , array[i][2]))
end
</syntaxhighlight>
</lang>
 
{{Out}}
Line 3,078 ⟶ 3,216:
 
=={{header|Mathematica}} / {{header|Wolfram Language}}==
<langsyntaxhighlight Mathematicalang="mathematica">TakeLargest[10]@WordCounts[Import["https://www.gutenberg.org/files/135/135-0.txt"], IgnoreCase->True]//Dataset</langsyntaxhighlight>
{{out}}
<pre>
Line 3,094 ⟶ 3,232:
 
=={{header|MATLAB}} / {{header|Octave}}==
<syntaxhighlight lang="matlab">
<lang Matlab>
function [result,count] = word_frequency()
URL='https://www.gutenberg.org/files/135/135-0.txt';
Line 3,109 ⟶ 3,247:
fprintf(1,'%d\t%s\n',count(k),result{k})
end
</syntaxhighlight>
</lang>
 
{{out}}
Line 3,126 ⟶ 3,264:
 
=={{header|Nim}}==
<langsyntaxhighlight Nimlang="nim">import tables, strutils, sequtils, httpclient
 
proc take[T](s: openArray[T], n: int): seq[T] = s[0 ..< min(n, s.len)]
Line 3,136 ⟶ 3,274:
wordFrequencies.sort
for (word, count) in toSeq(wordFrequencies.pairs).take(10):
echo alignLeft($count, 8), word</langsyntaxhighlight>
{{out}}
<pre>40377 the
Line 3,150 ⟶ 3,288:
 
=={{header|Objeck}}==
<langsyntaxhighlight lang="objeck">use System.IO.File;
use Collection;
use RegEx;
Line 3,202 ⟶ 3,340:
};
}
}</langsyntaxhighlight>
 
Output:
Line 3,222 ⟶ 3,360:
=={{header|OCaml}}==
 
<langsyntaxhighlight lang="ocaml">let () =
let n =
try int_of_string Sys.argv.(1)
Line 3,248 ⟶ 3,386:
List.iter (fun (word, count) ->
Printf.printf "%d %s\n" count word
) r</langsyntaxhighlight>
 
{{out}}
Line 3,267 ⟶ 3,405:
=={{header|Perl}}==
{{trans|Raku}}
<syntaxhighlight lang ="perl">$top =use 10strict;
use warnings;
use utf8;
 
my $top = 10;
open $fh, "<", '135-0.txt';
($text = join '', <$fh>) =~ tr/A-Z/a-z/
or die "Can't open '135-0.txt': $!\n";
 
open my $fh, '<', 'ref/word-count.txt';
@matcher = (
(my $text = join '', <$fh>) =~ tr/A-Z/a-z/;
 
my @matcher = (
qr/[a-z]+/, # simple 7-bit ASCII
qr/\w+/, # word characters with underscore
Line 3,279 ⟶ 3,420:
);
 
for my $reg (@matcher) {
print "\nTop $top using regex: " . $reg . "\n";
my @matches = $text =~ /$reg/g;
my %words;
for my $w (@matches) { $words{$w}++ };
my $c = 0;
for my $w ( sort { $words{$b} <=> $words{$a} } keys %words ) {
printf "%-7s %6d\n", $w, $words{$w};
last if ++$c >= $top;
}
}</langsyntaxhighlight>
 
{{out}}
<pre>
<pre>Top 10 using regex: (?^:[a-z]+)
Top 10 using regex: (?^:[a-z]+)
the 41089
of 19949
Line 3,326 ⟶ 3,468:
was 8621
that 7924
it 6661</pre>
</pre>
 
=={{header|Phix}}==
<!--<langsyntaxhighlight Phixlang="phix">(notonline)-->
<span style="color: #008080;">without</span> <span style="color: #008080;">javascript_semantics</span>
<span style="color: #0000FF;">?</span><span style="color: #008000;">"loading..."</span>
Line 3,356 ⟶ 3,499:
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<span style="color: #7060A8;">traverse_dict</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">routine_id</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"visitor"</span><span style="color: #0000FF;">),</span><span style="color: #000000;">0</span><span style="color: #0000FF;">,</span><span style="color: #000000;">wf</span><span style="color: #0000FF;">,</span><span style="color: #004600;">true</span><span style="color: #0000FF;">)</span>
<!--</langsyntaxhighlight>-->
{{out}}
<pre>
Line 3,373 ⟶ 3,516:
 
=={{header|Phixmonti}}==
<langsyntaxhighlight Phixmontilang="phixmonti">include ..\Utilitys.pmt
 
"loading..." ?
Line 3,408 ⟶ 3,551:
-1 * get ?
endfor
drop</langsyntaxhighlight>
{{out}}
<pre>loading...
Line 3,427 ⟶ 3,570:
 
=={{header|PHP}}==
<langsyntaxhighlight lang="php">
<?php
 
Line 3,442 ⟶ 3,585:
}
$i++;
}</langsyntaxhighlight>
{{out}}
<pre>
Line 3,461 ⟶ 3,604:
=={{header|Picat}}==
To get the book proper, the header and footer are removed. Here are some tests with different sets of characters to split the words (<code>split_char/1</code>).
<langsyntaxhighlight Picatlang="picat">main =>
NTop = 10,
File = "les_miserables.txt",
Line 3,491 ⟶ 3,634:
split_chars(all,"\n\r \t,;!.?()[]”\"-“—-__‘’*").
split_chars(space_punct,"\n\r \t,;!.?").
split_chars(space,"\n\r \t").</langsyntaxhighlight>
 
{{out}}
Line 3,515 ⟶ 3,658:
 
=={{header|PicoLisp}}==
<langsyntaxhighlight PicoLisplang="picolisp">(setq *Delim " ^I^J^M-_.,\"'*[]?!&@#$%^\(\):;")
(setq *Skip (chop *Delim))
 
Line 3,529 ⟶ 3,672:
(if (idx 'B W T) (inc (car @)) (set W 1)) ) ) )
(for L (head 10 (flip (by val sort (idx 'B))))
(println L (val L)) )</langsyntaxhighlight>
{{out}}
<pre>
Line 3,546 ⟶ 3,689:
=={{header|Prolog}}==
{{works with|SWI Prolog}}
<langsyntaxhighlight lang="prolog">print_top_words(File, N):-
read_file_to_string(File, String, [encoding(utf8)]),
re_split("\\w+", String, Words),
Line 3,578 ⟶ 3,721:
 
main:-
print_top_words("135-0.txt", 10).</langsyntaxhighlight>
 
{{out}}
Line 3,597 ⟶ 3,740:
 
=={{header|PureBasic}}==
<langsyntaxhighlight PureBasiclang="purebasic">EnableExplicit
 
Structure wordcount
Line 3,651 ⟶ 3,794:
EndIf
 
End</langsyntaxhighlight>
{{out}}
<pre>
Line 3,672 ⟶ 3,815:
===Collections===
====Python2.7====
<langsyntaxhighlight lang="python">import collections
import re
import string
Line 3,682 ⟶ 3,825:
 
if __name__ == "__main__":
main()</langsyntaxhighlight>
 
{{Out}}
Line 3,692 ⟶ 3,835:
 
====Python3.6====
<langsyntaxhighlight lang="python">from collections import Counter
from re import findall
 
Line 3,711 ⟶ 3,854:
if __name__ == "__main__":
n = int(input('How many?: '))
most_common_words_in_file(les_mis_file, n)</langsyntaxhighlight>
 
{{Out}}
Line 3,729 ⟶ 3,872:
===Sorted and groupby===
{{Works with|Python|3.7}}
<langsyntaxhighlight lang="python">"""
Word count task from Rosetta Code
http://www.rosettacode.org/wiki/Word_count#Python
Line 3,776 ⟶ 3,919:
if __name__ == '__main__':
main()
</syntaxhighlight>
</lang>
{{Out}}
<pre>('the', 40372)
Line 3,790 ⟶ 3,933:
 
===Collections, Sorted and Lambda===
<langsyntaxhighlight lang="python">
#!/usr/bin/python3
import collections
Line 3,810 ⟶ 3,953:
if i == count - 1:
break
</syntaxhighlight>
</lang>
{{Out}}
<pre>[ 1] the : 41039
Line 3,824 ⟶ 3,967:
 
=={{header|R}}==
===Version 1===
I chose to remove apostrophes only if they're followed by an s (so "mom" and "mom's" will show up as the same word but "they" and "they're" won't). I also chose not to remove hyphens.
<syntaxhighlight lang="r">
<lang R>
wordcount<-function(file,n){
punctuation=c("`","~","!","@","#","$","%","^","&","*","(",")","_","+","=","{","[","}","]","|","\\",":",";","\"","<",",",">",".","?","/","'s")
Line 3,841 ⟶ 3,985:
return(df[1:n,])
}
</syntaxhighlight>
</lang>
{{Out}}
<pre>
Line 3,857 ⟶ 4,001:
9 it 2308
10 i 1845
</pre>
 
===Version 2===
This version is purely functional using the native pipe operator in R 4.1+ and runs in less than a second.
<syntaxhighlight lang="r">
word_frequency_pipeline <- function(file=NULL, n=10) {
file |>
vroom::vroom_lines() |>
stringi::stri_split_boundaries(type="word", skip_word_none=T, skip_word_number=T) |>
unlist() |>
tolower() |>
table() |>
sort(decreasing = T) |>
(\(.) .[1:n])() |>
data.frame()
}
</syntaxhighlight>
{{Out}}
<pre>
> word_frequency_pipeline("~/../Downloads/135-0.txt")
Var1 Freq
1 the 41042
2 of 19952
3 and 14938
4 a 14526
5 to 13942
6 in 11208
7 he 9605
8 was 8620
9 that 7824
10 it 6533
</pre>
 
=={{header|Racket}}==
<langsyntaxhighlight lang="racket">#lang racket
 
(define (all-words f (case-fold string-downcase))
Line 3,870 ⟶ 4,047:
 
(module+ main
(take (counts (all-words "data/les-mis.txt")) 10))</langsyntaxhighlight>
 
{{out}}
Line 3,886 ⟶ 4,063:
=={{header|Raku}}==
(formerly Perl 6)
{{works with|Rakudo|20202022.08.107}}
Note: much of the following exposition is no longer critical to the task as the requirements have been updated, but is left here for historical and informational reasons.
 
This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is '''full''' of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup.
 
We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs[[wp:diacritic|diacritic]]s. Those '''are''' letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)
 
Actually, in this case /A-Za-z/ returns '''very nearly''' the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the".
Line 3,899 ⟶ 4,076:
 
Here is a sample that shows the result when using various different matchers.
<syntaxhighlight lang="raku" perl6line>sub MAIN ($filename, UInt $top = 10) {
my $file = $filename.IO.slurp.lc.subst(/ (<[\w]-[_]>'-')\n(<[\w]-[_]>) /, {$0 ~ $1}, :g );
my @matcher = (
rx/ <[a..z]>+ /, # simple 7-bit ASCII
rx/ \w+ /, # word characters with underscore
rx/ <[\w]-[_]>+ /, # word characters without underscore
rx/ [<[\w]-[_]>+[["'"|]+ % < ' -'|" '-"]<[\w]-[_] >+]* / # word characters without underscore but with hyphens and contractions
);
for @matcher -> $reg {
say "\nTop $top using regex: ", $reg.raku;
my @words .put for= $file.comb( $reg ).Bag.sort(-*.value)[^$top];
my $length = max @words».key».chars;
printf "%-{$length}s %d\n", .key, .value for @words;
}
}</langsyntaxhighlight>
 
{{out}}
Line 4,091 ⟶ 4,270:
Since REXX doesn't support UTF-8 encodings, code was added to this REXX version to
support the accented letters in the mandated input file.
<langsyntaxhighlight lang="rexx">/*REXX pgm displays top 10 words in a file (includes foreign letters), case is ignored.*/
parse arg fID top . /*obtain optional arguments from the CL*/
if fID=='' | fID=="," then fID= 'les_mes.txt' /*None specified? Then use the default.*/
Line 4,146 ⟶ 4,325:
end /*#*/
say commas(totW) ' words found ('commas(c) "unique) in " commas(#),
' records read from file: ' fID; say; return</langsyntaxhighlight>
{{out|output|text=&nbsp; when using the default inputs:}}
<pre>
Line 4,169 ⟶ 4,348:
Inspired by version 1 and adapted for ooRexx.
It ignores all characters other than a-z and A-Z (which are translated to a-z).
<syntaxhighlight lang="text">/*REXX program reads and displays a count of words a file. Word case is ignored.*/
Call time 'R'
abc='abcdefghijklmnopqrstuvwxyz'
Line 4,219 ⟶ 4,398:
tops=tops+words(tl) /*correctly handle the tied rankings. */
end
Say time('E') 'seconds elapsed'</langsyntaxhighlight>
{{out}}
<pre>We found 22820 different words
Line 4,237 ⟶ 4,416:
 
=={{header|Ring}}==
<langsyntaxhighlight lang="ring">
# project : Word count
 
Line 4,296 ⟶ 4,475:
b = temp
return [a, b]
</syntaxhighlight>
</lang>
Output:
<pre>
Line 4,312 ⟶ 4,491:
 
=={{header|Ruby}}==
<langsyntaxhighlight lang="ruby">
class String
def wc
Line 4,322 ⟶ 4,501:
 
open('135-0.txt') { |n| n.read.wc[-10,10].each{|n| puts n[0].to_s+"->"+n[1].to_s} }
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 4,338 ⟶ 4,517:
===Tally and max_by===
{{Works with|Ruby|2.7}}
<langsyntaxhighlight lang="ruby">RE = /[[:alpha:]]+/
count = open("135-0.txt").read.downcase.scan(RE).tally.max_by(10, &:last)
count.each{|ar| puts ar.join("->") }
</syntaxhighlight>
</lang>
{{out}}
<pre>the->41092
Line 4,355 ⟶ 4,534:
</pre>
===Chain of Enumerables===
<langsyntaxhighlight lang="ruby">wf = File.read("135-0.txt", :encoding => "UTF-8")
.downcase
.scan(/\w+/)
Line 4,368 ⟶ 4,547:
w[1]
}
</syntaxhighlight>
</lang>
{{out}}
<pre>[ 1] the : 41040
Line 4,383 ⟶ 4,562:
 
=={{header|Rust}}==
<langsyntaxhighlight Rustlang="rust">use std::cmp::Reverse;
use std::collections::HashMap;
use std::fs::File;
Line 4,414 ⟶ 4,593:
fn main() {
word_count(File::open("135-0.txt").expect("File open error"), 10)
}</langsyntaxhighlight>
 
{{out}}
Line 4,434 ⟶ 4,613:
{{Out}}
Best seen running in your browser [https://scastie.scala-lang.org/EP2Fm6HXQrC1DwtSNvnUzQ Scastie (remote JVM)].
<langsyntaxhighlight Scalalang="scala">import scala.io.Source
 
object WordCount extends App {
Line 4,457 ⟶ 4,636:
println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]")
 
}</langsyntaxhighlight>
{{out}}
<pre>Rank Word Frequency
Line 4,481 ⟶ 4,660:
to get words from a fle. The words are [http://seed7.sourceforge.net/libraries/string.htm#lower(in_string) converted to lower case], to assure that "The" and "the" are considered the same.
 
<langsyntaxhighlight lang="seed7">$ include "seed7_05.s7i";
include "gethttp.s7i";
include "strifile.s7i";
Line 4,522 ⟶ 4,701:
end for;
end for;
end func;</langsyntaxhighlight>
 
{{out}}
Line 4,540 ⟶ 4,719:
 
=={{header|Sidef}}==
<langsyntaxhighlight lang="ruby">var count = Hash()
var file = File(ARGV[0] \\ '135-0.txt')
 
Line 4,553 ⟶ 4,732:
top.each { |pair|
say "#{pair.key}\t-> #{pair.value}"
}</langsyntaxhighlight>
{{out}}
<pre>
Line 4,569 ⟶ 4,748:
 
=={{header|Simula}}==
<langsyntaxhighlight lang="simula">COMMENT COMPILE WITH
$ cim -m64 word-count.sim
;
Line 4,848 ⟶ 5,027:
 
END
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 4,864 ⟶ 5,043:
6 garbage collection(s) in 0.2 seconds.
</pre>
 
=={{header|Smalltalk}}==
The ASCII text file is from https://www.gutenberg.org/files/135/old/lesms10.txt.
 
===Cuis Smalltalk, ASCII===
{{works with|Cuis|6.0}}
<syntaxhighlight lang="smalltalk">
(StandardFileStream new open: 'lesms10.txt' forWrite: false)
contents asLowercase substrings asBag sortedCounts first: 10.
</syntaxhighlight>
{{Out}}<pre>an OrderedCollection(40543 -> 'the' 19796 -> 'of' 14448 -> 'and' 14380 -> 'a' 13582 -> 'to' 11006 -> 'in' 9221 -> 'he' 8351 -> 'was' 7258 -> 'that' 6420 -> 'his') </pre>
 
===Squeak Smalltalk, ASCII===
{{works with|Squeak|6.0}}
<syntaxhighlight lang="smalltalk">
(StandardFileStream readOnlyFileNamed: 'lesms10.txt')
contents asLowercase substrings asBag sortedCounts first: 10.
</syntaxhighlight>
{{Out}}<pre>{40543->'the' . 19796->'of' . 14448->'and' . 14380->'a' . 13582->'to' . 11006->'in' . 9221->'he' . 8351->'was' . 7258->'that' . 6420->'his'} </pre>
 
=={{header|Swift}}==
<langsyntaxhighlight lang="swift">import Foundation
 
func printTopWords(path: String, count: Int) throws {
Line 4,893 ⟶ 5,091:
} catch {
print(error.localizedDescription)
}</langsyntaxhighlight>
 
{{out}}
Line 4,911 ⟶ 5,109:
 
=={{header|Tcl}}==
<langsyntaxhighlight Tcllang="tcl">lassign $argv head
while { [gets stdin line] >= 0 } {
foreach word [regexp -all -inline {[A-Za-z]+} $line] {
Line 4,921 ⟶ 5,119:
foreach {word count} [lrange $sorted 0 [expr {$head * 2 - 1}]] {
puts "$count\t$word"
}</langsyntaxhighlight>
 
./wordcount-di.tcl 10 < 135-0.txt
Line 4,940 ⟶ 5,138:
=={{header|TMG}}==
McIlroy's Unix TMG:
<langsyntaxhighlight UnixTMGlang="unixtmg">/* Input format: N text */
/* Only lowercase letters can constitute a word in text. */
/* (c) 2020, Andrii Makukha, 2-clause BSD licence. */
Line 5,001 ⟶ 5,199:
/* Character classes */
letter: <<abcdefghijklmnopqrstuvwxyz>>;
other: !<<abcdefghijklmnopqrstuvwxyz>>;</langsyntaxhighlight>
 
Unix TMG didn't have <tt>tolower</tt> builtin. Therefore, you would use it together with <tt>tr</tt>:
<langsyntaxhighlight lang="bash">cat file | tr A-Z a-z > file1; ./a.out file1</langsyntaxhighlight>
 
Additionally, because 1972 TMG only understood ASCII characters, you might want to strip down the diacritics (e.g., é → e):
<langsyntaxhighlight lang="bash">cat file | uni2ascii -B | tr A-Z a-z > file1; ./a.out file1</langsyntaxhighlight>
 
=={{header|Transd}}==
<syntaxhighlight lang="Scheme">#lang transd
 
MainModule: {
_start: (λ locals: cnt 0
(with fs FileStream() words String()
(open-r fs "/mnt/text/Literature/Miserables.txt")
(textin fs words)
 
(with v ( -|
(split (tolower words))
(group-by)
(regroup-by (λ v Vector<String>() -> Int() (size v))))
 
(for i in v :rev do (lout (get (get (snd i) 0) 0) ":\t " (fst i))
(+= cnt 1) (if (> cnt 10) break))
)))
}</syntaxhighlight>
{{out}}
<pre>
the: 40379
of: 19869
and: 14468
a: 14278
to: 13590
in: 11025
he: 9213
was: 8347
that: 7249
his: 6414
had: 6051
</pre>
 
=={{header|UNIX Shell}}==
Line 5,013 ⟶ 5,244:
{{works with|zsh}}
This is derived from Doug McIlroy's original 6-line note in the ACM article cited in the task.
<langsyntaxhighlight lang="bash">#!/bin/sh
<"$1" tr -cs A-Za-z '\n' | tr A-Z a-z | LC_ALL=C sort | uniq -c | sort -rn | head -n "$2"</langsyntaxhighlight>
 
 
Line 5,040 ⟶ 5,271:
This is Doug McIlroy's original solution but follows other solutions in importing the task's text file from the web and directly specifying the 10 most commonly used words.
 
<langsyntaxhighlight lang="zsh">curl "https://www.gutenberg.org/files/135/135-0.txt" | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 10q</langsyntaxhighlight>
 
{{Out}}
Line 5,058 ⟶ 5,289:
In order to use it, you have to adapt the PATHFILE Const.
 
<syntaxhighlight lang="vb">
<lang vb>
Option Explicit
 
Line 5,174 ⟶ 5,405:
If d.Exists(Word) Then _
DisplayFrequencyOf = d(Word)
End Function</langsyntaxhighlight>
{{out}}
<pre>Words different in this book : 25884
Line 5,206 ⟶ 5,437:
I've taken the view that 'letter' means either a letter or digit for Unicode codepoints up to 255. I haven't included underscore, hyphen nor apostrophe as these usually separate compound words.
 
Not very quick (runs in about 4715 seconds on my system) though this is partially due to Wren not having regular expressions and the string pattern matching module being written in Wren itself rather than C.
 
If the Go example is re-run today (2117 OctoberFebruary 20202024), then the output matches this Wren example precisely though it appears that the text file has changed since the former was written more than 25 years ago.
<langsyntaxhighlight ecmascriptlang="wren">import "io" for File
import "./str" for Str
import "./sort" for Sort
import "./fmt" for Fmt
import "./pattern" for Pattern
 
var fileName = "135-0.txt"
Line 5,236 ⟶ 5,467:
var freq = keyVals[rank-1].value
Fmt.print("$2d $-4s $5d", rank, word, freq)
}</langsyntaxhighlight>
 
{{out}}
Line 5,256 ⟶ 5,487:
=={{header|XQuery}}==
 
<langsyntaxhighlight lang="xquery">let $maxentries := 10,
$uri := 'https://www.gutenberg.org/files/135/135-0.txt'
return
Line 5,275 ⟶ 5,506:
return <word key="{$key}" count="{$count}"/>
)[position()=(1 to $maxentries)]
}</words></langsyntaxhighlight>
{{out}}
<langsyntaxhighlight lang="xml"><words in="https://www.gutenberg.org/files/135/135-0.txt" top="10">
<word key="the" count="41092"/>
<word key="of" count="19954"/>
Line 5,288 ⟶ 5,519:
<word key="that" count="7924"/>
<word key="it" count="6661"/>
</words></langsyntaxhighlight>
 
=={{header|zkl}}==
<langsyntaxhighlight lang="zkl">fname,count := vm.arglist; // grab cammand line args
 
// words may have leading or trailing "_", ie "the" and "_the"
Line 5,297 ⟶ 5,528:
RegExp("[a-z]+").pump.fp1(Dictionary().incV)) // line-->(word:count,..)
.toList().copy().sort(fcn(a,b){ b[1]<a[1] })[0,count.toInt()] // hash-->list
.pump(String,Void.Xplode,"%s,%s\n".fmt).println();</langsyntaxhighlight>
{{out}}
<pre>
9,476

edits