Word frequency: Difference between revisions

← Older edit

Word frequency (view source)

Revision as of 10:35, 17 February 2024

9,282 bytes added , 3 months ago

m

→‎{{header|Wren}}: Minor tidy and rerun

PureFox

9,476

edits

Revision as of 18:48, 23 June 2022 (view source) KenS (talk \| contribs) (→‎{{header\|FutureBasic}}) ← Older edit		Latest revision as of 10:35, 17 February 2024 (view source) PureFox (talk \| contribs) m (→‎{{header\|Wren}}: Minor tidy and rerun)
(19 intermediate revisions by 10 users not shown)
Line 40: =={{header\|11l}}== <~~lang~~syntaxhighlight lang="11l">DefaultDict[String, Int] cnt L(word) re:‘\w+’.find_strings(File(‘135-0.txt’).read().lowercase()) cnt[word]++ print(sorted(cnt.items(), key' wordc -> wordc[1], reverse' 1B)[0.<10])</~~lang~~syntaxhighlight> {{out}} Line 56: {{works with\|Ada\|Ada\|2012}} <~~lang~~syntaxhighlight ~~Ada~~lang="ada">with Ada.Command_Line; with Ada.Text_IO; with Ada.Integer_Text_IO; Line 143: end loop; end Word_Frequency; </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 162: {{works with\|ALGOL 68G\|Any - tested with release 2.8.3.win32}} Uses the associative array implementations in [[ALGOL_68/prelude]]. <~~lang~~syntaxhighlight lang="algol68"># find the n most common words in a file # # use the associative array in the Associate array/iteration task # # but with integer values # Line 286: print( ( whole( top counts[ i ], -6 ), ": ", top words[ i ], newline ) ) OD FI</~~lang~~syntaxhighlight> {{out}} <pre> Line 308: {{works with\|GNU APL}} <syntaxhighlight lang="apl"> ~~<lang APL>~~ ⍝⍝ NOTE: input text is assumed to be encoded in ISO-8859-1 ⍝⍝ (The suggested example '135-0.txt' of Les Miserables on Line 339: the of and a to 41042 19952 14938 14526 13942 </syntaxhighlight> ~~</lang>~~ =={{header\|AppleScript}}== <~~lang~~syntaxhighlight lang="applescript">(* For simplicity here, words are considered to be uninterrupted sequences of letters and/or digits. The set text is too messy to warrant faffing around with anything more sophisticated. Line 424: set filePath to POSIX path of ((path to desktop as text) & "www.rosettacode.org:Word frequency:135-0.txt") set n to 10 return wordFrequency(filePath, n)</~~lang~~syntaxhighlight> {{output}} <~~lang~~syntaxhighlight lang="applescript">"The 10 most frequently occurring words in the file are: The: 41092 Of: 19954 Line 437: Was: 8622 That: 7924 It: 6661"</~~lang~~syntaxhighlight> =={{header\|Arturo}}== <~~lang~~syntaxhighlight lang="rebol">findFrequency: function [file, count][ freqs: #[] r: {/[[:alpha:]]+/} Line 458: loop findFrequency "https://www.gutenberg.org/files/135/135-0.txt" 10 'pair [ print pair ]</~~lang~~syntaxhighlight> {{out}} Line 474: =={{header\|AutoHotkey}}== <~~lang~~syntaxhighlight ~~AutoHotkey~~lang="autohotkey">URLDownloadToFile, http://www.gutenberg.org/files/135/135-0.txt, % A_temp "\tempfile.txt" FileRead, H, % A_temp "\tempfile.txt" FileDelete, % A_temp "\tempfile.txt" Line 490: } MsgBox % "Freq`tWord`n" result return</~~lang~~syntaxhighlight> Outputs:<pre>Freq Word 41036 The Line 504: =={{header\|AWK}}== <syntaxhighlight lang="awk"> ~~<lang AWK>~~ # syntax: GAWK -f WORD_FREQUENCY.AWK [-v show=x] LES_MISERABLES.TXT # Line 533: exit(0) } </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 552: ==={{header\|QB64}}=== This is a rather long code. I fulfilled the requirement with QB64. It "cleans" each word so it takes as a word anything that begins and ends with a letter. It works with arrays. Amazing the speed of QB64 to do this job with such a big file as Les Miserables.txt. <syntaxhighlight lang="qbasic"> ~~<lang QBASIC>~~ OPTION _EXPLICIT Line 1,120: END SUB </syntaxhighlight> ~~</lang>~~ {{output}} Line 1,164: ==={{header\|BaCon}}=== Removing all punctuation, digits, tabs and carriage returns. So "This", "this" and "this." are the same. Full support for UTF8 characters in words. The code itself could be smaller, but for sake of clarity all has been written explicitly. <~~lang~~syntaxhighlight lang="bacon">' We do not count superfluous spaces as words OPTION COLLAPSE TRUE Line 1,187: FOR i = 0 TO 9 PRINT term$[i], " : ", frequency(term$[i]) NEXT</~~lang~~syntaxhighlight> {{output}} <pre> Line 1,208: You could cut the length of this down drastically if you didn't need to be able to recall the word at nth position and wished only to display the top 10 words. <~~lang~~syntaxhighlight lang="dos"> @echo off Line 1,254: goto:eof </syntaxhighlight> ~~</lang>~~ Line 1,287: <~~lang~~syntaxhighlight lang="bracmat"> ( 10-most-frequent-words = MergeSort { Local variable declarations. } types Line 1,330: & !most-frequent-words { Return the last 10 terms. } ) & out$(10-most-frequent-words$"135-0.txt") { Call 10-most-frequent-words with name of inout file and print result to screen. }</~~lang~~syntaxhighlight> '''Output''' <pre> (6661.it) Line 1,346: {{libheader\|GLib}} Words are defined by the regular expression "\w+". <~~lang~~syntaxhighlight lang="c">#include <stdbool.h> #include <stdio.h> #include <glib.h> Line 1,437: return EXIT_FAILURE; return EXIT_SUCCESS; }</~~lang~~syntaxhighlight> {{out}} Line 1,457: =={{header\|C sharp\|C#}}== {{trans\|D}} <~~lang~~syntaxhighlight lang="csharp">using System; using System.Collections.Generic; using System.IO; Line 1,489: } } }</~~lang~~syntaxhighlight> {{out}} <pre>Rank Word Frequency Line 1,505: =={{header\|C++}}== <~~lang~~syntaxhighlight lang="cpp">#include <algorithm> #include <cstdlib> #include <fstream> Line 1,550: return 0; } </syntaxhighlight> ~~</lang>~~ {{out}} Line 1,568: ===Alternative=== {{trans\|C#}} <~~lang~~syntaxhighlight lang="cpp">#include <algorithm> #include <iostream> #include <fstream> Line 1,624: return 0; }</~~lang~~syntaxhighlight> {{out}} <pre>Rank Word Frequency Line 1,638: 9 he 6814 10 had 6139</pre> ===C++20=== {{trans\|C#}} <syntaxhighlight lang="cpp">#include <algorithm> #include <iostream> #include <format> #include <fstream> #include <map> #include <ranges> #include <regex> #include <string> #include <vector> int main() { std::ifstream in("135-0.txt"); std::string text{ std::istreambuf_iterator<char>{in}, std::istreambuf_iterator<char>{} }; in.close(); std::regex word_rx("\\w+"); std::map<std::string, int> freq; for (const auto& a : std::ranges::subrange( std::sregex_iterator{ text.cbegin(),text.cend(), word_rx }, std::sregex_iterator{} )) { auto word = a.str(); transform(word.begin(), word.end(), word.begin(), ::tolower); freq[word]++; } std::vector<std::pair<std::string, int>> pairs; for (const auto& elem : freq) { pairs.push_back(elem); } std::ranges::sort(pairs, std::ranges::greater{}, &std::pair<std::string, int>::second); std::cout << "Rank Word Frequency\n" "==== ==== =========\n"; for (int rank=1; const auto& [word, count] : pairs \| std::views::take(10)) { std::cout << std::format("{:2} {:>4} {:5}\n", rank++, word, count); } }</syntaxhighlight> {{out}} <pre>Rank Word Frequency ==== ==== ========= 0 the 41043 1 of 19952 2 and 14938 3 a 14539 4 to 13942 5 in 11208 6 he 9646 7 was 8620 8 that 7922 9 it 6659</pre> =={{header\|Clojure}}== <~~lang~~syntaxhighlight lang="clojure">(defn count-words [file n] (->> file slurp Line 1,647 ⟶ 1,706: frequencies (sort-by val >) (take n)))</~~lang~~syntaxhighlight> {{Out}} Line 1,657 ⟶ 1,716: =={{header\|COBOL}}== <syntaxhighlight lang="cobol"> ~~<lang COBOL>~~ IDENTIFICATION DIVISION. PROGRAM-ID. WordFrequency. Line 1,871 ⟶ 1,930: CLOSE Word-File Output-File. END-PROGRAM. </syntaxhighlight> ~~</lang>~~ {{Out}} Line 1,894 ⟶ 1,953: =={{header\|Common Lisp}}== <~~lang~~syntaxhighlight lang="lisp"> (defun count-word (n pathname) (with-open-file (s pathname :direction :input) Line 1,915 ⟶ 1,974: (dolist (word words) (incf (gethash word hash 0))) (maphash #'(lambda (e n) (push `(,e . ,n) ac)) hash) ac) </syntaxhighlight> ~~</lang>~~ {{Out}} Line 1,925 ⟶ 1,984: =={{header\|Crystal}}== <~~lang~~syntaxhighlight lang="ruby">require "http/client" require "regex" Line 1,943 ⟶ 2,002: .sort { \|a, b\| b[1] <=> a[1] }[0..9] # sort and get the first 10 elements .each_with_index(1) { \|(word, n), i\| puts "#{i} \t #{word} \t #{n}" } # print the result </syntaxhighlight> ~~</lang>~~ {{out}} Line 1,960 ⟶ 2,019: =={{header\|D}}== <~~lang~~syntaxhighlight Dlang="d">import std.algorithm : sort; import std.array : appender, split; import std.range : take; Line 1,995 ⟶ 2,054: writefln("%4s %-10s %9s", rank++, word.k, word.v); } }</~~lang~~syntaxhighlight> {{out}} Line 2,016 ⟶ 2,075: {{libheader\| System.RegularExpressions}} {{Trans\|C#}} <syntaxhighlight lang="delphi"> ~~<lang Delphi>~~ program Word_frequency; Line 2,089 ⟶ 2,148: readln; end. </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 2,106 ⟶ 2,165: </pre> =={{header\|F Sharp}}== <~~lang~~syntaxhighlight lang="fsharp"> open System.IO open System.Text.RegularExpressions let g=Regex("[A-Za-zÀ-ÿ]+").Matches(File.ReadAllText "135-0.txt") [for n in g do yield n.Value.ToLower()]\|>List.countBy(id)\|>List.sortBy(fun n->(-(snd n)))\|>List.take 10\|>List.iter(fun n->printfn "%A" n) </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 2,128 ⟶ 2,187: =={{header\|Factor}}== This program expects stdin to read from a file via the command line. ( e.g. invoking the program in Windows: <tt>>factor word-count.factor < input.txt</tt> ) The definition of a word here is simply any string surrounded by some combination of spaces, punctuation, or newlines. <~~lang~~syntaxhighlight lang="factor"> USING: ascii io math.statistics prettyprint sequences splitting ; Line 2,135 ⟶ 2,194: lines " " join " .,?!:;()\"-" split harvest [ >lower ] map sorted-histogram <reversed> 10 head . </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 2,153 ⟶ 2,212: =={{header\|FreeBASIC}}== <~~lang~~syntaxhighlight lang="freebasic"> #Include "file.bi" type tally Line 2,283 ⟶ 2,342: print "time for operation ";timer-tm;" seconds" sleep </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 2,320 ⟶ 2,379: There are two sample programs below. First, a simple but powerful method that works in old versions of Frink: <~~lang~~syntaxhighlight lang="frink">d = new dict for w = select[wordList[read[normalizeUnicode["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]], %r/[[:alnum:]]/ ] d.increment[lc[w], 1] println[join["\n", first[reverse[sort[array[d], {\|a,b\| a@1 <=> b@1}]], 10]]]</~~lang~~syntaxhighlight> {{out}} Line 2,342 ⟶ 2,401: Next, a "showing off" one-liner that works in recent versions of Frink that uses the <CODE>countToArray</CODE> function which easily creates sorted frequency lists and the <CODE>formatTable</CODE> function that formats into a nice table with columns lined up, and still performs full Unicode-aware normalization, capitalization, and word-breaking: <~~lang~~syntaxhighlight lang="frink">formatTable[first[countToArray[select[wordList[lc[normalizeUnicode[read["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]]], %r/[[:alnum:]]/ ]], 10], "right"]</~~lang~~syntaxhighlight> {{out}} Line 2,360 ⟶ 2,419: =={{header\|FutureBasic}}== Task said: "Feel free to explicitly state the thoughts behind the program decisions." Thus the heavy comments. <~~lang~~syntaxhighlight lang="futurebasic"> include "NSLog.incl" Line 2,369 ⟶ 2,427: CFDictionaryRef dict // ~~Break~~Depending ~~out~~on ~~capitalized~~the ~~words~~value ~~during seaarch or not as determined by~~of the caseSensitive Boolean function ~~input~~parameter ~~parameter~~above, lowercase incoming text if caseSensitive == NO then textStr = fn StringLowercaseString( textStr ) // Trim non-alphabetic characters from string, and separate individual words with a space CFStringRef tempStr = fn ArrayComponentsJoinedByString( fn StringComponentsSeparatedByCharactersInSet( textStr, fn CharacterSetInvertedSet( fn CharacterSetLetterSet ) ), @" " ) Line 2,390 ⟶ 2,448: CountedSetRef freqencies = fn CountedSetWithArray( tempArr ) // Enumerate each word-~~frequeny~~frequency ~~pain~~pair in the counted set... EnumeratorRef enumRef = fn CountedSetObjectEnumerator( freqencies ) Line 2,399 ⟶ 2,457: CFMutableArrayRef wordArr = fn MutableArrayWithCapacity( 0 ) // Create word ~~couter~~counter NSInteger totalWords = 0 // Enumerate each unique word, get its frequency, create its own key/value pair dictionary, add each dictionary into master array for wrd in array totalWords++ Line 2,429 ⟶ 2,487: next // Create an immutable output string from mutable the string CFStringRef resultStr = fn StringWithFormat( @"%@", mutStr ) end fn = resultStr Line 2,458 ⟶ 2,516: HandleEvents </syntaxhighlight> ~~</lang>~~ {{~~Out~~output}} <pre> 1 41095 the Line 2,488 ⟶ 2,546: 22910 1 isabella Total unique words in document: 22910 Elapsed time: 595.407963 milliseconds. </pre> Line 2,494 ⟶ 2,552: =={{header\|Go}}== {{trans\|Kotlin}} <~~lang~~syntaxhighlight lang="go">package main import ( Line 2,536 ⟶ 2,594: fmt.Printf("%2d %-4s %5d\n", rank, word, freq) } }</~~lang~~syntaxhighlight> {{out}} Line 2,556 ⟶ 2,614: =={{header\|Groovy}}== Solution: <~~lang~~syntaxhighlight lang="groovy">def topWordCounts = { String content, int n -> def mapCounts = [:] content.toLowerCase().split(/\W+/).each { Line 2,564 ⟶ 2,622: println "Rank Word Frequency\n==== ==== =========" (0..<n).each { printf ("%4d %-4s %9d\n", it+1, top[it].key, top[it].value) } }</~~lang~~syntaxhighlight> Test: <~~lang~~syntaxhighlight lang="groovy">def rawText = "http://www.gutenberg.org/files/135/135-0.txt".toURL().text topWordCounts(rawText, 10)</~~lang~~syntaxhighlight> Output: Line 2,587 ⟶ 2,645: ===Lazy IO with pure Map, arrows=== {{trans\|Clojure}} <~~lang~~syntaxhighlight ~~Haskell~~lang="haskell">module Main where import Control.Category -- (>>>) Line 2,627 ⟶ 2,685: >>> take n >>> print) when filep (hClose hand)</~~lang~~syntaxhighlight> {{Out}} <pre> Line 2,636 ⟶ 2,694: ===Lazy IO, map of IORefs=== Using IORefs as values in the map seems to give a ~2x speedup on large files. The below code is based on https://github.com/composewell/streamly-examples/blob/master/examples/WordFrequency.hs , but still using lazy IO to avoid the extra library dependency (in production you should [https://stackoverflow.com/questions/5892653/whats-so-bad-about-lazy-i-o use a streaming library] like streamly/conduit/io-streams): <~~lang~~syntaxhighlight lang="haskell"> module Main where Line 2,675 ⟶ 2,733: in mapM readRef $ M.toList freqtable print $ take maxw $ sortOn (Down . snd) counts </syntaxhighlight> ~~</lang>~~ {{Out}} <pre> Line 2,684 ⟶ 2,742: ===Lazy IO, short code, but not streaming=== Or, perhaps a little more simply, though not streaming (will read everything into memory, don't use on big files): <~~lang~~syntaxhighlight lang="haskell">import qualified Data.Text.IO as T import qualified Data.Text as T Line 2,696 ⟶ 2,754: main :: IO () main = T.readFile "miserables.txt" >>= (mapM_ print . take 10 . frequentWords)</~~lang~~syntaxhighlight> {{Out}} <pre>(40370,"the") Line 2,754 ⟶ 2,812: =={{header\|Java}}== This is relatively simple in Java.<br /> I used a ''URL'' class to download the content, a ''BufferedReader'' class to examine the text line-for-line, a ''Pattern'' and ''Matcher'' to identify words, and a ''Map'' to hold to values. <syntaxhighlight lang="java"> import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URI; import java.net.URISyntaxException; import java.net.URL; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.regex.Matcher; import java.util.regex.Pattern; </syntaxhighlight> <syntaxhighlight lang="java"> void printWordFrequency() throws URISyntaxException, IOException { URL url = new URI("https://www.gutenberg.org/files/135/135-0.txt").toURL(); try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()))) { Pattern pattern = Pattern.compile("(\\w+)"); Matcher matcher; String line; String word; Map<String, Integer> map = new HashMap<>(); while ((line = reader.readLine()) != null) { matcher = pattern.matcher(line); while (matcher.find()) { word = matcher.group().toLowerCase(); if (map.containsKey(word)) { map.put(word, map.get(word) + 1); } else { map.put(word, 1); } } } /* print out top 10 / List<Map.Entry<String, Integer>> list = new ArrayList<>(map.entrySet()); list.sort(Map.Entry.comparingByValue()); Collections.reverse(list); int count = 1; for (Map.Entry<String, Integer> value : list) { System.out.printf("%-20s%,7d%n", value.getKey(), value.getValue()); if (count++ == 10) break; } } } </syntaxhighlight> <pre> the 41,043 of 19,952 and 14,938 a 14,539 to 13,942 in 11,208 he 9,646 was 8,620 that 7,922 it 6,659 </pre> <br /> An alternate demonstration {{trans\|Kotlin}} <~~lang~~syntaxhighlight ~~Java~~lang="java">import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path; Line 2,797 ⟶ 2,919: } } }</~~lang~~syntaxhighlight> {{out}} <pre>Rank Word Frequency Line 2,819 ⟶ 2,941: may not begin with hyphen. Thus "the-the" would count as one word, and "-the" would be excluded. <syntaxhighlight lang="jq"> ~~<lang jq>~~ < 135-0.txt jq -nR --argjson n 10 ' def bow(stream): Line 2,831 ⟶ 2,953: \| from_entries ' </syntaxhighlight> ~~</lang>~~ ====Output==== <syntaxhighlight lang="jq"> ~~<lang jq>~~ { "the": 41087, Line 2,846 ⟶ 2,968: "it": 6661 } </syntaxhighlight> ~~</lang>~~ =={{header\|Julia}}== {{works with\|Julia\|1.0}} <~~lang~~syntaxhighlight lang="julia"> using FreqTables Line 2,856 ⟶ 2,978: words = split(replace(txt, r"\P{L}"i => " ")) table = sort(freqtable(words); rev=true) println(table[1:10])</~~lang~~syntaxhighlight> {{out}} Line 2,871 ⟶ 2,993: "he" │ 6816 "had" │ 6140</pre> =={{header\|K}}== {{works with\|ngn/k}}<syntaxhighlight lang=K>common:{+((!d)o)!n@o:x#>n:#'.d:=("&"\`c$"&"\|_,/0:y)^,""} {(,'!x),'.x}common[10;"135-0.txt"] (("the";41019) ("of";19898) ("and";14658) (,"a";14517) ("to";13695) ("in";11134) ("he";9405) ("was";8361) ("that";7592) ("his";6446))</syntaxhighlight> (The relatively easy to read output format here is arguably less useful than the table produced by <code>common</code> but it would have been more concise to have <code>common</code> generate it directly.) =={{header\|KAP}}== The below program defines the function 'stats' which accepts a filename containing the text. <~~lang~~syntaxhighlight lang="kap">∇ stats (file) { content ← "[\\h,.\"'\n-]+" regex:split unicode:toLower io:readFile file sorted ← (⍋⊇⊢) content Line 2,881 ⟶ 3,019: words ← selection / sorted {⍵[10↑⍒⍵[;1];]} words ,[0.5] ≢¨ sorted ⊂⍨ +\selection }</~~lang~~syntaxhighlight> {{out}} <pre>┏━━━━━━━━━━━━┓ Line 2,902 ⟶ 3,040: There is no change in the results if the numerals 0-9 are also regarded as letters. <~~lang~~syntaxhighlight lang="scala">// version 1.1.3 import java.io.File Line 2,920 ⟶ 3,058: for ((word, freq) in wordGroups) System.out.printf("%2d %-4s %5d\n", rank++, word, freq) }</~~lang~~syntaxhighlight> {{out}} Line 2,939 ⟶ 3,077: =={{header\|Liberty BASIC}}== <~~lang~~syntaxhighlight lang="lb">dim words$(100000,2)'words$(a,1)=the word, words$(a,2)=the count dim lines$(150000) open "135-0.txt" for input as #txt Line 3,005 ⟶ 3,143: close #txt end </syntaxhighlight> ~~</lang>~~ {{out}} <pre>Count Word Line 3,026 ⟶ 3,164: =={{header\|Lua}}== {{works with\|lua\|5.3}} <~~lang~~syntaxhighlight lang="lua"> -- This program takes two optional command line arguments. The first (arg[1]) -- specifies the input file, or defaults to standard input. The second Line 3,055 ⟶ 3,193: io.write(string.format('%7d %s\n', array[i][1] , array[i][2])) end </syntaxhighlight> ~~</lang>~~ {{Out}} Line 3,078 ⟶ 3,216: =={{header\|Mathematica}} / {{header\|Wolfram Language}}== <~~lang~~syntaxhighlight ~~Mathematica~~lang="mathematica">TakeLargest[10]@WordCounts[Import["https://www.gutenberg.org/files/135/135-0.txt"], IgnoreCase->True]//Dataset</~~lang~~syntaxhighlight> {{out}} <pre> Line 3,094 ⟶ 3,232: =={{header\|MATLAB}} / {{header\|Octave}}== <syntaxhighlight lang="matlab"> ~~<lang Matlab>~~ function [result,count] = word_frequency() URL='https://www.gutenberg.org/files/135/135-0.txt'; Line 3,109 ⟶ 3,247: fprintf(1,'%d\t%s\n',count(k),result{k}) end </syntaxhighlight> ~~</lang>~~ {{out}} Line 3,126 ⟶ 3,264: =={{header\|Nim}}== <~~lang~~syntaxhighlight ~~Nim~~lang="nim">import tables, strutils, sequtils, httpclient proc take[T](s: openArray[T], n: int): seq[T] = s[0 ..< min(n, s.len)] Line 3,136 ⟶ 3,274: wordFrequencies.sort for (word, count) in toSeq(wordFrequencies.pairs).take(10): echo alignLeft($count, 8), word</~~lang~~syntaxhighlight> {{out}} <pre>40377 the Line 3,150 ⟶ 3,288: =={{header\|Objeck}}== <~~lang~~syntaxhighlight lang="objeck">use System.IO.File; use Collection; use RegEx; Line 3,202 ⟶ 3,340: }; } }</~~lang~~syntaxhighlight> Output: Line 3,222 ⟶ 3,360: =={{header\|OCaml}}== <~~lang~~syntaxhighlight lang="ocaml">let () = let n = try int_of_string Sys.argv.(1) Line 3,248 ⟶ 3,386: List.iter (fun (word, count) -> Printf.printf "%d %s\n" count word ) r</~~lang~~syntaxhighlight> {{out}} Line 3,267 ⟶ 3,405: =={{header\|Perl}}== {{trans\|Raku}} <syntaxhighlight lang ="perl">~~$top =~~use 10strict; use warnings; use utf8; my $top = 10; ~~open $fh, "<", '135-0.txt';~~ ~~($text = join '', <$fh>) =~ tr/A-Z/a-z/~~ ~~or die "Can't open '135-0.txt': $!\n";~~ open my $fh, '<', 'ref/word-count.txt'; ~~@matcher = (~~ (my $text = join '', <$fh>) =~ tr/A-Z/a-z/; my @matcher = ( qr/[a-z]+/, # simple 7-bit ASCII qr/\w+/, # word characters with underscore Line 3,279 ⟶ 3,420: ); for my $reg (@matcher) { print "\nTop $top using regex: " . $reg ~~. "~~\n"; my @matches = $text =~ /$reg/g; my %words; for my $w (@matches) { $words{$w}++ }; my $c = 0; for my $w ( sort { $words{$b} <=> $words{$a} } keys %words ) { printf "%-7s %6d\n", $w, $words{$w}; last if ++$c >= $top; } }</~~lang~~syntaxhighlight> {{out}} <pre> ~~<pre>Top 10 using regex: (?^:[a-z]+)~~ Top 10 using regex: (?^:[a-z]+) the 41089 of 19949 Line 3,326 ⟶ 3,468: was 8621 that 7924 it 6661~~</pre>~~ </pre> =={{header\|Phix}}== <!--<~~lang~~syntaxhighlight ~~Phix~~lang="phix">(notonline)--> <span style="color: #008080;">without</span> <span style="color: #008080;">javascript_semantics</span> <span style="color: #0000FF;">?</span><span style="color: #008000;">"loading..."</span> Line 3,356 ⟶ 3,499: <span style="color: #008080;">end</span> <span style="color: #008080;">function</span> <span style="color: #7060A8;">traverse_dict</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">routine_id</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"visitor"</span><span style="color: #0000FF;">),</span><span style="color: #000000;">0</span><span style="color: #0000FF;">,</span><span style="color: #000000;">wf</span><span style="color: #0000FF;">,</span><span style="color: #004600;">true</span><span style="color: #0000FF;">)</span> <!--</~~lang~~syntaxhighlight>--> {{out}} <pre> Line 3,373 ⟶ 3,516: =={{header\|Phixmonti}}== <~~lang~~syntaxhighlight ~~Phixmonti~~lang="phixmonti">include ..\Utilitys.pmt "loading..." ? Line 3,408 ⟶ 3,551: -1 get ? endfor drop</~~lang~~syntaxhighlight> {{out}} <pre>loading... Line 3,427 ⟶ 3,570: =={{header\|PHP}}== <~~lang~~syntaxhighlight lang="php"> <?php Line 3,442 ⟶ 3,585: } $i++; }</~~lang~~syntaxhighlight> {{out}} <pre> Line 3,461 ⟶ 3,604: =={{header\|Picat}}== To get the book proper, the header and footer are removed. Here are some tests with different sets of characters to split the words (<code>split_char/1</code>). <~~lang~~syntaxhighlight ~~Picat~~lang="picat">main => NTop = 10, File = "les_miserables.txt", Line 3,491 ⟶ 3,634: split_chars(all,"\n\r \t,;!.?()[]”\"-“—-__‘’"). split_chars(space_punct,"\n\r \t,;!.?"). split_chars(space,"\n\r \t").</~~lang~~syntaxhighlight> {{out}} Line 3,515 ⟶ 3,658: =={{header\|PicoLisp}}== <~~lang~~syntaxhighlight ~~PicoLisp~~lang="picolisp">(setq Delim " ^I^J^M-_.,\"'[]?!&@#$%^\(\):;") (setq Skip (chop Delim)) Line 3,529 ⟶ 3,672: (if (idx 'B W T) (inc (car @)) (set W 1)) ) ) ) (for L (head 10 (flip (by val sort (idx 'B)))) (println L (val L)) )</~~lang~~syntaxhighlight> {{out}} <pre> Line 3,546 ⟶ 3,689: =={{header\|Prolog}}== {{works with\|SWI Prolog}} <~~lang~~syntaxhighlight lang="prolog">print_top_words(File, N):- read_file_to_string(File, String, [encoding(utf8)]), re_split("\\w+", String, Words), Line 3,578 ⟶ 3,721: main:- print_top_words("135-0.txt", 10).</~~lang~~syntaxhighlight> {{out}} Line 3,597 ⟶ 3,740: =={{header\|PureBasic}}== <~~lang~~syntaxhighlight ~~PureBasic~~lang="purebasic">EnableExplicit Structure wordcount Line 3,651 ⟶ 3,794: EndIf End</~~lang~~syntaxhighlight> {{out}} <pre> Line 3,672 ⟶ 3,815: ===Collections=== ====Python2.7==== <~~lang~~syntaxhighlight lang="python">import collections import re import string Line 3,682 ⟶ 3,825: if __name__ == "__main__": main()</~~lang~~syntaxhighlight> {{Out}} Line 3,692 ⟶ 3,835: ====Python3.6==== <~~lang~~syntaxhighlight lang="python">from collections import Counter from re import findall Line 3,711 ⟶ 3,854: if __name__ == "__main__": n = int(input('How many?: ')) most_common_words_in_file(les_mis_file, n)</~~lang~~syntaxhighlight> {{Out}} Line 3,729 ⟶ 3,872: ===Sorted and groupby=== {{Works with\|Python\|3.7}} <~~lang~~syntaxhighlight lang="python">""" Word count task from Rosetta Code http://www.rosettacode.org/wiki/Word_count#Python Line 3,776 ⟶ 3,919: if __name__ == '__main__': main() </syntaxhighlight> ~~</lang>~~ {{Out}} <pre>('the', 40372) Line 3,790 ⟶ 3,933: ===Collections, Sorted and Lambda=== <~~lang~~syntaxhighlight lang="python"> #!/usr/bin/python3 import collections Line 3,810 ⟶ 3,953: if i == count - 1: break </syntaxhighlight> ~~</lang>~~ {{Out}} <pre>[ 1] the : 41039 Line 3,824 ⟶ 3,967: =={{header\|R}}== ===Version 1=== I chose to remove apostrophes only if they're followed by an s (so "mom" and "mom's" will show up as the same word but "they" and "they're" won't). I also chose not to remove hyphens. <syntaxhighlight lang="r"> ~~<lang R>~~ wordcount<-function(file,n){ punctuation=c("`","~","!","@","#","$","%","^","&","","(",")","_","+","=","{","[","}","]","\|","\\",":",";","\"","<",",",">",".","?","/","'s") Line 3,841 ⟶ 3,985: return(df[1:n,]) } </syntaxhighlight> ~~</lang>~~ {{Out}} <pre> Line 3,857 ⟶ 4,001: 9 it 2308 10 i 1845 </pre> ===Version 2=== This version is purely functional using the native pipe operator in R 4.1+ and runs in less than a second. <syntaxhighlight lang="r"> word_frequency_pipeline <- function(file=NULL, n=10) { file \|> vroom::vroom_lines() \|> stringi::stri_split_boundaries(type="word", skip_word_none=T, skip_word_number=T) \|> unlist() \|> tolower() \|> table() \|> sort(decreasing = T) \|> (\(.) .[1:n])() \|> data.frame() } </syntaxhighlight> {{Out}} <pre> > word_frequency_pipeline("~/../Downloads/135-0.txt") Var1 Freq 1 the 41042 2 of 19952 3 and 14938 4 a 14526 5 to 13942 6 in 11208 7 he 9605 8 was 8620 9 that 7824 10 it 6533 </pre> =={{header\|Racket}}== <~~lang~~syntaxhighlight lang="racket">#lang racket (define (all-words f (case-fold string-downcase)) Line 3,870 ⟶ 4,047: (module+ main (take (counts (all-words "data/les-mis.txt")) 10))</~~lang~~syntaxhighlight> {{out}} Line 3,886 ⟶ 4,063: =={{header\|Raku}}== (formerly Perl 6) {{works with\|Rakudo\|~~2020~~2022.~~08.1~~07}} Note: much of the following exposition is no longer critical to the task as the requirements have been updated, but is left here for historical and informational reasons. This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is '''full''' of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup. We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various ~~accented glyphs~~[[wp:diacritic\|diacritic]]s. Those '''are''' letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.) Actually, in this case /A-Za-z/ returns '''very nearly''' the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the". Line 3,899 ⟶ 4,076: Here is a sample that shows the result when using various different matchers. <syntaxhighlight lang="raku" ~~perl6~~line>sub MAIN ($filename, UInt $top = 10) { my $file = $filename.IO.slurp.lc.subst(/ (<[\w]-[_]>'-')\n(<[\w]-[_]>) /, {$0 ~ $1}, :g ); my @matcher = ( rx/ <[a..z]>+ /, # simple 7-bit ASCII rx/ \w+ /, # word characters with underscore rx/ <[\w]-[_]>+ /, # word characters without underscore rx/ [<[\w]-[_]>+~~[["'"\|~~]+ % < ' -~~'\|"~~ '-~~"]<[\w]-[_]~~ >+]* / # word characters without underscore but with hyphens and contractions ); for @matcher -> $reg { say "\nTop $top using regex: ", $reg.raku; my @words ~~.put for~~= $file.comb( $reg ).Bag.sort(-.value)[^$top]; my $length = max @words».key».chars; printf "%-{$length}s %d\n", .key, .value for @words; } }</~~lang~~syntaxhighlight> {{out}} Line 4,091 ⟶ 4,270: Since REXX doesn't support UTF-8 encodings, code was added to this REXX version to support the accented letters in the mandated input file. <~~lang~~syntaxhighlight lang="rexx">/REXX pgm displays top 10 words in a file (includes foreign letters), case is ignored./ parse arg fID top . /obtain optional arguments from the CL/ if fID=='' \| fID=="," then fID= 'les_mes.txt' /None specified? Then use the default./ Line 4,146 ⟶ 4,325: end /#/ say commas(totW) ' words found ('commas(c) "unique) in " commas(#), ' records read from file: ' fID; say; return</~~lang~~syntaxhighlight> {{out\|output\|text=  when using the default inputs:}} <pre> Line 4,169 ⟶ 4,348: Inspired by version 1 and adapted for ooRexx. It ignores all characters other than a-z and A-Z (which are translated to a-z). <syntaxhighlight lang="text">/REXX program reads and displays a count of words a file. Word case is ignored./ Call time 'R' abc='abcdefghijklmnopqrstuvwxyz' Line 4,219 ⟶ 4,398: tops=tops+words(tl) /correctly handle the tied rankings. / end Say time('E') 'seconds elapsed'</~~lang~~syntaxhighlight> {{out}} <pre>We found 22820 different words Line 4,237 ⟶ 4,416: =={{header\|Ring}}== <~~lang~~syntaxhighlight lang="ring"> # project : Word count Line 4,296 ⟶ 4,475: b = temp return [a, b] </syntaxhighlight> ~~</lang>~~ Output: <pre> Line 4,312 ⟶ 4,491: =={{header\|Ruby}}== <~~lang~~syntaxhighlight lang="ruby"> class String def wc Line 4,322 ⟶ 4,501: open('135-0.txt') { \|n\| n.read.wc[-10,10].each{\|n\| puts n[0].to_s+"->"+n[1].to_s} } </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 4,338 ⟶ 4,517: ===Tally and max_by=== {{Works with\|Ruby\|2.7}} <~~lang~~syntaxhighlight lang="ruby">RE = /[[:alpha:]]+/ count = open("135-0.txt").read.downcase.scan(RE).tally.max_by(10, &:last) count.each{\|ar\| puts ar.join("->") } </syntaxhighlight> ~~</lang>~~ {{out}} <pre>the->41092 Line 4,355 ⟶ 4,534: </pre> ===Chain of Enumerables=== <~~lang~~syntaxhighlight lang="ruby">wf = File.read("135-0.txt", :encoding => "UTF-8") .downcase .scan(/\w+/) Line 4,368 ⟶ 4,547: w[1] } </syntaxhighlight> ~~</lang>~~ {{out}} <pre>[ 1] the : 41040 Line 4,383 ⟶ 4,562: =={{header\|Rust}}== <~~lang~~syntaxhighlight ~~Rust~~lang="rust">use std::cmp::Reverse; use std::collections::HashMap; use std::fs::File; Line 4,414 ⟶ 4,593: fn main() { word_count(File::open("135-0.txt").expect("File open error"), 10) }</~~lang~~syntaxhighlight> {{out}} Line 4,434 ⟶ 4,613: {{Out}} Best seen running in your browser [https://scastie.scala-lang.org/EP2Fm6HXQrC1DwtSNvnUzQ Scastie (remote JVM)]. <~~lang~~syntaxhighlight ~~Scala~~lang="scala">import scala.io.Source object WordCount extends App { Line 4,457 ⟶ 4,636: println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]") }</~~lang~~syntaxhighlight> {{out}} <pre>Rank Word Frequency Line 4,481 ⟶ 4,660: to get words from a fle. The words are [http://seed7.sourceforge.net/libraries/string.htm#lower(in_string) converted to lower case], to assure that "The" and "the" are considered the same. <~~lang~~syntaxhighlight lang="seed7">$ include "seed7_05.s7i"; include "gethttp.s7i"; include "strifile.s7i"; Line 4,522 ⟶ 4,701: end for; end for; end func;</~~lang~~syntaxhighlight> {{out}} Line 4,540 ⟶ 4,719: =={{header\|Sidef}}== <~~lang~~syntaxhighlight lang="ruby">var count = Hash() var file = File(ARGV[0] \\ '135-0.txt') Line 4,553 ⟶ 4,732: top.each { \|pair\| say "#{pair.key}\t-> #{pair.value}" }</~~lang~~syntaxhighlight> {{out}} <pre> Line 4,569 ⟶ 4,748: =={{header\|Simula}}== <~~lang~~syntaxhighlight lang="simula">COMMENT COMPILE WITH $ cim -m64 word-count.sim ; Line 4,848 ⟶ 5,027: END </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 4,864 ⟶ 5,043: 6 garbage collection(s) in 0.2 seconds. </pre> =={{header\|Smalltalk}}== The ASCII text file is from https://www.gutenberg.org/files/135/old/lesms10.txt. ===Cuis Smalltalk, ASCII=== {{works with\|Cuis\|6.0}} <syntaxhighlight lang="smalltalk"> (StandardFileStream new open: 'lesms10.txt' forWrite: false) contents asLowercase substrings asBag sortedCounts first: 10. </syntaxhighlight> {{Out}}<pre>an OrderedCollection(40543 -> 'the' 19796 -> 'of' 14448 -> 'and' 14380 -> 'a' 13582 -> 'to' 11006 -> 'in' 9221 -> 'he' 8351 -> 'was' 7258 -> 'that' 6420 -> 'his') </pre> ===Squeak Smalltalk, ASCII=== {{works with\|Squeak\|6.0}} <syntaxhighlight lang="smalltalk"> (StandardFileStream readOnlyFileNamed: 'lesms10.txt') contents asLowercase substrings asBag sortedCounts first: 10. </syntaxhighlight> {{Out}}<pre>{40543->'the' . 19796->'of' . 14448->'and' . 14380->'a' . 13582->'to' . 11006->'in' . 9221->'he' . 8351->'was' . 7258->'that' . 6420->'his'} </pre> =={{header\|Swift}}== <~~lang~~syntaxhighlight lang="swift">import Foundation func printTopWords(path: String, count: Int) throws { Line 4,893 ⟶ 5,091: } catch { print(error.localizedDescription) }</~~lang~~syntaxhighlight> {{out}} Line 4,911 ⟶ 5,109: =={{header\|Tcl}}== <~~lang~~syntaxhighlight ~~Tcl~~lang="tcl">lassign $argv head while { [gets stdin line] >= 0 } { foreach word [regexp -all -inline {[A-Za-z]+} $line] { Line 4,921 ⟶ 5,119: foreach {word count} [lrange $sorted 0 [expr {$head 2 - 1}]] { puts "$count\t$word" }</~~lang~~syntaxhighlight> ./wordcount-di.tcl 10 < 135-0.txt Line 4,940 ⟶ 5,138: =={{header\|TMG}}== McIlroy's Unix TMG: <~~lang~~syntaxhighlight ~~UnixTMG~~lang="unixtmg">/* Input format: N text / / Only lowercase letters can constitute a word in text. / / (c) 2020, Andrii Makukha, 2-clause BSD licence. / Line 5,001 ⟶ 5,199: / Character classes */ letter: <<abcdefghijklmnopqrstuvwxyz>>; other: !<<abcdefghijklmnopqrstuvwxyz>>;</~~lang~~syntaxhighlight> Unix TMG didn't have <tt>tolower</tt> builtin. Therefore, you would use it together with <tt>tr</tt>: <~~lang~~syntaxhighlight lang="bash">cat file \| tr A-Z a-z > file1; ./a.out file1</~~lang~~syntaxhighlight> Additionally, because 1972 TMG only understood ASCII characters, you might want to strip down the diacritics (e.g., é → e): <~~lang~~syntaxhighlight lang="bash">cat file \| uni2ascii -B \| tr A-Z a-z > file1; ./a.out file1</~~lang~~syntaxhighlight> =={{header\|Transd}}== <syntaxhighlight lang="Scheme">#lang transd MainModule: { _start: (λ locals: cnt 0 (with fs FileStream() words String() (open-r fs "/mnt/text/Literature/Miserables.txt") (textin fs words) (with v ( -\| (split (tolower words)) (group-by) (regroup-by (λ v Vector<String>() -> Int() (size v)))) (for i in v :rev do (lout (get (get (snd i) 0) 0) ":\t " (fst i)) (+= cnt 1) (if (> cnt 10) break)) ))) }</syntaxhighlight> {{out}} <pre> the: 40379 of: 19869 and: 14468 a: 14278 to: 13590 in: 11025 he: 9213 was: 8347 that: 7249 his: 6414 had: 6051 </pre> =={{header\|UNIX Shell}}== Line 5,013 ⟶ 5,244: {{works with\|zsh}} This is derived from Doug McIlroy's original 6-line note in the ACM article cited in the task. <~~lang~~syntaxhighlight lang="bash">#!/bin/sh <"$1" tr -cs A-Za-z '\n' \| tr A-Z a-z \| LC_ALL=C sort \| uniq -c \| sort -rn \| head -n "$2"</~~lang~~syntaxhighlight> Line 5,040 ⟶ 5,271: This is Doug McIlroy's original solution but follows other solutions in importing the task's text file from the web and directly specifying the 10 most commonly used words. <~~lang~~syntaxhighlight lang="zsh">curl "https://www.gutenberg.org/files/135/135-0.txt" \| tr -cs A-Za-z '\n' \| tr A-Z a-z \| sort \| uniq -c \| sort -rn \| sed 10q</~~lang~~syntaxhighlight> {{Out}} Line 5,058 ⟶ 5,289: In order to use it, you have to adapt the PATHFILE Const. <syntaxhighlight lang="vb"> ~~<lang vb>~~ Option Explicit Line 5,174 ⟶ 5,405: If d.Exists(Word) Then _ DisplayFrequencyOf = d(Word) End Function</~~lang~~syntaxhighlight> {{out}} <pre>Words different in this book : 25884 Line 5,206 ⟶ 5,437: I've taken the view that 'letter' means either a letter or digit for Unicode codepoints up to 255. I haven't included underscore, hyphen nor apostrophe as these usually separate compound words. Not very quick (runs in about 4715 seconds on my system) though this is partially due to Wren not having regular expressions and the string pattern matching module being written in Wren itself rather than C. If the Go example is re-run today (2117 ~~October~~February ~~2020~~2024), then the output matches this Wren example precisely though it appears that the text file has changed since the former was written more than 25 years ago. <~~lang~~syntaxhighlight ~~ecmascript~~lang="wren">import "io" for File import "./str" for Str import "./sort" for Sort import "./fmt" for Fmt import "./pattern" for Pattern var fileName = "135-0.txt" Line 5,236 ⟶ 5,467: var freq = keyVals[rank-1].value Fmt.print("$2d $-4s $5d", rank, word, freq) }</~~lang~~syntaxhighlight> {{out}} Line 5,256 ⟶ 5,487: =={{header\|XQuery}}== <~~lang~~syntaxhighlight lang="xquery">let $maxentries := 10, $uri := 'https://www.gutenberg.org/files/135/135-0.txt' return Line 5,275 ⟶ 5,506: return <word key="{$key}" count="{$count}"/> )[position()=(1 to $maxentries)] }</words></~~lang~~syntaxhighlight> {{out}} <~~lang~~syntaxhighlight lang="xml"><words in="https://www.gutenberg.org/files/135/135-0.txt" top="10"> <word key="the" count="41092"/> <word key="of" count="19954"/> Line 5,288 ⟶ 5,519: <word key="that" count="7924"/> <word key="it" count="6661"/> </words></~~lang~~syntaxhighlight> =={{header\|zkl}}== <~~lang~~syntaxhighlight lang="zkl">fname,count := vm.arglist; // grab cammand line args // words may have leading or trailing "_", ie "the" and "_the" Line 5,297 ⟶ 5,528: RegExp("[a-z]+").pump.fp1(Dictionary().incV)) // line-->(word:count,..) .toList().copy().sort(fcn(a,b){ b[1]<a[1] })[0,count.toInt()] // hash-->list .pump(String,Void.Xplode,"%s,%s\n".fmt).println();</~~lang~~syntaxhighlight> {{out}} <pre>