Jump to content

Word frequency: Difference between revisions

→‎version 1: added support for most accented letters, words that contain an apostrophe, optimized the reading of the file to support non-Latin letters, added verbiage to the REXX's section header and output section.
(→‎version 1: added support for most accented letters, words that contain an apostrophe, optimized the reading of the file to support non-Latin letters, added verbiage to the REXX's section header and output section.)
Line 236:
This REXX version doesn't need to sort the list of words.
 
Currently, this version treatsrecognizes all the accented (non-Latin) lettersaccented as non-letters.  that Additionalare supportpresent ofin accentedthe letterstext that is waitingspecified forto clarificationbe fromused the  task's(and author.some other non-Latin letters as well).
 
<lang rexx>/*REXX program reads and displays a count of words a file. Word case is ignored.*/
This version also supports words that contain embedded apostrophes (<b><big>''' ' '''</big></b>) &nbsp; &nbsp; [that is, within a word, but not those words that start or end with an apostrophe, for those words, the apostrophe is elided).
 
Thus, &nbsp; ''' it's ''' &nbsp; is counted separately from &nbsp; '''it''' &nbsp; or &nbsp; ''' its'''.
<lang rexx>/*REXX pgm displays top 10 words in a file (includes foreign letters), case is ignored.*/
parse arg fID top . /*obtain optional arguments from the CL*/
if fID=='' | fID=="," then fID= 'les_mes.TXT' /*None specified? Then use the default.*/
if top=='' | top=="," then top= 10 /* " " " " " " */
c=0; @.=0; abcL="abcdefghijklmnopqrstuvwxyz'" /*initialize word list; word, count. ; alphabet*/
!.q= "'"; abcU= abcL; upper abcU /*define uppercase version of " the original word instancealphabet*/
do #=1 while lines(fID)\==0 accL= 'üéâÄàÅÇêëèïîìéæôÖòûùÿáíóúÑ' /* " /*loop whilst" there are linesof insome file.accented chrs*/
accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ' /* " lowercase accented characters.*/
y=space( linein(fID) ) /*remove superfluous blanks in the line*/
$= accG= 'αßΓπΣσµτΦΘΩδφε' /*$: is a" list of wordssome in thislower/upper line.Greek letters*/
a=abcL || abcL ||accL ||accL || accG do j=1 for length(y); _=substr(y,j,1) /*obtain a character " char string of the wordafter foundletters.*/
b=abcL || abcU ||accL ||accU || accG || if datatypexrange(_, 'M') /* then $=$ || _ /*Is it" char string aof letter?before Append to" $. */
x= 'Çà åå çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü' /*list of 16-bit chars.*/
else $=$ || ' ' /*Is it not a letter? Append blank. */
xs= words(d) /*num. " " " */
end /*j*/
!.= $=strip($) /*strip any leading and" the original word instances. trailing blanks*/
do #=1 do while $lines(fID)\=''=0; parse var $ z $ =linein(fID) /*now,loop processwhilst eachthere wordare lines in thefile. $ list.*/
if pos('├', oz$)\=z; upper z =0 then do k=1 for xs; _=word(x, k) /*obtainany an16-bit uppercase version of word.chars? */
$=changestr('├'left(_, 1), $, right(_, 1) ) /*convert.*/
end /*jk*/
y$=space( lineintranslate(fID) ) $, a, b) ) /*remove superfluous blanks in the line*/
do while $\=''; parse var $ z $ /*now, process each word in the $ list.*/
if left(z, 1)==q then z=substr(z, 2) /*starts with an apostrophe?*/
if right(z, 1)==q then z=left(z, length(z) - 1) /*ends " " " */
if z='' then iterate
if @.z==0 then do; c=c+1; !.c=z; end /*bump word count; assign word to array*/
@@.z=ozz /*save the original case of the word. */
@.z=@.z + 1 /*bump the count of occurrences of word*/
end /*while*/
end /*#*/
say right('word', 40) " " center(' rank ', 6) " count " /*display a title for output*/
say right('════', 40) " " center('══════', 6) "═══════" /* " a title separator.*/
 
do tops=1 by 0 until otops==tops|tops>top /*process enough words to satisfy TOP.*/
Line 267 ⟶ 278:
z=!.n /*get the name of the capitalized word.*/
if count==mc then tl=tl z /*handle cases of tied number of words.*/
if count>mc mc then do; mc=count /*this word count is the current max. */
tl=z /* " word " " " " */
end
Line 275 ⟶ 286:
do d=1 for words(tl); _=word(tl, d)
if d==1 then w=max(8, length(@._)) /*use the length of the first word used*/
say right(@@._, 40 ) right(commas(tops), wr) right(commas(@._), w)
@._=0 /*nullify this word count for next time*/
end /*d*/
tops=tops + words(tl) /*correctly handle the tied rankings. */
end /*tops*/ /*stick a fork in it, we're all done. */</lang>
exit else $=$ || ' ' /*Isstick ita notfork ain letter?it, we're Appendall blankdone. */
/*──────────────────────────────────────────────────────────────────────────────────────*/
commas: procedure; parse arg _; n=_'.9'; #=123456789; b=verify(n, #, "M")
e=verify(n, #'0', , verify(n, #"0.", 'M') ) - 4
do j=e to b by -3; _=insert(',', _, j); end /*j*/; return _</lang>
{{out|output|text=&nbsp; when using the default inputs:}}
 
This output agrees with '''UNIX Shell'''.
<pre>
word rank count
════ ══════ ═══════
the 1 4108941,088
of 2 1994919,949
and 3 1494214,942
a 4 1460814,595
to 5 1395113,950
in 6 1121411,214
he 7 96489,607
was 8 86218,620
that 9 79247,826
it 10 6,535 6661
</pre>
To see a list of the top 5,000 words that show (among other things) words like &nbsp; '''it's''' &nbsp; and other accented words, see the discussion page. <br><br>
 
===version 2===
Cookies help us deliver our services. By using our services, you agree to our use of cookies.