Word frequency: Difference between revisions

Line 236:

This REXX version doesn't need to sort the list of words.

Currently, this version ~~treats~~ accented (non-Latin) ~~letters~~ ~~as non-~~letters. ~~ ~~ ~~Additional~~ ~~support~~ of ~~accented~~ ~~letters~~ is ~~waiting~~ ~~for~~ ~~clarification~~ ~~from~~ ~~the~~ ~~task's~~ ~~author.~~

Currently, this version recognizes all the accented (non-Latin) accented letters that are present in the text that is specified to be used   (and some other non-Latin letters as well).

<lang rexx>/*REXX program reads and displays a count of words a file. Word case is ignored.*/

This version also supports words that contain embedded apostrophes (<b><big>''' ' '''</big></b>)     [that is, within a word, but not those words that start or end with an apostrophe, for those words, the apostrophe is elided).

Thus,   ''' it's '''   is counted separately from   '''it'''   or   ''' its'''.

<lang rexx>/*REXX pgm displays top 10 words in a file (includes foreign letters), case is ignored.*/

parse arg fID top . /*obtain optional arguments from the CL*/

if fID=='' | fID=="," then fID= 'les_mes.TXT' /*None specified? Then use the default.*/

if top=='' | top=="," then top= 10 /* " " " " " " */

c=0; @.=0 /*initialize word list~~; word~~ count. */

c=0; @.=0; abcL="abcdefghijklmnopqrstuvwxyz'" /*initialize word list, count; alphabet*/

!.= /* ~~" the original word instance~~*/

q= "'"; abcU= abcL; upper abcU /*define uppercase version of alphabet*/

do ~~#=1~~ ~~while~~ ~~lines(fID)\==0~~ ~~/*loop~~ ~~whilst~~ ~~there~~ ~~are~~ ~~lines~~ in ~~file.~~ */

accL= 'üéâÄàÅÇêëèïîìéæôÖòûùÿáíóúÑ' /* " " of some accented chrs*/

accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ' /* " lowercase accented characters.*/

⚫

y=space( ~~linein~~(~~fID) )~~ /*remove superfluous blanks in the line*/

$= /*$: is a ~~list~~ of ~~words~~ ~~in this~~ ~~line.~~ */

accG= 'αßΓπΣσµτΦΘΩδφε' /* " some lower/upper Greek letters*/

do ~~j=1~~ ~~for~~ ~~length(y);~~ ~~_=substr(y,j,1)~~ /*~~obtain~~ a ~~character~~ of ~~the~~ ~~word~~ ~~found~~.*/

a=abcL || abcL ||accL ||accL || accG /* " char string of after letters.*/

~~if datatype~~(~~_, 'M'~~) ~~then~~ ~~$=$ || _ /*Is~~ it a ~~letter?~~ ~~Append~~ to $. */

b=abcL || abcU ||accL ||accU || accG || xrange() /* " char string of before " */

x= 'Çà åå çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü' /*list of 16-bit chars.*/

⚫

~~else~~ ~~$=$~~ || ' ' /*Is it ~~not~~ a ~~letter?~~ ~~Append~~ ~~blank~~. */

xs= words(d) /*num. " " " */

⚫

end /*j*/

~~$=strip($)~~ /*~~strip~~ ~~any~~ ~~leading~~ ~~and~~ ~~trailing~~ ~~blanks~~*/

!.= /* " the original word instances. */

do while $\=''; ~~parse var~~ $ ~~z $~~ /*~~now,~~ ~~process~~ ~~each~~ ~~word~~ in ~~the~~ ~~$ list.~~*/

do #=1 while lines(fID)\==0; $=linein(fID) /*loop whilst there are lines in file. */

oz=~~z; upper z~~ /*~~obtain~~ an ~~uppercase version of word.~~ */

if pos('├', $)\==0 then do k=1 for xs; _=word(x, k) /*any 16-bit chars? */

$=changestr('├'left(_, 1), $, right(_, 1) ) /*convert.*/

⚫

end /*k*/

⚫

$=space( translate( $, a, b) ) /*remove superfluous blanks in the line*/

do while $\=''; parse var $ z $ /*now, process each word in the $ list.*/

if left(z, 1)==q then z=substr(z, 2) /*starts with an apostrophe?*/

if right(z, 1)==q then z=left(z, length(z) - 1) /*ends " " " */

if z='' then iterate

if @.z==0 then do; c=c+1; !.c=z; end /*bump word count; assign word to array*/

@@.z=oz /*save the original case of the word. */

@@.z=z /*save the original case of the word. */

@.z=@.z + 1 /*bump the count of occurrences of word*/

end /*while*/

end /*#*/

say right('word',40) " " center(' rank ',6) " count " /*display a title for output*/

say right('word', 40) " " center(' rank ', 6) " count " /*display title for output*/

say right('════',40) " " center('══════',6) "═══════" /* " a title separator.*/

say right('════', 40) " " center('══════', 6) "═══════" /* " title separator.*/

do tops=1 by 0 until otops==tops|tops>top /*process enough words to satisfy TOP.*/

Line 267:

Line 278:

z=!.n /*get the name of the capitalized word.*/

if count==mc then tl=tl z /*handle cases of tied number of words.*/

if count>mc then do; mc=count /*this word count is the current max. */

if count> mc then do; mc=count /*this word count is the current max. */

tl=z /* " word " " " " */

end

Line 275:

Line 286:

do d=1 for words(tl); _=word(tl, d)

if d==1 then w=max(8, length(@._)) /*use the length of the first word used*/

say right(@@._, 40 ) right(tops, wr) right(@._, w)

say right(@@._, 40) right(commas(tops), wr) right(commas(@._), w)

@._=0 /*nullify this word count for next time*/

end /*d*/

tops=tops + words(tl) /*correctly handle the tied rankings. */

end /*tops*/ ~~/*stick a fork in it, we're all done. */</lang>~~

end /*tops*/

⚫

exit /*stick a fork in it, we're all done. */

/*──────────────────────────────────────────────────────────────────────────────────────*/

commas: procedure; parse arg _; n=_'.9'; #=123456789; b=verify(n, #, "M")

e=verify(n, #'0', , verify(n, #"0.", 'M') ) - 4

do j=e to b by -3; _=insert(',', _, j); end /*j*/; return _</lang>

This output agrees with '''UNIX Shell'''.

<pre>

word rank count

════ ══════ ═══════

the 1 ~~41089~~

the 1 41,088

of 2 ~~19949~~

of 2 19,949

and 3 ~~14942~~

and 3 14,942

a 4 ~~14608~~

a 4 14,595

to 5 ~~13951~~

to 5 13,950

in 6 ~~11214~~

in 6 11,214

he 7 ~~9648~~

he 7 9,607

was 8 ~~8621~~

was 8 8,620

that 9 ~~7924~~

that 9 7,826

it 10 ~~6661~~

it 10 6,535

</pre>

To see a list of the top 5,000 words that show (among other things) words like   '''it's'''   and other accented words, see the discussion page. <br><br>

===version 2===