Word frequency: Difference between revisions
Content deleted Content added
→{{header|zkl}}: rewrite |
→version 1: added support for most accented letters, words that contain an apostrophe, optimized the reading of the file to support non-Latin letters, added verbiage to the REXX's section header and output section. |
||
Line 236:
This REXX version doesn't need to sort the list of words.
Currently, this version
This version also supports words that contain embedded apostrophes (<b><big>''' ' '''</big></b>) [that is, within a word, but not those words that start or end with an apostrophe, for those words, the apostrophe is elided).
Thus, ''' it's ''' is counted separately from '''it''' or ''' its'''.
<lang rexx>/*REXX pgm displays top 10 words in a file (includes foreign letters), case is ignored.*/
parse arg fID top . /*obtain optional arguments from the CL*/
if fID=='' | fID=="," then fID= 'les_mes.TXT' /*None specified? Then use the default.*/
if top=='' | top=="," then top= 10 /* " " " " " " */
c=0;
accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ' /* " lowercase accented characters.*/
y=space( linein(fID) ) /*remove superfluous blanks in the line*/▼
a=abcL || abcL ||accL ||accL || accG
b=abcL || abcU ||accL ||accU || accG ||
x= 'Çà åå çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü' /*list of 16-bit chars.*/
else $=$ || ' ' /*Is it not a letter? Append blank. */▼
xs= words(d) /*num. " " " */
end /*j*/▼
!.=
do #=1
if pos('├',
$=changestr('├'left(_, 1), $, right(_, 1) ) /*convert.*/
do while $\=''; parse var $ z $ /*now, process each word in the $ list.*/
if left(z, 1)==q then z=substr(z, 2) /*starts with an apostrophe?*/
if right(z, 1)==q then z=left(z, length(z) - 1) /*ends " " " */
if z='' then iterate
if @.z==0 then do; c=c+1; !.c=z; end /*bump word count; assign word to array*/
@@.z=
@.z=@.z + 1 /*bump the count of occurrences of word*/
end /*while*/
end /*#*/
say right('word', 40) " " center(' rank ', 6) " count " /*display
say right('════', 40) " " center('══════', 6) "═══════" /* "
do tops=1 by 0 until otops==tops|tops>top /*process enough words to satisfy TOP.*/
Line 267 ⟶ 278:
z=!.n /*get the name of the capitalized word.*/
if count==mc then tl=tl z /*handle cases of tied number of words.*/
if count>
tl=z /* " word " " " " */
end
Line 275 ⟶ 286:
do d=1 for words(tl); _=word(tl, d)
if d==1 then w=max(8, length(@._)) /*use the length of the first word used*/
say right(@@._, 40
@._=0 /*nullify this word count for next time*/
end /*d*/
tops=tops + words(tl) /*correctly handle the tied rankings. */
end /*tops*/
/*──────────────────────────────────────────────────────────────────────────────────────*/
commas: procedure; parse arg _; n=_'.9'; #=123456789; b=verify(n, #, "M")
e=verify(n, #'0', , verify(n, #"0.", 'M') ) - 4
do j=e to b by -3; _=insert(',', _, j); end /*j*/; return _</lang>
{{out|output|text= when using the default inputs:}}
<pre>
word rank count
════ ══════ ═══════
the 1
of 2
and 3
a 4
to 5
in 6
he 7
was 8
that 9
it 10 6,535
</pre>
To see a list of the top 5,000 words that show (among other things) words like '''it's''' and other accented words, see the discussion page. <br><br>
===version 2===
|