Word frequency: Difference between revisions

Word frequency (view source)

Revision as of 05:33, 23 August 2017

1,746 bytes added , 6 years ago

→‎version 1: added support for most accented letters, words that contain an apostrophe, optimized the reading of the file to support non-Latin letters, added verbiage to the REXX's section header and output section.

Anonymous user

rosettacode>Gerard Schildberger

Revision as of 23:40, 22 August 2017 (view source) rosettacode>Craigd (→‎{{header\|zkl}}: rewrite) ← Older edit		Revision as of 05:33, 23 August 2017 (view source) rosettacode>Gerard Schildberger (→‎version 1: added support for most accented letters, words that contain an apostrophe, optimized the reading of the file to support non-Latin letters, added verbiage to the REXX's section header and output section.) Newer edit →
Line 236: This REXX version doesn't need to sort the list of words. Currently, this version ~~treats~~recognizes all the accented (non-Latin) ~~letters~~accented ~~as non-~~letters. ~~ ~~that ~~Additional~~are ~~support~~present ofin ~~accented~~the ~~letters~~text that is ~~waiting~~specified ~~for~~to ~~clarification~~be ~~from~~used ~~the~~  ~~task's~~(and ~~author.~~some other non-Latin letters as well). ~~<lang rexx>/REXX program reads and displays a count of words a file. Word case is ignored./~~ This version also supports words that contain embedded apostrophes (<b><big>''' ' '''</big></b>)     [that is, within a word, but not those words that start or end with an apostrophe, for those words, the apostrophe is elided). Thus,   ''' it's '''   is counted separately from   '''it'''   or   ''' its'''. <lang rexx>/REXX pgm displays top 10 words in a file (includes foreign letters), case is ignored./ parse arg fID top . /obtain optional arguments from the CL/ if fID=='' \| fID=="," then fID= 'les_mes.TXT' /None specified? Then use the default./ if top=='' \| top=="," then top= 10 /* " " " " " " / c=0; @.=0; abcL="abcdefghijklmnopqrstuvwxyz'" /initialize word list~~; word~~, count. ; alphabet/ !.q= "'"; abcU= abcL; upper abcU /define uppercase version of ~~" the original word instance~~alphabet/ do ~~#=1~~ ~~while~~ ~~lines(fID)\==0~~ accL= 'üéâÄàÅÇêëèïîìéæôÖòûùÿáíóúÑ' / " ~~/loop~~ ~~whilst~~" ~~there~~ ~~are~~ ~~lines~~of insome ~~file.~~accented chrs/ accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ' /* " lowercase accented characters./ y=space( linein(fID) ) /remove superfluous blanks in the line/▼ $= accG= 'αßΓπΣσµτΦΘΩδφε' /$: is a" ~~list~~ of ~~words~~some ~~in this~~lower/upper ~~line.~~Greek letters/ a=abcL \|\| abcL \|\|accL \|\|accL \|\| accG do ~~j=1~~ ~~for~~ ~~length(y);~~ ~~_=substr(y,j,1)~~ /~~obtain~~ a ~~character~~ " char string of ~~the~~ ~~word~~after ~~found~~letters./ b=abcL \|\| abcU \|\|accL \|\|accU \|\| accG \|\| ~~if datatype~~xrange(~~_, 'M'~~) / ~~then~~ ~~$=$ \|\| _ /Is~~ it" char string aof ~~letter?~~before ~~Append~~ to" $. / x= 'Çà åå çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü' /list of 16-bit chars./ else $=$ \|\| ' ' /Is it not a letter? Append blank. /▼ xs= words(d) /num. " " " / end /j/▼ !.= ~~$=strip($)~~ /~~strip~~ ~~any~~ ~~leading~~ ~~and~~" the original word instances. ~~trailing~~ ~~blanks~~/ do #=1 do while $lines(fID)\=''=0; ~~parse var~~ $ ~~z $~~ =linein(fID) /~~now,~~loop ~~process~~whilst ~~each~~there ~~word~~are lines in ~~the~~file. ~~$ list.~~/ if pos('├', oz$)\=~~z; upper z~~ =0 then do k=1 for xs; _=word(x, k) /~~obtain~~any an16-bit ~~uppercase version of word.~~chars? / $=changestr('├'left(_, 1), $, right(_, 1) ) /convert./ ▲ end /jk/ ▲ y$=space( ~~linein~~translate(~~fID) )~~ $, a, b) ) /remove superfluous blanks in the line/ do while $\=''; parse var $ z $ /now, process each word in the $ list./ if left(z, 1)==q then z=substr(z, 2) /starts with an apostrophe?/ if right(z, 1)==q then z=left(z, length(z) - 1) /ends " " " / if z='' then iterate if @.z==0 then do; c=c+1; !.c=z; end /bump word count; assign word to array/ @@.z=ozz /save the original case of the word. / @.z=@.z + 1 /bump the count of occurrences of word/ end /while/ end /#/ say right('word', 40) " " center(' rank ', 6) " count " /display a title for output/ say right('════', 40) " " center('══════', 6) "═══════" /* " a title separator./ do tops=1 by 0 until otops==tops\|tops>top /process enough words to satisfy TOP./ Line 267 ⟶ 278: z=!.n /get the name of the capitalized word./ if count==mc then tl=tl z /handle cases of tied number of words./ if count>mc mc then do; mc=count /this word count is the current max. / tl=z / " word " " " " / end Line 275 ⟶ 286: do d=1 for words(tl); _=word(tl, d) if d==1 then w=max(8, length(@._)) /use the length of the first word used/ say right(@@._, 40 ) right(commas(tops), wr) right(commas(@._), w) @._=0 /nullify this word count for next time/ end /d/ tops=tops + words(tl) /correctly handle the tied rankings. / end /tops/ ~~/stick a fork in it, we're all done. /</lang>~~ ▲exit ~~else~~ ~~$=$~~ \|\| ' ' /Isstick ita ~~not~~fork ain ~~letter?~~it, we're ~~Append~~all ~~blank~~done. / /──────────────────────────────────────────────────────────────────────────────────────/ commas: procedure; parse arg _; n=_'.9'; #=123456789; b=verify(n, #, "M") e=verify(n, #'0', , verify(n, #"0.", 'M') ) - 4 do j=e to b by -3; _=insert(',', _, j); end /j*/; return _</lang> {{out\|output\|text=  when using the default inputs:}} ~~This output agrees with '''UNIX Shell'''.~~ <pre> word rank count ════ ══════ ═══════ the 1 ~~41089~~41,088 of 2 ~~19949~~19,949 and 3 ~~14942~~14,942 a 4 ~~14608~~14,595 to 5 ~~13951~~13,950 in 6 ~~11214~~11,214 he 7 ~~9648~~9,607 was 8 ~~8621~~8,620 that 9 ~~7924~~7,826 it 10 6,535 ~~6661~~ </pre> To see a list of the top 5,000 words that show (among other things) words like   '''it's'''   and other accented words, see the discussion page. <br><br> ===version 2===