Word frequency: Difference between revisions
Content added Content deleted
(headers) |
|||
Line 3,324: | Line 3,324: | ||
10 had 6139 |
10 had 6139 |
||
</pre> |
</pre> |
||
=={{header|Picat}}== |
|||
To get the book proper, the header and footer are removed. Here are some tests with different sets of characters to split the words (<code>split_char/1</code>). |
|||
<lang Picat>main => |
|||
NTop = 10, |
|||
File = "les_miserables.txt", |
|||
Chars = read_file_chars(File), |
|||
% Remove the Project Gutenberg header/footer |
|||
find(Chars,"*** START OF THE PROJECT GUTENBERG EBOOK LES MISÉRABLES ***",_,HeaderEnd), |
|||
find(Chars,"*** END OF THE PROJECT GUTENBERG EBOOK LES MISÉRABLES ***",FooterStart,_), |
|||
Book = [to_lowercase(C) : C in slice(Chars,HeaderEnd+1,FooterStart-1)], |
|||
% Split into words (different set of split characters) |
|||
member(SplitType,[all,space_punct,space]), |
|||
println(split_type=SplitType), |
|||
split_chars(SplitType,SplitChars), |
|||
Words = split(Book,SplitChars), |
|||
println(freq(Words).to_list.sort_down(2).take(NTop)), |
|||
nl, |
|||
fail. |
|||
freq(L) = Freq => |
|||
Freq = new_map(), |
|||
foreach(E in L) |
|||
Freq.put(E,Freq.get(E,0)+1) |
|||
end. |
|||
% different set of split chars |
|||
split_chars(all,"\n\r \t,;!.?()[]”\"-“—-__‘’*"). |
|||
split_chars(space_punct,"\n\r \t,;!.?"). |
|||
split_chars(space,"\n\r \t").</lang> |
|||
{{out}} |
|||
<pre>split_type = all |
|||
[the = 40907,of = 19830,and = 14872,a = 14487,to = 13872,in = 11157,he = 9645,was = 8618,that = 7908,it = 6626] |
|||
split_type = space_punct |
|||
[the = 40193,of = 19779,and = 14668,a = 14227,to = 13538,in = 11033,he = 9455,was = 8604,that = 7576,” = 6578] |
|||
split_type = space |
|||
[the = 40193,of = 19747,and = 14402,a = 14222,to = 13512,in = 10964,he = 9211,was = 8345,that = 7235,his = 6414]</pre> |
|||
It is a slightly different result if the the header/footer are not removed: |
|||
<pre>split_type = all |
|||
[the = 41094,of = 19952,and = 14939,a = 14545,to = 13954,in = 11218,he = 9647,was = 8620,that = 7922,it = 6641] |
|||
split_type = space_punct |
|||
[the = 40378,of = 19901,and = 14734,a = 14284,to = 13620,in = 11094,he = 9457,was = 8606,that = 7590,” = 6578] |
|||
split_type = space |
|||
[the = 40378,of = 19869,and = 14468,a = 14278,to = 13590,in = 11025,he = 9213,was = 8347,that = 7249,his = 6414]</pre> |
|||
=={{header|PicoLisp}}== |
=={{header|PicoLisp}}== |