Jump to content

Word frequency: Difference between revisions

(headers)
Line 3,324:
10 had 6139
</pre>
 
=={{header|Picat}}==
To get the book proper, the header and footer are removed. Here are some tests with different sets of characters to split the words (<code>split_char/1</code>).
<lang Picat>main =>
NTop = 10,
File = "les_miserables.txt",
Chars = read_file_chars(File),
 
% Remove the Project Gutenberg header/footer
find(Chars,"*** START OF THE PROJECT GUTENBERG EBOOK LES MISÉRABLES ***",_,HeaderEnd),
find(Chars,"*** END OF THE PROJECT GUTENBERG EBOOK LES MISÉRABLES ***",FooterStart,_),
 
Book = [to_lowercase(C) : C in slice(Chars,HeaderEnd+1,FooterStart-1)],
 
% Split into words (different set of split characters)
member(SplitType,[all,space_punct,space]),
println(split_type=SplitType),
split_chars(SplitType,SplitChars),
Words = split(Book,SplitChars),
 
println(freq(Words).to_list.sort_down(2).take(NTop)),
nl,
fail.
 
freq(L) = Freq =>
Freq = new_map(),
foreach(E in L)
Freq.put(E,Freq.get(E,0)+1)
end.
 
% different set of split chars
split_chars(all,"\n\r \t,;!.?()[]”\"-“—-__‘’*").
split_chars(space_punct,"\n\r \t,;!.?").
split_chars(space,"\n\r \t").</lang>
 
{{out}}
<pre>split_type = all
[the = 40907,of = 19830,and = 14872,a = 14487,to = 13872,in = 11157,he = 9645,was = 8618,that = 7908,it = 6626]
 
split_type = space_punct
[the = 40193,of = 19779,and = 14668,a = 14227,to = 13538,in = 11033,he = 9455,was = 8604,that = 7576,” = 6578]
 
split_type = space
[the = 40193,of = 19747,and = 14402,a = 14222,to = 13512,in = 10964,he = 9211,was = 8345,that = 7235,his = 6414]</pre>
 
It is a slightly different result if the the header/footer are not removed:
<pre>split_type = all
[the = 41094,of = 19952,and = 14939,a = 14545,to = 13954,in = 11218,he = 9647,was = 8620,that = 7922,it = 6641]
 
split_type = space_punct
[the = 40378,of = 19901,and = 14734,a = 14284,to = 13620,in = 11094,he = 9457,was = 8606,that = 7590,” = 6578]
 
split_type = space
[the = 40378,of = 19869,and = 14468,a = 14278,to = 13590,in = 11025,he = 9213,was = 8347,that = 7249,his = 6414]</pre>
 
 
=={{header|PicoLisp}}==
495

edits

Cookies help us deliver our services. By using our services, you agree to our use of cookies.