Word frequency: Difference between revisions

Content added Content deleted

Inline

@@ Line 3,324: / Line 3,324: @@
 had    6139
 </pre>
+=={{header|Picat}}==
+To get the book proper, the header and footer are removed. Here are some tests with different sets of characters to split the words (<code>split_char/1</code>).
+<lang Picat>main =>
+  NTop = 10,
+  File = "les_miserables.txt",
+  Chars = read_file_chars(File),
+  % Remove the Project Gutenberg header/footer
+  find(Chars,"*** START OF THE PROJECT GUTENBERG EBOOK LES MISÉRABLES ***",_,HeaderEnd),
+  find(Chars,"*** END OF THE PROJECT GUTENBERG EBOOK LES MISÉRABLES ***",FooterStart,_),
+  Book = [to_lowercase(C) : C in slice(Chars,HeaderEnd+1,FooterStart-1)],
+  % Split into words (different set of split characters)
+  member(SplitType,[all,space_punct,space]),
+  println(split_type=SplitType),
+  split_chars(SplitType,SplitChars),
+  Words = split(Book,SplitChars),
+  println(freq(Words).to_list.sort_down(2).take(NTop)),
+  nl,
+  fail.
+freq(L) = Freq =>
+  Freq = new_map(),
+  foreach(E in L)
+    Freq.put(E,Freq.get(E,0)+1)
+  end.
+% different set of split chars
+split_chars(all,"\n\r \t,;!.?()[]”\"-“—-__‘’*").
+split_chars(space_punct,"\n\r \t,;!.?").
+split_chars(space,"\n\r \t").</lang>
+{{out}}
+<pre>split_type = all
+[the = 40907,of = 19830,and = 14872,a = 14487,to = 13872,in = 11157,he = 9645,was = 8618,that = 7908,it = 6626]
+split_type = space_punct
+[the = 40193,of = 19779,and = 14668,a = 14227,to = 13538,in = 11033,he = 9455,was = 8604,that = 7576,” = 6578]
+split_type = space
+[the = 40193,of = 19747,and = 14402,a = 14222,to = 13512,in = 10964,he = 9211,was = 8345,that = 7235,his = 6414]</pre>
+It is a slightly different result if the the header/footer are not removed:
+<pre>split_type = all
+[the = 41094,of = 19952,and = 14939,a = 14545,to = 13954,in = 11218,he = 9647,was = 8620,that = 7922,it = 6641]
+split_type = space_punct
+[the = 40378,of = 19901,and = 14734,a = 14284,to = 13620,in = 11094,he = 9457,was = 8606,that = 7590,” = 6578]
+split_type = space
+[the = 40378,of = 19869,and = 14468,a = 14278,to = 13590,in = 11025,he = 9213,was = 8347,that = 7249,his = 6414]</pre>
 =={{header|PicoLisp}}==