Word frequency: Difference between revisions
Content added Content deleted
Line 3,823: | Line 3,823: | ||
=={{header|R}}== |
=={{header|R}}== |
||
===='''Version 1'''==== |
|||
I chose to remove apostrophes only if they're followed by an s (so "mom" and "mom's" will show up as the same word but "they" and "they're" won't). I also chose not to remove hyphens. |
I chose to remove apostrophes only if they're followed by an s (so "mom" and "mom's" will show up as the same word but "they" and "they're" won't). I also chose not to remove hyphens. |
||
<lang R> |
<lang R> |
||
Line 3,856: | Line 3,857: | ||
9 it 2308 |
9 it 2308 |
||
10 i 1845 |
10 i 1845 |
||
</pre> |
|||
===='''Version 2'''==== |
|||
This version is purely functional using the native pipe operator in R 4.1+ and runs in less than a second. |
|||
<lang R> |
|||
word_frequency_pipeline <- function(file=NULL, n=10) { |
|||
file |> |
|||
vroom::vroom_lines() |> |
|||
stringi::stri_split_boundaries(type="word", skip_word_none=T, skip_word_number=T) |> |
|||
unlist() |> |
|||
tolower() |> |
|||
table() |> |
|||
sort(decreasing = T) |> |
|||
(\(.) .[1:n])() |> |
|||
data.frame() |
|||
} |
|||
</lang> |
|||
{{Out}} |
|||
<pre> |
|||
> word_frequency_pipeline("~/../Downloads/135-0.txt") |
|||
Var1 Freq |
|||
1 the 41042 |
|||
2 of 19952 |
|||
3 and 14938 |
|||
4 a 14526 |
|||
5 to 13942 |
|||
6 in 11208 |
|||
7 he 9605 |
|||
8 was 8620 |
|||
9 that 7824 |
|||
10 it 6533 |
|||
</pre> |
</pre> |
||