N-grams: Difference between revisions
Content deleted Content added
m →{{header|jq}}: fix |
Drkameleon (talk | contribs) added Arturo |
||
Line 23: | Line 23: | ||
;* [[Sorensen–Dice_coefficient|Related task: Sorensen–Dice coefficient]] |
;* [[Sorensen–Dice_coefficient|Related task: Sorensen–Dice coefficient]] |
||
=={{header|Arturo}}== |
|||
<syntaxhighlight lang="arturo">ngrams: function [s :string n :integer][ |
|||
0..sub size s n | map 'i -> slice upper s i i+n-1 |
|||
| tally |
|||
] |
|||
loop [2 3 4] 'n [ |
|||
print ~"|n|-grams:" |
|||
loop ngrams "Live and let live" n [k v] -> print [~{"|k|"} v] |
|||
print "" |
|||
]</syntaxhighlight> |
|||
{{out}} |
|||
<pre>2-grams: |
|||
"LI" 2 |
|||
"IV" 2 |
|||
"VE" 2 |
|||
"E " 1 |
|||
" A" 1 |
|||
"AN" 1 |
|||
"ND" 1 |
|||
"D " 1 |
|||
" L" 2 |
|||
"LE" 1 |
|||
"ET" 1 |
|||
"T " 1 |
|||
3-grams: |
|||
"LIV" 2 |
|||
"IVE" 2 |
|||
"VE " 1 |
|||
"E A" 1 |
|||
" AN" 1 |
|||
"AND" 1 |
|||
"ND " 1 |
|||
"D L" 1 |
|||
" LE" 1 |
|||
"LET" 1 |
|||
"ET " 1 |
|||
"T L" 1 |
|||
" LI" 1 |
|||
4-grams: |
|||
"LIVE" 2 |
|||
"IVE " 1 |
|||
"VE A" 1 |
|||
"E AN" 1 |
|||
" AND" 1 |
|||
"AND " 1 |
|||
"ND L" 1 |
|||
"D LE" 1 |
|||
" LET" 1 |
|||
"LET " 1 |
|||
"ET L" 1 |
|||
"T LI" 1 |
|||
" LIV" 1</pre> |
|||
=={{header|Common Lisp}}== |
=={{header|Common Lisp}}== |
Revision as of 09:05, 22 April 2023
N-grams is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
An N-gram is a sequence of N contiguous elements of a given text. Although N-grams refer sometimes to words or syllables, in this task we will consider only sequences of characters. The task consists in, given a text and an integer size of the desired N-grams, find all the different contiguous sequences of N characters, together with the number of times they appear in the text. For example, the 2-grams of the text "Live and let live" are:
"LI" - 2 "IV" - 2 "VE" - 2 " L" - 2 "E " - 1 " A" - 1 "AN" - 1 "ND" - 1 "D " - 1 "LE" - 1 "ET" - 1 "T " - 1
Note that space and other non-alphanumeric characters are taken into account.
- See also
Arturo
ngrams: function [s :string n :integer][
0..sub size s n | map 'i -> slice upper s i i+n-1
| tally
]
loop [2 3 4] 'n [
print ~"|n|-grams:"
loop ngrams "Live and let live" n [k v] -> print [~{"|k|"} v]
print ""
]
- Output:
2-grams: "LI" 2 "IV" 2 "VE" 2 "E " 1 " A" 1 "AN" 1 "ND" 1 "D " 1 " L" 2 "LE" 1 "ET" 1 "T " 1 3-grams: "LIV" 2 "IVE" 2 "VE " 1 "E A" 1 " AN" 1 "AND" 1 "ND " 1 "D L" 1 " LE" 1 "LET" 1 "ET " 1 "T L" 1 " LI" 1 4-grams: "LIVE" 2 "IVE " 1 "VE A" 1 "E AN" 1 " AND" 1 "AND " 1 "ND L" 1 "D LE" 1 " LET" 1 "LET " 1 "ET L" 1 "T LI" 1 " LIV" 1
Common Lisp
A hash table is used to store and retrieve the n-grams fast.
(defun n-grams (text n)
"Return a list of all the N-grams of length n in the text, together with their frequency"
(let* (res (*ht-n-grams* (make-hash-table :test 'equal)) )
(loop for i from 0 to (- (length text) n) do
(let* ((n-gram (string-upcase (subseq text i (+ i n))))
(freq (gethash n-gram *ht-n-grams*)))
(setf (gethash n-gram *ht-n-grams*) (if (null freq) 1 (1+ freq))) ))
(maphash #'(lambda (key val)
(push (cons key val) res) )
*ht-n-grams* )
(sort res #'> :key #'cdr) ))
- Output:
> (n-grams "Live and let live" 2)
(("LI" . 2) ("IV" . 2) ("VE" . 2) (" L" . 2) ("E " . 1) (" A" . 1) ("AN" . 1)
("ND" . 1) ("D " . 1) ("LE" . 1) ("ET" . 1) ("T " . 1))
Factor
USING: ascii grouping kernel math.statistics prettyprint ;
: n-grams ( str n -- assoc ) [ >upper ] dip clump histogram ;
"Live and let live" 2 n-grams .
- Output:
H{ { "ET" 1 } { "IV" 2 } { "T " 1 } { " A" 1 } { "VE" 2 } { "LI" 2 } { "E " 1 } { "D " 1 } { " L" 2 } { "ND" 1 } { "LE" 1 } { "AN" 1 } }
jq
Works with jq and gojq, that is, the C and Go implementations of jq.
# Generic "bag of words" utility:
def bow(stream):
reduce stream as $word ({}; .[($word|tostring)] += 1);
# The ngrams as a bow
def ngrams($n):
ascii_upcase as $text
| bow( range(0;$text|1+ length - $n) as $i | $text[$i:$i+$n]);
# The task
# Sort by increasing frequency, then by lexicographical order
def ngrams($text; $n):
($text|ngrams($n)) as $ngrams
| "\nAll \($n)-grams of '\($text)' and their frequencies:",
($ngrams|to_entries|sort_by(.value,.key)[] | "\(.key): \(.value)" ) ;
ngrams("Live and let live"; 2,3,4)
- Output:
All 2-grams of 'Live and let live' and their frequencies: A: 1 AN: 1 D : 1 E : 1 ET: 1 LE: 1 ND: 1 T : 1 L: 2 IV: 2 LI: 2 VE: 2 All 3-grams of 'Live and let live' and their frequencies: AN: 1 LE: 1 LI: 1 AND: 1 D L: 1 E A: 1 ET : 1 LET: 1 ND : 1 T L: 1 VE : 1 IVE: 2 LIV: 2 All 4-grams of 'Live and let live' and their frequencies: AND: 1 LET: 1 LIV: 1 AND : 1 D LE: 1 E AN: 1 ET L: 1 IVE : 1 LET : 1 ND L: 1 T LI: 1 VE A: 1 LIVE: 2
Raku
sub n-gram ($this, $N=2) { Bag.new( flat $this.uc.map: { .comb.rotor($N => -($N-1))».join } ) }
dd 'Live and let live'.&n-gram; # bi-gram
dd 'Live and let live'.&n-gram(3); # tri-gram
- Output:
("IV"=>2,"T "=>1,"VE"=>2,"E "=>1,"LE"=>1,"AN"=>1,"LI"=>2,"ND"=>1,"ET"=>1," L"=>2," A"=>1,"D "=>1).Bag ("ET "=>1,"AND"=>1,"LIV"=>2," LI"=>1,"ND "=>1," LE"=>1,"IVE"=>2,"E A"=>1,"VE "=>1,"T L"=>1,"D L"=>1,"LET"=>1," AN"=>1).Bag
Wren
import "./str" for Str
import "./maputil" for MapUtil
import "./fmt" for Fmt
var findNgrams = Fn.new { |n, text|
text = Str.upper(text)
var ngrams = {}
for (i in 0..text.count-n) {
var ngram = text[i...i+n]
MapUtil.increase(ngrams, ngram)
}
return ngrams
}
// sort by decreasing frequency, then by lexicographical order
var comparer = Fn.new { |i, j|
if (i.value != j.value) return j.value < i.value
return Str.lt(i.key, j.key)
}
var text = "Live and let live"
for (n in [2, 3, 4]) {
var ngrams = findNgrams.call(n, text)
System.print("All %(n)-grams of '%(text)' and their frequencies:")
var ng = ngrams.toList.sort(comparer).map { |me| "(\"%(me.key)\" : %(me.value))"}
Fmt.tprint("$s ", ng, 5)
System.print()
}
- Output:
All 2-grams of 'Live and let live' and their frequencies: (" L" : 2) ("IV" : 2) ("LI" : 2) ("VE" : 2) (" A" : 1) ("AN" : 1) ("D " : 1) ("E " : 1) ("ET" : 1) ("LE" : 1) ("ND" : 1) ("T " : 1) All 3-grams of 'Live and let live' and their frequencies: ("IVE" : 2) ("LIV" : 2) (" AN" : 1) (" LE" : 1) (" LI" : 1) ("AND" : 1) ("D L" : 1) ("E A" : 1) ("ET " : 1) ("LET" : 1) ("ND " : 1) ("T L" : 1) ("VE " : 1) All 4-grams of 'Live and let live' and their frequencies: ("LIVE" : 2) (" AND" : 1) (" LET" : 1) (" LIV" : 1) ("AND " : 1) ("D LE" : 1) ("E AN" : 1) ("ET L" : 1) ("IVE " : 1) ("LET " : 1) ("ND L" : 1) ("T LI" : 1) ("VE A" : 1)