N-grams: Difference between revisions

Content deleted Content added

Inline

Revision as of 09:05, 22 April 2023

An N-gram is a sequence of N contiguous elements of a given text. Although N-grams refer sometimes to words or syllables, in this task we will consider only sequences of characters. The task consists in, given a text and an integer size of the desired N-grams, find all the different contiguous sequences of N characters, together with the number of times they appear in the text. For example, the 2-grams of the text "Live and let live" are:

 "LI" - 2
 "IV" - 2
 "VE" - 2
 " L" - 2
 "E " - 1
 " A" - 1
 "AN" - 1
 "ND" - 1
 "D " - 1
 "LE" - 1
 "ET" - 1
 "T " - 1

Note that space and other non-alphanumeric characters are taken into account.

See also

Related task: Sorensen–Dice coefficient

Arturo

ngrams: function [s :string n :integer][
    0..sub size s n | map 'i -> slice upper s i i+n-1
                    | tally
]

loop [2 3 4] 'n [
    print ~"|n|-grams:"
    loop ngrams "Live and let live" n [k v] -> print [~{"|k|"} v]
    print ""
]

Output:

2-grams:
"LI" 2 
"IV" 2 
"VE" 2 
"E " 1 
" A" 1 
"AN" 1 
"ND" 1 
"D " 1 
" L" 2 
"LE" 1 
"ET" 1 
"T " 1 

3-grams:
"LIV" 2 
"IVE" 2 
"VE " 1 
"E A" 1 
" AN" 1 
"AND" 1 
"ND " 1 
"D L" 1 
" LE" 1 
"LET" 1 
"ET " 1 
"T L" 1 
" LI" 1 

4-grams:
"LIVE" 2 
"IVE " 1 
"VE A" 1 
"E AN" 1 
" AND" 1 
"AND " 1 
"ND L" 1 
"D LE" 1 
" LET" 1 
"LET " 1 
"ET L" 1 
"T LI" 1 
" LIV" 1

Common Lisp

A hash table is used to store and retrieve the n-grams fast.

(defun n-grams (text n)
"Return a list of all the N-grams of length n in the text, together with their frequency"
	(let* (res (*ht-n-grams* (make-hash-table :test 'equal)) )
		(loop for i from 0 to (- (length text) n) do
			(let* ((n-gram (string-upcase (subseq text i (+ i n))))
			       (freq (gethash n-gram *ht-n-grams*)))
		  		(setf (gethash n-gram *ht-n-grams*) (if (null freq) 1 (1+ freq))) ))
			(maphash #'(lambda (key val)
									(push (cons key val) res) )
				*ht-n-grams* )
			(sort res #'> :key #'cdr) ))

Output:

> (n-grams "Live and let live" 2)

(("LI" . 2) ("IV" . 2) ("VE" . 2) (" L" . 2) ("E " . 1) (" A" . 1) ("AN" . 1)
 ("ND" . 1) ("D " . 1) ("LE" . 1) ("ET" . 1) ("T " . 1))

Factor

Works with: Factor version 0.99 2022-04-03

USING: ascii grouping kernel math.statistics prettyprint ;

: n-grams ( str n -- assoc ) [ >upper ] dip clump histogram ;

"Live and let live" 2 n-grams .

Output:

H{
    { "ET" 1 }
    { "IV" 2 }
    { "T " 1 }
    { " A" 1 }
    { "VE" 2 }
    { "LI" 2 }
    { "E " 1 }
    { "D " 1 }
    { " L" 2 }
    { "ND" 1 }
    { "LE" 1 }
    { "AN" 1 }
}

jq

Works with jq and gojq, that is, the C and Go implementations of jq.

# Generic "bag of words" utility:
def bow(stream): 
  reduce stream as $word ({}; .[($word|tostring)] += 1);

# The ngrams as a bow
def ngrams($n):
  ascii_upcase as $text
  | bow( range(0;$text|1+ length - $n) as $i | $text[$i:$i+$n]);

# The task
# Sort by increasing frequency, then by lexicographical order
def ngrams($text; $n):
  ($text|ngrams($n)) as $ngrams
  | "\nAll \($n)-grams of '\($text)' and their frequencies:",
    ($ngrams|to_entries|sort_by(.value,.key)[] | "\(.key): \(.value)" ) ;

ngrams("Live and let live"; 2,3,4)

Output:


All 2-grams of 'Live and let live' and their frequencies:
 A: 1
AN: 1
D : 1
E : 1
ET: 1
LE: 1
ND: 1
T : 1
 L: 2
IV: 2
LI: 2
VE: 2

All 3-grams of 'Live and let live' and their frequencies:
 AN: 1
 LE: 1
 LI: 1
AND: 1
D L: 1
E A: 1
ET : 1
LET: 1
ND : 1
T L: 1
VE : 1
IVE: 2
LIV: 2

All 4-grams of 'Live and let live' and their frequencies:
 AND: 1
 LET: 1
 LIV: 1
AND : 1
D LE: 1
E AN: 1
ET L: 1
IVE : 1
LET : 1
ND L: 1
T LI: 1
VE A: 1
LIVE: 2

Raku

sub n-gram ($this, $N=2) { Bag.new( flat $this.uc.map: { .comb.rotor($N => -($N-1))».join } ) }
dd 'Live and let live'.&n-gram; # bi-gram
dd 'Live and let live'.&n-gram(3); # tri-gram

Output:

("IV"=>2,"T "=>1,"VE"=>2,"E "=>1,"LE"=>1,"AN"=>1,"LI"=>2,"ND"=>1,"ET"=>1," L"=>2," A"=>1,"D "=>1).Bag
("ET "=>1,"AND"=>1,"LIV"=>2," LI"=>1,"ND "=>1," LE"=>1,"IVE"=>2,"E A"=>1,"VE "=>1,"T L"=>1,"D L"=>1,"LET"=>1," AN"=>1).Bag

Wren

Library: Wren-str

Library: Wren-maputil

Library: Wren-fmt

import "./str" for Str
import "./maputil" for MapUtil
import "./fmt" for Fmt

var findNgrams = Fn.new { |n, text|
    text = Str.upper(text)
    var ngrams = {}
    for (i in 0..text.count-n) {
        var ngram = text[i...i+n]
        MapUtil.increase(ngrams, ngram)
    }
    return ngrams
}

// sort by decreasing frequency, then by lexicographical order
var comparer = Fn.new { |i, j|
    if (i.value != j.value) return j.value < i.value
    return Str.lt(i.key, j.key)
}

var text = "Live and let live"
for (n in [2, 3, 4]) {
    var ngrams = findNgrams.call(n, text)
    System.print("All %(n)-grams of '%(text)' and their frequencies:")
    var ng = ngrams.toList.sort(comparer).map { |me| "(\"%(me.key)\" : %(me.value))"}
    Fmt.tprint("$s  ", ng, 5)
    System.print()
}

Output:

All 2-grams of 'Live and let live' and their frequencies:
(" L" : 2)   ("IV" : 2)   ("LI" : 2)   ("VE" : 2)   (" A" : 1)   
("AN" : 1)   ("D " : 1)   ("E " : 1)   ("ET" : 1)   ("LE" : 1)   
("ND" : 1)   ("T " : 1)   

All 3-grams of 'Live and let live' and their frequencies:
("IVE" : 2)   ("LIV" : 2)   (" AN" : 1)   (" LE" : 1)   (" LI" : 1)   
("AND" : 1)   ("D L" : 1)   ("E A" : 1)   ("ET " : 1)   ("LET" : 1)   
("ND " : 1)   ("T L" : 1)   ("VE " : 1)   

All 4-grams of 'Live and let live' and their frequencies:
("LIVE" : 2)   (" AND" : 1)   (" LET" : 1)   (" LIV" : 1)   ("AND " : 1)   
("D LE" : 1)   ("E AN" : 1)   ("ET L" : 1)   ("IVE " : 1)   ("LET " : 1)   
("ND L" : 1)   ("T LI" : 1)   ("VE A" : 1)