N-grams: Difference between revisions

Content added Content deleted

Inline

Revision as of 12:15, 22 April 2023

An N-gram is a sequence of N contiguous elements of a given text. Although N-grams refer sometimes to words or syllables, in this task we will consider only sequences of characters. The task consists in, given a text and an integer size of the desired N-grams, find all the different contiguous sequences of N characters, together with the number of times they appear in the text. For example, the 2-grams of the text "Live and let live" are:

 "LI" - 2
 "IV" - 2
 "VE" - 2
 " L" - 2
 "E " - 1
 " A" - 1
 "AN" - 1
 "ND" - 1
 "D " - 1
 "LE" - 1
 "ET" - 1
 "T " - 1

Note that space and other non-alphanumeric characters are taken into account.

See also

Related task: Sorensen–Dice coefficient

Arturo

ngrams: function [s :string n :integer][
    0..sub size s n | map 'i -> slice upper s i i+n-1
                    | tally
]

loop [2 3 4] 'n [
    print ~"|n|-grams:"
    loop ngrams "Live and let live" n [k v] -> print [~{"|k|"} v]
    print ""
]

Output:

2-grams:
"LI" 2 
"IV" 2 
"VE" 2 
"E " 1 
" A" 1 
"AN" 1 
"ND" 1 
"D " 1 
" L" 2 
"LE" 1 
"ET" 1 
"T " 1 

3-grams:
"LIV" 2 
"IVE" 2 
"VE " 1 
"E A" 1 
" AN" 1 
"AND" 1 
"ND " 1 
"D L" 1 
" LE" 1 
"LET" 1 
"ET " 1 
"T L" 1 
" LI" 1 

4-grams:
"LIVE" 2 
"IVE " 1 
"VE A" 1 
"E AN" 1 
" AND" 1 
"AND " 1 
"ND L" 1 
"D LE" 1 
" LET" 1 
"LET " 1 
"ET L" 1 
"T LI" 1 
" LIV" 1

Common Lisp

A hash table is used to store and retrieve the n-grams fast.

(defun n-grams (text n)
"Return a list of all the N-grams of length n in the text, together with their frequency"
	(let* (res (*ht-n-grams* (make-hash-table :test 'equal)) )
		(loop for i from 0 to (- (length text) n) do
			(let* ((n-gram (string-upcase (subseq text i (+ i n))))
			       (freq (gethash n-gram *ht-n-grams*)))
		  		(setf (gethash n-gram *ht-n-grams*) (if (null freq) 1 (1+ freq))) ))
			(maphash #'(lambda (key val)
									(push (cons key val) res) )
				*ht-n-grams* )
			(sort res #'> :key #'cdr) ))

Output:

> (n-grams "Live and let live" 2)

(("LI" . 2) ("IV" . 2) ("VE" . 2) (" L" . 2) ("E " . 1) (" A" . 1) ("AN" . 1)
 ("ND" . 1) ("D " . 1) ("LE" . 1) ("ET" . 1) ("T " . 1))

Factor

Works with: Factor version 0.99 2022-04-03

USING: ascii grouping kernel math.statistics prettyprint ;

: n-grams ( str n -- assoc ) [ >upper ] dip clump histogram ;

"Live and let live" 2 n-grams .

Output:

H{
    { "ET" 1 }
    { "IV" 2 }
    { "T " 1 }
    { " A" 1 }
    { "VE" 2 }
    { "LI" 2 }
    { "E " 1 }
    { "D " 1 }
    { " L" 2 }
    { "ND" 1 }
    { "LE" 1 }
    { "AN" 1 }
}

jq

Works with jq and gojq, that is, the C and Go implementations of jq.

# Generic "bag of words" utility:
def bow(stream): 
  reduce stream as $word ({}; .[($word|tostring)] += 1);

# The ngrams as a bow
def ngrams($n):
  ascii_upcase as $text
  | bow( range(0;$text|1+ length - $n) as $i | $text[$i:$i+$n]);

# The task
# Sort by increasing frequency, then by lexicographical order
def ngrams($text; $n):
  ($text|ngrams($n)) as $ngrams
  | "\nAll \($n)-grams of '\($text)' and their frequencies:",
    ($ngrams|to_entries|sort_by(.value,.key)[] | "\(.key): \(.value)" ) ;

ngrams("Live and let live"; 2,3,4)

Output:


All 2-grams of 'Live and let live' and their frequencies:
 A: 1
AN: 1
D : 1
E : 1
ET: 1
LE: 1
ND: 1
T : 1
 L: 2
IV: 2
LI: 2
VE: 2

All 3-grams of 'Live and let live' and their frequencies:
 AN: 1
 LE: 1
 LI: 1
AND: 1
D L: 1
E A: 1
ET : 1
LET: 1
ND : 1
T L: 1
VE : 1
IVE: 2
LIV: 2

All 4-grams of 'Live and let live' and their frequencies:
 AND: 1
 LET: 1
 LIV: 1
AND : 1
D LE: 1
E AN: 1
ET L: 1
IVE : 1
LET : 1
ND L: 1
T LI: 1
VE A: 1
LIVE: 2

Phix

A dictionary is used to find the index of already-seen n-grams, even though a simpler find() would be good enough for this task.
I have replicated most orderings found on this page, the task description order corresponds to orig/freq,
jq is alpha/freq but high last, but there is no equivalent for the Factor or Raku orderings here ;-).

with javascript_semantics
function n_grams(integer len, string txt, sequence orders)
    sequence ng = {}, ngc = {}
    integer d = new_dict()
    txt = upper(txt)
    for i=1 to length(txt)-len+1 do
        string tn = txt[i..i+len-1]
        integer ndx = getdd(tn,0,d)
        if ndx=0 then
            ng = append(ng,tn)
            ngc = append(ngc,1)
            ndx = length(ng)
            setd(tn,ndx,d)
        else
            ngc[ndx] += 1
        end if
    end for
    destroy_dict(d)
    integer l = length(ng)
    sequence ares = columnize({ng,ngc,tagset(l)}),
              res = {}
    for c in orders do
        if c="original" then -- original/first found order
            res = append(res,ares)
        elsif c="orig/freq" then -- "" but higher freq first
            res = append(res,sort_columns(ares,{-2,3}))
        elsif c="alphabetic" then -- alphabetical order
            res = append(res,sort(deep_copy(ares)))
        elsif c="alpha/freq" then -- "" but higher freq first
            res = append(res,sort_columns(ares,{-2,1}))
        else
            ?9/0 -- (unknown ordering requested)
        end if
    end for
    return res
end function

constant src = "Live and let live",
         orders = {"original","orig/freq",
                 "alphabetic","alpha/freq"}
printf(1,"For \"%s\":\n",src)
for l=2 to 4 do
    sequence res = n_grams(l,src,orders)
    string count = ordinal(length(res[1]),true)
    printf(1,"There are %s unique %d-grams:\n",{count,l})
    for i,r in res do
        printf(1,"%12s: %s\n",{orders[i],join(r,", ",fmt:="%s %d")})
    end for
end for

Output:

For "Live and let live":
There are twelve unique 2-grams:
    original: LI 2, IV 2, VE 2, E  1,  A 1, AN 1, ND 1, D  1,  L 2, LE 1, ET 1, T  1
   orig/freq: LI 2, IV 2, VE 2,  L 2, E  1,  A 1, AN 1, ND 1, D  1, LE 1, ET 1, T  1
  alphabetic:  A 1,  L 2, AN 1, D  1, E  1, ET 1, IV 2, LE 1, LI 2, ND 1, T  1, VE 2
  alpha/freq:  L 2, IV 2, LI 2, VE 2,  A 1, AN 1, D  1, E  1, ET 1, LE 1, ND 1, T  1
There are thirteen unique 3-grams:
    original: LIV 2, IVE 2, VE  1, E A 1,  AN 1, AND 1, ND  1, D L 1,  LE 1, LET 1, ET  1, T L 1,  LI 1
   orig/freq: LIV 2, IVE 2, VE  1, E A 1,  AN 1, AND 1, ND  1, D L 1,  LE 1, LET 1, ET  1, T L 1,  LI 1
  alphabetic:  AN 1,  LE 1,  LI 1, AND 1, D L 1, E A 1, ET  1, IVE 2, LET 1, LIV 2, ND  1, T L 1, VE  1
  alpha/freq: IVE 2, LIV 2,  AN 1,  LE 1,  LI 1, AND 1, D L 1, E A 1, ET  1, LET 1, ND  1, T L 1, VE  1
There are thirteen unique 4-grams:
    original: LIVE 2, IVE  1, VE A 1, E AN 1,  AND 1, AND  1, ND L 1, D LE 1,  LET 1, LET  1, ET L 1, T LI 1,  LIV 1
   orig/freq: LIVE 2, IVE  1, VE A 1, E AN 1,  AND 1, AND  1, ND L 1, D LE 1,  LET 1, LET  1, ET L 1, T LI 1,  LIV 1
  alphabetic:  AND 1,  LET 1,  LIV 1, AND  1, D LE 1, E AN 1, ET L 1, IVE  1, LET  1, LIVE 2, ND L 1, T LI 1, VE A 1
  alpha/freq: LIVE 2,  AND 1,  LET 1,  LIV 1, AND  1, D LE 1, E AN 1, ET L 1, IVE  1, LET  1, ND L 1, T LI 1, VE A 1

Python

This example generates n-grams lazily, much like the sliding_window recipe from the Python itertools docs.

from collections import Counter
from collections import deque
from itertools import islice


def n_grams(text, n):
    """Generate contiguous sequences of _n_ characters from _text_."""
    it = iter(text.upper())
    ngram = deque(islice(it, n), maxlen=n)
    if len(ngram) == n:
        yield "".join(ngram)
    for ch in it:
        ngram.append(ch)
        yield "".join(ngram)


if __name__ == "__main__":
    import pprint

    example = "Live and let live"

    for n in range(2, 5):
        result = Counter(n_grams(example, n)).most_common()
        print(
            f"{len(result)} {n}-grams of {example!r}:\n",
            pprint.pformat(result, compact=True),
            end="\n\n",
        )

Output:

12 2-grams of 'Live and let live':
 [('LI', 2), ('IV', 2), ('VE', 2), (' L', 2), ('E ', 1), (' A', 1), ('AN', 1),
 ('ND', 1), ('D ', 1), ('LE', 1), ('ET', 1), ('T ', 1)]

13 3-grams of 'Live and let live':
 [('LIV', 2), ('IVE', 2), ('VE ', 1), ('E A', 1), (' AN', 1), ('AND', 1),
 ('ND ', 1), ('D L', 1), (' LE', 1), ('LET', 1), ('ET ', 1), ('T L', 1),
 (' LI', 1)]

13 4-grams of 'Live and let live':
 [('LIVE', 2), ('IVE ', 1), ('VE A', 1), ('E AN', 1), (' AND', 1), ('AND ', 1),
 ('ND L', 1), ('D LE', 1), (' LET', 1), ('LET ', 1), ('ET L', 1), ('T LI', 1),
 (' LIV', 1)]

Raku

sub n-gram ($this, $N=2) { Bag.new( flat $this.uc.map: { .comb.rotor($N => -($N-1))».join } ) }
dd 'Live and let live'.&n-gram; # bi-gram
dd 'Live and let live'.&n-gram(3); # tri-gram

Output:

("IV"=>2,"T "=>1,"VE"=>2,"E "=>1,"LE"=>1,"AN"=>1,"LI"=>2,"ND"=>1,"ET"=>1," L"=>2," A"=>1,"D "=>1).Bag
("ET "=>1,"AND"=>1,"LIV"=>2," LI"=>1,"ND "=>1," LE"=>1,"IVE"=>2,"E A"=>1,"VE "=>1,"T L"=>1,"D L"=>1,"LET"=>1," AN"=>1).Bag

Wren

Library: Wren-str

Library: Wren-maputil

Library: Wren-fmt

import "./str" for Str
import "./maputil" for MapUtil
import "./fmt" for Fmt

var findNgrams = Fn.new { |n, text|
    text = Str.upper(text)
    var ngrams = {}
    for (i in 0..text.count-n) {
        var ngram = text[i...i+n]
        MapUtil.increase(ngrams, ngram)
    }
    return ngrams
}

// sort by decreasing frequency, then by lexicographical order
var comparer = Fn.new { |i, j|
    if (i.value != j.value) return j.value < i.value
    return Str.lt(i.key, j.key)
}

var text = "Live and let live"
for (n in [2, 3, 4]) {
    var ngrams = findNgrams.call(n, text)
    System.print("All %(n)-grams of '%(text)' and their frequencies:")
    var ng = ngrams.toList.sort(comparer).map { |me| "(\"%(me.key)\" : %(me.value))"}
    Fmt.tprint("$s  ", ng, 5)
    System.print()
}

Output:

All 2-grams of 'Live and let live' and their frequencies:
(" L" : 2)   ("IV" : 2)   ("LI" : 2)   ("VE" : 2)   (" A" : 1)   
("AN" : 1)   ("D " : 1)   ("E " : 1)   ("ET" : 1)   ("LE" : 1)   
("ND" : 1)   ("T " : 1)   

All 3-grams of 'Live and let live' and their frequencies:
("IVE" : 2)   ("LIV" : 2)   (" AN" : 1)   (" LE" : 1)   (" LI" : 1)   
("AND" : 1)   ("D L" : 1)   ("E A" : 1)   ("ET " : 1)   ("LET" : 1)   
("ND " : 1)   ("T L" : 1)   ("VE " : 1)   

All 4-grams of 'Live and let live' and their frequencies:
("LIVE" : 2)   (" AND" : 1)   (" LET" : 1)   (" LIV" : 1)   ("AND " : 1)   
("D LE" : 1)   ("E AN" : 1)   ("ET L" : 1)   ("IVE " : 1)   ("LET " : 1)   
("ND L" : 1)   ("T LI" : 1)   ("VE A" : 1)