N-grams

From Rosetta Code
Revision as of 16:10, 21 April 2023 by Thundergnat (talk | contribs) (Add draft markup, related task, Raku example)
N-grams is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

An N-gram is a sequence of N contiguous elements of a given text. Although N-grams refer sometimes to words or syllables, in this task we will consider only sequences of characters. The task consists in, given a text and an integer size of the desired N-grams, find all the different contiguous sequences of N characters, together with the number of times they appear in the text. For example, the 2-grams of the text "Live and let live" are:

 "LI" - 2
 "IV" - 2
 "VE" - 2
 " L" - 2
 "E " - 1
 " A" - 1
 "AN" - 1
 "ND" - 1
 "D " - 1
 "LE" - 1
 "ET" - 1
 "T " - 1

Note that space and other non-alphanumeric characters are taken into account.


See also


Common Lisp

A hash table is used to store and retrieve the n-grams fast.

(defun n-grams (text n)
"Return a list of all the N-grams of length n in the text, together with their frequency"
	(let* (res (*ht-n-grams* (make-hash-table :test 'equal)) )
		(loop for i from 0 to (- (length text) n) do
			(let* ((n-gram (string-upcase (subseq text i (+ i n))))
			       (freq (gethash n-gram *ht-n-grams*)))
		  		(setf (gethash n-gram *ht-n-grams*) (if (null freq) 1 (1+ freq))) ))
			(maphash #'(lambda (key val)
									(push (cons key val) res) )
				*ht-n-grams* )
			(sort res #'> :key #'cdr) ))
Output:
> (n-grams "Live and let live" 2)

(("LI" . 2) ("IV" . 2) ("VE" . 2) (" L" . 2) ("E " . 1) (" A" . 1) ("AN" . 1)
 ("ND" . 1) ("D " . 1) ("LE" . 1) ("ET" . 1) ("T " . 1))

Raku

sub n-gram ($this, $N=2) { Bag.new( flat $this.uc.map: { .comb.rotor($N => -($N-1))».join } ) }
dd 'Live and let live'.&n-gram; # bi-gram
dd 'Live and let live'.&n-gram(3); # tri-gram
Output:
("IV"=>2,"T "=>1,"VE"=>2,"E "=>1,"LE"=>1,"AN"=>1,"LI"=>2,"ND"=>1,"ET"=>1," L"=>2," A"=>1,"D "=>1).Bag
("ET "=>1,"AND"=>1,"LIV"=>2," LI"=>1,"ND "=>1," LE"=>1,"IVE"=>2,"E A"=>1,"VE "=>1,"T L"=>1,"D L"=>1,"LET"=>1," AN"=>1).Bag