Talk:N-grams

Python

I like the use of `deque` and `islice`, but I'm not sure they are needed. Perhaps a more Pythonic implementation could use a generator:

‎

from collections import Counter

def n_grams(text, n=2, topk=10):
    """ Count top-k character n-grams, returning list of 2-tuples (ngram, count)

    >>> text = "Live and let live"
    >>> n_grams(text.upper(), 1)
    [('L', 3), ('E', 3), (' ', 3), ('I', 2), ('V', 2), ('A', 1),
     ('N', 1), ('D', 1), ('T', 1)]
    >>> n_grams(text.upper(), 2)
    [('LI', 2), ('IV', 2), ('VE', 2), (' L', 2), ('E ', 1),
     (' A', 1), ('AN', 1), ('ND', 1), ('D ', 1), ('LE', 1)]
    >>> n_grams(text.upper(), 3)
    [('LIV', 2), ('IVE', 2), ('VE ', 1), ('E A', 1),
     (' AN', 1), ('AND', 1), ('ND ', 1), ('D L', 1), (' LE', 1), ('LET', 1)]
    >>> n_grams(text.upper(), 4)
    [('LIVE', 2), ('IVE ', 1), ('VE A', 1), ('E AN', 1), (' AND', 1),
     ('AND ', 1), ('ND L', 1), ('D LE', 1), (' LET', 1), ('LET ', 1)]
    """
    return Counter(
        text[i:(i + n)] for i in range(len(text) - n + 1)
    ).most_common(topk)
‎

To run the doctests:

>>> import doctest
>>> doctest.testmod(optionflags=doctest.NORMALIZE_WHITESPACE)
TestResults(failed=0, attempted=5)

I agree, the existing Python solution is needlessly complicated for most situations and your solution would be a welcome addition. I suggest we place your solution first, followed by the existing solution under a suitable sub heading, a bit like the Python entries on the Canonicalize CIDR task page, for example.

"Lazy generator function" might be a good sub heading. --Jgrprior (talk) 16:02, 30 March 2024 (UTC)

For consistency with other solutions to this task, may I suggest that you change topk to default to None instead of 10, so all n-grams are returned by default. And maybe force text to uppercase inside your function rather than expecting the caller to do text.upper(), again, just for consistency with the other task solutions. --Jgrprior (talk) 16:24, 30 March 2024 (UTC)

@Hobson, I've gone ahead and added a Python solution inspired by your implementation. Feel free to edit it and make it your own. --Jgrprior (talk) 14:37, 8 April 2024 (UTC)