Talk:N-grams
Appearance
Python
I like the use of `deque` and `islice`, but I'm not sure they are needed. Perhaps a more Pythonic implementation could use a generator:
from collections import Counter
def n_grams(text, n=2, topk=10):
""" Count top-k character n-grams, returning list of 2-tuples (ngram, count)
>>> text = "Live and let live"
>>> n_grams(text.upper(), 1)
[('L', 3), ('E', 3), (' ', 3), ('I', 2), ('V', 2), ('A', 1),
('N', 1), ('D', 1), ('T', 1)]
>>> n_grams(text.upper(), 2)
[('LI', 2), ('IV', 2), ('VE', 2), (' L', 2), ('E ', 1),
(' A', 1), ('AN', 1), ('ND', 1), ('D ', 1), ('LE', 1)]
>>> n_grams(text.upper(), 3)
[('LIV', 2), ('IVE', 2), ('VE ', 1), ('E A', 1),
(' AN', 1), ('AND', 1), ('ND ', 1), ('D L', 1), (' LE', 1), ('LET', 1)]
>>> n_grams(text.upper(), 4)
[('LIVE', 2), ('IVE ', 1), ('VE A', 1), ('E AN', 1), (' AND', 1),
('AND ', 1), ('ND L', 1), ('D LE', 1), (' LET', 1), ('LET ', 1)]
"""
return Counter(
text[i:(i + n)] for i in range(len(text) - n + 1)
).most_common(topk)
To run the doctests:
>>> import doctest >>> doctest.testmod(optionflags=doctest.NORMALIZE_WHITESPACE) TestResults(failed=0, attempted=5)
- I agree, the existing Python solution is needlessly complicated for most situations and your solution would be a welcome addition. I suggest we place your solution first, followed by the existing solution under a suitable sub heading, a bit like the Python entries on the Canonicalize CIDR task page, for example.
- For consistency with other solutions to this task, may I suggest that you change
topk
to default toNone
instead of10
, so all n-grams are returned by default. And maybe forcetext
to uppercase inside your function rather than expecting the caller to dotext.upper()
, again, just for consistency with the other task solutions. --Jgrprior (talk) 16:24, 30 March 2024 (UTC)