Tokenize a string with escaping: Difference between revisions

→‎Regex-based: add a simpler version using finditer+sub with preprocessing
m (→‎Regex-based: ordering of regexes is not important (at least in this case))
(→‎Regex-based: add a simpler version using finditer+sub with preprocessing)
Line 2,708:
 
The following code also illustrates an important feature of Python ‒ nested functions with closures.
Owing to this feature, the inner functions, such as <code>start_new_token</code>, are able to access the local variable <code>tokens</code> of their enclosing function <code>tokenize</code>. For the inner function, the name <code>tokens</code> is ''nonlocal'', and is in the ''enclosing scope'' of the inner function (as opposed to the parameters <code>scanner</code> and <code>substring</code>, which are in the local scope).
For the inner function, the name <code>tokens</code> is ''nonlocal'', and is in the ''enclosing scope'' of the inner function (as opposed to the parameters <code>scanner</code> and <code>substring</code>, which are in the local scope).
 
<lang python>import re
Line 2,747 ⟶ 2,748:
 
if __name__ == '__main__':
print(list(tokenize()))</lang>
</lang>
 
Output is the same as in the functional Python version above.
 
 
====Simpler version with preprocessing====
 
This version does not require any extra state, such as the <code>token</code> list in the Scanner-based version above.
It first preprocesses the input to allow for a simpler regex pattern; then it works only with the primitive regex operations <code>re.findall</code> and <code>re.sub</code>.
Note that the regex used here is compiled with the <code>re.VERBOSE</code> flag.
This allows us to write the regex on several lines (since unescaped whitespace is ignored in this mode), and use comments inside the regex (starting with <code>#</code>).
 
<lang python>import re
 
STRING = 'one^|uno||three^^^^|four^^^|^cuatro|'
 
def tokenize(string=STRING, escape='^', separator='|'):
 
re_escape, re_separator = map(re.escape, (escape, separator))
 
# token regex
regex = re.compile(fr'''
# lookbehind: a token must be preceded by a separator
(?<={re_separator})
 
# a token consists either of an escape sequence,
# or a regular (non-escape, non-separator) character,
# repeated arbitrarily many times (even zero)
(?:{re_escape}.|[^{re_escape}{re_separator}])*
''',
flags=re.VERBOSE
)
 
# since each token must start with a separator,
# we must add an extra separator at the beginning of input
preprocessed_string = separator + string
 
for almost_token in regex.findall(preprocessed_string):
# now get rid of escape characters: '^^' -> '^' etc.
token = re.sub(fr'{re_escape}(.)', r'\1', almost_token)
yield token
 
if __name__ == '__main__':
print(list(tokenize()))</lang>
 
=={{header|Racket}}==