Category talk:Wren-pattern: Difference between revisions
m
→Source code: Now uses Wren S/H lexer.
(Added limited support for 'lazy' matches.) |
m (→Source code: Now uses Wren S/H lexer.) |
||
(4 intermediate revisions by the same user not shown) | |||
Line 1:
===Wren patterns===
Wren doesn't have direct access to a regular expression library (but see Wren-regex section below) and writing such a library from scratch in a small scripting language such as Wren may not be a viable proposition due to its likely size and performance limitations.
I have therefore designed and coded a simple pattern matcher instead, the rules of which are described below. Obviously, this has nothing like the full power of regular expressions but I hope it will nevertheless be a useful addition to the armory.
Line 72:
Although the ''standard'' character classes should suffice for most purposes, the user can redefine up to three of them (i, j and k) for each pattern to deal with special cases, such as a limited range of letters or digits.
An upper case character class represents the complement of the lower case version. For example /A matches any character other than a-z or A-Z, including non-ASCII characters. Note that /Z normally just matches Z itself as it's not possible, of course, to match no characters. However, it does have a special meaning for 'lazy' and 'quantified group' matches which will be covered later.
/ followed by any character other than a letter just matches the character itself. This allows the 12 meta-characters to be treated as ''literal'' characters without their special meaning. So /| is a literal vertical bar.
Line 101:
The upper case version of an extended class represents the complement of the lower case version. For example &N matches any character other than a sign.
& followed by any other letter or character behaves exactly the same as if it were preceded by / except that &Z has a special meaning for 'lazy' and 'quantified group' matches which will be covered later.
;Complements
Line 162:
These are patterns which are used in the appropriate methods as replacements for a matching pattern. They are treated as normal text except that they can contain back-references ($0 always refers to the whole match) and a literal $ must be escaped with $$ (not the usual /$).
;Quantification of groups of characters▼
Quantifiers always qualify ''singles''. Any departure from this is an error and groups of characters cannot therefore be directly quantified.▼
There may sometimes be ways to quantify them indirectly either by simply repeating the pattern or by using captures and back-references. For example "[abab|ab|]" would match 'ab' repeated 2, 1 or 0 times and "[abcd]$1$1" would match 'abcd' repeated exactly three times. ▼
;Examples
Line 199 ⟶ 191:
These methods do not actually change the 'greedy' nature of the engine but use a hack (replacing text with rarely used control characters and back again after matching) to simulate lazy matching to a limited extent.
▲;Quantification of groups of characters
▲Quantifiers always qualify ''singles''. Any departure from this is an error and groups of characters cannot therefore be directly quantified.
▲There may sometimes be ways to quantify them indirectly either by simply repeating the pattern or by using captures and back-references. For example "[abab|ab|]" would match 'ab' repeated 2, 1 or 0 times and "[abcd]$1$1" would match 'abcd' repeated exactly three times. However, this sort of approach clearly has its limitations.
A different and usually better approach is to use the 'findWithGroup' or 'findWithGroup2' methods. These work in an analogous way to the 'findLazy' and 'findLazy2' methods. However, this time '/Z' and '&Z' respectively match the parameter strings 't' and 'u' themselves rather than any characters other than these strings and we can therefore quantify them as though they were 'singles'.
;Wren-regex
Since this module was written, I've created a Wren-regex module which wraps Go's 'regexp' package. Whilst this addresses some shortcomings in Wren-pattern, it requires a special Go executable to use it and doesn't therefore work with Wren-cli.
===Source code===
<
/* Match represents a single successful match made by methods in the Pattern class.
Line 960 ⟶ 963:
var text2 = m.text.replace(rep1, t).replace(rep2, u)
return Match.new_(text2, m.index, captures2)
}
// As the 'find' method but can simulate quantified group matching by treating '/Z' within the pattern
// as matching the string of literal characters 't'.
// Should not be used if 's' might contain the SO (shift out) character '0x0e'.
findWithGroup(s, t) {
var SO = "\x0e"
s = s.replace(t, SO)
var indexMap = List.filled(s.count, 0)
var i = 0
var j = 0
var d = t.count - 1
for (c in s) {
indexMap[i] = j
if (c == SO) j = j + d
i = i + 1
j = j + 1
}
var pattern2 = _pattern.replace("/Z", SO).replace(Pattern.escape(t), SO)
var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
var m = p2.find(s)
if (!m) return null
var captures2 = []
for (c in m.captures) {
captures2.add(Capture.new_(c.text.replace(SO, t), indexMap[c.index]))
}
var text2 = m.text.replace(SO, t)
return Match.new_(text2, indexMap[m.index], captures2)
}
// As the 'find' method but can simulate quantified group matching by treating '/Z' within the pattern
// as matching the string of literal characters 't' and '&Z' within the pattern as matching
// the string of literal characters 'u'.
// Should not be used if 's' might contain the SO (shift out) character '0x0e' or the
// SI (shift in) character '0x0f'.
findWithGroup2(s, t, u) {
var SO = "\x0e"
var SI = "\x0f"
s = s.replace(t, SO).replace(u, SI)
var indexMap = List.filled(s.count, 0)
var i = 0
var j = 0
var d1 = t.count - 1
var d2 = u.count - 1
for (c in s) {
indexMap[i] = j
if (c == SO) {
j = j + d1
} else if (c == SI) {
j = j + d2
}
i = i + 1
j = j + 1
}
var pattern2 = _pattern.replace("/Z", SO).replace(Pattern.escape(t), SO)
.replace("&Z", SI).replace(Pattern.escape(u), SI)
var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
var m = p2.find(s)
if (!m) return null
var captures2 = []
for (c in m.captures) {
captures2.add(Capture.new_(c.text.replace(SO, t).replace(SI, u), indexMap[c.index]))
}
var text2 = m.text.replace(SO, t).replace(SI, u)
return Match.new_(text2, indexMap[m.index], captures2)
}
Line 1,036 ⟶ 1,104:
}
Pattern.init_()</
|