Category talk:Wren-pattern: Difference between revisions

Added limited support for 'lazy' matches.
(→‎Source code: Performance improvement for large strings.)
(Added limited support for 'lazy' matches.)
Line 1:
===Wren patterns===
 
Wren doesn't have direct access to a regular expression library and, even if it did, they would be tedious to use as all meta-characters would need to be ''double escaped'' due to the language not supporting ''raw'' strings either. Moreover, writing such a library from scratch in a small scripting language such as Wren may not be a viable proposition due to its likely size and performance limitations.
 
I have therefore designed and coded a simple pattern matcher instead, the rules of which are described below. Obviously, this has nothing like the full power of regular expressions but I hope it will nevertheless be a useful addition to the armory.
Line 13:
Multiples, minima, ranges and optionals are referred to as ''quantifiers'' as they specify how many times a ''single'' needs to be repeated for a match to occur.
 
Note that quantifiers are always ''greedy''; they match as many times as they can, never take into account whether the next symbol in the pattern would match and never backtrack to try and change an unsuccessful match into a successful one. Clearly, this needs to be borne in mind when using them in a pattern. However, there is a way to make them act ''lazily'' which will be discussed later.
 
Captures match a group of 0 or more singles and may include alternatives which are considered in turn until a match occurs. Finally, back-references refer to text matched by a previous capture.
Line 72:
Although the ''standard'' character classes should suffice for most purposes, the user can redefine up to three of them (i, j and k) for each pattern to deal with special cases, such as a limited range of letters or digits.
 
An upper case character class represents the complement of the lower case version. For example /A matches any character other than a-z or A-Z, including non-ASCII characters. Note that /Z normally just matches Z itself as it's not possible, of course, to match no characters. However, it does have a special meaning for 'lazy' matches which will be covered later.
 
/ followed by any character other than a letter just matches the character itself. This allows the 12 meta-characters to be treated as ''literal'' characters without their special meaning. So /| is a literal vertical bar.
Line 101:
The upper case version of an extended class represents the complement of the lower case version. For example &N matches any character other than a sign.
 
& followed by any other letter or character behaves exactly the same as if it were preceded by / except that &Z has a special meaning for 'lazy' matches which will be covered later.
 
;Complements
Line 187:
 
"The quick brown [fox|c/l/l]" matches: The quick brown fox|cat|cow.
 
;Lazy matching
 
The pattern "<a>[+0/z]<//a>" fails to match the string "<a>b < c</a>" because +0/z is 'greedy' and matches all characters to the end of the string including the final "</a>".
 
If you change the pattern to "<a>[+0^<]<//a>" it still fails to match the string because of the presence of '<' in the text between the tags.
 
However, if you change the pattern to "<a>[+0/Z]<//a>" and use the 'findLazy' method with a 't' parameter of "</a>", then a match will occur and the captured text will be 'b < c'. What we did here was to replace '/z' or '^<' in the pattern with '/Z' which, in a lazy method, means match any characters other than the parameter string 't'. In the non-lazy 'find' method '/Z' would just match a literal 'Z' as noted earlier.
 
The 'findLazy2' method works in a similar fashion but can carry out 2 lazy matches by using '/Z' for the first and '&Z' for the second which match any characters other than the parameter strings 't' and 'u' respectively.
 
These methods do not actually change the 'greedy' nature of the engine but use a hack (replacing text with rarely used control characters and back again after matching) to simulate lazy matching to a limited extent.
 
===Source code===
Line 799 ⟶ 811:
 
return Match.new_(wm, start, captures)
}
 
// Converts any metacharacters in 'p' to their literal equivalents.
static escape(p) {
var mcs = "/&^@=+#~[]|$"
for (mc in mcs) {
p = p.replace(mc, "/%(mc)")
}
return p
}
 
Line 896 ⟶ 917:
}
return matches
}
 
// As the 'find' method but can simulate lazy matching by treating '/Z' within the pattern
// as matching any characters other than the string of literal characters 't'.
// Should not be used if 's' might contain the SO (shift out) character '0x0e'.
findLazy(s, t) {
var SO = "\x0e"
var rep = SO * t.count
s = s.replace(t, rep)
var pattern2 = _pattern.replace("/Z", "^%(SO)").replace(Pattern.escape(t), rep)
var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
var m = p2.find(s)
if (!m) return null
var captures2 = []
for (c in m.captures) {
captures2.add(Capture.new_(c.text.replace(rep, t), c.index))
}
var text2 = m.text.replace(rep, t)
return Match.new_(text2, m.index, captures2)
}
 
// As the 'find' method but can simulate lazy matching by treating '/Z' within the pattern
// as matching any characters other than the string of literal characters 't' and '&Z' within
// the pattern as matching any characters other than the string of literal characters 'u'.
// Should not be used if 's' might contain the SO (shift out) character '0x0e' or the
// SI (shift in) character '0x0f'.
findLazy2(s, t, u) {
var SO = "\x0e"
var SI = "\x0f"
var rep1 = SO * t.count
var rep2 = SI * u.count
s = s.replace(t, rep1).replace(u, rep2)
var pattern2 = _pattern.replace("/Z", "^%(SO)").replace(Pattern.escape(t), rep1)
.replace("&Z", "^%(SI)").replace(Pattern.escape(u), rep2)
var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
var m = p2.find(s)
if (!m) return null
var captures2 = []
for (c in m.captures) {
captures2.add(Capture.new_(c.text.replace(rep1, t).replace(rep2, u), c.index))
}
var text2 = m.text.replace(rep1, t).replace(rep2, u)
return Match.new_(text2, m.index, captures2)
}
 
9,479

edits