Category talk:Wren-pattern: Difference between revisions

← Older edit

Category talk:Wren-pattern (view source)

Revision as of 12:00, 3 November 2023

3,135 bytes added , 7 months ago

m

→‎Source code: Now uses Wren S/H lexer.

PureFox

9,490

edits

Revision as of 16:49, 31 March 2022 (view source) PureFox (talk \| contribs) (Added limited support for 'lazy' matches.) ← Older edit		Latest revision as of 12:00, 3 November 2023 (view source) PureFox (talk \| contribs) m (→‎Source code: Now uses Wren S/H lexer.)
(4 intermediate revisions by the same user not shown)
Line 1: ===Wren patterns=== Wren doesn't have direct access to a regular expression library (but see Wren-regex section below) and writing such a library from scratch in a small scripting language such as Wren may not be a viable proposition due to its likely size and performance limitations. I have therefore designed and coded a simple pattern matcher instead, the rules of which are described below. Obviously, this has nothing like the full power of regular expressions but I hope it will nevertheless be a useful addition to the armory. Line 72: Although the ''standard'' character classes should suffice for most purposes, the user can redefine up to three of them (i, j and k) for each pattern to deal with special cases, such as a limited range of letters or digits. An upper case character class represents the complement of the lower case version. For example /A matches any character other than a-z or A-Z, including non-ASCII characters. Note that /Z normally just matches Z itself as it's not possible, of course, to match no characters. However, it does have a special meaning for 'lazy' and 'quantified group' matches which will be covered later. / followed by any character other than a letter just matches the character itself. This allows the 12 meta-characters to be treated as ''literal'' characters without their special meaning. So /\| is a literal vertical bar. Line 101: The upper case version of an extended class represents the complement of the lower case version. For example &N matches any character other than a sign. & followed by any other letter or character behaves exactly the same as if it were preceded by / except that &Z has a special meaning for 'lazy' and 'quantified group' matches which will be covered later. ;Complements Line 162: These are patterns which are used in the appropriate methods as replacements for a matching pattern. They are treated as normal text except that they can contain back-references ($0 always refers to the whole match) and a literal $ must be escaped with $$ (not the usual /$). ;Quantification of groups of characters▼ Quantifiers always qualify ''singles''. Any departure from this is an error and groups of characters cannot therefore be directly quantified.▼ There may sometimes be ways to quantify them indirectly either by simply repeating the pattern or by using captures and back-references. For example "[abab\|ab\|]" would match 'ab' repeated 2, 1 or 0 times and "[abcd]$1$1" would match 'abcd' repeated exactly three times. ▼ ~~However, this sort of approach clearly has its limitations and there is no way to match a group of characters repeated an indefinite number of times.~~ ;Examples Line 199 ⟶ 191: These methods do not actually change the 'greedy' nature of the engine but use a hack (replacing text with rarely used control characters and back again after matching) to simulate lazy matching to a limited extent. ▲;Quantification of groups of characters ▲Quantifiers always qualify ''singles''. Any departure from this is an error and groups of characters cannot therefore be directly quantified. ▲There may sometimes be ways to quantify them indirectly either by simply repeating the pattern or by using captures and back-references. For example "[abab\|ab\|]" would match 'ab' repeated 2, 1 or 0 times and "[abcd]$1$1" would match 'abcd' repeated exactly three times. However, this sort of approach clearly has its limitations. A different and usually better approach is to use the 'findWithGroup' or 'findWithGroup2' methods. These work in an analogous way to the 'findLazy' and 'findLazy2' methods. However, this time '/Z' and '&Z' respectively match the parameter strings 't' and 'u' themselves rather than any characters other than these strings and we can therefore quantify them as though they were 'singles'. ;Wren-regex Since this module was written, I've created a Wren-regex module which wraps Go's 'regexp' package. Whilst this addresses some shortcomings in Wren-pattern, it requires a special Go executable to use it and doesn't therefore work with Wren-cli. ===Source code=== <~~lang~~syntaxhighlight ~~ecmascript~~lang="wren">/* Module "pattern.wren" / / Match represents a single successful match made by methods in the Pattern class. Line 960 ⟶ 963: var text2 = m.text.replace(rep1, t).replace(rep2, u) return Match.new_(text2, m.index, captures2) } // As the 'find' method but can simulate quantified group matching by treating '/Z' within the pattern // as matching the string of literal characters 't'. // Should not be used if 's' might contain the SO (shift out) character '0x0e'. findWithGroup(s, t) { var SO = "\x0e" s = s.replace(t, SO) var indexMap = List.filled(s.count, 0) var i = 0 var j = 0 var d = t.count - 1 for (c in s) { indexMap[i] = j if (c == SO) j = j + d i = i + 1 j = j + 1 } var pattern2 = _pattern.replace("/Z", SO).replace(Pattern.escape(t), SO) var p2 = Pattern.new(pattern2, _type, _i, _j, _k) var m = p2.find(s) if (!m) return null var captures2 = [] for (c in m.captures) { captures2.add(Capture.new_(c.text.replace(SO, t), indexMap[c.index])) } var text2 = m.text.replace(SO, t) return Match.new_(text2, indexMap[m.index], captures2) } // As the 'find' method but can simulate quantified group matching by treating '/Z' within the pattern // as matching the string of literal characters 't' and '&Z' within the pattern as matching // the string of literal characters 'u'. // Should not be used if 's' might contain the SO (shift out) character '0x0e' or the // SI (shift in) character '0x0f'. findWithGroup2(s, t, u) { var SO = "\x0e" var SI = "\x0f" s = s.replace(t, SO).replace(u, SI) var indexMap = List.filled(s.count, 0) var i = 0 var j = 0 var d1 = t.count - 1 var d2 = u.count - 1 for (c in s) { indexMap[i] = j if (c == SO) { j = j + d1 } else if (c == SI) { j = j + d2 } i = i + 1 j = j + 1 } var pattern2 = _pattern.replace("/Z", SO).replace(Pattern.escape(t), SO) .replace("&Z", SI).replace(Pattern.escape(u), SI) var p2 = Pattern.new(pattern2, _type, _i, _j, _k) var m = p2.find(s) if (!m) return null var captures2 = [] for (c in m.captures) { captures2.add(Capture.new_(c.text.replace(SO, t).replace(SI, u), indexMap[c.index])) } var text2 = m.text.replace(SO, t).replace(SI, u) return Match.new_(text2, indexMap[m.index], captures2) } Line 1,036 ⟶ 1,104: } Pattern.init_()</~~lang~~syntaxhighlight>