Category talk:Wren-pattern

Wren patterns

Wren doesn't have direct access to a regular expression library and, even if it did, they would be tedious to use as all meta-characters would need to be double escaped due to the language not supporting raw strings either. Moreover, writing such a library from scratch in a small scripting language such as Wren may not be a viable proposition due to its likely size and performance limitations.

I have therefore designed and coded a simple pattern matcher instead, the rules of which are described below. Obviously, this has nothing like the full power of regular expressions but I hope it will nevertheless be a useful addition to the armory.

Pattern

A pattern is a Wren string (i.e. a sequence of Unicode code-points) which in addition to normal characters can contain one or more of the following: character classes, extended classes, complements, either-cases, multiples, minima, ranges, optionals, captures and back-references.

Normal characters, character classes, extended classes, complements and either-cases are referred to collectively as singles since they always match a single character if they can.

Multiples, minima, ranges and optionals are referred to as quantifiers as they specify how many times a single needs to be repeated for a match to occur.

Note that quantifiers are always greedy; they match as many times as they can, never take into account whether the next symbol in the pattern would match and never backtrack to try and change an unsuccessful match into a successful one. Clearly, this needs to be borne in mind when using them in a pattern.

Captures match a group of 0 or more singles and may include alternatives which are considered in turn until a match occurs. Finally, back-references refer to text matched by a previous capture.

Twelve meta-characters are needed and have been chosen so as not to need escaping in Wren itself (which rules out \ and %) nor be commonly used punctuation characters in English text.

/ character class
& extended class
^ complement
@ either-case
= multiple
+ minimum
# range
~ optional
[ capture start
] capture end
| separates alternatives within a capture
$ back-reference

When used literally, all meta-characters must be escaped by preceding them with / (or &) unless otherwise noted below.

There are 4 kinds of patterns depending on whether they match: anywhere within a string, from the start of the string, up to the end of the string or the whole string.

Character classes

These are single letters preceded by the / meta-character. The following always match exactly one character:

class   name       contents

 a   alphabetic   a-z A-z
 b   binary       0-1
 c   control      \x00-\x1f and \x7f (codes 0-31 and 127)
 d   decimal      0-9
 e   exponent     0-9 and .+-eE
 f   float        0-9 and .
 g   graphic      all printable ASCII except space (codes 33-126)
 h   hex          0-9 a-f A-F
 i   i class      0-2 (but can be redefined) 
 j   j class      0-3 (but can be redefined)
 k   k class      0-4 (but can be redefined)
 l   lower case   a-z
 m   m class      a-z 0-9
 n   n class      +- (sign)
 o   octal        0-7
 p   punctuation  as graphic but excluding alphabetic and decimal
 q   quote        ", ', ` (double, single or back quotes)
 r   r class      all ASCII (codes 0-127)
 s   space        \t, \n, \f, \v, \r and space
 t   t class      punctuation plus space
 u   upper case   A-Z
 v   v class      A-Z 0-9
 w   word         a-z A-Z 0-9
 x   x class      a-z A-Z 0-9 and _ (word plus underscore)
 y   y class      a-z A-Z 0-9 and _'- (x class plus apostrophe and hyphen)
 z   z class      any character at all including non-ASCII

Although the standard character classes should suffice for most purposes, the user can redefine up to three of them (i, j and k) for each pattern to deal with special cases, such as a limited range of letters or digits.

An upper case character class represents the complement of the lower case version. For example /A matches any character other than a-z or A-Z, including non-ASCII characters. Note that /Z just matches Z itself as it's not possible, of course, to match no characters.

/ followed by any character other than a letter just matches the character itself. This allows the 12 meta-characters to be treated as literal characters without their special meaning. So /| is a literal vertical bar.

Extended classes

These are single letters preceded by the & meta-character. They extend character classes to include controls, punctuation and upper/lower case letters with code-points in the range 128-255. This is useful when matching text written in Western European languages other than English. The following always match exactly one character.

class   name       contents

 a   alphabetic   a-z A-z and codes: 181, 192-214, 216-222, 223-246 and 248-255. 
 c   control      \x00-\x1f, \x7f and \x80-\x9f (codes 0-31, 127 and 128-159)
 g   graphic      all printable except space and non-breaking space (codes 33-126 and 161-255)  
 l   lower case   a-z and codes: 181, 223-246, 248-255
 m   m class      as class l plus 0-9
 n   n class      +-± (sign plus code 177)
 p   punctuation  as graphic but excluding alphabetic and decimal
 q   quote        ", ', `, «, » (double, single, back and double-angle quotes)
 r   r class      all extended (codes 0-255)
 s   space        \t, \n, \f, \v, \r, space and non-breaking space (code 160)
 t   t class      punctuation plus space
 u   upper case   A-Z and codes: 192-214, 216-222
 v   v class      as class u plus 0-9
 w   word         as class a plus 0-9
 x   x class      as class w plus _ (word plus underscore)
 y   y class      as class x plus '- and code 173 (x class plus apostrophe, hyphen and soft hyphen)

The upper case version of an extended class represents the complement of the lower case version. For example &N matches any character other than a sign.

& followed by any other letter or character behaves exactly the same as if it were preceded by /.

Complements

A complement is any literal character (but not other kind of single) preceded by ^. It matches any character other than the literal character itself. For example, ^a matches any character other than a.

When followed by any meta-character ^ treats them as literals. So ^/ matches any literal character except / and ^^ matches anything other than ^.

Either-cases

An either-case is any literal character (but not other kind of single) preceded by @.

If it's a lower case letter, it matches either the character itself or its upper case equivalent provided that both have code-points less than 256. For example @a matches either a or A.

If it's an upper case letter, it matches any character other than the character itself and its lower case equivalent provided that both have code-points less than 256. For example @A does not match A and a but would match b.

Non-letters simply match themselves.

Multiples

A multiple specifies the exact number of times the following single should be repeated for a match to occur. It is represented by = followed by a digit then by the single to be repeated. If the digit is between 2 and 9 then this is the number of repeats. If the digit is 0 or 1 then the number of repeats is 10 or 11 respectively. For example =3x matches 3 consecutive x's and =2/d matches 2 consecutive digits. (With regard to the last example, it would be just as easy and perhaps clearer to write /d/d).

Minima

A minimum specifies the minimum number of times the following single should be repeated for a match to occur. It is represented by + followed by a digit then by the single to be repeated. The digit can be between 0 and 9 and is the minimum number of repetitions. So +0/w matches 0 or more word characters and +1a matches one or more a's.

Ranges

A range specifies the range of times the following single should be repeated for a match to occur. It is represented by # followed by two digits and then the single to be repeated.

The first digit must be between 0 and 8.

The second digit must be between 1 and 9 and must be more than the first digit.

For example #13b matches a sequence of between 1 and 3 b's and #04C matches between 0 and 4 C's.

Optionals

An optional matches the following single zero or one times. It is represented by ~ followed by the single to be matched. For example ~/d matches zero or one digits.

It is shorthand for the range #01 but is so common that it was felt it should have its own meta-character.

Captures

These are sequences of 1 or more mini-patterns to be matched in order. At least one of the mini-patterns needs to be matched successfully for the capture as a whole to match. A mini-pattern can contain any of the elements which a normal pattern can contain except another capture and may be empty. Captures cannot overlap each other.

Captures begin with [ and end with ]. Individual mini-patterns are separated by |. For example [cat|dog] matches either cat or dog and [/d/d|/a/a|/p/p] matches two digits, letters or punctuation symbols.

Captures cannot be qualified by a quantifier and are stored separately with each match. There can be a maximum of 9 captures within the overall pattern.

Back-references

These refer back to previous captures numbered from 1 to 9 preceded by a $. So $1 refers to the first capture's text.

$0 refers to the whole of the text matched so far (before the current capture started if the back-reference is within a capture mini-pattern). It is an error to refer back to captures which have not yet taken place or been completed.

Back-references cannot be qualified by a quantifier but can appear within a mini-pattern of a subsequent capture.

Replacement patterns

These are patterns which are used in the appropriate methods as replacements for a matching pattern. They are treated as normal text except that they can contain back-references ($0 always refers to the whole match) and a literal $ must be escaped with $$ (not the usual /$).

Quantification of groups of characters

Quantifiers always qualify singles. Any departure from this is an error and groups of characters cannot therefore be directly quantified.

There may sometimes be ways to quantify them indirectly either by simply repeating the pattern or by using captures and back-references. For example "[abab|ab|]" would match 'ab' repeated 2, 1 or 0 times and "[abcd]/1/1" would match 'abcd' repeated exactly three times.

However, this sort of approach clearly has its limitations and there is no way to match a group of characters repeated an indefinite number of times.

Examples

"/d/d/d/d-/d/d-/d/d" matches a date in yyyy-mm-dd format. It could also be written: "=4/d-/d/d-/d/d".

"/u/u=9/v/d" matches an ISIN though doesn't of course validate the check-digit.

If we define:

i = "BCDFGHJKLMNPQRSTVWXYZ" (all letters excluding vowels)

j = i + "0123456789" (as i plus digits)

"/i=5/j/d" should then match a new SEDOL though again without validating the check digit.

"The quick brown [fox|c/l/l]" matches: The quick brown fox|cat|cow.

Source code

<lang ecmascript>/* Module "pattern.wren" */

/* Match represents a single successful match made by methods in the Pattern class.

  Match objects are immutable.

/

class Match {

   // Constructs a Match object from the text of the match, its starting index as a codepoint offset
   // from the start of the string and its capture list. This is a private constructor
   // intended to be called from the Pattern class as there should be no need for the user
   // to construct Match objects directly.
   construct new_(text, index, captures) {
       if (!(text is String)) Fiber.abort("Match text must be a string.")
       if (!((index is Num) && index.isInteger && index >= 0)) {
           Fiber.abort("Match index must be a non-negative integer.")
       }
       if (!(captures is List)) Fiber.abort("Match captures must be a list of Capture objects.")
       _text = text
       _index = index
       _captures = captures
   }

   // Properties.
   text     { _text }                                // the text of the match
   index    { _index }                               // its starting index (codepoints)
   length   { _text.count }                          // its length
   span     { [_index, index + length - 1] }         // a list of its starting and ending indices
   captures { _captures.toList }                     // the Capture objects associated with the match
   capsText { _captures.map { |c| c.text }.toList }  // a list of each capture's text property

   // String representation (excluding captures)
   toString { "{ text = %(_text), index = %(_index), length = %(length) }" }

}

/* Capture represents a single successful capture made by methods in the Pattern class.

  Capture objects are immutable.

/

class Capture {

   // Constructs a capture object from the text of the capture and its starting index
   // as a codepoint offset from the start of the string. This is a private constructor
   // intended to be called from the Pattern class as there should be no need for the user
   // to construct Capture objects directly.
   construct new_(text, index) {
       if (!(text is String)) Fiber.abort("Capture text must be a string.")
       if (!((index is Num) && index.isInteger && index >= 0)) {
           Fiber.abort("Capture index must be a non-negative integer.")
       }
       _text = text
       _index = index
   }

   // Properties.
   text     { _text }                        //  the text of the capture
   index    { _index }                       //  its starting index (codepoints)
   length   { _text.count }                  //  its length
   span     { [_index, index + length - 1] } //  a list of its starting and ending indices

   // String representation.
   toString { "{ text = %(_text), index = %(_index), length = %(length) }" }

}

/* Pattern represents a pattern to be used for matching characters within a string.

  A Pattern object is immutable.

/

class Pattern {

   // Constant pattern types.
   static within { 0 }   // matches anywhere within a string
   static start  { 1 }   // matches only at the start of a string
   static end    { 2 }   // matches only at the end of a string
   static whole  { 3 }   // matches the whole of a string

   static types { ["within", "start", "end", "whole"] }

   // Constants to help construct user-defined patterns.
   static lower  { "abcdefghijklmnopqrstuvwxyz" }
   static upper  { "ABCDEFGHIJKLMNOPQRSTUVWXYZ" }
   static letter { lower + upper }
   static digit  { "0123456789" }
   static alpha  { letter + digit }

   // Private method to initialize function tables and back-reference symbols.
   static init_() {
       // character classes
       __fns = [
           Fn.new { |c| (c >= 65 && c <= 90) || (c >= 97 && c <= 122) },                         // a
           Fn.new { |c|  c == 48 || c == 49 },                                                   // b
           Fn.new { |c|  c <  32 || c == 127 },                                                  // c
           Fn.new { |c|  c >= 48 && c <= 57 },                                                   // d
           Fn.new { |c| (c >= 48 && c <= 57) || ".+-Ee".codePoints.contains(c) },                // e
           Fn.new { |c| (c >= 48 && c <= 57) || c == 46 },                                       // f
           Fn.new { |c|  c >= 33 && c < 127 },                                                   // g
           Fn.new { |c| (c >= 48 && c <= 57) || (c >= 65 && c <= 70) || (c >= 97 && c <= 102) }, // h
           Fn.new { |c, p|  p.i.codePoints.contains(c) },                                        // i
           Fn.new { |c, p|  p.j.codePoints.contains(c) },                                        // j
           Fn.new { |c, p|  p.k.codePoints.contains(c) },                                        // k
           Fn.new { |c|  c >= 97 && c <= 122 },                                                  // l
           Fn.new { |c| (c >= 97 && c <= 122) || (c >= 48 && c <= 57) },                         // m
           Fn.new { |c|  c == 43 || c == 45 },                                                   // n
           Fn.new { |c|  c >= 48 && c <= 55 },                                                   // o
           Fn.new { |c| (c >= 33 && c < 127) && !__fns[22].call(c) },                            // p
           Fn.new { |c|  c == 34 || c == 39 || c == 96 },                                        // q
           Fn.new { |c|  c < 128 },                                                              // r
           Fn.new { |c|  c == 32 || (c >= 9 && c <= 13) },                                       // s
           Fn.new { |c| ((c >= 9 && c <= 13) || (c >= 32 && c < 127)) && !__fns[22].call(c) },   // t
           Fn.new { |c|  c >= 65 && c <= 90 },                                                   // u
           Fn.new { |c| (c >= 65 && c <= 90) || (c >= 48 && c <= 57) },                          // v
           Fn.new { |c| (c >= 48 && c <= 57) || (c >= 65 && c <= 90) || (c >= 97 && c <= 122) }, // w
           Fn.new { |c|  __fns[22].call(c) || c == 95 },                                         // x
           Fn.new { |c|  __fns[22].call(c) || c == 95 || c == 39 || c == 45 },                   // y
           Fn.new { |c|  true }                                                                  // z
       ]
                                                         
       // extended classes
       __fns2 = [
           Fn.new { |c|  __fns2[11].call(c) || __fns2[20].call(c) },                               // a
           Fn.new { |c|  c == 48 || c == 49 },                                                     // b
           Fn.new { |c|  c <  32 || (c >=  127 && c < 160) },                                      // c
           Fn.new { |c|  c >= 48 && c <= 57 },                                                     // d
           Fn.new { |c| (c >= 48 && c <= 57) || ".+-Ee".codePoints.contains(c) },                  // e
           Fn.new { |c| (c >= 48 && c <= 57) || c == 46 },                                         // f
           Fn.new { |c| (c >= 33 && c < 127) || (c >= 161 && c <= 255) },                          // g
           Fn.new { |c| (c >= 48 && c <= 57) || (c >= 65 && c <= 70) || (c >= 97 && c <= 102) },   // h
           Fn.new { |c, p|  p.i.codePoints.contains(c) },                                          // i
           Fn.new { |c, p|  p.j.codePoints.contains(c) },                                          // j
           Fn.new { |c, p|  p.k.codePoints.contains(c) },                                          // k
           Fn.new { |c| (c >= 97 && c <= 122) || c == 181 || (c >= 223 && c <= 255 && c != 247) }, // l  
           Fn.new { |c|  __fns2[11].call(c) || (c >= 48 && c <= 57) },                             // m
           Fn.new { |c|  c == 43 || c == 45 || c == 177 },                                         // n
           Fn.new { |c|  c >= 48 && c <= 55 },                                                     // o
           Fn.new { |c|  __fns2[6].call(c) && !__fns2[22].call(c) },                               // p
           Fn.new { |c|  c == 34 || c == 39 || c == 96 || c == 171 || c == 187 },                  // q
           Fn.new { |c|  c < 256 },                                                                // r
           Fn.new { |c|  c == 32 || (c >= 9 && c <= 13) || c == 160 },                             // s
           Fn.new { |c|  __fns2[15].call || __fns2[18].call },                                     // t
           Fn.new { |c| (c >= 65 && c <= 90) || (c >= 192 && c <= 222 && c != 215) },              // u
           Fn.new { |c|  __fns2[20].call(c) || (c >= 48 && c <= 57) },                             // v
           Fn.new { |c| (c >= 48 && c <= 57) || __fns2[0].call(c) },                               // w
           Fn.new { |c|  __fns2[22].call(c) || c == 95 },                                          // x
           Fn.new { |c|  __fns2[22].call(c) || c == 95 || c == 39 || c == 45 || c == 173 },        // y
           Fn.new { |c|  true }                                                                    // z
       ]

       // back reference symbols
       __backRefs = ["$1", "$2", "$3", "$4", "$5", "$6", "$7", "$8", "$9"]
   }

   // Returns a list of the text properties of each match in matches.
   static matchesText(matches) { matches.map { |m| m.text }.toList }

   // Returns whether a pattern string is valid or not.
   static validate(pattern) {
       if (!((pattern is String) && pattern != "")) return false
       return !Fiber.new {
           validate_(pattern)
       }.try()
   }

   // Private worker method to validate and tokenize a pattern and get its minimum matching length.
   static validate_(pattern) {
       var min = 0                        // minimum length
       var pc = pattern.codePoints.toList // pattern codepoints
       var lpc = pc.count                 // pattern length
       var i = 0                          // codepoint index
       var cap = false                    // whether within a capture
       var captures = []                  // stores min length for each capture
       var curMin = 0                     // minimum length of current mini-pattern
       var capMin = 0                     // minimum length of current capture
       var c = 0                          // current codepoint
       var tokens = []                    // tokenize pattern to make subsequent matching easier

       // Increments min or curMin.
       var increment = Fn.new { |imin|
           if (!cap) {
               min = min + imin
           } else {
               curMin = curMin + imin
           }
       }

       // Handles the slash or ampersand metacharacters.
       var slashOrAmp = Fn.new { |reps|
           i = i + 1
           if (i == lpc) Fiber.abort("Invalid pattern - missing character at index %(i).")
           var d = pc[i]
           if (reps > 0) increment.call(reps)
           if (d >= 97 && d <= 122) {
               tokens.add(-c)
               tokens.add(d - 97)
           } else if (d >= 65 && d <= 89) {
               tokens.add(-c)
               tokens.add(d - 39)
           } else {
               tokens.add(d)
           }
       }

       // Handles the caret or at sign metacharacters.
       var caretOrAt = Fn.new { |reps|
           i = i + 1
           if (i == lpc) Fiber.abort("Invalid pattern - missing character at index %(i).")
           if (reps > 0) increment.call(reps)
           var d = pc[i]
           if (c == 94) {
               tokens.add(-c)
               tokens.add(d)
           } else {
               if (__fns2[20].call(d)) {
                   tokens.add(-c)
                   tokens.add(d)
                   tokens.add(d + 32)
               } else if (__fns2[11].call(d) && d != 181 && d != 223 && d != 255) {
                   tokens.add(-c)
                   tokens.add(d)
                   tokens.add(d - 32)
               } else {
                   tokens.add(d)
               }
           }
       }

       while (i < lpc) {
           c = pc[i]                          // current codepoint
           if (c == 47 || c == 38) {          // slash = character class, ampersand = extended class
               slashOrAmp.call(1)
           } else if (c == 94 || c == 64 ) {  // caret = complement, at sign = either-case
               caretOrAt.call(1)
           } else if (c == 61 || c == 43 || c == 35) { // multiple, minimum or range
               i = i + 1
               if (i == lpc) Fiber.abort("Invalid pattern - missing digit at index %(i).")
               var d = pc[i]                  // get the next codepoint
               if (d < 48 || d > 57) Fiber.abort("Invalid pattern - non-digit found at index %(i).")
               tokens.add(-c)
               var reps = d - 48
               tokens.add(reps)
               if (c == 61) {        // equals sign = multiple
                   if (reps < 2) {
                       reps = reps + 10
                       tokens[-1] = reps
                   }
               } else if (c == 35) { // hash = range
                   if (reps == 9) {
                       Fiber.abort("Invalid pattern - first digit cannot exceed eight at index %(i).")
                   }
                   i = i + 1
                   if (i == lpc) Fiber.abort("Invalid pattern - missing second digit at index %(i).")
                   var e = pc[i]
                   if (e < 48 || e > 57) Fiber.abort("Invalid pattern - non-digit found at index %(i).")
                   if (e <= d) {
                       Fiber.abort("Invalid pattern - seocond digit must be greater than first at index %(i).")
                   }
                   tokens.add(e - 48)
               }
               i = i + 1
               if (i == pc.count) {
                   Fiber.abort("Invalid pattern - missing 'single' at index %(i).")
               }
               c = pc[i]   // get the next codepoint
               if (c == 47 || c == 38) {
                   slashOrAmp.call(reps)
               } else if (c == 94 || c == 64) {
                   caretOrAt.call(reps)
               } else if ("=+#~[]|$".codePoints.contains(c)) {
                   Fiber.abort("Invalid pattern - missing 'single' at index %(i).")
               } else {
                   increment.call(reps)
                   tokens.add(c)
               }
           } else if (c == 126) { // tilde == optional
               i = i + 1
               if (i == lpc) Fiber.abort("Invalid pattern - missing 'single' at index %(i).")
               tokens.add(-35) // use range for tokenization purposes
               tokens.add(0)
               tokens.add(1)
               c = pc[i]  // get the next codepoint
               if (c == 47 || c == 38) {
                   slashOrAmp.call(0)
               } else if (c == 94 || c == 64) {
                   caretOrAt.call(0)
               } else if ("=+#~[]|$".codePoints.contains(c)) {
                   Fiber.abort("Invalid pattern - missing 'single' at index %(i).")
               } else {
                   tokens.add(c)
               }
           } else if (c == 91) { // left square bracket = capture opening
               if (cap) Fiber.abort("Invalid pattern - orphan [ found.")
               cap = true
               curMin = 0
               capMin = Num.largest
               tokens.add(-91)
           } else if (c == 124) { // vertical bar = end of mini-pattern
               if (!cap) Fiber.abort("Invalid pattern - orphan | found.")
               if (curMin < capMin) capMin = curMin
               curMin = 0
               tokens.add(-124)
           } else if (c == 93) { // right square bracket = capture end
               if (!cap) Fiber.abort("Invalid pattern - orphan ] found.")
               if (curMin < capMin) capMin = curMin
               cap = false
               increment.call(capMin)
               captures.add(capMin)
               tokens.add(-93)
           } else if (c == 36) { // dollar sign = back-reference
               i = i + 1
               if (i == lpc) Fiber.abort("Invalid pattern - missing digit at index %(i).")
               c = pc[i]       // get the next codepoint
               if (c < 48 || c > 57) Fiber.abort("Invalid pattern - non-digit found at index %(i).")
               c = c - 48
               if (c == 0) {
                   increment.call(min)
               } else if (c > captures.count) {
                   Fiber.abort("Invalid pattern - back-reference exceeds capture count at %(i).")
               } else {
                   increment.call(captures[c-1])
               }
               tokens.add(-36)
               tokens.add(c)
           } else { // normal character
               increment.call(1)
               tokens.add(c)
           }
           i = i + 1
       }
       if (cap) Fiber.abort("Invalid pattern - capture unfinished at %(i).")
       return [min, tokens]
   }

   // Private worker method.
   // Looks for a pattern match for the string 's' starting from codepoint index 'start'.
   // Returns a Match object if a match is found or null otherwise.
   match_(s, start) {
       var tokens = _tokens.toList     // use a copy as we might change it
       var tc = tokens.count           // tokens length
       var ti = 0                      // tokens index
       var t = tokens[ti]              // current token
       var codes = s.codePoints.toList // string codepoints
       var sc = s.count                // string codepoints count
       var si = start                  // string codepoints index        
       var c = -1                      // string current codepoint
       var consumed = false            // whether current codepoint has been consumed
       var cap = false                 // whether within a capture
       var captures = []               // stores captures
       var wm = ""                     // matched so far in string as a whole
       var cm = ""                     // matched so far in current capture
       var ci = 0                      // string index at which capture started

       if (si < sc) c = codes[si]

       // Consume current character.
       var consume = Fn.new {
           if (!cap) {
               wm = wm + String.fromCodePoint(c)
           } else {
               cm = cm + String.fromCodePoint(c)
           }
           consumed = true
       }

       // Moves token index, where necessary, to next metacharacter
       var moveTokenIndex = Fn.new { |z|
            if (z == -47 || z == -38 || z == -94) {
               ti = ti + 1
            } else if (z == -64) {
               ti = ti + 2
            }
       }

       // Checks if there's another mini-pattern in the current capture and if so prepares to match it.
       var nextMiniPattern = Fn.new {
           while (true) {
               ti = ti + 1
               t = tokens[ti]
               if (t == -93) return false // end of capture
               if (t == -124) {
                   cm = ""
                   si = ci
                   c = codes[si]
                   break
               }
           }
           return true
       }

       // Checks that there are no more options to consider before declaring a non-match.
       var noMore = Fn.new { !cap || !nextMiniPattern.call() }

       // Checks if character class matches and if so consumes character.
       var slash = Fn.new { |inc|
           if (inc) ti = ti + 1
           var u = tokens[ti]
           if (u >= 8 && u <= 10) {
               if (!__fns[u].call(c, this)) return false
           } else if (u < 26) {
               if (!__fns[u].call(c)) return false
           } else if (u >= 34 && u <= 36) {
               if (__fns[u-26].call(c, this)) return false
           } else {
               if (__fns[u-26].call(c)) return false
           }
           consume.call()
           return true
       }

       // Checks if extended class matches and if so consumes character.
       var ampersand = Fn.new { |inc|
           if (inc) ti = ti + 1
           var u = tokens[ti]
           if (u >= 8 && u <= 10) {
               if (!__fns2[u].call(c, this)) return false
           } else if (u < 26) {
               if (!__fns2[u].call(c)) return false
           } else if (u >= 34 && u <= 36) {
               if (__fns2[u-26].call(c, this)) return false
           } else {
               if (__fns2[u-26].call(c)) return false
           }
           consume.call()
           return true
       }

       // Checks if complement matches and if so consumes character.
       var caret = Fn.new { |inc|
           if (inc) ti = ti + 1
           var u = tokens[ti]
           if (c == u) return false
           consume.call()
           return true
       }

       // Checks if either-case matches and if so consumes character.
       var at = Fn.new { |inc|
           var u
           var v
           if (inc) {
               u = tokens[ti + 1]
               v = tokens[ti + 2]
               ti = ti + 2
           } else {
               u = tokens[ti-1]
               v = tokens[ti]
           }
           if (u > v) { // lower case first
               if (c != u && c != v) return false
           } else {
               if (c == u || c == v) return false
           }
           consume.call()
           return true
       }

       // Checks if ordinary character matches and if so consumes character.
       var character = Fn.new { |z|
           if (c != z) return false
           consume.call()
           return true
       }
        
       while (true) {
           for (i in 1..1) { // dummy loop so break can emulate goto
               if (t == -47) {  // slash = character class
                   if (si == sc) if (noMore.call()) return null else break
                   if (!slash.call(true) && noMore.call()) return null
               } else if (t == -38) { // ampersand = extended class
                   if (si == sc) if (noMore.call()) return null else break
                   if (!ampersand.call(true) && noMore.call()) return null
               } else if (t == -94) { // caret = complement
                   if (si == sc) if (noMore.call()) return null else break
                   if (!caret.call(true) && noMore.call()) return null
               } else if (t == -64) { // at sign = either-case
                   if (si == sc) if (noMore.call) return null else break
                   if (!at.call(true) && noMore.call()) return null
               } else if (t == -61 || t == -43 || t == -35) { // quantifier
                   ti = ti + 1
                   var required = tokens[ti]
                   var reps
                   if (t == -61) {        // equals sign = multiple
                       reps = required
                   } else if (t == -43) { // plus sign = minimum
                       reps = Num.largest
                   } else if (t == -35) { // hash sign = range
                       ti = ti + 1
                       reps = tokens[ti]
                   }
                   ti = ti + 1
                   var z = tokens[ti]
                   if (si == sc) {
                       if (required > 0) {
                           if (noMore.call()) return null else break
                       } else {
                           moveTokenIndex.call(z)
                           break
                       }
                   }                  
                   if (z >= 0) { // ordinary character
                       for (i in 1..reps) {
                           if (!character.call(z)) {
                               if (i <= required && noMore.call()) return null
                               break
                           }
                           if (i == reps) break
                           si = si + 1
                           consumed = false
                           if (si == sc) {
                               if (i < required && noMore.call()) return null          
                               break
                           }
                           c = codes[si]
                       }
                   } else if (z == -47) { // character class
                       for (i in 1..reps) {
                           if (!slash.call(i == 1)) {
                               if (i <= required && noMore.call()) return null
                               break
                           }
                           if (i == reps) break
                           si = si + 1
                           consumed = false
                           if (si == sc) {
                               if (i < required && noMore.call()) return null         
                               break
                           }
                           c = codes[si]
                       }
                   } else if (z == -38) { // extended class
                       for (i in 1..reps) {
                           if (!ampersand.call(i == 1)) {
                               if (i <= required && noMore.call()) return null
                               break
                           }
                           if (i == reps) break
                           si = si + 1
                           consumed = false
                           if (si == sc) {
                               if (i < required && noMore.call()) return null
                               break
                           }
                           c = codes[si]
                       }
                   } else if (z == -94) { // complement
                       for (i in 1..reps) {
                           if (!caret.call(i == 1)) {
                               if (i <= required && noMore.call()) return null
                               break
                           }
                           if (i == reps) break
                           si = si + 1
                           consumed = false
                           if (si == sc) {
                               if (i < required && noMore.call()) return null           
                               break
                           }
                           c = codes[si]
                       }
                   } else if (z == -64) { // either-case
                       for (i in 1..reps) {
                           if (!at.call(i == 1)) {
                               if (i <= required && noMore.call()) return null
                               break
                           }
                           if (i == reps) break
                           si = si + 1
                           consumed = false
                           if (si == sc) {
                               if (i < required && noMore.call()) return null
                               break
                           }
                           c = codes[si]
                       }
                   }               
               } else if (t == -91) { // capture opening
                   cap = true
                   cm = ""
                   ci = si
               } else if (t == -124) { // end of mini-pattern
                   captures.add(Capture.new_(cm, ci))
                   wm = wm + cm
                   cap = false
                   while (true) { // find capture end
                       ti = ti + 1
                       t = tokens[ti]
                       if (t == -93) break
                   }
               } else if (t == -93) { // capture end
                   captures.add(Capture.new_(cm, ci))
                   wm = wm + cm
                   cap = false
               } else if (t == -36) { // back-reference
                   ti = ti + 1
                   var cn = tokens[ti]
                   var text = (cn > 0) ? captures[cn-1].text : wm
                   if (si == sc && text.Count > 0 && noMore.call()) return null
                   var tokens1 = tokens[0..ti]
                   var tokens2 = tokens[ti+1..-1]
                   tokens = tokens1 + text.codePoints.toList + tokens2
                   tc = tokens.count
               } else { // ordinary character
                   if (si == sc) if (noMore.call()) return null else break
                   if (!character.call(t) && noMore.call()) return null
               }
           } // end for loop
           ti = ti + 1
           if (ti == tc) break
           t = tokens[ti]
           if (consumed) {
               si = si + 1
               consumed = false
               if (si < sc) {
                   c = codes[si]
               }
           }
       }

       return Match.new_(wm, start, captures)
   }

   // Constructs a Pattern object from a pattern, its type and its user defined character
   // classes. If an empty string is passed for the latter, they use their defaults.      
   construct new(pattern, type, i, j, k) {
       if (!((pattern is String) && pattern != "")) {
           Fiber.abort("Pattern must be a non-empty string.")
       }
       var mt = Pattern.validate_(pattern)
       _minLen = mt[0]
       _tokens = mt[1]
       _pattern = pattern
       if (!((type is Num) && type.isInteger && type >= 0 &&  type <= 3)) {
            Fiber.abort("Pattern type must be an integer between 0 and 3 inclusive.")
       }
       _type = type
       if (!((i is String) && (j is String) && (k is String))) {
           Fiber.abort("Used defined class must be a string.")
       }
       _i = (i != "") ? i : "012"
       _j = (j != "") ? j : "0123"
       _k = (k != "") ? k : "01234"
   }

   // Convenience methods which call the constructor with default values for some arguments.
   static new(pattern, type, i, j) { new(pattern, type,  i,  j, "") }
   static new(pattern, type, i)    { new(pattern, type,  i, "", "") }
   static new(pattern, type)       { new(pattern, type, "", "", "") }
   static new(pattern)             { new(pattern,    0, "", "", "") }

   // Properties.
   pattern  { _pattern }   // the pattern string
   type     { _type    }   // its type
   minLen   { _minLen  }   // its minimum matching length (possibly zero)
   i        { _i }         // the user defined character class represented by /i
   j        { _j }         // the user defined character class represented by /j
   k        { _k }         // the user defined character class represented by /k

   // Checks whether the pattern matches a string or not.
   isMatch(s) { find(s) != null }

   // Finds and returns the first match (as a Match object) or null if there are no matches.
   find(s) {
       if (!(s is String)) Fiber.abort("Argument must be a string.")
       var sc = s.count
       if (sc < _minLen) return null
       if (_type == Pattern.within) {
           var maxStart = sc - _minLen
           for (start in 0..maxStart) {
               var m = match_(s, start)
               if (m) return m
           }
           return null
       }
       if (_type == Pattern.start) return match_(s, 0)
       if (_type == Pattern.end) {
           var maxStart = sc - _minLen
           for (start in 0..maxStart) {
               var m = match_(s, start)
               if (m && ((start + m.length) == sc)) return m
           }
           return null
       }
       if (_type == Pattern.whole) {
           var m = match_(s, 0)
           if (!m || m.length < sc) return null
           return m
       }
   }

   // Finds and returns all successive non-overlapping matches, if there are any,
   // as a list of Match objects. The list will be empty if there are no matches.
   // To prevent infinite recursion, it stops at (but includes) the first empty match.
   // Note that apart from Pattern.within there can never be more than one match.
   findAll(s) {
       var m = find(s)
       if (!_type == Pattern.within) {
           return (m) ? [m] : []
       }
       if (!m) return []
       var sc = s.count
       var matches = [m]
       if (m.length == 0) return matches
       var start = m.index + m.length
       while (start + _minLen <= sc) {
           m = match_(s, start)
           if (m) {
               matches.add(m)
               if (m.length == 0) break
               start = start + m.length
           } else {
               start = start + 1
           }
       }
       return matches
   }

   // Replaces up to 'n' successive matches in 's', optionally skipping some of those 'n', by the     
   // replacement string 'repl'. If there are no (or not enough) matches, returns 's' itself.
   // If n <= 1, uses all matches as separators.
   replace(s, repl, n, skip) {
       if (!(s is String) || !(repl is String)) Fiber.abort("First two arguments must be strings.")
       if (!((n is Num) && n.isInteger)) Fiber.abort("Third argument must be an integer.")
       if (!((skip is Num) && skip.isInteger && skip >= 0)) {
           Fiber.abort("Fourth argument must be a non-negative integer.")
       }
       var matches = findAll(s)
       var c = matches.count
       if (c == 0) return s
       if (n < 1 || n > c) n = c
       if (n <= skip) return s
       if (skip > 0 || n < c) matches = matches[skip...n]
       var cps = s.codePoints.toList
       var addIndex = 0
       for (m in matches) {
           var caps = m.captures
           var count = 0
           for (br in __backRefs[0...caps.count]) {
               repl = repl.replace(br, caps[count].text)
               count = count + 1
           }
           repl = repl.replace("$0", m.text)
           repl = repl.replace("$$", "$")               
           var s1 = cps[0...addIndex + m.index]
           var s2 = repl.codePoints.toList
           var s3 = cps[addIndex + m.index + m.length..-1]        
           cps = s1 + s2 + s3
           addIndex = addIndex + s2.count - m.length
       }
       return cps.map { |cp| String.fromCodePoint(cp) }.join()
   }

   // Convenience version of the above method which replaces all matches.
   replaceAll(s, repl) { replace(s, repl, 0, 0) }

   // Splits the string into a list of up to 'n+1' substrings using pattern matches as the separators
   // optionally skipping some of those 'n' separators.
   // If there are no matches returns a list with a single element, 's' itself.
   // If n < 1, uses all the matches as separators.
   split(s, n, skip) {
       if (!(s is String)) Fiber.abort("First argument must be a string.")
       if (!((n is Num) && n.isInteger)) Fiber.abort("Second argument must be an integer.")
       if (!((skip is Num) && skip.isInteger && skip >= 0)) {
           Fiber.abort("Third argument must be a non-negative integer.")
       }
       var matches = findAll(s)
       var c = matches.count
       if (c == 0) return [s]
       if (n < 1 || n > c) n = c
       if (n <= skip) return [s]
       if (skip > 0 || n < c) matches = matches[skip...n]
       var cps = s.codePoints.toList
       var splits = []
       var prev = 0
       for (m in matches) {
           var next = m.index
           var item = cps[prev...next]
           splits.add(item.map { |cp| String.fromCodePoint(cp) }.join())
           prev = next + m.length
       }
       splits.add(cps[prev..-1].map { |cp| String.fromCodePoint(cp) }.join())
       return splits
   }

   // Convenience version of the above method which uses all the matches as separators.
   splitAll(s) { split(s, 0, 0) }

   // String representation (excluding user defined character classes).
   toString { "{ pattern = %(_pattern), type = %(Pattern.types[_type]), min length = %(_minLen) }" }

}

// Type aliases for classes in case of any name clashes with other modules. var Pattern_Match = Match var Pattern_Capture = Capture var Pattern_Pattern = Pattern

Pattern.init_()</lang>