Parse EBNF: Difference between revisions

Content added Content deleted

Inline

Revision as of 22:44, 12 May 2011

Create a simple parser for EBNF grammars. Here is an ebnf grammar in itself and a parser for it in php.

You can use regular expressions for lexing.
Generate the calculator in Arithmetic evaluation using an EBNF description of the calculator.

Here are simple parser rules for a calculator taken from the antlr tutorial

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;

PicoLisp

<lang PicoLisp>(def 'expr 'ebnf '(term ((PLUS | MINUS) term) *)) (def 'term 'ebnf '(factor ((MULT | DIV) factor) *)) (def 'factor 'ebnf '(NUMBER))</lang> <lang PicoLisp>(de matchEbnf (Pat)

  (cond
     ((asoq Pat '((PLUS . +) (MINUS . -) (MULT . *) (DIV . /)))
        (let Op (cdr @)
           (when (= Op (car *Lst))
              (pop '*Lst)
              Op ) ) )
     ((== 'NUMBER Pat)
        (cond
           ((num? (car *Lst))
              (pop '*Lst)
              @ )
           ((and (= "-" (car *Lst)) (num? (cadr *Lst)))
              (setq *Lst (cddr *Lst))
              (- @) ) ) )
     ((get Pat 'ebnf) (parseLst @))
     ((atom Pat))
     (T
        (loop
           (T (matchEbnf (pop 'Pat)) @)
           (NIL Pat)
           (NIL (== '| (pop 'Pat)))
           (NIL Pat) ) ) ) )

(de parseLst (Pat)

  (let (P (pop 'Pat)  X (matchEbnf P))
     (loop
        (NIL Pat)
        (if (n== '* (cadr Pat))
           (if (matchEbnf (pop 'Pat))
              (setq X (list @ X))
              (throw) )
           (loop
              (NIL *Lst)
              (NIL (matchEbnf (car Pat)))
              (setq X (list @ X (or (matchEbnf P) (throw)))) )
           (setq Pat (cddr Pat)) ) )
     X ) )

(de parseEbnf (Str)

  (let *Lst (str Str "")
     (catch NIL
        (parseLst (get 'expr 'ebnf)) ) ) )</lang>

Output:

: (parseEbnf "1 + 2 * -3 / 7 - 3 * 4")
-> (- (+ 1 (/ (* 2 -3) 7)) (* 3 4))

Tcl

This example is in need of improvement:

This is not an EBNF parser. It never uses EBNF. It is a calculator parser, but there is already a calculator parser at Arithmetic evaluation#Tcl. One should adjust this solution to parse the EBNF language, not the calculator language.

Demonstration lexer and parser. Note that this parser supports parenthesized expressions, making the grammar recursive. <lang tcl>package require Tcl 8.6

Utilities to make the coroutine easier to use

proc provide args {while {![yield $args]} {yield}} proc next lexer {$lexer 1} proc pushback lexer {$lexer 0}

Lexical analyzer coroutine core

proc lexer {str} {

   yield [info coroutine]
   set symbols {+ PLUS - MINUS * MULT / DIV ( LPAR ) RPAR}
   set idx 0
   while 1 {

switch -regexp -matchvar m -- $str { {^\s+} { # No special action for whitespace } {^([-+*/()])} { provide [dict get $symbols [lindex $m 1]] [lindex $m 1] $idx } {^(\d+)} { provide NUMBER [lindex $m 1] $idx } {^$} { provide EOT "EOT" $idx return } . { provide PARSE_ERROR [lindex $m 0] $idx } } # Trim the matched string set str [string range $str [string length [lindex $m 0]] end] incr idx [string length [lindex $m 0]]

}

Utility functions to help with making an LL(1) parser; ParseLoop handles
EBNF looping constructs, ParseSeq handles sequence constructs.

proc ParseLoop {lexer def} {

   upvar 1 token token payload payload index index
   foreach {a b} $def {

if {$b ne "-"} {set b [list set c $b]} lappend m $a $b

   }
   lappend m default {pushback $lexer; break}
   while 1 {

lassign [next $lexer] token payload index switch -- $token {*}$m if {[set c [catch {uplevel 1 $c} res opt]]} { dict set opt -level [expr {[dict get $opt -level]+1}] return -options $opt $res }

} proc ParseSeq {lexer def} {

   upvar 1 token token payload payload index index
   foreach {t s} $def {

lassign [next $lexer] token payload index switch -- $token $t { if {[set c [catch {uplevel 1 $s} res opt]]} { dict set opt -level [expr {[dict get $opt -level]+1}] return -options $opt $res } } EOT { throw SYNTAX "end of text at position $index" } default { throw SYNTAX "\"$payload\" at position $index" }

}

Main parser driver; contains "master" grammar that ensures that the whole
text is matched and not just a prefix substring. Note also that the parser
runs the lexer as a coroutine (with a fixed name in this basic demonstration
code).

proc parse {str} {

   set lexer [coroutine l lexer $str]
   try {

set parsed [parse.expr $lexer] ParseLoop $lexer { EOT { return $parsed } } throw SYNTAX "\"$payload\" at position $index"

   } trap SYNTAX msg {

return -code error "syntax error: $msg"

   } finally {

catch {rename $lexer ""}

}

Now the descriptions of how to match each production in the grammar...

proc parse.expr {lexer} {

   set expr [parse.term $lexer]
   ParseLoop $lexer {

PLUS - MINUS { set expr [list $token $expr [parse.term $lexer]] }

   }
   return $expr

} proc parse.term {lexer} {

   set term [parse.factor $lexer]
   ParseLoop $lexer {

MULT - DIV { set term [list $token $term [parse.factor $lexer]] }

   }
   return $term

} proc parse.factor {lexer} {

   ParseLoop $lexer {

NUMBER { return $payload } MINUS { ParseSeq $lexer { NUMBER {return -$payload} } } LPAR { set result [parse.expr $lexer] ParseSeq $lexer { RPAR {return $result} } break } EOT { throw SYNTAX "end of text at position $index" }

   }
   throw SYNTAX "\"$payload\" at position $index"

}</lang>

<lang tcl># Demonstration code puts [parse "1 - 2 - -3 * 4 + 5"] puts [parse "1 - 2 - -3 * (4 + 5)"]</lang> Output:

PLUS {MINUS {MINUS 1 2} {MULT -3 4}} 5
MINUS {MINUS 1 2} {MULT -3 {PLUS 4 5}}

@@ Line 64: / Line 64: @@
 =={{header|Ruby}}==
 {{in progress|lang=Ruby|day=12|month=May|year=2011}}
-{{incomplete|Ruby|This code divides the input into lexical tokens, but does not parse the tokens by grammar.}}
+{{incomplete|Ruby|The tokenizer is here, but the parser is very incomplete.}}
-<lang ruby>require 'strscan'
+<lang ruby>#--
+# The tokenizer splits the input into Tokens like "identifier",
+# ":", ")*" and so on. This design uses a StringScanner on each line of
+# input, therefore a Token can never span more than one line.
+#
+# Each Token knows its original line and position, so an error message
+# can locate a bad token.
+#++
+require 'strscan'
+# A line of input.
+# where::  A location like "file.txt:3"
+# str::    String of this line
 Line = Struct.new :where, :str
+# A token.
+# cat::   A category like :colon, :ident or so on
+# str::   String of this token
+# line::  Line containing this token
+# pos::   Position of this token within this line
 Token = Struct.new :cat, :str, :line, :pos
 # Reads and returns the next Token. At end of file, returns nil.
+#--
+# Needs @filename and @in.
+#++
 def next_token
   # Loop until we reach a Token.
   loop do
-    # If before first line, or if at end of line,
+    # If at end of line, then get next line, or else declare end of
-    # then get next line, or else declare end of file.
+    # file.
-    if (not @scanner) or @scanner.eos?
+    if @scanner.eos?
       if s = @in.gets
         # Each line needs a new Line object. Tokens can hold references
         # to old Line objects.
         @line = Line.new("#{@filename}:#{@in.lineno}", s)
-        @scanner = StringScanner.new(s)
+        @scanner.string = s
       else
         return nil  # End of file
@@ Line 91: / Line 112: @@
   end
-  # Find token by regexp.
+  # Read token by regular expression.
-  # The :unknown regexp must not swallow any good tokens.
   if s = @scanner.scan(/:/)
     c = :colon
@@ Line 98: / Line 118: @@
     c = :semicolon
   elsif s = @scanner.scan(/\(/)
-    c = :group
+    c = :paren
   elsif s = @scanner.scan(/\)\?/)
     c = :option
@@ Line 104: / Line 124: @@
     c = :repeat
   elsif s = @scanner.scan(/\)/)
-    c = :one
+    c = :group
   elsif s = @scanner.scan(/\|/)
     c = :bar
-  elsif s = @scanner.scan(/'[^']*'|"[^"]*"/)
-    c = :string
   elsif s = @scanner.scan(/[[:alpha:]][[:alnum:]]*/)
     c = :ident
-  elsif s = @scanner.scan(/[^:;()'"[:alpha:][:space:]]*/)
+  elsif s = @scanner.scan(/'[^']*'|"[^"]*"/)
+    # Fix syntax highlighting for Rosetta Code. => '
+    c = :string
+  elsif s = @scanner.scan(/'[^']*|"[^"]*/)
+    c = :bad_string
+  elsif s = @scanner.scan(/.*/)
     c = :unknown
   end
@@ Line 118: / Line 141: @@
 end
+# Prints a _message_ to standard error, along with location of _token_.
-def read_it
+def error(token, message)
+  line = token.line
+  # We print a caret ^ pointing at the bad token. We make a very crude
+  # attempt to align the caret ^ in the correct column. If the input
+  # line has a non-[:print:] character, like a tab, then we print it as
+  # a space.
+  STDERR.puts <<EOF
+#{line.where}: #{message}
+#{line.str.gsub(/[^[:print:]]/, " ")}
+#{" " * token.pos}^
+EOF
+end
+#--
+# The parser converts Tokens to a Grammar object. The parser also
+# detects syntax errors.
+#++
+# A parsed EBNF grammar. It is an Array of Productions.
+class Grammar < Array; end
+# A production.
+# ident::  The identifier
+# alts::   An Array of Alternatives
+Production = Struct.new :ident, :alts
+# An array of Alternatives, as from "(a | b)".
+class Group < Array; end
+# An optional group, as from "(a | b)?".
+class OptionalGroup < Group; end
+# A repeated group, as from "(a | b)*".
+class RepeatedGroup < Group; end
+# An array of identifiers and string literals.
+class Alternative < Array; end
+#--
+# Needs @filename and @in.
+#++
+def parse
   # TODO: this only dumps the tokens.
   while t = next_token
-    p t
+    error(t, "#{t.cat}")
   end
 end
+# Set @filename and @in. Parse input.
-# Read files from ARGV, or else read STDIN.
 case ARGV.length
 when 0 then @filename = "-"
@@ Line 131: / Line 198: @@
 else fail "Too many arguments"
 end
-open(@filename) { |f| @in = f; read_it }</lang>
+open(@filename) do |f|
+  @in = f
+  @scanner = StringScanner.new("")
+  parse
+end
+</lang>
 =={{header|Tcl}}==