Compiler/lexical analyzer: Difference between revisions

From Rosetta Code
Content added Content deleted
(Re-order and rename columns of the token tables, and add more details. (This doesn't affect the task content.))
(some more rewording in the task description)
Line 84: Line 84:
;Other entities
;Other entities


These differ from the other tokens, in that each occurrence of them has a value associated with it.
These differ from the the previous tokens, in that each occurrence of them has a value associated with it.


{| class="wikitable"
{| class="wikitable"
Line 119: Line 119:
|}
|}


* For char and string literals, <code>\n</code> is supported as a new line character.
* For char and string literals, the <code>\n</code> escape sequence is supported to represent a new-line character.
* For char literals, to represent a backslash, use <code>\\</code>.
* For char literals, to represent a backslash, use <code>\\</code>.
* For char literals, an embedded single quote character is not supported.
* For char literals, an embedded single quote character is not supported.
Line 128: Line 128:


* Zero or more whitespace characters or comments are allowed between any two tokens, with the exceptions noted below.
* Zero or more whitespace characters or comments are allowed between any two tokens, with the exceptions noted below.
* "longest token matching" is used to resolve conflicts (e.g., in order to match '''<=''' as a single token rather than the two tokens '''<''' and '''=''').
* "Longest token matching" is used to resolve conflicts (e.g., in order to match '''<=''' as a single token rather than the two tokens '''<''' and '''=''').
* Whitespace is only required in the following situations:
* Whitespace is ''required'' between two tokens that have an alphanumeric character or underscore at the edge.
** This means: keywords, identifiers, and integer literals.
** To distinguish between keywords:
*** ifprint - is recognized as an identifier, instead of the keywords '''if''' and '''print'''.
** e.g. <code>ifprint</code> is recognized as an identifier, instead of the keywords <tt>if</tt> and <tt>print</tt>.
** e.g. <code>42fred</code> is invalid, and neither recognized as a number nor an identifier.
** To distinguish between keywords and integers:
* Whitespace is ''not allowed'' inside of tokens (except for chars and strings where they are part of the value).
*** 42fred - is an invalid number or invalid identifier.
** e.g. <code>& &</code> is invalid, and not interpreted as the <tt>&&</tt> operator.
* Whitespace is not allowed between:
** Multi-character operators: These cannot be recognized unless they occur without embedded whitespace: '''&&''' '''<='''.


The following programs are equivalent:
The following programs are equivalent:

Revision as of 17:43, 17 August 2016

Compiler/lexical analyzer is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
This task has been flagged for clarification. Code on this page in its current state may be flagged incorrect once this task has been clarified. See this page's Talk page for discussion.


Definition from Wikipedia:

Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).
The Task

Create a lexical analyzer for the simple programming language specified below. The program should read input from a file and/or stdin, and write output to a file and/or stdout. If the language being used has a lexer module/library/class, it would be great if two versions of the solution are provided: One without the lexer module, and one with.

Input Specification

The simple programming language to be analyzed is more or less a subset of C. It supports the following tokens:

Operators
Name Common name Character sequence
OP_MULTIPLY multiply *
OP_DIVIDE divide /
OP_ADD plus +
OP_SUBTRACT minus -
OP_NEGATE unary minus -
OP_LESS less than <
OP_LESSEQUAL less than or equal <=
OP_GREATER greater than >
OP_NOTEQUAL not equal !=
OP_ASSIGN assignment =
OP_AND logical and &&
  • The - token should always be interpreted as OP_SUBTRACT by the lexer. Turning some OP_SUBTRACT into OP_NEGATE will be the job of the syntax analyzer, which is not part of this task.
Symbols
Name Common name Character
LEFTPAREN left parenthesis (
RIGHTPAREN right parenthesis )
LEFTBRACE left brace {
RIGHTBRACE right brace }
SEMICOLON semi-colon ;
COMMA comma ,
Keywords
Name Character sequence
KEYWORD_IF if
KEYWORD_WHILE while
KEYWORD_PRINT print
KEYWORD_PUTC putc
Other entities

These differ from the the previous tokens, in that each occurrence of them has a value associated with it.

Name Common name Format description Format regex Value
IDENTIFIER identifier one or more letter/number/underscore characters, but not starting with a number [_a-zA-Z][_a-zA-Z0-9]* as is
INTEGER integer literal one or more digits [0-9]+ as is, interpreted as a number
INTEGER char literal exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes '([^'\n]|\\n|\\\\)' the ASCII code point number of the character
STRING string literal zero or more characters (anything except newline or double quote), enclosed by double quotes "[^"\n]*" the characters without the double quotes and with escape sequences converted
  • For char and string literals, the \n escape sequence is supported to represent a new-line character.
  • For char literals, to represent a backslash, use \\.
  • For char literals, an embedded single quote character is not supported.
  • For string literals, embedded double quote characters are not supported.
  • No other special sequences are supported.
White space
  • Zero or more whitespace characters or comments are allowed between any two tokens, with the exceptions noted below.
  • "Longest token matching" is used to resolve conflicts (e.g., in order to match <= as a single token rather than the two tokens < and =).
  • Whitespace is required between two tokens that have an alphanumeric character or underscore at the edge.
    • This means: keywords, identifiers, and integer literals.
    • e.g. ifprint is recognized as an identifier, instead of the keywords if and print.
    • e.g. 42fred is invalid, and neither recognized as a number nor an identifier.
  • Whitespace is not allowed inside of tokens (except for chars and strings where they are part of the value).
    • e.g. & & is invalid, and not interpreted as the && operator.

The following programs are equivalent:

<lang c> count = 1 ; n = 1 ; limit = 100 ; while ( n < limit ) {

   k = 3 ;
   p = 1 ;
   n = n + 2 ;
   while ( ( k * k <= n ) && ( p ) ) {
       p = n / k * k != n ;
       k = k + 2 ;
   }
   if ( p ) {
       print ( n , " " ) ;
       count = count + 1 ;
   }

} print ( count , "\n" ) ; </lang>

<lang c> count=1;n=1;limit=100;while(n<limit){k=3;p=1;n=n+2;while((k*k<=n)&&(p)){p=n/k*k!=n;k=k+2;}if(p){print(n," ");count=count+1;}}print(count,"\n"); </lang>

They should produce the same token stream, except for the line and column positions.

Comments enclosed in /* ... */ are also treated as whitespace outside of strings.

Complete list of token names
END_OF_INPUT  OP_MULTIPLY   OP_DIVIDE     OP_ADD        OP_SUBTRACT   OP_NEGATE
OP_LESS       OP_LESSEQUAL  OP_GREATER    OP_NOTEQUAL   OP_ASSIGN     OP_AND
KEYWORD_IF    KEYWORD_WHILE KEYWORD_PRINT KEYWORD_PUTC  LEFTPAREN     RIGHTPAREN
LEFTBRACE     RIGHTBRACE    SEMICOLON     COMMA         IDENTIFIER    INTEGER
STRING
Output Format

The program output should be a sequence of lines, each consisting of the following whitespace-separated fields:

  1. the line number where the token starts
  2. the column number where the token starts
  3. the token name
  4. the token value, in case of Integer, String, or Ident
Diagnostics

The following error conditions should be caught:

Error Example
Empty character constant ''
Unknown escape sequence. \r
Multi-character constant. 'xx'
End-of-file in comment. Closing comment characters not found.
End-of-file while scanning string literal. Closing string character not found.
End-of-line while scanning string literal. Closing string character not found before end-of-line.
Unrecognized character.
Test Cases
Input Output

<lang c> /*

 Hello world
*/

print("Hello, World!\n"); </lang>

    4      1 KEYWORD_PRINT
    4      6 LEFTPAREN
    4      7 STRING         "Hello, World!\n"
    4     24 RIGHTPAREN
    4     25 SEMICOLON
    5      1 END_OF_INPUT

<lang c> /*

 Show Ident and Integers
*/

phoenix_number = 142857; print(phoenix_number, "\n"); </lang>

    4      1 IDENTIFIER     phoenix_number
    4     16 OP_ASSIGN
    4     18 INTEGER         142857
    4     24 SEMICOLON
    5      1 KEYWORD_PRINT
    5      6 LEFTPAREN
    5      7 IDENTIFIER     phoenix_number
    5     21 COMMA
    5     23 STRING         "\n"
    5     27 RIGHTPAREN
    5     28 SEMICOLON
    6      1 END_OF_INPUT

<lang c> /*

 All lexical tokens - not syntactically correct, but that will
 have to wait until syntax analysis
*/

/* Print */ print /* Sub */ - /* Putc */ putc /* Lss */ < /* If */ if /* Gtr */ > /* While */ while /* Leq */ <= /* Lbrace */ { /* Neq */  != /* Rbrace */ } /* And */ && /* Lparen */ ( /* Semi */  ; /* Rparen */ ) /* Comma */ , /* Uminus */ - /* Assign */ = /* Mul */ * /* Integer */ 42 /* Div */ / /* String */ "String literal" /* Add */ + /* Ident */ variable_name /* character literal */ '\n' /* character literal */ '\\' /* character literal */ ' ' </lang>

    5     15 KEYWORD_PRINT
    5     41 OP_SUBTRACT
    6     15 KEYWORD_PUTC
    6     41 OP_LESS
    7     15 KEYWORD_IF
    7     41 OP_GREATER
    8     15 KEYWORD_WHILE
    8     41 OP_LESSEQUAL
    9     15 LEFTBRACE
    9     41 OP_NOTEQUAL
   10     15 RIGHTBRACE
   10     41 OP_AND
   11     15 LEFTPAREN
   11     41 SEMICOLON
   12     15 RIGHTPAREN
   12     41 COMMA
   13     15 OP_SUBTRACT
   13     41 OP_ASSIGN
   14     15 OP_MULTIPLY
   14     41 INTEGER           42
   15     15 OP_DIVIDE
   15     41 STRING         "String literal"
   16     15 OP_ADD
   16     41 IDENTIFIER     variable_name
   17     26 INTEGER           10
   18     26 INTEGER           92
   19     26 INTEGER           32
   20      1 END_OF_INPUT
Reference

The Flex, C, Python and Euphoria versions can be considered reference implementations.


C

Tested with gcc 4.81 and later, compiles warning free with -Wall -Wextra <lang C>

  1. include <stdlib.h>
  2. include <stdio.h>
  3. include <stdarg.h>
  4. include <ctype.h>
  5. include <string.h>
  6. include <errno.h>
  7. include <stdbool.h>
  8. include <limits.h>
  1. define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))
  1. define da_dim(name, type) type *name = NULL; \
                           int _qy_ ## name ## _p = 0;  \
                           int _qy_ ## name ## _max = 0
  1. define da_rewind(name) _qy_ ## name ## _p = 0
  2. define da_redim(name) if (_qy_ ## name ## _p >= _qy_ ## name ## _max) \
                               name = realloc(name, (_qy_ ## name ## _max += 32) * sizeof(name[0]))
  1. define da_append(name, x) do {da_redim(name); name[_qy_ ## name ## _p++] = x;} while (0)
  2. define da_len(name) _qy_ ## name ## _p

typedef enum {tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr,

   tk_Neq, tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen,
   tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String

} TokenType;

typedef struct {

   int tok;
   int err_ln, err_col;
   union {
       int n;                  /* value for constants */
       char *text;             /* text for idents */
   };

} tok_s;

static FILE *source_fp, *dest_fp; static int line = 1, col = 0, the_ch = ' '; da_dim(text, char);

tok_s gettok();

static void error(int err_line, int err_col, const char *fmt, ... ) {

   char buf[1000];
   va_list ap;
   va_start(ap, fmt);
   vsprintf(buf, fmt, ap);
   va_end(ap);
   printf("(%d,%d) error: %s\n", err_line, err_col, buf);
   exit(1);

}

static int next_ch() { /* get next char from input */

   the_ch = getc(source_fp);
   ++col;
   if (the_ch == '\n') {
       ++line;
       col = 0;
   }
   return the_ch;

}

static tok_s char_lit(int n, int err_line, int err_col) { /* 'x' */

   if (the_ch == '\)
       error(err_line, err_col, "gettok: empty character constant");
   if (the_ch == '\\') {
       next_ch();
       if (the_ch == 'n')
           n = 10;
       else if (the_ch == '\\')
           n = '\\';
       else error(err_line, err_col, "gettok: unknown escape sequence \\%c", the_ch);
   }
   if (next_ch() != '\)
       error(err_line, err_col, "multi-character constant");
   next_ch();
   return (tok_s){tk_Integer, err_line, err_col, {n}};

}

static tok_s div_or_cmt(int err_line, int err_col) { /* process divide or comments */

   if (the_ch != '*')
       return (tok_s){tk_Div, err_line, err_col, {0}};
   /* comment found */
   for (;;) {
       if (next_ch() == '*' && next_ch() == '/') {
           next_ch();
           return gettok();
       } else if (the_ch == EOF)
           error(err_line, err_col, "EOF in comment");
   }

}

static tok_s string_lit(int start, int err_line, int err_col) { /* "st" */

   da_rewind(text);
   while (next_ch() != start) {
       if (the_ch == '\n') error(err_line, err_col, "EOL in string");
       if (the_ch == EOF)  error(err_line, err_col, "EOF in string");
       da_append(text, (char)the_ch);
   }
   da_append(text, '\0');
   next_ch();
   return (tok_s){tk_String, err_line, err_col, {.text=text}};

}

static int kwd_cmp(const void *p1, const void *p2) {

   return strcmp(*(char **)p1, *(char **)p2);

}

static TokenType get_ident_type(const char *ident) {

   static struct {
       char *s;
       TokenType sym;
   } kwds[] = {
       {"if",    tk_If},
       {"print", tk_Print},
       {"putc",  tk_Putc},
       {"while", tk_While},
   }, *kwp;
   return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;

}

static tok_s ident_or_int(int err_line, int err_col) {

   int n, is_number = true;
   da_rewind(text);
   while (isalnum(the_ch) || the_ch == '_') {
       da_append(text, (char)the_ch);
       if (!isdigit(the_ch))
           is_number = false;
       next_ch();
   }
   if (da_len(text) == 0)
       error(err_line, err_col, "gettok: unrecognized character (%d) '%c'\n", the_ch, the_ch);
   da_append(text, '\0');
   if (isdigit(text[0])) {
       if (!is_number)
           error(err_line, err_col, "invalid number: %s\n", text);
       n = strtol(text, NULL, 0);
       if (n == LONG_MAX && errno == ERANGE)
           error(err_line, err_col, "Number exceeds maximum value");
       return (tok_s){tk_Integer, err_line, err_col, {n}};
   }
   return (tok_s){get_ident_type(text), err_line, err_col, {.text=text}};

}

static tok_s follow(int expect, TokenType ifyes, TokenType ifno, int err_line, int err_col) { /* look ahead for '>=', etc. */

   if (the_ch == expect) {
       next_ch();
       return (tok_s){ifyes, err_line, err_col, {0}};
   }
   if (ifno == tk_EOI)
       error(err_line, err_col, "follow: unrecognized character '%c' (%d)\n", the_ch, the_ch);
   return (tok_s){ifno, err_line, err_col, {0}};

}

tok_s gettok() { /* return the token type */

   /* skip white space */
   while (isspace(the_ch))
       next_ch();
   int err_line = line;
   int err_col  = col;
   switch (the_ch) {
       case '{':  next_ch(); return (tok_s){tk_Lbrace, err_line, err_col, {0}};
       case '}':  next_ch(); return (tok_s){tk_Rbrace, err_line, err_col, {0}};
       case '(':  next_ch(); return (tok_s){tk_Lparen, err_line, err_col, {0}};
       case ')':  next_ch(); return (tok_s){tk_Rparen, err_line, err_col, {0}};
       case '+':  next_ch(); return (tok_s){tk_Add, err_line, err_col, {0}};
       case '-':  next_ch(); return (tok_s){tk_Sub, err_line, err_col, {0}};
       case '*':  next_ch(); return (tok_s){tk_Mul, err_line, err_col, {0}};
       case ';':  next_ch(); return (tok_s){tk_Semi, err_line, err_col, {0}};
       case ',':  next_ch(); return (tok_s){tk_Comma,err_line, err_col, {0}};
       case '>':  next_ch(); return (tok_s){tk_Gtr, err_line, err_col, {0}};
       case '=':  next_ch(); return (tok_s){tk_Assign, err_line, err_col, {0}};
       case '/':  next_ch(); return div_or_cmt(err_line, err_col);
       case '\: next_ch(); return char_lit(the_ch, err_line, err_col);
       case '<':  next_ch(); return follow('=', tk_Leq, tk_Lss, err_line, err_col);
       case '!':  next_ch(); return follow('=', tk_Neq, tk_EOI, err_line, err_col);
       case '&':  next_ch(); return follow('&', tk_And, tk_EOI, err_line, err_col);
       case '"' : return string_lit(the_ch, err_line, err_col);
       default:   return ident_or_int(err_line, err_col);
       case EOF:  return (tok_s){tk_EOI, err_line, err_col, {0}};
   }

}

void run() { /* tokenize the given input */

   tok_s tok;
   do {
       tok = gettok();
       fprintf(dest_fp, "%5d  %5d %.14s",
           tok.err_ln, tok.err_col,
           &"END_OF_INPUT  OP_MULTIPLY   OP_DIVIDE     OP_ADD        OP_SUBTRACT   OP_NEGATE     "
            "OP_LESS       OP_LESSEQUAL  OP_GREATER    OP_NOTEQUAL   OP_ASSIGN     OP_AND        "
            "KEYWORD_IF    KEYWORD_WHILE KEYWORD_PRINT KEYWORD_PUTC  LEFTPAREN     RIGHTPAREN    "
            "LEFTBRACE     RIGHTBRACE    SEMICOLON     COMMA         IDENTIFIER    INTEGER       "
            "STRING        "[tok.tok * 14]);
       if (tok.tok == tk_Integer)     fprintf(dest_fp, "  %4d",   tok.n);
       else if (tok.tok == tk_Ident)  fprintf(dest_fp, " %s",     tok.text);
       else if (tok.tok == tk_String) fprintf(dest_fp, " \"%s\"", tok.text);
       fprintf(dest_fp, "\n");
   } while (tok.tok != tk_EOI);
   if (dest_fp != stdout)
       fclose(dest_fp);

}

void init_io(FILE **fp, FILE *std, const char mode[], const char fn[]) {

   if (fn[0] == '\0')
       *fp = std;
   else if ((*fp = fopen(fn, mode)) == NULL)
       error(0, 0, "Can't open %s\n", fn);

}

int main(int argc, char *argv[]) {

   init_io(&source_fp, stdin,  "r",  argc > 1 ? argv[1] : "");
   init_io(&dest_fp,   stdout, "wb", argc > 2 ? argv[2] : "");
   run();
   return 0;

} </lang>

Output from test case 3:

    5     15 KEYWORD_PRINT
    5     41 OP_SUBTRACT
    6     15 KEYWORD_PUTC
    6     41 OP_LESS
    7     15 KEYWORD_IF
    7     41 OP_GREATER
    8     15 KEYWORD_WHILE
    8     41 OP_LESSEQUAL
    9     15 LEFTBRACE
    9     41 OP_NOTEQUAL
   10     15 RIGHTBRACE
   10     41 OP_AND
   11     15 LEFTPAREN
   11     41 SEMICOLON
   12     15 RIGHTPAREN
   12     41 COMMA
   13     15 OP_SUBTRACT
   13     41 OP_ASSIGN
   14     15 OP_MULTIPLY
   14     41 INTEGER           42
   15     15 OP_DIVIDE
   15     41 STRING         "String literal"
   16     15 OP_ADD
   16     41 IDENTIFIER     variable_name
   17     26 INTEGER           10
   18     26 INTEGER           92
   19     26 INTEGER           32
   20      1 END_OF_INPUT

Euphoria

Tested with Euphoria 4.05. <lang euphoria> include std/io.e include std/map.e include std/types.e include std/convert.e

constant true = 1, false = 0, EOF = -1

enum tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr, tk_Neq,

   tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen,
   tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String

constant all_syms = {"END_OF_INPUT", "OP_MULTIPLY", "OP_DIVIDE", "OP_ADD", "OP_SUBTRACT",

   "OP_NEGATE", "OP_LESS", "OP_LESSEQUAL", "OP_GREATER", "OP_NOTEQUAL", "OP_ASSIGN",
   "OP_AND", "KEYWORD_IF", "KEYWORD_WHILE", "KEYWORD_PRINT", "KEYWORD_PUTC", "LEFTPAREN",
   "RIGHTPAREN", "LEFTBRACE", "RIGHTBRACE", "SEMICOLON", "COMMA", "IDENTIFIER",
   "INTEGER", "STRING"}

integer input_file, the_ch = ' ', the_col = 0, the_line = 1 sequence symbols map key_words = new()

procedure error(sequence format, sequence data)

   printf(STDOUT, format, data)
   abort(1)

end procedure

-- get the next character from the input function next_ch()

   the_ch = getc(input_file)
   the_col += 1
   if the_ch = '\n' then
       the_line += 1
       the_col = 0
   end if
   return the_ch

end function

-- 'x' - character constants function char_lit(integer err_line, integer err_col)

   integer n = next_ch()              -- skip opening quote
   if the_ch = '\ then
       error("%d %d empty character constant", {err_line, err_col})
   elsif the_ch = '\\' then
       next_ch()
       if the_ch = 'n' then
           n = 10
       elsif the_ch = '\\' then
           n = '\\'
       else
           error("%d %d unknown escape sequence \\%c", {err_line, err_col, the_ch})
       end if
   end if
   if next_ch() != '\ then
       error("%d %d multi-character constant", {err_line, err_col})
   end if
   next_ch()
   return {tk_Integer, err_line, err_col, n}

end function

-- process divide or comments function div_or_cmt(integer err_line, integer err_col)

   if next_ch() != '*' then
       return {tk_Div, err_line, err_col}
   end if
   -- comment found
   while true do
       if next_ch() = '*' and next_ch() = '/' then
           next_ch()
           return get_tok()
       elsif the_ch = EOF then
           error("%d %d EOF in comment", {err_line, err_col})
       end if
   end while

end function

-- "string" function string_lit(integer start, integer err_line, integer err_col)

   string text = ""
   while next_ch() != start do
       if the_ch = EOF then
           error("%d %d EOF while scanning string literal", {err_line, err_col})
       end if
       if the_ch = '\n' then
           error("%d %d EOL while scanning string literal", {err_line, err_col})
       end if
       text &= the_ch
   end while
   next_ch()
   return {tk_String, err_line, err_col, text}

end function

-- handle identifiers and integers function ident_or_int(integer err_line, integer err_col)

   integer n, is_number = true
   string text = ""
   while t_alnum(the_ch) or the_ch = '_' do
       text &= the_ch
       if not t_digit(the_ch) then
           is_number = false
       end if
       next_ch()
   end while
   if length(text) = 0 then
       error("%d %d ident_or_int: unrecognized character: (%d) '%s'", {err_line, err_col, the_ch, the_ch})
   end if
   if t_digit(text[1]) then
       if not is_number then
           error("%d %d invalid number: %s", {err_line, err_col, text})
       end if
       n = to_integer(text)
       return {tk_Integer, err_line, err_col, n}
   end if
   if has(key_words, text) then
       return {get(key_words, text), err_line, err_col}
   end if
   return {tk_Ident, err_line, err_col, text}

end function

-- look ahead for '>=', etc. function follow(integer expect, integer ifyes, integer ifno, integer err_line, integer err_col)

   if next_ch() = expect then
       next_ch()
       return {ifyes, err_line, err_col}
   end if
   if ifno = tk_EOI then
       error("%d %d follow: unrecognized character: (%d)", {err_line, err_col, the_ch})
   end if
   return {ifno, err_line, err_col}

end function

-- return the next token type function get_tok()

   while t_space(the_ch) do
       next_ch()
   end while
   integer err_line = the_line
   integer err_col  = the_col
   switch the_ch do
       case EOF  then return {tk_EOI, err_line, err_col}
       case '/'  then return div_or_cmt(err_line, err_col)
       case '\ then return char_lit(err_line, err_col)
       case '<'  then return follow('=', tk_Leq, tk_Lss, err_line, err_col)
       case '!'  then return follow('=', tk_Neq, tk_EOI, err_line, err_col)
       case '&'  then return follow('&', tk_And, tk_EOI, err_line, err_col)
       case '"'  then return string_lit(the_ch, err_line, err_col)
       case else
           integer sym = symbols[the_ch]
           if sym  != tk_EOI then
               next_ch()
               return {sym, err_line, err_col}
           end if
           return ident_or_int(err_line, err_col)
   end switch

end function

procedure init()

   put(key_words, "if",      tk_If)
   put(key_words, "print",   tk_Print)
   put(key_words, "putc",    tk_Putc)
   put(key_words, "while",   tk_While)
   symbols = repeat(tk_EOI, 256)
   symbols['{'] = tk_Lbrace
   symbols['}'] = tk_Rbrace
   symbols['('] = tk_Lparen
   symbols[')'] = tk_Rparen
   symbols['+'] = tk_Add
   symbols['-'] = tk_Sub
   symbols['*'] = tk_Mul
   symbols[';'] = tk_Semi
   symbols[','] = tk_Comma
   symbols['>'] = tk_Gtr
   symbols['='] = tk_Assign

end procedure

procedure main(sequence cl)

   sequence file_name
   input_file = STDIN
   if length(cl) > 2 then
       file_name = cl[3]
       input_file = open(file_name, "r")
       if input_file = -1 then
           error("Could not open %s", {file_name})
       end if
   end if
   init()
   sequence t
   loop do
       t = get_tok()
       printf(STDOUT, "%5d  %5d %-8s", {t[2], t[3], all_syms[t[1]]})
       switch t[1] do
           case tk_Integer then    printf(STDOUT, "  %5d\n",   {t[4]})
           case tk_Ident then printf(STDOUT, " %s\n",     {t[4]})
           case tk_String  then    printf(STDOUT, " \"%s\"\n", {t[4]})
           case else            printf(STDOUT, "\n")
       end switch
       until t[1] = tk_EOI
   end loop

end procedure

main(command_line()) </lang>

Output from test case 3:

    5     15 KEYWORD_PRINT
    5     41 OP_SUBTRACT
    6     15 KEYWORD_PUTC
    6     41 OP_LESS
    7     15 KEYWORD_IF
    7     41 OP_GREATER
    8     15 KEYWORD_WHILE
    8     41 OP_LESSEQUAL
    9     15 LEFTBRACE
    9     41 OP_NOTEQUAL
   10     15 RIGHTBRACE
   10     41 OP_AND
   11     15 LEFTPAREN
   11     41 SEMICOLON
   12     15 RIGHTPAREN
   12     41 COMMA
   13     15 OP_SUBTRACT
   13     41 OP_ASSIGN
   14     15 OP_MULTIPLY
   14     41 INTEGER      42
   15     15 OP_DIVIDE
   15     41 STRING   "String literal"
   16     15 OP_ADD
   16     41 IDENTIFIER variable_name
   17     26 INTEGER      10
   18     26 INTEGER      92
   19     26 INTEGER      32
   20      1 END_OF_INPUT

Flex

Tested with Flex 2.5.4. <lang C> %{

  1. include <stdio.h>
  2. include <stdlib.h>
  3. include <string.h>
  4. include <errno.h>
  5. include <limits.h>
  1. define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))

typedef enum {tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr,

   tk_Neq, tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen,
   tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String

} TokenType;

void yyerror(char msg[]) {

   printf(msg);
   exit(1);

}

static int yynval;

struct yylloc {

   int first_line, first_col;
   int last_line, last_col;

} yylloc;

static void update_loc() {

 static int curr_line = 1;
 static int curr_col  = 1;
 yylloc.first_line = curr_line;
 yylloc.first_col  = curr_col;
 {char *s; for (s = yytext; *s != '\0'; s++) {
   if (*s == '\n') {
     curr_line++;
     curr_col = 1;
   } else {
     curr_col++;
   }
 }}
 yylloc.last_line = curr_line;
 yylloc.last_col  = curr_col-1;

}

  1. define YY_USER_ACTION update_loc();

static int kwd_cmp(const void *p1, const void *p2) {

   return strcmp(*(char **)p1, *(char **)p2);

}

static TokenType get_ident_type(const char *ident) {

   static struct {
       char *s;
       TokenType sym;
   } kwds[] = {
       {"if",    tk_If},
       {"print", tk_Print},
       {"putc",  tk_Putc},
       {"while", tk_While},
   }, *kwp;
   return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;

}

%}

%start COMMENT2

%option noyywrap

digit [0-9] ident [a-zA-Z_][a-zA-Z_0-9]*

number {digit}+ string \"[^"\n]*\" char_const \'([^'\n]|\\n|\\\\)\'

%%

<COMMENT2>[^*]+  ; <COMMENT2>\*[^/] ; <COMMENT2>\*\/ BEGIN 0; /* end comment */ "/*" BEGIN COMMENT2;

"{" {return tk_Lbrace;} "}" {return tk_Rbrace;} "(" {return tk_Lparen;} ")" {return tk_Rparen;} "*" {return tk_Mul;} "/" {return tk_Div;} "+" {return tk_Add;} "-" {return tk_Sub;} "<" {return tk_Lss;} ">" {return tk_Gtr;} "<=" {return tk_Leq;} "!=" {return tk_Neq;} "&&" {return tk_And;} ";" {return tk_Semi;} "," {return tk_Comma;} "=" {return tk_Assign;} {ident} {return get_ident_type(yytext);} {string} {return tk_String;}

[ \t\n]+ ; /* ignore whitespace */

{number} {

               yynval = strtol(yytext, NULL, 0);
               if (yynval == LONG_MAX && errno == ERANGE)
                   yyerror("Number exceeds maximum value");
               return tk_Integer;
            }

{char_const} {

               int n = yytext[1];
               char *p = yytext;
               if (yyleng < 3)
                   yyerror("empty character constant");
               ++p;
               if (p[0] == '\\') {
                   ++p;
                   if (p[0] == 'n')
                       n = 10;
                   else if (p[0] == '\\')
                       n = '\\';
                   else
                       yyerror("unknown escape sequence");
               }
               yynval = n;
               return tk_Integer;
            }

. yyerror("Unknown character\n");

%%

int main(int argc, char *argv[]) {

   int tok;
   ++argv, --argc;  /* skip over program name */
   yyin = stdin;
   if (argc > 0)
       yyin = fopen(argv[0], "r");
   do {
       tok = yylex();
       printf("%5d  %5d %.14s", yylloc.first_line, yylloc.first_col,
           &"END_OF_INPUT  OP_MULTIPLY   OP_DIVIDE     OP_ADD        OP_SUBTRACT   OP_NEGATE     "
            "OP_LESS       OP_LESSEQUAL  OP_GREATER    OP_NOTEQUAL   OP_ASSIGN     OP_AND        "
            "KEYWORD_IF    KEYWORD_WHILE KEYWORD_PRINT KEYWORD_PUTC  LEFTPAREN     RIGHTPAREN    "
            "LEFTBRACE     RIGHTBRACE    SEMICOLON     COMMA         IDENTIFIER    INTEGER       "
            "STRING        "[tok * 14]);
       if (tok == tk_Integer)     printf("   %5d", yynval);
       else if (tok == tk_Ident)  printf("  %s",   yytext);
       else if (tok == tk_String) printf("  %s",   yytext);
       printf("\n");
   } while (tok != tk_EOI);
   return 0;

} </lang>

Output from test case 3:

    5     15 KEYWORD_PRINT
    5     41 OP_SUBTRACT
    6     15 KEYWORD_PUTC
    6     41 OP_LESS
    7     15 KEYWORD_IF
    7     41 OP_GREATER
    8     15 KEYWORD_WHILE
    8     41 OP_LESSEQUAL
    9     15 LEFTBRACE
    9     41 OP_NOTEQUAL
   10     15 RIGHTBRACE
   10     41 OP_AND
   11     15 LEFTPAREN
   11     41 SEMICOLON
   12     15 RIGHTPAREN
   12     41 COMMA
   13     15 OP_SUBTRACT
   13     41 OP_ASSIGN
   14     15 OP_MULTIPLY
   14     41 INTEGER             42
   15     15 OP_DIVIDE
   15     41 STRING          "String literal"
   16     15 OP_ADD
   16     41 IDENTIFIER      variable_name
   17     26 INTEGER             10
   18     26 INTEGER             92
   19     26 INTEGER             32
   19     29 END_OF_INPUT

FreeBASIC

Tested with FreeBASIC 1.05 <lang FreeBASIC> enum Token_type

   tk_EOI
   tk_Mul
   tk_Div
   tk_Add
   tk_Sub
   tk_Uminus
   tk_Lss
   tk_Leq
   tk_Gtr
   tk_Neq
   tk_Assign
   tk_And
   tk_If
   tk_While
   tk_Print
   tk_Putc
   tk_Lparen
   tk_Rparen
   tk_Lbrace
   tk_Rbrace
   tk_Semi
   tk_Comma
   tk_Ident
   tk_Integer
   tk_String

end enum

const NewLine = chr(10) const DoubleQuote = chr(34)

' where we store keywords and variables type Symbol

   s_name as string
   tok as Token_type

end type

dim shared symtab() as Symbol

dim shared cur_line as string dim shared cur_ch as string dim shared line_num as integer dim shared col_num as integer

function is_digit(byval ch as string) as long

   is_digit = (ch <> "") and ch >= "0" and ch <= "9"

end function

function is_alnum(byval ch as string) as long

   is_alnum = (ch <> "") and ((UCase(ch) >= "A" and UCase(ch) <= "Z") or (is_digit(ch)))

end function

sub error_msg(byval eline as integer, byval ecol as integer, byval msg as string)

   print "("; eline; ":"; ecol; ")"; " "; msg
   system

end sub

' add an identifier to the symbol table function install(byval s_name as string, byval tok as Token_type) as integer

   dim n as integer
   n = ubound(symtab)
   redim preserve symtab(n + 1)
   n = ubound(symtab)
   symtab(n).s_name = s_name
   symtab(n).tok    = tok
   return n

end function

' search for an identifier in the symbol table function lookup(byval s_name as string) as integer

   dim i as integer
   for i = lbound(symtab) to ubound(symtab)
       if symtab(i).s_name = s_name then return i
   next
   return -1

end function

sub next_line() ' read the next line of input from the source file

   cur_line = ""
   cur_ch  = ""        ' empty cur_ch means end-of-file
   if eof(1) then exit sub
   line input #1, cur_line
   cur_line = cur_line + NewLine
   line_num += + 1
   col_num = 1

end sub

sub next_char() ' get the next char

   cur_ch = ""
   col_num += 1
   if col_num > len(cur_line) then next_line()
   if col_num <= len(cur_line) then cur_ch = mid(cur_line, col_num, 1)

end sub

function follow(byval err_line as integer, byval err_col as integer, byval expect as string, byval ifyes as Token_type, byval ifno as Token_type) as Token_type

   if cur_ch = expect then
       next_char()
       return ifyes
   end if
   if ifno = tk_eoi then error_msg(err_line, err_col, "follow unrecognized character: " + cur_ch)
   return ifno

end function

sub gettok(byref err_line as integer, byref err_col as integer, byref tok as Token_type, byref v as string)

   ' skip whitespace
   do while (cur_ch = " " or cur_ch = chr(9) or cur_ch = NewLine) and (cur_ch <> "")
       next_char()
   loop
   err_line = line_num
   err_col  = col_num
   select case cur_ch
       case "":  tok = tk_eoi: exit sub
       case "{": tok = tk_lbrace: next_char(): exit sub
       case "}": tok = tk_rbrace: next_char(): exit sub
       case "(": tok = tk_lparen: next_char(): exit sub
       case ")": tok = tk_rparen: next_char(): exit sub
       case "+": tok = tk_add:    next_char(): exit sub
       case "-": tok = tk_sub:    next_char(): exit sub
       case "*": tok = tk_mul:    next_char(): exit sub
       case ";": tok = tk_semi:   next_char(): exit sub
       case ",": tok = tk_comma:  next_char(): exit sub
       case ">": tok = tk_gtr:    next_char(): exit sub
       case "=": tok = tk_assign: next_char(): exit sub
       case "/": ' div or comment
           next_char()
           if cur_ch <> "*" then
               tok = tk_div
               exit sub
           end if
           ' skip comments
           do
               next_char()
               if cur_ch = "*" or cur_ch = "" then
                   next_char()
                   if cur_ch = "/" or cur_ch = "" then
                       next_char()
                       gettok(err_line, err_col, tok, v)
                       exit sub
                   end if
               end if
           loop
       case "'":   ' single char literals
           next_char()
           v = str(Asc(cur_ch))
           if cur_ch = "'" then error_msg(err_line, err_col, "empty character constant")
           if cur_ch = "\" then
               next_char()
               if cur_ch = "n" then
                   v = "10"
               elseif cur_ch = "\" then
                   v = Str(Asc("\"))
               else error_msg(err_line, err_col, "unknown escape sequence: " + cur_ch)
               end if
           end if
           next_char()
           if cur_ch <> "'" then error_msg(err_line, err_col, "multi-character constant")
           next_char()
           tok = tk_integer
           exit sub
       case "<": next_char(): tok = follow(err_line, err_col, "=", tk_Leq, tk_Lss): exit sub
       case "!": next_char(): tok = follow(err_line, err_col, "=", tk_Neq, tk_EOI): exit sub
       case "&": next_char(): tok = follow(err_line, err_col, "&", tk_And, tk_EOI): exit sub
       case DoubleQuote: ' string
           v = cur_ch
           next_char()
           do while cur_ch <> DoubleQuote
               if cur_ch = NewLine then error_msg(err_line, err_col, "EOL in string")
               if cur_ch = "" then error_msg(err_line, err_col, "EOF in string")
               v += cur_ch
               next_char()
           loop
           v += cur_ch
           next_char()
           tok = tk_string
           exit sub
       case else   ' integers or identifiers
           dim is_number as boolean = is_digit(cur_ch)
           v = ""
           do while is_alnum(cur_ch) orelse cur_ch = "_"
               if not is_digit(cur_ch) then is_number = false
               v += cur_ch
               next_char()
           loop
           if len(v) = 0 then error_msg(err_line, err_col, "unknown character: " + cur_ch)
           if is_digit(mid(v, 1, 1)) then
               if not is_number then error_msg(err_line, err_col, "invalid number: " + v)
               tok = tk_integer
               exit sub
           end if
           dim as integer index = lookup(v)
           if index = -1 then
               tok = tk_ident
           else
               tok = symtab(index).tok
           end if
           exit sub
   end select

end sub

sub init_lex(byval filein as string)

   install("if",    tk_if)
   install("print", tk_print)
   install("putc",  tk_putc)
   install("while", tk_while)
   open filein for input as #1
   cur_line = ""
   line_num = 0
   col_num = 0
   next_char()

end sub

sub scanner()

   dim err_line as integer
   dim err_col as integer
   dim tok as Token_type
   dim v as string
   dim tok_list(tk_eoi to tk_string) as string
   tok_list(tk_EOI    ) = "END_OF_INPUT"
   tok_list(tk_Mul    ) = "OP_MULTIPLY"
   tok_list(tk_Div    ) = "OP_DIVIDE"
   tok_list(tk_Add    ) = "OP_ADD"
   tok_list(tk_Sub    ) = "OP_SUBTRACT"
   tok_list(tk_Uminus ) = "OP_NEGATE"
   tok_list(tk_Lss    ) = "OP_LESS"
   tok_list(tk_Leq    ) = "OP_LESSEQUAL"
   tok_list(tk_Gtr    ) = "OP_GREATER"
   tok_list(tk_Neq    ) = "OP_NOTEQUAL"
   tok_list(tk_Assign ) = "OP_ASSIGN"
   tok_list(tk_And    ) = "OP_AND"
   tok_list(tk_If     ) = "KEYWORD_IF"
   tok_list(tk_While  ) = "KEYWORD_WHILE"
   tok_list(tk_Print  ) = "KEYWORD_PRINT"
   tok_list(tk_Putc   ) = "KEYWORD_PUTC"
   tok_list(tk_Lparen ) = "LEFTPAREN"
   tok_list(tk_Rparen ) = "RIGHTPAREN"
   tok_list(tk_Lbrace ) = "LEFTBRACE"
   tok_list(tk_Rbrace ) = "RIGHTBRACE"
   tok_list(tk_Semi   ) = "SEMICOLON"
   tok_list(tk_Comma  ) = "COMMA"
   tok_list(tk_Ident  ) = "IDENTIFIER"
   tok_list(tk_Integer) = "INTEGER"
   tok_list(tk_String ) = "STRING"
   do
       gettok(err_line, err_col, tok, v)
       print using "#####  ##### \              \"; err_line; err_col; tok_list(tok);
       if tok = tk_integer orelse tok = tk_ident orelse tok = tk_string then print " " + v;
       print
   loop until tok = tk_eoi

end sub

sub main()

   if command(1) = "" then print "filename required" : system
   init_lex(command(1))
   scanner()

end sub

main() system </lang>

Output from test case 3:

    5     15 KEYWORD_PRINT
    5     41 OP_SUBTRACT
    6     15 KEYWORD_PUTC
    6     41 OP_LESS
    7     15 KEYWORD_IF
    7     41 OP_GREATER
    8     15 KEYWORD_WHILE
    8     41 OP_LESSEQUAL
    9     15 LEFTBRACE
    9     41 OP_NOTEQUAL
   10     15 RIGHTBRACE
   10     41 OP_AND
   11     15 LEFTPAREN
   11     41 SEMICOLON
   12     15 RIGHTPAREN
   12     41 COMMA
   13     15 OP_SUBTRACT
   13     41 OP_ASSIGN
   14     15 OP_MULTIPLY
   14     41 INTEGER          42
   15     15 OP_DIVIDE
   15     41 STRING           "String literal"
   16     15 OP_ADD
   16     41 IDENTIFIER       variable_name
   17     26 INTEGER          10
   18     26 INTEGER          92
   19     26 INTEGER          32
   19     30 END_OF_INPUT

Perl

<lang perl>#!/usr/bin/env perl

use strict; use warnings; no warnings 'once';


  1. ----- Definition of the language to be lexed -----#

my @tokens = (

   # Name          | Format               | Value       #
   # --------------|----------------------|-------------#
   ['OP_MULTIPLY'  , '*'                  ,             ],
   ['OP_DIVIDE'    , '/'                  ,             ],
   ['OP_ADD'       , '+'                  ,             ],
   ['OP_SUBTRACT'  , '-'                  ,             ],
   ['OP_LESSEQUAL' , '<='                 ,             ],
   ['OP_LESS'      , '<'                  ,             ],
   ['OP_GREATER'   , '>'                  ,             ],
   ['OP_NOTEQUAL'  , '!='                 ,             ],
   ['OP_ASSIGN'    , '='                  ,             ],
   ['OP_AND'       , '&&'                 ,             ],
   ['KEYWORD_IF'   , qr/if\b/             ,             ],
   ['KEYWORD_WHILE', qr/while\b/          ,             ],
   ['KEYWORD_PRINT', qr/print\b/          ,             ],
   ['KEYWORD_PUTC' , qr/putc\b/           ,             ],
   ['LEFTPAREN'    , '('                  ,             ],
   ['RIGHTPAREN'   , ')'                  ,             ],
   ['LEFTBRACE'    , '{'                  ,             ],
   ['RIGHTBRACE'   , '}'                  ,             ],
   ['SEMICOLON'    , ';'                  ,             ],
   ['COMMA'        , ','                  ,             ],
   ['IDENTIFIER'   , qr/[_a-z][_a-z0-9]*/i, \&raw       ],
   ['INTEGER'      , qr/[0-9]+\b/         , \&raw       ],
   ['INTEGER'      , qr/'([^']*)(')?/     , \&char_val  ],
   ['STRING'       , qr/"([^"]*)(")?/     , \&string_raw],
   
   ['END_OF_INPUT' , qr/$/                ,             ],

);

my $comment = qr/\/\* .+? (?: \*\/ | $ (?{die "End-of-file in comment\n"}) )/xs; my $whitespace = qr/(?: \s | $comment)*/x; my $unrecognized = qr/\w+? \b | ./x;

  1. | Returns the value of a matched char literal, or dies if it is invalid

sub char_val {

   my $str = string_val();
   die "Multiple characters\n" if length $str > 1;
   die "No character\n"        if length $str == 0;
   ord $str;

}

  1. | Returns the value of a matched string literal, or dies if it is invalid

sub string_val {

   my ($str, $end) = ($1, $2);
   die "End-of-file\n" if not defined $end;
   die "End-of-line\n" if $str =~ /\n/;
   $str =~ s/\\(.)/
         $1 eq 'n'  ? "\n"
       : $1 eq '\\' ? $1
       : $1 eq $end ? $1
       : die "Unknown escape sequence \\$1\n"
   /rge;

}

  1. | Returns the source string of a matched literal

sub raw { $& }

  1. | Returns the source string of a matched string literal, or dies if invalid

sub string_raw {

   string_val(); # Just for the error handling side-effects
   $&;

}


  1. ----- Lexer "engine" -----#
  1. Construct the scanner regex:

my $tokens =

   join "|",
   map {
       my $format = $tokens[$_][1];
       "\n".(ref $format ? $format : quotemeta $format)." (*MARK:$_) ";
   } 0..$#tokens;

my $regex = qr/

   \G (?| $whitespace  \K (?| $tokens )
        | $whitespace? \K ($unrecognized) (*MARK:!) )

/x;


  1. Run the lexer:

my $input = do { local $/ = undef; <STDIN> }; my $pos = 0; my $linecol = linecol_accumulator();

while ($input =~ /$regex/g) {

   # Get the line and column number
   my ($line, $col) = $linecol->(substr $input, $pos, $-[0] - $pos);
   $pos = $-[0];
   
   # Get the token type that was identified by the scanner regex
   my $type = $main::REGMARK;
   die "Unrecognized token $1 at line $line, col $col\n" if $type eq '!';
   my ($name, $evaluator) = @{$tokens[$type]}[0, 2];
   
   # Get the token value
   my $value;
   if ($evaluator) {
       eval { $value = $evaluator->() };
       if ($@) { chomp $@; die "$@ in $name at line $line, col $col\n" }
   }
   
   # Print the output line
   print "$line\t$col\t$name".($value ? "\t$value" : )."\n";

}

  1. | Returns a closure, which can be fed a string one piece at a time and gives
  2. | back the cumulative line and column number each time

sub linecol_accumulator {

   my ($line, $col) = (1, 1);
   sub {
       my $str = shift;
       my @lines = split "\n", $str, -1;
       my ($l, $c) = (@lines - 1, length $lines[-1]);
       if ($l) { $line += $l;  $col = 1 + $c }
       else    { $col += $c }
       ($line, $col)
   }

}</lang>

Output:

(for test-case 3)

5	15	KEYWORD_PRINT
5	41	OP_SUBTRACT
6	15	KEYWORD_PUTC
6	41	OP_LESS
7	15	KEYWORD_IF
7	41	OP_GREATER
8	15	KEYWORD_WHILE
8	41	OP_LESSEQUAL
9	15	LEFTBRACE
9	41	OP_NOTEQUAL
10	15	RIGHTBRACE
10	41	OP_AND
11	15	LEFTPAREN
11	41	SEMICOLON
12	15	RIGHTPAREN
12	41	COMMA
13	15	OP_SUBTRACT
13	41	OP_ASSIGN
14	15	OP_MULTIPLY
14	41	INTEGER	42
15	15	OP_DIVIDE
15	41	STRING	"String literal"
16	15	OP_ADD
16	41	IDENTIFIER	variable_name
17	26	INTEGER	10
18	26	INTEGER	92
19	26	INTEGER	32
20	1	END_OF_INPUT

Python

Tested with Python 2.7 and 3.x <lang Python> from __future__ import print_function import sys

  1. following two must remain in the same order

tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr, tk_Neq, \ tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, \ tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String = range(25)

all_syms = ["END_OF_INPUT", "OP_MULTIPLY", "OP_DIVIDE", "OP_ADD", "OP_SUBTRACT",

   "OP_NEGATE", "OP_LESS", "OP_LESSEQUAL", "OP_GREATER", "OP_NOTEQUAL", "OP_ASSIGN",
   "OP_AND", "KEYWORD_IF", "KEYWORD_WHILE", "KEYWORD_PRINT", "KEYWORD_PUTC", "LEFTPAREN",
   "RIGHTPAREN", "LEFTBRACE", "RIGHTBRACE", "SEMICOLON", "COMMA", "IDENTIFIER",
   "INTEGER", "STRING"]
  1. single character only symbols

symbols = { '{': tk_Lbrace, '}': tk_Rbrace, '(': tk_Lparen, ')': tk_Rparen, '+': tk_Add, '-': tk_Sub,

   '*': tk_Mul, ';': tk_Semi, ',': tk_Comma, '>': tk_Gtr, '=': tk_Assign }

key_words = {'if': tk_If, 'print': tk_Print, 'putc': tk_Putc, 'while': tk_While}

the_ch = " " # dummy first char - but it must be a space the_col = 0 the_line = 1 input_file = None

        • show error and exit

def error(line, col, msg):

   print(line, col, msg)
   exit(1)
        • get the next character from the input

def next_ch():

   global the_ch, the_col, the_line
   the_ch = input_file.read(1)
   the_col += 1
   if the_ch == '\n':
       the_line += 1
       the_col = 0
   return the_ch
        • 'x' - character constants

def char_lit(err_line, err_col):

   n = ord(next_ch())              # skip opening quote
   if the_ch == '\:
       error(err_line, err_col, "empty character constant")
   elif the_ch == '\\':
       next_ch()
       if the_ch == 'n':
           n = 10
       elif the_ch == '\\':
           n = ord('\\')
       else:
           error(err_line, err_col, "unknown escape sequence \\%c" % (the_ch))
   if next_ch() != '\:
       error(err_line, err_col, "multi-character constant")
   next_ch()
   return tk_Integer, err_line, err_col, n
        • process divide or comments

def div_or_cmt(err_line, err_col):

   if next_ch() != '*':
       return tk_Div, err_line, err_col
   # comment found
   while True:
       if next_ch() == '*' and next_ch() == '/':
           next_ch()
           return gettok()
       elif len(the_ch) == 0:
           error(err_line, err_col, "EOF in comment")
        • "string"

def string_lit(start, err_line, err_col):

   text = ""
   while next_ch() != start:
       if len(the_ch) == 0:
           error(err_line, err_col, "EOF while scanning string literal")
       if the_ch == '\n':
           error(err_line, err_col, "EOL while scanning string literal")
       text += the_ch
   next_ch()
   return tk_String, err_line, err_col, text
        • handle identifiers and integers

def ident_or_int(err_line, err_col):

   is_number = True
   text = ""
   while the_ch.isalnum() or the_ch == '_':
       text += the_ch
       if not the_ch.isdigit():
           is_number = False
       next_ch()
   if len(text) == 0:
       error(err_line, err_col, "ident_or_int: unrecognized character: (%d) '%c'" % (ord(the_ch), the_ch))
   if text[0].isdigit():
       if not is_number:
           error(err_line, err_col, "invalid number: %s" % (text))
       n = int(text)
       return tk_Integer, err_line, err_col, n
   if text in key_words:
       return key_words[text], err_line, err_col
   return tk_Ident, err_line, err_col, text
        • look ahead for '>=', etc.

def follow(expect, ifyes, ifno, err_line, err_col):

   if next_ch() == expect:
       next_ch()
       return ifyes, err_line, err_col
   if ifno == tk_EOI:
       error(err_line, err_col, "follow: unrecognized character: (%d) '%c'" % (ord(the_ch), the_ch))
   return ifno, err_line, err_col
        • return the next token type

def gettok():

   while the_ch.isspace():
       next_ch()
   err_line = the_line
   err_col  = the_col
   if len(the_ch) == 0:    return tk_EOI, err_line, err_col
   elif the_ch == '/':     return div_or_cmt(err_line, err_col)
   elif the_ch == '\:    return char_lit(err_line, err_col)
   elif the_ch == '<':     return follow('=', tk_Leq, tk_Lss, err_line, err_col)
   elif the_ch == '!':     return follow('=', tk_Neq, tk_EOI, err_line, err_col)
   elif the_ch == '&':     return follow('&', tk_And, tk_EOI, err_line, err_col)
   elif the_ch == '"':     return string_lit(the_ch, err_line, err_col)
   elif the_ch in symbols:
       sym = symbols[the_ch]
       next_ch()
       return sym, err_line, err_col
   else: return ident_or_int(err_line, err_col)
        • main driver

input_file = sys.stdin if len(sys.argv) > 1:

   try:
       input_file = open(sys.argv[1], "r", 4096)
   except IOError as e:
       error(0, 0, "Can't open %s" % sys.argv[1])

while True:

   t = gettok()
   tok  = t[0]
   line = t[1]
   col  = t[2]
   print("%5d  %5d   %-14s" % (line, col, all_syms[tok]), end=)
   if tok == tk_Integer:  print("   %5d" % (t[3]))
   elif tok == tk_Ident:  print("  %s" %   (t[3]))
   elif tok == tk_String: print('  "%s"' % (t[3]))
   else:                  print("")
   if tok == tk_EOI:
       break

</lang>

Output from test case 3:

    5     15   KEYWORD_PRINT
    5     41   OP_SUBTRACT
    6     15   KEYWORD_PUTC
    6     41   OP_LESS
    7     15   KEYWORD_IF
    7     41   OP_GREATER
    8     15   KEYWORD_WHILE
    8     41   OP_LESSEQUAL
    9     15   LEFTBRACE
    9     41   OP_NOTEQUAL
   10     15   RIGHTBRACE
   10     41   OP_AND
   11     15   LEFTPAREN
   11     41   SEMICOLON
   12     15   RIGHTPAREN
   12     41   COMMA
   13     15   OP_SUBTRACT
   13     41   OP_ASSIGN
   14     15   OP_MULTIPLY
   14     41   INTEGER             42
   15     15   OP_DIVIDE
   15     41   STRING          "String literal"
   16     15   OP_ADD
   16     41   IDENTIFIER      variable_name
   17     26   INTEGER             10
   18     26   INTEGER             92
   19     26   INTEGER             32
   20      1   END_OF_INPUT