Compiler/lexical analyzer: Difference between revisions

Content added Content deleted

Inline

Revision as of 17:43, 17 August 2016

This task has been flagged for clarification. Code on this page in its current state may be flagged incorrect once this task has been clarified. See this page's Talk page for discussion.

Definition from Wikipedia:

Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).

The Task

Create a lexical analyzer for the simple programming language specified below. The program should read input from a file and/or stdin, and write output to a file and/or stdout. If the language being used has a lexer module/library/class, it would be great if two versions of the solution are provided: One without the lexer module, and one with.

Input Specification

The simple programming language to be analyzed is more or less a subset of C. It supports the following tokens:

Operators

Name	Common name	Character sequence
`OP_MULTIPLY`	multiply	`*`
`OP_DIVIDE`	divide	`/`
`OP_ADD`	plus	`+`
`OP_SUBTRACT`	minus	`-`
`OP_NEGATE`	unary minus	`-`
`OP_LESS`	less than	`<`
`OP_LESSEQUAL`	less than or equal	`<=`
`OP_GREATER`	greater than	`>`
`OP_NOTEQUAL`	not equal	`!=`
`OP_ASSIGN`	assignment	`=`
`OP_AND`	logical and	`&&`

The - token should always be interpreted as OP_SUBTRACT by the lexer. Turning some OP_SUBTRACT into OP_NEGATE will be the job of the syntax analyzer, which is not part of this task.

Symbols

Name	Common name	Character
`LEFTPAREN`	left parenthesis	`(`
`RIGHTPAREN`	right parenthesis	`)`
`LEFTBRACE`	left brace	`{`
`RIGHTBRACE`	right brace	`}`
`SEMICOLON`	semi-colon	`;`
`COMMA`	comma	`,`

Keywords

Name	Character sequence
`KEYWORD_IF`	`if`
`KEYWORD_WHILE`	`while`
`KEYWORD_PRINT`	`print`
`KEYWORD_PUTC`	`putc`

Other entities

These differ from the the previous tokens, in that each occurrence of them has a value associated with it.

Name	Common name	Format description	Format regex	Value
`IDENTIFIER`	identifier	one or more letter/number/underscore characters, but not starting with a number	`[_a-zA-Z][_a-zA-Z0-9]*`	as is
`INTEGER`	integer literal	one or more digits	`[0-9]+`	as is, interpreted as a number
`INTEGER`	char literal	exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes	`'([^'\n]\|\\n\|\\\\)'`	the ASCII code point number of the character
`STRING`	string literal	zero or more characters (anything except newline or double quote), enclosed by double quotes	`"[^"\n]*"`	the characters without the double quotes and with escape sequences converted

For char and string literals, the \n escape sequence is supported to represent a new-line character.
For char literals, to represent a backslash, use \\.
For char literals, an embedded single quote character is not supported.
For string literals, embedded double quote characters are not supported.
No other special sequences are supported.

White space

Zero or more whitespace characters or comments are allowed between any two tokens, with the exceptions noted below.
"Longest token matching" is used to resolve conflicts (e.g., in order to match <= as a single token rather than the two tokens < and =).
Whitespace is required between two tokens that have an alphanumeric character or underscore at the edge.
- This means: keywords, identifiers, and integer literals.
- e.g. ifprint is recognized as an identifier, instead of the keywords if and print.
- e.g. 42fred is invalid, and neither recognized as a number nor an identifier.
Whitespace is not allowed inside of tokens (except for chars and strings where they are part of the value).
- e.g. & & is invalid, and not interpreted as the && operator.

The following programs are equivalent:

<lang c> count = 1 ; n = 1 ; limit = 100 ; while ( n < limit ) {

   k = 3 ;
   p = 1 ;
   n = n + 2 ;
   while ( ( k * k <= n ) && ( p ) ) {
       p = n / k * k != n ;
       k = k + 2 ;
   }
   if ( p ) {
       print ( n , " " ) ;
       count = count + 1 ;
   }

} print ( count , "\n" ) ; </lang>

<lang c> count=1;n=1;limit=100;while(n<limit){k=3;p=1;n=n+2;while((k*k<=n)&&(p)){p=n/k*k!=n;k=k+2;}if(p){print(n," ");count=count+1;}}print(count,"\n"); </lang>

They should produce the same token stream, except for the line and column positions.

Comments enclosed in /* ... */ are also treated as whitespace outside of strings.

Complete list of token names

END_OF_INPUT  OP_MULTIPLY   OP_DIVIDE     OP_ADD        OP_SUBTRACT   OP_NEGATE
OP_LESS       OP_LESSEQUAL  OP_GREATER    OP_NOTEQUAL   OP_ASSIGN     OP_AND
KEYWORD_IF    KEYWORD_WHILE KEYWORD_PRINT KEYWORD_PUTC  LEFTPAREN     RIGHTPAREN
LEFTBRACE     RIGHTBRACE    SEMICOLON     COMMA         IDENTIFIER    INTEGER
STRING

Output Format

The program output should be a sequence of lines, each consisting of the following whitespace-separated fields:

the line number where the token starts
the column number where the token starts
the token name
the token value, in case of Integer, String, or Ident

Diagnostics

The following error conditions should be caught:

Error	Example
Empty character constant	`''`
Unknown escape sequence.	`\r`
Multi-character constant.	`'xx'`
End-of-file in comment. Closing comment characters not found.
End-of-file while scanning string literal. Closing string character not found.
End-of-line while scanning string literal. Closing string character not found before end-of-line.
Unrecognized character.

Test Cases

Input

Output

<lang c> /*

 Hello world
*/

print("Hello, World!\n"); </lang>

4 1 KEYWORD_PRINT 4 6 LEFTPAREN 4 7 STRING "Hello, World!\n" 4 24 RIGHTPAREN 4 25 SEMICOLON 5 1 END_OF_INPUT

<lang c> /*

 Show Ident and Integers
*/

phoenix_number = 142857; print(phoenix_number, "\n"); </lang>

4 1 IDENTIFIER phoenix_number 4 16 OP_ASSIGN 4 18 INTEGER 142857 4 24 SEMICOLON 5 1 KEYWORD_PRINT 5 6 LEFTPAREN 5 7 IDENTIFIER phoenix_number 5 21 COMMA 5 23 STRING "\n" 5 27 RIGHTPAREN 5 28 SEMICOLON 6 1 END_OF_INPUT

<lang c> /*

 All lexical tokens - not syntactically correct, but that will
 have to wait until syntax analysis
*/

/* Print */ print /* Sub */ - /* Putc */ putc /* Lss */ < /* If */ if /* Gtr */ > /* While */ while /* Leq */ <= /* Lbrace */ { /* Neq */ != /* Rbrace */ } /* And */ && /* Lparen */ ( /* Semi */ ; /* Rparen */ ) /* Comma */ , /* Uminus */ - /* Assign */ = /* Mul */ * /* Integer */ 42 /* Div */ / /* String */ "String literal" /* Add */ + /* Ident */ variable_name /* character literal */ '\n' /* character literal */ '\\' /* character literal */ ' ' </lang>

5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 20 1 END_OF_INPUT

Reference

The Flex, C, Python and Euphoria versions can be considered reference implementations.

C

Tested with gcc 4.81 and later, compiles warning free with -Wall -Wextra <lang C>

include <stdlib.h>
include <stdio.h>
include <stdarg.h>
include <ctype.h>
include <string.h>
include <errno.h>
include <stdbool.h>
include <limits.h>

define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))

define da_dim(name, type) type *name = NULL; \

                           int _qy_ ## name ## _p = 0;  \
                           int _qy_ ## name ## _max = 0

define da_rewind(name) _qy_ ## name ## _p = 0
define da_redim(name) if (_qy_ ## name ## _p >= _qy_ ## name ## _max) \

                               name = realloc(name, (_qy_ ## name ## _max += 32) * sizeof(name[0]))

define da_append(name, x) do {da_redim(name); name[_qy_ ## name ## _p++] = x;} while (0)
define da_len(name) _qy_ ## name ## _p

typedef enum {tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr,

   tk_Neq, tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen,
   tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String

} TokenType;

typedef struct {

   int tok;
   int err_ln, err_col;
   union {
       int n;                  /* value for constants */
       char *text;             /* text for idents */
   };

} tok_s;

static FILE *source_fp, *dest_fp; static int line = 1, col = 0, the_ch = ' '; da_dim(text, char);

tok_s gettok();

static void error(int err_line, int err_col, const char *fmt, ... ) {

   char buf[1000];
   va_list ap;

   va_start(ap, fmt);
   vsprintf(buf, fmt, ap);
   va_end(ap);
   printf("(%d,%d) error: %s\n", err_line, err_col, buf);
   exit(1);

}

static int next_ch() { /* get next char from input */

   the_ch = getc(source_fp);
   ++col;
   if (the_ch == '\n') {
       ++line;
       col = 0;
   }
   return the_ch;

}

static tok_s char_lit(int n, int err_line, int err_col) { /* 'x' */

   if (the_ch == '\)
       error(err_line, err_col, "gettok: empty character constant");
   if (the_ch == '\\') {
       next_ch();
       if (the_ch == 'n')
           n = 10;
       else if (the_ch == '\\')
           n = '\\';
       else error(err_line, err_col, "gettok: unknown escape sequence \\%c", the_ch);
   }
   if (next_ch() != '\)
       error(err_line, err_col, "multi-character constant");
   next_ch();
   return (tok_s){tk_Integer, err_line, err_col, {n}};

}

static tok_s div_or_cmt(int err_line, int err_col) { /* process divide or comments */

   if (the_ch != '*')
       return (tok_s){tk_Div, err_line, err_col, {0}};

   /* comment found */
   for (;;) {
       if (next_ch() == '*' && next_ch() == '/') {
           next_ch();
           return gettok();
       } else if (the_ch == EOF)
           error(err_line, err_col, "EOF in comment");
   }

}

static tok_s string_lit(int start, int err_line, int err_col) { /* "st" */

   da_rewind(text);

   while (next_ch() != start) {
       if (the_ch == '\n') error(err_line, err_col, "EOL in string");
       if (the_ch == EOF)  error(err_line, err_col, "EOF in string");
       da_append(text, (char)the_ch);
   }
   da_append(text, '\0');

   next_ch();
   return (tok_s){tk_String, err_line, err_col, {.text=text}};

}

static int kwd_cmp(const void *p1, const void *p2) {

   return strcmp(*(char **)p1, *(char **)p2);

}

static TokenType get_ident_type(const char *ident) {

   static struct {
       char *s;
       TokenType sym;
   } kwds[] = {
       {"if",    tk_If},
       {"print", tk_Print},
       {"putc",  tk_Putc},
       {"while", tk_While},
   }, *kwp;

   return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;

}

static tok_s ident_or_int(int err_line, int err_col) {

   int n, is_number = true;

   da_rewind(text);
   while (isalnum(the_ch) || the_ch == '_') {
       da_append(text, (char)the_ch);
       if (!isdigit(the_ch))
           is_number = false;
       next_ch();
   }
   if (da_len(text) == 0)
       error(err_line, err_col, "gettok: unrecognized character (%d) '%c'\n", the_ch, the_ch);
   da_append(text, '\0');
   if (isdigit(text[0])) {
       if (!is_number)
           error(err_line, err_col, "invalid number: %s\n", text);
       n = strtol(text, NULL, 0);
       if (n == LONG_MAX && errno == ERANGE)
           error(err_line, err_col, "Number exceeds maximum value");
       return (tok_s){tk_Integer, err_line, err_col, {n}};
   }
   return (tok_s){get_ident_type(text), err_line, err_col, {.text=text}};

}

static tok_s follow(int expect, TokenType ifyes, TokenType ifno, int err_line, int err_col) { /* look ahead for '>=', etc. */

   if (the_ch == expect) {
       next_ch();
       return (tok_s){ifyes, err_line, err_col, {0}};
   }
   if (ifno == tk_EOI)
       error(err_line, err_col, "follow: unrecognized character '%c' (%d)\n", the_ch, the_ch);
   return (tok_s){ifno, err_line, err_col, {0}};

}

tok_s gettok() { /* return the token type */

   /* skip white space */
   while (isspace(the_ch))
       next_ch();
   int err_line = line;
   int err_col  = col;
   switch (the_ch) {
       case '{':  next_ch(); return (tok_s){tk_Lbrace, err_line, err_col, {0}};
       case '}':  next_ch(); return (tok_s){tk_Rbrace, err_line, err_col, {0}};
       case '(':  next_ch(); return (tok_s){tk_Lparen, err_line, err_col, {0}};
       case ')':  next_ch(); return (tok_s){tk_Rparen, err_line, err_col, {0}};
       case '+':  next_ch(); return (tok_s){tk_Add, err_line, err_col, {0}};
       case '-':  next_ch(); return (tok_s){tk_Sub, err_line, err_col, {0}};
       case '*':  next_ch(); return (tok_s){tk_Mul, err_line, err_col, {0}};
       case ';':  next_ch(); return (tok_s){tk_Semi, err_line, err_col, {0}};
       case ',':  next_ch(); return (tok_s){tk_Comma,err_line, err_col, {0}};
       case '>':  next_ch(); return (tok_s){tk_Gtr, err_line, err_col, {0}};
       case '=':  next_ch(); return (tok_s){tk_Assign, err_line, err_col, {0}};
       case '/':  next_ch(); return div_or_cmt(err_line, err_col);
       case '\: next_ch(); return char_lit(the_ch, err_line, err_col);
       case '<':  next_ch(); return follow('=', tk_Leq, tk_Lss, err_line, err_col);
       case '!':  next_ch(); return follow('=', tk_Neq, tk_EOI, err_line, err_col);
       case '&':  next_ch(); return follow('&', tk_And, tk_EOI, err_line, err_col);
       case '"' : return string_lit(the_ch, err_line, err_col);
       default:   return ident_or_int(err_line, err_col);
       case EOF:  return (tok_s){tk_EOI, err_line, err_col, {0}};
   }

}

void run() { /* tokenize the given input */

   tok_s tok;
   do {
       tok = gettok();
       fprintf(dest_fp, "%5d  %5d %.14s",
           tok.err_ln, tok.err_col,
           &"END_OF_INPUT  OP_MULTIPLY   OP_DIVIDE     OP_ADD        OP_SUBTRACT   OP_NEGATE     "
            "OP_LESS       OP_LESSEQUAL  OP_GREATER    OP_NOTEQUAL   OP_ASSIGN     OP_AND        "
            "KEYWORD_IF    KEYWORD_WHILE KEYWORD_PRINT KEYWORD_PUTC  LEFTPAREN     RIGHTPAREN    "
            "LEFTBRACE     RIGHTBRACE    SEMICOLON     COMMA         IDENTIFIER    INTEGER       "
            "STRING        "[tok.tok * 14]);

       if (tok.tok == tk_Integer)     fprintf(dest_fp, "  %4d",   tok.n);
       else if (tok.tok == tk_Ident)  fprintf(dest_fp, " %s",     tok.text);
       else if (tok.tok == tk_String) fprintf(dest_fp, " \"%s\"", tok.text);
       fprintf(dest_fp, "\n");
   } while (tok.tok != tk_EOI);
   if (dest_fp != stdout)
       fclose(dest_fp);

}

void init_io(FILE **fp, FILE *std, const char mode[], const char fn[]) {

   if (fn[0] == '\0')
       *fp = std;
   else if ((*fp = fopen(fn, mode)) == NULL)
       error(0, 0, "Can't open %s\n", fn);

}

int main(int argc, char *argv[]) {

   init_io(&source_fp, stdin,  "r",  argc > 1 ? argv[1] : "");
   init_io(&dest_fp,   stdout, "wb", argc > 2 ? argv[2] : "");
   run();
   return 0;

} </lang>

Output from test case 3:

5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 20 1 END_OF_INPUT

Euphoria

Tested with Euphoria 4.05. <lang euphoria> include std/io.e include std/map.e include std/types.e include std/convert.e

constant true = 1, false = 0, EOF = -1

enum tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr, tk_Neq,

   tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen,
   tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String

constant all_syms = {"END_OF_INPUT", "OP_MULTIPLY", "OP_DIVIDE", "OP_ADD", "OP_SUBTRACT",

   "OP_NEGATE", "OP_LESS", "OP_LESSEQUAL", "OP_GREATER", "OP_NOTEQUAL", "OP_ASSIGN",
   "OP_AND", "KEYWORD_IF", "KEYWORD_WHILE", "KEYWORD_PRINT", "KEYWORD_PUTC", "LEFTPAREN",
   "RIGHTPAREN", "LEFTBRACE", "RIGHTBRACE", "SEMICOLON", "COMMA", "IDENTIFIER",
   "INTEGER", "STRING"}

integer input_file, the_ch = ' ', the_col = 0, the_line = 1 sequence symbols map key_words = new()

procedure error(sequence format, sequence data)

   printf(STDOUT, format, data)
   abort(1)

end procedure

-- get the next character from the input function next_ch()

   the_ch = getc(input_file)
   the_col += 1
   if the_ch = '\n' then
       the_line += 1
       the_col = 0
   end if
   return the_ch

end function

-- 'x' - character constants function char_lit(integer err_line, integer err_col)

   integer n = next_ch()              -- skip opening quote
   if the_ch = '\ then
       error("%d %d empty character constant", {err_line, err_col})
   elsif the_ch = '\\' then
       next_ch()
       if the_ch = 'n' then
           n = 10
       elsif the_ch = '\\' then
           n = '\\'
       else
           error("%d %d unknown escape sequence \\%c", {err_line, err_col, the_ch})
       end if
   end if
   if next_ch() != '\ then
       error("%d %d multi-character constant", {err_line, err_col})
   end if
   next_ch()
   return {tk_Integer, err_line, err_col, n}

end function

-- process divide or comments function div_or_cmt(integer err_line, integer err_col)

   if next_ch() != '*' then
       return {tk_Div, err_line, err_col}
   end if

   -- comment found
   while true do
       if next_ch() = '*' and next_ch() = '/' then
           next_ch()
           return get_tok()
       elsif the_ch = EOF then
           error("%d %d EOF in comment", {err_line, err_col})
       end if
   end while

end function

-- "string" function string_lit(integer start, integer err_line, integer err_col)

   string text = ""

   while next_ch() != start do
       if the_ch = EOF then
           error("%d %d EOF while scanning string literal", {err_line, err_col})
       end if
       if the_ch = '\n' then
           error("%d %d EOL while scanning string literal", {err_line, err_col})
       end if
       text &= the_ch
   end while

   next_ch()
   return {tk_String, err_line, err_col, text}

end function

-- handle identifiers and integers function ident_or_int(integer err_line, integer err_col)

   integer n, is_number = true
   string text = ""

   while t_alnum(the_ch) or the_ch = '_' do
       text &= the_ch
       if not t_digit(the_ch) then
           is_number = false
       end if
       next_ch()
   end while

   if length(text) = 0 then
       error("%d %d ident_or_int: unrecognized character: (%d) '%s'", {err_line, err_col, the_ch, the_ch})
   end if

   if t_digit(text[1]) then
       if not is_number then
           error("%d %d invalid number: %s", {err_line, err_col, text})
       end if
       n = to_integer(text)
       return {tk_Integer, err_line, err_col, n}
   end if

   if has(key_words, text) then
       return {get(key_words, text), err_line, err_col}
   end if

   return {tk_Ident, err_line, err_col, text}

end function

-- look ahead for '>=', etc. function follow(integer expect, integer ifyes, integer ifno, integer err_line, integer err_col)

   if next_ch() = expect then
       next_ch()
       return {ifyes, err_line, err_col}
   end if

   if ifno = tk_EOI then
       error("%d %d follow: unrecognized character: (%d)", {err_line, err_col, the_ch})
   end if

   return {ifno, err_line, err_col}

end function

-- return the next token type function get_tok()

   while t_space(the_ch) do
       next_ch()
   end while

   integer err_line = the_line
   integer err_col  = the_col

   switch the_ch do
       case EOF  then return {tk_EOI, err_line, err_col}
       case '/'  then return div_or_cmt(err_line, err_col)
       case '\ then return char_lit(err_line, err_col)
       case '<'  then return follow('=', tk_Leq, tk_Lss, err_line, err_col)
       case '!'  then return follow('=', tk_Neq, tk_EOI, err_line, err_col)
       case '&'  then return follow('&', tk_And, tk_EOI, err_line, err_col)
       case '"'  then return string_lit(the_ch, err_line, err_col)
       case else
           integer sym = symbols[the_ch]
           if sym  != tk_EOI then
               next_ch()
               return {sym, err_line, err_col}
           end if
           return ident_or_int(err_line, err_col)
   end switch

end function

procedure init()

   put(key_words, "if",      tk_If)
   put(key_words, "print",   tk_Print)
   put(key_words, "putc",    tk_Putc)
   put(key_words, "while",   tk_While)

   symbols = repeat(tk_EOI, 256)
   symbols['{'] = tk_Lbrace
   symbols['}'] = tk_Rbrace
   symbols['('] = tk_Lparen
   symbols[')'] = tk_Rparen
   symbols['+'] = tk_Add
   symbols['-'] = tk_Sub
   symbols['*'] = tk_Mul
   symbols[';'] = tk_Semi
   symbols[','] = tk_Comma
   symbols['>'] = tk_Gtr
   symbols['='] = tk_Assign

end procedure

procedure main(sequence cl)

   sequence file_name

   input_file = STDIN
   if length(cl) > 2 then
       file_name = cl[3]
       input_file = open(file_name, "r")
       if input_file = -1 then
           error("Could not open %s", {file_name})
       end if
   end if
   init()
   sequence t
   loop do
       t = get_tok()
       printf(STDOUT, "%5d  %5d %-8s", {t[2], t[3], all_syms[t[1]]})
       switch t[1] do
           case tk_Integer then    printf(STDOUT, "  %5d\n",   {t[4]})
           case tk_Ident then printf(STDOUT, " %s\n",     {t[4]})
           case tk_String  then    printf(STDOUT, " \"%s\"\n", {t[4]})
           case else            printf(STDOUT, "\n")
       end switch
       until t[1] = tk_EOI
   end loop

end procedure

main(command_line()) </lang>

Output from test case 3:

5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 20 1 END_OF_INPUT

Flex

Tested with Flex 2.5.4. <lang C> %{

include <stdio.h>
include <stdlib.h>
include <string.h>
include <errno.h>
include <limits.h>

define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))

typedef enum {tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr,

   tk_Neq, tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen,
   tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String

} TokenType;

void yyerror(char msg[]) {

   printf(msg);
   exit(1);

}

static int yynval;

struct yylloc {

   int first_line, first_col;
   int last_line, last_col;

} yylloc;

static void update_loc() {

 static int curr_line = 1;
 static int curr_col  = 1;

 yylloc.first_line = curr_line;
 yylloc.first_col  = curr_col;

 {char *s; for (s = yytext; *s != '\0'; s++) {
   if (*s == '\n') {
     curr_line++;
     curr_col = 1;
   } else {
     curr_col++;
   }
 }}

 yylloc.last_line = curr_line;
 yylloc.last_col  = curr_col-1;

}

define YY_USER_ACTION update_loc();

static int kwd_cmp(const void *p1, const void *p2) {

   return strcmp(*(char **)p1, *(char **)p2);

}

static TokenType get_ident_type(const char *ident) {

   static struct {
       char *s;
       TokenType sym;
   } kwds[] = {
       {"if",    tk_If},
       {"print", tk_Print},
       {"putc",  tk_Putc},
       {"while", tk_While},
   }, *kwp;

   return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;

}

%}

%start COMMENT2

%option noyywrap

digit [0-9] ident [a-zA-Z_][a-zA-Z_0-9]*

number {digit}+ string \"[^"\n]*\" char_const \'([^'\n]|\\n|\\\\)\'

%%

<COMMENT2>[^*]+ ; <COMMENT2>\*[^/] ; <COMMENT2>\*\/ BEGIN 0; /* end comment */ "/*" BEGIN COMMENT2;

"{" {return tk_Lbrace;} "}" {return tk_Rbrace;} "(" {return tk_Lparen;} ")" {return tk_Rparen;} "*" {return tk_Mul;} "/" {return tk_Div;} "+" {return tk_Add;} "-" {return tk_Sub;} "<" {return tk_Lss;} ">" {return tk_Gtr;} "<=" {return tk_Leq;} "!=" {return tk_Neq;} "&&" {return tk_And;} ";" {return tk_Semi;} "," {return tk_Comma;} "=" {return tk_Assign;} {ident} {return get_ident_type(yytext);} {string} {return tk_String;}

[ \t\n]+ ; /* ignore whitespace */

{number} {

               yynval = strtol(yytext, NULL, 0);
               if (yynval == LONG_MAX && errno == ERANGE)
                   yyerror("Number exceeds maximum value");

               return tk_Integer;
            }

{char_const} {

               int n = yytext[1];
               char *p = yytext;

               if (yyleng < 3)
                   yyerror("empty character constant");
               ++p;
               if (p[0] == '\\') {
                   ++p;
                   if (p[0] == 'n')
                       n = 10;
                   else if (p[0] == '\\')
                       n = '\\';
                   else
                       yyerror("unknown escape sequence");
               }
               yynval = n;
               return tk_Integer;
            }

. yyerror("Unknown character\n");

%%

int main(int argc, char *argv[]) {

   int tok;

   ++argv, --argc;  /* skip over program name */
   yyin = stdin;
   if (argc > 0)
       yyin = fopen(argv[0], "r");

   do {
       tok = yylex();
       printf("%5d  %5d %.14s", yylloc.first_line, yylloc.first_col,
           &"END_OF_INPUT  OP_MULTIPLY   OP_DIVIDE     OP_ADD        OP_SUBTRACT   OP_NEGATE     "
            "OP_LESS       OP_LESSEQUAL  OP_GREATER    OP_NOTEQUAL   OP_ASSIGN     OP_AND        "
            "KEYWORD_IF    KEYWORD_WHILE KEYWORD_PRINT KEYWORD_PUTC  LEFTPAREN     RIGHTPAREN    "
            "LEFTBRACE     RIGHTBRACE    SEMICOLON     COMMA         IDENTIFIER    INTEGER       "
            "STRING        "[tok * 14]);

       if (tok == tk_Integer)     printf("   %5d", yynval);
       else if (tok == tk_Ident)  printf("  %s",   yytext);
       else if (tok == tk_String) printf("  %s",   yytext);
       printf("\n");
   } while (tok != tk_EOI);
   return 0;

} </lang>

Output from test case 3:

5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 19 29 END_OF_INPUT

FreeBASIC

Tested with FreeBASIC 1.05 <lang FreeBASIC> enum Token_type

   tk_EOI
   tk_Mul
   tk_Div
   tk_Add
   tk_Sub
   tk_Uminus
   tk_Lss
   tk_Leq
   tk_Gtr
   tk_Neq
   tk_Assign
   tk_And
   tk_If
   tk_While
   tk_Print
   tk_Putc
   tk_Lparen
   tk_Rparen
   tk_Lbrace
   tk_Rbrace
   tk_Semi
   tk_Comma
   tk_Ident
   tk_Integer
   tk_String

end enum

const NewLine = chr(10) const DoubleQuote = chr(34)

' where we store keywords and variables type Symbol

   s_name as string
   tok as Token_type

end type

dim shared symtab() as Symbol

dim shared cur_line as string dim shared cur_ch as string dim shared line_num as integer dim shared col_num as integer

function is_digit(byval ch as string) as long

   is_digit = (ch <> "") and ch >= "0" and ch <= "9"

end function

function is_alnum(byval ch as string) as long

   is_alnum = (ch <> "") and ((UCase(ch) >= "A" and UCase(ch) <= "Z") or (is_digit(ch)))

end function

sub error_msg(byval eline as integer, byval ecol as integer, byval msg as string)

   print "("; eline; ":"; ecol; ")"; " "; msg
   system

end sub

' add an identifier to the symbol table function install(byval s_name as string, byval tok as Token_type) as integer

   dim n as integer

   n = ubound(symtab)
   redim preserve symtab(n + 1)
   n = ubound(symtab)

   symtab(n).s_name = s_name
   symtab(n).tok    = tok
   return n

end function

' search for an identifier in the symbol table function lookup(byval s_name as string) as integer

   dim i as integer

   for i = lbound(symtab) to ubound(symtab)
       if symtab(i).s_name = s_name then return i
   next
   return -1

end function

sub next_line() ' read the next line of input from the source file

   cur_line = ""
   cur_ch  = ""        ' empty cur_ch means end-of-file
   if eof(1) then exit sub
   line input #1, cur_line
   cur_line = cur_line + NewLine
   line_num += + 1
   col_num = 1

end sub

sub next_char() ' get the next char

   cur_ch = ""
   col_num += 1
   if col_num > len(cur_line) then next_line()
   if col_num <= len(cur_line) then cur_ch = mid(cur_line, col_num, 1)

end sub

function follow(byval err_line as integer, byval err_col as integer, byval expect as string, byval ifyes as Token_type, byval ifno as Token_type) as Token_type

   if cur_ch = expect then
       next_char()
       return ifyes
   end if
   if ifno = tk_eoi then error_msg(err_line, err_col, "follow unrecognized character: " + cur_ch)
   return ifno

end function

sub gettok(byref err_line as integer, byref err_col as integer, byref tok as Token_type, byref v as string)

   ' skip whitespace
   do while (cur_ch = " " or cur_ch = chr(9) or cur_ch = NewLine) and (cur_ch <> "")
       next_char()
   loop

   err_line = line_num
   err_col  = col_num

   select case cur_ch
       case "":  tok = tk_eoi: exit sub
       case "{": tok = tk_lbrace: next_char(): exit sub
       case "}": tok = tk_rbrace: next_char(): exit sub
       case "(": tok = tk_lparen: next_char(): exit sub
       case ")": tok = tk_rparen: next_char(): exit sub
       case "+": tok = tk_add:    next_char(): exit sub
       case "-": tok = tk_sub:    next_char(): exit sub
       case "*": tok = tk_mul:    next_char(): exit sub
       case ";": tok = tk_semi:   next_char(): exit sub
       case ",": tok = tk_comma:  next_char(): exit sub
       case ">": tok = tk_gtr:    next_char(): exit sub
       case "=": tok = tk_assign: next_char(): exit sub
       case "/": ' div or comment
           next_char()
           if cur_ch <> "*" then
               tok = tk_div
               exit sub
           end if
           ' skip comments
           do
               next_char()
               if cur_ch = "*" or cur_ch = "" then
                   next_char()
                   if cur_ch = "/" or cur_ch = "" then
                       next_char()
                       gettok(err_line, err_col, tok, v)
                       exit sub
                   end if
               end if
           loop
       case "'":   ' single char literals
           next_char()
           v = str(Asc(cur_ch))
           if cur_ch = "'" then error_msg(err_line, err_col, "empty character constant")
           if cur_ch = "\" then
               next_char()
               if cur_ch = "n" then
                   v = "10"
               elseif cur_ch = "\" then
                   v = Str(Asc("\"))
               else error_msg(err_line, err_col, "unknown escape sequence: " + cur_ch)
               end if
           end if
           next_char()
           if cur_ch <> "'" then error_msg(err_line, err_col, "multi-character constant")
           next_char()
           tok = tk_integer
           exit sub
       case "<": next_char(): tok = follow(err_line, err_col, "=", tk_Leq, tk_Lss): exit sub
       case "!": next_char(): tok = follow(err_line, err_col, "=", tk_Neq, tk_EOI): exit sub
       case "&": next_char(): tok = follow(err_line, err_col, "&", tk_And, tk_EOI): exit sub
       case DoubleQuote: ' string
           v = cur_ch
           next_char()
           do while cur_ch <> DoubleQuote
               if cur_ch = NewLine then error_msg(err_line, err_col, "EOL in string")
               if cur_ch = "" then error_msg(err_line, err_col, "EOF in string")
               v += cur_ch
               next_char()
           loop
           v += cur_ch
           next_char()
           tok = tk_string
           exit sub
       case else   ' integers or identifiers
           dim is_number as boolean = is_digit(cur_ch)
           v = ""
           do while is_alnum(cur_ch) orelse cur_ch = "_"
               if not is_digit(cur_ch) then is_number = false
               v += cur_ch
               next_char()
           loop
           if len(v) = 0 then error_msg(err_line, err_col, "unknown character: " + cur_ch)
           if is_digit(mid(v, 1, 1)) then
               if not is_number then error_msg(err_line, err_col, "invalid number: " + v)
               tok = tk_integer
               exit sub
           end if
           dim as integer index = lookup(v)
           if index = -1 then
               tok = tk_ident
           else
               tok = symtab(index).tok
           end if
           exit sub
   end select

end sub

sub init_lex(byval filein as string)

   install("if",    tk_if)
   install("print", tk_print)
   install("putc",  tk_putc)
   install("while", tk_while)

   open filein for input as #1

   cur_line = ""
   line_num = 0
   col_num = 0
   next_char()

end sub

sub scanner()

   dim err_line as integer
   dim err_col as integer
   dim tok as Token_type
   dim v as string
   dim tok_list(tk_eoi to tk_string) as string

   tok_list(tk_EOI    ) = "END_OF_INPUT"
   tok_list(tk_Mul    ) = "OP_MULTIPLY"
   tok_list(tk_Div    ) = "OP_DIVIDE"
   tok_list(tk_Add    ) = "OP_ADD"
   tok_list(tk_Sub    ) = "OP_SUBTRACT"
   tok_list(tk_Uminus ) = "OP_NEGATE"
   tok_list(tk_Lss    ) = "OP_LESS"
   tok_list(tk_Leq    ) = "OP_LESSEQUAL"
   tok_list(tk_Gtr    ) = "OP_GREATER"
   tok_list(tk_Neq    ) = "OP_NOTEQUAL"
   tok_list(tk_Assign ) = "OP_ASSIGN"
   tok_list(tk_And    ) = "OP_AND"
   tok_list(tk_If     ) = "KEYWORD_IF"
   tok_list(tk_While  ) = "KEYWORD_WHILE"
   tok_list(tk_Print  ) = "KEYWORD_PRINT"
   tok_list(tk_Putc   ) = "KEYWORD_PUTC"
   tok_list(tk_Lparen ) = "LEFTPAREN"
   tok_list(tk_Rparen ) = "RIGHTPAREN"
   tok_list(tk_Lbrace ) = "LEFTBRACE"
   tok_list(tk_Rbrace ) = "RIGHTBRACE"
   tok_list(tk_Semi   ) = "SEMICOLON"
   tok_list(tk_Comma  ) = "COMMA"
   tok_list(tk_Ident  ) = "IDENTIFIER"
   tok_list(tk_Integer) = "INTEGER"
   tok_list(tk_String ) = "STRING"

   do
       gettok(err_line, err_col, tok, v)
       print using "#####  ##### \              \"; err_line; err_col; tok_list(tok);
       if tok = tk_integer orelse tok = tk_ident orelse tok = tk_string then print " " + v;
       print
   loop until tok = tk_eoi

end sub

sub main()

   if command(1) = "" then print "filename required" : system
   init_lex(command(1))
   scanner()

end sub

main() system </lang>

Output from test case 3:

5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 19 30 END_OF_INPUT

Perl

<lang perl>#!/usr/bin/env perl

use strict; use warnings; no warnings 'once';

----- Definition of the language to be lexed -----#

my @tokens = (

   # Name          | Format               | Value       #
   # --------------|----------------------|-------------#
   ['OP_MULTIPLY'  , '*'                  ,             ],
   ['OP_DIVIDE'    , '/'                  ,             ],
   ['OP_ADD'       , '+'                  ,             ],
   ['OP_SUBTRACT'  , '-'                  ,             ],
   ['OP_LESSEQUAL' , '<='                 ,             ],
   ['OP_LESS'      , '<'                  ,             ],
   ['OP_GREATER'   , '>'                  ,             ],
   ['OP_NOTEQUAL'  , '!='                 ,             ],
   ['OP_ASSIGN'    , '='                  ,             ],
   ['OP_AND'       , '&&'                 ,             ],
   ['KEYWORD_IF'   , qr/if\b/             ,             ],
   ['KEYWORD_WHILE', qr/while\b/          ,             ],
   ['KEYWORD_PRINT', qr/print\b/          ,             ],
   ['KEYWORD_PUTC' , qr/putc\b/           ,             ],

   ['LEFTPAREN'    , '('                  ,             ],
   ['RIGHTPAREN'   , ')'                  ,             ],
   ['LEFTBRACE'    , '{'                  ,             ],
   ['RIGHTBRACE'   , '}'                  ,             ],
   ['SEMICOLON'    , ';'                  ,             ],
   ['COMMA'        , ','                  ,             ],

   ['IDENTIFIER'   , qr/[_a-z][_a-z0-9]*/i, \&raw       ],
   ['INTEGER'      , qr/[0-9]+\b/         , \&raw       ],
   ['INTEGER'      , qr/'([^']*)(')?/     , \&char_val  ],
   ['STRING'       , qr/"([^"]*)(")?/     , \&string_raw],
   
   ['END_OF_INPUT' , qr/$/                ,             ],

);

my $comment = qr/\/\* .+? (?: \*\/ | $ (?{die "End-of-file in comment\n"}) )/xs; my $whitespace = qr/(?: \s | $comment)*/x; my $unrecognized = qr/\w+? \b | ./x;

| Returns the value of a matched char literal, or dies if it is invalid

sub char_val {

   my $str = string_val();
   die "Multiple characters\n" if length $str > 1;
   die "No character\n"        if length $str == 0;
   ord $str;

}

| Returns the value of a matched string literal, or dies if it is invalid

sub string_val {

   my ($str, $end) = ($1, $2);
   die "End-of-file\n" if not defined $end;
   die "End-of-line\n" if $str =~ /\n/;
   $str =~ s/\\(.)/
         $1 eq 'n'  ? "\n"
       : $1 eq '\\' ? $1
       : $1 eq $end ? $1
       : die "Unknown escape sequence \\$1\n"
   /rge;

}

| Returns the source string of a matched literal

sub raw { $& }

| Returns the source string of a matched string literal, or dies if invalid

sub string_raw {

   string_val(); # Just for the error handling side-effects
   $&;

}

----- Lexer "engine" -----#

Construct the scanner regex:

my $tokens =

   join "|",
   map {
       my $format = $tokens[$_][1];
       "\n".(ref $format ? $format : quotemeta $format)." (*MARK:$_) ";
   } 0..$#tokens;

my $regex = qr/

   \G (?| $whitespace  \K (?| $tokens )
        | $whitespace? \K ($unrecognized) (*MARK:!) )

/x;

Run the lexer:

my $input = do { local $/ = undef; <STDIN> }; my $pos = 0; my $linecol = linecol_accumulator();

while ($input =~ /$regex/g) {

   # Get the line and column number
   my ($line, $col) = $linecol->(substr $input, $pos, $-[0] - $pos);
   $pos = $-[0];
   
   # Get the token type that was identified by the scanner regex
   my $type = $main::REGMARK;
   die "Unrecognized token $1 at line $line, col $col\n" if $type eq '!';
   my ($name, $evaluator) = @{$tokens[$type]}[0, 2];
   
   # Get the token value
   my $value;
   if ($evaluator) {
       eval { $value = $evaluator->() };
       if ($@) { chomp $@; die "$@ in $name at line $line, col $col\n" }
   }
   
   # Print the output line
   print "$line\t$col\t$name".($value ? "\t$value" : )."\n";

}

| Returns a closure, which can be fed a string one piece at a time and gives
| back the cumulative line and column number each time

sub linecol_accumulator {

   my ($line, $col) = (1, 1);
   sub {
       my $str = shift;
       my @lines = split "\n", $str, -1;
       my ($l, $c) = (@lines - 1, length $lines[-1]);
       if ($l) { $line += $l;  $col = 1 + $c }
       else    { $col += $c }
       ($line, $col)
   }

}</lang>

Output:

(for test-case 3)

5	15	KEYWORD_PRINT
5	41	OP_SUBTRACT
6	15	KEYWORD_PUTC
6	41	OP_LESS
7	15	KEYWORD_IF
7	41	OP_GREATER
8	15	KEYWORD_WHILE
8	41	OP_LESSEQUAL
9	15	LEFTBRACE
9	41	OP_NOTEQUAL
10	15	RIGHTBRACE
10	41	OP_AND
11	15	LEFTPAREN
11	41	SEMICOLON
12	15	RIGHTPAREN
12	41	COMMA
13	15	OP_SUBTRACT
13	41	OP_ASSIGN
14	15	OP_MULTIPLY
14	41	INTEGER	42
15	15	OP_DIVIDE
15	41	STRING	"String literal"
16	15	OP_ADD
16	41	IDENTIFIER	variable_name
17	26	INTEGER	10
18	26	INTEGER	92
19	26	INTEGER	32
20	1	END_OF_INPUT

Python

Tested with Python 2.7 and 3.x <lang Python> from __future__ import print_function import sys

following two must remain in the same order

tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr, tk_Neq, \ tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, \ tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String = range(25)

all_syms = ["END_OF_INPUT", "OP_MULTIPLY", "OP_DIVIDE", "OP_ADD", "OP_SUBTRACT",

   "OP_NEGATE", "OP_LESS", "OP_LESSEQUAL", "OP_GREATER", "OP_NOTEQUAL", "OP_ASSIGN",
   "OP_AND", "KEYWORD_IF", "KEYWORD_WHILE", "KEYWORD_PRINT", "KEYWORD_PUTC", "LEFTPAREN",
   "RIGHTPAREN", "LEFTBRACE", "RIGHTBRACE", "SEMICOLON", "COMMA", "IDENTIFIER",
   "INTEGER", "STRING"]

single character only symbols

symbols = { '{': tk_Lbrace, '}': tk_Rbrace, '(': tk_Lparen, ')': tk_Rparen, '+': tk_Add, '-': tk_Sub,

   '*': tk_Mul, ';': tk_Semi, ',': tk_Comma, '>': tk_Gtr, '=': tk_Assign }

key_words = {'if': tk_If, 'print': tk_Print, 'putc': tk_Putc, 'while': tk_While}

the_ch = " " # dummy first char - but it must be a space the_col = 0 the_line = 1 input_file = None

- - - show error and exit

def error(line, col, msg):

   print(line, col, msg)
   exit(1)

- - - get the next character from the input

def next_ch():

   global the_ch, the_col, the_line

   the_ch = input_file.read(1)
   the_col += 1
   if the_ch == '\n':
       the_line += 1
       the_col = 0
   return the_ch

- - - 'x' - character constants

def char_lit(err_line, err_col):

   n = ord(next_ch())              # skip opening quote
   if the_ch == '\:
       error(err_line, err_col, "empty character constant")
   elif the_ch == '\\':
       next_ch()
       if the_ch == 'n':
           n = 10
       elif the_ch == '\\':
           n = ord('\\')
       else:
           error(err_line, err_col, "unknown escape sequence \\%c" % (the_ch))
   if next_ch() != '\:
       error(err_line, err_col, "multi-character constant")
   next_ch()
   return tk_Integer, err_line, err_col, n

- - - process divide or comments

def div_or_cmt(err_line, err_col):

   if next_ch() != '*':
       return tk_Div, err_line, err_col

   # comment found
   while True:
       if next_ch() == '*' and next_ch() == '/':
           next_ch()
           return gettok()
       elif len(the_ch) == 0:
           error(err_line, err_col, "EOF in comment")

- - - "string"

def string_lit(start, err_line, err_col):

   text = ""

   while next_ch() != start:
       if len(the_ch) == 0:
           error(err_line, err_col, "EOF while scanning string literal")
       if the_ch == '\n':
           error(err_line, err_col, "EOL while scanning string literal")
       text += the_ch

   next_ch()
   return tk_String, err_line, err_col, text

- - - handle identifiers and integers

def ident_or_int(err_line, err_col):

   is_number = True
   text = ""

   while the_ch.isalnum() or the_ch == '_':
       text += the_ch
       if not the_ch.isdigit():
           is_number = False
       next_ch()

   if len(text) == 0:
       error(err_line, err_col, "ident_or_int: unrecognized character: (%d) '%c'" % (ord(the_ch), the_ch))

   if text[0].isdigit():
       if not is_number:
           error(err_line, err_col, "invalid number: %s" % (text))
       n = int(text)
       return tk_Integer, err_line, err_col, n

   if text in key_words:
       return key_words[text], err_line, err_col

   return tk_Ident, err_line, err_col, text

- - - look ahead for '>=', etc.

def follow(expect, ifyes, ifno, err_line, err_col):

   if next_ch() == expect:
       next_ch()
       return ifyes, err_line, err_col

   if ifno == tk_EOI:
       error(err_line, err_col, "follow: unrecognized character: (%d) '%c'" % (ord(the_ch), the_ch))

   return ifno, err_line, err_col

- - - return the next token type

def gettok():

   while the_ch.isspace():
       next_ch()

   err_line = the_line
   err_col  = the_col

   if len(the_ch) == 0:    return tk_EOI, err_line, err_col
   elif the_ch == '/':     return div_or_cmt(err_line, err_col)
   elif the_ch == '\:    return char_lit(err_line, err_col)
   elif the_ch == '<':     return follow('=', tk_Leq, tk_Lss, err_line, err_col)
   elif the_ch == '!':     return follow('=', tk_Neq, tk_EOI, err_line, err_col)
   elif the_ch == '&':     return follow('&', tk_And, tk_EOI, err_line, err_col)
   elif the_ch == '"':     return string_lit(the_ch, err_line, err_col)
   elif the_ch in symbols:
       sym = symbols[the_ch]
       next_ch()
       return sym, err_line, err_col
   else: return ident_or_int(err_line, err_col)

- - - main driver

input_file = sys.stdin if len(sys.argv) > 1:

   try:
       input_file = open(sys.argv[1], "r", 4096)
   except IOError as e:
       error(0, 0, "Can't open %s" % sys.argv[1])

while True:

   t = gettok()
   tok  = t[0]
   line = t[1]
   col  = t[2]

   print("%5d  %5d   %-14s" % (line, col, all_syms[tok]), end=)

   if tok == tk_Integer:  print("   %5d" % (t[3]))
   elif tok == tk_Ident:  print("  %s" %   (t[3]))
   elif tok == tk_String: print('  "%s"' % (t[3]))
   else:                  print("")

   if tok == tk_EOI:
       break

</lang>

Output from test case 3:

5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 20 1 END_OF_INPUT

@@ Line 84: / Line 84: @@
 ;Other entities
-These differ from the other tokens, in that each occurrence of them has a value associated with it.
+These differ from the the previous tokens, in that each occurrence of them has a value associated with it.
 {| class="wikitable"
@@ Line 119: / Line 119: @@
 |}
-* For char and string literals, <code>\n</code> is supported as a new line character.
+* For char and string literals, the <code>\n</code> escape sequence is supported to represent a new-line character.
 * For char literals, to represent a backslash, use <code>\\</code>.
 * For char literals, an embedded single quote character is not supported.
@@ Line 128: / Line 128: @@
 * Zero or more whitespace characters or comments are allowed between any two tokens, with the exceptions noted below.
-* "longest token matching" is used to resolve conflicts (e.g., in order to match '''<=''' as a single token rather than the two tokens '''<''' and '''=''').
+* "Longest token matching" is used to resolve conflicts (e.g., in order to match '''<=''' as a single token rather than the two tokens '''<''' and '''=''').
-* Whitespace is only required in the following situations:
+* Whitespace is ''required'' between two tokens that have an alphanumeric character or underscore at the edge.
+** This means: keywords, identifiers, and integer literals.
-** To distinguish between keywords:
-*** ifprint - is recognized as an identifier, instead of the keywords '''if''' and '''print'''.
+** e.g. <code>ifprint</code> is recognized as an identifier, instead of the keywords <tt>if</tt> and <tt>print</tt>.
+** e.g. <code>42fred</code> is invalid, and neither recognized as a number nor an identifier.
-** To distinguish between keywords and integers:
+* Whitespace is ''not allowed'' inside of tokens (except for chars and strings where they are part of the value).
-*** 42fred - is an invalid number or invalid identifier.
+** e.g. <code>& &</code> is invalid, and not interpreted as the <tt>&&</tt> operator.
-* Whitespace is not allowed between:
-** Multi-character operators:  These cannot be recognized unless they occur without embedded whitespace: '''&&''' '''<='''.
 The following programs are equivalent: