Compiler/lexical analyzer: Difference between revisions

Content added Content deleted

Inline

Revision as of 15:54, 22 October 2016

Lexical Analyzer

Definition from Wikipedia:

Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).

Task

Create a lexical analyzer for the simple programming language specified below. The program should read input from a file and/or stdin, and write output to a file and/or stdout. If the language being used has a lexer module/library/class, it would be great if two versions of the solution are provided: One without the lexer module, and one with.

Input Specification

The simple programming language to be analyzed is more or less a subset of C. It supports the following tokens:

Operators

Name	Common name	Character sequence
`Op_multiply`	multiply	`*`
`Op_divide`	divide	`/`
`Op_mod`	mod	`%`
`Op_add`	plus	`+`
`Op_subtract`	minus	`-`
`Op_negate`	unary minus	`-`
`Op_less`	less than	`<`
`Op_lessequal`	less than or equal	`<=`
`Op_greater`	greater than	`>`
`Op_greaterequal`	greater than or equal	`>=`
`Op_equal`	equal	`==`
`Op_notequal`	not equal	`!=`
`Op_not`	unary not	`!`
`Op_assign`	assignment	`=`
`Op_and`	logical and	`&&`
`Op_or`	logical or	`¦¦`

The - token should always be interpreted as Op_subtract by the lexer. Turning some Op_subtract into Op_negate will be the job of the syntax analyzer, which is not part of this task.

Symbols

Name	Common name	Character
`LeftParen`	left parenthesis	`(`
`RightParen`	right parenthesis	`)`
`LeftBrace`	left brace	`{`
`RightBrace`	right brace	`}`
`Semicolon`	semi-colon	`;`
`Comma`	comma	`,`

Keywords

Name	Character sequence
`Keyword_if`	`if`
`Keyword_else`	`else`
`Keyword_while`	`while`
`Keyword_print`	`print`
`Keyword_putc`	`putc`

Identifiers and literals

These differ from the the previous tokens, in that each occurrence of them has a value associated with it.

Name	Common name	Format description	Format regex	Value
`Identifier`	identifier	one or more letter/number/underscore characters, but not starting with a number	`[_a-zA-Z][_a-zA-Z0-9]*`	as is
`Integer`	integer literal	one or more digits	`[0-9]+`	as is, interpreted as a number
`Integer`	char literal	exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes	`'([^'\n]\|\\n\|\\\\)'`	the ASCII code point number of the character, e.g. 65 for `'A'` and 10 for `'\n'`
`String`	string literal	zero or more characters (anything except newline or double quote), enclosed by double quotes	`"[^"\n]*"`	the characters without the double quotes and with escape sequences converted

For char and string literals, the \n escape sequence is supported to represent a new-line character.
For char literals, to represent a backslash, use \\.
No other special sequences are supported. This means that:
- Char literals cannot represent a single quote character (value 39).
- String literals cannot represent strings containing double quote characters.

Zero-width tokens

Name	Location
`End_of_input`	when the end of the input stream is reached

White space

Zero or more whitespace characters, or comments enclosed in /* ... */, are allowed between any two tokens, with the exceptions noted below.
"Longest token matching" is used to resolve conflicts (e.g., in order to match <= as a single token rather than the two tokens < and =).
Whitespace is required between two tokens that have an alphanumeric character or underscore at the edge.
- This means: keywords, identifiers, and integer literals.
- e.g. ifprint is recognized as an identifier, instead of the keywords if and print.
- e.g. 42fred is invalid, and neither recognized as a number nor an identifier.
Whitespace is not allowed inside of tokens (except for chars and strings where they are part of the value).
- e.g. & & is invalid, and not interpreted as the && operator.

For example, the following two program fragments are equivalent, and should produce the same token stream except for the line and column positions:

<lang c>if ( p /* meaning n is prime */ ) {

   print ( n , " " ) ;
   count = count + 1 ; /* number of primes found so far */

}</lang>

<lang c>if(p){print(n," ");count=count+1;}</lang>

Complete list of token names

End_of_input  Op_multiply   Op_divide     Op_mod       Op_add     Op_subtract
Op_negate     Op_not        Op_less       Op_lessequal Op_greater Op_greaterequal
Op_equal      Op_notequal   Op_assign     Op_and       Op_or      Keyword_if
Keyword_else  Keyword_while Keyword_print Keyword_putc LeftParen  RightParen
LeftBrace     RightBrace    Semicolon     Comma        Identifier Integer
String

Output Format

The program output should be a sequence of lines, each consisting of the following whitespace-separated fields:

the line number where the token starts
the column number where the token starts
the token name
the token value (only for Identifier, Integer, and String tokens)
the number of spaces between fields is up to you. Neatly aligned is nice, but not a requirement.

Diagnostics

The following error conditions should be caught:

Error	Example
Empty character constant	`''`
Unknown escape sequence.	`\r`
Multi-character constant.	`'xx'`
End-of-file in comment. Closing comment characters not found.
End-of-file while scanning string literal. Closing string character not found.
End-of-line while scanning string literal. Closing string character not found before end-of-line.
Unrecognized character.
Invalid number. Starts like a number, but ends in non-numeric characters.	`123abc`

Test Cases

Input

Output

<lang c>/*

 Hello world
*/

print("Hello, World!\n");</lang>

4 1 Keyword_print 4 6 LeftParen 4 7 String "Hello, World!\n" 4 24 RightParen 4 25 Semicolon 5 1 End_of_input

<lang c>/*

 Show Ident and Integers
*/

phoenix_number = 142857; print(phoenix_number, "\n");</lang>

4 1 Identifier phoenix_number 4 16 Op_assign 4 18 Integer 142857 4 24 Semicolon 5 1 Keyword_print 5 6 LeftParen 5 7 Identifier phoenix_number 5 21 Comma 5 23 String "\n" 5 27 RightParen 5 28 Semicolon 6 1 End_of_input

<lang c>/*

 All lexical tokens - not syntactically correct, but that will
 have to wait until syntax analysis
*/

/* Print */ print /* Sub */ - /* Putc */ putc /* Lss */ < /* If */ if /* Gtr */ > /* Else */ else /* Leq */ <= /* While */ while /* Geq */ >= /* Lbrace */ { /* Eq */ == /* Rbrace */ } /* Neq */ != /* Lparen */ ( /* And */ && /* Rparen */ ) /* Or */ || /* Uminus */ - /* Semi */ ; /* Not */ ! /* Comma */ , /* Mul */ * /* Assign */ = /* Div */ / /* Integer */ 42 /* Mod */ % /* String */ "String literal" /* Add */ + /* Ident */ variable_name /* character literal */ '\n' /* character literal */ '\\' /* character literal */ ' '</lang>

5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input

Reference

The C and Python versions can be considered reference implementations.

ALGOL W

<lang algolw>begin

   %lexical analyser %
   % Algol W strings are limited to 256 characters in length so we limit source lines %
   % and tokens to 256 characters %

   integer     lineNumber, columnNumber;
   string(256) line;
   string(256) tkValue;
   integer     tkType, tkLine, tkColumn, tkLength, tkIntegerValue;
   logical     tkTooLong;
   string(1)   currChar;
   string(1)   newlineChar;

   integer     LINE_WIDTH, MAX_TOKEN_LENGTH, MAXINTEGER_OVER_10, MAXINTEGER_MOD_10;
   integer     tOp_multiply   , tOp_divide        , tOp_mod       , tOp_add
         ,     tOp_subtract   , tOp_negate        , tOp_less      , tOp_lessequal
         ,     tOp_greater    , tOp_greaterequal  , tOp_equal     , tOp_notequal
         ,     tOp_not        , tOp_assign        , tOp_and       , tOp_or
         ,     tLeftParen     , tRightParen       , tLeftBrace    , tRightBrace
         ,     tSemicolon     , tComma            , tKeyword_if   , tKeyword_else
         ,     tKeyword_while , tKeyword_print    , tKeyword_putc , tIdentifier
         ,     tInteger       , tString           , tEnd_of_input , tComment
         ;

   string(16)  array tkName ( 1 :: 32 );

   % reports an error %
   procedure lexError( string(80) value message ); begin
       integer errorPos;
       write( i_w := 1, s_w := 0, "**** Error at(", lineNumber, ",", columnNumber, "): " );
       errorPos := 0;
       while errorPos < 80 and message( errorPos // 1 ) not = "." do begin
           writeon( s_w := 0, message( errorPos // 1 ) );
           errorPos := errorPos + 1
       end while_not_at_end_of_message ;
       writeon( s_w := 0, "." )
   end lexError ;

   % gets the next source character %
   procedure nextChar ; begin
       if      columnNumber = LINE_WIDTH then begin
           currChar     := newlineChar;
           columnNumber := columnNumber + 1
           end
       else if columnNumber > LINE_WIDTH then begin
           readcard( line );
           columnNumber := 1;
           if not XCPNOTED(ENDFILE) then lineNumber := lineNumber + 1;
           currChar     := line( 0 // 1 )
           end
       else begin
           currChar     := line( columnNumber // 1 );
           columnNumber := columnNumber + 1
       end
   end nextChar ;

   % gets the next token, returns the token type %
   integer procedure nextToken ; begin

       % returns true if currChar is in the inclusive range lowerValue to upperValue %
       %         false otherwise %
       logical procedure range( string(1) value lowerValue, upperValue ) ; begin
           currChar >= lowerValue and currChar <= upperValue
       end range ;

       % returns true if the current character can start an identifier, false otherwise %
       logical procedure identifierStartChar ; begin
           currChar = "_" or range( "a", "z" ) or range( "A", "Z" )
       end identifierStartChar ;

       % add the current character to the token and get the next %
       procedure addAndNextChar ; begin
           if tkLength >= MAX_TOKEN_LENGTH then tkTooLong := true
           else begin
               tkValue( tkLength // 1 ) := currChar;
               tkLength                 := tkLength + 1
           end if_symbol_not_too_long ;
           nextChar
       end % addAndNextChar % ;

       % handle a single character token %
       procedure singleCharToken( integer value tokenType ) ; begin
           tkType := tokenType;
           nextChar
       end singleCharToken ;

       % handle a doubled character token: && or || %
       procedure doubleCharToken( integer value tokenType ) ; begin
           string(1) firstChar;
           firstChar := currChar;
           tkType    := tokenType;
           nextChar;
           if currChar = firstChar then nextChar
           else % the character wasn't doubled % lexError( "Unrecognised character." );
       end singleCharToken ;

       % handle an operator or operator= token %
       procedure opOrOpEqual( integer value opToken, opEqualToken ) ; begin
           tkType := opToken;
           nextChar;
           if currChar = "=" then begin
               % have operator= %
               tkType := opEqualToken;
               nextChar
           end if_currChar_is_equal ;
       end opOrOpEqual ;

       % handle a / operator or /* comment %
       procedure divideOrComment ; begin
           tkType := tOp_divide;
           nextChar;
           if currChar = "*" then begin
               % have a comment %
               logical moreComment;
               tkType      := tComment;
               moreComment := true;
               while moreComment do begin
                   nextChar;
                   while currChar not = "*" and not XCPNOTED(ENDFILE) do nextChar;
                   while currChar     = "*" and not XCPNOTED(ENDFILE) do nextChar;
                   moreComment := ( currChar not = "/" and not XCPNOTED(ENDFILE) )
               end while_more_comment ;
               if not XCPNOTED(ENDFILE)
               then nextChar
               else lexError( "End-of-file in comment." )
           end if_currChar_is_star ;
       end divideOrComment ;

       % handle an indentifier or keyword %
       procedure identifierOrKeyword ; begin
           tkType := tIdentifier;
           while identifierStartChar or range( "0", "9" ) do addAndNextChar;
           % there are only 5 keywords, so we just test each in turn here %
           if      tkValue = "if"      then tkType  := tKeyword_if
           else if tkValue = "else"    then tkType  := tKeyword_else
           else if tkValue = "while"   then tkType  := tKeyword_while
           else if tkValue = "print"   then tkType  := tKeyword_print
           else if tkValue = "putc"    then tkType  := tKeyword_putc;
           if tkType not = tIdentifier then tkValue := "";
       end identifierOrKeyword ;

       % handle an integer literal %
       procedure integerLiteral ; begin
           logical overflowed;
           integer digit;
           overflowed := false;
           tkType     := tInteger;
           while range( "0", "9" ) do begin
               digit := ( decode( currChar ) - decode( "0" ) );
               if      tkIntegerValue > MAXINTEGER_OVER_10 then overflowed := true
               else if tkIntegerValue = MAXINTEGER_OVER_10
                   and digit          > MAXINTEGER_MOD_10  then overflowed := true
               else begin
                   tkIntegerValue := tkIntegerValue * 10;
                   tkIntegerValue := tkIntegerValue + digit;
               end;
               nextChar
           end while_have_a_digit ;
           if overflowed          then lexError( "Number too large." );
           if identifierStartChar then lexError( "Number followed by letter or underscore." );
       end integerLiteral ;

       % handle a char literal %
       procedure charLiteral ; begin
           nextChar;
           if      currChar = "'" or currChar = newlineChar then lexError( "Invalid character constant." )
           else if currChar = "\" then begin
               % have an escape %
               nextChar;
               if      currChar     = "n" then currChar := newlineChar
               else if currChar not = "\" then lexError( "Unknown escape sequence." )
           end;
           tkType         := tInteger;
           tkIntegerValue := decode( currChar );
           % should have a closing quoute next %
           nextChar;
           if   currChar not = "'"
           then lexError( "Multi-character constant." )
           else nextChar
       end charLiteral ;

       % handle a string literal %
       procedure stringLiteral ; begin
           tkType            := tString;
           tkValue( 0 // 1 ) := currChar;
           tkLength          := 1;
           nextChar;
           while currChar not = """" and currChar not = newlineChar and not XCPNOTED(ENDFILE) do addAndNextChar;
           if      currChar = newlineChar then lexError( "End-of-line while scanning string literal." )
           else if XCPNOTED(ENDFILE)      then lexError( "End-of-file while scanning string literal." )
           else    % currChar must be """" % addAndNextChar
       end stringLiteral ;

       while begin
           % skip white space %
           while ( currChar = " " or currChar = newlineChar ) and not XCPNOTED(ENDFILE) do nextChar;
           % get the token %
           tkLine         := lineNumber;
           tkColumn       := columnNumber;
           tkValue        := "";
           tkLength       := 0;
           tkIntegerValue := 0;
           tkTooLong      := false;
           if      XCPNOTED(ENDFILE)   then tkType := tEnd_of_input
           else if currChar = "*"      then singleCharToken( tOp_multiply )
           else if currChar = "/"      then divideOrComment
           else if currChar = "%"      then singleCharToken( tOp_mod      )
           else if currChar = "+"      then singleCharToken( tOp_add )
           else if currChar = "-"      then singleCharToken( tOp_subtract )
           else if currChar = "<"      then opOrOpEqual( tOp_less,    tOp_lessequal     )
           else if currChar = ">"      then opOrOpEqual( tOp_greater, tOp_greaterequal  )
           else if currChar = "="      then opOrOpEqual( tOp_assign,  tOp_equal         )
           else if currChar = "!"      then opOrOpEqual( tOp_not,     tOp_notequal      )
           else if currChar = "&"      then doubleCharToken( tOp_and     )
           else if currChar = "|"      then doubleCharToken( tOp_or      )
           else if currChar = "("      then singleCharToken( tLeftParen  )
           else if currChar = ")"      then singleCharToken( tRightParen )
           else if currChar = "{"      then singleCharToken( tLeftBrace  )
           else if currChar = "}"      then singleCharToken( tRightBrace )
           else if currChar = ";"      then singleCharToken( tSemicolon  )
           else if currChar = ","      then singleCharToken( tComma      )
           else if identifierStartChar then identifierOrKeyword
           else if range( "0", "9" )   then integerLiteral
           else if currChar = "'"      then charLiteral
           else if currChar = """"     then stringLiteral
           else begin
               lexError( "Unrecognised character." );
               singleCharToken( tComment )
           end ;
           % continue until we get something other than a comment %
           tkType = tComment
       end do begin end;
       if tkTooLong then if   tkType = tString
                         then lexError( "String literal too long." )
                         else lexError( "Identifier too long."     );
       tkType
   end nextToken ;

   % outputs the current token %
   procedure writeToken ; begin
       write( i_w := 5, s_w := 2, tkLine, tkColumn, tkName( tkType ) );
       if tkType = tInteger then writeon( i_w := 11, tkIntegerValue )
       else if tkLength > 0 then begin
           writeon( "  " );
           for tkPos := 0 until tkLength - 1 do writeon( s_w := 0, tkValue( tkPos // 1 ) );
       end
   end writeToken ;

   LINE_WIDTH       := 256; MAXINTEGER_MOD_10  := MAXINTEGER rem 10;
   MAX_TOKEN_LENGTH := 256; MAXINTEGER_OVER_10 := MAXINTEGER div 10;
   newlineChar      := code( 10 );
   tOp_multiply     :=  1; tkName( tOp_multiply     ) := "Op_multiply";
   tOp_divide       :=  2; tkName( tOp_divide       ) := "Op_divide";
   tOp_mod          :=  3; tkName( tOp_mod          ) := "Op_mod";
   tOp_add          :=  4; tkName( tOp_add          ) := "Op_add";
   tOp_subtract     :=  5; tkName( tOp_subtract     ) := "Op_subtract";
   tOp_negate       :=  6; tkName( tOp_negate       ) := "Op_negate";
   tOp_less         :=  7; tkName( tOp_less         ) := "Op_less";
   tOp_lessequal    :=  8; tkName( tOp_lessequal    ) := "Op_lessequal";
   tOp_greater      :=  9; tkName( tOp_greater      ) := "Op_greater";
   tOp_greaterequal := 10; tkName( tOp_greaterequal ) := "Op_greaterequal";
   tOp_equal        := 11; tkName( tOp_equal        ) := "Op_equal";
   tOp_notequal     := 12; tkName( tOp_notequal     ) := "Op_notequal";
   tOp_not          := 13; tkName( tOp_not          ) := "Op_not";
   tOp_assign       := 14; tkName( tOp_assign       ) := "Op_assign";
   tOp_and          := 15; tkName( tOp_and          ) := "Op_and";
   tOp_or           := 16; tkName( tOp_or           ) := "Op_or";
   tLeftParen       := 17; tkName( tLeftParen       ) := "LeftParen";
   tRightParen      := 18; tkName( tRightParen      ) := "RightParen";
   tLeftBrace       := 19; tkName( tLeftBrace       ) := "LeftBrace";
   tRightBrace      := 20; tkName( tRightBrace      ) := "RightBrace";
   tSemicolon       := 21; tkName( tSemicolon       ) := "Semicolon";
   tComma           := 22; tkName( tComma           ) := "Comma";
   tKeyword_if      := 23; tkName( tKeyword_if      ) := "Keyword_if";
   tKeyword_else    := 24; tkName( tKeyword_else    ) := "Keyword_else";
   tKeyword_while   := 25; tkName( tKeyword_while   ) := "Keyword_while";
   tKeyword_print   := 26; tkName( tKeyword_print   ) := "Keyword_print";
   tKeyword_putc    := 27; tkName( tKeyword_putc    ) := "Keyword_putc";
   tIdentifier      := 28; tkName( tIdentifier      ) := "Identifier";
   tInteger         := 29; tkName( tInteger         ) := "Integer";
   tString          := 30; tkName( tString          ) := "String";
   tEnd_of_input    := 31; tkName( tEnd_of_input    ) := "End_of_input";
   tComment         := 32; tkName( tComment         ) := "Comment";

   % allow the program to continue after reaching end-of-file %
   ENDFILE := EXCEPTION( false, 1, 0, false, "EOF" );
   % ensure the first call to nextToken reads the first line %
   lineNumber   := 0;
   columnNumber := LINE_WIDTH + 1;
   currChar     := " ";
   % get and print all tokens from standard input %
   while nextToken not = tEnd_of_input do writeToken;
   writeToken

end.</lang>

Output:

Test case 3

    5     16  Keyword_print            
    5     40  Op_subtract     
    6     16  Keyword_putc            
    6     40  Op_less         
    7     16  Keyword_if            
    7     40  Op_greater      
    8     16  Keyword_else            
    8     40  Op_lessequal    
    9     16  Keyword_while            
    9     40  Op_greaterequal 
   10     16  LeftBrace       
   10     40  Op_equal        
   11     16  RightBrace      
   11     40  Op_notequal     
   12     16  LeftParen       
   12     40  Op_and          
   13     16  RightParen      
   13     40  Op_or           
   14     16  Op_subtract     
   14     40  Semicolon       
   15     16  Op_not          
   15     40  Comma           
   16     16  Op_multiply     
   16     40  Op_assign       
   17     16  Op_divide       
   17     40  Integer                  42  
   18     16  Op_mod          
   18     40  String              "String literal"
   19     16  Op_add          
   19     40  Identifier          variable_name
   20     26  Integer                  10  
   21     26  Integer                  92  
   22     26  Integer                  32  
   23      1  End_of_input

C

Tested with gcc 4.81 and later, compiles warning free with -Wall -Wextra <lang C>#include <stdlib.h>

include <stdio.h>
include <stdarg.h>
include <ctype.h>
include <string.h>
include <errno.h>
include <stdbool.h>
include <limits.h>

define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))

define da_dim(name, type) type *name = NULL; \

                           int _qy_ ## name ## _p = 0;  \
                           int _qy_ ## name ## _max = 0

define da_rewind(name) _qy_ ## name ## _p = 0
define da_redim(name) do {if (_qy_ ## name ## _p >= _qy_ ## name ## _max) \

                               name = realloc(name, (_qy_ ## name ## _max += 32) * sizeof(name[0]));} while (0)

define da_append(name, x) do {da_redim(name); name[_qy_ ## name ## _p++] = x;} while (0)
define da_len(name) _qy_ ## name ## _p

typedef enum {

   tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq,
   tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While,
   tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma,
   tk_Ident, tk_Integer, tk_String

} TokenType;

typedef struct {

   TokenType tok;
   int err_ln, err_col;
   union {
       int n;                  /* value for constants */
       char *text;             /* text for idents */
   };

} tok_s;

static FILE *source_fp, *dest_fp; static int line = 1, col = 0, the_ch = ' '; da_dim(text, char);

tok_s gettok();

static void error(int err_line, int err_col, const char *fmt, ... ) {

   char buf[1000];
   va_list ap;

   va_start(ap, fmt);
   vsprintf(buf, fmt, ap);
   va_end(ap);
   printf("(%d,%d) error: %s\n", err_line, err_col, buf);
   exit(1);

}

static int next_ch() { /* get next char from input */

   the_ch = getc(source_fp);
   ++col;
   if (the_ch == '\n') {
       ++line;
       col = 0;
   }
   return the_ch;

}

static tok_s char_lit(int n, int err_line, int err_col) { /* 'x' */

   if (the_ch == '\)
       error(err_line, err_col, "gettok: empty character constant");
   if (the_ch == '\\') {
       next_ch();
       if (the_ch == 'n')
           n = 10;
       else if (the_ch == '\\')
           n = '\\';
       else error(err_line, err_col, "gettok: unknown escape sequence \\%c", the_ch);
   }
   if (next_ch() != '\)
       error(err_line, err_col, "multi-character constant");
   next_ch();
   return (tok_s){tk_Integer, err_line, err_col, {n}};

}

static tok_s div_or_cmt(int err_line, int err_col) { /* process divide or comments */

   if (the_ch != '*')
       return (tok_s){tk_Div, err_line, err_col, {0}};

   /* comment found */
   for (;;) {
       if (next_ch() == '*' && next_ch() == '/') {
           next_ch();
           return gettok();
       } else if (the_ch == EOF)
           error(err_line, err_col, "EOF in comment");
   }

}

static tok_s string_lit(int start, int err_line, int err_col) { /* "st" */

   da_rewind(text);

   while (next_ch() != start) {
       if (the_ch == '\n') error(err_line, err_col, "EOL in string");
       if (the_ch == EOF)  error(err_line, err_col, "EOF in string");
       da_append(text, (char)the_ch);
   }
   da_append(text, '\0');

   next_ch();
   return (tok_s){tk_String, err_line, err_col, {.text=text}};

}

static int kwd_cmp(const void *p1, const void *p2) {

   return strcmp(*(char **)p1, *(char **)p2);

}

static TokenType get_ident_type(const char *ident) {

   static struct {
       char *s;
       TokenType sym;
   } kwds[] = {
       {"else",  tk_Else},
       {"if",    tk_If},
       {"print", tk_Print},
       {"putc",  tk_Putc},
       {"while", tk_While},
   }, *kwp;

   return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;

}

static tok_s ident_or_int(int err_line, int err_col) {

   int n, is_number = true;

   da_rewind(text);
   while (isalnum(the_ch) || the_ch == '_') {
       da_append(text, (char)the_ch);
       if (!isdigit(the_ch))
           is_number = false;
       next_ch();
   }
   if (da_len(text) == 0)
       error(err_line, err_col, "gettok: unrecognized character (%d) '%c'\n", the_ch, the_ch);
   da_append(text, '\0');
   if (isdigit(text[0])) {
       if (!is_number)
           error(err_line, err_col, "invalid number: %s\n", text);
       n = strtol(text, NULL, 0);
       if (n == LONG_MAX && errno == ERANGE)
           error(err_line, err_col, "Number exceeds maximum value");
       return (tok_s){tk_Integer, err_line, err_col, {n}};
   }
   return (tok_s){get_ident_type(text), err_line, err_col, {.text=text}};

}

static tok_s follow(int expect, TokenType ifyes, TokenType ifno, int err_line, int err_col) { /* look ahead for '>=', etc. */

   if (the_ch == expect) {
       next_ch();
       return (tok_s){ifyes, err_line, err_col, {0}};
   }
   if (ifno == tk_EOI)
       error(err_line, err_col, "follow: unrecognized character '%c' (%d)\n", the_ch, the_ch);
   return (tok_s){ifno, err_line, err_col, {0}};

}

tok_s gettok() { /* return the token type */

   /* skip white space */
   while (isspace(the_ch))
       next_ch();
   int err_line = line;
   int err_col  = col;
   switch (the_ch) {
       case '{':  next_ch(); return (tok_s){tk_Lbrace, err_line, err_col, {0}};
       case '}':  next_ch(); return (tok_s){tk_Rbrace, err_line, err_col, {0}};
       case '(':  next_ch(); return (tok_s){tk_Lparen, err_line, err_col, {0}};
       case ')':  next_ch(); return (tok_s){tk_Rparen, err_line, err_col, {0}};
       case '+':  next_ch(); return (tok_s){tk_Add, err_line, err_col, {0}};
       case '-':  next_ch(); return (tok_s){tk_Sub, err_line, err_col, {0}};
       case '*':  next_ch(); return (tok_s){tk_Mul, err_line, err_col, {0}};
       case '%':  next_ch(); return (tok_s){tk_Mod, err_line, err_col, {0}};
       case ';':  next_ch(); return (tok_s){tk_Semi, err_line, err_col, {0}};
       case ',':  next_ch(); return (tok_s){tk_Comma,err_line, err_col, {0}};
       case '/':  next_ch(); return div_or_cmt(err_line, err_col);
       case '\: next_ch(); return char_lit(the_ch, err_line, err_col);
       case '<':  next_ch(); return follow('=', tk_Leq, tk_Lss,    err_line, err_col);
       case '>':  next_ch(); return follow('=', tk_Geq, tk_Gtr,    err_line, err_col);
       case '=':  next_ch(); return follow('=', tk_Eq,  tk_Assign, err_line, err_col);
       case '!':  next_ch(); return follow('=', tk_Neq, tk_Not,    err_line, err_col);
       case '&':  next_ch(); return follow('&', tk_And, tk_EOI,    err_line, err_col);
       case '|':  next_ch(); return follow('|', tk_Or,  tk_EOI,    err_line, err_col);
       case '"' : return string_lit(the_ch, err_line, err_col);
       default:   return ident_or_int(err_line, err_col);
       case EOF:  return (tok_s){tk_EOI, err_line, err_col, {0}};
   }

}

void run() { /* tokenize the given input */

   tok_s tok;
   do {
       tok = gettok();
       fprintf(dest_fp, "%5d  %5d %.15s",
           tok.err_ln, tok.err_col,
           &"End_of_input    Op_multiply     Op_divide       Op_mod          Op_add          "
            "Op_subtract     Op_negate       Op_not          Op_less         Op_lessequal    "
            "Op_greater      Op_greaterequal Op_equal        Op_notequal     Op_assign       "
            "Op_and          Op_or           Keyword_if      Keyword_else    Keyword_while   "
            "Keyword_print   Keyword_putc    LeftParen       RightParen      LeftBrace       "
            "RightBrace      Semicolon       Comma           Identifier      Integer         "
            "String          "
           [tok.tok * 16]);
       if (tok.tok == tk_Integer)     fprintf(dest_fp, "  %4d",   tok.n);
       else if (tok.tok == tk_Ident)  fprintf(dest_fp, " %s",     tok.text);
       else if (tok.tok == tk_String) fprintf(dest_fp, " \"%s\"", tok.text);
       fprintf(dest_fp, "\n");
   } while (tok.tok != tk_EOI);
   if (dest_fp != stdout)
       fclose(dest_fp);

}

void init_io(FILE **fp, FILE *std, const char mode[], const char fn[]) {

   if (fn[0] == '\0')
       *fp = std;
   else if ((*fp = fopen(fn, mode)) == NULL)
       error(0, 0, "Can't open %s\n", fn);

}

int main(int argc, char *argv[]) {

   init_io(&source_fp, stdin,  "r",  argc > 1 ? argv[1] : "");
   init_io(&dest_fp,   stdout, "wb", argc > 2 ? argv[2] : "");
   run();
   return 0;

}</lang>

Output — test case 3:

5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input

C#

Requires C#6.0 because of the use of null coalescing operators. <lang csharp> using System; using System.IO; using System.Linq; using System.Collections.Generic;

namespace Rosetta {

   public enum TokenType {
       End_of_input, Op_multiply, Op_divide, Op_mod, Op_add, Op_subtract,
       Op_negate, Op_not, Op_less, Op_lessequal, Op_greater, Op_greaterequal,
       Op_equal, Op_notequal, Op_assign, Op_and, Op_or, Keyword_if,
       Keyword_else, Keyword_while, Keyword_print, Keyword_putc, LeftParen, RightParen,
       LeftBrace, RightBrace, Semicolon, Comma, Identifier, Integer, String, None
   }

   /// <summary>
   /// Storage class for tokens
   /// </summary>
   public class Token {
       public TokenType Type { get; set; }
       public int Line { get; set; }
       public int Position { get; set; }
       public string Value { get; set; }
       public override string ToString() {
           if (Type == TokenType.Integer || Type == TokenType.Identifier) {
               return String.Format("{0,-5}  {1,-5}   {2,-14}     {3}", Line, Position, Type.ToString(), Value);
           } else if (Type == TokenType.String) {
               return String.Format("{0,-5}  {1,-5}   {2,-14}     \"{3}\"", Line, Position, Type.ToString(), Value.Replace("\n", "\\n"));
           }
           return String.Format("{0,-5}  {1,-5}   {2,-14}", Line, Position, Type.ToString());
       }
   }

   /// <summary>
   /// C# Example of Lexical scanner for Rosetta Compiler
   /// </summary>
   public class LexicalScanner {

       // character classes 
       private const string _letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_";        
       private const string _numbers = "0123456789";
       private const string _identifier = _letters + _numbers + "_";
       private const string _whitespace = " \t\n\r";
       
       // mappings from string keywords to token type 
       private Dictionary<string, TokenType> _keywordTokenTypeMap = new Dictionary<string, TokenType>() {
           { "if", TokenType.Keyword_if },
           { "else", TokenType.Keyword_else },
           { "while", TokenType.Keyword_while },
           { "print", TokenType.Keyword_print },
           { "putc", TokenType.Keyword_putc }
       };

       // mappings from simple operators to token type
       private Dictionary<string, TokenType> _operatorTokenTypeMap = new Dictionary<string, TokenType>() {
           { "+", TokenType.Op_add },
           { "-", TokenType.Op_subtract },
           { "*", TokenType.Op_multiply },
           { "/", TokenType.Op_divide },
           { "%", TokenType.Op_mod },
           { "=", TokenType.Op_assign },
           { "<", TokenType.Op_less },
           { ">", TokenType.Op_greater },
           { "!", TokenType.Op_not },
       };

       private List<string> _keywords;
       private string _operators = "+-*/%=<>!%";

       private string _code;
       private List<Token> tokens = new List<Token>();

       private int _line = 1;
       private int _position = 1;

       public string CurrentCharacter {
           get {
               try {
                   return _code.Substring(0, 1);
               } catch (ArgumentOutOfRangeException) {
                   return "";
               }
           }
       }

       /// <summary>
       /// Lexical scanner initialiser
       /// </summary>
       /// <param name="code">Code to be tokenised</param>
       public LexicalScanner (string code) {
           _code = code;
           _keywords = _keywordTokenTypeMap.Keys.ToList();
       }

       /// <summary>
       /// Advance the cursor forward given number of characters
       /// </summary>
       /// <param name="characters">Number of characters to advance</param>
       private void advance(int characters=1) {
           try {
               // reset position when there is a newline
               if (CurrentCharacter == "\n") {
                   _position = 0;
                   _line++;
               }
                   
               _code = _code.Substring(characters, _code.Length - characters);
               _position += characters;
           } catch (ArgumentOutOfRangeException) {
               _code = "";
           }
       }

       /// <summary>
       /// Outputs error message to the console and exits 
       /// </summary>
       /// <param name="message">Error message to display to user</param>
       /// <param name="line">Line error occurred on</param>
       /// <param name="position">Line column that the error occurred at</param>
       public void error(string message, int line, int position) {
           // output error to the console and exit
           Console.WriteLine(String.Format("{0} @ {1}:{2}", message, line, position));
           Environment.Exit(1);
       }

       /// <summary>
       /// Pattern matching using first & follow matching
       /// </summary>
       /// <param name="recogniseClass">String of characters that identifies the token type
       /// or the exact match the be made if exact:true</param>
       /// <param name="matchClass">String of characters to match against remaining target characters</param>
       /// <param name="tokenType">Type of token the match represents.</param>
       /// <param name="notNextClass">Optional class of characters that cannot follow the match</param>
       /// <param name="maxLen">Optional maximum length of token value</param>
       /// <param name="exact">Denotes whether recogniseClass represents an exact match or class match. 
       /// Default: false</param>
       /// <param name="discard">Denotes whether the token is kept or discarded. Default: false</param>
       /// <param name="offset">Optiona line position offset to account for discarded tokens</param>
       /// <returns>Boolean indicating if a match was made </returns>
       public bool match(string recogniseClass, string matchClass, TokenType tokenType, 
                         string notNextClass=null, int maxLen=Int32.MaxValue, bool exact=false, 
                         bool discard=false, int offset=0) {

           // if we've hit the end of the file, there's no more matching to be done
           if (CurrentCharacter == "")
               return false;

           // store _current_ line and position so that our vectors point at the start
           // of each token
           int line = _line;
           int position = _position;

           // special case exact tokens to avoid needing to worry about backtracking
           if (exact) {
               if (_code.StartsWith(recogniseClass)) {
                   if (!discard) 
                       tokens.Add(new Token() { Type = tokenType, Value = recogniseClass, Line = line, Position = position - offset});   
                   advance(recogniseClass.Length);
                   return true;
               }
               return false;
           }

           // first match - denotes the token type usually
           if (!recogniseClass.Contains(CurrentCharacter))
               return false;

           string tokenValue = CurrentCharacter;
           advance();

           // follow match while we haven't exceeded maxLen and there are still characters
           // in the code stream
           while ((matchClass ?? "").Contains(CurrentCharacter) && tokenValue.Length <= maxLen && CurrentCharacter != "") {
               tokenValue += CurrentCharacter;
               advance();
           }

           // ensure that any incompatible characters are not next to the token
           // eg 42fred is invalid, and neither recognized as a number nor an identifier.
           // _letters would be the notNextClass
           if (notNextClass != null && notNextClass.Contains(CurrentCharacter))
               error("Unrecognised character: " + CurrentCharacter, _line, _position);

           // only add tokens to the stack that aren't marked as discard - dont want
           // things like open and close quotes/comments
           if (!discard) {
               Token token = new Token() { Type = tokenType, Value = tokenValue, Line = line, Position = position - offset };
               tokens.Add(token);
           }

           return true;
       }

       /// <summary>
       /// Tokenise the input code 
       /// </summary>
       /// <returns>List of Tokens</returns>
       public List<Token> scan() {

           while (CurrentCharacter != "") {
               // match whitespace
               match(_whitespace, _whitespace, TokenType.None, discard: true);

               // match integers
               match(_numbers, _numbers, TokenType.Integer, notNextClass:_letters);
               
               // match identifiers and keywords
               if (match(_letters, _identifier, TokenType.Identifier)) {
                   Token match = tokens.Last();
                   if (_keywords.Contains(match.Value))
                       match.Type = _keywordTokenTypeMap[match.Value];
               }

               // match string similarly to comments without allowing newlines
               // this token doesn't get discarded though
               if (match("\"", null, TokenType.String, discard:true)) {
                   string value = "";
                   int position = _position;
                   while (!match("\"", null, TokenType.String, discard:true)) {
                       // not allowed newlines in strings
                       if (CurrentCharacter == "\n")
                           error("End-of-line while scanning string literal. Closing string character not found before end-of-line", _line, _position);
                       // end of file reached before finding end of string
                       if (CurrentCharacter == "")
                           error("End-of-file while scanning string literal. Closing string character not found", _line, _position);

                       value += CurrentCharacter;

                       // deal with escape sequences - we only accept newline (\n)
                       if (value.Length >= 2) {
                           string lastCharacters = value.Substring(value.Length - 2, 2);
                           if (lastCharacters[0] == '\\') {
                               if (lastCharacters[1] != 'n') {
                                   error("Unknown escape sequence. ", _line, position);
                               }
                               value = value.Substring(0, value.Length - 2).ToString() + "\n";
                           }
                       }

                       advance();
                   }
                   tokens.Add(new Token() { Type = TokenType.String, Value = value, Line = _line, Position = position - 1});
               }

               // match string literals
               if (match("'", null, TokenType.Integer, discard:true)) {
                   int value;
                   int position = _position;
                   value = CurrentCharacter.ToCharArray()[0];
                   advance();

                   // deal with empty literals 
                   if (value == '\)
                       error("Empty character literal", _line, _position);

                   // deal with escaped characters, only need to worry about \n and \\
                   // throw werror on any other
                   if (value == '\\') {
                       if (CurrentCharacter == "n") {
                           value = '\n';
                       } else if (CurrentCharacter == "\\") {
                           value = '\\';
                       } else {
                           error("Unknown escape sequence. ", _line, _position - 1);
                       }
                       advance();
                   }

                   // if we haven't hit a closing ' here, there are two many characters
                   // in the literal
                   if (!match("'", null, TokenType.Integer, discard: true))
                       error("Multi-character constant", _line, _position);

                   tokens.Add(new Rosetta.Token() { Type = TokenType.Integer, Value = value.ToString(), Line = _line, Position = position - 1 });
               }

               // match comments by checking for starting token, then advancing 
               // until closing token is matched
               if (match("/*", null, TokenType.None, exact: true, discard: true)) {
                   while (!match("*/", null, TokenType.None, exact: true, discard: true)) {
                       // reached the end of the file without closing comment!
                       if (CurrentCharacter == "")
                           error("End-of-file in comment. Closing comment characters not found.", _line, _position);
                       advance();
                   }
                   continue;
               }

               // match complex operators
               match("<=", null, TokenType.Op_lessequal, exact: true);
               match(">=", null, TokenType.Op_greaterequal, exact: true);
               match("==", null, TokenType.Op_equal, exact: true);
               match("!=", null, TokenType.Op_notequal, exact: true);
               match("&&", null, TokenType.Op_and, exact: true);
               match("||", null, TokenType.Op_or, exact: true);

               // match simple operators
               if (match(_operators, null, TokenType.None, maxLen:1)) {
                   Token match = tokens.Last();
                   match.Type = _operatorTokenTypeMap[match.Value];
               }

               // brackets, braces and separators
               match("(", null, TokenType.LeftParen, exact: true);
               match(")", null, TokenType.RightParen, exact: true);
               match("{", null, TokenType.LeftBrace, exact: true);
               match("}", null, TokenType.RightBrace, exact: true);
               match(";", null, TokenType.Semicolon, exact: true);
               match(",", null, TokenType.Comma, exact: true);

           // end of file token
           tokens.Add(new Rosetta.Token() { Type = TokenType.End_of_input, Line = _line, Position = _position });
           
           return tokens;
       }

       static void Main (string[] args) {
           StreamReader inputFile;

           // if we passed in a filename, read code from that, else
           // read code from stdin
           if (args.Length > 0) {
               string path = args[0];
               try {
                   inputFile = new StreamReader(path);
               } catch (IOException) {
                   inputFile = new StreamReader(Console.OpenStandardInput(8192));
               }
           } else {
               inputFile = new StreamReader(Console.OpenStandardInput(8192));
           }

           string code = inputFile.ReadToEnd();

           // strip windows line endings out
           code = code.Replace("\r", "");

           LexicalScanner scanner = new LexicalScanner(code);
           List<Token> tokens = scanner.scan();

           foreach(Token token in tokens) {
               Console.WriteLine(token.ToString());
           }       
       }
   }

} </lang>

Output — test case 3:

5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input

Euphoria

Tested with Euphoria 4.05. <lang euphoria>include std/io.e include std/map.e include std/types.e include std/convert.e

constant true = 1, false = 0, EOF = -1

enum tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq,

   tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While,
   tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma,
   tk_Ident, tk_Integer, tk_String

constant all_syms = {"End_of_input", "Op_multiply", "Op_divide", "Op_mod", "Op_add",

   "Op_subtract", "Op_negate", "Op_not", "Op_less", "Op_lessequal", "Op_greater",
   "Op_greaterequal", "Op_equal", "Op_notequal", "Op_assign", "Op_and", "Op_or",
   "Keyword_if", "Keyword_else", "Keyword_while", "Keyword_print", "Keyword_putc",
   "LeftParen", "RightParen", "LeftBrace", "RightBrace", "Semicolon", "Comma",
   "Identifier", "Integer", "String"}

integer input_file, the_ch = ' ', the_col = 0, the_line = 1 sequence symbols map key_words = new()

procedure error(sequence format, sequence data)

   printf(STDOUT, format, data)
   abort(1)

end procedure

-- get the next character from the input function next_ch()

   the_ch = getc(input_file)
   the_col += 1
   if the_ch = '\n' then
       the_line += 1
       the_col = 0
   end if
   return the_ch

end function

-- 'x' - character constants function char_lit(integer err_line, integer err_col)

   integer n = next_ch()              -- skip opening quote
   if the_ch = '\ then
       error("%d %d empty character constant", {err_line, err_col})
   elsif the_ch = '\\' then
       next_ch()
       if the_ch = 'n' then
           n = 10
       elsif the_ch = '\\' then
           n = '\\'
       else
           error("%d %d unknown escape sequence \\%c", {err_line, err_col, the_ch})
       end if
   end if
   if next_ch() != '\ then
       error("%d %d multi-character constant", {err_line, err_col})
   end if
   next_ch()
   return {tk_Integer, err_line, err_col, n}

end function

-- process divide or comments function div_or_cmt(integer err_line, integer err_col)

   if next_ch() != '*' then
       return {tk_Div, err_line, err_col}
   end if

   -- comment found
   while true do
       if next_ch() = '*' and next_ch() = '/' then
           next_ch()
           return get_tok()
       elsif the_ch = EOF then
           error("%d %d EOF in comment", {err_line, err_col})
       end if
   end while

end function

-- "string" function string_lit(integer start, integer err_line, integer err_col)

   string text = ""

   while next_ch() != start do
       if the_ch = EOF then
           error("%d %d EOF while scanning string literal", {err_line, err_col})
       end if
       if the_ch = '\n' then
           error("%d %d EOL while scanning string literal", {err_line, err_col})
       end if
       text &= the_ch
   end while

   next_ch()
   return {tk_String, err_line, err_col, text}

end function

-- handle identifiers and integers function ident_or_int(integer err_line, integer err_col)

   integer n, is_number = true
   string text = ""

   while t_alnum(the_ch) or the_ch = '_' do
       text &= the_ch
       if not t_digit(the_ch) then
           is_number = false
       end if
       next_ch()
   end while

   if length(text) = 0 then
       error("%d %d ident_or_int: unrecognized character: (%d) '%s'", {err_line, err_col, the_ch, the_ch})
   end if

   if t_digit(text[1]) then
       if not is_number then
           error("%d %d invalid number: %s", {err_line, err_col, text})
       end if
       n = to_integer(text)
       return {tk_Integer, err_line, err_col, n}
   end if

   if has(key_words, text) then
       return {get(key_words, text), err_line, err_col}
   end if

   return {tk_Ident, err_line, err_col, text}

end function

-- look ahead for '>=', etc. function follow(integer expect, integer ifyes, integer ifno, integer err_line, integer err_col)

   if next_ch() = expect then
       next_ch()
       return {ifyes, err_line, err_col}
   end if

   if ifno = tk_EOI then
       error("%d %d follow: unrecognized character: (%d)", {err_line, err_col, the_ch})
   end if

   return {ifno, err_line, err_col}

end function

-- return the next token type function get_tok()

   while t_space(the_ch) do
       next_ch()
   end while

   integer err_line = the_line
   integer err_col  = the_col

   switch the_ch do
       case EOF  then return {tk_EOI, err_line, err_col}
       case '/'  then return div_or_cmt(err_line, err_col)
       case '\ then return char_lit(err_line, err_col)

       case '<'  then return follow('=', tk_Leq, tk_Lss,    err_line, err_col)
       case '>'  then return follow('=', tk_Geq, tk_Gtr,    err_line, err_col)
       case '='  then return follow('=', tk_Eq,  tk_Assign, err_line, err_col)
       case '!'  then return follow('=', tk_Neq, tk_Not,    err_line, err_col)
       case '&'  then return follow('&', tk_And, tk_EOI,    err_line, err_col)
       case '|'  then return follow('|', tk_Or,  tk_EOI,    err_line, err_col)

       case '"'  then return string_lit(the_ch, err_line, err_col)
       case else
           integer sym = symbols[the_ch]
           if sym  != tk_EOI then
               next_ch()
               return {sym, err_line, err_col}
           end if
           return ident_or_int(err_line, err_col)
   end switch

end function

procedure init()

   put(key_words, "else",    tk_Else)
   put(key_words, "if",      tk_If)
   put(key_words, "print",   tk_Print)
   put(key_words, "putc",    tk_Putc)
   put(key_words, "while",   tk_While)

   symbols = repeat(tk_EOI, 256)
   symbols['{'] = tk_Lbrace
   symbols['}'] = tk_Rbrace
   symbols['('] = tk_Lparen
   symbols[')'] = tk_Rparen
   symbols['+'] = tk_Add
   symbols['-'] = tk_Sub
   symbols['*'] = tk_Mul
   symbols['%'] = tk_Mod
   symbols[';'] = tk_Semi
   symbols[','] = tk_Comma

end procedure

procedure main(sequence cl)

   sequence file_name

   input_file = STDIN
   if length(cl) > 2 then
       file_name = cl[3]
       input_file = open(file_name, "r")
       if input_file = -1 then
           error("Could not open %s", {file_name})
       end if
   end if
   init()
   sequence t
   loop do
       t = get_tok()
       printf(STDOUT, "%5d  %5d %-8s", {t[2], t[3], all_syms[t[1]]})
       switch t[1] do
           case tk_Integer then printf(STDOUT, "  %5d\n",   {t[4]})
           case tk_Ident   then printf(STDOUT, " %s\n",     {t[4]})
           case tk_String  then printf(STDOUT, " \"%s\"\n", {t[4]})
           case else            printf(STDOUT, "\n")
       end switch
       until t[1] = tk_EOI
   end loop

end procedure

main(command_line())</lang>

Output — test case 3:

5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input

Flex

Tested with Flex 2.5.4. <lang C>%{

include <stdio.h>
include <stdlib.h>
include <string.h>
include <errno.h>
include <limits.h>

define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))

typedef enum {

   tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq,
   tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While,
   tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma,
   tk_Ident, tk_Integer, tk_String

} TokenType;

void yyerror(char msg[]) {

   printf(msg);
   exit(1);

}

static int yynval;

struct yylloc {

   int first_line, first_col;
   int last_line, last_col;

} yylloc;

static void update_loc() {

 static int curr_line = 1;
 static int curr_col  = 1;

 yylloc.first_line = curr_line;
 yylloc.first_col  = curr_col;

 {char *s; for (s = yytext; *s != '\0'; s++) {
   if (*s == '\n') {
     curr_line++;
     curr_col = 1;
   } else {
     curr_col++;
   }
 }}

 yylloc.last_line = curr_line;
 yylloc.last_col  = curr_col-1;

}

define YY_USER_ACTION update_loc();

static int kwd_cmp(const void *p1, const void *p2) {

   return strcmp(*(char **)p1, *(char **)p2);

}

static TokenType get_ident_type(const char *ident) {

   static struct {
       char *s;
       TokenType sym;
   } kwds[] = {
       {"else",  tk_Else},
       {"if",    tk_If},
       {"print", tk_Print},
       {"putc",  tk_Putc},
       {"while", tk_While},
   }, *kwp;

   return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;

}

%}

%start COMMENT2

%option noyywrap

digit [0-9] ident [a-zA-Z_][a-zA-Z_0-9]*

number {digit}+ string \"[^"\n]*\" char_const \'([^'\n]|\\n|\\\\)\'

%%

<COMMENT2>[^*]+ ; <COMMENT2>\*[^/] ; <COMMENT2>\*\/ BEGIN 0; /* end comment */ "/*" BEGIN COMMENT2;

"{" {return tk_Lbrace;} "}" {return tk_Rbrace;} "(" {return tk_Lparen;} ")" {return tk_Rparen;} "*" {return tk_Mul;} "/" {return tk_Div;} "%" {return tk_Mod;} "+" {return tk_Add;} "-" {return tk_Sub;} "<" {return tk_Lss;} ">" {return tk_Gtr;} "<=" {return tk_Leq;} ">=" {return tk_Geq;} "!=" {return tk_Neq;} "!" {return tk_Not;} "&&" {return tk_And;} "||" {return tk_Or;} ";" {return tk_Semi;} "," {return tk_Comma;} "==" {return tk_Eq;} "=" {return tk_Assign;} {ident} {return get_ident_type(yytext);} {string} {return tk_String;}

[ \t\n]+ ; /* ignore whitespace */

{number} {

               yynval = strtol(yytext, NULL, 0);
               if (yynval == LONG_MAX && errno == ERANGE)
                   yyerror("Number exceeds maximum value");

               return tk_Integer;
            }

{char_const} {

               int n = yytext[1];
               char *p = yytext;

               if (yyleng < 3)
                   yyerror("empty character constant");
               ++p;
               if (p[0] == '\\') {
                   ++p;
                   if (p[0] == 'n')
                       n = 10;
                   else if (p[0] == '\\')
                       n = '\\';
                   else
                       yyerror("unknown escape sequence");
               }
               yynval = n;
               return tk_Integer;
            }

. yyerror("Unknown character\n");

%%

int main(int argc, char *argv[]) {

   int tok;

   ++argv, --argc;  /* skip over program name */
   yyin = stdin;
   if (argc > 0)
       yyin = fopen(argv[0], "r");

   do {
       tok = yylex();
       printf("%5d  %5d %.15s", yylloc.first_line, yylloc.first_col,
           &"End_of_input    Op_multiply     Op_divide       Op_mod          Op_add          "
            "Op_subtract     Op_negate       Op_not          Op_less         Op_lessequal    "
            "Op_greater      Op_greaterequal Op_equal        Op_notequal     Op_assign       "
            "Op_and          Op_or           Keyword_if      Keyword_else    Keyword_while   "
            "Keyword_print   Keyword_putc    LeftParen       RightParen      LeftBrace       "
            "RightBrace      Semicolon       Comma           Identifier      Integer         "
            "String          "
           [tok * 16]);

       if (tok == tk_Integer)     printf("   %5d", yynval);
       else if (tok == tk_Ident)  printf("  %s",   yytext);
       else if (tok == tk_String) printf("  %s",   yytext);
       printf("\n");
   } while (tok != tk_EOI);
   return 0;

}</lang>

Output — test case 3:

5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 22 29 End_of_input

FreeBASIC

Tested with FreeBASIC 1.05 <lang FreeBASIC>enum Token_type

   tk_EOI
   tk_Mul
   tk_Div
   tk_Mod
   tk_Add
   tk_Sub
   tk_Negate
   tk_Not
   tk_Lss
   tk_Leq
   tk_Gtr
   tk_Geq
   tk_Eq
   tk_Neq
   tk_Assign
   tk_And
   tk_Or
   tk_If
   tk_Else
   tk_While
   tk_Print
   tk_Putc
   tk_Lparen
   tk_Rparen
   tk_Lbrace
   tk_Rbrace
   tk_Semi
   tk_Comma
   tk_Ident
   tk_Integer
   tk_String

end enum

const NewLine = chr(10) const DoubleQuote = chr(34)

' where we store keywords and variables type Symbol

   s_name as string
   tok as Token_type

end type

dim shared symtab() as Symbol

dim shared cur_line as string dim shared cur_ch as string dim shared line_num as integer dim shared col_num as integer

function is_digit(byval ch as string) as long

   is_digit = (ch <> "") and ch >= "0" and ch <= "9"

end function

function is_alnum(byval ch as string) as long

   is_alnum = (ch <> "") and ((UCase(ch) >= "A" and UCase(ch) <= "Z") or (is_digit(ch)))

end function

sub error_msg(byval eline as integer, byval ecol as integer, byval msg as string)

   print "("; eline; ":"; ecol; ")"; " "; msg
   system

end sub

' add an identifier to the symbol table function install(byval s_name as string, byval tok as Token_type) as integer

   dim n as integer

   n = ubound(symtab)
   redim preserve symtab(n + 1)
   n = ubound(symtab)

   symtab(n).s_name = s_name
   symtab(n).tok    = tok
   return n

end function

' search for an identifier in the symbol table function lookup(byval s_name as string) as integer

   dim i as integer

   for i = lbound(symtab) to ubound(symtab)
       if symtab(i).s_name = s_name then return i
   next
   return -1

end function

sub next_line() ' read the next line of input from the source file

   cur_line = ""
   cur_ch  = ""        ' empty cur_ch means end-of-file
   if eof(1) then exit sub
   line input #1, cur_line
   cur_line = cur_line + NewLine
   line_num += + 1
   col_num = 1

end sub

sub next_char() ' get the next char

   cur_ch = ""
   col_num += 1
   if col_num > len(cur_line) then next_line()
   if col_num <= len(cur_line) then cur_ch = mid(cur_line, col_num, 1)

end sub

function follow(byval err_line as integer, byval err_col as integer, byval expect as string, byval ifyes as Token_type, byval ifno as Token_type) as Token_type

   if cur_ch = expect then
       next_char()
       return ifyes
   end if
   if ifno = tk_eoi then error_msg(err_line, err_col, "follow unrecognized character: " + cur_ch)
   return ifno

end function

sub gettok(byref err_line as integer, byref err_col as integer, byref tok as Token_type, byref v as string)

   ' skip whitespace
   do while (cur_ch = " " or cur_ch = chr(9) or cur_ch = NewLine) and (cur_ch <> "")
       next_char()
   loop

   err_line = line_num
   err_col  = col_num

   select case cur_ch
       case "":  tok = tk_eoi: exit sub
       case "{": tok = tk_lbrace: next_char(): exit sub
       case "}": tok = tk_rbrace: next_char(): exit sub
       case "(": tok = tk_lparen: next_char(): exit sub
       case ")": tok = tk_rparen: next_char(): exit sub
       case "+": tok = tk_add:    next_char(): exit sub
       case "-": tok = tk_sub:    next_char(): exit sub
       case "*": tok = tk_mul:    next_char(): exit sub
       case "%": tok = tk_Mod:    next_char(): exit sub
       case ";": tok = tk_semi:   next_char(): exit sub
       case ",": tok = tk_comma:  next_char(): exit sub
       case "/": ' div or comment
           next_char()
           if cur_ch <> "*" then
               tok = tk_div
               exit sub
           end if
           ' skip comments
           do
               next_char()
               if cur_ch = "*" or cur_ch = "" then
                   next_char()
                   if cur_ch = "/" or cur_ch = "" then
                       next_char()
                       gettok(err_line, err_col, tok, v)
                       exit sub
                   end if
               end if
           loop
       case "'":   ' single char literals
           next_char()
           v = str(Asc(cur_ch))
           if cur_ch = "'" then error_msg(err_line, err_col, "empty character constant")
           if cur_ch = "\" then
               next_char()
               if cur_ch = "n" then
                   v = "10"
               elseif cur_ch = "\" then
                   v = Str(Asc("\"))
               else error_msg(err_line, err_col, "unknown escape sequence: " + cur_ch)
               end if
           end if
           next_char()
           if cur_ch <> "'" then error_msg(err_line, err_col, "multi-character constant")
           next_char()
           tok = tk_integer
           exit sub
       case "<": next_char(): tok = follow(err_line, err_col, "=", tk_Leq, tk_Lss): exit sub
       case ">": next_char(): tok = follow(err_line, err_col, "=", tk_Geq, tk_Gtr): exit sub
       case "!": next_char(): tok = follow(err_line, err_col, "=", tk_Neq, tk_Not): exit sub
       case "=": next_char(): tok = follow(err_line, err_col, "=", tk_Eq,  tk_Assign): exit sub
       case "&": next_char(): tok = follow(err_line, err_col, "&", tk_And, tk_EOI): exit sub
       case "|": next_char(): tok = follow(err_line, err_col, "|", tk_Or,  tk_EOI): exit sub
       case DoubleQuote: ' string
           v = cur_ch
           next_char()
           do while cur_ch <> DoubleQuote
               if cur_ch = NewLine then error_msg(err_line, err_col, "EOL in string")
               if cur_ch = "" then error_msg(err_line, err_col, "EOF in string")
               v += cur_ch
               next_char()
           loop
           v += cur_ch
           next_char()
           tok = tk_string
           exit sub
       case else   ' integers or identifiers
           dim is_number as boolean = is_digit(cur_ch)
           v = ""
           do while is_alnum(cur_ch) orelse cur_ch = "_"
               if not is_digit(cur_ch) then is_number = false
               v += cur_ch
               next_char()
           loop
           if len(v) = 0 then error_msg(err_line, err_col, "unknown character: " + cur_ch)
           if is_digit(mid(v, 1, 1)) then
               if not is_number then error_msg(err_line, err_col, "invalid number: " + v)
               tok = tk_integer
               exit sub
           end if
           dim as integer index = lookup(v)
           if index = -1 then
               tok = tk_ident
           else
               tok = symtab(index).tok
           end if
           exit sub
   end select

end sub

sub init_lex(byval filein as string)

   install("else",  tk_else)
   install("if",    tk_if)
   install("print", tk_print)
   install("putc",  tk_putc)
   install("while", tk_while)

   open filein for input as #1

   cur_line = ""
   line_num = 0
   col_num = 0
   next_char()

end sub

sub scanner()

   dim err_line as integer
   dim err_col as integer
   dim tok as Token_type
   dim v as string
   dim tok_list(tk_eoi to tk_string) as string

   tok_list(tk_EOI    ) = "End_of_input"
   tok_list(tk_Mul    ) = "Op_multiply"
   tok_list(tk_Div    ) = "Op_divide"
   tok_list(tk_Mod    ) = "Op_mod"
   tok_list(tk_Add    ) = "Op_add"
   tok_list(tk_Sub    ) = "Op_subtract"
   tok_list(tk_Negate ) = "Op_negate"
   tok_list(tk_Not    ) = "Op_not"
   tok_list(tk_Lss    ) = "Op_less"
   tok_list(tk_Leq    ) = "Op_lessequal"
   tok_list(tk_Gtr    ) = "Op_greater"
   tok_list(tk_Geq    ) = "Op_greaterequal"
   tok_list(tk_Eq     ) = "Op_equal"
   tok_list(tk_Neq    ) = "Op_notequal"
   tok_list(tk_Assign ) = "Op_assign"
   tok_list(tk_And    ) = "Op_and"
   tok_list(tk_Or     ) = "Op_or"
   tok_list(tk_If     ) = "Keyword_if"
   tok_list(tk_Else   ) = "Keyword_else"
   tok_list(tk_While  ) = "Keyword_while"
   tok_list(tk_Print  ) = "Keyword_print"
   tok_list(tk_Putc   ) = "Keyword_putc"
   tok_list(tk_Lparen ) = "LeftParen"
   tok_list(tk_Rparen ) = "RightParen"
   tok_list(tk_Lbrace ) = "LeftBrace"
   tok_list(tk_Rbrace ) = "RightBrace"
   tok_list(tk_Semi   ) = "Semicolon"
   tok_list(tk_Comma  ) = "Comma"
   tok_list(tk_Ident  ) = "Identifier"
   tok_list(tk_Integer) = "Integer"
   tok_list(tk_String ) = "String"

   do
       gettok(err_line, err_col, tok, v)
       print using "#####  ##### \               \"; err_line; err_col; tok_list(tok);
       if tok = tk_integer orelse tok = tk_ident orelse tok = tk_string then print " " + v;
       print
   loop until tok = tk_eoi

end sub

sub main()

   if command(1) = "" then print "filename required" : system
   init_lex(command(1))
   scanner()

end sub

main() system</lang>

Output — test case 3:

5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 22 30 End_of_input

Perl

<lang perl>#!/usr/bin/env perl

use strict; use warnings; no warnings 'once';

----- Definition of the language to be lexed -----#

my @tokens = (

   # Name            | Format               | Value       #
   # --------------  |----------------------|-------------#
   ['Op_multiply'    , '*'                  ,             ],
   ['Op_divide'      , '/'                  ,             ],
   ['Op_mod'         , '%'                  ,             ],
   ['Op_add'         , '+'                  ,             ],
   ['Op_subtract'    , '-'                  ,             ],
   ['Op_lessequal'   , '<='                 ,             ],
   ['Op_less'        , '<'                  ,             ],
   ['Op_greaterequal', '>='                 ,             ],
   ['Op_greater'     , '>'                  ,             ],
   ['Op_equal'       , '=='                 ,             ],
   ['Op_assign'      , '='                  ,             ],
   ['Op_not'         , '!'                  ,             ],
   ['Op_notequal'    , '!='                 ,             ],
   ['Op_and'         , '&&'                 ,             ],
   ['Op_or'          , '||'                 ,             ],
   ['Keyword_else'   , qr/else\b/           ,             ],
   ['Keyword_if'     , qr/if\b/             ,             ],
   ['Keyword_while'  , qr/while\b/          ,             ],
   ['Keyword_print'  , qr/print\b/          ,             ],
   ['Keyword_putc'   , qr/putc\b/           ,             ],

   ['LeftParen'      , '('                  ,             ],
   ['RightParen'     , ')'                  ,             ],
   ['LeftBrace'      , '{'                  ,             ],
   ['RightBrace'     , '}'                  ,             ],
   ['Semicolon'      , ';'                  ,             ],
   ['Comma'          , ','                  ,             ],

   ['Identifier'     , qr/[_a-z][_a-z0-9]*/i, \&raw       ],
   ['Integer'        , qr/[0-9]+\b/         , \&raw       ],
   ['Integer'        , qr/'([^']*)(')?/     , \&char_val  ],
   ['String'         , qr/"([^"]*)(")?/     , \&string_raw],

   ['End_of_input'   , qr/$/                ,             ],

);

my $comment = qr/\/\* .+? (?: \*\/ | $ (?{die "End-of-file in comment\n"}) )/xs; my $whitespace = qr/(?: \s | $comment)*/x; my $unrecognized = qr/\w+ | ./x;

| Returns the value of a matched char literal, or dies if it is invalid

sub char_val {

   my $str = string_val();
   die "Multiple characters\n" if length $str > 1;
   die "No character\n"        if length $str == 0;
   ord $str;

}

| Returns the value of a matched string literal, or dies if it is invalid

sub string_val {

   my ($str, $end) = ($1, $2);
   die "End-of-file\n" if not defined $end;
   die "End-of-line\n" if $str =~ /\n/;
   $str =~ s/\\(.)/
         $1 eq 'n'  ? "\n"
       : $1 eq '\\' ? $1
       : $1 eq $end ? $1
       : die "Unknown escape sequence \\$1\n"
   /rge;

}

| Returns the source string of a matched literal

sub raw { $& }

| Returns the source string of a matched string literal, or dies if invalid

sub string_raw {

   string_val(); # Just for the error handling side-effects
   $&;

}

----- Lexer "engine" -----#

Construct the scanner regex:

my $tokens =

   join "|",
   map {
       my $format = $tokens[$_][1];
       "\n".(ref $format ? $format : quotemeta $format)." (*MARK:$_) ";
   } 0..$#tokens;

my $regex = qr/

   \G (?| $whitespace  \K (?| $tokens )
        | $whitespace? \K ($unrecognized) (*MARK:!) )

/x;

Run the lexer:

my $input = do { local $/ = undef; <STDIN> }; my $pos = 0; my $linecol = linecol_accumulator();

while ($input =~ /$regex/g) {

   # Get the line and column number
   my ($line, $col) = $linecol->(substr $input, $pos, $-[0] - $pos);
   $pos = $-[0];

   # Get the token type that was identified by the scanner regex
   my $type = $main::REGMARK;
   die "Unrecognized token $1 at line $line, col $col\n" if $type eq '!';
   my ($name, $evaluator) = @{$tokens[$type]}[0, 2];

   # Get the token value
   my $value;
   if ($evaluator) {
       eval { $value = $evaluator->() };
       if ($@) { chomp $@; die "$@ in $name at line $line, col $col\n" }
   }

   # Print the output line
   print "$line\t$col\t$name".($value ? "\t$value" : )."\n";

}

| Returns a closure, which can be fed a string one piece at a time and gives
| back the cumulative line and column number each time

sub linecol_accumulator {

   my ($line, $col) = (1, 1);
   sub {
       my $str = shift;
       my @lines = split "\n", $str, -1;
       my ($l, $c) = @lines ? (@lines - 1, length $lines[-1]) : (0, 0);
       if ($l) { $line += $l;  $col = 1 + $c }
       else    { $col += $c }
       ($line, $col)
   }

}</lang>

Output — test case 3:

5       16      Keyword_print
5       40      Op_subtract
6       16      Keyword_putc
6       40      Op_less
7       16      Keyword_if
7       40      Op_greater
8       16      Keyword_else
8       40      Op_lessequal
9       16      Keyword_while
9       40      Op_greaterequal
10      16      LeftBrace
10      40      Op_equal
11      16      RightBrace
11      40      Op_not
11      41      Op_assign
12      16      LeftParen
12      40      Op_and
13      16      RightParen
13      40      Op_or
14      16      Op_subtract
14      40      Semicolon
15      16      Op_not
15      40      Comma
16      16      Op_multiply
16      40      Op_assign
17      16      Op_divide
17      40      Integer 42
18      16      Op_mod
18      40      String  "String literal"
19      16      Op_add
19      40      Identifier      variable_name
20      26      Integer 10
21      26      Integer 92
22      26      Integer 32
23      1       End_of_input

Perl 6

This is more complicated than strictly necessary for this task. It is set up to be easily adapted to do syntax analysis.

(Note: there are several bogus comments added solely to help with syntax highlighting.)

Works with: Rakudo version 2016.08

<lang perl6>grammar tiny_C {

   rule TOP { ^ <.whitespace>? <tokens> + % <.whitespace> <.whitespace> <eoi> }

   rule whitespace { [ <comment> + % <ws> | <ws> ] }

   token comment    { '/*' ~ '*/' .*? }

   token tokens {
       [
       | <operator>   { make $/<operator>.ast   }
       | <keyword>    { make $/<keyword>.ast    }
       | <symbol>     { make $/<symbol>.ast     }
       | <identifier> { make $/<identifier>.ast }
       | <integer>    { make $/<integer>.ast    }
       | <char>       { make $/<char>.ast       }
       | <string>     { make $/<string>.ast     }
       | <error>
       ]
   }

   proto token operator    {*}
   token operator:sym<*>   { '*'               { make 'Op_multiply'    } }
   token operator:sym</>   { '/'<!before '*'>  { make 'Op_divide'      } }
   token operator:sym<%>   { '%'               { make 'Op_mod'         } }
   token operator:sym<+>   { '+'               { make 'Op_add'         } }
   token operator:sym<->   { '-'               { make 'Op_subtract'    } }
   token operator:sym('<='){ '<='              { make 'Op_lessequal'   } }
   token operator:sym('<') { '<'               { make 'Op_less'        } }
   token operator:sym('>='){ '>='              { make 'Op_greaterequal'} }
   token operator:sym('>') { '>'               { make 'Op_greater'     } }
   token operator:sym<==>  { '=='              { make 'Op_equal'       } }
   token operator:sym<!=>  { '!='              { make 'Op_notequal'    } }
   token operator:sym<!>   { '!'               { make 'Op_not'         } }
   token operator:sym<=>   { '='               { make 'Op_assign'      } }
   token operator:sym<&&>  { '&&'              { make 'Op_and'         } }
   token operator:sym<||>  { '||'              { make 'Op_or'          } }

   proto token keyword      {*}
   token keyword:sym<if>    { 'if'    { make 'Keyword_if'    } }
   token keyword:sym<else>  { 'else'  { make 'Keyword_else'  } }
   token keyword:sym<putc>  { 'putc'  { make 'Keyword_putc'  } }
   token keyword:sym<while> { 'while' { make 'Keyword_while' } }
   token keyword:sym<print> { 'print' { make 'Keyword_print' } }

   proto token symbol  {*}
   token symbol:sym<(> { '(' { make 'LeftParen'  } }
   token symbol:sym<)> { ')' { make 'RightParen' } }
   token symbol:sym<{> { '{' { make 'LeftBrace'  } }
   token symbol:sym<}> { '}' { make 'RightBrace' } }
   token symbol:sym<;> { ';' { make 'Semicolon'   } }
   token symbol:sym<,> { ',' { make 'Comma'       } }

   token identifier { <[_A..Za..z]><[_A..Za..z0..9]>* { make 'Identifier ' ~ $/ } }
   token integer    { <[0..9]>+                       { make 'Integer '    ~ $/ } }

   token char {
       '\ [<-[']> | '\n' | '\\\\'] '\
       { make 'Char_Literal ' ~ $/.subst("\\n", "\n").substr(1, *-1).ord }
   }

   token string {
       '"' <-["\n]>* '"' #'
       {
           make 'String ' ~ $/;
           note 'Error: Unknown escape sequence.' and exit if (~$/ ~~ m:r/ <!after <[\\]>>[\\<-[n\\]>]<!before <[\\]>> /);
       }
   }

   token eoi { $ { make 'End_of_input' } }

   token error {
       | '\'\                   { note 'Error: Empty character constant.' and exit }
       | '\ <-[']> ** {2..*} '\ { note 'Error: Multi-character constant.' and exit }
       | '/*' <-[*]>* $             { note 'Error: End-of-file in comment.'   and exit }
       | '"' <-["]>* $              { note 'Error: End-of-file in string.'    and exit }
       | '"' <-["]>*? \n            { note 'Error: End of line in string.'    and exit } #'
   }

}

sub parse_it ( $c_code ) {

   my $l;
   my @pos = gather for $c_code.lines>>.chars.kv -> $line, $v {
       take [ $line + 1, $_ ] for 1 .. ($v+1); # v+1 for newline
       $l = $line+2;
   }
   @pos.push: [ $l, 1 ]; # capture eoi

   for flat $c_code<tokens>.list, $c_code<eoi> -> $m {
       say join "\t", @pos[$m.from].fmt('%3d'), $m.ast;
   }

}

my $tokenizer = tiny_C.parse(@*ARGS[0].IO.slurp); parse_it( $tokenizer );</lang>

Output — test case 3:

  5  16 Keyword_print
  5  40 Op_subtract
  6  16 Keyword_putc
  6  40 Op_less
  7  16 Keyword_if
  7  40 Op_greater
  8  16 Keyword_else
  8  40 Op_lessequal
  9  16 Keyword_while
  9  40 Op_greaterequal
 10  16 LeftBrace
 10  40 Op_equal
 11  16 RightBrace
 11  40 Op_notequal
 12  16 LeftParen
 12  40 Op_and
 13  16 RightParen
 13  40 Op_or
 14  16 Op_subtract
 14  40 Semicolon
 15  16 Op_not
 15  40 Comma
 16  16 Op_multiply
 16  40 Op_assign
 17  16 Op_divide
 17  40 Integer 42
 18  16 Op_mod
 18  40 String "String literal"
 19  16 Op_add
 19  40	Identifier variable_name
 20  26	Char_Literal 10
 21  26	Char_Literal 92
 22  26	Char_Literal 32
 23   1	End_of_input

Python

Tested with Python 2.7 and 3.x <lang Python>from __future__ import print_function import sys

following two must remain in the same order

tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq, tk_Gtr, \ tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While, tk_Print, \ tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, \ tk_Integer, tk_String = range(31)

all_syms = ["End_of_input", "Op_multiply", "Op_divide", "Op_mod", "Op_add", "Op_subtract",

   "Op_negate", "Op_not", "Op_less", "Op_lessequal", "Op_greater", "Op_greaterequal",
   "Op_equal", "Op_notequal", "Op_assign", "Op_and", "Op_or", "Keyword_if",
   "Keyword_else", "Keyword_while", "Keyword_print", "Keyword_putc", "LeftParen",
   "RightParen", "LeftBrace", "RightBrace", "Semicolon", "Comma", "Identifier",
   "Integer", "String"]

single character only symbols

symbols = { '{': tk_Lbrace, '}': tk_Rbrace, '(': tk_Lparen, ')': tk_Rparen, '+': tk_Add, '-': tk_Sub,

   '*': tk_Mul, '%': tk_Mod, ';': tk_Semi, ',': tk_Comma }

key_words = {'if': tk_If, 'else': tk_Else, 'print': tk_Print, 'putc': tk_Putc, 'while': tk_While}

the_ch = " " # dummy first char - but it must be a space the_col = 0 the_line = 1 input_file = None

- - - show error and exit

def error(line, col, msg):

   print(line, col, msg)
   exit(1)

- - - get the next character from the input

def next_ch():

   global the_ch, the_col, the_line

   the_ch = input_file.read(1)
   the_col += 1
   if the_ch == '\n':
       the_line += 1
       the_col = 0
   return the_ch

- - - 'x' - character constants

def char_lit(err_line, err_col):

   n = ord(next_ch())              # skip opening quote
   if the_ch == '\:
       error(err_line, err_col, "empty character constant")
   elif the_ch == '\\':
       next_ch()
       if the_ch == 'n':
           n = 10
       elif the_ch == '\\':
           n = ord('\\')
       else:
           error(err_line, err_col, "unknown escape sequence \\%c" % (the_ch))
   if next_ch() != '\:
       error(err_line, err_col, "multi-character constant")
   next_ch()
   return tk_Integer, err_line, err_col, n

- - - process divide or comments

def div_or_cmt(err_line, err_col):

   if next_ch() != '*':
       return tk_Div, err_line, err_col

   # comment found
   while True:
       if next_ch() == '*' and next_ch() == '/':
           next_ch()
           return gettok()
       elif len(the_ch) == 0:
           error(err_line, err_col, "EOF in comment")

- - - "string"

def string_lit(start, err_line, err_col):

   text = ""

   while next_ch() != start:
       if len(the_ch) == 0:
           error(err_line, err_col, "EOF while scanning string literal")
       if the_ch == '\n':
           error(err_line, err_col, "EOL while scanning string literal")
       text += the_ch

   next_ch()
   return tk_String, err_line, err_col, text

- - - handle identifiers and integers

def ident_or_int(err_line, err_col):

   is_number = True
   text = ""

   while the_ch.isalnum() or the_ch == '_':
       text += the_ch
       if not the_ch.isdigit():
           is_number = False
       next_ch()

   if len(text) == 0:
       error(err_line, err_col, "ident_or_int: unrecognized character: (%d) '%c'" % (ord(the_ch), the_ch))

   if text[0].isdigit():
       if not is_number:
           error(err_line, err_col, "invalid number: %s" % (text))
       n = int(text)
       return tk_Integer, err_line, err_col, n

   if text in key_words:
       return key_words[text], err_line, err_col

   return tk_Ident, err_line, err_col, text

- - - look ahead for '>=', etc.

def follow(expect, ifyes, ifno, err_line, err_col):

   if next_ch() == expect:
       next_ch()
       return ifyes, err_line, err_col

   if ifno == tk_EOI:
       error(err_line, err_col, "follow: unrecognized character: (%d) '%c'" % (ord(the_ch), the_ch))

   return ifno, err_line, err_col

- - - return the next token type

def gettok():

   while the_ch.isspace():
       next_ch()

   err_line = the_line
   err_col  = the_col

   if len(the_ch) == 0:    return tk_EOI, err_line, err_col
   elif the_ch == '/':     return div_or_cmt(err_line, err_col)
   elif the_ch == '\:    return char_lit(err_line, err_col)
   elif the_ch == '<':     return follow('=', tk_Leq, tk_Lss,    err_line, err_col)
   elif the_ch == '>':     return follow('=', tk_Geq, tk_Gtr,    err_line, err_col)
   elif the_ch == '=':     return follow('=', tk_Eq,  tk_Assign, err_line, err_col)
   elif the_ch == '!':     return follow('=', tk_Neq, tk_Not,    err_line, err_col)
   elif the_ch == '&':     return follow('&', tk_And, tk_EOI,    err_line, err_col)
   elif the_ch == '|':     return follow('|', tk_Or,  tk_EOI,    err_line, err_col)
   elif the_ch == '"':     return string_lit(the_ch, err_line, err_col)
   elif the_ch in symbols:
       sym = symbols[the_ch]
       next_ch()
       return sym, err_line, err_col
   else: return ident_or_int(err_line, err_col)

- - - main driver

input_file = sys.stdin if len(sys.argv) > 1:

   try:
       input_file = open(sys.argv[1], "r", 4096)
   except IOError as e:
       error(0, 0, "Can't open %s" % sys.argv[1])

while True:

   t = gettok()
   tok  = t[0]
   line = t[1]
   col  = t[2]

   print("%5d  %5d   %-14s" % (line, col, all_syms[tok]), end=)

   if tok == tk_Integer:  print("   %5d" % (t[3]))
   elif tok == tk_Ident:  print("  %s" %   (t[3]))
   elif tok == tk_String: print('  "%s"' % (t[3]))
   else:                  print("")

   if tok == tk_EOI:
       break</lang>

Output — test case 3:

5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input

@@ Line 211: / Line 211: @@
 | Unrecognized character.
 | <code>&#124;</code>
+|-
+| Invalid number. Starts like a number, but ends in non-numeric characters.
+| <code>123abc</code>
 |}