Compiler/lexical analyzer

From Rosetta Code
Revision as of 15:54, 20 November 2023 by PureFox (talk | contribs) (→‎{{header|Wren}}: Minor tidy)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Task
Compiler/lexical analyzer
You are encouraged to solve this task according to the task description, using any language you may know.
Lexical Analyzer

Definition from Wikipedia:

Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).

Create a lexical analyzer for the simple programming language specified below. The program should read input from a file and/or stdin, and write output to a file and/or stdout. If the language being used has a lexer module/library/class, it would be great if two versions of the solution are provided: One without the lexer module, and one with.

Input Specification

The simple programming language to be analyzed is more or less a subset of C. It supports the following tokens:

Operators
Name Common name Character sequence
Op_multiply multiply *
Op_divide divide /
Op_mod mod %
Op_add plus +
Op_subtract minus -
Op_negate unary minus -
Op_less less than <
Op_lessequal less than or equal <=
Op_greater greater than >
Op_greaterequal greater than or equal >=
Op_equal equal ==
Op_notequal not equal !=
Op_not unary not !
Op_assign assignment =
Op_and logical and &&
Op_or logical or ¦¦
  • The - token should always be interpreted as Op_subtract by the lexer. Turning some Op_subtract into Op_negate will be the job of the syntax analyzer, which is not part of this task.
Symbols
Name Common name Character
LeftParen left parenthesis (
RightParen right parenthesis )
LeftBrace left brace {
RightBrace right brace }
Semicolon semi-colon ;
Comma comma ,
Keywords
Name Character sequence
Keyword_if if
Keyword_else else
Keyword_while while
Keyword_print print
Keyword_putc putc
Identifiers and literals

These differ from the the previous tokens, in that each occurrence of them has a value associated with it.

Name Common name Format description Format regex Value
Identifier identifier one or more letter/number/underscore characters, but not starting with a number [_a-zA-Z][_a-zA-Z0-9]* as is
Integer integer literal one or more digits [0-9]+ as is, interpreted as a number
Integer char literal exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes '([^'\n]|\\n|\\\\)' the ASCII code point number of the character, e.g. 65 for 'A' and 10 for '\n'
String string literal zero or more characters (anything except newline or double quote), enclosed by double quotes "[^"\n]*" the characters without the double quotes and with escape sequences converted
  • For char and string literals, the \n escape sequence is supported to represent a new-line character.
  • For char and string literals, to represent a backslash, use \\.
  • No other special sequences are supported. This means that:
    • Char literals cannot represent a single quote character (value 39).
    • String literals cannot represent strings containing double quote characters.
Zero-width tokens
Name Location
End_of_input when the end of the input stream is reached
White space
  • Zero or more whitespace characters, or comments enclosed in /* ... */, are allowed between any two tokens, with the exceptions noted below.
  • "Longest token matching" is used to resolve conflicts (e.g., in order to match <= as a single token rather than the two tokens < and =).
  • Whitespace is required between two tokens that have an alphanumeric character or underscore at the edge.
    • This means: keywords, identifiers, and integer literals.
    • e.g. ifprint is recognized as an identifier, instead of the keywords if and print.
    • e.g. 42fred is invalid, and neither recognized as a number nor an identifier.
  • Whitespace is not allowed inside of tokens (except for chars and strings where they are part of the value).
    • e.g. & & is invalid, and not interpreted as the && operator.

For example, the following two program fragments are equivalent, and should produce the same token stream except for the line and column positions:

  • if ( p /* meaning n is prime */ ) {
        print ( n , " " ) ;
        count = count + 1 ; /* number of primes found so far */
    }
    
  • if(p){print(n," ");count=count+1;}
    
Complete list of token names
End_of_input  Op_multiply   Op_divide     Op_mod       Op_add     Op_subtract
Op_negate     Op_not        Op_less       Op_lessequal Op_greater Op_greaterequal
Op_equal      Op_notequal   Op_assign     Op_and       Op_or      Keyword_if
Keyword_else  Keyword_while Keyword_print Keyword_putc LeftParen  RightParen
LeftBrace     RightBrace    Semicolon     Comma        Identifier Integer
String
Output Format

The program output should be a sequence of lines, each consisting of the following whitespace-separated fields:

  1. the line number where the token starts
  2. the column number where the token starts
  3. the token name
  4. the token value (only for Identifier, Integer, and String tokens)
  5. the number of spaces between fields is up to you. Neatly aligned is nice, but not a requirement.


This task is intended to be used as part of a pipeline, with the other compiler tasks - for example:
lex < hello.t | parse | gen | vm

Or possibly:
lex hello.t lex.out
parse lex.out parse.out
gen parse.out gen.out
vm gen.out


This implies that the output of this task (the lexical analyzer) should be suitable as input to any of the Syntax Analyzer task programs.

Diagnostics

The following error conditions should be caught:

Error Example
Empty character constant ''
Unknown escape sequence. \r
Multi-character constant. 'xx'
End-of-file in comment. Closing comment characters not found.
End-of-file while scanning string literal. Closing string character not found.
End-of-line while scanning string literal. Closing string character not found before end-of-line.
Unrecognized character.
Invalid number. Starts like a number, but ends in non-numeric characters. 123abc
Test Cases
Input Output

Test Case 1:

/*
  Hello world
 */
print("Hello, World!\n");
    4      1 Keyword_print
    4      6 LeftParen
    4      7 String         "Hello, World!\n"
    4     24 RightParen
    4     25 Semicolon
    5      1 End_of_input

Test Case 2:

/*
  Show Ident and Integers
 */
phoenix_number = 142857;
print(phoenix_number, "\n");
    4      1 Identifier     phoenix_number
    4     16 Op_assign
    4     18 Integer         142857
    4     24 Semicolon
    5      1 Keyword_print
    5      6 LeftParen
    5      7 Identifier     phoenix_number
    5     21 Comma
    5     23 String         "\n"
    5     27 RightParen
    5     28 Semicolon
    6      1 End_of_input

Test Case 3:

/*
  All lexical tokens - not syntactically correct, but that will
  have to wait until syntax analysis
 */
/* Print   */  print    /* Sub     */  -
/* Putc    */  putc     /* Lss     */  <
/* If      */  if       /* Gtr     */  >
/* Else    */  else     /* Leq     */  <=
/* While   */  while    /* Geq     */  >=
/* Lbrace  */  {        /* Eq      */  ==
/* Rbrace  */  }        /* Neq     */  !=
/* Lparen  */  (        /* And     */  &&
/* Rparen  */  )        /* Or      */  ||
/* Uminus  */  -        /* Semi    */  ;
/* Not     */  !        /* Comma   */  ,
/* Mul     */  *        /* Assign  */  =
/* Div     */  /        /* Integer */  42
/* Mod     */  %        /* String  */  "String literal"
/* Add     */  +        /* Ident   */  variable_name
/* character literal */  '\n'
/* character literal */  '\\'
/* character literal */  ' '
    5     16   Keyword_print
    5     40   Op_subtract
    6     16   Keyword_putc
    6     40   Op_less
    7     16   Keyword_if
    7     40   Op_greater
    8     16   Keyword_else
    8     40   Op_lessequal
    9     16   Keyword_while
    9     40   Op_greaterequal
   10     16   LeftBrace
   10     40   Op_equal
   11     16   RightBrace
   11     40   Op_notequal
   12     16   LeftParen
   12     40   Op_and
   13     16   RightParen
   13     40   Op_or
   14     16   Op_subtract
   14     40   Semicolon
   15     16   Op_not
   15     40   Comma
   16     16   Op_multiply
   16     40   Op_assign
   17     16   Op_divide
   17     40   Integer             42
   18     16   Op_mod
   18     40   String          "String literal"
   19     16   Op_add
   19     40   Identifier      variable_name
   20     26   Integer             10
   21     26   Integer             92
   22     26   Integer             32
   23      1   End_of_input

Test Case 4:

/*** test printing, embedded \n and comments with lots of '*' ***/
print(42);
print("\nHello World\nGood Bye\nok\n");
print("Print a slash n - \\n.\n");
    2      1 Keyword_print
    2      6 LeftParen
    2      7 Integer            42
    2      9 RightParen
    2     10 Semicolon
    3      1 Keyword_print
    3      6 LeftParen
    3      7 String          "\nHello World\nGood Bye\nok\n"
    3     38 RightParen
    3     39 Semicolon
    4      1 Keyword_print
    4      6 LeftParen
    4      7 String          "Print a slash n - \\n.\n"
    4     33 RightParen
    4     34 Semicolon
    5      1 End_of_input
Additional examples

Your solution should pass all the test cases above and the additional tests found Here.


Reference

The C and Python versions can be considered reference implementations.


Related Tasks



Ada

with Ada.Text_IO, Ada.Streams.Stream_IO, Ada.Strings.Unbounded, Ada.Command_Line,
     Ada.Exceptions;
use Ada.Strings, Ada.Strings.Unbounded, Ada.Streams, Ada.Exceptions;

procedure Main is
   package IO renames Ada.Text_IO;

   package Lexer is
      type Token is (Op_multiply, Op_divide, Op_mod, Op_add, Op_subtract, Op_negate,
                     Op_less, Op_lessequal, Op_greater, Op_greaterequal, Op_equal,
                     Op_notequal, Op_not, Op_assign, Op_and, Op_or,

                     LeftParen, RightParen, LeftBrace, RightBrace, Semicolon, Comma,

                     Keyword_if, Keyword_else, Keyword_while, Keyword_print, Keyword_putc,
                     Identifier, Token_Integer, Token_String, End_of_input,

                     Empty_Char_Error, Invalid_Escape_Error, Multi_Char_Error, EOF_Comment_Error,
                     EOF_String_Error, EOL_String_Error, Invalid_Char_Error, Invalid_Num_Error
                    );

      subtype Operator is Token range Op_multiply .. Op_or;
      subtype Symbol is Token range Token'Succ(Operator'Last) .. Comma;
      subtype Keyword is Token range Token'Succ(Symbol'Last) .. Keyword_putc;
      subtype Error is Token range Empty_Char_Error .. Invalid_Num_Error;
      subtype Operator_or_Error is Token
        with Static_Predicate => Operator_or_Error in Operator | Error;

      subtype Whitespace is Character
        with Static_Predicate => Whitespace in ' ' | ASCII.HT | ASCII.CR | ASCII.LF;

      Lexer_Error : exception;
      Invalid_Escape_Code : constant Character := ASCII.NUL;

      procedure run(input : Stream_IO.File_Type);
   end Lexer;

   package body Lexer is
      use type Stream_IO.Count;

      procedure run(input : Stream_IO.File_Type) is
         type State is (State_Start, State_Identifier, State_Integer, State_Char, State_String,
                        State_Comment);
         curr_state : State := State_Start;
         curr_char : Character;
         curr_col, curr_row, token_col, token_row : Positive := 1;
         token_text : Unbounded_String := Unbounded.Null_Unbounded_String;

         function look_ahead return Character is
            next_char : Character := ASCII.LF;
         begin
            if not Stream_IO.End_Of_File(input) then
               next_char := Character'Input(Stream_IO.Stream(input));
               Stream_IO.Set_Index(input, Stream_IO.Index(input) - 1);
            end if;
            return next_char;
         end look_ahead;

         procedure next_char is
            next : Character := Character'Input(Stream_IO.Stream(input));
         begin
            curr_col := curr_col + 1;
            if curr_char = ASCII.LF then
               curr_row := curr_row + 1;
               curr_col := 1;
            end if;
            curr_char := next;
         end next_char;

         procedure print_token(tok : Token; text : String := "") is
            procedure raise_error(text : String) is
            begin
               raise Lexer_Error with "Error: " & text;
            end;
         begin
            IO.Put(token_row'Image & ASCII.HT & token_col'Image & ASCII.HT);
            case tok is
               when Operator | Symbol | Keyword | End_of_input => IO.Put_Line(tok'Image);
               when Token_Integer => IO.Put_Line("INTEGER" & ASCII.HT & text);
               when Token_String  => IO.Put_Line("STRING" & ASCII.HT & ASCII.Quotation & text & ASCII.Quotation);
               when Identifier    => IO.Put_Line(tok'Image & ASCII.HT & text);
               when Empty_Char_Error => raise_error("empty character constant");
               when Invalid_Escape_Error => raise_error("unknown escape sequence: " & text);
               when Multi_Char_Error => raise_error("multi-character constant: " & text);
               when EOF_Comment_Error => raise_error("EOF in comment");
               when EOF_String_Error => raise_error("EOF in string");
               when EOL_String_Error => raise_error("EOL in string");
               when Invalid_Char_Error => raise_error("invalid character: " & curr_char);
               when Invalid_Num_Error => raise_error("invalid number: " & text);
            end case;
         end print_token;

         procedure lookahead_choose(determiner : Character; a, b : Operator_or_Error) is
         begin
            if look_ahead = determiner then
               print_token(a);
               next_char;
            else
               print_token(b);
            end if;
         end lookahead_choose;

         function to_escape_code(c : Character) return Character is
         begin
            case c is
               when 'n' => return ASCII.LF;
               when '\' => return '\';
               when others =>
                  print_token(Invalid_Escape_Error, ASCII.Back_Slash & c);
                  return Invalid_Escape_Code;
            end case;
         end to_escape_code;
      begin
         curr_char := Character'Input(Stream_IO.Stream(input));
         loop
            case curr_state is
               when State_Start =>
                  token_col := curr_col;
                  token_row := curr_row;
                  case curr_char is
                     when '*' => print_token(Op_multiply);
                     when '/' =>
                        if look_ahead = '*' then
                           next_char;
                           curr_state := State_Comment;
                        else
                           print_token(Op_divide);
                        end if;
                     when '%' => print_token(Op_mod);
                     when '+' => print_token(Op_add);
                     when '-' => print_token(Op_subtract);
                     when '(' => print_token(LeftParen);
                     when ')' => print_token(RightParen);
                     when '{' => print_token(LeftBrace);
                     when '}' => print_token(RightBrace);
                     when ';' => print_token(Semicolon);
                     when ',' => print_token(Comma);
                     when '<' => lookahead_choose('=', Op_lessequal, Op_less);
                     when '>' => lookahead_choose('=', Op_greaterequal, Op_greater);
                     when '!' => lookahead_choose('=', Op_notequal, Op_not);
                     when '=' => lookahead_choose('=', Op_equal, Op_assign);
                     when '&' => lookahead_choose('&', Op_and, Invalid_Char_Error);
                     when '|' => lookahead_choose('|', Op_or, Invalid_Char_Error);
                     when 'a' .. 'z' | 'A' .. 'Z' | '_' =>
                        Unbounded.Append(token_text, curr_char);
                        curr_state := State_Identifier;
                     when '0' .. '9' =>
                        Unbounded.Append(token_text, curr_char);
                        curr_state := State_Integer;
                     when ''' => curr_state := State_Char;
                     when ASCII.Quotation => curr_state := State_String;
                     when Whitespace => null;
                     when others => null;
                  end case;
                  next_char;

               when State_Identifier =>
                  case curr_char is
                     when 'a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '_' =>
                        Unbounded.Append(token_text, curr_char);
                        next_char;
                     when others =>
                        if token_text = "if" then
                           print_token(Keyword_if);
                        elsif token_text = "else" then
                           print_token(Keyword_else);
                        elsif token_text = "while" then
                           print_token(Keyword_while);
                        elsif token_text = "print" then
                           print_token(Keyword_print);
                        elsif token_text = "putc" then
                           print_token(Keyword_putc);
                        else
                           print_token(Identifier, To_String(token_text));
                        end if;
                        Unbounded.Set_Unbounded_String(token_text, "");
                        curr_state := State_Start;
                  end case;

               when State_Integer =>
                  case curr_char is
                     when '0' .. '9' =>
                        Unbounded.Append(token_text, curr_char);
                        next_char;
                     when 'a' .. 'z' | 'A' .. 'Z' | '_' =>
                        print_token(Invalid_Num_Error, To_String(token_text));
                     when others =>
                        print_token(Token_Integer, To_String(token_text));
                        Unbounded.Set_Unbounded_String(token_text, "");
                        curr_state := State_Start;
                  end case;

               when State_Char =>
                  case curr_char is
                     when ''' =>
                        if Unbounded.Length(token_text) = 0 then
                           print_token(Empty_Char_Error);
                        elsif Unbounded.Length(token_text) = 1 then
                           print_token(Token_Integer, Character'Pos(Element(token_text, 1))'Image);
                        else
                           print_token(Multi_Char_Error, To_String(token_text));
                        end if;
                        Set_Unbounded_String(token_text, "");
                        curr_state := State_Start;
                     when '\' =>
                        Unbounded.Append(token_text, to_escape_code(look_ahead));
                        next_char;
                     when others => Unbounded.Append(token_text, curr_char);
                  end case;
                  next_char;

               when State_String =>
                  case curr_char is
                     when ASCII.Quotation =>
                        print_token(Token_String, To_String(token_text));
                        Set_Unbounded_String(token_text, "");
                        curr_state := State_Start;
                     when '\' =>
                        if to_escape_code(look_ahead) /= Invalid_Escape_Code then
                           Unbounded.Append(token_text, curr_char);
                        end if;
                     when ASCII.LF | ASCII.CR => print_token(EOL_String_Error);
                     when others => Unbounded.Append(token_text, curr_char);
                  end case;
                  next_char;

               when State_Comment =>
                  case curr_char is
                     when '*' =>
                        if look_ahead = '/' then
                           next_char;
                           curr_state := State_Start;
                        end if;
                     when others => null;
                  end case;
                  next_char;
            end case;
         end loop;
      exception
         when error : Stream_IO.End_Error =>
            if curr_state = State_String then
               print_token(EOF_String_Error);
            else
               print_token(End_of_input);
            end if;
         when error : Lexer.Lexer_Error => IO.Put_Line(Exception_Message(error));
      end run;
   end Lexer;

   source_file : Stream_IO.File_Type;
begin
   if Ada.Command_Line.Argument_Count < 1 then
      IO.Put_Line("usage: lex [filename]");
      return;
   end if;
   Stream_IO.Open(source_file, Stream_IO.In_File, Ada.Command_Line.Argument(1));
   Lexer.run(source_file);
exception
   when error : others => IO.Put_Line("Error: " & Exception_Message(error));
end Main;
Output:
Test case 3
 5	 16	KEYWORD_PRINT
 5	 40	OP_SUBTRACT
 6	 16	KEYWORD_PUTC
 6	 40	OP_LESS
 7	 16	KEYWORD_IF
 7	 40	OP_GREATER
 8	 16	KEYWORD_ELSE
 8	 40	OP_LESSEQUAL
 9	 16	KEYWORD_WHILE
 9	 40	OP_GREATEREQUAL
 10	 16	LEFTBRACE
 10	 40	OP_EQUAL
 11	 16	RIGHTBRACE
 11	 40	OP_NOTEQUAL
 12	 16	LEFTPAREN
 12	 40	OP_AND
 13	 16	RIGHTPAREN
 13	 40	OP_OR
 14	 16	OP_SUBTRACT
 14	 40	SEMICOLON
 15	 16	OP_NOT
 15	 40	COMMA
 16	 16	OP_MULTIPLY
 16	 40	OP_ASSIGN
 17	 16	OP_DIVIDE
 17	 40	INTEGER	42
 18	 16	OP_MOD
 18	 40	STRING	"String literal"
 19	 16	OP_ADD
 19	 40	IDENTIFIER	variable_name
 20	 26	INTEGER	 10
 21	 26	INTEGER	 92
 22	 26	INTEGER	 32
 23	 1	END_OF_INPUT

ALGOL 68

This is a simple token in, line out program. It doesn't keep an internal representation of tokens or anything like that, since that's not needed at all.

As an addition, it emits a diagnostic if integer literals are too big.

BEGIN
  # implement C-like getchar, where EOF and EOLn are "characters" (-1 and 10 resp.). #
  INT eof = -1, eoln = 10;
  BOOL eof flag := FALSE;
  STRING buf := "";
  INT col := 1;
  INT line := 0;
  on logical file end (stand in, (REF FILE f)BOOL: eof flag := TRUE);
  PROC getchar = INT:
    IF   eof flag      THEN eof
    ELIF col = UPB buf THEN col +:= 1; eoln
    ELIF col > UPB buf THEN IF line > 0 THEN read(newline) FI;
                            line +:= 1;
                            read(buf);
                            IF eof flag THEN col := 1; eof
                            ELSE col := 0; getchar
                            FI
    ELSE col +:= 1; ABS buf[col]
    FI;
  PROC nextchar = INT: IF eof flag THEN eof ELIF col >= UPB buf THEN eoln ELSE ABS buf[col+1] FI;

  PROC is blank = (INT ch) BOOL: ch = 0 OR ch = 9 OR ch = 10 OR ch = 13 OR ch = ABS " ";
  PROC is digit = (INT ch) BOOL: ch >= ABS "0" AND ch <= ABS "9";
  PROC is ident start = (INT ch) BOOL: ch >= ABS "A" AND ch <= ABS "Z" OR
                                       ch >= ABS "a" AND ch <= ABS "z" OR
                                       ch = ABS "_";
  PROC is ident = (INT ch) BOOL: is ident start(ch) OR is digit(ch);

  PROC ident or keyword = (INT start char) VOID:
    BEGIN
      STRING w := REPR start char;
      INT start col = col;
      WHILE is ident (next char) DO w +:= REPR getchar OD;
      IF   w = "if"    THEN output2("Keyword_if", start col)
      ELIF w = "else"  THEN output2("Keyword_else", start col)
      ELIF w = "while" THEN output2("Keyword_while", start col)
      ELIF w = "print" THEN output2("Keyword_print", start col)
      ELIF w = "putc"  THEN output2("Keyword_putc", start col)
      ELSE output2("Identifier " + w, start col)
      FI
    END;
  PROC char = VOID:
    BEGIN
      INT start col = col;
      INT ch := getchar;
      IF   ch = ABS "'" THEN error("Empty character constant")
      ELIF ch = ABS "\" THEN ch := getchar;
                             IF   ch = ABS "n" THEN ch := 10
                             ELIF ch = ABS "\" THEN SKIP
                             ELSE error("Unknown escape sequence. \" + REPR ch)
                             FI
      FI;
      IF nextchar /= ABS "'" THEN error("Multi-character constant.") FI;
      getchar;
      output2("Integer " + whole(ch, 0), start col)
    END;
  PROC string = VOID:
    BEGIN
      INT start col = col;
      STRING s := """";
      WHILE INT ch := getchar; ch /= ABS """"
      DO
        IF   ch = eoln     THEN error("End-of-line while scanning string literal. Closing string character not found before end-of-line.")
        ELIF ch = eof      THEN error("End-of-file while scanning string literal. Closing string character not found.")
        ELIF ch = ABS "\"  THEN s +:= REPR ch; ch := getchar;
                                IF ch /= ABS "\" AND ch /= ABS "n" THEN error("Unknown escape sequence. \" + REPR ch) FI;
                                s +:= REPR ch
        ELSE s +:= REPR ch
        FI
      OD;
      output2("String " + s + """", start col)
    END;
  PROC comment = VOID:
    BEGIN
      WHILE INT ch := getchar; NOT (ch = ABS "*" AND nextchar = ABS "/")
      DO IF ch = eof THEN error("End-of-file in comment. Closing comment characters not found.") FI
      OD;
      getchar
    END;
  PROC number = (INT first digit) VOID:
    BEGIN
      INT start col = col;
      INT n := first digit - ABS "0";
      WHILE is digit (nextchar) DO
        INT u := getchar - ABS "0";
        IF LENG n * 10 + LENG u > max int THEN error("Integer too big") FI;
        n := n * 10 + u
      OD;
      IF is ident start (nextchar) THEN error("Invalid number. Starts like a number, but ends in non-numeric characters.") FI;
      output2("Integer " + whole(n, 0), start col)
    END;

  PROC output  = (STRING s) VOID: output2(s, col);
  PROC output2 = (STRING s, INT col) VOID: print((whole(line,-8), whole(col,-8), "  ", s, newline));

  PROC if follows = (CHAR second, STRING longer, shorter) VOID:
    IF nextchar = ABS second
    THEN output(longer); getchar
    ELSE output(shorter)
    FI;
  PROC error = (STRING s)VOID: (put(stand error, ("At ", whole(line,0), ":", whole(col,0), " ", s, new line)); stop);
  PROC unrecognized = (INT char) VOID: error("Unrecognized character " + REPR char);
  PROC double char = (INT first, STRING op) VOID:
    IF nextchar /= first THEN unrecognized(first)
    ELSE output2(op, col-1); getchar
    FI;

  WHILE INT ch := getchar; ch /= eof
  DO
    IF   is blank(ch) THEN SKIP
    ELIF ch = ABS "(" THEN output("LeftParen")
    ELIF ch = ABS ")" THEN output("RightParen")
    ELIF ch = ABS "{" THEN output("LeftBrace")
    ELIF ch = ABS "}" THEN output("RightBrace")
    ELIF ch = ABS ";" THEN output("Semicolon")
    ELIF ch = ABS "," THEN output("Comma")
    ELIF ch = ABS "*" THEN output("Op_multiply")
    ELIF ch = ABS "/" THEN IF next char = ABS "*" THEN comment
                           ELSE output("Op_divide")
                           FI
    ELIF ch = ABS "%" THEN output("Op_mod")
    ELIF ch = ABS "+" THEN output("Op_add")
    ELIF ch = ABS "-" THEN output("Op_subtract")
    ELIF ch = ABS "<" THEN if follows("=", "Op_lessequal", "Op_less")
    ELIF ch = ABS ">" THEN if follows("=", "Op_greaterequal", "Op_greater")
    ELIF ch = ABS "=" THEN if follows("=", "Op_equal", "Op_assign")
    ELIF ch = ABS "!" THEN if follows("=", "Op_notequal", "Op_not")
    ELIF ch = ABS "&" THEN double char(ch, "Op_and")
    ELIF ch = ABS "|" THEN double char(ch, "Op_or")
    ELIF is ident start (ch) THEN ident or keyword (ch)
    ELIF ch = ABS """" THEN string
    ELIF ch = ABS "'" THEN char
    ELIF is digit(ch) THEN number(ch)
    ELSE unrecognized(ch)
    FI
  OD;
  output("End_Of_Input")
END

ALGOL W

begin
    %lexical analyser %
    % Algol W strings are limited to 256 characters in length so we limit source lines %
    % and tokens to 256 characters %

    integer     lineNumber, columnNumber;
    string(256) line;
    string(256) tkValue;
    integer     tkType, tkLine, tkColumn, tkLength, tkIntegerValue;
    logical     tkTooLong;
    string(1)   currChar;
    string(1)   newlineChar;

    integer     LINE_WIDTH, MAX_TOKEN_LENGTH, MAXINTEGER_OVER_10, MAXINTEGER_MOD_10;
    integer     tOp_multiply   , tOp_divide        , tOp_mod       , tOp_add
          ,     tOp_subtract   , tOp_negate        , tOp_less      , tOp_lessequal
          ,     tOp_greater    , tOp_greaterequal  , tOp_equal     , tOp_notequal
          ,     tOp_not        , tOp_assign        , tOp_and       , tOp_or
          ,     tLeftParen     , tRightParen       , tLeftBrace    , tRightBrace
          ,     tSemicolon     , tComma            , tKeyword_if   , tKeyword_else
          ,     tKeyword_while , tKeyword_print    , tKeyword_putc , tIdentifier
          ,     tInteger       , tString           , tEnd_of_input , tComment
          ;

    string(16)  array tkName ( 1 :: 32 );

    % reports an error %
    procedure lexError( string(80) value message ); begin
        integer errorPos;
        write( i_w := 1, s_w := 0, "**** Error at(", lineNumber, ",", columnNumber, "): " );
        errorPos := 0;
        while errorPos < 80 and message( errorPos // 1 ) not = "." do begin
            writeon( s_w := 0, message( errorPos // 1 ) );
            errorPos := errorPos + 1
        end while_not_at_end_of_message ;
        writeon( s_w := 0, "." )
    end lexError ;

    % gets the next source character %
    procedure nextChar ; begin
        if      columnNumber = LINE_WIDTH then begin
            currChar     := newlineChar;
            columnNumber := columnNumber + 1
            end
        else if columnNumber > LINE_WIDTH then begin
            readcard( line );
            columnNumber := 1;
            if not XCPNOTED(ENDFILE) then lineNumber := lineNumber + 1;
            currChar     := line( 0 // 1 )
            end
        else begin
            currChar     := line( columnNumber // 1 );
            columnNumber := columnNumber + 1
        end
    end nextChar ;

    % gets the next token, returns the token type %
    integer procedure nextToken ; begin

        % returns true if currChar is in the inclusive range lowerValue to upperValue %
        %         false otherwise %
        logical procedure range( string(1) value lowerValue, upperValue ) ; begin
            currChar >= lowerValue and currChar <= upperValue
        end range ;

        % returns true if the current character can start an identifier, false otherwise %
        logical procedure identifierStartChar ; begin
            currChar = "_" or range( "a", "z" ) or range( "A", "Z" )
        end identifierStartChar ;

        % add the current character to the token and get the next %
        procedure addAndNextChar ; begin
            if tkLength >= MAX_TOKEN_LENGTH then tkTooLong := true
            else begin
                tkValue( tkLength // 1 ) := currChar;
                tkLength                 := tkLength + 1
            end if_symbol_not_too_long ;
            nextChar
        end % addAndNextChar % ;

        % handle a single character token %
        procedure singleCharToken( integer value tokenType ) ; begin
            tkType := tokenType;
            nextChar
        end singleCharToken ;

        % handle a doubled character token: && or || %
        procedure doubleCharToken( integer value tokenType ) ; begin
            string(1) firstChar;
            firstChar := currChar;
            tkType    := tokenType;
            nextChar;
            if currChar = firstChar then nextChar
            else % the character wasn't doubled % lexError( "Unrecognised character." );
        end singleCharToken ;

        % handle an operator or operator= token %
        procedure opOrOpEqual( integer value opToken, opEqualToken ) ; begin
            tkType := opToken;
            nextChar;
            if currChar = "=" then begin
                % have operator= %
                tkType := opEqualToken;
                nextChar
            end if_currChar_is_equal ;
        end opOrOpEqual ;

        % handle a / operator or /* comment %
        procedure divideOrComment ; begin
            tkType := tOp_divide;
            nextChar;
            if currChar = "*" then begin
                % have a comment %
                logical moreComment;
                tkType      := tComment;
                moreComment := true;
                while moreComment do begin
                    nextChar;
                    while currChar not = "*" and not XCPNOTED(ENDFILE) do nextChar;
                    while currChar     = "*" and not XCPNOTED(ENDFILE) do nextChar;
                    moreComment := ( currChar not = "/" and not XCPNOTED(ENDFILE) )
                end while_more_comment ;
                if not XCPNOTED(ENDFILE)
                then nextChar
                else lexError( "End-of-file in comment." )
            end if_currChar_is_star ;
        end divideOrComment ;

        % handle an indentifier or keyword %
        procedure identifierOrKeyword ; begin
            tkType := tIdentifier;
            while identifierStartChar or range( "0", "9" ) do addAndNextChar;
            % there are only 5 keywords, so we just test each in turn here %
            if      tkValue = "if"      then tkType  := tKeyword_if
            else if tkValue = "else"    then tkType  := tKeyword_else
            else if tkValue = "while"   then tkType  := tKeyword_while
            else if tkValue = "print"   then tkType  := tKeyword_print
            else if tkValue = "putc"    then tkType  := tKeyword_putc;
            if tkType not = tIdentifier then tkValue := "";
        end identifierOrKeyword ;

        % handle an integer literal %
        procedure integerLiteral ; begin
            logical overflowed;
            integer digit;
            overflowed := false;
            tkType     := tInteger;
            while range( "0", "9" ) do begin
                digit := ( decode( currChar ) - decode( "0" ) );
                if      tkIntegerValue > MAXINTEGER_OVER_10 then overflowed := true
                else if tkIntegerValue = MAXINTEGER_OVER_10
                    and digit          > MAXINTEGER_MOD_10  then overflowed := true
                else begin
                    tkIntegerValue := tkIntegerValue * 10;
                    tkIntegerValue := tkIntegerValue + digit;
                end;
                nextChar
            end while_have_a_digit ;
            if overflowed          then lexError( "Number too large." );
            if identifierStartChar then lexError( "Number followed by letter or underscore." );
        end integerLiteral ;

        % handle a char literal %
        procedure charLiteral ; begin
            nextChar;
            if      currChar = "'" or currChar = newlineChar then lexError( "Invalid character constant." )
            else if currChar = "\" then begin
                % have an escape %
                nextChar;
                if      currChar     = "n" then currChar := newlineChar
                else if currChar not = "\" then lexError( "Unknown escape sequence." )
            end;
            tkType         := tInteger;
            tkIntegerValue := decode( currChar );
            % should have a closing quoute next %
            nextChar;
            if   currChar not = "'"
            then lexError( "Multi-character constant." )
            else nextChar
        end charLiteral ;

        % handle a string literal %
        procedure stringLiteral ; begin
            tkType            := tString;
            tkValue( 0 // 1 ) := currChar;
            tkLength          := 1;
            nextChar;
            while currChar not = """" and currChar not = newlineChar and not XCPNOTED(ENDFILE) do addAndNextChar;
            if      currChar = newlineChar then lexError( "End-of-line while scanning string literal." )
            else if XCPNOTED(ENDFILE)      then lexError( "End-of-file while scanning string literal." )
            else    % currChar must be """" % addAndNextChar
        end stringLiteral ;

        while begin
            % skip white space %
            while ( currChar = " " or currChar = newlineChar ) and not XCPNOTED(ENDFILE) do nextChar;
            % get the token %
            tkLine         := lineNumber;
            tkColumn       := columnNumber;
            tkValue        := "";
            tkLength       := 0;
            tkIntegerValue := 0;
            tkTooLong      := false;
            if      XCPNOTED(ENDFILE)   then tkType := tEnd_of_input
            else if currChar = "*"      then singleCharToken( tOp_multiply )
            else if currChar = "/"      then divideOrComment
            else if currChar = "%"      then singleCharToken( tOp_mod      )
            else if currChar = "+"      then singleCharToken( tOp_add )
            else if currChar = "-"      then singleCharToken( tOp_subtract )
            else if currChar = "<"      then opOrOpEqual( tOp_less,    tOp_lessequal     )
            else if currChar = ">"      then opOrOpEqual( tOp_greater, tOp_greaterequal  )
            else if currChar = "="      then opOrOpEqual( tOp_assign,  tOp_equal         )
            else if currChar = "!"      then opOrOpEqual( tOp_not,     tOp_notequal      )
            else if currChar = "&"      then doubleCharToken( tOp_and     )
            else if currChar = "|"      then doubleCharToken( tOp_or      )
            else if currChar = "("      then singleCharToken( tLeftParen  )
            else if currChar = ")"      then singleCharToken( tRightParen )
            else if currChar = "{"      then singleCharToken( tLeftBrace  )
            else if currChar = "}"      then singleCharToken( tRightBrace )
            else if currChar = ";"      then singleCharToken( tSemicolon  )
            else if currChar = ","      then singleCharToken( tComma      )
            else if identifierStartChar then identifierOrKeyword
            else if range( "0", "9" )   then integerLiteral
            else if currChar = "'"      then charLiteral
            else if currChar = """"     then stringLiteral
            else begin
                lexError( "Unrecognised character." );
                singleCharToken( tComment )
            end ;
            % continue until we get something other than a comment %
            tkType = tComment
        end do begin end;
        if tkTooLong then if   tkType = tString
                          then lexError( "String literal too long." )
                          else lexError( "Identifier too long."     );
        tkType
    end nextToken ;

    % outputs the current token %
    procedure writeToken ; begin
        write( i_w := 5, s_w := 2, tkLine, tkColumn, tkName( tkType ) );
        if tkType = tInteger then writeon( i_w := 11, tkIntegerValue )
        else if tkLength > 0 then begin
            writeon( "  " );
            for tkPos := 0 until tkLength - 1 do writeon( s_w := 0, tkValue( tkPos // 1 ) );
        end
    end writeToken ;

    LINE_WIDTH       := 256; MAXINTEGER_MOD_10  := MAXINTEGER rem 10;
    MAX_TOKEN_LENGTH := 256; MAXINTEGER_OVER_10 := MAXINTEGER div 10;
    newlineChar      := code( 10 );
    tOp_multiply     :=  1; tkName( tOp_multiply     ) := "Op_multiply";
    tOp_divide       :=  2; tkName( tOp_divide       ) := "Op_divide";
    tOp_mod          :=  3; tkName( tOp_mod          ) := "Op_mod";
    tOp_add          :=  4; tkName( tOp_add          ) := "Op_add";
    tOp_subtract     :=  5; tkName( tOp_subtract     ) := "Op_subtract";
    tOp_negate       :=  6; tkName( tOp_negate       ) := "Op_negate";
    tOp_less         :=  7; tkName( tOp_less         ) := "Op_less";
    tOp_lessequal    :=  8; tkName( tOp_lessequal    ) := "Op_lessequal";
    tOp_greater      :=  9; tkName( tOp_greater      ) := "Op_greater";
    tOp_greaterequal := 10; tkName( tOp_greaterequal ) := "Op_greaterequal";
    tOp_equal        := 11; tkName( tOp_equal        ) := "Op_equal";
    tOp_notequal     := 12; tkName( tOp_notequal     ) := "Op_notequal";
    tOp_not          := 13; tkName( tOp_not          ) := "Op_not";
    tOp_assign       := 14; tkName( tOp_assign       ) := "Op_assign";
    tOp_and          := 15; tkName( tOp_and          ) := "Op_and";
    tOp_or           := 16; tkName( tOp_or           ) := "Op_or";
    tLeftParen       := 17; tkName( tLeftParen       ) := "LeftParen";
    tRightParen      := 18; tkName( tRightParen      ) := "RightParen";
    tLeftBrace       := 19; tkName( tLeftBrace       ) := "LeftBrace";
    tRightBrace      := 20; tkName( tRightBrace      ) := "RightBrace";
    tSemicolon       := 21; tkName( tSemicolon       ) := "Semicolon";
    tComma           := 22; tkName( tComma           ) := "Comma";
    tKeyword_if      := 23; tkName( tKeyword_if      ) := "Keyword_if";
    tKeyword_else    := 24; tkName( tKeyword_else    ) := "Keyword_else";
    tKeyword_while   := 25; tkName( tKeyword_while   ) := "Keyword_while";
    tKeyword_print   := 26; tkName( tKeyword_print   ) := "Keyword_print";
    tKeyword_putc    := 27; tkName( tKeyword_putc    ) := "Keyword_putc";
    tIdentifier      := 28; tkName( tIdentifier      ) := "Identifier";
    tInteger         := 29; tkName( tInteger         ) := "Integer";
    tString          := 30; tkName( tString          ) := "String";
    tEnd_of_input    := 31; tkName( tEnd_of_input    ) := "End_of_input";
    tComment         := 32; tkName( tComment         ) := "Comment";

    % allow the program to continue after reaching end-of-file %
    ENDFILE := EXCEPTION( false, 1, 0, false, "EOF" );
    % ensure the first call to nextToken reads the first line %
    lineNumber   := 0;
    columnNumber := LINE_WIDTH + 1;
    currChar     := " ";
    % get and print all tokens from standard input %
    while nextToken not = tEnd_of_input do writeToken;
    writeToken
end.
Output:
Test case 3
    5     16  Keyword_print            
    5     40  Op_subtract     
    6     16  Keyword_putc            
    6     40  Op_less         
    7     16  Keyword_if            
    7     40  Op_greater      
    8     16  Keyword_else            
    8     40  Op_lessequal    
    9     16  Keyword_while            
    9     40  Op_greaterequal 
   10     16  LeftBrace       
   10     40  Op_equal        
   11     16  RightBrace      
   11     40  Op_notequal     
   12     16  LeftParen       
   12     40  Op_and          
   13     16  RightParen      
   13     40  Op_or           
   14     16  Op_subtract     
   14     40  Semicolon       
   15     16  Op_not          
   15     40  Comma           
   16     16  Op_multiply     
   16     40  Op_assign       
   17     16  Op_divide       
   17     40  Integer                  42  
   18     16  Op_mod          
   18     40  String              "String literal"
   19     16  Op_add          
   19     40  Identifier          variable_name
   20     26  Integer                  10  
   21     26  Integer                  92  
   22     26  Integer                  32  
   23      1  End_of_input    

ATS

One interesting feature of this implementation is my liberal use of a pushback buffer for input characters. This kept the code modular and easier to write.

(One point of note: the C "EOF" pseudo-character is detected in the following code by looking for a negative number. That EOF has to be negative and the other characters non-negative is implied by the ISO C standard.)

(********************************************************************)
(* Usage: lex [INPUTFILE [OUTPUTFILE]]
   If INPUTFILE or OUTPUTFILE is "-" or missing, then standard input
   or standard output is used, respectively. *)

#define ATS_DYNLOADFLAG 0

#include "share/atspre_staload.hats"
staload UN = "prelude/SATS/unsafe.sats"

#define NIL list_nil ()
#define ::  list_cons

%{^
/* alloca(3) is needed for ATS exceptions. */
#include <alloca.h>
%}

(********************************************************************)

#define NUM_TOKENS 31
#define RESERVED_WORD_HASHTAB_SIZE 9

#define TOKEN_ELSE          0
#define TOKEN_IF            1
#define TOKEN_PRINT         2
#define TOKEN_PUTC          3
#define TOKEN_WHILE         4
#define TOKEN_MULTIPLY      5
#define TOKEN_DIVIDE        6
#define TOKEN_MOD           7
#define TOKEN_ADD           8
#define TOKEN_SUBTRACT      9
#define TOKEN_NEGATE       10
#define TOKEN_LESS         11
#define TOKEN_LESSEQUAL    12
#define TOKEN_GREATER      13
#define TOKEN_GREATEREQUAL 14
#define TOKEN_EQUAL        15
#define TOKEN_NOTEQUAL     16
#define TOKEN_NOT          17
#define TOKEN_ASSIGN       18
#define TOKEN_AND          19
#define TOKEN_OR           20
#define TOKEN_LEFTPAREN    21
#define TOKEN_RIGHTPAREN   22
#define TOKEN_LEFTBRACE    23
#define TOKEN_RIGHTBRACE   24
#define TOKEN_SEMICOLON    25
#define TOKEN_COMMA        26
#define TOKEN_IDENTIFIER   27
#define TOKEN_INTEGER      28
#define TOKEN_STRING       29
#define TOKEN_END_OF_INPUT 30

typedef token_t =
  [i : int | TOKEN_ELSE <= i; i <= TOKEN_END_OF_INPUT] 
  int i
typedef tokentuple_t = (token_t, String, ullint, ullint)
typedef token_names_vt = @[string][NUM_TOKENS]

vtypedef reserved_words_vt =
  @[String][RESERVED_WORD_HASHTAB_SIZE]
vtypedef reserved_word_tokens_vt =
  @[token_t][RESERVED_WORD_HASHTAB_SIZE]

vtypedef lookups_vt =
  [p_toknames : addr]
  [p_wordtab  : addr]
  [p_toktab   : addr]
  @{
    pf_toknames = token_names_vt @ p_toknames,
    pf_wordtab = reserved_words_vt @ p_wordtab,
    pf_toktab = reserved_word_tokens_vt @ p_toktab |
    toknames = ptr p_toknames,
    wordtab = ptr p_wordtab,
    toktab = ptr p_toktab
  }

fn
reserved_word_lookup
          (s         : String,
           lookups   : !lookups_vt,
           line_no   : ullint,
           column_no : ullint) : tokentuple_t =
  if string_length s < 2 then
    (TOKEN_IDENTIFIER, s, line_no, column_no)
  else
    let
      macdef wordtab = !(lookups.wordtab)
      macdef toktab = !(lookups.toktab)
      val hashval =
        g1uint_mod (g1ofg0 (char2ui s[0] + char2ui s[1]),
                    g1i2u RESERVED_WORD_HASHTAB_SIZE)
      val token = toktab[hashval]
    in
      if token = TOKEN_IDENTIFIER || s <> wordtab[hashval] then
        (TOKEN_IDENTIFIER, s, line_no, column_no)
      else
        (token, s, line_no, column_no)
    end

(********************************************************************)
(* Input allows pushback into a buffer. *)

typedef ch_t =
  @{
    ichar = int,
    line_no = ullint,
    column_no = ullint
  }

typedef inp_t (n : int) =
  [0 <= n]
  @{
    file = FILEref,
    pushback = list (ch_t, n),
    line_no = ullint,
    column_no = ullint
  }
typedef inp_t = [n : int] inp_t n

fn
get_ch (inp : inp_t) : (ch_t, inp_t) =
  case+ (inp.pushback) of
  | NIL =>
    let
      val c = fileref_getc (inp.file)
      val ch =
        @{
          ichar = c,
          line_no = inp.line_no,
          column_no = inp.column_no
        }
    in
      if c = char2i '\n' then
        let
          val inp =
            @{
              file = inp.file,
              pushback = inp.pushback,
              line_no = succ (inp.line_no),
              column_no = 1ULL
            }
        in
          (ch, inp)
        end
      else
        let
          val inp =
            @{
              file = inp.file,
              pushback = inp.pushback,
              line_no = inp.line_no,
              column_no = succ (inp.column_no)
            }
        in
          (ch, inp)
        end
    end
  | ch :: pushback =>
    let
      val inp =
        @{
          file = inp.file,
          pushback = pushback,
          line_no = inp.line_no,
          column_no = inp.column_no
        }
    in
      (ch, inp)
    end

fn
push_back_ch (ch  : ch_t,
              inp : inp_t) : [n : pos] inp_t n =
  let
    prval _ = lemma_list_param (inp.pushback)
  in
    @{
      file = inp.file,
      pushback = ch :: (inp.pushback),
      line_no = inp.line_no,
      column_no = inp.column_no
    }
  end    

(********************************************************************)

exception unterminated_comment of (ullint, ullint)
exception unterminated_character_literal of (ullint, ullint)
exception multicharacter_literal of (ullint, ullint)
exception unterminated_string_literal of (ullint, ullint, bool)
exception unsupported_escape of (ullint, ullint, int)
exception invalid_integer_literal of (ullint, ullint, String)
exception unexpected_character of (ullint, ullint, int)

fn
scan_comment (inp       : inp_t,
              line_no   : ullint,
              column_no : ullint) : inp_t =
  let
    fun
    loop (inp : inp_t) : inp_t =
      let
        val (ch, inp) = get_ch inp
      in
        if (ch.ichar) < 0 then
          $raise unterminated_comment (line_no, column_no)
        else if (ch.ichar) = char2i '*' then
          let
            val (ch1, inp) = get_ch inp
          in
            if (ch.ichar) < 0 then
              $raise unterminated_comment (line_no, column_no)
            else if (ch1.ichar) = char2i '/' then
              inp
            else
              loop inp
          end
        else
          loop inp
      end
  in
    loop inp
  end

fn
skip_spaces_and_comments (inp : inp_t) : [n : pos] inp_t n =
  let
    fun
    loop (inp : inp_t) : [n : pos] inp_t n =
      let
        val (ch, inp) = get_ch inp
      in
        if isspace (ch.ichar) then
          loop inp
        else if (ch.ichar) = char2i '/' then
          let
            val (ch1, inp) = get_ch inp
          in
            if (ch1.ichar) = char2i '*' then
              loop (scan_comment (inp, ch.line_no, ch.column_no))
            else
              let
                val inp = push_back_ch (ch1, inp)
                val inp = push_back_ch (ch, inp)
              in
                inp
              end
          end
        else
          push_back_ch (ch, inp)
      end
  in
    loop inp
  end

fn
reverse_list_to_string
          {m   : int}
          (lst : list (char, m)) : string m =
  let
    fun
    fill_array {n : nat | n <= m} .<n>.
               (arr : &(@[char][m + 1]),
                lst : list (char, n),
                n   : size_t n) : void =
      case+ lst of
      | NIL => ()
      | c :: tail =>
        begin
          arr[pred n] := c;
          fill_array (arr, tail, pred n)
        end

    prval _ = lemma_list_param lst
    val m : size_t m = i2sz (list_length lst)
    val (pf, pfgc | p) = array_ptr_alloc<char> (succ m)
    val _ = array_initize_elt<char> (!p, succ m, '\0')
    val _ = fill_array (!p, lst, m)
  in
    $UN.castvwtp0 @(pf, pfgc | p)
  end

extern fun {}
simple_scan$pred : int -> bool
fun {}
simple_scan {u : nat}
            (lst : list (char, u),
             inp : inp_t) :
    [m : nat]
    [n : pos]
    (list (char, m), inp_t n) =
  let
    val (ch, inp) = get_ch inp
  in
    if simple_scan$pred (ch.ichar) then
      simple_scan<> (int2char0 (ch.ichar) :: lst, inp)
    else
      let
        val inp = push_back_ch (ch, inp)
      in
        (lst, inp)
      end
  end

fn
is_ident_start (c : int) :<> bool =
  isalpha (c) || c = char2i '_'

fn
is_ident_continuation (c : int) :<> bool =
  isalnum (c) || c = char2i '_'

fn
scan_identifier_or_reserved_word
          (inp     : inp_t,
           lookups : !lookups_vt) :
    (tokentuple_t, [n : pos] inp_t n) =
  let
    val (ch, inp) = get_ch inp
    val _ = assertloc (is_ident_start (ch.ichar))

    implement simple_scan$pred<> c = is_ident_continuation c
    val (lst, inp) = simple_scan (int2char0 (ch.ichar) :: NIL, inp)

    val s = reverse_list_to_string lst
    val toktup =
      reserved_word_lookup (s, lookups, ch.line_no, ch.column_no)
  in
    (toktup, inp)
  end

fn
scan_integer_literal
          (inp     : inp_t,
           lookups : !lookups_vt) :
    (tokentuple_t, [n : pos] inp_t n) =
  let
    val (ch, inp) = get_ch inp
    val _ = assertloc (isdigit (ch.ichar))

    implement simple_scan$pred<> c = is_ident_continuation c
    val (lst, inp) = simple_scan (int2char0 (ch.ichar) :: NIL, inp)

    val s = reverse_list_to_string lst

    fun
    check_they_are_all_digits
              {n : nat} .<n>.
              (lst : list (char, n)) : void =
      case+ lst of
      | NIL => ()
      | c :: tail =>
        if isdigit c then
          check_they_are_all_digits tail
        else
          $raise invalid_integer_literal (ch.line_no, ch.column_no, s)

    val _ = check_they_are_all_digits lst
  in
    ((TOKEN_INTEGER, s, ch.line_no, ch.column_no), inp)
  end

fn
ichar2integer_literal (c : int) : String0 =
  let
    var buf = @[char][100] ('\0')
    val _ = $extfcall (int, "snprintf", addr@ buf, i2sz 99, "%d", c)
    val s = string1_copy ($UN.castvwtp0{String0} buf)
  in
    strnptr2string s
  end

fn
scan_character_literal_without_checking_end (inp : inp_t) :
    (tokentuple_t, inp_t) =
  let
    val (ch, inp) = get_ch inp
    val _ = assertloc ((ch.ichar) = '\'')

    val (ch1, inp) = get_ch inp
  in
    if (ch1.ichar) < 0 then
      $raise unterminated_character_literal (ch.line_no, ch.column_no)
    else if (ch1.ichar) = char2i '\\' then
      let
        val (ch2, inp) = get_ch inp
      in
        if (ch2.ichar) < 0 then
          $raise unterminated_character_literal (ch.line_no,
                                                 ch.column_no)
        else if (ch2.ichar) = char2i 'n' then
          let
            val s = ichar2integer_literal (char2i '\n')
          in
            ((TOKEN_INTEGER, s, ch.line_no, ch.column_no), inp)
          end
        else if (ch2.ichar) = char2i '\\' then
          let
            val s = ichar2integer_literal (char2i '\\')
          in
            ((TOKEN_INTEGER, s, ch.line_no, ch.column_no), inp)
          end
        else
          $raise unsupported_escape (ch1.line_no, ch1.column_no,
                                     ch2.ichar)
      end
    else
      let
        val s = ichar2integer_literal (ch1.ichar)
      in
        ((TOKEN_INTEGER, s, ch.line_no, ch.column_no), inp)
      end
  end

fn
scan_character_literal (inp : inp_t) : (tokentuple_t, inp_t) =
  let
    val (tok, inp) =
      scan_character_literal_without_checking_end inp
    val line_no = (tok.2)
    val column_no = (tok.3)

    fun
    check_end (inp : inp_t) : inp_t =
      let
        val (ch, inp) = get_ch inp
      in
        if (ch.ichar) = char2i '\'' then
          inp
        else
          let
            fun
            loop_to_end (ch1 : ch_t,
                         inp : inp_t) : inp_t =
              if (ch1.ichar) < 0 then
                $raise unterminated_character_literal (line_no,
                                                       column_no)
              else if (ch1.ichar) = char2i '\'' then
                $raise multicharacter_literal (line_no, column_no)
              else
                let
                  val (ch1, inp) = get_ch inp
                in
                  loop_to_end (ch1, inp)
                end

            val inp = loop_to_end (ch, inp)
          in
            inp
          end
      end

    val inp = check_end inp
  in
    (tok, inp)
  end

fn
scan_string_literal (inp : inp_t) : (tokentuple_t, inp_t) =
  let
    val (ch, inp) = get_ch inp
    val _ = assertloc ((ch.ichar) = '"')

    fun
    scan {u : pos}
         (lst : list (char, u),
          inp : inp_t) :
        [m : pos] (list (char, m), inp_t) =
      let
        val (ch1, inp) = get_ch inp
      in
        if (ch1.ichar) < 0 then
          $raise unterminated_string_literal (ch.line_no,
                                              ch.column_no, false)
        else if (ch1.ichar) = char2i '\n' then
          $raise unterminated_string_literal (ch.line_no,
                                              ch.column_no, true)
        else if (ch1.ichar) = char2i '"' then
          (lst, inp)
        else if (ch1.ichar) <> char2i '\\' then
          scan (int2char0 (ch1.ichar) :: lst, inp)
        else
          let
            val (ch2, inp) = get_ch inp
          in
            if (ch2.ichar) = char2i 'n' then
              scan ('n' :: '\\' :: lst, inp)
            else if (ch2.ichar) = char2i '\\' then
              scan ('\\' :: '\\' :: lst, inp)
            else
              $raise unsupported_escape (ch1.line_no, ch1.column_no,
                                         ch2.ichar)
          end
      end

    val lst = '"' :: NIL
    val (lst, inp) = scan (lst, inp)
    val lst = '"' :: lst
    val s = reverse_list_to_string lst
  in
    ((TOKEN_STRING, s, ch.line_no, ch.column_no), inp)
  end

fn
get_next_token (inp     : inp_t,
                lookups : !lookups_vt) : (tokentuple_t, inp_t) =
  let
    val inp = skip_spaces_and_comments inp
    val (ch, inp) = get_ch inp
    val ln = ch.line_no
    val cn = ch.column_no
  in
    if ch.ichar < 0 then
      ((TOKEN_END_OF_INPUT, "", ln, cn), inp)
    else
      case+ int2char0 (ch.ichar) of
      | ',' => ((TOKEN_COMMA, ",", ln, cn), inp)
      | ';' => ((TOKEN_SEMICOLON, ";", ln, cn), inp)
      | '\(' => ((TOKEN_LEFTPAREN, "(", ln, cn), inp)
      | ')' => ((TOKEN_RIGHTPAREN, ")", ln, cn), inp)
      | '\{' => ((TOKEN_LEFTBRACE, "{", ln, cn), inp)
      | '}' => ((TOKEN_RIGHTBRACE, "}", ln, cn), inp)
      | '*' => ((TOKEN_MULTIPLY, "*", ln, cn), inp)
      | '/' => ((TOKEN_DIVIDE, "/", ln, cn), inp)
      | '%' => ((TOKEN_MOD, "%", ln, cn), inp)
      | '+' => ((TOKEN_ADD, "+", ln, cn), inp)
      | '-' => ((TOKEN_SUBTRACT, "-", ln, cn), inp)
      | '<' =>
        let
          val (ch1, inp) = get_ch inp
        in
          if (ch1.ichar) = char2i '=' then
            ((TOKEN_LESSEQUAL, "<=", ln, cn), inp)
          else
            let
              val inp = push_back_ch (ch1, inp)
            in
              ((TOKEN_LESS, "<", ln, cn), inp)
            end
        end
      | '>' =>
        let
          val (ch1, inp) = get_ch inp
        in
          if (ch1.ichar) = char2i '=' then
            ((TOKEN_GREATEREQUAL, ">=", ln, cn), inp)
          else
            let
              val inp = push_back_ch (ch1, inp)
            in
              ((TOKEN_GREATER, ">", ln, cn), inp)
            end
        end
      | '=' =>
        let
          val (ch1, inp) = get_ch inp
        in
          if (ch1.ichar) = char2i '=' then
            ((TOKEN_EQUAL, "==", ln, cn), inp)
          else
            let
              val inp = push_back_ch (ch1, inp)
            in
              ((TOKEN_ASSIGN, "=", ln, cn), inp)
            end
        end
      | '!' =>
        let
          val (ch1, inp) = get_ch inp
        in
          if (ch1.ichar) = char2i '=' then
            ((TOKEN_NOTEQUAL, "!=", ln, cn), inp)
          else
            let
              val inp = push_back_ch (ch1, inp)
            in
              ((TOKEN_NOT, "!", ln, cn), inp)
            end
        end
      | '&' =>
        let
          val (ch1, inp) = get_ch inp
        in
          if (ch1.ichar) = char2i '&' then
            ((TOKEN_AND, "&&", ln, cn), inp)
          else
            $raise unexpected_character (ch.line_no, ch.column_no,
                                         ch.ichar)
        end
      | '|' =>
        let
          val (ch1, inp) = get_ch inp
        in
          if (ch1.ichar) = char2i '|' then
            ((TOKEN_OR, "||", ln, cn), inp)
          else
            $raise unexpected_character (ch.line_no, ch.column_no,
                                         ch.ichar)
        end
      | '"' =>
        let
          val inp = push_back_ch (ch, inp)
        in
          scan_string_literal inp
        end
      | '\'' =>
        let
          val inp = push_back_ch (ch, inp)
        in
          scan_character_literal inp
        end
      | _ when isdigit (ch.ichar) =>
        let
          val inp = push_back_ch (ch, inp)
        in
          scan_integer_literal (inp, lookups)
        end
      | _ when is_ident_start (ch.ichar) =>
        let
          val inp = push_back_ch (ch, inp)
        in
          scan_identifier_or_reserved_word (inp, lookups)
        end
      | _ => $raise unexpected_character (ch.line_no, ch.column_no,
                                          ch.ichar)
  end

fn
fprint_ullint_rightjust (outf : FILEref,
                         num  : ullint) : void =
  if num < 10ULL then
    fprint! (outf, "    ", num)
  else if num < 100ULL then
    fprint! (outf, "   ", num)
  else if num < 1000ULL then
    fprint! (outf, "  ", num)
  else if num < 10000ULL then
    fprint! (outf, " ", num)
  else
    fprint! (outf, num)

fn
print_token (outf    : FILEref,
             toktup  : tokentuple_t,
             lookups : !lookups_vt) : void =
  let
    macdef toknames = !(lookups.toknames)
    val name = toknames[toktup.0]
    val str = (toktup.1)
    val line_no = (toktup.2)
    val column_no = (toktup.3)

    val _ = fprint_ullint_rightjust (outf, line_no)
    val _ = fileref_puts (outf, " ")
    val _ = fprint_ullint_rightjust (outf, column_no)
    val _ = fileref_puts (outf, "  ")
    val _ = fileref_puts (outf, name)
  in
    begin
      case+ toktup.0 of
      | TOKEN_IDENTIFIER => fprint! (outf, "     ", str)
      | TOKEN_INTEGER => fprint! (outf, "        ", str)
      | TOKEN_STRING => fprint! (outf, "         ", str)
      | _ => ()
    end;

    fileref_putc (outf, '\n')
  end

fn
scan_text (outf    : FILEref,
           inp     : inp_t,
           lookups : !lookups_vt) : void =
  let
    fun
    loop (inp     : inp_t,
          lookups : !lookups_vt) : void =
      let
        val (toktup, inp) = get_next_token (inp, lookups)
      in
        print_token (outf, toktup, lookups);
        if toktup.0 <> TOKEN_END_OF_INPUT then
          loop (inp, lookups)
      end
  in
    loop (inp, lookups)
  end

(********************************************************************)

fn
main_program (inpf : FILEref,
              outf : FILEref) : int =
  let
    (* Using a simple Scheme program, I found the following perfect
       hash for the reserved words, using the sum of the first two
       characters as the hash value. *)
    var reserved_words =
      @[String][RESERVED_WORD_HASHTAB_SIZE]
        ("if", "print", "else", "", "putc", "", "", "while", "")
    var reserved_word_tokens =
      @[token_t][RESERVED_WORD_HASHTAB_SIZE]
        (TOKEN_IF, TOKEN_PRINT, TOKEN_ELSE, TOKEN_IDENTIFIER,
         TOKEN_PUTC, TOKEN_IDENTIFIER, TOKEN_IDENTIFIER, TOKEN_WHILE,
         TOKEN_IDENTIFIER)

    var token_names =
      @[string][NUM_TOKENS]
        ("Keyword_else",
         "Keyword_if",
         "Keyword_print",
         "Keyword_putc",
         "Keyword_while",
         "Op_multiply",
         "Op_divide",
         "Op_mod",
         "Op_add",
         "Op_subtract",
         "Op_negate",
         "Op_less",
         "Op_lessequal",
         "Op_greater",
         "Op_greaterequal",
         "Op_equal",
         "Op_notequal",
         "Op_not",
         "Op_assign",
         "Op_and",
         "Op_or",
         "LeftParen",
         "RightParen",
         "LeftBrace",
         "RightBrace",
         "Semicolon",
         "Comma",
         "Identifier",
         "Integer",
         "String",
         "End_of_input")

    var lookups : lookups_vt =
      @{
        pf_toknames = view@ token_names,
        pf_wordtab = view@ reserved_words,
        pf_toktab = view@ reserved_word_tokens |
        toknames = addr@ token_names,
        wordtab = addr@ reserved_words,
        toktab = addr@ reserved_word_tokens
      }

    val inp =
      @{
        file = inpf,
        pushback = NIL,
        line_no = 1ULL,
        column_no = 1ULL
      }

    val _ = scan_text (outf, inp, lookups)

    val @{
          pf_toknames = pf_toknames,
          pf_wordtab = pf_wordtab,
          pf_toktab = pf_toktab |
          toknames = toknames,
          wordtab = wordtab,
          toktab = toktab
        } = lookups
    prval _ = view@ token_names := pf_toknames
    prval _ = view@ reserved_words := pf_wordtab
    prval _ = view@ reserved_word_tokens := pf_toktab
  in
    0
  end

macdef lex_error = "Lexical error: "

implement
main (argc, argv) =
  let
    val inpfname =
      if 2 <= argc then
        $UN.cast{string} argv[1]
      else
        "-"
    val outfname =
      if 3 <= argc then
        $UN.cast{string} argv[2]
      else
        "-"
  in
    try
      let
        val inpf =
          if (inpfname : string) = "-" then
            stdin_ref
          else
            fileref_open_exn (inpfname, file_mode_r)

        val outf =
          if (outfname : string) = "-" then
            stdout_ref
          else
            fileref_open_exn (outfname, file_mode_w)
      in
        main_program (inpf, outf)
      end
    with
    | ~ unterminated_comment (line_no, column_no) =>
      begin
        fprintln! (stderr_ref, lex_error,
                   "unterminated comment starting at ",
                   line_no, ":", column_no);
        1
      end
    | ~ unterminated_character_literal (line_no, column_no) =>
      begin
        fprintln! (stderr_ref, lex_error,
                   "unterminated character literal starting at ",
                   line_no, ":", column_no);
        1
      end
    | ~ multicharacter_literal (line_no, column_no) =>
      begin
        fprintln! (stderr_ref, lex_error,
                   "unsupported multicharacter literal starting at ",
                   line_no, ":", column_no);
        1
      end
    | ~ unterminated_string_literal (line_no, column_no,
                                     end_of_line) =>
      let
        val s =
          begin
            if end_of_line then
              "end of line"
            else
              "end of input"
          end : String
      in
        fprintln! (stderr_ref, lex_error,
                   "unterminated string literal (", s,
                   ") starting at ", line_no, ":", column_no);
        1
      end
    | ~ unsupported_escape (line_no, column_no, c) =>
      begin
        fprintln! (stderr_ref, lex_error,
                   "unsupported escape \\",
                   int2char0 c, " starting at ",
                   line_no, ":", column_no);
        1
      end
    | ~ invalid_integer_literal (line_no, column_no, s) =>
      begin
        fprintln! (stderr_ref, lex_error,
                   "invalid integer literal ", s,
                   " starting at ", line_no, ":", column_no);
        1
      end
    | ~ unexpected_character (line_no, column_no, c) =>
      begin
        fprintln! (stderr_ref, lex_error,
                   "unexpected character '", int2char0 c,
                   "' at ", line_no, ":", column_no);
        1
      end
end

(********************************************************************)
Output:
$ patscc -O2 -DATS_MEMALLOC_GCBDW -o lex lex-in-ATS.dats -lgc && ./lex compiler-tests/testcase3.t
    5    16  Keyword_print
    5    40  Op_subtract
    6    16  Keyword_putc
    6    40  Op_less
    7    16  Keyword_if
    7    40  Op_greater
    8    16  Keyword_else
    8    40  Op_lessequal
    9    16  Keyword_while
    9    40  Op_greaterequal
   10    16  LeftBrace
   10    40  Op_equal
   11    16  RightBrace
   11    40  Op_notequal
   12    16  LeftParen
   12    40  Op_and
   13    16  RightParen
   13    40  Op_or
   14    16  Op_subtract
   14    40  Semicolon
   15    16  Op_not
   15    40  Comma
   16    16  Op_multiply
   16    40  Op_assign
   17    16  Op_divide
   17    40  Integer        42
   18    16  Op_mod
   18    40  String         "String literal"
   19    16  Op_add
   19    40  Identifier     variable_name
   20    26  Integer        10
   21    26  Integer        92
   22    26  Integer        32
   23     1  End_of_input

AWK

Tested with gawk 4.1.1 and mawk 1.3.4.

BEGIN {
  all_syms["tk_EOI"    ] = "End_of_input"
  all_syms["tk_Mul"    ] = "Op_multiply"
  all_syms["tk_Div"    ] = "Op_divide"
  all_syms["tk_Mod"    ] = "Op_mod"
  all_syms["tk_Add"    ] = "Op_add"
  all_syms["tk_Sub"    ] = "Op_subtract"
  all_syms["tk_Negate" ] = "Op_negate"
  all_syms["tk_Not"    ] = "Op_not"
  all_syms["tk_Lss"    ] = "Op_less"
  all_syms["tk_Leq"    ] = "Op_lessequal"
  all_syms["tk_Gtr"    ] = "Op_greater"
  all_syms["tk_Geq"    ] = "Op_greaterequal"
  all_syms["tk_Eq"     ] = "Op_equal"
  all_syms["tk_Neq"    ] = "Op_notequal"
  all_syms["tk_Assign" ] = "Op_assign"
  all_syms["tk_And"    ] = "Op_and"
  all_syms["tk_Or"     ] = "Op_or"
  all_syms["tk_If"     ] = "Keyword_if"
  all_syms["tk_Else"   ] = "Keyword_else"
  all_syms["tk_While"  ] = "Keyword_while"
  all_syms["tk_Print"  ] = "Keyword_print"
  all_syms["tk_Putc"   ] = "Keyword_putc"
  all_syms["tk_Lparen" ] = "LeftParen"
  all_syms["tk_Rparen" ] = "RightParen"
  all_syms["tk_Lbrace" ] = "LeftBrace"
  all_syms["tk_Rbrace" ] = "RightBrace"
  all_syms["tk_Semi"   ] = "Semicolon"
  all_syms["tk_Comma"  ] = "Comma"
  all_syms["tk_Ident"  ] = "Identifier"
  all_syms["tk_Integer"] = "Integer"
  all_syms["tk_String" ] = "String"

  ## single character only symbols
  symbols["{"   ] = "tk_Lbrace"
  symbols["}"   ] = "tk_Rbrace"
  symbols["("   ] = "tk_Lparen"
  symbols[")"   ] = "tk_Rparen"
  symbols["+"   ] = "tk_Add"
  symbols["-"   ] = "tk_Sub"
  symbols["*"   ] = "tk_Mul"
  symbols["%"   ] = "tk_Mod"
  symbols[";"   ] = "tk_Semi"
  symbols[","   ] = "tk_Comma"

  key_words["if"   ] = "tk_If"
  key_words["else" ] = "tk_Else"
  key_words["print"] = "tk_Print"
  key_words["putc" ] = "tk_Putc"
  key_words["while"] = "tk_While"

  # Set up an array that emulates the ord() function.
  for(n=0;n<256;n++)
    ord[sprintf("%c",n)]=n

  input_file = "-"
  if (ARGC > 1)
    input_file = ARGV[1]
  RS=FS=""   # read complete file into one line $0
  getline < input_file
  the_ch = " " # dummy first char - but it must be a space
  the_col  = 0 # always points to the current character
  the_line = 1
  for (the_nf=1; ; ) {
    split(gettok(), t, SUBSEP)
    printf("%5s  %5s %-14s", t[2], t[3], all_syms[t[1]])
    if      (t[1] == "tk_Integer") printf("   %5s\n", t[4])
    else if (t[1] == "tk_Ident"  ) printf("  %s\n",   t[4])
    else if (t[1] == "tk_String" ) printf("  \"%s\"\n", t[4])
    else                           print("")
    if (t[1] == "tk_EOI")
      break
  }
}

#*** show error and exit
function error(line, col, msg) {
  print(line, col, msg)
  exit(1)
}

# get the next character from the input
function next_ch() {
  the_ch = $the_nf
  the_nf  ++
  the_col ++
  if (the_ch == "\n") {
    the_line ++
    the_col = 0
  }
  return the_ch
}

#*** 'x' - character constants
function char_lit(err_line, err_col) {
  n = ord[next_ch()]              # skip opening quote
  if (the_ch == "'") {
    error(err_line, err_col, "empty character constant")
  } else if (the_ch == "\\") {
    next_ch()
    if (the_ch == "n")
      n = 10
    else if (the_ch == "\\")
      n = ord["\\"]
    else
      error(err_line, err_col, "unknown escape sequence " the_ch)
  }
  if (next_ch() != "'")
    error(err_line, err_col, "multi-character constant")
  next_ch()
  return "tk_Integer" SUBSEP err_line SUBSEP err_col SUBSEP n
}

#*** process divide or comments
function div_or_cmt(err_line, err_col) {
  if (next_ch() != "*")
    return "tk_Div" SUBSEP err_line SUBSEP err_col
  # comment found
  next_ch()
  while (1) {
    if (the_ch == "*") {
      if (next_ch() == "/") {
        next_ch()
        return gettok()
      } else if (the_ch == "") {
        error(err_line, err_col, "EOF in comment")
      }
    } else {
      next_ch()
    }
  }
}

#*** "string"
function string_lit(start, err_line, err_col) {
  text = ""
  while (next_ch() != start) {
    if (the_ch == "")
      error(err_line, err_col, "EOF while scanning string literal")
    if (the_ch == "\n")
      error(err_line, err_col, "EOL while scanning string literal")
    text = text the_ch
  }
  next_ch()
  return "tk_String" SUBSEP err_line SUBSEP err_col SUBSEP text
}

#*** handle identifiers and integers
function ident_or_int(err_line, err_col) {
  is_number = 1
  text = ""
  while ((the_ch ~ /^[0-9a-zA-Z]+$/)  || (the_ch == "_")) {
    text = text the_ch
    if (! (the_ch ~ /^[0-9]+$/))
      is_number = 0
    next_ch()
  }
  if (text == "")
    error(err_line, err_col, "ident_or_int: unrecognized character: " the_ch)
  if (text ~ /^[0-9]/) {
    if (! is_number)
      error(err_line, err_col, "invalid number: " text)
    n = text + 0
    return "tk_Integer" SUBSEP err_line SUBSEP err_col SUBSEP n
  }
  if (text in key_words)
    return key_words[text] SUBSEP err_line SUBSEP err_col
  return "tk_Ident" SUBSEP err_line SUBSEP err_col SUBSEP text
}

#*** look ahead for '>=', etc.
function follow(expect, ifyes, ifno, err_line, err_col) {
  if (next_ch() == expect) {
    next_ch()
    return ifyes SUBSEP err_line SUBSEP err_col
  }
  if (ifno == tk_EOI)
    error(err_line, err_col, "follow: unrecognized character: " the_ch)
  return ifno SUBSEP err_line SUBSEP err_col
}

#*** return the next token type
function gettok() {
  while (the_ch == " " || the_ch == "\n" || the_ch == "\r")
    next_ch()
  err_line = the_line
  err_col  = the_col
  if      (the_ch == "" )    return "tk_EOI" SUBSEP err_line SUBSEP err_col
  else if (the_ch == "/")    return div_or_cmt(err_line, err_col)
  else if (the_ch == "'")    return char_lit(err_line, err_col)
  else if (the_ch == "<")    return follow("=", "tk_Leq", "tk_Lss",    err_line, err_col)
  else if (the_ch == ">")    return follow("=", "tk_Geq", "tk_Gtr",    err_line, err_col)
  else if (the_ch == "=")    return follow("=", "tk_Eq",  "tk_Assign", err_line, err_col)
  else if (the_ch == "!")    return follow("=", "tk_Neq", "tk_Not",    err_line, err_col)
  else if (the_ch == "&")    return follow("&", "tk_And", "tk_EOI",    err_line, err_col)
  else if (the_ch == "|")    return follow("|", "tk_Or",  "tk_EOI",    err_line, err_col)
  else if (the_ch =="\"")    return string_lit(the_ch, err_line, err_col)
  else if (the_ch in symbols) {
    sym = symbols[the_ch]
    next_ch()
    return sym SUBSEP err_line SUBSEP err_col
  } else {
    return ident_or_int(err_line, err_col)
  }
}
Output  —  count:

    1      1 Identifier      count
    1      7 Op_assign     
    1      9 Integer              1
    1     10 Semicolon     
    2      1 Keyword_while 
    2      7 LeftParen     
    2      8 Identifier      count
    2     14 Op_less       
    2     16 Integer             10
    2     18 RightParen    
    2     20 LeftBrace     
    3      5 Keyword_print 
    3     10 LeftParen     
    3     11 String          "count is: "
    3     23 Comma         
    3     25 Identifier      count
    3     30 Comma         
    3     32 String          "\n"
    3     36 RightParen    
    3     37 Semicolon     
    4      5 Identifier      count
    4     11 Op_assign     
    4     13 Identifier      count
    4     19 Op_add        
    4     21 Integer              1
    4     22 Semicolon     
    5      1 RightBrace    
    5      3 End_of_input 

C

Tested with gcc 4.81 and later, compiles warning free with -Wpedantic -pedantic -Wall -Wextra

#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
#include <ctype.h>
#include <string.h>
#include <errno.h>
#include <stdbool.h>
#include <limits.h>

#define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))

#define da_dim(name, type)  type *name = NULL;          \
                            int _qy_ ## name ## _p = 0;  \
                            int _qy_ ## name ## _max = 0
#define da_rewind(name)     _qy_ ## name ## _p = 0
#define da_redim(name)      do {if (_qy_ ## name ## _p >= _qy_ ## name ## _max) \
                                name = realloc(name, (_qy_ ## name ## _max += 32) * sizeof(name[0]));} while (0)
#define da_append(name, x)  do {da_redim(name); name[_qy_ ## name ## _p++] = x;} while (0)
#define da_len(name)        _qy_ ## name ## _p

typedef enum {
    tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq,
    tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While,
    tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma,
    tk_Ident, tk_Integer, tk_String
} TokenType;

typedef struct {
    TokenType tok;
    int err_ln, err_col;
    union {
        int n;                  /* value for constants */
        char *text;             /* text for idents */
    };
} tok_s;

static FILE *source_fp, *dest_fp;
static int line = 1, col = 0, the_ch = ' ';
da_dim(text, char);

tok_s gettok(void);

static void error(int err_line, int err_col, const char *fmt, ... ) {
    char buf[1000];
    va_list ap;

    va_start(ap, fmt);
    vsprintf(buf, fmt, ap);
    va_end(ap);
    printf("(%d,%d) error: %s\n", err_line, err_col, buf);
    exit(1);
}

static int next_ch(void) {     /* get next char from input */
    the_ch = getc(source_fp);
    ++col;
    if (the_ch == '\n') {
        ++line;
        col = 0;
    }
    return the_ch;
}

static tok_s char_lit(int n, int err_line, int err_col) {   /* 'x' */
    if (the_ch == '\'')
        error(err_line, err_col, "gettok: empty character constant");
    if (the_ch == '\\') {
        next_ch();
        if (the_ch == 'n')
            n = 10;
        else if (the_ch == '\\')
            n = '\\';
        else error(err_line, err_col, "gettok: unknown escape sequence \\%c", the_ch);
    }
    if (next_ch() != '\'')
        error(err_line, err_col, "multi-character constant");
    next_ch();
    return (tok_s){tk_Integer, err_line, err_col, {n}};
}

static tok_s div_or_cmt(int err_line, int err_col) { /* process divide or comments */
    if (the_ch != '*')
        return (tok_s){tk_Div, err_line, err_col, {0}};

    /* comment found */
    next_ch();
    for (;;) {
        if (the_ch == '*') {
            if (next_ch() == '/') {
                next_ch();
                return gettok();
            }
        } else if (the_ch == EOF)
            error(err_line, err_col, "EOF in comment");
        else
            next_ch();
    }
}

static tok_s string_lit(int start, int err_line, int err_col) { /* "st" */
    da_rewind(text);

    while (next_ch() != start) {
        if (the_ch == '\n') error(err_line, err_col, "EOL in string");
        if (the_ch == EOF)  error(err_line, err_col, "EOF in string");
        da_append(text, (char)the_ch);
    }
    da_append(text, '\0');

    next_ch();
    return (tok_s){tk_String, err_line, err_col, {.text=text}};
}

static int kwd_cmp(const void *p1, const void *p2) {
    return strcmp(*(char **)p1, *(char **)p2);
}

static TokenType get_ident_type(const char *ident) {
    static struct {
        const char *s;
        TokenType sym;
    } kwds[] = {
        {"else",  tk_Else},
        {"if",    tk_If},
        {"print", tk_Print},
        {"putc",  tk_Putc},
        {"while", tk_While},
    }, *kwp;

    return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;
}

static tok_s ident_or_int(int err_line, int err_col) {
    int n, is_number = true;

    da_rewind(text);
    while (isalnum(the_ch) || the_ch == '_') {
        da_append(text, (char)the_ch);
        if (!isdigit(the_ch))
            is_number = false;
        next_ch();
    }
    if (da_len(text) == 0)
        error(err_line, err_col, "gettok: unrecognized character (%d) '%c'\n", the_ch, the_ch);
    da_append(text, '\0');
    if (isdigit(text[0])) {
        if (!is_number)
            error(err_line, err_col, "invalid number: %s\n", text);
        n = strtol(text, NULL, 0);
        if (n == LONG_MAX && errno == ERANGE)
            error(err_line, err_col, "Number exceeds maximum value");
        return (tok_s){tk_Integer, err_line, err_col, {n}};
    }
    return (tok_s){get_ident_type(text), err_line, err_col, {.text=text}};
}

static tok_s follow(int expect, TokenType ifyes, TokenType ifno, int err_line, int err_col) {   /* look ahead for '>=', etc. */
    if (the_ch == expect) {
        next_ch();
        return (tok_s){ifyes, err_line, err_col, {0}};
    }
    if (ifno == tk_EOI)
        error(err_line, err_col, "follow: unrecognized character '%c' (%d)\n", the_ch, the_ch);
    return (tok_s){ifno, err_line, err_col, {0}};
}

tok_s gettok(void) {            /* return the token type */
    /* skip white space */
    while (isspace(the_ch))
        next_ch();
    int err_line = line;
    int err_col  = col;
    switch (the_ch) {
        case '{':  next_ch(); return (tok_s){tk_Lbrace, err_line, err_col, {0}};
        case '}':  next_ch(); return (tok_s){tk_Rbrace, err_line, err_col, {0}};
        case '(':  next_ch(); return (tok_s){tk_Lparen, err_line, err_col, {0}};
        case ')':  next_ch(); return (tok_s){tk_Rparen, err_line, err_col, {0}};
        case '+':  next_ch(); return (tok_s){tk_Add, err_line, err_col, {0}};
        case '-':  next_ch(); return (tok_s){tk_Sub, err_line, err_col, {0}};
        case '*':  next_ch(); return (tok_s){tk_Mul, err_line, err_col, {0}};
        case '%':  next_ch(); return (tok_s){tk_Mod, err_line, err_col, {0}};
        case ';':  next_ch(); return (tok_s){tk_Semi, err_line, err_col, {0}};
        case ',':  next_ch(); return (tok_s){tk_Comma,err_line, err_col, {0}};
        case '/':  next_ch(); return div_or_cmt(err_line, err_col);
        case '\'': next_ch(); return char_lit(the_ch, err_line, err_col);
        case '<':  next_ch(); return follow('=', tk_Leq, tk_Lss,    err_line, err_col);
        case '>':  next_ch(); return follow('=', tk_Geq, tk_Gtr,    err_line, err_col);
        case '=':  next_ch(); return follow('=', tk_Eq,  tk_Assign, err_line, err_col);
        case '!':  next_ch(); return follow('=', tk_Neq, tk_Not,    err_line, err_col);
        case '&':  next_ch(); return follow('&', tk_And, tk_EOI,    err_line, err_col);
        case '|':  next_ch(); return follow('|', tk_Or,  tk_EOI,    err_line, err_col);
        case '"' : return string_lit(the_ch, err_line, err_col);
        default:   return ident_or_int(err_line, err_col);
        case EOF:  return (tok_s){tk_EOI, err_line, err_col, {0}};
    }
}

void run(void) {    /* tokenize the given input */
    tok_s tok;
    do {
        tok = gettok();
        fprintf(dest_fp, "%5d  %5d %.15s",
            tok.err_ln, tok.err_col,
            &"End_of_input    Op_multiply     Op_divide       Op_mod          Op_add          "
             "Op_subtract     Op_negate       Op_not          Op_less         Op_lessequal    "
             "Op_greater      Op_greaterequal Op_equal        Op_notequal     Op_assign       "
             "Op_and          Op_or           Keyword_if      Keyword_else    Keyword_while   "
             "Keyword_print   Keyword_putc    LeftParen       RightParen      LeftBrace       "
             "RightBrace      Semicolon       Comma           Identifier      Integer         "
             "String          "
            [tok.tok * 16]);
        if (tok.tok == tk_Integer)     fprintf(dest_fp, "  %4d",   tok.n);
        else if (tok.tok == tk_Ident)  fprintf(dest_fp, " %s",     tok.text);
        else if (tok.tok == tk_String) fprintf(dest_fp, " \"%s\"", tok.text);
        fprintf(dest_fp, "\n");
    } while (tok.tok != tk_EOI);
    if (dest_fp != stdout)
        fclose(dest_fp);
}

void init_io(FILE **fp, FILE *std, const char mode[], const char fn[]) {
    if (fn[0] == '\0')
        *fp = std;
    else if ((*fp = fopen(fn, mode)) == NULL)
        error(0, 0, "Can't open %s\n", fn);
}

int main(int argc, char *argv[]) {
    init_io(&source_fp, stdin,  "r",  argc > 1 ? argv[1] : "");
    init_io(&dest_fp,   stdout, "wb", argc > 2 ? argv[2] : "");
    run();
    return 0;
}
Output  —  test case 3:

    5     16 Keyword_print
    5     40 Op_subtract
    6     16 Keyword_putc
    6     40 Op_less
    7     16 Keyword_if
    7     40 Op_greater
    8     16 Keyword_else
    8     40 Op_lessequal
    9     16 Keyword_while
    9     40 Op_greaterequal
   10     16 LeftBrace
   10     40 Op_equal
   11     16 RightBrace
   11     40 Op_notequal
   12     16 LeftParen
   12     40 Op_and
   13     16 RightParen
   13     40 Op_or
   14     16 Op_subtract
   14     40 Semicolon
   15     16 Op_not
   15     40 Comma
   16     16 Op_multiply
   16     40 Op_assign
   17     16 Op_divide
   17     40 Integer            42
   18     16 Op_mod
   18     40 String          "String literal"
   19     16 Op_add
   19     40 Identifier      variable_name
   20     26 Integer            10
   21     26 Integer            92
   22     26 Integer            32
   23      1 End_of_input

C#

Requires C#6.0 because of the use of null coalescing operators.

using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;


namespace Rosetta {

    public enum TokenType {
        End_of_input, Op_multiply, Op_divide, Op_mod, Op_add, Op_subtract,
        Op_negate, Op_not, Op_less, Op_lessequal, Op_greater, Op_greaterequal,
        Op_equal, Op_notequal, Op_assign, Op_and, Op_or, Keyword_if,
        Keyword_else, Keyword_while, Keyword_print, Keyword_putc, LeftParen, RightParen,
        LeftBrace, RightBrace, Semicolon, Comma, Identifier, Integer, String, None
    }

    /// <summary>
    /// Storage class for tokens
    /// </summary>
    public class Token {
        public TokenType Type { get; set; }
        public int Line { get; set; }
        public int Position { get; set; }
        public string Value { get; set; }
        public override string ToString() {
            if (Type == TokenType.Integer || Type == TokenType.Identifier) {
                return String.Format("{0,-5}  {1,-5}   {2,-14}     {3}", Line, Position, Type.ToString(), Value);
            } else if (Type == TokenType.String) {
                return String.Format("{0,-5}  {1,-5}   {2,-14}     \"{3}\"", Line, Position, Type.ToString(), Value.Replace("\n", "\\n"));
            }
            return String.Format("{0,-5}  {1,-5}   {2,-14}", Line, Position, Type.ToString());
        }
    }

    /// <summary>
    /// C# Example of Lexical scanner for Rosetta Compiler
    /// </summary>
    public class LexicalScanner {

        // character classes 
        private const string _letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_";        
        private const string _numbers = "0123456789";
        private const string _identifier = _letters + _numbers + "_";
        private const string _whitespace = " \t\n\r";
        
        // mappings from string keywords to token type 
        private Dictionary<string, TokenType> _keywordTokenTypeMap = new Dictionary<string, TokenType>() {
            { "if", TokenType.Keyword_if },
            { "else", TokenType.Keyword_else },
            { "while", TokenType.Keyword_while },
            { "print", TokenType.Keyword_print },
            { "putc", TokenType.Keyword_putc }
        };

        // mappings from simple operators to token type
        private Dictionary<string, TokenType> _operatorTokenTypeMap = new Dictionary<string, TokenType>() {
            { "+", TokenType.Op_add },
            { "-", TokenType.Op_subtract },
            { "*", TokenType.Op_multiply },
            { "/", TokenType.Op_divide },
            { "%", TokenType.Op_mod },
            { "=", TokenType.Op_assign },
            { "<", TokenType.Op_less },
            { ">", TokenType.Op_greater },
            { "!", TokenType.Op_not },
        };

        private List<string> _keywords;
        private string _operators = "+-*/%=<>!%";

        private string _code;
        private List<Token> tokens = new List<Token>();

        private int _line = 1;
        private int _position = 1;

        public string CurrentCharacter {
            get {
                try {
                    return _code.Substring(0, 1);
                } catch (ArgumentOutOfRangeException) {
                    return "";
                }
            }
        }

        /// <summary>
        /// Lexical scanner initialiser
        /// </summary>
        /// <param name="code">Code to be tokenised</param>
        public LexicalScanner (string code) {
            _code = code;
            _keywords = _keywordTokenTypeMap.Keys.ToList();
        }

        /// <summary>
        /// Advance the cursor forward given number of characters
        /// </summary>
        /// <param name="characters">Number of characters to advance</param>
        private void advance(int characters=1) {
            try {
                // reset position when there is a newline
                if (CurrentCharacter == "\n") {
                    _position = 0;
                    _line++;
                }
                    
                _code = _code.Substring(characters, _code.Length - characters);
                _position += characters;
            } catch (ArgumentOutOfRangeException) {
                _code = "";
            }
        }

        /// <summary>
        /// Outputs error message to the console and exits 
        /// </summary>
        /// <param name="message">Error message to display to user</param>
        /// <param name="line">Line error occurred on</param>
        /// <param name="position">Line column that the error occurred at</param>
        public void error(string message, int line, int position) {
            // output error to the console and exit
            Console.WriteLine(String.Format("{0} @ {1}:{2}", message, line, position));
            Environment.Exit(1);
        }

        /// <summary>
        /// Pattern matching using first & follow matching
        /// </summary>
        /// <param name="recogniseClass">String of characters that identifies the token type
        /// or the exact match the be made if exact:true</param>
        /// <param name="matchClass">String of characters to match against remaining target characters</param>
        /// <param name="tokenType">Type of token the match represents.</param>
        /// <param name="notNextClass">Optional class of characters that cannot follow the match</param>
        /// <param name="maxLen">Optional maximum length of token value</param>
        /// <param name="exact">Denotes whether recogniseClass represents an exact match or class match. 
        /// Default: false</param>
        /// <param name="discard">Denotes whether the token is kept or discarded. Default: false</param>
        /// <param name="offset">Optiona line position offset to account for discarded tokens</param>
        /// <returns>Boolean indicating if a match was made </returns>
        public bool match(string recogniseClass, string matchClass, TokenType tokenType, 
                          string notNextClass=null, int maxLen=Int32.MaxValue, bool exact=false, 
                          bool discard=false, int offset=0) {

            // if we've hit the end of the file, there's no more matching to be done
            if (CurrentCharacter == "")
                return false;

            // store _current_ line and position so that our vectors point at the start
            // of each token
            int line = _line;
            int position = _position;

            // special case exact tokens to avoid needing to worry about backtracking
            if (exact) {
                if (_code.StartsWith(recogniseClass)) {
                    if (!discard) 
                        tokens.Add(new Token() { Type = tokenType, Value = recogniseClass, Line = line, Position = position - offset});   
                    advance(recogniseClass.Length);
                    return true;
                }
                return false;
            }

            // first match - denotes the token type usually
            if (!recogniseClass.Contains(CurrentCharacter))
                return false;

            string tokenValue = CurrentCharacter;
            advance();

            // follow match while we haven't exceeded maxLen and there are still characters
            // in the code stream
            while ((matchClass ?? "").Contains(CurrentCharacter) && tokenValue.Length <= maxLen && CurrentCharacter != "") {
                tokenValue += CurrentCharacter;
                advance();
            }

            // ensure that any incompatible characters are not next to the token
            // eg 42fred is invalid, and neither recognized as a number nor an identifier.
            // _letters would be the notNextClass
            if (notNextClass != null && notNextClass.Contains(CurrentCharacter))
                error("Unrecognised character: " + CurrentCharacter, _line, _position);

            // only add tokens to the stack that aren't marked as discard - dont want
            // things like open and close quotes/comments
            if (!discard) {
                Token token = new Token() { Type = tokenType, Value = tokenValue, Line = line, Position = position - offset };
                tokens.Add(token);
            }

            return true;
        }

        /// <summary>
        /// Tokenise the input code 
        /// </summary>
        /// <returns>List of Tokens</returns>
        public List<Token> scan() {

            while (CurrentCharacter != "") {
                // match whitespace
                match(_whitespace, _whitespace, TokenType.None, discard: true);

                // match integers
                match(_numbers, _numbers, TokenType.Integer, notNextClass:_letters);
                
                // match identifiers and keywords
                if (match(_letters, _identifier, TokenType.Identifier)) {
                    Token match = tokens.Last();
                    if (_keywords.Contains(match.Value))
                        match.Type = _keywordTokenTypeMap[match.Value];
                }

                // match string similarly to comments without allowing newlines
                // this token doesn't get discarded though
                if (match("\"", null, TokenType.String, discard:true)) {
                    string value = "";
                    int position = _position;
                    while (!match("\"", null, TokenType.String, discard:true)) {
                        // not allowed newlines in strings
                        if (CurrentCharacter == "\n")
                            error("End-of-line while scanning string literal. Closing string character not found before end-of-line", _line, _position);
                        // end of file reached before finding end of string
                        if (CurrentCharacter == "")
                            error("End-of-file while scanning string literal. Closing string character not found", _line, _position);

                        value += CurrentCharacter;

                        // deal with escape sequences - we only accept newline (\n)
                        if (value.Length >= 2) {
                            string lastCharacters = value.Substring(value.Length - 2, 2);
                            if (lastCharacters[0] == '\\') {
                                if (lastCharacters[1] != 'n') {
                                    error("Unknown escape sequence. ", _line, position);
                                }
                                value = value.Substring(0, value.Length - 2).ToString() + "\n";
                            }
                        }

                        advance();
                    }
                    tokens.Add(new Token() { Type = TokenType.String, Value = value, Line = _line, Position = position - 1});
                }

                // match string literals
                if (match("'", null, TokenType.Integer, discard:true)) {
                    int value;
                    int position = _position;
                    value = CurrentCharacter.ToCharArray()[0];
                    advance();

                    // deal with empty literals ''
                    if (value == '\'')
                        error("Empty character literal", _line, _position);

                    // deal with escaped characters, only need to worry about \n and \\
                    // throw werror on any other
                    if (value == '\\') {
                        if (CurrentCharacter == "n") {
                            value = '\n';
                        } else if (CurrentCharacter == "\\") {
                            value = '\\';
                        } else {
                            error("Unknown escape sequence. ", _line, _position - 1);
                        }
                        advance();
                    }

                    // if we haven't hit a closing ' here, there are two many characters
                    // in the literal
                    if (!match("'", null, TokenType.Integer, discard: true))
                        error("Multi-character constant", _line, _position);

                    tokens.Add(new Rosetta.Token() { Type = TokenType.Integer, Value = value.ToString(), Line = _line, Position = position - 1 });
                }

                // match comments by checking for starting token, then advancing 
                // until closing token is matched
                if (match("/*", null, TokenType.None, exact: true, discard: true)) {
                    while (!match("*/", null, TokenType.None, exact: true, discard: true)) {
                        // reached the end of the file without closing comment!
                        if (CurrentCharacter == "")
                            error("End-of-file in comment. Closing comment characters not found.", _line, _position);
                        advance();
                    }
                    continue;
                }

                // match complex operators
                match("<=", null, TokenType.Op_lessequal, exact: true);
                match(">=", null, TokenType.Op_greaterequal, exact: true);
                match("==", null, TokenType.Op_equal, exact: true);
                match("!=", null, TokenType.Op_notequal, exact: true);
                match("&&", null, TokenType.Op_and, exact: true);
                match("||", null, TokenType.Op_or, exact: true);

                // match simple operators
                if (match(_operators, null, TokenType.None, maxLen:1)) {
                    Token match = tokens.Last();
                    match.Type = _operatorTokenTypeMap[match.Value];
                }

                // brackets, braces and separators
                match("(", null, TokenType.LeftParen, exact: true);
                match(")", null, TokenType.RightParen, exact: true);
                match("{", null, TokenType.LeftBrace, exact: true);
                match("}", null, TokenType.RightBrace, exact: true);
                match(";", null, TokenType.Semicolon, exact: true);
                match(",", null, TokenType.Comma, exact: true);

            }

            // end of file token
            tokens.Add(new Rosetta.Token() { Type = TokenType.End_of_input, Line = _line, Position = _position });
            
            return tokens;
        }

        static void Main (string[] args) {
            StreamReader inputFile;

            // if we passed in a filename, read code from that, else
            // read code from stdin
            if (args.Length > 0) {
                string path = args[0];
                try {
                    inputFile = new StreamReader(path);
                } catch (IOException) {
                    inputFile = new StreamReader(Console.OpenStandardInput(8192));
                }
            } else {
                inputFile = new StreamReader(Console.OpenStandardInput(8192));
            }

            string code = inputFile.ReadToEnd();

            // strip windows line endings out
            code = code.Replace("\r", "");

            LexicalScanner scanner = new LexicalScanner(code);
            List<Token> tokens = scanner.scan();

            foreach(Token token in tokens) {
                Console.WriteLine(token.ToString());
            }       
        }
    }
}
Output  —  test case 3:

5      16      Keyword_print
5      40      Op_subtract
6      16      Keyword_putc
6      40      Op_less
7      16      Keyword_if
7      40      Op_greater
8      16      Keyword_else
8      40      Op_lessequal
9      16      Keyword_while
9      40      Op_greaterequal
10     16      LeftBrace
10     40      Op_equal
11     16      RightBrace
11     40      Op_notequal
12     16      LeftParen
12     40      Op_and
13     16      RightParen
13     40      Op_or
14     16      Op_subtract
14     40      Semicolon
15     16      Op_not
15     40      Comma
16     16      Op_multiply
16     40      Op_assign
17     16      Op_divide
17     40      Integer            42
18     16      Op_mod
18     40      String             "String literal"
19     16      Op_add
19     40      Identifier         variable_name
20     26      Integer            10
21     26      Integer            92
22     26      Integer            32
23     1       End_of_input

C++

Tested with GCC 9.3.0 (g++ -std=c++17)

#include <charconv>      // std::from_chars
#include <fstream>       // file_to_string, string_to_file
#include <functional>    // std::invoke
#include <iomanip>       // std::setw
#include <ios>           // std::left
#include <iostream>
#include <map>           // keywords
#include <sstream>
#include <string>
#include <utility>       // std::forward
#include <variant>       // TokenVal

using namespace std;

// =====================================================================================================================
// Machinery
// =====================================================================================================================
string file_to_string (const string& path)
{
    // Open file
    ifstream file {path, ios::in | ios::binary | ios::ate};
    if (!file)   throw (errno);

    // Allocate string memory
    string contents;
    contents.resize(file.tellg());

    // Read file contents into string
    file.seekg(0);
    file.read(contents.data(), contents.size());

    return contents;
}

void string_to_file (const string& path, string contents)
{
    ofstream file {path, ios::out | ios::binary};
    if (!file)    throw (errno);

    file.write(contents.data(), contents.size());
}

template <class F>
void with_IO (string source, string destination, F&& f)
{
    string input;

    if (source == "stdin")    getline(cin, input);
    else                      input = file_to_string(source);

    string output = invoke(forward<F>(f), input);

    if (destination == "stdout")    cout << output;
    else                            string_to_file(destination, output);
}

// Add escaped newlines and backslashes back in for printing
string sanitize (string s)
{
    for (auto i = 0u; i < s.size(); ++i)
    {
        if      (s[i] == '\n')    s.replace(i++, 1, "\\n");
        else if (s[i] == '\\')    s.replace(i++, 1, "\\\\");
    }

    return s;
}

class Scanner
{
public:
    const char* pos;
    int         line   = 1;
    int         column = 1;

    Scanner (const char* source) : pos {source} {}

    inline char peek ()    { return *pos; }

    void advance ()
    {
        if (*pos == '\n')    { ++line; column = 1; }
        else                 ++column;

        ++pos;
    }

    char next ()
    {
        advance();
        return peek();
    }

    void skip_whitespace ()
    {
        while (isspace(static_cast<unsigned char>(peek())))
            advance();
    }
}; // class Scanner


// =====================================================================================================================
// Tokens
// =====================================================================================================================
enum class TokenName
{
    OP_MULTIPLY, OP_DIVIDE, OP_MOD, OP_ADD, OP_SUBTRACT, OP_NEGATE,
    OP_LESS, OP_LESSEQUAL, OP_GREATER, OP_GREATEREQUAL, OP_EQUAL, OP_NOTEQUAL,
    OP_NOT, OP_ASSIGN, OP_AND, OP_OR,
    LEFTPAREN, RIGHTPAREN, LEFTBRACE, RIGHTBRACE, SEMICOLON, COMMA,
    KEYWORD_IF, KEYWORD_ELSE, KEYWORD_WHILE, KEYWORD_PRINT, KEYWORD_PUTC,
    IDENTIFIER, INTEGER, STRING,
    END_OF_INPUT, ERROR
};

using TokenVal = variant<int, string>;

struct Token
{
    TokenName name;
    TokenVal  value;
    int       line;
    int       column;
};


const char* to_cstring (TokenName name)
{
    static const char* s[] =
    {
        "Op_multiply", "Op_divide", "Op_mod", "Op_add", "Op_subtract", "Op_negate",
        "Op_less", "Op_lessequal", "Op_greater", "Op_greaterequal", "Op_equal", "Op_notequal",
        "Op_not", "Op_assign", "Op_and", "Op_or",
        "LeftParen", "RightParen", "LeftBrace", "RightBrace", "Semicolon", "Comma",
        "Keyword_if", "Keyword_else", "Keyword_while", "Keyword_print", "Keyword_putc",
        "Identifier", "Integer", "String",
        "End_of_input", "Error"
    };

    return s[static_cast<int>(name)];
}


string to_string (Token t)
{
    ostringstream out;
    out << setw(2) << t.line << "   " << setw(2) << t.column  << "   ";

    switch (t.name)
    {
        case (TokenName::IDENTIFIER)   : out << "Identifier        "   << get<string>(t.value);                  break;
        case (TokenName::INTEGER)      : out << "Integer           "   << left << get<int>(t.value);             break;
        case (TokenName::STRING)       : out << "String            \"" << sanitize(get<string>(t.value)) << '"'; break;
        case (TokenName::END_OF_INPUT) : out << "End_of_input";                                                  break;
        case (TokenName::ERROR)        : out << "Error             "   << get<string>(t.value);                  break;
        default                        : out << to_cstring(t.name);
    }

    out << '\n';

    return out.str();
}


// =====================================================================================================================
// Lexer
// =====================================================================================================================
class Lexer
{
public:
    Lexer (const char* source) : s {source}, pre_state {s} {}

    bool has_more ()    { return s.peek() != '\0'; }

    Token next_token ()
    {
        s.skip_whitespace();

        pre_state = s;

        switch (s.peek())
        {
            case '*'  :    return simply(TokenName::OP_MULTIPLY);
            case '%'  :    return simply(TokenName::OP_MOD);
            case '+'  :    return simply(TokenName::OP_ADD);
            case '-'  :    return simply(TokenName::OP_SUBTRACT);
            case '{'  :    return simply(TokenName::LEFTBRACE);
            case '}'  :    return simply(TokenName::RIGHTBRACE);
            case '('  :    return simply(TokenName::LEFTPAREN);
            case ')'  :    return simply(TokenName::RIGHTPAREN);
            case ';'  :    return simply(TokenName::SEMICOLON);
            case ','  :    return simply(TokenName::COMMA);
            case '&'  :    return expect('&', TokenName::OP_AND);
            case '|'  :    return expect('|', TokenName::OP_OR);
            case '<'  :    return follow('=', TokenName::OP_LESSEQUAL,    TokenName::OP_LESS);
            case '>'  :    return follow('=', TokenName::OP_GREATEREQUAL, TokenName::OP_GREATER);
            case '='  :    return follow('=', TokenName::OP_EQUAL,        TokenName::OP_ASSIGN);
            case '!'  :    return follow('=', TokenName::OP_NOTEQUAL,     TokenName::OP_NOT);
            case '/'  :    return divide_or_comment();
            case '\'' :    return char_lit();
            case '"'  :    return string_lit();

            default   :    if (is_id_start(s.peek()))    return identifier();
                           if (is_digit(s.peek()))       return integer_lit();
                           return error("Unrecognized character '", s.peek(), "'");

            case '\0' :    return make_token(TokenName::END_OF_INPUT);
        }
    }


private:
    Scanner s;
    Scanner pre_state;
    static const map<string, TokenName> keywords;


    template <class... Args>
    Token error (Args&&... ostream_args)
    {
        string code {pre_state.pos, (string::size_type) s.column - pre_state.column};

        ostringstream msg;
        (msg << ... << forward<Args>(ostream_args)) << '\n'
            << string(28, ' ') << "(" << s.line << ", " << s.column << "): " << code;

        if (s.peek() != '\0')    s.advance();

        return make_token(TokenName::ERROR, msg.str());
    }


    inline Token make_token (TokenName name, TokenVal value = 0)
    {
        return {name, value, pre_state.line, pre_state.column};
    }


    Token simply (TokenName name)
    {
        s.advance();
        return make_token(name);
    }


    Token expect (char expected, TokenName name)
    {
        if (s.next() == expected)    return simply(name);
        else                         return error("Unrecognized character '", s.peek(), "'");
    }


    Token follow (char expected, TokenName ifyes, TokenName ifno)
    {
        if (s.next() == expected)    return simply(ifyes);
        else                         return make_token(ifno);
    }


    Token divide_or_comment ()
    {
        if (s.next() != '*')    return make_token(TokenName::OP_DIVIDE);

        while (s.next() != '\0')
        {
            if (s.peek() == '*' && s.next() == '/')
            {
                s.advance();
                return next_token();
            }
        }

        return error("End-of-file in comment. Closing comment characters not found.");
    }


    Token char_lit ()
    {
        int n = s.next();

        if (n == '\'')    return error("Empty character constant");

        if (n == '\\')    switch (s.next())
                          {
                              case 'n'  :    n = '\n'; break;
                              case '\\' :    n = '\\'; break;
                              default   :    return error("Unknown escape sequence \\", s.peek());
                          }

        if (s.next() != '\'')    return error("Multi-character constant");

        s.advance();
        return make_token(TokenName::INTEGER, n);
    }


    Token string_lit ()
    {
        string text = "";

        while (s.next() != '"')
            switch (s.peek())
            {
                case '\\' :    switch (s.next())
                               {
                                   case 'n'  :    text += '\n'; continue;
                                   case '\\' :    text += '\\'; continue;
                                   default   :    return error("Unknown escape sequence \\", s.peek());
                               }

                case '\n' :    return error("End-of-line while scanning string literal."
                                            " Closing string character not found before end-of-line.");

                case '\0' :    return error("End-of-file while scanning string literal."
                                            " Closing string character not found.");

                default   :    text += s.peek();
            }

        s.advance();
        return make_token(TokenName::STRING, text);
    }


    static inline bool is_id_start (char c)    { return isalpha(static_cast<unsigned char>(c)) || c == '_'; }
    static inline bool is_id_end   (char c)    { return isalnum(static_cast<unsigned char>(c)) || c == '_'; }
    static inline bool is_digit    (char c)    { return isdigit(static_cast<unsigned char>(c));             }


    Token identifier ()
    {
        string text (1, s.peek());

        while (is_id_end(s.next()))    text += s.peek();

        auto i = keywords.find(text);
        if (i != keywords.end())    return make_token(i->second);

        return make_token(TokenName::IDENTIFIER, text);
    }


    Token integer_lit ()
    {
        while (is_digit(s.next()));

        if (is_id_start(s.peek()))
            return error("Invalid number. Starts like a number, but ends in non-numeric characters.");

        int n;

        auto r = from_chars(pre_state.pos, s.pos, n);
        if (r.ec == errc::result_out_of_range)    return error("Number exceeds maximum value");

        return make_token(TokenName::INTEGER, n);
    }
}; // class Lexer


const map<string, TokenName> Lexer::keywords =
{
    {"else",  TokenName::KEYWORD_ELSE},
    {"if",    TokenName::KEYWORD_IF},
    {"print", TokenName::KEYWORD_PRINT},
    {"putc",  TokenName::KEYWORD_PUTC},
    {"while", TokenName::KEYWORD_WHILE}
};


int main (int argc, char* argv[])
{
    string in  = (argc > 1) ? argv[1] : "stdin";
    string out = (argc > 2) ? argv[2] : "stdout";

    with_IO(in, out, [](string input)
    {
        Lexer lexer {input.data()};

        string s = "Location  Token name        Value\n"
                   "--------------------------------------\n";

        while (lexer.has_more())    s += to_string(lexer.next_token());
        return s;
    });
}
Output  —  test case 3:
Location  Token name        Value
--------------------------------------
 5   16   Keyword_print
 5   40   Op_subtract
 6   16   Keyword_putc
 6   40   Op_less
 7   16   Keyword_if
 7   40   Op_greater
 8   16   Keyword_else
 8   40   Op_lessequal
 9   16   Keyword_while
 9   40   Op_greaterequal
10   16   LeftBrace
10   40   Op_equal
11   16   RightBrace
11   40   Op_notequal
12   16   LeftParen
12   40   Op_and
13   16   RightParen
13   40   Op_or
14   16   Op_subtract
14   40   Semicolon
15   16   Op_not
15   40   Comma
16   16   Op_multiply
16   40   Op_assign
17   16   Op_divide
17   40   Integer           42
18   16   Op_mod
18   40   String            "String literal"
19   16   Op_add
19   40   Identifier        variable_name
20   26   Integer           10
21   26   Integer           92
22   26   Integer           32
23    1   End_of_input

COBOL

Using GnuCOBOL 2. By Steve Williams (with one change to get around a Rosetta Code code highlighter problem).

        >>SOURCE FORMAT IS FREE
*> this code is dedicated to the public domain
*> (GnuCOBOL) 2.3-dev.0
identification division.
program-id. lexer.
environment division.
configuration section.
repository. function all intrinsic.
input-output section.
file-control.
    select input-file assign using input-name
        status input-status
        organization line sequential.
data division.

file section.
fd  input-file.
01  input-record pic x(98).

working-storage section.
01  input-name pic x(32).
01  input-status pic xx.
01  input-length pic 99.

01  output-name pic x(32) value spaces.
01  output-status pic xx.
01  output-record pic x(64).

01  line-no pic 999 value 0.
01  col-no pic 99.
01  col-no-max pic 99.
01  col-increment pic 9 value 1.
01  start-col pic 99.
01  outx pic 99.
01  out-lim pic 99 value 48.

01  output-line value spaces.
    03  out-line pic zzzz9.
    03  out-column pic zzzzzz9.
    03  message-area.
        05  filler pic xxx.
        05  token pic x(16).
        05  out-value pic x(48).
        05  out-integer redefines out-value pic zzzzz9.
        05  out-integer1 redefines out-value pic zzzzzz9. *> to match the python lexer

01  error-record.
    03  error-line pic zzzz9 value 0.
    03  error-col pic zzzzzz9 value 0.
    03  error-message pic x(68) value spaces.

01  scan-state pic x(16) value spaces.
01  current-character pic x.
01  previous-character pic x.

procedure division chaining input-name.
start-lexer.
    if input-name <> spaces
        open input input-file
        if input-status = '35'
            string 'in lexer ' trim(input-name) ' not found' into error-message
            perform report-error
        end-if
    end-if
    perform read-input-file
    perform until input-status <> '00'
        add 1 to line-no
        move line-no to out-line
        move length(trim(input-record,trailing)) to col-no-max
        move 1 to col-no
        move space to previous-character
        perform until col-no > col-no-max
            move col-no to out-column
            move input-record(col-no:1) to current-character
            evaluate scan-state

            when 'identifier'
                if current-character >= 'A' and <= 'Z'
                or (current-character >= 'a' and <= 'z')
                or (current-character >= '0' and <= '9')
                or current-character = '_'
                    perform increment-outx
                    move current-character to out-value(outx:1)
                    if col-no = col-no-max
                        perform process-identifier
                    end-if
                else
                    perform process-identifier
                    if current-character <> space
                        move 0 to col-increment
                    end-if
                end-if

            when 'integer'
                evaluate true
                when current-character >= '0' and <= '9'
                    perform increment-outx
                    move current-character to out-value(outx:1)
                    if col-no = col-no-max
                        move numval(out-value) to out-integer
                        move 'Integer' to token
                    end-if
                when current-character >= 'A' and <= 'Z'
                when current-character >= 'a' and <= 'z'
                    move 'in lexer invalid integer' to error-message
                    perform report-error
                when other
                    if outx > 5
                        move numval(out-value) to out-integer1 *> to match the python lexer
                    else
                        move numval(out-value) to out-integer
                    end-if
                    move 'Integer' to token
                    if current-character <> space
                        move 0 to col-increment
                    end-if
                end-evaluate
                
            when 'comment'
                if previous-character = '*' and current-character = '/' 
                    move 'comment' to token
                end-if

            when 'quote'
                evaluate current-character also outx
                when '"' also 0
                    string 'in lexer empty string' into error-message
                    perform report-error
                when '"' also any
                    perform increment-outx
                    move current-character to out-value(outx:1)
                    move 'String' to token
                when other
                    if col-no = col-no-max
                        string 'in lexer missing close quote' into error-message
                        perform report-error
                    else
                        perform increment-outx
                        move current-character to out-value(outx:1)
                    end-if
                end-evaluate

            when 'character'
                evaluate current-character also outx
                when "'" also 0
                    string 'in lexer empty character constant' into error-message
                    perform report-error
                when "'" also 1
                    subtract 1 from ord(out-value(1:1)) giving out-integer
                    move 'Integer' to token
                when "'" also 2
                    evaluate true 
                    when out-value(1:2) = '\n'
                        move 10 to out-integer
                    when out-value(1:2) = '\\'
                        subtract 1 from ord('\') giving out-integer      *> ' (workaround a Rosetta Code highlighter problem)
                    when other
                        string 'in lexer unknown escape sequence ' out-value(1:2)
                            into error-message
                        perform report-error
                    end-evaluate
                    move 'Integer' to token
                when "'" also any
                    string 'in lexer multicharacter constant' into error-message
                    perform report-error
                when other
                    if col-no = col-no-max
                        string 'in lexer missing close quote' into error-message
                        perform report-error
                    end-if
                    perform increment-outx
                    move current-character to out-value(outx:1)
                end-evaluate

            when 'and'
                evaluate previous-character also current-character
                when '&' also '&'
                    move 'Op_and' to token
                when other
                    string 'in lexer AND error' into error-message
                    perform report-error
                end-evaluate

            when 'or'
                evaluate previous-character also current-character
                when '|' also '|'
                    move 'Op_or' to token
                when other
                    string 'in lexer OR error' into error-message
                    perform report-error
                end-evaluate

            when 'ambiguous'
                evaluate previous-character also current-character
                when '/' also '*'
                    move 'comment' to scan-state
                    subtract 1 from col-no giving start-col
                when '/' also any
                    move 'Op_divide' to token
                    move 0 to col-increment

                when '=' also '='
                    move 'Op_equal' to token
                when '=' also any
                    move 'Op_assign' to token
                    move 0 to col-increment

                when '<' also '='
                    move 'Op_lessequal' to token
                when '<' also any
                    move 'Op_less' to token 
                    move 0 to col-increment

                when '>' also '='
                    move 'Op_greaterequal' to token
                when '>'also any
                    move 'Op_greater' to token
                    move 0 to col-increment

                when '!' also '='
                    move 'Op_notequal' to token
                when '!' also any
                    move 'Op_not' to token
                    move 0 to col-increment

                when other
                    display input-record
                    string 'in lexer ' trim(scan-state)
                        ' unknown character "' current-character '"'
                        ' with previous character "' previous-character '"'
                        into error-message
                    perform report-error
                end-evaluate

            when other
                move col-no to start-col
                evaluate current-character
                when space
                    continue
                when >= 'A' and <= 'Z'
                when >= 'a' and <= 'z'
                    move 'identifier' to scan-state
                    move 1 to outx
                    move current-character to out-value
                when >= '0' and <= '9'
                    move 'integer' to scan-state
                    move 1 to outx
                    move current-character to out-value
                when '&'
                    move 'and' to scan-state
                when '|'
                    move 'or' to scan-state
                when '"'
                    move 'quote' to scan-state
                    move 1 to outx
                    move current-character to out-value
                when "'"
                    move 'character' to scan-state
                    move 0 to outx
                when '{'
                    move 'LeftBrace' to token
                when '}'
                    move 'RightBrace' to token
                when '('
                    move 'LeftParen' to token
                when ')'
                    move 'RightParen' to token
                when '+'
                    move 'Op_add' to token
                when '-'
                    move 'Op_subtract' to token
                when '*'
                    move 'Op_multiply' to token
                when '%'
                    move 'Op_mod' to token
                when ';'
                    move 'Semicolon' to token
                when ','
                    move 'Comma' to token
                when '/'
                when '<'
                when '>'
                when '='
                when '='
                when '<'
                when '>'
                when '!'
                    move 'ambiguous' to scan-state
                when other
                    string 'in lexer unknown character "' current-character '"'
                        into error-message
                    perform report-error
                end-evaluate
            end-evaluate

            if token <> spaces
                perform process-token
            end-if

            move current-character to previous-character
            add col-increment to col-no
            move 1 to col-increment
        end-perform
        if scan-state = 'ambiguous'
            evaluate previous-character
            when '/'
                move 'Op_divide' to token
                perform process-token

            when '='
                move 'Op_assign' to token
                perform process-token

            when '<'
                move 'Op_less' to token 
                perform process-token

            when '>'
                move 'Op_greater' to token
                perform process-token

            when '!'
                move 'Op_not' to token
                perform process-token

            when other
                string 'in lexer unresolved ambiguous
                    "' previous-character '" at end of line'
                into error-message
                perform report-error
            end-evaluate
        end-if
        perform read-input-file
    end-perform

    evaluate true
    when input-status <> '10'
        string 'in lexer ' trim(input-name) ' invalid input status ' input-status
            into error-message
        perform report-error
    when scan-state = 'comment'
        string 'in lexer unclosed comment at end of input' into error-message
        perform report-error
     end-evaluate
    
    move 'End_of_input' to token
    move 1 to out-column
    move 1 to start-col
    add 1 to line-no
    perform process-token

    close input-file
    stop run
    .
process-identifier.
    evaluate true
    when out-value = 'print'
        move 'Keyword_print' to token
        move spaces to out-value
    when out-value = 'while'
        move 'Keyword_while' to token
        move spaces to out-value
    when out-value = 'if'
        move 'Keyword_if' to token
        move spaces to out-value
    when out-value = 'else'
        move 'Keyword_else' to token
        move spaces to out-value
    when out-value = 'putc'
        move 'Keyword_putc' to token
        move spaces to out-value
    when other
        move 'Identifier' to token
    end-evaluate
    .
increment-outx.
    if outx >= out-lim
        string 'in lexer token value length exceeds ' out-lim into error-message
        perform report-error
    end-if
    add 1 to outx
    .
process-token.
    if token <> 'comment'
        move start-col to out-column
        move line-no to out-line
        display output-line
    end-if
    move 0 to start-col
    move spaces to scan-state message-area
    .
report-error.
    move line-no to error-line
    move start-col to error-col
    display error-record
    close input-file
    stop run with error status -1
    .
read-input-file.
    if input-name = spaces
        move '00' to input-status
        accept input-record on exception move '10' to input-status end-accept
    else
        read input-file
    end-if
    .
end program lexer.
Output  —  test case 3:
prompt$ ./lexer <testcase3
    5     16   Keyword_print
    5     40   Op_subtract
    6     16   Keyword_putc
    6     40   Op_less
    7     16   Keyword_if
    7     40   Op_greater
    8     16   Keyword_else
    8     40   Op_lessequal
    9     16   Keyword_while
    9     40   Op_greaterequal
   10     16   LeftBrace
   10     40   Op_equal
   11     16   RightBrace
   11     40   Op_notequal
   12     16   LeftParen
   12     40   Op_and
   13     16   RightParen
   13     40   Op_or
   14     16   Op_subtract
   14     40   Semicolon
   15     16   Op_not
   15     40   Comma
   16     16   Op_multiply
   16     40   Op_assign
   17     16   Op_divide
   17     40   Integer             42
   18     16   Op_mod
   18     40   String          "String literal"
   19     16   Op_add
   19     40   Identifier      variable_name
   20     26   Integer             10
   21     26   Integer             92
   22     26   Integer             32
   23      1   End_of_input

Common Lisp

Lisp has a built-in reader and you can customize the reader by modifying its readtable. I'm also using the Gray stream, which is an almost standard feature of Common Lisp, for counting lines and columns.

(defpackage #:lexical-analyzer
  (:use #:cl #:sb-gray)
  (:export #:main))

(in-package #:lexical-analyzer)

(defconstant +lex-symbols-package+ (or (find-package :lex-symbols)
                                       (make-package :lex-symbols)))

(defclass counting-character-input-stream (fundamental-character-input-stream)
  ((stream :type stream :initarg :stream :reader stream-of)
   (line :type fixnum :initform 1 :accessor line-of)
   (column :type fixnum :initform 0 :accessor column-of)
   (prev-column :type (or null fixnum) :initform nil :accessor prev-column-of))
  (:documentation "Character input stream that counts lines and columns."))

(defmethod stream-read-char ((stream counting-character-input-stream))
  (let ((ch (read-char (stream-of stream) nil :eof)))
    (case ch
      (#\Newline
       (incf (line-of stream))
       (setf (prev-column-of stream) (column-of stream)
             (column-of stream) 0))
      (t
       (incf (column-of stream))))
    ch))

(defmethod stream-unread-char ((stream counting-character-input-stream) char)
  (unread-char char (stream-of stream))
  (case char
      (#\Newline
       (decf (line-of stream))
       (setf (column-of stream) (prev-column-of stream)))
      (t
       (decf (column-of stream)))))

(defstruct token
  (name nil :type symbol)
  (value nil :type t)
  (line nil :type fixnum)
  (column nil :type fixnum))

(defun lexer-error (format-control &rest args)
  (apply #'error format-control args))

(defun handle-divide-or-comment (stream char)
  (declare (ignore char))
  (case (peek-char nil stream t nil t)
    (#\* (loop with may-end = nil
                 initially (read-char stream t nil t)
               for ch = (read-char stream t nil t)
               until (and may-end (char= ch #\/))
               do (setf may-end (char= ch #\*))
               finally (return (read stream t nil t))))
    (t (make-token :name :op-divide :line (line-of stream) :column (column-of stream)))))

(defun make-constant-handler (token-name)
  (lambda (stream char)
    (declare (ignore char))
    (make-token :name token-name :line (line-of stream) :column (column-of stream))))

(defun make-this-or-that-handler (expect then &optional else)
  (lambda (stream char)
    (declare (ignore char))
    (let ((line (line-of stream))
          (column (column-of stream))
          (next (peek-char nil stream nil nil t)))
      (cond ((and expect (char= next expect))
             (read-char stream nil nil t)
             (make-token :name then :line line :column column))
            (else
             (make-token :name else :line line :column column))
            (t
             (lexer-error "Unrecognized character '~A'" next))))))

(defun identifier? (symbol)
  (and (symbolp symbol)
       (not (keywordp symbol))
       (let ((name (symbol-name symbol)))
         (and (find (char name 0) "_abcdefghijklmnopqrstuvwxyz" :test #'char-equal)
              (or (< (length name) 2)
                  (not (find-if-not (lambda (ch)
                                      (find ch "_abcdefghijklmnopqrstuvwxyz0123456789"
                                            :test #'char-equal))
                                    name :start 1)))))))

(defun id->keyword (id line column)
  (case id
    (lex-symbols::|if|    (make-token :name :keyword-if :line line :column column))
    (lex-symbols::|else|  (make-token :name :keyword-else :line line :column column))
    (lex-symbols::|while| (make-token :name :keyword-while :line line :column column))
    (lex-symbols::|print| (make-token :name :keyword-print :line line :column column))
    (lex-symbols::|putc|  (make-token :name :keyword-putc :line line :column column))
    (t nil)))

(defun handle-identifier (stream char)
  (let ((*readtable* (copy-readtable)))
    (set-syntax-from-char char #\z)
    (let ((line (line-of stream))
          (column (column-of stream)))
      (unread-char char stream)
      (let ((obj (read stream t nil t)))
        (if (identifier? obj)
            (or (id->keyword obj line column)
                (make-token :name :identifier :value obj :line line :column column))
            (lexer-error "Invalid identifier name: ~A" obj))))))

(defun handle-integer (stream char)
  (let ((*readtable* (copy-readtable)))
    (set-syntax-from-char char #\z)
    (let ((line (line-of stream))
          (column (column-of stream)))
      (unread-char char stream)
      (let ((obj (read stream t nil t)))
        (if (integerp obj)
            (make-token :name :integer :value obj :line line :column column)
            (lexer-error "Invalid integer: ~A" obj))))))

(defun handle-char-literal (stream char)
  (declare (ignore char))
  (let* ((line (line-of stream))
         (column (column-of stream))
         (ch (read-char stream t nil t))
         (parsed (case ch
                   (#\' (lexer-error "Empty character constant"))
                   (#\Newline (lexer-error "New line in character literal"))
                   (#\\ (let ((next-ch (read-char stream t nil t)))
                          (case next-ch
                            (#\n #\Newline)
                            (#\\ #\\)
                            (t (lexer-error "Unknown escape sequence: \\~A" next-ch)))))
                   (t ch))))
    (if (char= #\' (read-char stream t nil t))
        (make-token :name :integer :value (char-code parsed) :line line :column column)
        (lexer-error "Only one character is allowed in character literal"))))

(defun handle-string (stream char)
  (declare (ignore char))
  (loop with result = (make-array 0 :element-type 'character :adjustable t :fill-pointer t)
        with line = (line-of stream)
        with column = (column-of stream)
        for ch = (read-char stream t nil t)
        until (char= ch #\")
        do (setf ch (case ch
                      (#\Newline (lexer-error "New line in string"))
                      (#\\ (let ((next-ch (read-char stream t nil t)))
                             (case next-ch
                               (#\n #\Newline)
                               (#\\ #\\)
                               (t (lexer-error "Unknown escape sequence: \\~A" next-ch)))))
                      (t ch)))
           (vector-push-extend ch result)
        finally (return (make-token :name :string :value result :line line :column column))))

(defun make-lexer-readtable ()
  (let ((*readtable* (copy-readtable nil)))
    (setf (readtable-case *readtable*) :preserve)
    (set-syntax-from-char #\\ #\z)
    (set-syntax-from-char #\# #\z)
    (set-syntax-from-char #\` #\z)

    ;; operators
    (set-macro-character #\* (make-constant-handler :op-multiply))
    (set-macro-character #\/ #'handle-divide-or-comment)
    (set-macro-character #\% (make-constant-handler :op-mod))
    (set-macro-character #\+ (make-constant-handler :op-add))
    (set-macro-character #\- (make-constant-handler :op-subtract))
    (set-macro-character #\< (make-this-or-that-handler #\= :op-lessequal :op-less))
    (set-macro-character #\> (make-this-or-that-handler #\= :op-greaterequal :op-greater))
    (set-macro-character #\= (make-this-or-that-handler #\= :op-equal :op-assign))
    (set-macro-character #\! (make-this-or-that-handler #\= :op-notequal :op-not))
    (set-macro-character #\& (make-this-or-that-handler #\& :op-and))
    (set-macro-character #\| (make-this-or-that-handler #\| :op-or))

    ;; symbols
    (set-macro-character #\( (make-constant-handler :leftparen))
    (set-macro-character #\) (make-constant-handler :rightparen))
    (set-macro-character #\{ (make-constant-handler :leftbrace))
    (set-macro-character #\} (make-constant-handler :rightbrace))
    (set-macro-character #\; (make-constant-handler :semicolon))
    (set-macro-character #\, (make-constant-handler :comma))

    ;; identifiers & keywords
    (set-macro-character #\_ #'handle-identifier t)
    (loop for ch across "abcdefghijklmnopqrstuvwxyz"
          do (set-macro-character ch #'handle-identifier t))
    (loop for ch across "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
          do (set-macro-character ch #'handle-identifier t))

    ;; integers
    (loop for ch across "0123456789"
          do (set-macro-character ch #'handle-integer t))
    (set-macro-character #\' #'handle-char-literal)

    ;; strings
    (set-macro-character #\" #'handle-string)

    *readtable*))

(defun lex (stream)
  (loop with *readtable* = (make-lexer-readtable)
        with *package* = +lex-symbols-package+
        with eof = (gensym)
        with counting-stream = (make-instance 'counting-character-input-stream :stream stream)
        for token = (read counting-stream nil eof)
        until (eq token eof)
        do (format t "~5D ~5D ~15A~@[ ~S~]~%"
                   (token-line token) (token-column token) (token-name token) (token-value token))
        finally (format t "~5D ~5D ~15A~%"
                        (line-of counting-stream) (column-of counting-stream) :end-of-input)
                (close counting-stream)))

(defun main ()
  (lex *standard-input*))
Output  —  test case 3:
    5    16 KEYWORD-PRINT  
    5    40 OP-SUBTRACT    
    6    16 KEYWORD-PUTC   
    6    40 OP-LESS        
    7    16 KEYWORD-IF     
    7    40 OP-GREATER     
    8    16 KEYWORD-ELSE   
    8    40 OP-LESSEQUAL   
    9    16 KEYWORD-WHILE  
    9    40 OP-GREATEREQUAL
   10    16 LEFTBRACE      
   10    40 OP-EQUAL       
   11    16 RIGHTBRACE     
   11    40 OP-NOTEQUAL    
   12    16 LEFTPAREN      
   12    40 OP-AND         
   13    16 RIGHTPAREN     
   13    40 OP-OR          
   14    16 OP-SUBTRACT    
   14    40 SEMICOLON      
   15    16 OP-NOT         
   15    40 COMMA          
   16    16 OP-MULTIPLY    
   16    40 OP-ASSIGN      
   17    16 OP-DIVIDE      
   17    40 INTEGER         42
   18    16 OP-MOD         
   18    40 STRING          "String literal"
   19    16 OP-ADD         
   19    40 IDENTIFIER      variable_name
   20    26 INTEGER         10
   21    26 INTEGER         92
   22    26 INTEGER         32
   23     1 END-OF-INPUT   

Elixir

Works with: Elixir version 1.13.3
Translation of: ATS
#!/bin/env elixir
# -*- elixir -*-

defmodule Lex do

  def main args do
    {inpf_name, outf_name, exit_status} =
      case args do
        [] -> {"-", "-", 0}
        [name] -> {name, "-", 0}
        [name1, name2] -> {name1, name2, 0}
        [name1, name2 | _] -> {name1, name2, usage_error()}
      end

    {inpf, outf, exit_status} =
      case {inpf_name, outf_name, exit_status} do
        {"-", "-", 0} -> {:stdio, :stdio, 0}
        {name1, "-", 0} ->
          {inpf, exit_status} = open_file(name1, [:read])
          {inpf, :stdio, exit_status}
        {"-", name2, 0} ->
          {outf, exit_status} = open_file(name2, [:write])
          {:stdio, outf, exit_status}
        {name1, name2, 0} ->
          {inpf, exit_status} = open_file(name1, [:read])
          if exit_status != 0 do
            {inpf, name2, exit_status}
          else
            {outf, exit_status} = open_file(name2, [:write])
            {inpf, outf, exit_status}
          end
        _ -> {inpf_name, outf_name, exit_status}
      end

    exit_status =
      case exit_status do
        0 -> main_program inpf, outf
        _ -> exit_status
    end

    # Choose one.
    System.halt exit_status     # Fast exit.
    #System.stop exit_status    # Laborious cleanup.
  end

  def main_program inpf, outf do
    inp = make_inp inpf
    scan_text outf, inp
    exit_status = 0
    exit_status
  end

  def open_file name, rw do
    case File.open name, rw do
      {:ok, f} -> {f, 0}
      _ ->
        IO.write :stderr, "Cannot open "
        IO.write :stderr, name
        case rw do
          [:read] -> IO.puts " for input"
          [:write] -> IO.puts " for output"
        end
        {name, 1}
    end
  end

  def scan_text outf, inp do
    {toktup, inp} = get_next_token inp
    print_token outf, toktup
    case toktup do
      {"End_of_input", _, _, _} -> :ok
      _ -> scan_text outf, inp
    end
  end

  def print_token outf, {tok, arg, line_no, column_no} do
    IO.write outf, (String.pad_leading "#{line_no}", 5)
    IO.write outf, " "
    IO.write outf, (String.pad_leading "#{column_no}", 5)
    IO.write outf, "  "
    IO.write outf, tok
    case tok do
      "Identifier" ->
        IO.write outf, "     "
        IO.write outf, arg
      "Integer" ->
        IO.write outf, "        "
        IO.write outf, arg
      "String" ->
        IO.write outf, "         "
        IO.write outf, arg
      _ -> :ok
    end
    IO.puts outf, ""
  end

###-------------------------------------------------------------------
###
### The token dispatcher.
###

  def get_next_token inp do
    inp = skip_spaces_and_comments inp
    {ch, inp} = get_ch inp
    {chr, line_no, column_no} = ch
    ln = line_no
    cn = column_no
    case chr do
      :eof -> {{"End_of_input", "", ln, cn}, inp}
      "," -> {{"Comma", ",", ln, cn}, inp}
      ";" -> {{"Semicolon", ";", ln, cn}, inp}
      "(" -> {{"LeftParen", "(", ln, cn}, inp}
      ")" -> {{"RightParen", ")", ln, cn}, inp}
      "{" -> {{"LeftBrace", "{", ln, cn}, inp}
      "}" -> {{"RightBrace", "}", ln, cn}, inp}
      "*" -> {{"Op_multiply", "*", ln, cn}, inp}
      "/" -> {{"Op_divide", "/", ln, cn}, inp}
      "%" -> {{"Op_mod", "%", ln, cn}, inp}
      "+" -> {{"Op_add", "+", ln, cn}, inp}
      "-" -> {{"Op_subtract", "-", ln, cn}, inp}
      "<" ->
        {ch1, inp} = get_ch inp
        {chr1, _, _} = ch1
        case chr1 do
          "=" -> {{"Op_lessequal", "<=", ln, cn}, inp}
          _ -> {{"Op_less", "<", ln, cn}, (push_back ch1, inp)}
        end
      ">" ->
        {ch1, inp} = get_ch inp
        {chr1, _, _} = ch1
        case chr1 do
          "=" -> {{"Op_greaterequal", ">=", ln, cn}, inp}
          _ -> {{"Op_greater", ">", ln, cn}, (push_back ch1, inp)}
        end
      "=" ->
        {ch1, inp} = get_ch inp
        {chr1, _, _} = ch1
        case chr1 do
          "=" -> {{"Op_equal", "==", ln, cn}, inp}
          _ -> {{"Op_assign", "=", ln, cn}, (push_back ch1, inp)}
        end
      "!" ->
        {ch1, inp} = get_ch inp
        {chr1, _, _} = ch1
        case chr1 do
          "=" -> {{"Op_notequal", "!=", ln, cn}, inp}
          _ -> {{"Op_not", "!", ln, cn}, (push_back ch1, inp)}
        end
      "&" ->
        {ch1, inp} = get_ch inp
        {chr1, _, _} = ch1
        case chr1 do
          "&" -> {{"Op_and", "&&", ln, cn}, inp}
          _ -> unexpected_character ln, cn, chr
        end
      "|" ->
        {ch1, inp} = get_ch inp
        {chr1, _, _} = ch1
        case chr1 do
          "|" -> {{"Op_or", "||", ln, cn}, inp}
          _ -> unexpected_character ln, cn, chr
        end
      "\"" ->
        inp = push_back ch, inp
        scan_string_literal inp
      "'" ->
        inp = push_back ch, inp
        scan_character_literal inp
      _ ->
        cond do
          String.match? chr, ~r/^[[:digit:]]$/u ->
            inp = push_back ch, inp
            scan_integer_literal inp
          String.match? chr, ~r/^[[:alpha:]_]$/u ->
            inp = push_back ch, inp
            scan_identifier_or_reserved_word inp
          true -> unexpected_character ln, cn, chr
        end
    end
  end

###-------------------------------------------------------------------
###
### Skipping past spaces and /* ... */ comments.
###
### Comments are treated exactly like a bit of whitespace. They never
### make it to the dispatcher.
###

  def skip_spaces_and_comments inp do
    {ch, inp} = get_ch inp
    {chr, line_no, column_no} = ch
    cond do
      chr == :eof -> push_back ch, inp
      String.match? chr, ~r/^[[:space:]]$/u ->
        skip_spaces_and_comments inp
      chr == "/" ->
        {ch1, inp} = get_ch inp
        case ch1 do
          {"*", _, _} ->
            inp = scan_comment inp, line_no, column_no
            skip_spaces_and_comments inp
          _ -> push_back ch, (push_back ch1, inp)
        end
      true -> push_back ch, inp
    end
  end

  def scan_comment inp, line_no, column_no do
    {ch, inp} = get_ch inp
    case ch do
      {:eof, _, _} -> unterminated_comment line_no, column_no
      {"*", _, _} ->
        {ch1, inp} = get_ch inp
        case ch1 do
          {:eof, _, _} -> unterminated_comment line_no, column_no
          {"/", _, _} -> inp
          _ -> scan_comment inp, line_no, column_no
        end
      _ -> scan_comment inp, line_no, column_no
    end
  end

###-------------------------------------------------------------------
###
### Scanning of integer literals, identifiers, and reserved words.
###
### These three types of token are very similar to each other.
###

  def scan_integer_literal inp do
    # Scan an entire word, not just digits. This way we detect
    # erroneous text such as "23skidoo".
    {line_no, column_no, inp} = get_position inp
    {word, inp} = scan_word inp
    if String.match? word, (~r/^[[:digit:]]+$/u) do
      {{"Integer", word, line_no, column_no}, inp}
    else
      invalid_integer_literal line_no, column_no, word
    end
  end

  def scan_identifier_or_reserved_word inp do
    # It is assumed that the first character is of the correct type,
    # thanks to the dispatcher.
    {line_no, column_no, inp} = get_position inp
    {word, inp} = scan_word inp
    tok =
      case word do
        "if" -> "Keyword_if"
        "else" -> "Keyword_else"
        "while" -> "Keyword_while"
        "print" -> "Keyword_print"
        "putc" -> "Keyword_putc"
        _ -> "Identifier"
      end
    {{tok, word, line_no, column_no}, inp}
  end

  def scan_word inp, word\\"" do
    {ch, inp} = get_ch inp
    {chr, _, _} = ch
    if String.match? chr, (~r/^[[:alnum:]_]$/u) do
      scan_word inp, (word <> chr)
    else
      {word, (push_back ch, inp)}
    end
  end

  def get_position inp do
    {ch, inp} = get_ch inp
    {_, line_no, column_no} = ch
    inp = push_back ch, inp
    {line_no, column_no, inp}
  end

###-------------------------------------------------------------------
###
### Scanning of string literals.
###
### It is assumed that the first character is the opening quote, and
### that the closing quote is the same character.
###

  def scan_string_literal inp do
    {ch, inp} = get_ch inp
    {quote_mark, line_no, column_no} = ch
    {contents, inp} = scan_str_lit inp, ch
    {{"String", quote_mark <> contents <> quote_mark,
      line_no, column_no},
     inp}
  end

  def scan_str_lit inp, ch, contents\\"" do
    {quote_mark, line_no, column_no} = ch    
    {ch1, inp} = get_ch inp
    {chr1, line_no1, column_no1} = ch1
    if chr1 == quote_mark do 
      {contents, inp}
    else
      case chr1 do
        :eof -> eoi_in_string_literal line_no, column_no
        "\n" -> eoln_in_string_literal line_no, column_no
        "\\" ->
          {ch2, inp} = get_ch inp
          {chr2, _, _} = ch2
          case chr2 do
            "n" -> scan_str_lit inp, ch, (contents <> "\\n")
            "\\" -> scan_str_lit inp, ch, (contents <> "\\\\")
            _ -> unsupported_escape line_no1, column_no1, chr2
          end
        _ -> scan_str_lit inp, ch, (contents <> chr1)
      end
    end
  end

###-------------------------------------------------------------------
###
### Scanning of character literals.
###
### It is assumed that the first character is the opening quote, and
### that the closing quote is the same character.
###
### The tedious part of scanning a character literal is distinguishing
### between the kinds of lexical error. (One might wish to modify the
### code to detect, as a distinct kind of error, end of line within a
### character literal.)
###

  def scan_character_literal inp do
    {ch, inp} = get_ch inp
    {_, line_no, column_no} = ch
    {ch1, inp} = get_ch inp
    {chr1, line_no1, column_no1} = ch1
    {intval, inp} =
      case chr1 do
        :eof -> unterminated_character_literal line_no, column_no
        "\\" ->
          {ch2, inp} = get_ch inp
          {chr2, _, _} = ch2
          case chr2 do
            :eof -> unterminated_character_literal line_no, column_no
            "n" -> {(:binary.first "\n"), inp}
            "\\" -> {(:binary.first "\\"), inp}
            _ -> unsupported_escape line_no1, column_no1, chr2
          end
        _ -> {(:binary.first chr1), inp}
      end
    inp = check_character_literal_end inp, ch
    {{"Integer", "#{intval}", line_no, column_no}, inp}
  end

  def check_character_literal_end inp, ch do
    {chr, _, _} = ch
    {{chr1, _, _}, inp} = get_ch inp
    if chr1 == chr do
      inp
    else
      # Lexical error.
      find_char_lit_end inp, ch
    end
  end

  def find_char_lit_end inp, ch do
    {chr, line_no, column_no} = ch
    {{chr1, _, _}, inp} = get_ch inp
    if chr1 == chr do
      multicharacter_literal line_no, column_no
    else
      case chr1 do
        :eof -> unterminated_character_literal line_no, column_no
        _ -> find_char_lit_end inp, ch
      end
    end
  end

###-------------------------------------------------------------------
###
### Character-at-a-time input, with unrestricted pushback, and with
### line and column numbering.
###

  def make_inp inpf do
    {inpf, [], 1, 1}
  end

  def get_ch {inpf, pushback, line_no, column_no} do
    case pushback do
      [head | tail] ->
        {head, {inpf, tail, line_no, column_no}}
      [] ->
        case IO.read(inpf, 1) do
          :eof ->
            {{:eof, line_no, column_no},
             {inpf, pushback, line_no, column_no}}
          {:error, _} ->
            {{:eof, line_no, column_no},
             {inpf, pushback, line_no, column_no}}
          chr ->
            case chr do
              "\n" ->
                {{chr, line_no, column_no},
                 {inpf, pushback, line_no + 1, 1}}
              _ ->
                {{chr, line_no, column_no},
                 {inpf, pushback, line_no, column_no + 1}}
            end
        end
    end
  end

  def push_back ch, {inpf, pushback, line_no, column_no} do
    {inpf, [ch | pushback], line_no, column_no}
  end

###-------------------------------------------------------------------
###
### Lexical and usage errors.
###

  def unterminated_comment line_no, column_no do
    raise "#{scriptname()}: unterminated comment at #{line_no}:#{column_no}"
  end

  def invalid_integer_literal line_no, column_no, word do
    raise "#{scriptname()}: invalid integer literal #{word} at #{line_no}:#{column_no}"
  end

  def unsupported_escape line_no, column_no, chr do
    raise "#{scriptname()}: unsupported escape \\#{chr} at #{line_no}:#{column_no}"
  end

  def eoi_in_string_literal line_no, column_no do
    raise "#{scriptname()}: end of input in string literal starting at #{line_no}:#{column_no}"
  end

  def eoln_in_string_literal line_no, column_no do
    raise "#{scriptname()}: end of line in string literal starting at #{line_no}:#{column_no}"
  end

  def multicharacter_literal line_no, column_no do
    raise "#{scriptname()}: unsupported multicharacter literal at #{line_no}:#{column_no}"
  end

  def unterminated_character_literal line_no, column_no do
    raise "#{scriptname()}: unterminated character literal starting at #{line_no}:#{column_no}"
  end

  def unexpected_character line_no, column_no, chr do
    raise "#{scriptname()}: unexpected character '#{chr}' at #{line_no}:#{column_no}"
  end    

  def usage_error() do
    IO.puts "Usage: #{scriptname()} [INPUTFILE [OUTPUTFILE]]"
    IO.puts "If either of INPUTFILE or OUTPUTFILE is not present or is \"-\","
    IO.puts "standard input or standard output is used, respectively."
    exit_status = 2
    exit_status
  end

  def scriptname() do
    Path.basename(__ENV__.file)
  end

#---------------------------------------------------------------------

end ## module Lex

Lex.main(System.argv)
Output:
$ ./lex testcase3.t
    5    16  Keyword_print
    5    40  Op_subtract
    6    16  Keyword_putc
    6    40  Op_less
    7    16  Keyword_if
    7    40  Op_greater
    8    16  Keyword_else
    8    40  Op_lessequal
    9    16  Keyword_while
    9    40  Op_greaterequal
   10    16  LeftBrace
   10    40  Op_equal
   11    16  RightBrace
   11    40  Op_notequal
   12    16  LeftParen
   12    40  Op_and
   13    16  RightParen
   13    40  Op_or
   14    16  Op_subtract
   14    40  Semicolon
   15    16  Op_not
   15    40  Comma
   16    16  Op_multiply
   16    40  Op_assign
   17    16  Op_divide
   17    40  Integer        42
   18    16  Op_mod
   18    40  String         "String literal"
   19    16  Op_add
   19    40  Identifier     variable_name
   20    26  Integer        10
   21    26  Integer        92
   22    26  Integer        32
   23     1  End_of_input


Emacs Lisp

Works with: Emacs version GNU 27.2
Translation of: ATS


#!/usr/bin/emacs --script
;;
;; The Rosetta Code lexical analyzer in GNU Emacs Lisp.
;;
;; Migrated from the ATS. However, Emacs Lisp is not friendly to the
;; functional style of the ATS implementation; therefore the
;; differences are vast.
;;
;; (A Scheme migration could easily, on the other hand, have been
;; almost exact. It is interesting to contrast Lisp dialects and see
;; how huge the differences are.)
;;
;; The script currently takes input only from standard input and
;; writes the token stream only to standard output.
;;

(require 'cl-lib)

;;; The type of a character, consisting of its code point and where it
;;; occurred in the text.
(cl-defstruct (ch_t (:constructor make-ch (ichar line-no column-no)))
  ichar line-no column-no)

(defun ch-ichar (ch)
  (ch_t-ichar ch))

(defun ch-line-no (ch)
  (ch_t-line-no ch))

(defun ch-column-no (ch)
  (ch_t-column-no ch))

;;; The type of an "inputter", consisting of an open file for the
;;; text, a pushback buffer (which is an indefinitely deep stack of
;;; ch_t), an input buffer for the current line, and a position in the
;;; text.
(cl-defstruct inp_t file pushback line line-no column-no)

(defun make-inp (file)
  "Initialize a new inp_t."
  (make-inp_t :file file
              :pushback '()
              :line ""
              :line-no 0
              :column-no 0))

(defvar inp (make-inp t)
  "A global inp_t.")

(defun get-ch ()
  "Get a ch_t, either from the pushback buffer or from the input."
  (pcase (inp_t-pushback inp)
    (`(,ch . ,tail)
     ;; Emacs Lisp has only single value return, so the results come
     ;; back as a list rather than multiple values.
     (setq inp (make-inp_t :file (inp_t-file inp)
                           :pushback tail
                           :line (inp_t-line inp)
                           :line-no (inp_t-line-no inp)
                           :column-no (inp_t-column-no inp)))
     ch)
    ('()
     (let ((line (inp_t-line inp))
           (line-no (inp_t-line-no inp))
           (column-no (inp_t-column-no inp)))
       (when (string= line "")
         ;; Refill the buffer.
         (let ((text
                (condition-case nil (read-string "")
                  nil (error 'eoi))))
           (if (eq text 'eoi)
               (setq line 'eoi)
             (setq line (format "%s%c" text ?\n)))
           (setq line-no (1+ line-no))
           (setq column-no 1)))
       (if (eq line 'eoi)
           (progn
             (setq inp (make-inp_t :file (inp_t-file inp)
                                   :pushback (inp_t-pushback inp)
                                   :line line
                                   :line-no line-no
                                   :column-no column-no))
             (make-ch 'eoi line-no column-no))
         (let ((c (elt line 0))
               (line (substring line 1)))
           (setq inp (make-inp_t :file (inp_t-file inp)
                                 :pushback (inp_t-pushback inp)
                                 :line line
                                 :line-no line-no
                                 :column-no (1+ column-no)))
           (make-ch c line-no column-no)))))))

(defun get-new-line (file)
  ;; Currently "file" is ignored and the input must be from stdin.
  (read-from-minibuffer "" :default 'eoi))

(defun push-back (ch)
  "Push back a ch_t."
  (setq inp (make-inp_t :file (inp_t-file inp)
                        :pushback (cons ch (inp_t-pushback inp))
                        :line (inp_t-line inp)
                        :line-no (inp_t-line-no inp)
                        :column-no (inp_t-column-no inp))))

(defun get-position ()
  "Return the line-no and column-no of the next ch_t to be
returned by get-ch, assuming there are no more pushbacks
beforehand."
  (let* ((ch (get-ch))
         (line-no (ch-line-no ch))
         (column-no (ch-column-no ch)))
    (push-back ch)
    (list line-no column-no)))

(defun scan-text (outf)
  "The main loop."
  (cl-loop for toktup = (get-next-token)
           do (print-token outf toktup)
           until (string= (elt toktup 0) "End_of_input")))

(defun print-token (outf toktup)
  "Print a token, along with its position and possibly an
argument."
  ;; Currently outf is ignored, and the output goes to stdout.
  (pcase toktup
    (`(,tok ,arg ,line-no ,column-no)
     (princ (format "%5d %5d  %s" line-no column-no tok))
     (pcase tok
       ("Identifier" (princ (format "     %s\n" arg)))
       ("Integer" (princ (format "        %s\n" arg)))
       ("String" (princ (format "         %s\n" arg)))
       (_ (princ "\n"))))))

(defun get-next-token ()
   "The token dispatcher. Returns the next token, as a list along
with its argument and text position."
   (skip-spaces-and-comments)
   (let* ((ch (get-ch))
          (ln (ch-line-no ch))
          (cn (ch-column-no ch)))
     (pcase (ch-ichar ch)
       ('eoi (list "End_of_input" "" ln cn))
       (?, (list "Comma" "," ln cn))
       (?\N{SEMICOLON} (list "Semicolon" ";" ln cn))
       (?\N{LEFT PARENTHESIS} (list "LeftParen" "(" ln cn))
       (?\N{RIGHT PARENTHESIS} (list "RightParen" ")" ln cn))
       (?{ (list "LeftBrace" "{" ln cn))
       (?} (list "RightBrace" "}" ln cn))
       (?* (list "Op_multiply" "*" ln cn))
       (?/ (list "Op_divide" "/" ln cn))
       (?% (list "Op_mod" "%" ln cn))
       (?+ (list "Op_add" "+" ln cn))
       (?- (list "Op_subtract" "-" ln cn))
       (?< (let ((ch1 (get-ch)))
             (pcase (ch-ichar ch1)
               (?= (list "Op_lessequal" "<=" ln cn))
               (_ (push-back ch1)
                  (list "Op_less" "<" ln cn)))))
       (?> (let ((ch1 (get-ch)))
             (pcase (ch-ichar ch1)
               (?= (list "Op_greaterequal" ">=" ln cn))
               (_ (push-back ch1)
                  (list "Op_greater" ">" ln cn)))))
       (?= (let ((ch1 (get-ch)))
             (pcase (ch-ichar ch1)
               (?= (list "Op_equal" "==" ln cn))
               (_ (push-back ch1)
                  (list "Op_assign" "=" ln cn)))))
       (?! (let ((ch1 (get-ch)))
             (pcase (ch-ichar ch1)
               (?= (list "Op_notequal" "!=" ln cn))
               (_ (push-back ch1)
                  (list "Op_not" "!" ln cn)))))
       (?& (let ((ch1 (get-ch)))
             (pcase (ch-ichar ch1)
               (?& (list "Op_and" "&&" ln cn))
               (_ (unexpected-character ln cn (get-ichar ch))))))
       (?| (let ((ch1 (get-ch)))
             (pcase (ch-ichar ch1)
               (?| (list "Op_or" "||" ln cn))
               (_ (unexpected-character ln cn (get-ichar ch))))))
       (?\N{QUOTATION MARK} (push-back ch) (scan-string-literal))
       (?\N{APOSTROPHE} (push-back ch) (scan-character-literal))
       ((pred digitp) (push-back ch) (scan-integer-literal))
       ((pred identifier-start-p)
        (progn
          (push-back ch)
          (scan-identifier-or-reserved-word)))
       (c (unexpected-character ln cn c)))))

(defun skip-spaces-and-comments ()
  "Skip spaces and comments. A comment is treated as equivalent
to a run of spaces."
  (cl-loop for ch = (let ((ch1 (get-ch)))
                      (pcase (ch-ichar ch1)
                        (?/ (let* ((ch2 (get-ch))
                                   (line-no (ch-line-no ch1))
                                   (column-no (ch-column-no ch1))
                                   (position `(,line-no ,column-no)))
                              (pcase (ch-ichar ch2)
                                (?* (scan-comment position)
                                    (get-ch))
                                (_ (push-back ch2)
                                   ch1))))
                        (_ ch1)))
           while (spacep (ch-ichar ch))
           finally do (push-back ch)))

(defun scan-comment (position)
  (cl-loop for ch = (get-ch)
           for done = (comment-done-p ch position)
           until done))

(defun comment-done-p (ch position)
  (pcase (ch-ichar ch)
    ('eoi (apply 'unterminated-comment position))
    (?* (let ((ch1 (get-ch)))
          (pcase (ch-ichar ch1)
            ('eoi (apply 'unterminated-comment position))
            (?/ t)
            (_ nil))))
    (_ nil)))

(defun scan-integer-literal ()
  "Scan an integer literal, on the assumption that a digit has
been seen and pushed back."
  (let* ((position (get-position))
         (lst (scan-word))
         (s (list-to-string lst)))
    (if (all-digits-p lst)
        `("Integer" ,s . ,position)
      (apply 'illegal-integer-literal `(,@position , s)))))

(defun scan-identifier-or-reserved-word ()
   "Scan an identifier or reserved word, on the assumption that a
legal first character (for an identifier) has been seen and
pushed back."
   (let* ((position (get-position))
          (lst (scan-word))
          (s (list-to-string lst))
          (tok (pcase s
                 ("else" "Keyword_else")
                 ("if" "Keyword_if")
                 ("while" "Keyword_while")
                 ("print" "Keyword_print")
                 ("putc" "Keyword_putc")
                 (_ "Identifier"))))
     `(,tok ,s . ,position)))

(defun scan-word ()
  (cl-loop for ch = (get-ch)
           while (identifier-continuation-p (ch-ichar ch))
           collect (ch-ichar ch)
           finally do (push-back ch)))

(defun scan-string-literal ()
    "Scan a string literal, on the assumption that a double quote
has been seen and pushed back."
  (let* ((ch (get-ch))
         (_ (cl-assert (= (ch-ichar ch) ?\N{QUOTATION MARK})))
         (line-no (ch-line-no ch))
         (column-no (ch-column-no ch))
         (position `(,line-no ,column-no))
         (lst (scan-str-lit position))
         (lst `(?\N{QUOTATION MARK} ,@lst ?\N{QUOTATION MARK})))
    `("String" ,(list-to-string lst) . ,position)))

(defun scan-str-lit (position)
  (flatten
   (cl-loop for ch = (get-ch)
            until (= (ch-ichar ch) ?\N{QUOTATION MARK})
            collect (process-str-lit-character
                     (ch-ichar ch) position))))

(defun process-str-lit-character (c position)
  ;; NOTE: This script might insert a newline before any eoi, so that
  ;; "end-of-input-in-string-literal" never actually occurs. It is a
  ;; peculiarity of the script's input mechanism.
  (pcase c
    ('eoi (apply 'end-of-input-in-string-literal position))
    (?\n (apply 'end-of-line-in-string-literal position))
    (?\\ (let ((ch1 (get-ch)))
           (pcase (ch-ichar ch1)
             (?n '(?\\ ?n))
             (?\\ '(?\\ ?\\))
             (c (unsupported-escape (ch-line-no ch1)
                                    (ch-column-no ch1)
                                    c)))))
    (c c)))

(defun scan-character-literal ()
  "Scan a character literal, on the assumption that an ASCII
single quote (that is, a Unicode APOSTROPHE) has been seen and
pushed back."
  (let* ((toktup (scan-character-literal-without-checking-end))
         (line-no (elt toktup 2))
         (column-no (elt toktup 3))
         (position (list line-no column-no)))
    (check-char-lit-end position)
    toktup))

(defun check-char-lit-end (position)
  (let ((ch (get-ch)))
    (unless (and (integerp (ch-ichar ch))
                 (= (ch-ichar ch) ?\N{APOSTROPHE}))
      (push-back ch)
      (loop-to-char-lit-end position))))

(defun loop-to-char-lit-end (position)
  (cl-loop for ch = (get-ch)
           until (or (eq (ch-ichar ch) 'eoi)
                     (= (ch-ichar ch) ?\N{APOSTROPHE}))
           finally do (if (eq (ch-ichar ch) 'eoi)
                          (apply 'unterminated-character-literal
                                 position)
                        (apply 'multicharacter-literal position))))

(defun scan-character-literal-without-checking-end ()
  (let* ((ch (get-ch))
         (_ (cl-assert (= (ch-ichar ch) ?\N{APOSTROPHE})))
         (line-no (ch-line-no ch))
         (column-no (ch-column-no ch))
         (position (list line-no column-no))
         (ch1 (get-ch)))
    (pcase (ch-ichar ch1)
      ('eoi (apply 'unterminated-character-literal position))
      (?\\ (let ((ch2 (get-ch)))
             (pcase (ch-ichar ch2)
               ('eoi (apply 'unterminated-character-literal position))
               (?n `("Integer" ,(format "%d" ?\n) . ,position))
               (?\\ `("Integer" ,(format "%d" ?\\) . ,position))
               (c (unsupported-escape (ch-line-no ch1)
                                      (ch-column-no ch1)
                                      c)))))
      (c `("Integer" ,(format "%d" c) . ,position)))))

(defun spacep (c)
  (and (integerp c) (or (= c ?\N{SPACE})
                        (and (<= 9 c) (<= c 13)))))

(defun digitp (c)
  (and (integerp c) (<= ?0 c) (<= c ?9)))

(defun lowerp (c)
  ;; Warning: in EBCDIC, this kind of test for "alphabetic" is no
  ;; good. The letters are not contiguous.
  (and (integerp c) (<= ?a c) (<= c ?z)))

(defun upperp (c)
  ;; Warning: in EBCDIC, this kind of test for "alphabetic" is no
  ;; good. The letters are not contiguous.
  (and (integerp c) (<= ?A c) (<= c ?Z)))

(defun alphap (c)
  (or (lowerp c) (upperp c)))

(defun identifier-start-p (c)
  (and (integerp c) (or (alphap c) (= c ?_))))

(defun identifier-continuation-p (c)
  (and (integerp c) (or (alphap c) (= c ?_) (digitp c))))

(defun all-digits-p (thing)
  (cl-loop for c in thing
           if (not (digitp c)) return nil
           finally return t))

(defun list-to-string (lst)
  "Convert a list of characters to a string."
  (apply 'string lst))

(defun flatten (lst)
  "Flatten nested lists. (The implementation is recursive and not
for very long lists.)"
  (pcase lst
    ('() '())
    (`(,head . ,tail)
     (if (listp head)
         (append (flatten head) (flatten tail))
       (cons head (flatten tail))))))

(defun unexpected-character (line-no column-no c)
  (error (format "unexpected character '%c' at %d:%d"
                 c line-no column-no)))

(defun unsupported-escape (line-no column-no c)
  (error (format "unsupported escape \\%c at %d:%d"
                 c line-no column-no)))

(defun illegal-integer-literal (line-no column-no s)
  (error (format "illegal integer literal \"%s\" at %d:%d"
                 s line-no column-no)))

(defun unterminated-character-literal (line-no column-no)
  (error (format "unterminated character literal starting at %d:%d"
                 line-no column-no)))

(defun multicharacter-literal (line-no column-no)
  (error (format
          "unsupported multicharacter literal starting at %d:%d"
          line-no column-no)))

(defun end-of-input-in-string-literal (line-no column-no)
  (error (format "end of input in string literal starting at %d:%d"
                 line-no column-no)))

(defun end-of-line-in-string-literal (line-no column-no)
  (error (format "end of line in string literal starting at %d:%d"
                 line-no column-no)))

(defun unterminated-comment (line-no column-no)
  (error (format "unterminated comment starting at %d:%d"
                 line-no column-no)))

(defun main ()
  (setq inp (make-inp t))
  (scan-text t))

(main)


Output:
$ ./lex-in-el < compiler-tests/testcase3.t
    5    16  Keyword_print
    5    40  Op_subtract
    6    16  Keyword_putc
    6    40  Op_less
    7    16  Keyword_if
    7    40  Op_greater
    8    16  Keyword_else
    8    40  Op_lessequal
    9    16  Keyword_while
    9    40  Op_greaterequal
   10    16  LeftBrace
   10    40  Op_equal
   11    16  RightBrace
   11    40  Op_notequal
   12    16  LeftParen
   12    40  Op_and
   13    16  RightParen
   13    40  Op_or
   14    16  Op_subtract
   14    40  Semicolon
   15    16  Op_not
   15    40  Comma
   16    16  Op_multiply
   16    40  Op_assign
   17    16  Op_divide
   17    40  Integer        42
   18    16  Op_mod
   18    40  String         "String literal"
   19    16  Op_add
   19    40  Identifier     variable_name
   20    26  Integer        10
   21    26  Integer        92
   22    26  Integer        32
   23     1  End_of_input

Erlang

Works with: Erlang version 24.3.3
Translation of: ATS
Translation of: Elixir


#!/bin/env escript
%%%-------------------------------------------------------------------

-record (inp_t, {inpf, pushback, line_no, column_no}).

main (Args) ->
   main_program (Args).

main_program ([]) ->
   scan_from_inpf_to_outf ("-", "-"),
   halt (0);
main_program ([Inpf_filename]) ->
   scan_from_inpf_to_outf (Inpf_filename, "-"),
   halt (0);
main_program ([Inpf_filename, Outf_filename]) ->
   scan_from_inpf_to_outf (Inpf_filename, Outf_filename),
   halt (0);
main_program ([_, _ | _]) ->
   ProgName = escript:script_name (),
   io:put_chars (standard_error, "Usage: "),
   io:put_chars (standard_error, ProgName),
   io:put_chars (standard_error, " [INPUTFILE [OUTPUTFILE]]\n"),
   halt (1).

scan_from_inpf_to_outf ("-", "-") ->
   scan_input (standard_io, standard_io);
scan_from_inpf_to_outf (Inpf_filename, "-") ->
   case file:open (Inpf_filename, [read]) of
      {ok, Inpf} -> scan_input (Inpf, standard_io);
      _ -> open_failure (Inpf_filename, "input")
   end;
scan_from_inpf_to_outf ("-", Outf_filename) ->
   case file:open (Outf_filename, [write]) of
      {ok, Outf} -> scan_input (standard_io, Outf);
      _ -> open_failure (Outf_filename, "output")
   end;
scan_from_inpf_to_outf (Inpf_filename, Outf_filename) ->
   case file:open(Inpf_filename, [read]) of
      {ok, Inpf} ->
         case file:open (Outf_filename, [write]) of
            {ok, Outf} -> scan_input (Inpf, Outf);
            _ -> open_failure (Outf_filename, "output")
         end;
      _ -> open_failure (Inpf_filename, "input")
   end.

open_failure (Filename, ForWhat) ->
   ProgName = escript:script_name (),
   io:put_chars (standard_error, ProgName),
   io:put_chars (standard_error, ": failed to open \""),
   io:put_chars (standard_error, Filename),
   io:put_chars (standard_error, "\" for "),
   io:put_chars (standard_error, ForWhat),
   io:put_chars (standard_error, "\n"),
   halt (1).

scan_input (Inpf, Outf) ->
   scan_text (Outf, make_inp (Inpf)).

scan_text (Outf, Inp) ->
   {TokTup, Inp1} = get_next_token (Inp),
   print_token (Outf, TokTup),
   case TokTup of
      {"End_of_input", _, _, _} -> ok;
      _ -> scan_text (Outf, Inp1)
   end.

print_token (Outf, {Tok, Arg, Line_no, Column_no}) ->
   S_line_no = erlang:integer_to_list (Line_no),
   S_column_no = erlang:integer_to_list (Column_no),
   io:put_chars (Outf, string:pad (S_line_no, 5, leading)),
   io:put_chars (Outf, " "),
   io:put_chars (Outf, string:pad (S_column_no, 5, leading)),
   io:put_chars (Outf, "  "),
   io:put_chars (Outf, Tok),
   {Padding, Arg1} =
      case Tok of
         "Identifier" -> {"     ", Arg};
         "Integer" -> {"        ", Arg};
         "String" -> {"         ", Arg};
         _ -> {"", ""}
      end,
   io:put_chars (Outf, Padding),
   io:put_chars (Outf, Arg1),
   io:put_chars ("\n").

%%%-------------------------------------------------------------------
%%%
%%% The token dispatcher.
%%%

get_next_token (Inp) ->
   Inp00 = skip_spaces_and_comments (Inp),
   {Ch, Inp0} = get_ch (Inp00),
   {Char, Line_no, Column_no} = Ch,
   Ln = Line_no,
   Cn = Column_no,
   case Char of
      eof -> {{"End_of_input", "", Ln, Cn}, Inp0};
      "," -> {{"Comma", ",", Ln, Cn}, Inp0};
      ";" -> {{"Semicolon", ";", Ln, Cn}, Inp0};
      "(" -> {{"LeftParen", "(", Ln, Cn}, Inp0};
      ")" -> {{"RightParen", ")", Ln, Cn}, Inp0};
      "{" -> {{"LeftBrace", "{", Ln, Cn}, Inp0};
      "}" -> {{"RightBrace", "}", Ln, Cn}, Inp0};
      "*" -> {{"Op_multiply", "*", Ln, Cn}, Inp0};
      "/" -> {{"Op_divide", "/", Ln, Cn}, Inp0};
      "%" -> {{"Op_mod", "%", Ln, Cn}, Inp0};
      "+" -> {{"Op_add", "+", Ln, Cn}, Inp0};
      "-" -> {{"Op_subtract", "-", Ln, Cn}, Inp0};
      "<" ->
         {Ch1, Inp1} = get_ch (Inp0),
         {Char1, _, _} = Ch1,
         case Char1 of
            "=" -> {{"Op_lessequal", "<=", Ln, Cn}, Inp1};
            _ -> {{"Op_less", "<", Ln, Cn}, push_back (Ch1, Inp1)}
         end;
      ">" ->
         {Ch1, Inp1} = get_ch (Inp0),
         {Char1, _, _} = Ch1,
         case Char1 of
            "=" -> {{"Op_greaterequal", ">=", Ln, Cn}, Inp1};
            _ -> {{"Op_greater", ">", Ln, Cn}, push_back (Ch1, Inp1)}
         end;
      "=" ->
         {Ch1, Inp1} = get_ch (Inp0),
         {Char1, _, _} = Ch1,
         case Char1 of
            "=" -> {{"Op_equal", "==", Ln, Cn}, Inp1};
            _ -> {{"Op_assign", "=", Ln, Cn}, push_back (Ch1, Inp1)}
         end;
      "!" ->
         {Ch1, Inp1} = get_ch (Inp0),
         {Char1, _, _} = Ch1,
         case Char1 of
            "=" -> {{"Op_notequal", "!=", Ln, Cn}, Inp1};
            _ -> {{"Op_not", "!", Ln, Cn}, push_back (Ch1, Inp1)}
         end;
      "&" ->
         {Ch1, Inp1} = get_ch (Inp0),
         {Char1, _, _} = Ch1,
         case Char1 of
            "&" -> {{"Op_and", "&&", Ln, Cn}, Inp1};
            _ -> unexpected_character (Ln, Cn, Char)
         end;
      "|" ->
         {Ch1, Inp1} = get_ch (Inp0),
         {Char1, _, _} = Ch1,
         case Char1 of
            "|" -> {{"Op_or", "||", Ln, Cn}, Inp1};
            _ -> unexpected_character (Ln, Cn, Char)
         end;
      "\"" ->
         Inp1 = push_back (Ch, Inp0),
         scan_string_literal (Inp1);
      "'" ->
         Inp1 = push_back (Ch, Inp0),
         scan_character_literal (Inp1);
      _ ->
         case is_digit (Char) of
            true ->
               Inp1 = push_back (Ch, Inp0),
               scan_integer_literal (Inp1);
            false ->
               case is_alpha_or_underscore (Char) of
                  true ->
                     Inp1 = push_back (Ch, Inp0),
                     scan_identifier_or_reserved_word (Inp1);
                  false ->
                     unexpected_character (Ln, Cn, Char)
               end
         end
   end.

%%%-------------------------------------------------------------------
%%%
%%% Skipping past spaces and /* ... */ comments.
%%%
%%% Comments are treated exactly like a bit of whitespace. They never
%%% make it to the dispatcher.
%%%

skip_spaces_and_comments (Inp) ->
   {Ch, Inp0} = get_ch (Inp),
   {Char, Line_no, Column_no} = Ch,
   case classify_char (Char) of
      eof -> push_back (Ch, Inp0);
      space -> skip_spaces_and_comments (Inp0);
      slash ->
         {Ch1, Inp1} = get_ch (Inp0),
         case Ch1 of
            {"*", _, _} ->
               Inp2 = scan_comment (Inp1, Line_no, Column_no),
               skip_spaces_and_comments (Inp2);
            _ -> push_back (Ch, (push_back (Ch1, Inp1)))
         end;
      other -> push_back (Ch, Inp0)
   end.

classify_char (Char) ->
   case Char of
      eof -> eof;
      "/" -> slash;
      _ -> case is_space (Char) of
              true -> space;
              false -> other
           end
   end.

scan_comment (Inp, Line_no, Column_no) ->
   {Ch0, Inp0} = get_ch (Inp),
   case Ch0 of
      {eof, _, _} -> unterminated_comment (Line_no, Column_no);
      {"*", _, _} ->
         {Ch1, Inp1} = get_ch (Inp0),
         case Ch1 of
            {eof, _, _} ->
               unterminated_comment (Line_no, Column_no);
            {"/", _, _} -> Inp1;
            _ -> scan_comment (Inp1, Line_no, Column_no)
         end;
      _ -> scan_comment (Inp0, Line_no, Column_no)
   end.

is_space (S) ->
   case re:run (S, "^[[:space:]]+$") of
      {match, _} -> true;
      _ -> false
   end.

%%%-------------------------------------------------------------------
%%%
%%% Scanning of integer literals, identifiers, and reserved words.
%%%
%%% These three types of token are very similar to each other.
%%%

scan_integer_literal (Inp) ->
   %% Scan an entire word, not just digits. This way we detect
   %% erroneous text such as "23skidoo".
   {Line_no, Column_no, Inp1} = get_position (Inp),
   {Word, Inp2} = scan_word (Inp1),
   case is_digit (Word) of
      true -> {{"Integer", Word, Line_no, Column_no}, Inp2};
      false -> invalid_integer_literal (Line_no, Column_no, Word)
   end.

scan_identifier_or_reserved_word (Inp) ->
   %% It is assumed that the first character is of the correct type,
   %% thanks to the dispatcher.
   {Line_no, Column_no, Inp1} = get_position (Inp),
   {Word, Inp2} = scan_word (Inp1),
   Tok =
      case Word of
         "if" -> "Keyword_if";
         "else" -> "Keyword_else";
         "while" -> "Keyword_while";
         "print" -> "Keyword_print";
         "putc" -> "Keyword_putc";
         _ -> "Identifier"
      end,
   {{Tok, Word, Line_no, Column_no}, Inp2}.

scan_word (Inp) ->
   scan_word_loop (Inp, "").

scan_word_loop (Inp, Word0) ->
   {Ch1, Inp1} = get_ch (Inp),
   {Char1, _, _} = Ch1,
   case is_alnum_or_underscore (Char1) of
      true -> scan_word_loop (Inp1, Word0 ++ Char1);
      false -> {Word0, push_back (Ch1, Inp1)}
   end.

get_position (Inp) ->
   {Ch1, Inp1} = get_ch (Inp),
   {_, Line_no, Column_no} = Ch1,
   Inp2 = push_back (Ch1, Inp1),
   {Line_no, Column_no, Inp2}.

is_digit (S) ->
   case re:run (S, "^[[:digit:]]+$") of
      {match, _} -> true;
      _ -> false
   end.

is_alpha_or_underscore (S) ->
   case re:run (S, "^[[:alpha:]_]+$") of
      {match, _} -> true;
      _ -> false
   end.

is_alnum_or_underscore (S) ->
   case re:run (S, "^[[:alnum:]_]+$") of
      {match, _} -> true;
      _ -> false
   end.

%%%-------------------------------------------------------------------
%%%
%%% Scanning of string literals.
%%%
%%% It is assumed that the first character is the opening quote, and
%%% that the closing quote is the same character.
%%%


scan_string_literal (Inp) ->
   {Ch1, Inp1} = get_ch (Inp),
   {Quote_mark, Line_no, Column_no} = Ch1,
   {Contents, Inp2} = scan_str_lit (Inp1, Ch1),
   Toktup = {"String", Quote_mark ++ Contents ++ Quote_mark,
             Line_no, Column_no},
   {Toktup, Inp2}.

scan_str_lit (Inp, Ch) -> scan_str_lit_loop (Inp, Ch, "").

scan_str_lit_loop (Inp, Ch, Contents) ->
   {Quote_mark, Line_no, Column_no} = Ch,
   {Ch1, Inp1} = get_ch (Inp),
   {Char1, Line_no1, Column_no1} = Ch1,
   case Char1 of
      Quote_mark -> {Contents, Inp1};
      eof -> eoi_in_string_literal (Line_no, Column_no);
      "\n" -> eoln_in_string_literal (Line_no, Column_no);
      "\\" ->
         {Ch2, Inp2} = get_ch (Inp1),
         {Char2, _, _} = Ch2,
         case Char2 of
            "n" ->
               scan_str_lit_loop (Inp2, Ch, Contents ++ "\\n");
            "\\" ->
               scan_str_lit_loop (Inp2, Ch, Contents ++ "\\\\");
            _ ->
               unsupported_escape (Line_no1, Column_no1, Char2)
         end;
      _ -> scan_str_lit_loop (Inp1, Ch, Contents ++ Char1)
   end.

%%%-------------------------------------------------------------------
%%%
%%% Scanning of character literals.
%%%
%%% It is assumed that the first character is the opening quote, and
%%% that the closing quote is the same character.
%%%
%%% The tedious part of scanning a character literal is distinguishing
%%% between the kinds of lexical error. (One might wish to modify the
%%% code to detect, as a distinct kind of error, end of line within a
%%% character literal.)
%%%

scan_character_literal (Inp) ->
   {Ch, Inp0} = get_ch (Inp),
   {_, Line_no, Column_no} = Ch,
   {Ch1, Inp1} = get_ch (Inp0),
   {Char1, Line_no1, Column_no1} = Ch1,
   {Intval, Inp3} =
      case Char1 of
         eof -> unterminated_character_literal (Line_no, Column_no);
         "\\" ->
            {Ch2, Inp2} = get_ch (Inp1),
            {Char2, _, _} = Ch2,
            case Char2 of
               eof -> unterminated_character_literal (Line_no,
                                                      Column_no);
               "n" -> {char_to_code ("\n"), Inp2};
               "\\" -> {char_to_code ("\\"), Inp2};
               _ -> unsupported_escape (Line_no1, Column_no1,
                                        Char2)
            end;
         _ -> {char_to_code (Char1), Inp1}
      end,
   Inp4 = check_character_literal_end (Inp3, Ch),
   {{"Integer", Intval, Line_no, Column_no}, Inp4}.

char_to_code (Char) ->
   %% Hat tip to https://archive.ph/BxZRS
   lists:flatmap (fun erlang:integer_to_list/1, Char).

check_character_literal_end (Inp, Ch) ->
   {Char, _, _} = Ch,
   {{Char1, _, _}, Inp1} = get_ch (Inp),
   case Char1 of
      Char -> Inp1;
      _ -> find_char_lit_end (Inp1, Ch)    % Handle a lexical error.
   end.

find_char_lit_end (Inp, Ch) ->
   %% There is a lexical error. Determine which kind it fits into.
   {Char, Line_no, Column_no} = Ch,
   {{Char1, _, _}, Inp1} = get_ch (Inp),
   case Char1 of
      Char -> multicharacter_literal (Line_no, Column_no);
      eof -> unterminated_character_literal (Line_no, Column_no);
      _ -> find_char_lit_end (Inp1, Ch)
   end.

%%%-------------------------------------------------------------------
%%%
%%% Character-at-a-time input, with unrestricted pushback, and with
%%% line and column numbering.
%%%

make_inp (Inpf) ->
   #inp_t{inpf = Inpf,
          pushback = [],
          line_no = 1,
          column_no = 1}.

get_ch (Inp) ->
   #inp_t{inpf = Inpf,
          pushback = Pushback,
          line_no = Line_no,
          column_no = Column_no} = Inp,
   case Pushback of
      [Ch | Tail] ->
         Inp1 = Inp#inp_t{pushback = Tail},
         {Ch, Inp1};
      [] ->
         case io:get_chars (Inpf, "", 1) of
            eof ->
               Ch = {eof, Line_no, Column_no},
               {Ch, Inp};
            {error, _} ->
               Ch = {eof, Line_no, Column_no},
               {Ch, Inp};
            Char ->
               case Char of
                  "\n" ->
                     Ch = {Char, Line_no, Column_no},
                     Inp1 = Inp#inp_t{line_no = Line_no + 1,
                                      column_no = 1},
                     {Ch, Inp1};
                  _ -> 
                     Ch = {Char, Line_no, Column_no},
                     Inp1 =
                        Inp#inp_t{column_no = Column_no + 1},
                     {Ch, Inp1}
               end
         end
   end.

push_back (Ch, Inp) ->
   Inp#inp_t{pushback = [Ch | Inp#inp_t.pushback]}.

%%%-------------------------------------------------------------------

invalid_integer_literal (Line_no, Column_no, Word) ->
   error_abort ("invalid integer literal \"" ++
                   Word ++ "\" at " ++
                   integer_to_list (Line_no) ++ ":" ++
                   integer_to_list (Column_no)).

unsupported_escape (Line_no, Column_no, Char) ->
   error_abort ("unsupported escape \\" ++
                   Char ++ " at " ++
                   integer_to_list (Line_no) ++ ":" ++
                   integer_to_list (Column_no)).

unexpected_character (Line_no, Column_no, Char) ->
   error_abort ("unexpected character '" ++
                   Char ++ "' at " ++
                   integer_to_list (Line_no) ++ ":" ++
                   integer_to_list (Column_no)).

eoi_in_string_literal (Line_no, Column_no) ->
   error_abort ("end of input in string literal starting at " ++
                   integer_to_list (Line_no) ++ ":" ++
                   integer_to_list (Column_no)).

eoln_in_string_literal (Line_no, Column_no) ->
   error_abort ("end of line in string literal starting at " ++
                   integer_to_list (Line_no) ++ ":" ++
                   integer_to_list (Column_no)).

unterminated_character_literal (Line_no, Column_no) ->
   error_abort ("unterminated character literal starting at " ++
                   integer_to_list (Line_no) ++ ":" ++
                   integer_to_list (Column_no)).

multicharacter_literal (Line_no, Column_no) ->
   error_abort ("unsupported multicharacter literal starting at " ++
                   integer_to_list (Line_no) ++ ":" ++
                   integer_to_list (Column_no)).

unterminated_comment (Line_no, Column_no) ->
   error_abort ("unterminated comment starting at " ++
                   integer_to_list (Line_no) ++ ":" ++
                   integer_to_list (Column_no)).

error_abort (Message) ->
   ProgName = escript:script_name (),
   io:put_chars (standard_error, ProgName),
   io:put_chars (standard_error, ": "),
   io:put_chars (standard_error, Message),
   io:put_chars (standard_error, "\n"),
   halt (1).

%%%-------------------------------------------------------------------
%%% Instructions to GNU Emacs --
%%% local variables:
%%% mode: erlang
%%% erlang-indent-level: 3
%%% end:
%%%-------------------------------------------------------------------


Output:
$ ./lex-in-Erlang compiler-tests/testcase3.t
    5    16  Keyword_print
    5    40  Op_subtract
    6    16  Keyword_putc
    6    40  Op_less
    7    16  Keyword_if
    7    40  Op_greater
    8    16  Keyword_else
    8    40  Op_lessequal
    9    16  Keyword_while
    9    40  Op_greaterequal
   10    16  LeftBrace
   10    40  Op_equal
   11    16  RightBrace
   11    40  Op_notequal
   12    16  LeftParen
   12    40  Op_and
   13    16  RightParen
   13    40  Op_or
   14    16  Op_subtract
   14    40  Semicolon
   15    16  Op_not
   15    40  Comma
   16    16  Op_multiply
   16    40  Op_assign
   17    16  Op_divide
   17    40  Integer        42
   18    16  Op_mod
   18    40  String         "String literal"
   19    16  Op_add
   19    40  Identifier     variable_name
   20    26  Integer        10
   21    26  Integer        92
   22    26  Integer        32
   23     1  End_of_input

Euphoria

Tested with Euphoria 4.05.

include std/io.e
include std/map.e
include std/types.e
include std/convert.e

constant true = 1, false = 0, EOF = -1

enum tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq,
    tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While,
    tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma,
    tk_Ident, tk_Integer, tk_String

constant all_syms = {"End_of_input", "Op_multiply", "Op_divide", "Op_mod", "Op_add",
    "Op_subtract", "Op_negate", "Op_not", "Op_less", "Op_lessequal", "Op_greater",
    "Op_greaterequal", "Op_equal", "Op_notequal", "Op_assign", "Op_and", "Op_or",
    "Keyword_if", "Keyword_else", "Keyword_while", "Keyword_print", "Keyword_putc",
    "LeftParen", "RightParen", "LeftBrace", "RightBrace", "Semicolon", "Comma",
    "Identifier", "Integer", "String"}

integer input_file, the_ch = ' ', the_col = 0, the_line = 1
sequence symbols
map key_words = new()

procedure error(sequence format, sequence data)
    printf(STDOUT, format, data)
    abort(1)
end procedure

-- get the next character from the input
function next_ch()
    the_ch = getc(input_file)
    the_col += 1
    if the_ch = '\n' then
        the_line += 1
        the_col = 0
    end if
    return the_ch
end function

-- 'x' - character constants
function char_lit(integer err_line, integer err_col)
    integer n = next_ch()              -- skip opening quote
    if the_ch = '\'' then
        error("%d %d empty character constant", {err_line, err_col})
    elsif the_ch = '\\' then
        next_ch()
        if the_ch = 'n' then
            n = 10
        elsif the_ch = '\\' then
            n = '\\'
        else
            error("%d %d unknown escape sequence \\%c", {err_line, err_col, the_ch})
        end if
    end if
    if next_ch() != '\'' then
        error("%d %d multi-character constant", {err_line, err_col})
    end if
    next_ch()
    return {tk_Integer, err_line, err_col, n}
end function

-- process divide or comments
function div_or_cmt(integer err_line, integer err_col)
    if next_ch() != '*' then
        return {tk_Div, err_line, err_col}
    end if

    -- comment found
    next_ch()
    while true do
        if the_ch = '*' then
            if next_ch() = '/' then
                next_ch()
                return get_tok()
            end if
        elsif the_ch = EOF then
            error("%d %d EOF in comment", {err_line, err_col})
        else
            next_ch()
        end if
    end while
end function

-- "string"
function string_lit(integer start, integer err_line, integer err_col)
    string text = ""

    while next_ch() != start do
        if the_ch = EOF then
            error("%d %d EOF while scanning string literal", {err_line, err_col})
        end if
        if the_ch = '\n' then
            error("%d %d EOL while scanning string literal", {err_line, err_col})
        end if
        text &= the_ch
    end while

    next_ch()
    return {tk_String, err_line, err_col, text}
end function

-- handle identifiers and integers
function ident_or_int(integer err_line, integer err_col)
    integer n, is_number = true
    string text = ""

    while t_alnum(the_ch) or the_ch = '_' do
        text &= the_ch
        if not t_digit(the_ch) then
            is_number = false
        end if
        next_ch()
    end while

    if length(text) = 0 then
        error("%d %d ident_or_int: unrecognized character: (%d) '%s'", {err_line, err_col, the_ch, the_ch})
    end if

    if t_digit(text[1]) then
        if not is_number then
            error("%d %d invalid number: %s", {err_line, err_col, text})
        end if
        n = to_integer(text)
        return {tk_Integer, err_line, err_col, n}
    end if

    if has(key_words, text) then
        return {get(key_words, text), err_line, err_col}
    end if

    return {tk_Ident, err_line, err_col, text}
end function

-- look ahead for '>=', etc.
function follow(integer expect, integer ifyes, integer ifno, integer err_line, integer err_col)
    if next_ch() = expect then
        next_ch()
        return {ifyes, err_line, err_col}
    end if

    if ifno = tk_EOI then
        error("%d %d follow: unrecognized character: (%d)", {err_line, err_col, the_ch})
    end if

    return {ifno, err_line, err_col}
end function

-- return the next token type
function get_tok()
    while t_space(the_ch) do
        next_ch()
    end while

    integer err_line = the_line
    integer err_col  = the_col

    switch the_ch do
        case EOF  then return {tk_EOI, err_line, err_col}
        case '/'  then return div_or_cmt(err_line, err_col)
        case '\'' then return char_lit(err_line, err_col)

        case '<'  then return follow('=', tk_Leq, tk_Lss,    err_line, err_col)
        case '>'  then return follow('=', tk_Geq, tk_Gtr,    err_line, err_col)
        case '='  then return follow('=', tk_Eq,  tk_Assign, err_line, err_col)
        case '!'  then return follow('=', tk_Neq, tk_Not,    err_line, err_col)
        case '&'  then return follow('&', tk_And, tk_EOI,    err_line, err_col)
        case '|'  then return follow('|', tk_Or,  tk_EOI,    err_line, err_col)

        case '"'  then return string_lit(the_ch, err_line, err_col)
        case else
            integer sym = symbols[the_ch]
            if sym  != tk_EOI then
                next_ch()
                return {sym, err_line, err_col}
            end if
            return ident_or_int(err_line, err_col)
    end switch
end function

procedure init()
    put(key_words, "else",    tk_Else)
    put(key_words, "if",      tk_If)
    put(key_words, "print",   tk_Print)
    put(key_words, "putc",    tk_Putc)
    put(key_words, "while",   tk_While)

    symbols = repeat(tk_EOI, 256)
    symbols['{'] = tk_Lbrace
    symbols['}'] = tk_Rbrace
    symbols['('] = tk_Lparen
    symbols[')'] = tk_Rparen
    symbols['+'] = tk_Add
    symbols['-'] = tk_Sub
    symbols['*'] = tk_Mul
    symbols['%'] = tk_Mod
    symbols[';'] = tk_Semi
    symbols[','] = tk_Comma
end procedure

procedure main(sequence cl)
    sequence file_name

    input_file = STDIN
    if length(cl) > 2 then
        file_name = cl[3]
        input_file = open(file_name, "r")
        if input_file = -1 then
            error("Could not open %s", {file_name})
        end if
    end if
    init()
    sequence t
    loop do
        t = get_tok()
        printf(STDOUT, "%5d  %5d %-8s", {t[2], t[3], all_syms[t[1]]})
        switch t[1] do
            case tk_Integer then printf(STDOUT, "  %5d\n",   {t[4]})
            case tk_Ident   then printf(STDOUT, " %s\n",     {t[4]})
            case tk_String  then printf(STDOUT, " \"%s\"\n", {t[4]})
            case else            printf(STDOUT, "\n")
        end switch
        until t[1] = tk_EOI
    end loop
end procedure

main(command_line())
Output  —  test case 3:

    5     16 Keyword_print
    5     40 Op_subtract
    6     16 Keyword_putc
    6     40 Op_less
    7     16 Keyword_if
    7     40 Op_greater
    8     16 Keyword_else
    8     40 Op_lessequal
    9     16 Keyword_while
    9     40 Op_greaterequal
   10     16 LeftBrace
   10     40 Op_equal
   11     16 RightBrace
   11     40 Op_notequal
   12     16 LeftParen
   12     40 Op_and
   13     16 RightParen
   13     40 Op_or
   14     16 Op_subtract
   14     40 Semicolon
   15     16 Op_not
   15     40 Comma
   16     16 Op_multiply
   16     40 Op_assign
   17     16 Op_divide
   17     40 Integer      42
   18     16 Op_mod
   18     40 String   "String literal"
   19     16 Op_add
   19     40 Identifier variable_name
   20     26 Integer      10
   21     26 Integer      92
   22     26 Integer      32
   23      1 End_of_input

Flex

Tested with Flex 2.5.4.

%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <limits.h>

#define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))

typedef enum {
    tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq,
    tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While,
    tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma,
    tk_Ident, tk_Integer, tk_String
} TokenType;

void yyerror(char msg[]) {
    printf(msg);
    exit(1);
}

static int yynval;

struct yylloc {
    int first_line, first_col;
    int last_line, last_col;
} yylloc;

static void update_loc() {
  static int curr_line = 1;
  static int curr_col  = 1;

  yylloc.first_line = curr_line;
  yylloc.first_col  = curr_col;

  {char *s; for (s = yytext; *s != '\0'; s++) {
    if (*s == '\n') {
      curr_line++;
      curr_col = 1;
    } else {
      curr_col++;
    }
  }}

  yylloc.last_line = curr_line;
  yylloc.last_col  = curr_col-1;
}

#define YY_USER_ACTION update_loc();

static int kwd_cmp(const void *p1, const void *p2) {
    return strcmp(*(char **)p1, *(char **)p2);
}

static TokenType get_ident_type(const char *ident) {
    static struct {
        char *s;
        TokenType sym;
    } kwds[] = {
        {"else",  tk_Else},
        {"if",    tk_If},
        {"print", tk_Print},
        {"putc",  tk_Putc},
        {"while", tk_While},
    }, *kwp;

    return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;
}

%}

%start COMMENT2

%option noyywrap

digit       [0-9]
ident       [a-zA-Z_][a-zA-Z_0-9]*

number       {digit}+
string       \"[^"\n]*\"
char_const   \'([^'\n]|\\n|\\\\)\'

%%

<COMMENT2>[^*]+  ;
<COMMENT2>\*[^/] ;
<COMMENT2>\*\/	 BEGIN 0;		/* end comment */
"/*"		     BEGIN COMMENT2;

"{"      {return tk_Lbrace;}
"}"      {return tk_Rbrace;}
"("      {return tk_Lparen;}
")"      {return tk_Rparen;}
"*"      {return tk_Mul;}
"/"      {return tk_Div;}
"%"      {return tk_Mod;}
"+"      {return tk_Add;}
"-"      {return tk_Sub;}
"<"      {return tk_Lss;}
">"      {return tk_Gtr;}
"<="     {return tk_Leq;}
">="     {return tk_Geq;}
"!="     {return tk_Neq;}
"!"      {return tk_Not;}
"&&"     {return tk_And;}
"||"     {return tk_Or;}
";"      {return tk_Semi;}
","      {return tk_Comma;}
"=="     {return tk_Eq;}
"="      {return tk_Assign;}
{ident}  {return get_ident_type(yytext);}
{string} {return tk_String;}

[ \t\n]+ ; /* ignore whitespace */

{number}     {
                yynval = strtol(yytext, NULL, 0);
                if (yynval == LONG_MAX && errno == ERANGE)
                    yyerror("Number exceeds maximum value");

                return tk_Integer;
             }

{char_const} {
                int n = yytext[1];
                char *p = yytext;

                if (yyleng < 3)
                    yyerror("empty character constant");
                ++p;
                if (p[0] == '\\') {
                    ++p;
                    if (p[0] == 'n')
                        n = 10;
                    else if (p[0] == '\\')
                        n = '\\';
                    else
                        yyerror("unknown escape sequence");
                }
                yynval = n;
                return tk_Integer;
             }

.            yyerror("Unknown character\n");

%%

int main(int argc, char *argv[]) {
    int tok;

    ++argv, --argc;  /* skip over program name */
    yyin = stdin;
    if (argc > 0)
        yyin = fopen(argv[0], "r");

    do {
        tok = yylex();
        printf("%5d  %5d %.15s", yylloc.first_line, yylloc.first_col,
            &"End_of_input    Op_multiply     Op_divide       Op_mod          Op_add          "
             "Op_subtract     Op_negate       Op_not          Op_less         Op_lessequal    "
             "Op_greater      Op_greaterequal Op_equal        Op_notequal     Op_assign       "
             "Op_and          Op_or           Keyword_if      Keyword_else    Keyword_while   "
             "Keyword_print   Keyword_putc    LeftParen       RightParen      LeftBrace       "
             "RightBrace      Semicolon       Comma           Identifier      Integer         "
             "String          "
            [tok * 16]);

        if (tok == tk_Integer)     printf("   %5d", yynval);
        else if (tok == tk_Ident)  printf("  %s",   yytext);
        else if (tok == tk_String) printf("  %s",   yytext);
        printf("\n");
    } while (tok != tk_EOI);
    return 0;
}
Output  —  test case 3:

    5     16 Keyword_print
    5     40 Op_subtract
    6     16 Keyword_putc
    6     40 Op_less
    7     16 Keyword_if
    7     40 Op_greater
    8     16 Keyword_else
    8     40 Op_lessequal
    9     16 Keyword_while
    9     40 Op_greaterequal
   10     16 LeftBrace
   10     40 Op_equal
   11     16 RightBrace
   11     40 Op_notequal
   12     16 LeftParen
   12     40 Op_and
   13     16 RightParen
   13     40 Op_or
   14     16 Op_subtract
   14     40 Semicolon
   15     16 Op_not
   15     40 Comma
   16     16 Op_multiply
   16     40 Op_assign
   17     16 Op_divide
   17     40 Integer              42
   18     16 Op_mod
   18     40 String           "String literal"
   19     16 Op_add
   19     40 Identifier       variable_name
   20     26 Integer              10
   21     26 Integer              92
   22     26 Integer              32
   22     29 End_of_input

Forth

Tested with Gforth 0.7.3.

CREATE BUF 0 ,              \ single-character look-ahead buffer
CREATE COLUMN# 0 ,
CREATE LINE# 1 ,

: NEWLINE? ( c -- t|f)  DUP 10 = SWAP  13 =  OR ;
: +IN ( c --)
   1 SWAP  NEWLINE?
   IF 0 COLUMN# ! LINE# ELSE COLUMN# THEN
   +!  0 BUF ! ;
: PEEK   BUF @ 0= IF STDIN KEY-FILE BUF ! THEN BUF @ ;
: GETC   PEEK  DUP +IN ;
: SKIP   GETC DROP ;
: .LOCATION   7 .R  4 .R SPACE ;
: WHERE   COLUMN# @ LINE# @ ;
: .WHERE    WHERE .LOCATION ;
: .WHERE+   WHERE  SWAP 1+ SWAP .LOCATION ;

: EXPECT   GETC  OVER OVER =
   IF 2DROP
   ELSE CR ." stdin:" COLUMN# @ 0 LINE# @ 0
      <# #s #> TYPE ." :" <# #s #> TYPE ." : "
      ." unexpected `" EMIT ." ', expecting `" EMIT ." '" CR
      BYE
   THEN ;
: EQ   PEEK [CHAR] = = IF SKIP 2SWAP THEN
       ." Op_" TYPE CR  2DROP ;

CREATE ESC  4 C, CHAR $ C, CHAR $ C, CHAR \ C, 0 C,
: ?ESC?   CR ." Unknown escape sequence `\" EMIT ." '" CR BYE ;
: >ESC   ESC 4 + C!  ESC ;
: $$\n   10 ;
: $$\\   [CHAR] \ ;
: ESCAPE   DUP >ESC FIND IF NIP EXECUTE ELSE DROP ?ESC? THEN ;
: ?ESCAPE   DUP [CHAR] \ = IF DROP GETC ESCAPE THEN ;
: ?EOF   DUP 4 = IF CR ." End-of-file in string" CR BYE THEN ;
: ?EOL   DUP NEWLINE? 
         IF CR ." End-of-line in string" CR BYE THEN ;
: STRING   PAD
   BEGIN  GETC ?EOF ?EOL DUP  [CHAR] " <>
   WHILE  OVER C! CHAR+
   REPEAT DROP  PAD TUCK - ;
: "TYPE"   [CHAR] " EMIT  TYPE  [CHAR] " EMIT ;

CREATE TOKEN  4 C, CHAR $ C, CHAR $ C, 0 C, 0 C,
: >HEX   DUP 9 > IF 7 + THEN [CHAR] 0 + ;
: HI!   $F0 AND  2/ 2/ 2/ 2/ >HEX  TOKEN 3 + C! ;
: LO!   $0F AND  >HEX TOKEN 4 + C! ;
: >TOKEN   DUP HI! LO!  TOKEN ;

: ?EOF   DUP 4 = IF CR ." End-of-file in comment" CR BYE THEN ;
: $$2F   PEEK [CHAR] * =
   IF SKIP
       BEGIN 
   	GETC ?EOF  [CHAR] * =
   	PEEK [CHAR] / =  AND
       UNTIL  SKIP
   ELSE  .WHERE ." Op_divide" CR THEN ;
: $$22   .WHERE ." String " STRING "TYPE" CR ;
: $$27   .WHERE GETC ?ESCAPE ." Integer " . [CHAR] ' EXPECT CR ;
: $$04   .WHERE ." End_of_input" CR BYE ;
: $$2D   .WHERE ." Op_subtract" CR ;
: $$2B   .WHERE ." Op_add" CR ;
: $$25   .WHERE ." Op_mod" CR ;
: $$2A   .WHERE ." Op_multiply" CR ;
: $$7B   .WHERE ." LeftBrace" CR ;
: $$7D   .WHERE ." RightBrace" CR ;
: $$2C   .WHERE ." Comma" CR ;
: $$29   .WHERE ." RightParen" CR ;
: $$28   .WHERE ." LeftParen" CR ;
: $$3B   .WHERE ." Semicolon" CR ;
: $$3D   .WHERE s" equal" s" assign" EQ ;
: $$21   .WHERE s" notequal" s" not" EQ ;
: $$3C   .WHERE s" lessequal" s" less" EQ ;
: $$3E   .WHERE s" greaterequal" s" greater" EQ ;
: $$26   .WHERE [CHAR] & EXPECT  ." Op_and" CR ;
: $$7C   .WHERE [CHAR] | EXPECT  ." Op_or" CR ;
: $$20   ;   \ space

CREATE KEYWORD  0 C, CHAR $ C, CHAR $ C, 5 CHARS ALLOT
: >KEYWORD   DUP  2 + KEYWORD C!
             KEYWORD 3 + SWAP CMOVE  KEYWORD ;
: FIND-KW   DUP 5 <=
   IF 2DUP >KEYWORD FIND
      IF TRUE 2SWAP 2DROP ELSE DROP FALSE THEN
   ELSE FALSE THEN ;

: $$if   ." Keyword_if" ;
: $$else   ." Keyword_else" ;
: $$while   ." Keyword_while" ;
: $$print   ." Keyword_print" ;
: $$putc   ." Keyword_putc" ;

: DIGIT?   48 58 WITHIN ;
: ALPHA?   DUP  95 = SWAP		  \ underscore?
           DUP 97 123 WITHIN SWAP	  \ lower?
           65 91 WITHIN  OR OR ;	  \ upper?
: ALNUM?   DUP DIGIT? SWAP  ALPHA? OR ;
: INTEGER   0
   BEGIN  PEEK DIGIT?
   WHILE  GETC [CHAR] 0 -  SWAP 10 * +
   REPEAT ;
: ?INTEGER?   CR ." Invalid number" CR BYE ;
: ?INTEGER   PEEK ALPHA? IF ?INTEGER? THEN ;
: DIGIT   .WHERE+ ." Integer " INTEGER ?INTEGER . CR ;
: NAME   PAD
         BEGIN  PEEK ALNUM?
	 WHILE GETC OVER C! CHAR+
	 REPEAT  PAD TUCK - ;
: IDENT   ." Identifier " TYPE ;
: ALPHA   .WHERE+ NAME FIND-KW
          IF EXECUTE ELSE IDENT THEN CR ;
: ?CHAR?   CR ." Character '" EMIT ." ' not recognized" CR BYE ;
: SPACE?   DUP BL = SWAP  9 14 WITHIN  OR ;
: SKIP-SPACE   BEGIN PEEK SPACE? WHILE SKIP REPEAT ;
: CONSUME
   SKIP-SPACE
   PEEK DIGIT? IF DIGIT ELSE
    PEEK ALPHA? IF ALPHA ELSE
     PEEK >TOKEN FIND
     IF SKIP EXECUTE ELSE GETC ?CHAR? BYE THEN
   THEN THEN ;
: TOKENIZE   BEGIN CONSUME AGAIN ;
TOKENIZE
Output:

Tested against all programs in Compiler/Sample programs.

Fortran

Works with: gfortran version 11.2.1

You should call the source file ‘lex.F90’, so gfortran will know to use the C preprocessor. I use the preprocessor to select between different ways to read stream input from the standard input.

(Despite the ‘.F90’ extension that I recommend, this is Fortran 2008/2018 code.)

There is ‘framework’ for supporting Unicode, but no actual Unicode support. To support Unicode reliably I would probably use the C interface and GNU libunistring.

The author has placed this Fortran code in the public domain.

!!!
!!! An implementation of the Rosetta Code lexical analyzer task:
!!! https://rosettacode.org/wiki/Compiler/lexical_analyzer
!!!
!!! The C implementation was used as a reference on behavior, but was
!!! not adhered to for the implementation.
!!!

module string_buffers
  use, intrinsic :: iso_fortran_env, only: error_unit
  use, intrinsic :: iso_fortran_env, only: int64

  implicit none
  private

  public :: strbuf_t
  public :: strbuf_t_length_kind
  public :: strbuf_t_character_kind

  integer, parameter :: strbuf_t_length_kind = int64

  ! String buffers can handle Unicode.
  integer, parameter :: strbuf_t_character_kind = selected_char_kind ('ISO_10646')

  ! Private abbreviations.
  integer, parameter :: nk = strbuf_t_length_kind
  integer, parameter :: ck = strbuf_t_character_kind

  type :: strbuf_t
     integer(kind = nk), private :: len = 0
     !
     ! ‘chars’ is made public for efficient access to the individual
     ! characters.
     !
     character(1, kind = ck), allocatable, public :: chars(:)
   contains
     procedure, pass, private :: ensure_storage => strbuf_t_ensure_storage
     procedure, pass :: to_unicode => strbuf_t_to_unicode
     procedure, pass :: length => strbuf_t_length
     procedure, pass :: set => strbuf_t_set
     procedure, pass :: append => strbuf_t_append
     generic :: assignment(=) => set
  end type strbuf_t

contains

  function strbuf_t_to_unicode (strbuf) result (s)
    class(strbuf_t), intent(in) :: strbuf
    character(:, kind = ck), allocatable :: s

    !
    ! This does not actually ensure that the string is valid Unicode;
    ! any 31-bit ‘character’ is supported.
    !

    integer(kind = nk) :: i

    allocate (character(len = strbuf%len, kind = ck) :: s)
    do i = 1, strbuf%len
       s(i:i) = strbuf%chars(i)
    end do
  end function strbuf_t_to_unicode

  elemental function strbuf_t_length (strbuf) result (n)
    class(strbuf_t), intent(in) :: strbuf
    integer(kind = nk) :: n

    n = strbuf%len
  end function strbuf_t_length

  elemental function next_power_of_two (x) result (y)
    integer(kind = nk), intent(in) :: x
    integer(kind = nk) :: y

    !
    ! It is assumed that no more than 64 bits are used.
    !
    ! The branch-free algorithm is that of
    ! https://archive.is/nKxAc#RoundUpPowerOf2
    !
    ! Fill in bits until one less than the desired power of two is
    ! reached, and then add one.
    !

    y = x - 1
    y = ior (y, ishft (y, -1))
    y = ior (y, ishft (y, -2))
    y = ior (y, ishft (y, -4))
    y = ior (y, ishft (y, -8))
    y = ior (y, ishft (y, -16))
    y = ior (y, ishft (y, -32))
    y = y + 1
  end function next_power_of_two

  elemental function new_storage_size (length_needed) result (size)
    integer(kind = nk), intent(in) :: length_needed
    integer(kind = nk) :: size

    ! Increase storage by orders of magnitude.

    if (2_nk**32 < length_needed) then
       size = huge (1_nk)
    else
       size = next_power_of_two (length_needed)
    end if
  end function new_storage_size

  subroutine strbuf_t_ensure_storage (strbuf, length_needed)
    class(strbuf_t), intent(inout) :: strbuf
    integer(kind = nk), intent(in) :: length_needed

    integer(kind = nk) :: new_size
    type(strbuf_t) :: new_strbuf

    if (.not. allocated (strbuf%chars)) then
       ! Initialize a new strbuf%chars array.
       new_size = new_storage_size (length_needed)
       allocate (strbuf%chars(1:new_size))
    else if (ubound (strbuf%chars, 1) < length_needed) then
       ! Allocate a new strbuf%chars array, larger than the current
       ! one, but containing the same characters.
       new_size = new_storage_size (length_needed)
       allocate (new_strbuf%chars(1:new_size))
       new_strbuf%chars(1:strbuf%len) = strbuf%chars(1:strbuf%len)
       call move_alloc (new_strbuf%chars, strbuf%chars)
    end if
  end subroutine strbuf_t_ensure_storage

  subroutine strbuf_t_set (dst, src)
    class(strbuf_t), intent(inout) :: dst
    class(*), intent(in) :: src

    integer(kind = nk) :: n
    integer(kind = nk) :: i

    select type (src)
    type is (character(*, kind = ck))
       n = len (src, kind = nk)
       call dst%ensure_storage(n)
       do i = 1, n
          dst%chars(i) = src(i:i)
       end do
       dst%len = n
    type is (character(*))
       n = len (src, kind = nk)
       call dst%ensure_storage(n)
       do i = 1, n
          dst%chars(i) = src(i:i)
       end do
       dst%len = n
    class is (strbuf_t)
       n = src%len
       call dst%ensure_storage(n)
       dst%chars(1:n) = src%chars(1:n)
       dst%len = n
    class default
       error stop
    end select
  end subroutine strbuf_t_set

  subroutine strbuf_t_append (dst, src)
    class(strbuf_t), intent(inout) :: dst
    class(*), intent(in) :: src

    integer(kind = nk) :: n_dst, n_src, n
    integer(kind = nk) :: i

    select type (src)
    type is (character(*, kind = ck))
       n_dst = dst%len
       n_src = len (src, kind = nk)
       n = n_dst + n_src
       call dst%ensure_storage(n)
       do i = 1, n_src
          dst%chars(n_dst + i) = src(i:i)
       end do
       dst%len = n
    type is (character(*))
       n_dst = dst%len
       n_src = len (src, kind = nk)
       n = n_dst + n_src
       call dst%ensure_storage(n)
       do i = 1, n_src
          dst%chars(n_dst + i) = src(i:i)
       end do
       dst%len = n
    class is (strbuf_t)
       n_dst = dst%len
       n_src = src%len
       n = n_dst + n_src
       call dst%ensure_storage(n)
       dst%chars((n_dst + 1):n) = src%chars(1:n_src)
       dst%len = n
    class default
       error stop
    end select
  end subroutine strbuf_t_append

end module string_buffers

module lexical_analysis
  use, intrinsic :: iso_fortran_env, only: input_unit
  use, intrinsic :: iso_fortran_env, only: output_unit
  use, intrinsic :: iso_fortran_env, only: error_unit
  use, intrinsic :: iso_fortran_env, only: int32
  use, non_intrinsic :: string_buffers

  implicit none
  private

  public :: lexer_input_t
  public :: lexer_output_t
  public :: run_lexer

  integer, parameter :: input_file_unit_no = 100
  integer, parameter :: output_file_unit_no = 101

  ! Private abbreviations.
  integer, parameter :: nk = strbuf_t_length_kind
  integer, parameter :: ck = strbuf_t_character_kind

  ! Integers large enough for a Unicode code point. Unicode code
  ! points (and UCS-4) have never been allowed to go higher than
  ! 7FFFFFFF, and are even further restricted now.
  integer, parameter :: ichar_kind = int32

  character(1, kind = ck), parameter :: horizontal_tab_char = char (9, kind = ck)
  character(1, kind = ck), parameter :: linefeed_char = char (10, kind = ck)
  character(1, kind = ck), parameter :: vertical_tab_char = char (11, kind = ck)
  character(1, kind = ck), parameter :: formfeed_char = char (12, kind = ck)
  character(1, kind = ck), parameter :: carriage_return_char = char (13, kind = ck)
  character(1, kind = ck), parameter :: space_char = ck_' '

  ! The following is correct for Unix and its relatives.
  character(1, kind = ck), parameter :: newline_char = linefeed_char

  character(1, kind = ck), parameter :: backslash_char = char (92, kind = ck)

  character(*, kind = ck), parameter :: newline_intstring = ck_'10'
  character(*, kind = ck), parameter :: backslash_intstring = ck_'92'

  integer, parameter :: tk_EOI = 0
  integer, parameter :: tk_Mul = 1
  integer, parameter :: tk_Div = 2
  integer, parameter :: tk_Mod = 3
  integer, parameter :: tk_Add = 4
  integer, parameter :: tk_Sub = 5
  integer, parameter :: tk_Negate = 6
  integer, parameter :: tk_Not = 7
  integer, parameter :: tk_Lss = 8
  integer, parameter :: tk_Leq = 9
  integer, parameter :: tk_Gtr = 10
  integer, parameter :: tk_Geq = 11
  integer, parameter :: tk_Eq = 12
  integer, parameter :: tk_Neq = 13
  integer, parameter :: tk_Assign = 14
  integer, parameter :: tk_And = 15
  integer, parameter :: tk_Or = 16
  integer, parameter :: tk_If = 17
  integer, parameter :: tk_Else = 18
  integer, parameter :: tk_While = 19
  integer, parameter :: tk_Print = 20
  integer, parameter :: tk_Putc = 21
  integer, parameter :: tk_Lparen = 22
  integer, parameter :: tk_Rparen = 23
  integer, parameter :: tk_Lbrace = 24
  integer, parameter :: tk_Rbrace = 25
  integer, parameter :: tk_Semi = 26
  integer, parameter :: tk_Comma = 27
  integer, parameter :: tk_Ident = 28
  integer, parameter :: tk_Integer = 29
  integer, parameter :: tk_String = 30

  character(len = 16), parameter :: token_names(0:30) = &
       & (/ "End_of_input    ", "Op_multiply     ", "Op_divide       ", "Op_mod          ", "Op_add          ", &
       &    "Op_subtract     ", "Op_negate       ", "Op_not          ", "Op_less         ", "Op_lessequal    ", &
       &    "Op_greater      ", "Op_greaterequal ", "Op_equal        ", "Op_notequal     ", "Op_assign       ", &
       &    "Op_and          ", "Op_or           ", "Keyword_if      ", "Keyword_else    ", "Keyword_while   ", &
       &    "Keyword_print   ", "Keyword_putc    ", "LeftParen       ", "RightParen      ", "LeftBrace       ", &
       &    "RightBrace      ", "Semicolon       ", "Comma           ", "Identifier      ", "Integer         ", &
       &    "String          " /)

  type :: token_t
     integer :: token_no

     ! Our implementation stores the value of a tk_Integer as a
     ! string. The C reference implementation stores it as an int.
     character(:, kind = ck), allocatable :: val

     integer(nk) :: line_no
     integer(nk) :: column_no
  end type token_t

  type :: lexer_input_t
     logical, private :: using_input_unit = .true.
     integer, private :: unit_no = -(huge (1))
     integer(kind = nk) :: line_no = 1
     integer(kind = nk) :: column_no = 0
     integer, private :: unget_count = 0

     ! The maximum lookahead is 2, although I believe we are using
     ! only 1. In principle, the lookahead could be any finite number.
     character(1, kind = ck), private :: unget_buffer(1:2)
     logical, private :: unget_eof_buffer(1:2)

     ! Using the same strbuf_t multiple times reduces the need for
     ! reallocations. Putting that strbuf_t in the lexer_input_t is
     ! simply for convenience.
     type(strbuf_t), private :: strbuf

   contains
     !
     ! Note: There is currently no facility for closing one input and
     !       switching to another.
     !
     ! Note: There is currently no facility to decode inputs into
     !       Unicode codepoints. Instead, what happens is raw bytes of
     !       input get stored as strbuf_t_character_kind values. This
     !       behavior is adequate for ASCII inputs.
     !
     procedure, pass :: use_file => lexer_input_t_use_file
     procedure, pass :: get_next_ch => lexer_input_t_get_next_ch
     procedure, pass :: unget_ch => lexer_input_t_unget_ch
     procedure, pass :: unget_eof => lexer_input_t_unget_eof
  end type lexer_input_t

  type :: lexer_output_t
     integer, private :: unit_no = output_unit
   contains
     procedure, pass :: use_file => lexer_output_t_use_file
     procedure, pass :: output_token => lexer_output_t_output_token
  end type lexer_output_t

contains

  subroutine lexer_input_t_use_file (inputter, filename)
    class(lexer_input_t), intent(inout) :: inputter
    character(*), intent(in) :: filename

    integer :: stat

    inputter%using_input_unit = .false.
    inputter%unit_no = input_file_unit_no
    inputter%line_no = 1
    inputter%column_no = 0

    open (unit = input_file_unit_no, file = filename, status = 'old', &
         & action = 'read', access = 'stream', form = 'unformatted',  &
         & iostat = stat)
    if (stat /= 0) then
       write (error_unit, '("Error: failed to open ", A, " for input")') filename
       stop 1
    end if
  end subroutine lexer_input_t_use_file

!!!
!!! If you tell gfortran you want -std=f2008 or -std=f2018, you likely
!!! will need to add also -fall-intrinsics or -U__GFORTRAN__
!!!
!!! The first way, you get the FGETC intrinsic. The latter way, you
!!! get the C interface code that uses getchar(3).
!!!
#ifdef __GFORTRAN__

  subroutine get_input_unit_char (c, stat)
    !
    ! The following works if you are using gfortran.
    !
    ! (FGETC is considered a feature for backwards compatibility with
    ! g77. However, I know of no way to reconfigure input_unit as a
    ! Fortran 2003 stream, for use with ordinary ‘read’.)
    !
    character, intent(inout) :: c
    integer, intent(out) :: stat

    call fgetc (input_unit, c, stat)
  end subroutine get_input_unit_char

#else

  subroutine get_input_unit_char (c, stat)
    !