Compiler/lexical analyzer

You are encouraged to solve this task according to the task description, using any language you may know.
Definition from Wikipedia:
- Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).
Create a lexical analyzer for the simple programming language specified below. The program should read input from a file and/or stdin, and write output to a file and/or stdout. If the language being used has a lexer module/library/class, it would be great if two versions of the solution are provided: One without the lexer module, and one with.
The simple programming language to be analyzed is more or less a subset of C. It supports the following tokens:
- Operators
Name Common name Character sequence Op_multiply multiply * Op_divide divide / Op_mod mod % Op_add plus + Op_subtract minus - Op_negate unary minus - Op_less less than < Op_lessequal less than or equal <= Op_greater greater than > Op_greaterequal greater than or equal >= Op_equal equal == Op_notequal not equal != Op_not unary not ! Op_assign assignment = Op_and logical and && Op_or logical or ¦¦
- The
-
token should always be interpreted as Op_subtract by the lexer. Turning some Op_subtract into Op_negate will be the job of the syntax analyzer, which is not part of this task.
- Symbols
Name Common name Character LeftParen left parenthesis ( RightParen right parenthesis ) LeftBrace left brace { RightBrace right brace } Semicolon semi-colon ; Comma comma ,
- Keywords
Name Character sequence Keyword_if if Keyword_else else Keyword_while while Keyword_print print Keyword_putc putc
- Identifiers and literals
These differ from the the previous tokens, in that each occurrence of them has a value associated with it.
Name Common name Format description Format regex Value Identifier identifier one or more letter/number/underscore characters, but not starting with a number [_a-zA-Z][_a-zA-Z0-9]*
as is Integer integer literal one or more digits [0-9]+
as is, interpreted as a number Integer char literal exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes '([^'\n]|\\n|\\\\)'
the ASCII code point number of the character, e.g. 65 for 'A'
and 10 for'\n'
String string literal zero or more characters (anything except newline or double quote), enclosed by double quotes "[^"\n]*"
the characters without the double quotes and with escape sequences converted
- For char and string literals, the
\n
escape sequence is supported to represent a new-line character. - For char and string literals, to represent a backslash, use
\\
. - No other special sequences are supported. This means that:
- Char literals cannot represent a single quote character (value 39).
- String literals cannot represent strings containing double quote characters.
- Zero-width tokens
Name Location End_of_input when the end of the input stream is reached
- White space
- Zero or more whitespace characters, or comments enclosed in
/* ... */
, are allowed between any two tokens, with the exceptions noted below. - "Longest token matching" is used to resolve conflicts (e.g., in order to match <= as a single token rather than the two tokens < and =).
- Whitespace is required between two tokens that have an alphanumeric character or underscore at the edge.
- This means: keywords, identifiers, and integer literals.
- e.g.
ifprint
is recognized as an identifier, instead of the keywords if and print. - e.g.
42fred
is invalid, and neither recognized as a number nor an identifier.
- Whitespace is not allowed inside of tokens (except for chars and strings where they are part of the value).
- e.g.
& &
is invalid, and not interpreted as the && operator.
- e.g.
For example, the following two program fragments are equivalent, and should produce the same token stream except for the line and column positions:
if ( p /* meaning n is prime */ ) { print ( n , " " ) ; count = count + 1 ; /* number of primes found so far */ }
if(p){print(n," ");count=count+1;}
- Complete list of token names
End_of_input Op_multiply Op_divide Op_mod Op_add Op_subtract Op_negate Op_not Op_less Op_lessequal Op_greater Op_greaterequal Op_equal Op_notequal Op_assign Op_and Op_or Keyword_if Keyword_else Keyword_while Keyword_print Keyword_putc LeftParen RightParen LeftBrace RightBrace Semicolon Comma Identifier Integer String
The program output should be a sequence of lines, each consisting of the following whitespace-separated fields:
- the line number where the token starts
- the column number where the token starts
- the token name
- the token value (only for Identifier, Integer, and String tokens)
- the number of spaces between fields is up to you. Neatly aligned is nice, but not a requirement.
This task is intended to be used as part of a pipeline, with the other compiler tasks - for example:
lex < hello.t | parse | gen | vm
Or possibly:
lex hello.t lex.out
parse lex.out parse.out
gen parse.out gen.out
vm gen.out
This implies that the output of this task (the lexical analyzer) should be suitable as input to any of the Syntax Analyzer task programs.
The following error conditions should be caught:
Error Example Empty character constant ''
Unknown escape sequence. \r
Multi-character constant. 'xx'
End-of-file in comment. Closing comment characters not found. End-of-file while scanning string literal. Closing string character not found. End-of-line while scanning string literal. Closing string character not found before end-of-line. Unrecognized character. Invalid number. Starts like a number, but ends in non-numeric characters. 123abc
Input Output Test Case 1:
/* Hello world */ print("Hello, World!\n");
4 1 Keyword_print 4 6 LeftParen 4 7 String "Hello, World!\n" 4 24 RightParen 4 25 Semicolon 5 1 End_of_input
Test Case 2:
/* Show Ident and Integers */ phoenix_number = 142857; print(phoenix_number, "\n");
4 1 Identifier phoenix_number 4 16 Op_assign 4 18 Integer 142857 4 24 Semicolon 5 1 Keyword_print 5 6 LeftParen 5 7 Identifier phoenix_number 5 21 Comma 5 23 String "\n" 5 27 RightParen 5 28 Semicolon 6 1 End_of_input
Test Case 3:
/* All lexical tokens - not syntactically correct, but that will have to wait until syntax analysis */ /* Print */ print /* Sub */ - /* Putc */ putc /* Lss */ < /* If */ if /* Gtr */ > /* Else */ else /* Leq */ <= /* While */ while /* Geq */ >= /* Lbrace */ { /* Eq */ == /* Rbrace */ } /* Neq */ != /* Lparen */ ( /* And */ && /* Rparen */ ) /* Or */ || /* Uminus */ - /* Semi */ ; /* Not */ ! /* Comma */ , /* Mul */ * /* Assign */ = /* Div */ / /* Integer */ 42 /* Mod */ % /* String */ "String literal" /* Add */ + /* Ident */ variable_name /* character literal */ '\n' /* character literal */ '\\' /* character literal */ ' '
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
Test Case 4:
/*** test printing, embedded \n and comments with lots of '*' ***/ print(42); print("\nHello World\nGood Bye\nok\n"); print("Print a slash n - \\n.\n");
2 1 Keyword_print 2 6 LeftParen 2 7 Integer 42 2 9 RightParen 2 10 Semicolon 3 1 Keyword_print 3 6 LeftParen 3 7 String "\nHello World\nGood Bye\nok\n" 3 38 RightParen 3 39 Semicolon 4 1 Keyword_print 4 6 LeftParen 4 7 String "Print a slash n - \\n.\n" 4 33 RightParen 4 34 Semicolon 5 1 End_of_input
- Additional examples
Your solution should pass all the test cases above and the additional tests found Here.
The C and Python versions can be considered reference implementations.
- Related Tasks
Ada
with Ada.Text_IO, Ada.Streams.Stream_IO, Ada.Strings.Unbounded, Ada.Command_Line,
Ada.Exceptions;
use Ada.Strings, Ada.Strings.Unbounded, Ada.Streams, Ada.Exceptions;
procedure Main is
package IO renames Ada.Text_IO;
package Lexer is
type Token is (Op_multiply, Op_divide, Op_mod, Op_add, Op_subtract, Op_negate,
Op_less, Op_lessequal, Op_greater, Op_greaterequal, Op_equal,
Op_notequal, Op_not, Op_assign, Op_and, Op_or,
LeftParen, RightParen, LeftBrace, RightBrace, Semicolon, Comma,
Keyword_if, Keyword_else, Keyword_while, Keyword_print, Keyword_putc,
Identifier, Token_Integer, Token_String, End_of_input,
Empty_Char_Error, Invalid_Escape_Error, Multi_Char_Error, EOF_Comment_Error,
EOF_String_Error, EOL_String_Error, Invalid_Char_Error, Invalid_Num_Error
);
subtype Operator is Token range Op_multiply .. Op_or;
subtype Symbol is Token range Token'Succ(Operator'Last) .. Comma;
subtype Keyword is Token range Token'Succ(Symbol'Last) .. Keyword_putc;
subtype Error is Token range Empty_Char_Error .. Invalid_Num_Error;
subtype Operator_or_Error is Token
with Static_Predicate => Operator_or_Error in Operator | Error;
subtype Whitespace is Character
with Static_Predicate => Whitespace in ' ' | ASCII.HT | ASCII.CR | ASCII.LF;
Lexer_Error : exception;
Invalid_Escape_Code : constant Character := ASCII.NUL;
procedure run(input : Stream_IO.File_Type);
end Lexer;
package body Lexer is
use type Stream_IO.Count;
procedure run(input : Stream_IO.File_Type) is
type State is (State_Start, State_Identifier, State_Integer, State_Char, State_String,
State_Comment);
curr_state : State := State_Start;
curr_char : Character;
curr_col, curr_row, token_col, token_row : Positive := 1;
token_text : Unbounded_String := Unbounded.Null_Unbounded_String;
function look_ahead return Character is
next_char : Character := ASCII.LF;
begin
if not Stream_IO.End_Of_File(input) then
next_char := Character'Input(Stream_IO.Stream(input));
Stream_IO.Set_Index(input, Stream_IO.Index(input) - 1);
end if;
return next_char;
end look_ahead;
procedure next_char is
next : Character := Character'Input(Stream_IO.Stream(input));
begin
curr_col := curr_col + 1;
if curr_char = ASCII.LF then
curr_row := curr_row + 1;
curr_col := 1;
end if;
curr_char := next;
end next_char;
procedure print_token(tok : Token; text : String := "") is
procedure raise_error(text : String) is
begin
raise Lexer_Error with "Error: " & text;
end;
begin
IO.Put(token_row'Image & ASCII.HT & token_col'Image & ASCII.HT);
case tok is
when Operator | Symbol | Keyword | End_of_input => IO.Put_Line(tok'Image);
when Token_Integer => IO.Put_Line("INTEGER" & ASCII.HT & text);
when Token_String => IO.Put_Line("STRING" & ASCII.HT & ASCII.Quotation & text & ASCII.Quotation);
when Identifier => IO.Put_Line(tok'Image & ASCII.HT & text);
when Empty_Char_Error => raise_error("empty character constant");
when Invalid_Escape_Error => raise_error("unknown escape sequence: " & text);
when Multi_Char_Error => raise_error("multi-character constant: " & text);
when EOF_Comment_Error => raise_error("EOF in comment");
when EOF_String_Error => raise_error("EOF in string");
when EOL_String_Error => raise_error("EOL in string");
when Invalid_Char_Error => raise_error("invalid character: " & curr_char);
when Invalid_Num_Error => raise_error("invalid number: " & text);
end case;
end print_token;
procedure lookahead_choose(determiner : Character; a, b : Operator_or_Error) is
begin
if look_ahead = determiner then
print_token(a);
next_char;
else
print_token(b);
end if;
end lookahead_choose;
function to_escape_code(c : Character) return Character is
begin
case c is
when 'n' => return ASCII.LF;
when '\' => return '\';
when others =>
print_token(Invalid_Escape_Error, ASCII.Back_Slash & c);
return Invalid_Escape_Code;
end case;
end to_escape_code;
begin
curr_char := Character'Input(Stream_IO.Stream(input));
loop
case curr_state is
when State_Start =>
token_col := curr_col;
token_row := curr_row;
case curr_char is
when '*' => print_token(Op_multiply);
when '/' =>
if look_ahead = '*' then
next_char;
curr_state := State_Comment;
else
print_token(Op_divide);
end if;
when '%' => print_token(Op_mod);
when '+' => print_token(Op_add);
when '-' => print_token(Op_subtract);
when '(' => print_token(LeftParen);
when ')' => print_token(RightParen);
when '{' => print_token(LeftBrace);
when '}' => print_token(RightBrace);
when ';' => print_token(Semicolon);
when ',' => print_token(Comma);
when '<' => lookahead_choose('=', Op_lessequal, Op_less);
when '>' => lookahead_choose('=', Op_greaterequal, Op_greater);
when '!' => lookahead_choose('=', Op_notequal, Op_not);
when '=' => lookahead_choose('=', Op_equal, Op_assign);
when '&' => lookahead_choose('&', Op_and, Invalid_Char_Error);
when '|' => lookahead_choose('|', Op_or, Invalid_Char_Error);
when 'a' .. 'z' | 'A' .. 'Z' | '_' =>
Unbounded.Append(token_text, curr_char);
curr_state := State_Identifier;
when '0' .. '9' =>
Unbounded.Append(token_text, curr_char);
curr_state := State_Integer;
when ''' => curr_state := State_Char;
when ASCII.Quotation => curr_state := State_String;
when Whitespace => null;
when others => null;
end case;
next_char;
when State_Identifier =>
case curr_char is
when 'a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '_' =>
Unbounded.Append(token_text, curr_char);
next_char;
when others =>
if token_text = "if" then
print_token(Keyword_if);
elsif token_text = "else" then
print_token(Keyword_else);
elsif token_text = "while" then
print_token(Keyword_while);
elsif token_text = "print" then
print_token(Keyword_print);
elsif token_text = "putc" then
print_token(Keyword_putc);
else
print_token(Identifier, To_String(token_text));
end if;
Unbounded.Set_Unbounded_String(token_text, "");
curr_state := State_Start;
end case;
when State_Integer =>
case curr_char is
when '0' .. '9' =>
Unbounded.Append(token_text, curr_char);
next_char;
when 'a' .. 'z' | 'A' .. 'Z' | '_' =>
print_token(Invalid_Num_Error, To_String(token_text));
when others =>
print_token(Token_Integer, To_String(token_text));
Unbounded.Set_Unbounded_String(token_text, "");
curr_state := State_Start;
end case;
when State_Char =>
case curr_char is
when ''' =>
if Unbounded.Length(token_text) = 0 then
print_token(Empty_Char_Error);
elsif Unbounded.Length(token_text) = 1 then
print_token(Token_Integer, Character'Pos(Element(token_text, 1))'Image);
else
print_token(Multi_Char_Error, To_String(token_text));
end if;
Set_Unbounded_String(token_text, "");
curr_state := State_Start;
when '\' =>
Unbounded.Append(token_text, to_escape_code(look_ahead));
next_char;
when others => Unbounded.Append(token_text, curr_char);
end case;
next_char;
when State_String =>
case curr_char is
when ASCII.Quotation =>
print_token(Token_String, To_String(token_text));
Set_Unbounded_String(token_text, "");
curr_state := State_Start;
when '\' =>
if to_escape_code(look_ahead) /= Invalid_Escape_Code then
Unbounded.Append(token_text, curr_char);
end if;
when ASCII.LF | ASCII.CR => print_token(EOL_String_Error);
when others => Unbounded.Append(token_text, curr_char);
end case;
next_char;
when State_Comment =>
case curr_char is
when '*' =>
if look_ahead = '/' then
next_char;
curr_state := State_Start;
end if;
when others => null;
end case;
next_char;
end case;
end loop;
exception
when error : Stream_IO.End_Error =>
if curr_state = State_String then
print_token(EOF_String_Error);
else
print_token(End_of_input);
end if;
when error : Lexer.Lexer_Error => IO.Put_Line(Exception_Message(error));
end run;
end Lexer;
source_file : Stream_IO.File_Type;
begin
if Ada.Command_Line.Argument_Count < 1 then
IO.Put_Line("usage: lex [filename]");
return;
end if;
Stream_IO.Open(source_file, Stream_IO.In_File, Ada.Command_Line.Argument(1));
Lexer.run(source_file);
exception
when error : others => IO.Put_Line("Error: " & Exception_Message(error));
end Main;
- Output:
5 16 KEYWORD_PRINT 5 40 OP_SUBTRACT 6 16 KEYWORD_PUTC 6 40 OP_LESS 7 16 KEYWORD_IF 7 40 OP_GREATER 8 16 KEYWORD_ELSE 8 40 OP_LESSEQUAL 9 16 KEYWORD_WHILE 9 40 OP_GREATEREQUAL 10 16 LEFTBRACE 10 40 OP_EQUAL 11 16 RIGHTBRACE 11 40 OP_NOTEQUAL 12 16 LEFTPAREN 12 40 OP_AND 13 16 RIGHTPAREN 13 40 OP_OR 14 16 OP_SUBTRACT 14 40 SEMICOLON 15 16 OP_NOT 15 40 COMMA 16 16 OP_MULTIPLY 16 40 OP_ASSIGN 17 16 OP_DIVIDE 17 40 INTEGER 42 18 16 OP_MOD 18 40 STRING "String literal" 19 16 OP_ADD 19 40 IDENTIFIER variable_name 20 26 INTEGER 10 21 26 INTEGER 92 22 26 INTEGER 32 23 1 END_OF_INPUT
ALGOL 68
This is a simple token in, line out program. It doesn't keep an internal representation of tokens or anything like that, since that's not needed at all.
As an addition, it emits a diagnostic if integer literals are too big.
BEGIN
# implement C-like getchar, where EOF and EOLn are "characters" (-1 and 10 resp.). #
INT eof = -1, eoln = 10;
BOOL eof flag := FALSE;
STRING buf := "";
INT col := 1;
INT line := 0;
on logical file end (stand in, (REF FILE f)BOOL: eof flag := TRUE);
PROC getchar = INT:
IF eof flag THEN eof
ELIF col = UPB buf THEN col +:= 1; eoln
ELIF col > UPB buf THEN IF line > 0 THEN read(newline) FI;
line +:= 1;
read(buf);
IF eof flag THEN col := 1; eof
ELSE col := 0; getchar
FI
ELSE col +:= 1; ABS buf[col]
FI;
PROC nextchar = INT: IF eof flag THEN eof ELIF col >= UPB buf THEN eoln ELSE ABS buf[col+1] FI;
PROC is blank = (INT ch) BOOL: ch = 0 OR ch = 9 OR ch = 10 OR ch = 13 OR ch = ABS " ";
PROC is digit = (INT ch) BOOL: ch >= ABS "0" AND ch <= ABS "9";
PROC is ident start = (INT ch) BOOL: ch >= ABS "A" AND ch <= ABS "Z" OR
ch >= ABS "a" AND ch <= ABS "z" OR
ch = ABS "_";
PROC is ident = (INT ch) BOOL: is ident start(ch) OR is digit(ch);
PROC ident or keyword = (INT start char) VOID:
BEGIN
STRING w := REPR start char;
INT start col = col;
WHILE is ident (next char) DO w +:= REPR getchar OD;
IF w = "if" THEN output2("Keyword_if", start col)
ELIF w = "else" THEN output2("Keyword_else", start col)
ELIF w = "while" THEN output2("Keyword_while", start col)
ELIF w = "print" THEN output2("Keyword_print", start col)
ELIF w = "putc" THEN output2("Keyword_putc", start col)
ELSE output2("Identifier " + w, start col)
FI
END;
PROC char = VOID:
BEGIN
INT start col = col;
INT ch := getchar;
IF ch = ABS "'" THEN error("Empty character constant")
ELIF ch = ABS "\" THEN ch := getchar;
IF ch = ABS "n" THEN ch := 10
ELIF ch = ABS "\" THEN SKIP
ELSE error("Unknown escape sequence. \" + REPR ch)
FI
FI;
IF nextchar /= ABS "'" THEN error("Multi-character constant.") FI;
getchar;
output2("Integer " + whole(ch, 0), start col)
END;
PROC string = VOID:
BEGIN
INT start col = col;
STRING s := """";
WHILE INT ch := getchar; ch /= ABS """"
DO
IF ch = eoln THEN error("End-of-line while scanning string literal. Closing string character not found before end-of-line.")
ELIF ch = eof THEN error("End-of-file while scanning string literal. Closing string character not found.")
ELIF ch = ABS "\" THEN s +:= REPR ch; ch := getchar;
IF ch /= ABS "\" AND ch /= ABS "n" THEN error("Unknown escape sequence. \" + REPR ch) FI;
s +:= REPR ch
ELSE s +:= REPR ch
FI
OD;
output2("String " + s + """", start col)
END;
PROC comment = VOID:
BEGIN
WHILE INT ch := getchar; NOT (ch = ABS "*" AND nextchar = ABS "/")
DO IF ch = eof THEN error("End-of-file in comment. Closing comment characters not found.") FI
OD;
getchar
END;
PROC number = (INT first digit) VOID:
BEGIN
INT start col = col;
INT n := first digit - ABS "0";
WHILE is digit (nextchar) DO
INT u := getchar - ABS "0";
IF LENG n * 10 + LENG u > max int THEN error("Integer too big") FI;
n := n * 10 + u
OD;
IF is ident start (nextchar) THEN error("Invalid number. Starts like a number, but ends in non-numeric characters.") FI;
output2("Integer " + whole(n, 0), start col)
END;
PROC output = (STRING s) VOID: output2(s, col);
PROC output2 = (STRING s, INT col) VOID: print((whole(line,-8), whole(col,-8), " ", s, newline));
PROC if follows = (CHAR second, STRING longer, shorter) VOID:
IF nextchar = ABS second
THEN output(longer); getchar
ELSE output(shorter)
FI;
PROC error = (STRING s)VOID: (put(stand error, ("At ", whole(line,0), ":", whole(col,0), " ", s, new line)); stop);
PROC unrecognized = (INT char) VOID: error("Unrecognized character " + REPR char);
PROC double char = (INT first, STRING op) VOID:
IF nextchar /= first THEN unrecognized(first)
ELSE output2(op, col-1); getchar
FI;
WHILE INT ch := getchar; ch /= eof
DO
IF is blank(ch) THEN SKIP
ELIF ch = ABS "(" THEN output("LeftParen")
ELIF ch = ABS ")" THEN output("RightParen")
ELIF ch = ABS "{" THEN output("LeftBrace")
ELIF ch = ABS "}" THEN output("RightBrace")
ELIF ch = ABS ";" THEN output("Semicolon")
ELIF ch = ABS "," THEN output("Comma")
ELIF ch = ABS "*" THEN output("Op_multiply")
ELIF ch = ABS "/" THEN IF next char = ABS "*" THEN comment
ELSE output("Op_divide")
FI
ELIF ch = ABS "%" THEN output("Op_mod")
ELIF ch = ABS "+" THEN output("Op_add")
ELIF ch = ABS "-" THEN output("Op_subtract")
ELIF ch = ABS "<" THEN if follows("=", "Op_lessequal", "Op_less")
ELIF ch = ABS ">" THEN if follows("=", "Op_greaterequal", "Op_greater")
ELIF ch = ABS "=" THEN if follows("=", "Op_equal", "Op_assign")
ELIF ch = ABS "!" THEN if follows("=", "Op_notequal", "Op_not")
ELIF ch = ABS "&" THEN double char(ch, "Op_and")
ELIF ch = ABS "|" THEN double char(ch, "Op_or")
ELIF is ident start (ch) THEN ident or keyword (ch)
ELIF ch = ABS """" THEN string
ELIF ch = ABS "'" THEN char
ELIF is digit(ch) THEN number(ch)
ELSE unrecognized(ch)
FI
OD;
output("End_Of_Input")
END
ALGOL W
begin
%lexical analyser %
% Algol W strings are limited to 256 characters in length so we limit source lines %
% and tokens to 256 characters %
integer lineNumber, columnNumber;
string(256) line;
string(256) tkValue;
integer tkType, tkLine, tkColumn, tkLength, tkIntegerValue;
logical tkTooLong;
string(1) currChar;
string(1) newlineChar;
integer LINE_WIDTH, MAX_TOKEN_LENGTH, MAXINTEGER_OVER_10, MAXINTEGER_MOD_10;
integer tOp_multiply , tOp_divide , tOp_mod , tOp_add
, tOp_subtract , tOp_negate , tOp_less , tOp_lessequal
, tOp_greater , tOp_greaterequal , tOp_equal , tOp_notequal
, tOp_not , tOp_assign , tOp_and , tOp_or
, tLeftParen , tRightParen , tLeftBrace , tRightBrace
, tSemicolon , tComma , tKeyword_if , tKeyword_else
, tKeyword_while , tKeyword_print , tKeyword_putc , tIdentifier
, tInteger , tString , tEnd_of_input , tComment
;
string(16) array tkName ( 1 :: 32 );
% reports an error %
procedure lexError( string(80) value message ); begin
integer errorPos;
write( i_w := 1, s_w := 0, "**** Error at(", lineNumber, ",", columnNumber, "): " );
errorPos := 0;
while errorPos < 80 and message( errorPos // 1 ) not = "." do begin
writeon( s_w := 0, message( errorPos // 1 ) );
errorPos := errorPos + 1
end while_not_at_end_of_message ;
writeon( s_w := 0, "." )
end lexError ;
% gets the next source character %
procedure nextChar ; begin
if columnNumber = LINE_WIDTH then begin
currChar := newlineChar;
columnNumber := columnNumber + 1
end
else if columnNumber > LINE_WIDTH then begin
readcard( line );
columnNumber := 1;
if not XCPNOTED(ENDFILE) then lineNumber := lineNumber + 1;
currChar := line( 0 // 1 )
end
else begin
currChar := line( columnNumber // 1 );
columnNumber := columnNumber + 1
end
end nextChar ;
% gets the next token, returns the token type %
integer procedure nextToken ; begin
% returns true if currChar is in the inclusive range lowerValue to upperValue %
% false otherwise %
logical procedure range( string(1) value lowerValue, upperValue ) ; begin
currChar >= lowerValue and currChar <= upperValue
end range ;
% returns true if the current character can start an identifier, false otherwise %
logical procedure identifierStartChar ; begin
currChar = "_" or range( "a", "z" ) or range( "A", "Z" )
end identifierStartChar ;
% add the current character to the token and get the next %
procedure addAndNextChar ; begin
if tkLength >= MAX_TOKEN_LENGTH then tkTooLong := true
else begin
tkValue( tkLength // 1 ) := currChar;
tkLength := tkLength + 1
end if_symbol_not_too_long ;
nextChar
end % addAndNextChar % ;
% handle a single character token %
procedure singleCharToken( integer value tokenType ) ; begin
tkType := tokenType;
nextChar
end singleCharToken ;
% handle a doubled character token: && or || %
procedure doubleCharToken( integer value tokenType ) ; begin
string(1) firstChar;
firstChar := currChar;
tkType := tokenType;
nextChar;
if currChar = firstChar then nextChar
else % the character wasn't doubled % lexError( "Unrecognised character." );
end singleCharToken ;
% handle an operator or operator= token %
procedure opOrOpEqual( integer value opToken, opEqualToken ) ; begin
tkType := opToken;
nextChar;
if currChar = "=" then begin
% have operator= %
tkType := opEqualToken;
nextChar
end if_currChar_is_equal ;
end opOrOpEqual ;
% handle a / operator or /* comment %
procedure divideOrComment ; begin
tkType := tOp_divide;
nextChar;
if currChar = "*" then begin
% have a comment %
logical moreComment;
tkType := tComment;
moreComment := true;
while moreComment do begin
nextChar;
while currChar not = "*" and not XCPNOTED(ENDFILE) do nextChar;
while currChar = "*" and not XCPNOTED(ENDFILE) do nextChar;
moreComment := ( currChar not = "/" and not XCPNOTED(ENDFILE) )
end while_more_comment ;
if not XCPNOTED(ENDFILE)
then nextChar
else lexError( "End-of-file in comment." )
end if_currChar_is_star ;
end divideOrComment ;
% handle an indentifier or keyword %
procedure identifierOrKeyword ; begin
tkType := tIdentifier;
while identifierStartChar or range( "0", "9" ) do addAndNextChar;
% there are only 5 keywords, so we just test each in turn here %
if tkValue = "if" then tkType := tKeyword_if
else if tkValue = "else" then tkType := tKeyword_else
else if tkValue = "while" then tkType := tKeyword_while
else if tkValue = "print" then tkType := tKeyword_print
else if tkValue = "putc" then tkType := tKeyword_putc;
if tkType not = tIdentifier then tkValue := "";
end identifierOrKeyword ;
% handle an integer literal %
procedure integerLiteral ; begin
logical overflowed;
integer digit;
overflowed := false;
tkType := tInteger;
while range( "0", "9" ) do begin
digit := ( decode( currChar ) - decode( "0" ) );
if tkIntegerValue > MAXINTEGER_OVER_10 then overflowed := true
else if tkIntegerValue = MAXINTEGER_OVER_10
and digit > MAXINTEGER_MOD_10 then overflowed := true
else begin
tkIntegerValue := tkIntegerValue * 10;
tkIntegerValue := tkIntegerValue + digit;
end;
nextChar
end while_have_a_digit ;
if overflowed then lexError( "Number too large." );
if identifierStartChar then lexError( "Number followed by letter or underscore." );
end integerLiteral ;
% handle a char literal %
procedure charLiteral ; begin
nextChar;
if currChar = "'" or currChar = newlineChar then lexError( "Invalid character constant." )
else if currChar = "\" then begin
% have an escape %
nextChar;
if currChar = "n" then currChar := newlineChar
else if currChar not = "\" then lexError( "Unknown escape sequence." )
end;
tkType := tInteger;
tkIntegerValue := decode( currChar );
% should have a closing quoute next %
nextChar;
if currChar not = "'"
then lexError( "Multi-character constant." )
else nextChar
end charLiteral ;
% handle a string literal %
procedure stringLiteral ; begin
tkType := tString;
tkValue( 0 // 1 ) := currChar;
tkLength := 1;
nextChar;
while currChar not = """" and currChar not = newlineChar and not XCPNOTED(ENDFILE) do addAndNextChar;
if currChar = newlineChar then lexError( "End-of-line while scanning string literal." )
else if XCPNOTED(ENDFILE) then lexError( "End-of-file while scanning string literal." )
else % currChar must be """" % addAndNextChar
end stringLiteral ;
while begin
% skip white space %
while ( currChar = " " or currChar = newlineChar ) and not XCPNOTED(ENDFILE) do nextChar;
% get the token %
tkLine := lineNumber;
tkColumn := columnNumber;
tkValue := "";
tkLength := 0;
tkIntegerValue := 0;
tkTooLong := false;
if XCPNOTED(ENDFILE) then tkType := tEnd_of_input
else if currChar = "*" then singleCharToken( tOp_multiply )
else if currChar = "/" then divideOrComment
else if currChar = "%" then singleCharToken( tOp_mod )
else if currChar = "+" then singleCharToken( tOp_add )
else if currChar = "-" then singleCharToken( tOp_subtract )
else if currChar = "<" then opOrOpEqual( tOp_less, tOp_lessequal )
else if currChar = ">" then opOrOpEqual( tOp_greater, tOp_greaterequal )
else if currChar = "=" then opOrOpEqual( tOp_assign, tOp_equal )
else if currChar = "!" then opOrOpEqual( tOp_not, tOp_notequal )
else if currChar = "&" then doubleCharToken( tOp_and )
else if currChar = "|" then doubleCharToken( tOp_or )
else if currChar = "(" then singleCharToken( tLeftParen )
else if currChar = ")" then singleCharToken( tRightParen )
else if currChar = "{" then singleCharToken( tLeftBrace )
else if currChar = "}" then singleCharToken( tRightBrace )
else if currChar = ";" then singleCharToken( tSemicolon )
else if currChar = "," then singleCharToken( tComma )
else if identifierStartChar then identifierOrKeyword
else if range( "0", "9" ) then integerLiteral
else if currChar = "'" then charLiteral
else if currChar = """" then stringLiteral
else begin
lexError( "Unrecognised character." );
singleCharToken( tComment )
end ;
% continue until we get something other than a comment %
tkType = tComment
end do begin end;
if tkTooLong then if tkType = tString
then lexError( "String literal too long." )
else lexError( "Identifier too long." );
tkType
end nextToken ;
% outputs the current token %
procedure writeToken ; begin
write( i_w := 5, s_w := 2, tkLine, tkColumn, tkName( tkType ) );
if tkType = tInteger then writeon( i_w := 11, tkIntegerValue )
else if tkLength > 0 then begin
writeon( " " );
for tkPos := 0 until tkLength - 1 do writeon( s_w := 0, tkValue( tkPos // 1 ) );
end
end writeToken ;
LINE_WIDTH := 256; MAXINTEGER_MOD_10 := MAXINTEGER rem 10;
MAX_TOKEN_LENGTH := 256; MAXINTEGER_OVER_10 := MAXINTEGER div 10;
newlineChar := code( 10 );
tOp_multiply := 1; tkName( tOp_multiply ) := "Op_multiply";
tOp_divide := 2; tkName( tOp_divide ) := "Op_divide";
tOp_mod := 3; tkName( tOp_mod ) := "Op_mod";
tOp_add := 4; tkName( tOp_add ) := "Op_add";
tOp_subtract := 5; tkName( tOp_subtract ) := "Op_subtract";
tOp_negate := 6; tkName( tOp_negate ) := "Op_negate";
tOp_less := 7; tkName( tOp_less ) := "Op_less";
tOp_lessequal := 8; tkName( tOp_lessequal ) := "Op_lessequal";
tOp_greater := 9; tkName( tOp_greater ) := "Op_greater";
tOp_greaterequal := 10; tkName( tOp_greaterequal ) := "Op_greaterequal";
tOp_equal := 11; tkName( tOp_equal ) := "Op_equal";
tOp_notequal := 12; tkName( tOp_notequal ) := "Op_notequal";
tOp_not := 13; tkName( tOp_not ) := "Op_not";
tOp_assign := 14; tkName( tOp_assign ) := "Op_assign";
tOp_and := 15; tkName( tOp_and ) := "Op_and";
tOp_or := 16; tkName( tOp_or ) := "Op_or";
tLeftParen := 17; tkName( tLeftParen ) := "LeftParen";
tRightParen := 18; tkName( tRightParen ) := "RightParen";
tLeftBrace := 19; tkName( tLeftBrace ) := "LeftBrace";
tRightBrace := 20; tkName( tRightBrace ) := "RightBrace";
tSemicolon := 21; tkName( tSemicolon ) := "Semicolon";
tComma := 22; tkName( tComma ) := "Comma";
tKeyword_if := 23; tkName( tKeyword_if ) := "Keyword_if";
tKeyword_else := 24; tkName( tKeyword_else ) := "Keyword_else";
tKeyword_while := 25; tkName( tKeyword_while ) := "Keyword_while";
tKeyword_print := 26; tkName( tKeyword_print ) := "Keyword_print";
tKeyword_putc := 27; tkName( tKeyword_putc ) := "Keyword_putc";
tIdentifier := 28; tkName( tIdentifier ) := "Identifier";
tInteger := 29; tkName( tInteger ) := "Integer";
tString := 30; tkName( tString ) := "String";
tEnd_of_input := 31; tkName( tEnd_of_input ) := "End_of_input";
tComment := 32; tkName( tComment ) := "Comment";
% allow the program to continue after reaching end-of-file %
ENDFILE := EXCEPTION( false, 1, 0, false, "EOF" );
% ensure the first call to nextToken reads the first line %
lineNumber := 0;
columnNumber := LINE_WIDTH + 1;
currChar := " ";
% get and print all tokens from standard input %
while nextToken not = tEnd_of_input do writeToken;
writeToken
end.
- Output:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
ATS
One interesting feature of this implementation is my liberal use of a pushback buffer for input characters. This kept the code modular and easier to write.
(One point of note: the C "EOF" pseudo-character is detected in the following code by looking for a negative number. That EOF has to be negative and the other characters non-negative is implied by the ISO C standard.)
(********************************************************************)
(* Usage: lex [INPUTFILE [OUTPUTFILE]]
If INPUTFILE or OUTPUTFILE is "-" or missing, then standard input
or standard output is used, respectively. *)
#define ATS_DYNLOADFLAG 0
#include "share/atspre_staload.hats"
staload UN = "prelude/SATS/unsafe.sats"
#define NIL list_nil ()
#define :: list_cons
%{^
/* alloca(3) is needed for ATS exceptions. */
#include <alloca.h>
%}
(********************************************************************)
#define NUM_TOKENS 31
#define RESERVED_WORD_HASHTAB_SIZE 9
#define TOKEN_ELSE 0
#define TOKEN_IF 1
#define TOKEN_PRINT 2
#define TOKEN_PUTC 3
#define TOKEN_WHILE 4
#define TOKEN_MULTIPLY 5
#define TOKEN_DIVIDE 6
#define TOKEN_MOD 7
#define TOKEN_ADD 8
#define TOKEN_SUBTRACT 9
#define TOKEN_NEGATE 10
#define TOKEN_LESS 11
#define TOKEN_LESSEQUAL 12
#define TOKEN_GREATER 13
#define TOKEN_GREATEREQUAL 14
#define TOKEN_EQUAL 15
#define TOKEN_NOTEQUAL 16
#define TOKEN_NOT 17
#define TOKEN_ASSIGN 18
#define TOKEN_AND 19
#define TOKEN_OR 20
#define TOKEN_LEFTPAREN 21
#define TOKEN_RIGHTPAREN 22
#define TOKEN_LEFTBRACE 23
#define TOKEN_RIGHTBRACE 24
#define TOKEN_SEMICOLON 25
#define TOKEN_COMMA 26
#define TOKEN_IDENTIFIER 27
#define TOKEN_INTEGER 28
#define TOKEN_STRING 29
#define TOKEN_END_OF_INPUT 30
typedef token_t =
[i : int | TOKEN_ELSE <= i; i <= TOKEN_END_OF_INPUT]
int i
typedef tokentuple_t = (token_t, String, ullint, ullint)
typedef token_names_vt = @[string][NUM_TOKENS]
vtypedef reserved_words_vt =
@[String][RESERVED_WORD_HASHTAB_SIZE]
vtypedef reserved_word_tokens_vt =
@[token_t][RESERVED_WORD_HASHTAB_SIZE]
vtypedef lookups_vt =
[p_toknames : addr]
[p_wordtab : addr]
[p_toktab : addr]
@{
pf_toknames = token_names_vt @ p_toknames,
pf_wordtab = reserved_words_vt @ p_wordtab,
pf_toktab = reserved_word_tokens_vt @ p_toktab |
toknames = ptr p_toknames,
wordtab = ptr p_wordtab,
toktab = ptr p_toktab
}
fn
reserved_word_lookup
(s : String,
lookups : !lookups_vt,
line_no : ullint,
column_no : ullint) : tokentuple_t =
if string_length s < 2 then
(TOKEN_IDENTIFIER, s, line_no, column_no)
else
let
macdef wordtab = !(lookups.wordtab)
macdef toktab = !(lookups.toktab)
val hashval =
g1uint_mod (g1ofg0 (char2ui s[0] + char2ui s[1]),
g1i2u RESERVED_WORD_HASHTAB_SIZE)
val token = toktab[hashval]
in
if token = TOKEN_IDENTIFIER || s <> wordtab[hashval] then
(TOKEN_IDENTIFIER, s, line_no, column_no)
else
(token, s, line_no, column_no)
end
(********************************************************************)
(* Input allows pushback into a buffer. *)
typedef ch_t =
@{
ichar = int,
line_no = ullint,
column_no = ullint
}
typedef inp_t (n : int) =
[0 <= n]
@{
file = FILEref,
pushback = list (ch_t, n),
line_no = ullint,
column_no = ullint
}
typedef inp_t = [n : int] inp_t n
fn
get_ch (inp : inp_t) : (ch_t, inp_t) =
case+ (inp.pushback) of
| NIL =>
let
val c = fileref_getc (inp.file)
val ch =
@{
ichar = c,
line_no = inp.line_no,
column_no = inp.column_no
}
in
if c = char2i '\n' then
let
val inp =
@{
file = inp.file,
pushback = inp.pushback,
line_no = succ (inp.line_no),
column_no = 1ULL
}
in
(ch, inp)
end
else
let
val inp =
@{
file = inp.file,
pushback = inp.pushback,
line_no = inp.line_no,
column_no = succ (inp.column_no)
}
in
(ch, inp)
end
end
| ch :: pushback =>
let
val inp =
@{
file = inp.file,
pushback = pushback,
line_no = inp.line_no,
column_no = inp.column_no
}
in
(ch, inp)
end
fn
push_back_ch (ch : ch_t,
inp : inp_t) : [n : pos] inp_t n =
let
prval _ = lemma_list_param (inp.pushback)
in
@{
file = inp.file,
pushback = ch :: (inp.pushback),
line_no = inp.line_no,
column_no = inp.column_no
}
end
(********************************************************************)
exception unterminated_comment of (ullint, ullint)
exception unterminated_character_literal of (ullint, ullint)
exception multicharacter_literal of (ullint, ullint)
exception unterminated_string_literal of (ullint, ullint, bool)
exception unsupported_escape of (ullint, ullint, int)
exception invalid_integer_literal of (ullint, ullint, String)
exception unexpected_character of (ullint, ullint, int)
fn
scan_comment (inp : inp_t,
line_no : ullint,
column_no : ullint) : inp_t =
let
fun
loop (inp : inp_t) : inp_t =
let
val (ch, inp) = get_ch inp
in
if (ch.ichar) < 0 then
$raise unterminated_comment (line_no, column_no)
else if (ch.ichar) = char2i '*' then
let
val (ch1, inp) = get_ch inp
in
if (ch.ichar) < 0 then
$raise unterminated_comment (line_no, column_no)
else if (ch1.ichar) = char2i '/' then
inp
else
loop inp
end
else
loop inp
end
in
loop inp
end
fn
skip_spaces_and_comments (inp : inp_t) : [n : pos] inp_t n =
let
fun
loop (inp : inp_t) : [n : pos] inp_t n =
let
val (ch, inp) = get_ch inp
in
if isspace (ch.ichar) then
loop inp
else if (ch.ichar) = char2i '/' then
let
val (ch1, inp) = get_ch inp
in
if (ch1.ichar) = char2i '*' then
loop (scan_comment (inp, ch.line_no, ch.column_no))
else
let
val inp = push_back_ch (ch1, inp)
val inp = push_back_ch (ch, inp)
in
inp
end
end
else
push_back_ch (ch, inp)
end
in
loop inp
end
fn
reverse_list_to_string
{m : int}
(lst : list (char, m)) : string m =
let
fun
fill_array {n : nat | n <= m} .<n>.
(arr : &(@[char][m + 1]),
lst : list (char, n),
n : size_t n) : void =
case+ lst of
| NIL => ()
| c :: tail =>
begin
arr[pred n] := c;
fill_array (arr, tail, pred n)
end
prval _ = lemma_list_param lst
val m : size_t m = i2sz (list_length lst)
val (pf, pfgc | p) = array_ptr_alloc<char> (succ m)
val _ = array_initize_elt<char> (!p, succ m, '\0')
val _ = fill_array (!p, lst, m)
in
$UN.castvwtp0 @(pf, pfgc | p)
end
extern fun {}
simple_scan$pred : int -> bool
fun {}
simple_scan {u : nat}
(lst : list (char, u),
inp : inp_t) :
[m : nat]
[n : pos]
(list (char, m), inp_t n) =
let
val (ch, inp) = get_ch inp
in
if simple_scan$pred (ch.ichar) then
simple_scan<> (int2char0 (ch.ichar) :: lst, inp)
else
let
val inp = push_back_ch (ch, inp)
in
(lst, inp)
end
end
fn
is_ident_start (c : int) :<> bool =
isalpha (c) || c = char2i '_'
fn
is_ident_continuation (c : int) :<> bool =
isalnum (c) || c = char2i '_'
fn
scan_identifier_or_reserved_word
(inp : inp_t,
lookups : !lookups_vt) :
(tokentuple_t, [n : pos] inp_t n) =
let
val (ch, inp) = get_ch inp
val _ = assertloc (is_ident_start (ch.ichar))
implement simple_scan$pred<> c = is_ident_continuation c
val (lst, inp) = simple_scan (int2char0 (ch.ichar) :: NIL, inp)
val s = reverse_list_to_string lst
val toktup =
reserved_word_lookup (s, lookups, ch.line_no, ch.column_no)
in
(toktup, inp)
end
fn
scan_integer_literal
(inp : inp_t,
lookups : !lookups_vt) :
(tokentuple_t, [n : pos] inp_t n) =
let
val (ch, inp) = get_ch inp
val _ = assertloc (isdigit (ch.ichar))
implement simple_scan$pred<> c = is_ident_continuation c
val (lst, inp) = simple_scan (int2char0 (ch.ichar) :: NIL, inp)
val s = reverse_list_to_string lst
fun
check_they_are_all_digits
{n : nat} .<n>.
(lst : list (char, n)) : void =
case+ lst of
| NIL => ()
| c :: tail =>
if isdigit c then
check_they_are_all_digits tail
else
$raise invalid_integer_literal (ch.line_no, ch.column_no, s)
val _ = check_they_are_all_digits lst
in
((TOKEN_INTEGER, s, ch.line_no, ch.column_no), inp)
end
fn
ichar2integer_literal (c : int) : String0 =
let
var buf = @[char][100] ('\0')
val _ = $extfcall (int, "snprintf", addr@ buf, i2sz 99, "%d", c)
val s = string1_copy ($UN.castvwtp0{String0} buf)
in
strnptr2string s
end
fn
scan_character_literal_without_checking_end (inp : inp_t) :
(tokentuple_t, inp_t) =
let
val (ch, inp) = get_ch inp
val _ = assertloc ((ch.ichar) = '\'')
val (ch1, inp) = get_ch inp
in
if (ch1.ichar) < 0 then
$raise unterminated_character_literal (ch.line_no, ch.column_no)
else if (ch1.ichar) = char2i '\\' then
let
val (ch2, inp) = get_ch inp
in
if (ch2.ichar) < 0 then
$raise unterminated_character_literal (ch.line_no,
ch.column_no)
else if (ch2.ichar) = char2i 'n' then
let
val s = ichar2integer_literal (char2i '\n')
in
((TOKEN_INTEGER, s, ch.line_no, ch.column_no), inp)
end
else if (ch2.ichar) = char2i '\\' then
let
val s = ichar2integer_literal (char2i '\\')
in
((TOKEN_INTEGER, s, ch.line_no, ch.column_no), inp)
end
else
$raise unsupported_escape (ch1.line_no, ch1.column_no,
ch2.ichar)
end
else
let
val s = ichar2integer_literal (ch1.ichar)
in
((TOKEN_INTEGER, s, ch.line_no, ch.column_no), inp)
end
end
fn
scan_character_literal (inp : inp_t) : (tokentuple_t, inp_t) =
let
val (tok, inp) =
scan_character_literal_without_checking_end inp
val line_no = (tok.2)
val column_no = (tok.3)
fun
check_end (inp : inp_t) : inp_t =
let
val (ch, inp) = get_ch inp
in
if (ch.ichar) = char2i '\'' then
inp
else
let
fun
loop_to_end (ch1 : ch_t,
inp : inp_t) : inp_t =
if (ch1.ichar) < 0 then
$raise unterminated_character_literal (line_no,
column_no)
else if (ch1.ichar) = char2i '\'' then
$raise multicharacter_literal (line_no, column_no)
else
let
val (ch1, inp) = get_ch inp
in
loop_to_end (ch1, inp)
end
val inp = loop_to_end (ch, inp)
in
inp
end
end
val inp = check_end inp
in
(tok, inp)
end
fn
scan_string_literal (inp : inp_t) : (tokentuple_t, inp_t) =
let
val (ch, inp) = get_ch inp
val _ = assertloc ((ch.ichar) = '"')
fun
scan {u : pos}
(lst : list (char, u),
inp : inp_t) :
[m : pos] (list (char, m), inp_t) =
let
val (ch1, inp) = get_ch inp
in
if (ch1.ichar) < 0 then
$raise unterminated_string_literal (ch.line_no,
ch.column_no, false)
else if (ch1.ichar) = char2i '\n' then
$raise unterminated_string_literal (ch.line_no,
ch.column_no, true)
else if (ch1.ichar) = char2i '"' then
(lst, inp)
else if (ch1.ichar) <> char2i '\\' then
scan (int2char0 (ch1.ichar) :: lst, inp)
else
let
val (ch2, inp) = get_ch inp
in
if (ch2.ichar) = char2i 'n' then
scan ('n' :: '\\' :: lst, inp)
else if (ch2.ichar) = char2i '\\' then
scan ('\\' :: '\\' :: lst, inp)
else
$raise unsupported_escape (ch1.line_no, ch1.column_no,
ch2.ichar)
end
end
val lst = '"' :: NIL
val (lst, inp) = scan (lst, inp)
val lst = '"' :: lst
val s = reverse_list_to_string lst
in
((TOKEN_STRING, s, ch.line_no, ch.column_no), inp)
end
fn
get_next_token (inp : inp_t,
lookups : !lookups_vt) : (tokentuple_t, inp_t) =
let
val inp = skip_spaces_and_comments inp
val (ch, inp) = get_ch inp
val ln = ch.line_no
val cn = ch.column_no
in
if ch.ichar < 0 then
((TOKEN_END_OF_INPUT, "", ln, cn), inp)
else
case+ int2char0 (ch.ichar) of
| ',' => ((TOKEN_COMMA, ",", ln, cn), inp)
| ';' => ((TOKEN_SEMICOLON, ";", ln, cn), inp)
| '\(' => ((TOKEN_LEFTPAREN, "(", ln, cn), inp)
| ')' => ((TOKEN_RIGHTPAREN, ")", ln, cn), inp)
| '\{' => ((TOKEN_LEFTBRACE, "{", ln, cn), inp)
| '}' => ((TOKEN_RIGHTBRACE, "}", ln, cn), inp)
| '*' => ((TOKEN_MULTIPLY, "*", ln, cn), inp)
| '/' => ((TOKEN_DIVIDE, "/", ln, cn), inp)
| '%' => ((TOKEN_MOD, "%", ln, cn), inp)
| '+' => ((TOKEN_ADD, "+", ln, cn), inp)
| '-' => ((TOKEN_SUBTRACT, "-", ln, cn), inp)
| '<' =>
let
val (ch1, inp) = get_ch inp
in
if (ch1.ichar) = char2i '=' then
((TOKEN_LESSEQUAL, "<=", ln, cn), inp)
else
let
val inp = push_back_ch (ch1, inp)
in
((TOKEN_LESS, "<", ln, cn), inp)
end
end
| '>' =>
let
val (ch1, inp) = get_ch inp
in
if (ch1.ichar) = char2i '=' then
((TOKEN_GREATEREQUAL, ">=", ln, cn), inp)
else
let
val inp = push_back_ch (ch1, inp)
in
((TOKEN_GREATER, ">", ln, cn), inp)
end
end
| '=' =>
let
val (ch1, inp) = get_ch inp
in
if (ch1.ichar) = char2i '=' then
((TOKEN_EQUAL, "==", ln, cn), inp)
else
let
val inp = push_back_ch (ch1, inp)
in
((TOKEN_ASSIGN, "=", ln, cn), inp)
end
end
| '!' =>
let
val (ch1, inp) = get_ch inp
in
if (ch1.ichar) = char2i '=' then
((TOKEN_NOTEQUAL, "!=", ln, cn), inp)
else
let
val inp = push_back_ch (ch1, inp)
in
((TOKEN_NOT, "!", ln, cn), inp)
end
end
| '&' =>
let
val (ch1, inp) = get_ch inp
in
if (ch1.ichar) = char2i '&' then
((TOKEN_AND, "&&", ln, cn), inp)
else
$raise unexpected_character (ch.line_no, ch.column_no,
ch.ichar)
end
| '|' =>
let
val (ch1, inp) = get_ch inp
in
if (ch1.ichar) = char2i '|' then
((TOKEN_OR, "||", ln, cn), inp)
else
$raise unexpected_character (ch.line_no, ch.column_no,
ch.ichar)
end
| '"' =>
let
val inp = push_back_ch (ch, inp)
in
scan_string_literal inp
end
| '\'' =>
let
val inp = push_back_ch (ch, inp)
in
scan_character_literal inp
end
| _ when isdigit (ch.ichar) =>
let
val inp = push_back_ch (ch, inp)
in
scan_integer_literal (inp, lookups)
end
| _ when is_ident_start (ch.ichar) =>
let
val inp = push_back_ch (ch, inp)
in
scan_identifier_or_reserved_word (inp, lookups)
end
| _ => $raise unexpected_character (ch.line_no, ch.column_no,
ch.ichar)
end
fn
fprint_ullint_rightjust (outf : FILEref,
num : ullint) : void =
if num < 10ULL then
fprint! (outf, " ", num)
else if num < 100ULL then
fprint! (outf, " ", num)
else if num < 1000ULL then
fprint! (outf, " ", num)
else if num < 10000ULL then
fprint! (outf, " ", num)
else
fprint! (outf, num)
fn
print_token (outf : FILEref,
toktup : tokentuple_t,
lookups : !lookups_vt) : void =
let
macdef toknames = !(lookups.toknames)
val name = toknames[toktup.0]
val str = (toktup.1)
val line_no = (toktup.2)
val column_no = (toktup.3)
val _ = fprint_ullint_rightjust (outf, line_no)
val _ = fileref_puts (outf, " ")
val _ = fprint_ullint_rightjust (outf, column_no)
val _ = fileref_puts (outf, " ")
val _ = fileref_puts (outf, name)
in
begin
case+ toktup.0 of
| TOKEN_IDENTIFIER => fprint! (outf, " ", str)
| TOKEN_INTEGER => fprint! (outf, " ", str)
| TOKEN_STRING => fprint! (outf, " ", str)
| _ => ()
end;
fileref_putc (outf, '\n')
end
fn
scan_text (outf : FILEref,
inp : inp_t,
lookups : !lookups_vt) : void =
let
fun
loop (inp : inp_t,
lookups : !lookups_vt) : void =
let
val (toktup, inp) = get_next_token (inp, lookups)
in
print_token (outf, toktup, lookups);
if toktup.0 <> TOKEN_END_OF_INPUT then
loop (inp, lookups)
end
in
loop (inp, lookups)
end
(********************************************************************)
fn
main_program (inpf : FILEref,
outf : FILEref) : int =
let
(* Using a simple Scheme program, I found the following perfect
hash for the reserved words, using the sum of the first two
characters as the hash value. *)
var reserved_words =
@[String][RESERVED_WORD_HASHTAB_SIZE]
("if", "print", "else", "", "putc", "", "", "while", "")
var reserved_word_tokens =
@[token_t][RESERVED_WORD_HASHTAB_SIZE]
(TOKEN_IF, TOKEN_PRINT, TOKEN_ELSE, TOKEN_IDENTIFIER,
TOKEN_PUTC, TOKEN_IDENTIFIER, TOKEN_IDENTIFIER, TOKEN_WHILE,
TOKEN_IDENTIFIER)
var token_names =
@[string][NUM_TOKENS]
("Keyword_else",
"Keyword_if",
"Keyword_print",
"Keyword_putc",
"Keyword_while",
"Op_multiply",
"Op_divide",
"Op_mod",
"Op_add",
"Op_subtract",
"Op_negate",
"Op_less",
"Op_lessequal",
"Op_greater",
"Op_greaterequal",
"Op_equal",
"Op_notequal",
"Op_not",
"Op_assign",
"Op_and",
"Op_or",
"LeftParen",
"RightParen",
"LeftBrace",
"RightBrace",
"Semicolon",
"Comma",
"Identifier",
"Integer",
"String",
"End_of_input")
var lookups : lookups_vt =
@{
pf_toknames = view@ token_names,
pf_wordtab = view@ reserved_words,
pf_toktab = view@ reserved_word_tokens |
toknames = addr@ token_names,
wordtab = addr@ reserved_words,
toktab = addr@ reserved_word_tokens
}
val inp =
@{
file = inpf,
pushback = NIL,
line_no = 1ULL,
column_no = 1ULL
}
val _ = scan_text (outf, inp, lookups)
val @{
pf_toknames = pf_toknames,
pf_wordtab = pf_wordtab,
pf_toktab = pf_toktab |
toknames = toknames,
wordtab = wordtab,
toktab = toktab
} = lookups
prval _ = view@ token_names := pf_toknames
prval _ = view@ reserved_words := pf_wordtab
prval _ = view@ reserved_word_tokens := pf_toktab
in
0
end
macdef lex_error = "Lexical error: "
implement
main (argc, argv) =
let
val inpfname =
if 2 <= argc then
$UN.cast{string} argv[1]
else
"-"
val outfname =
if 3 <= argc then
$UN.cast{string} argv[2]
else
"-"
in
try
let
val inpf =
if (inpfname : string) = "-" then
stdin_ref
else
fileref_open_exn (inpfname, file_mode_r)
val outf =
if (outfname : string) = "-" then
stdout_ref
else
fileref_open_exn (outfname, file_mode_w)
in
main_program (inpf, outf)
end
with
| ~ unterminated_comment (line_no, column_no) =>
begin
fprintln! (stderr_ref, lex_error,
"unterminated comment starting at ",
line_no, ":", column_no);
1
end
| ~ unterminated_character_literal (line_no, column_no) =>
begin
fprintln! (stderr_ref, lex_error,
"unterminated character literal starting at ",
line_no, ":", column_no);
1
end
| ~ multicharacter_literal (line_no, column_no) =>
begin
fprintln! (stderr_ref, lex_error,
"unsupported multicharacter literal starting at ",
line_no, ":", column_no);
1
end
| ~ unterminated_string_literal (line_no, column_no,
end_of_line) =>
let
val s =
begin
if end_of_line then
"end of line"
else
"end of input"
end : String
in
fprintln! (stderr_ref, lex_error,
"unterminated string literal (", s,
") starting at ", line_no, ":", column_no);
1
end
| ~ unsupported_escape (line_no, column_no, c) =>
begin
fprintln! (stderr_ref, lex_error,
"unsupported escape \\",
int2char0 c, " starting at ",
line_no, ":", column_no);
1
end
| ~ invalid_integer_literal (line_no, column_no, s) =>
begin
fprintln! (stderr_ref, lex_error,
"invalid integer literal ", s,
" starting at ", line_no, ":", column_no);
1
end
| ~ unexpected_character (line_no, column_no, c) =>
begin
fprintln! (stderr_ref, lex_error,
"unexpected character '", int2char0 c,
"' at ", line_no, ":", column_no);
1
end
end
(********************************************************************)
- Output:
$ patscc -O2 -DATS_MEMALLOC_GCBDW -o lex lex-in-ATS.dats -lgc && ./lex compiler-tests/testcase3.t 5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
AWK
Tested with gawk 4.1.1 and mawk 1.3.4.
BEGIN {
all_syms["tk_EOI" ] = "End_of_input"
all_syms["tk_Mul" ] = "Op_multiply"
all_syms["tk_Div" ] = "Op_divide"
all_syms["tk_Mod" ] = "Op_mod"
all_syms["tk_Add" ] = "Op_add"
all_syms["tk_Sub" ] = "Op_subtract"
all_syms["tk_Negate" ] = "Op_negate"
all_syms["tk_Not" ] = "Op_not"
all_syms["tk_Lss" ] = "Op_less"
all_syms["tk_Leq" ] = "Op_lessequal"
all_syms["tk_Gtr" ] = "Op_greater"
all_syms["tk_Geq" ] = "Op_greaterequal"
all_syms["tk_Eq" ] = "Op_equal"
all_syms["tk_Neq" ] = "Op_notequal"
all_syms["tk_Assign" ] = "Op_assign"
all_syms["tk_And" ] = "Op_and"
all_syms["tk_Or" ] = "Op_or"
all_syms["tk_If" ] = "Keyword_if"
all_syms["tk_Else" ] = "Keyword_else"
all_syms["tk_While" ] = "Keyword_while"
all_syms["tk_Print" ] = "Keyword_print"
all_syms["tk_Putc" ] = "Keyword_putc"
all_syms["tk_Lparen" ] = "LeftParen"
all_syms["tk_Rparen" ] = "RightParen"
all_syms["tk_Lbrace" ] = "LeftBrace"
all_syms["tk_Rbrace" ] = "RightBrace"
all_syms["tk_Semi" ] = "Semicolon"
all_syms["tk_Comma" ] = "Comma"
all_syms["tk_Ident" ] = "Identifier"
all_syms["tk_Integer"] = "Integer"
all_syms["tk_String" ] = "String"
## single character only symbols
symbols["{" ] = "tk_Lbrace"
symbols["}" ] = "tk_Rbrace"
symbols["(" ] = "tk_Lparen"
symbols[")" ] = "tk_Rparen"
symbols["+" ] = "tk_Add"
symbols["-" ] = "tk_Sub"
symbols["*" ] = "tk_Mul"
symbols["%" ] = "tk_Mod"
symbols[";" ] = "tk_Semi"
symbols["," ] = "tk_Comma"
key_words["if" ] = "tk_If"
key_words["else" ] = "tk_Else"
key_words["print"] = "tk_Print"
key_words["putc" ] = "tk_Putc"
key_words["while"] = "tk_While"
# Set up an array that emulates the ord() function.
for(n=0;n<256;n++)
ord[sprintf("%c",n)]=n
input_file = "-"
if (ARGC > 1)
input_file = ARGV[1]
RS=FS="" # read complete file into one line $0
getline < input_file
the_ch = " " # dummy first char - but it must be a space
the_col = 0 # always points to the current character
the_line = 1
for (the_nf=1; ; ) {
split(gettok(), t, SUBSEP)
printf("%5s %5s %-14s", t[2], t[3], all_syms[t[1]])
if (t[1] == "tk_Integer") printf(" %5s\n", t[4])
else if (t[1] == "tk_Ident" ) printf(" %s\n", t[4])
else if (t[1] == "tk_String" ) printf(" \"%s\"\n", t[4])
else print("")
if (t[1] == "tk_EOI")
break
}
}
#*** show error and exit
function error(line, col, msg) {
print(line, col, msg)
exit(1)
}
# get the next character from the input
function next_ch() {
the_ch = $the_nf
the_nf ++
the_col ++
if (the_ch == "\n") {
the_line ++
the_col = 0
}
return the_ch
}
#*** 'x' - character constants
function char_lit(err_line, err_col) {
n = ord[next_ch()] # skip opening quote
if (the_ch == "'") {
error(err_line, err_col, "empty character constant")
} else if (the_ch == "\\") {
next_ch()
if (the_ch == "n")
n = 10
else if (the_ch == "\\")
n = ord["\\"]
else
error(err_line, err_col, "unknown escape sequence " the_ch)
}
if (next_ch() != "'")
error(err_line, err_col, "multi-character constant")
next_ch()
return "tk_Integer" SUBSEP err_line SUBSEP err_col SUBSEP n
}
#*** process divide or comments
function div_or_cmt(err_line, err_col) {
if (next_ch() != "*")
return "tk_Div" SUBSEP err_line SUBSEP err_col
# comment found
next_ch()
while (1) {
if (the_ch == "*") {
if (next_ch() == "/") {
next_ch()
return gettok()
} else if (the_ch == "") {
error(err_line, err_col, "EOF in comment")
}
} else {
next_ch()
}
}
}
#*** "string"
function string_lit(start, err_line, err_col) {
text = ""
while (next_ch() != start) {
if (the_ch == "")
error(err_line, err_col, "EOF while scanning string literal")
if (the_ch == "\n")
error(err_line, err_col, "EOL while scanning string literal")
text = text the_ch
}
next_ch()
return "tk_String" SUBSEP err_line SUBSEP err_col SUBSEP text
}
#*** handle identifiers and integers
function ident_or_int(err_line, err_col) {
is_number = 1
text = ""
while ((the_ch ~ /^[0-9a-zA-Z]+$/) || (the_ch == "_")) {
text = text the_ch
if (! (the_ch ~ /^[0-9]+$/))
is_number = 0
next_ch()
}
if (text == "")
error(err_line, err_col, "ident_or_int: unrecognized character: " the_ch)
if (text ~ /^[0-9]/) {
if (! is_number)
error(err_line, err_col, "invalid number: " text)
n = text + 0
return "tk_Integer" SUBSEP err_line SUBSEP err_col SUBSEP n
}
if (text in key_words)
return key_words[text] SUBSEP err_line SUBSEP err_col
return "tk_Ident" SUBSEP err_line SUBSEP err_col SUBSEP text
}
#*** look ahead for '>=', etc.
function follow(expect, ifyes, ifno, err_line, err_col) {
if (next_ch() == expect) {
next_ch()
return ifyes SUBSEP err_line SUBSEP err_col
}
if (ifno == tk_EOI)
error(err_line, err_col, "follow: unrecognized character: " the_ch)
return ifno SUBSEP err_line SUBSEP err_col
}
#*** return the next token type
function gettok() {
while (the_ch == " " || the_ch == "\n" || the_ch == "\r")
next_ch()
err_line = the_line
err_col = the_col
if (the_ch == "" ) return "tk_EOI" SUBSEP err_line SUBSEP err_col
else if (the_ch == "/") return div_or_cmt(err_line, err_col)
else if (the_ch == "'") return char_lit(err_line, err_col)
else if (the_ch == "<") return follow("=", "tk_Leq", "tk_Lss", err_line, err_col)
else if (the_ch == ">") return follow("=", "tk_Geq", "tk_Gtr", err_line, err_col)
else if (the_ch == "=") return follow("=", "tk_Eq", "tk_Assign", err_line, err_col)
else if (the_ch == "!") return follow("=", "tk_Neq", "tk_Not", err_line, err_col)
else if (the_ch == "&") return follow("&", "tk_And", "tk_EOI", err_line, err_col)
else if (the_ch == "|") return follow("|", "tk_Or", "tk_EOI", err_line, err_col)
else if (the_ch =="\"") return string_lit(the_ch, err_line, err_col)
else if (the_ch in symbols) {
sym = symbols[the_ch]
next_ch()
return sym SUBSEP err_line SUBSEP err_col
} else {
return ident_or_int(err_line, err_col)
}
}
- Output — count:
1 1 Identifier count 1 7 Op_assign 1 9 Integer 1 1 10 Semicolon 2 1 Keyword_while 2 7 LeftParen 2 8 Identifier count 2 14 Op_less 2 16 Integer 10 2 18 RightParen 2 20 LeftBrace 3 5 Keyword_print 3 10 LeftParen 3 11 String "count is: " 3 23 Comma 3 25 Identifier count 3 30 Comma 3 32 String "\n" 3 36 RightParen 3 37 Semicolon 4 5 Identifier count 4 11 Op_assign 4 13 Identifier count 4 19 Op_add 4 21 Integer 1 4 22 Semicolon 5 1 RightBrace 5 3 End_of_input
C
Tested with gcc 4.81 and later, compiles warning free with -Wpedantic -pedantic -Wall -Wextra
#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
#include <ctype.h>
#include <string.h>
#include <errno.h>
#include <stdbool.h>
#include <limits.h>
#define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))
#define da_dim(name, type) type *name = NULL; \
int _qy_ ## name ## _p = 0; \
int _qy_ ## name ## _max = 0
#define da_rewind(name) _qy_ ## name ## _p = 0
#define da_redim(name) do {if (_qy_ ## name ## _p >= _qy_ ## name ## _max) \
name = realloc(name, (_qy_ ## name ## _max += 32) * sizeof(name[0]));} while (0)
#define da_append(name, x) do {da_redim(name); name[_qy_ ## name ## _p++] = x;} while (0)
#define da_len(name) _qy_ ## name ## _p
typedef enum {
tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq,
tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While,
tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma,
tk_Ident, tk_Integer, tk_String
} TokenType;
typedef struct {
TokenType tok;
int err_ln, err_col;
union {
int n; /* value for constants */
char *text; /* text for idents */
};
} tok_s;
static FILE *source_fp, *dest_fp;
static int line = 1, col = 0, the_ch = ' ';
da_dim(text, char);
tok_s gettok(void);
static void error(int err_line, int err_col, const char *fmt, ... ) {
char buf[1000];
va_list ap;
va_start(ap, fmt);
vsprintf(buf, fmt, ap);
va_end(ap);
printf("(%d,%d) error: %s\n", err_line, err_col, buf);
exit(1);
}
static int next_ch(void) { /* get next char from input */
the_ch = getc(source_fp);
++col;
if (the_ch == '\n') {
++line;
col = 0;
}
return the_ch;
}
static tok_s char_lit(int n, int err_line, int err_col) { /* 'x' */
if (the_ch == '\'')
error(err_line, err_col, "gettok: empty character constant");
if (the_ch == '\\') {
next_ch();
if (the_ch == 'n')
n = 10;
else if (the_ch == '\\')
n = '\\';
else error(err_line, err_col, "gettok: unknown escape sequence \\%c", the_ch);
}
if (next_ch() != '\'')
error(err_line, err_col, "multi-character constant");
next_ch();
return (tok_s){tk_Integer, err_line, err_col, {n}};
}
static tok_s div_or_cmt(int err_line, int err_col) { /* process divide or comments */
if (the_ch != '*')
return (tok_s){tk_Div, err_line, err_col, {0}};
/* comment found */
next_ch();
for (;;) {
if (the_ch == '*') {
if (next_ch() == '/') {
next_ch();
return gettok();
}
} else if (the_ch == EOF)
error(err_line, err_col, "EOF in comment");
else
next_ch();
}
}
static tok_s string_lit(int start, int err_line, int err_col) { /* "st" */
da_rewind(text);
while (next_ch() != start) {
if (the_ch == '\n') error(err_line, err_col, "EOL in string");
if (the_ch == EOF) error(err_line, err_col, "EOF in string");
da_append(text, (char)the_ch);
}
da_append(text, '\0');
next_ch();
return (tok_s){tk_String, err_line, err_col, {.text=text}};
}
static int kwd_cmp(const void *p1, const void *p2) {
return strcmp(*(char **)p1, *(char **)p2);
}
static TokenType get_ident_type(const char *ident) {
static struct {
const char *s;
TokenType sym;
} kwds[] = {
{"else", tk_Else},
{"if", tk_If},
{"print", tk_Print},
{"putc", tk_Putc},
{"while", tk_While},
}, *kwp;
return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;
}
static tok_s ident_or_int(int err_line, int err_col) {
int n, is_number = true;
da_rewind(text);
while (isalnum(the_ch) || the_ch == '_') {
da_append(text, (char)the_ch);
if (!isdigit(the_ch))
is_number = false;
next_ch();
}
if (da_len(text) == 0)
error(err_line, err_col, "gettok: unrecognized character (%d) '%c'\n", the_ch, the_ch);
da_append(text, '\0');
if (isdigit(text[0])) {
if (!is_number)
error(err_line, err_col, "invalid number: %s\n", text);
n = strtol(text, NULL, 0);
if (n == LONG_MAX && errno == ERANGE)
error(err_line, err_col, "Number exceeds maximum value");
return (tok_s){tk_Integer, err_line, err_col, {n}};
}
return (tok_s){get_ident_type(text), err_line, err_col, {.text=text}};
}
static tok_s follow(int expect, TokenType ifyes, TokenType ifno, int err_line, int err_col) { /* look ahead for '>=', etc. */
if (the_ch == expect) {
next_ch();
return (tok_s){ifyes, err_line, err_col, {0}};
}
if (ifno == tk_EOI)
error(err_line, err_col, "follow: unrecognized character '%c' (%d)\n", the_ch, the_ch);
return (tok_s){ifno, err_line, err_col, {0}};
}
tok_s gettok(void) { /* return the token type */
/* skip white space */
while (isspace(the_ch))
next_ch();
int err_line = line;
int err_col = col;
switch (the_ch) {
case '{': next_ch(); return (tok_s){tk_Lbrace, err_line, err_col, {0}};
case '}': next_ch(); return (tok_s){tk_Rbrace, err_line, err_col, {0}};
case '(': next_ch(); return (tok_s){tk_Lparen, err_line, err_col, {0}};
case ')': next_ch(); return (tok_s){tk_Rparen, err_line, err_col, {0}};
case '+': next_ch(); return (tok_s){tk_Add, err_line, err_col, {0}};
case '-': next_ch(); return (tok_s){tk_Sub, err_line, err_col, {0}};
case '*': next_ch(); return (tok_s){tk_Mul, err_line, err_col, {0}};
case '%': next_ch(); return (tok_s){tk_Mod, err_line, err_col, {0}};
case ';': next_ch(); return (tok_s){tk_Semi, err_line, err_col, {0}};
case ',': next_ch(); return (tok_s){tk_Comma,err_line, err_col, {0}};
case '/': next_ch(); return div_or_cmt(err_line, err_col);
case '\'': next_ch(); return char_lit(the_ch, err_line, err_col);
case '<': next_ch(); return follow('=', tk_Leq, tk_Lss, err_line, err_col);
case '>': next_ch(); return follow('=', tk_Geq, tk_Gtr, err_line, err_col);
case '=': next_ch(); return follow('=', tk_Eq, tk_Assign, err_line, err_col);
case '!': next_ch(); return follow('=', tk_Neq, tk_Not, err_line, err_col);
case '&': next_ch(); return follow('&', tk_And, tk_EOI, err_line, err_col);
case '|': next_ch(); return follow('|', tk_Or, tk_EOI, err_line, err_col);
case '"' : return string_lit(the_ch, err_line, err_col);
default: return ident_or_int(err_line, err_col);
case EOF: return (tok_s){tk_EOI, err_line, err_col, {0}};
}
}
void run(void) { /* tokenize the given input */
tok_s tok;
do {
tok = gettok();
fprintf(dest_fp, "%5d %5d %.15s",
tok.err_ln, tok.err_col,
&"End_of_input Op_multiply Op_divide Op_mod Op_add "
"Op_subtract Op_negate Op_not Op_less Op_lessequal "
"Op_greater Op_greaterequal Op_equal Op_notequal Op_assign "
"Op_and Op_or Keyword_if Keyword_else Keyword_while "
"Keyword_print Keyword_putc LeftParen RightParen LeftBrace "
"RightBrace Semicolon Comma Identifier Integer "
"String "
[tok.tok * 16]);
if (tok.tok == tk_Integer) fprintf(dest_fp, " %4d", tok.n);
else if (tok.tok == tk_Ident) fprintf(dest_fp, " %s", tok.text);
else if (tok.tok == tk_String) fprintf(dest_fp, " \"%s\"", tok.text);
fprintf(dest_fp, "\n");
} while (tok.tok != tk_EOI);
if (dest_fp != stdout)
fclose(dest_fp);
}
void init_io(FILE **fp, FILE *std, const char mode[], const char fn[]) {
if (fn[0] == '\0')
*fp = std;
else if ((*fp = fopen(fn, mode)) == NULL)
error(0, 0, "Can't open %s\n", fn);
}
int main(int argc, char *argv[]) {
init_io(&source_fp, stdin, "r", argc > 1 ? argv[1] : "");
init_io(&dest_fp, stdout, "wb", argc > 2 ? argv[2] : "");
run();
return 0;
}
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
C#
Requires C#6.0 because of the use of null coalescing operators.
using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;
namespace Rosetta {
public enum TokenType {
End_of_input, Op_multiply, Op_divide, Op_mod, Op_add, Op_subtract,
Op_negate, Op_not, Op_less, Op_lessequal, Op_greater, Op_greaterequal,
Op_equal, Op_notequal, Op_assign, Op_and, Op_or, Keyword_if,
Keyword_else, Keyword_while, Keyword_print, Keyword_putc, LeftParen, RightParen,
LeftBrace, RightBrace, Semicolon, Comma, Identifier, Integer, String, None
}
/// <summary>
/// Storage class for tokens
/// </summary>
public class Token {
public TokenType Type { get; set; }
public int Line { get; set; }
public int Position { get; set; }
public string Value { get; set; }
public override string ToString() {
if (Type == TokenType.Integer || Type == TokenType.Identifier) {
return String.Format("{0,-5} {1,-5} {2,-14} {3}", Line, Position, Type.ToString(), Value);
} else if (Type == TokenType.String) {
return String.Format("{0,-5} {1,-5} {2,-14} \"{3}\"", Line, Position, Type.ToString(), Value.Replace("\n", "\\n"));
}
return String.Format("{0,-5} {1,-5} {2,-14}", Line, Position, Type.ToString());
}
}
/// <summary>
/// C# Example of Lexical scanner for Rosetta Compiler
/// </summary>
public class LexicalScanner {
// character classes
private const string _letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_";
private const string _numbers = "0123456789";
private const string _identifier = _letters + _numbers + "_";
private const string _whitespace = " \t\n\r";
// mappings from string keywords to token type
private Dictionary<string, TokenType> _keywordTokenTypeMap = new Dictionary<string, TokenType>() {
{ "if", TokenType.Keyword_if },
{ "else", TokenType.Keyword_else },
{ "while", TokenType.Keyword_while },
{ "print", TokenType.Keyword_print },
{ "putc", TokenType.Keyword_putc }
};
// mappings from simple operators to token type
private Dictionary<string, TokenType> _operatorTokenTypeMap = new Dictionary<string, TokenType>() {
{ "+", TokenType.Op_add },
{ "-", TokenType.Op_subtract },
{ "*", TokenType.Op_multiply },
{ "/", TokenType.Op_divide },
{ "%", TokenType.Op_mod },
{ "=", TokenType.Op_assign },
{ "<", TokenType.Op_less },
{ ">", TokenType.Op_greater },
{ "!", TokenType.Op_not },
};
private List<string> _keywords;
private string _operators = "+-*/%=<>!%";
private string _code;
private List<Token> tokens = new List<Token>();
private int _line = 1;
private int _position = 1;
public string CurrentCharacter {
get {
try {
return _code.Substring(0, 1);
} catch (ArgumentOutOfRangeException) {
return "";
}
}
}
/// <summary>
/// Lexical scanner initialiser
/// </summary>
/// <param name="code">Code to be tokenised</param>
public LexicalScanner (string code) {
_code = code;
_keywords = _keywordTokenTypeMap.Keys.ToList();
}
/// <summary>
/// Advance the cursor forward given number of characters
/// </summary>
/// <param name="characters">Number of characters to advance</param>
private void advance(int characters=1) {
try {
// reset position when there is a newline
if (CurrentCharacter == "\n") {
_position = 0;
_line++;
}
_code = _code.Substring(characters, _code.Length - characters);
_position += characters;
} catch (ArgumentOutOfRangeException) {
_code = "";
}
}
/// <summary>
/// Outputs error message to the console and exits
/// </summary>
/// <param name="message">Error message to display to user</param>
/// <param name="line">Line error occurred on</param>
/// <param name="position">Line column that the error occurred at</param>
public void error(string message, int line, int position) {
// output error to the console and exit
Console.WriteLine(String.Format("{0} @ {1}:{2}", message, line, position));
Environment.Exit(1);
}
/// <summary>
/// Pattern matching using first & follow matching
/// </summary>
/// <param name="recogniseClass">String of characters that identifies the token type
/// or the exact match the be made if exact:true</param>
/// <param name="matchClass">String of characters to match against remaining target characters</param>
/// <param name="tokenType">Type of token the match represents.</param>
/// <param name="notNextClass">Optional class of characters that cannot follow the match</param>
/// <param name="maxLen">Optional maximum length of token value</param>
/// <param name="exact">Denotes whether recogniseClass represents an exact match or class match.
/// Default: false</param>
/// <param name="discard">Denotes whether the token is kept or discarded. Default: false</param>
/// <param name="offset">Optiona line position offset to account for discarded tokens</param>
/// <returns>Boolean indicating if a match was made </returns>
public bool match(string recogniseClass, string matchClass, TokenType tokenType,
string notNextClass=null, int maxLen=Int32.MaxValue, bool exact=false,
bool discard=false, int offset=0) {
// if we've hit the end of the file, there's no more matching to be done
if (CurrentCharacter == "")
return false;
// store _current_ line and position so that our vectors point at the start
// of each token
int line = _line;
int position = _position;
// special case exact tokens to avoid needing to worry about backtracking
if (exact) {
if (_code.StartsWith(recogniseClass)) {
if (!discard)
tokens.Add(new Token() { Type = tokenType, Value = recogniseClass, Line = line, Position = position - offset});
advance(recogniseClass.Length);
return true;
}
return false;
}
// first match - denotes the token type usually
if (!recogniseClass.Contains(CurrentCharacter))
return false;
string tokenValue = CurrentCharacter;
advance();
// follow match while we haven't exceeded maxLen and there are still characters
// in the code stream
while ((matchClass ?? "").Contains(CurrentCharacter) && tokenValue.Length <= maxLen && CurrentCharacter != "") {
tokenValue += CurrentCharacter;
advance();
}
// ensure that any incompatible characters are not next to the token
// eg 42fred is invalid, and neither recognized as a number nor an identifier.
// _letters would be the notNextClass
if (notNextClass != null && notNextClass.Contains(CurrentCharacter))
error("Unrecognised character: " + CurrentCharacter, _line, _position);
// only add tokens to the stack that aren't marked as discard - dont want
// things like open and close quotes/comments
if (!discard) {
Token token = new Token() { Type = tokenType, Value = tokenValue, Line = line, Position = position - offset };
tokens.Add(token);
}
return true;
}
/// <summary>
/// Tokenise the input code
/// </summary>
/// <returns>List of Tokens</returns>
public List<Token> scan() {
while (CurrentCharacter != "") {
// match whitespace
match(_whitespace, _whitespace, TokenType.None, discard: true);
// match integers
match(_numbers, _numbers, TokenType.Integer, notNextClass:_letters);
// match identifiers and keywords
if (match(_letters, _identifier, TokenType.Identifier)) {
Token match = tokens.Last();
if (_keywords.Contains(match.Value))
match.Type = _keywordTokenTypeMap[match.Value];
}
// match string similarly to comments without allowing newlines
// this token doesn't get discarded though
if (match("\"", null, TokenType.String, discard:true)) {
string value = "";
int position = _position;
while (!match("\"", null, TokenType.String, discard:true)) {
// not allowed newlines in strings
if (CurrentCharacter == "\n")
error("End-of-line while scanning string literal. Closing string character not found before end-of-line", _line, _position);
// end of file reached before finding end of string
if (CurrentCharacter == "")
error("End-of-file while scanning string literal. Closing string character not found", _line, _position);
value += CurrentCharacter;
// deal with escape sequences - we only accept newline (\n)
if (value.Length >= 2) {
string lastCharacters = value.Substring(value.Length - 2, 2);
if (lastCharacters[0] == '\\') {
if (lastCharacters[1] != 'n') {
error("Unknown escape sequence. ", _line, position);
}
value = value.Substring(0, value.Length - 2).ToString() + "\n";
}
}
advance();
}
tokens.Add(new Token() { Type = TokenType.String, Value = value, Line = _line, Position = position - 1});
}
// match string literals
if (match("'", null, TokenType.Integer, discard:true)) {
int value;
int position = _position;
value = CurrentCharacter.ToCharArray()[0];
advance();
// deal with empty literals ''
if (value == '\'')
error("Empty character literal", _line, _position);
// deal with escaped characters, only need to worry about \n and \\
// throw werror on any other
if (value == '\\') {
if (CurrentCharacter == "n") {
value = '\n';
} else if (CurrentCharacter == "\\") {
value = '\\';
} else {
error("Unknown escape sequence. ", _line, _position - 1);
}
advance();
}
// if we haven't hit a closing ' here, there are two many characters
// in the literal
if (!match("'", null, TokenType.Integer, discard: true))
error("Multi-character constant", _line, _position);
tokens.Add(new Rosetta.Token() { Type = TokenType.Integer, Value = value.ToString(), Line = _line, Position = position - 1 });
}
// match comments by checking for starting token, then advancing
// until closing token is matched
if (match("/*", null, TokenType.None, exact: true, discard: true)) {
while (!match("*/", null, TokenType.None, exact: true, discard: true)) {
// reached the end of the file without closing comment!
if (CurrentCharacter == "")
error("End-of-file in comment. Closing comment characters not found.", _line, _position);
advance();
}
continue;
}
// match complex operators
match("<=", null, TokenType.Op_lessequal, exact: true);
match(">=", null, TokenType.Op_greaterequal, exact: true);
match("==", null, TokenType.Op_equal, exact: true);
match("!=", null, TokenType.Op_notequal, exact: true);
match("&&", null, TokenType.Op_and, exact: true);
match("||", null, TokenType.Op_or, exact: true);
// match simple operators
if (match(_operators, null, TokenType.None, maxLen:1)) {
Token match = tokens.Last();
match.Type = _operatorTokenTypeMap[match.Value];
}
// brackets, braces and separators
match("(", null, TokenType.LeftParen, exact: true);
match(")", null, TokenType.RightParen, exact: true);
match("{", null, TokenType.LeftBrace, exact: true);
match("}", null, TokenType.RightBrace, exact: true);
match(";", null, TokenType.Semicolon, exact: true);
match(",", null, TokenType.Comma, exact: true);
}
// end of file token
tokens.Add(new Rosetta.Token() { Type = TokenType.End_of_input, Line = _line, Position = _position });
return tokens;
}
static void Main (string[] args) {
StreamReader inputFile;
// if we passed in a filename, read code from that, else
// read code from stdin
if (args.Length > 0) {
string path = args[0];
try {
inputFile = new StreamReader(path);
} catch (IOException) {
inputFile = new StreamReader(Console.OpenStandardInput(8192));
}
} else {
inputFile = new StreamReader(Console.OpenStandardInput(8192));
}
string code = inputFile.ReadToEnd();
// strip windows line endings out
code = code.Replace("\r", "");
LexicalScanner scanner = new LexicalScanner(code);
List<Token> tokens = scanner.scan();
foreach(Token token in tokens) {
Console.WriteLine(token.ToString());
}
}
}
}
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
C++
Tested with GCC 9.3.0 (g++ -std=c++17)
#include <charconv> // std::from_chars
#include <fstream> // file_to_string, string_to_file
#include <functional> // std::invoke
#include <iomanip> // std::setw
#include <ios> // std::left
#include <iostream>
#include <map> // keywords
#include <sstream>
#include <string>
#include <utility> // std::forward
#include <variant> // TokenVal
using namespace std;
// =====================================================================================================================
// Machinery
// =====================================================================================================================
string file_to_string (const string& path)
{
// Open file
ifstream file {path, ios::in | ios::binary | ios::ate};
if (!file) throw (errno);
// Allocate string memory
string contents;
contents.resize(file.tellg());
// Read file contents into string
file.seekg(0);
file.read(contents.data(), contents.size());
return contents;
}
void string_to_file (const string& path, string contents)
{
ofstream file {path, ios::out | ios::binary};
if (!file) throw (errno);
file.write(contents.data(), contents.size());
}
template <class F>
void with_IO (string source, string destination, F&& f)
{
string input;
if (source == "stdin") getline(cin, input);
else input = file_to_string(source);
string output = invoke(forward<F>(f), input);
if (destination == "stdout") cout << output;
else string_to_file(destination, output);
}
// Add escaped newlines and backslashes back in for printing
string sanitize (string s)
{
for (auto i = 0u; i < s.size(); ++i)
{
if (s[i] == '\n') s.replace(i++, 1, "\\n");
else if (s[i] == '\\') s.replace(i++, 1, "\\\\");
}
return s;
}
class Scanner
{
public:
const char* pos;
int line = 1;
int column = 1;
Scanner (const char* source) : pos {source} {}
inline char peek () { return *pos; }
void advance ()
{
if (*pos == '\n') { ++line; column = 1; }
else ++column;
++pos;
}
char next ()
{
advance();
return peek();
}
void skip_whitespace ()
{
while (isspace(static_cast<unsigned char>(peek())))
advance();
}
}; // class Scanner
// =====================================================================================================================
// Tokens
// =====================================================================================================================
enum class TokenName
{
OP_MULTIPLY, OP_DIVIDE, OP_MOD, OP_ADD, OP_SUBTRACT, OP_NEGATE,
OP_LESS, OP_LESSEQUAL, OP_GREATER, OP_GREATEREQUAL, OP_EQUAL, OP_NOTEQUAL,
OP_NOT, OP_ASSIGN, OP_AND, OP_OR,
LEFTPAREN, RIGHTPAREN, LEFTBRACE, RIGHTBRACE, SEMICOLON, COMMA,
KEYWORD_IF, KEYWORD_ELSE, KEYWORD_WHILE, KEYWORD_PRINT, KEYWORD_PUTC,
IDENTIFIER, INTEGER, STRING,
END_OF_INPUT, ERROR
};
using TokenVal = variant<int, string>;
struct Token
{
TokenName name;
TokenVal value;
int line;
int column;
};
const char* to_cstring (TokenName name)
{
static const char* s[] =
{
"Op_multiply", "Op_divide", "Op_mod", "Op_add", "Op_subtract", "Op_negate",
"Op_less", "Op_lessequal", "Op_greater", "Op_greaterequal", "Op_equal", "Op_notequal",
"Op_not", "Op_assign", "Op_and", "Op_or",
"LeftParen", "RightParen", "LeftBrace", "RightBrace", "Semicolon", "Comma",
"Keyword_if", "Keyword_else", "Keyword_while", "Keyword_print", "Keyword_putc",
"Identifier", "Integer", "String",
"End_of_input", "Error"
};
return s[static_cast<int>(name)];
}
string to_string (Token t)
{
ostringstream out;
out << setw(2) << t.line << " " << setw(2) << t.column << " ";
switch (t.name)
{
case (TokenName::IDENTIFIER) : out << "Identifier " << get<string>(t.value); break;
case (TokenName::INTEGER) : out << "Integer " << left << get<int>(t.value); break;
case (TokenName::STRING) : out << "String \"" << sanitize(get<string>(t.value)) << '"'; break;
case (TokenName::END_OF_INPUT) : out << "End_of_input"; break;
case (TokenName::ERROR) : out << "Error " << get<string>(t.value); break;
default : out << to_cstring(t.name);
}
out << '\n';
return out.str();
}
// =====================================================================================================================
// Lexer
// =====================================================================================================================
class Lexer
{
public:
Lexer (const char* source) : s {source}, pre_state {s} {}
bool has_more () { return s.peek() != '\0'; }
Token next_token ()
{
s.skip_whitespace();
pre_state = s;
switch (s.peek())
{
case '*' : return simply(TokenName::OP_MULTIPLY);
case '%' : return simply(TokenName::OP_MOD);
case '+' : return simply(TokenName::OP_ADD);
case '-' : return simply(TokenName::OP_SUBTRACT);
case '{' : return simply(TokenName::LEFTBRACE);
case '}' : return simply(TokenName::RIGHTBRACE);
case '(' : return simply(TokenName::LEFTPAREN);
case ')' : return simply(TokenName::RIGHTPAREN);
case ';' : return simply(TokenName::SEMICOLON);
case ',' : return simply(TokenName::COMMA);
case '&' : return expect('&', TokenName::OP_AND);
case '|' : return expect('|', TokenName::OP_OR);
case '<' : return follow('=', TokenName::OP_LESSEQUAL, TokenName::OP_LESS);
case '>' : return follow('=', TokenName::OP_GREATEREQUAL, TokenName::OP_GREATER);
case '=' : return follow('=', TokenName::OP_EQUAL, TokenName::OP_ASSIGN);
case '!' : return follow('=', TokenName::OP_NOTEQUAL, TokenName::OP_NOT);
case '/' : return divide_or_comment();
case '\'' : return char_lit();
case '"' : return string_lit();
default : if (is_id_start(s.peek())) return identifier();
if (is_digit(s.peek())) return integer_lit();
return error("Unrecognized character '", s.peek(), "'");
case '\0' : return make_token(TokenName::END_OF_INPUT);
}
}
private:
Scanner s;
Scanner pre_state;
static const map<string, TokenName> keywords;
template <class... Args>
Token error (Args&&... ostream_args)
{
string code {pre_state.pos, (string::size_type) s.column - pre_state.column};
ostringstream msg;
(msg << ... << forward<Args>(ostream_args)) << '\n'
<< string(28, ' ') << "(" << s.line << ", " << s.column << "): " << code;
if (s.peek() != '\0') s.advance();
return make_token(TokenName::ERROR, msg.str());
}
inline Token make_token (TokenName name, TokenVal value = 0)
{
return {name, value, pre_state.line, pre_state.column};
}
Token simply (TokenName name)
{
s.advance();
return make_token(name);
}
Token expect (char expected, TokenName name)
{
if (s.next() == expected) return simply(name);
else return error("Unrecognized character '", s.peek(), "'");
}
Token follow (char expected, TokenName ifyes, TokenName ifno)
{
if (s.next() == expected) return simply(ifyes);
else return make_token(ifno);
}
Token divide_or_comment ()
{
if (s.next() != '*') return make_token(TokenName::OP_DIVIDE);
while (s.next() != '\0')
{
if (s.peek() == '*' && s.next() == '/')
{
s.advance();
return next_token();
}
}
return error("End-of-file in comment. Closing comment characters not found.");
}
Token char_lit ()
{
int n = s.next();
if (n == '\'') return error("Empty character constant");
if (n == '\\') switch (s.next())
{
case 'n' : n = '\n'; break;
case '\\' : n = '\\'; break;
default : return error("Unknown escape sequence \\", s.peek());
}
if (s.next() != '\'') return error("Multi-character constant");
s.advance();
return make_token(TokenName::INTEGER, n);
}
Token string_lit ()
{
string text = "";
while (s.next() != '"')
switch (s.peek())
{
case '\\' : switch (s.next())
{
case 'n' : text += '\n'; continue;
case '\\' : text += '\\'; continue;
default : return error("Unknown escape sequence \\", s.peek());
}
case '\n' : return error("End-of-line while scanning string literal."
" Closing string character not found before end-of-line.");
case '\0' : return error("End-of-file while scanning string literal."
" Closing string character not found.");
default : text += s.peek();
}
s.advance();
return make_token(TokenName::STRING, text);
}
static inline bool is_id_start (char c) { return isalpha(static_cast<unsigned char>(c)) || c == '_'; }
static inline bool is_id_end (char c) { return isalnum(static_cast<unsigned char>(c)) || c == '_'; }
static inline bool is_digit (char c) { return isdigit(static_cast<unsigned char>(c)); }
Token identifier ()
{
string text (1, s.peek());
while (is_id_end(s.next())) text += s.peek();
auto i = keywords.find(text);
if (i != keywords.end()) return make_token(i->second);
return make_token(TokenName::IDENTIFIER, text);
}
Token integer_lit ()
{
while (is_digit(s.next()));
if (is_id_start(s.peek()))
return error("Invalid number. Starts like a number, but ends in non-numeric characters.");
int n;
auto r = from_chars(pre_state.pos, s.pos, n);
if (r.ec == errc::result_out_of_range) return error("Number exceeds maximum value");
return make_token(TokenName::INTEGER, n);
}
}; // class Lexer
const map<string, TokenName> Lexer::keywords =
{
{"else", TokenName::KEYWORD_ELSE},
{"if", TokenName::KEYWORD_IF},
{"print", TokenName::KEYWORD_PRINT},
{"putc", TokenName::KEYWORD_PUTC},
{"while", TokenName::KEYWORD_WHILE}
};
int main (int argc, char* argv[])
{
string in = (argc > 1) ? argv[1] : "stdin";
string out = (argc > 2) ? argv[2] : "stdout";
with_IO(in, out, [](string input)
{
Lexer lexer {input.data()};
string s = "Location Token name Value\n"
"--------------------------------------\n";
while (lexer.has_more()) s += to_string(lexer.next_token());
return s;
});
}
- Output — test case 3:
Location Token name Value -------------------------------------- 5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
COBOL
Using GnuCOBOL 2. By Steve Williams (with one change to get around a Rosetta Code code highlighter problem).
>>SOURCE FORMAT IS FREE
*> this code is dedicated to the public domain
*> (GnuCOBOL) 2.3-dev.0
identification division.
program-id. lexer.
environment division.
configuration section.
repository. function all intrinsic.
input-output section.
file-control.
select input-file assign using input-name
status input-status
organization line sequential.
data division.
file section.
fd input-file.
01 input-record pic x(98).
working-storage section.
01 input-name pic x(32).
01 input-status pic xx.
01 input-length pic 99.
01 output-name pic x(32) value spaces.
01 output-status pic xx.
01 output-record pic x(64).
01 line-no pic 999 value 0.
01 col-no pic 99.
01 col-no-max pic 99.
01 col-increment pic 9 value 1.
01 start-col pic 99.
01 outx pic 99.
01 out-lim pic 99 value 48.
01 output-line value spaces.
03 out-line pic zzzz9.
03 out-column pic zzzzzz9.
03 message-area.
05 filler pic xxx.
05 token pic x(16).
05 out-value pic x(48).
05 out-integer redefines out-value pic zzzzz9.
05 out-integer1 redefines out-value pic zzzzzz9. *> to match the python lexer
01 error-record.
03 error-line pic zzzz9 value 0.
03 error-col pic zzzzzz9 value 0.
03 error-message pic x(68) value spaces.
01 scan-state pic x(16) value spaces.
01 current-character pic x.
01 previous-character pic x.
procedure division chaining input-name.
start-lexer.
if input-name <> spaces
open input input-file
if input-status = '35'
string 'in lexer ' trim(input-name) ' not found' into error-message
perform report-error
end-if
end-if
perform read-input-file
perform until input-status <> '00'
add 1 to line-no
move line-no to out-line
move length(trim(input-record,trailing)) to col-no-max
move 1 to col-no
move space to previous-character
perform until col-no > col-no-max
move col-no to out-column
move input-record(col-no:1) to current-character
evaluate scan-state
when 'identifier'
if current-character >= 'A' and <= 'Z'
or (current-character >= 'a' and <= 'z')
or (current-character >= '0' and <= '9')
or current-character = '_'
perform increment-outx
move current-character to out-value(outx:1)
if col-no = col-no-max
perform process-identifier
end-if
else
perform process-identifier
if current-character <> space
move 0 to col-increment
end-if
end-if
when 'integer'
evaluate true
when current-character >= '0' and <= '9'
perform increment-outx
move current-character to out-value(outx:1)
if col-no = col-no-max
move numval(out-value) to out-integer
move 'Integer' to token
end-if
when current-character >= 'A' and <= 'Z'
when current-character >= 'a' and <= 'z'
move 'in lexer invalid integer' to error-message
perform report-error
when other
if outx > 5
move numval(out-value) to out-integer1 *> to match the python lexer
else
move numval(out-value) to out-integer
end-if
move 'Integer' to token
if current-character <> space
move 0 to col-increment
end-if
end-evaluate
when 'comment'
if previous-character = '*' and current-character = '/'
move 'comment' to token
end-if
when 'quote'
evaluate current-character also outx
when '"' also 0
string 'in lexer empty string' into error-message
perform report-error
when '"' also any
perform increment-outx
move current-character to out-value(outx:1)
move 'String' to token
when other
if col-no = col-no-max
string 'in lexer missing close quote' into error-message
perform report-error
else
perform increment-outx
move current-character to out-value(outx:1)
end-if
end-evaluate
when 'character'
evaluate current-character also outx
when "'" also 0
string 'in lexer empty character constant' into error-message
perform report-error
when "'" also 1
subtract 1 from ord(out-value(1:1)) giving out-integer
move 'Integer' to token
when "'" also 2
evaluate true
when out-value(1:2) = '\n'
move 10 to out-integer
when out-value(1:2) = '\\'
subtract 1 from ord('\') giving out-integer *> ' (workaround a Rosetta Code highlighter problem)
when other
string 'in lexer unknown escape sequence ' out-value(1:2)
into error-message
perform report-error
end-evaluate
move 'Integer' to token
when "'" also any
string 'in lexer multicharacter constant' into error-message
perform report-error
when other
if col-no = col-no-max
string 'in lexer missing close quote' into error-message
perform report-error
end-if
perform increment-outx
move current-character to out-value(outx:1)
end-evaluate
when 'and'
evaluate previous-character also current-character
when '&' also '&'
move 'Op_and' to token
when other
string 'in lexer AND error' into error-message
perform report-error
end-evaluate
when 'or'
evaluate previous-character also current-character
when '|' also '|'
move 'Op_or' to token
when other
string 'in lexer OR error' into error-message
perform report-error
end-evaluate
when 'ambiguous'
evaluate previous-character also current-character
when '/' also '*'
move 'comment' to scan-state
subtract 1 from col-no giving start-col
when '/' also any
move 'Op_divide' to token
move 0 to col-increment
when '=' also '='
move 'Op_equal' to token
when '=' also any
move 'Op_assign' to token
move 0 to col-increment
when '<' also '='
move 'Op_lessequal' to token
when '<' also any
move 'Op_less' to token
move 0 to col-increment
when '>' also '='
move 'Op_greaterequal' to token
when '>'also any
move 'Op_greater' to token
move 0 to col-increment
when '!' also '='
move 'Op_notequal' to token
when '!' also any
move 'Op_not' to token
move 0 to col-increment
when other
display input-record
string 'in lexer ' trim(scan-state)
' unknown character "' current-character '"'
' with previous character "' previous-character '"'
into error-message
perform report-error
end-evaluate
when other
move col-no to start-col
evaluate current-character
when space
continue
when >= 'A' and <= 'Z'
when >= 'a' and <= 'z'
move 'identifier' to scan-state
move 1 to outx
move current-character to out-value
when >= '0' and <= '9'
move 'integer' to scan-state
move 1 to outx
move current-character to out-value
when '&'
move 'and' to scan-state
when '|'
move 'or' to scan-state
when '"'
move 'quote' to scan-state
move 1 to outx
move current-character to out-value
when "'"
move 'character' to scan-state
move 0 to outx
when '{'
move 'LeftBrace' to token
when '}'
move 'RightBrace' to token
when '('
move 'LeftParen' to token
when ')'
move 'RightParen' to token
when '+'
move 'Op_add' to token
when '-'
move 'Op_subtract' to token
when '*'
move 'Op_multiply' to token
when '%'
move 'Op_mod' to token
when ';'
move 'Semicolon' to token
when ','
move 'Comma' to token
when '/'
when '<'
when '>'
when '='
when '='
when '<'
when '>'
when '!'
move 'ambiguous' to scan-state
when other
string 'in lexer unknown character "' current-character '"'
into error-message
perform report-error
end-evaluate
end-evaluate
if token <> spaces
perform process-token
end-if
move current-character to previous-character
add col-increment to col-no
move 1 to col-increment
end-perform
if scan-state = 'ambiguous'
evaluate previous-character
when '/'
move 'Op_divide' to token
perform process-token
when '='
move 'Op_assign' to token
perform process-token
when '<'
move 'Op_less' to token
perform process-token
when '>'
move 'Op_greater' to token
perform process-token
when '!'
move 'Op_not' to token
perform process-token
when other
string 'in lexer unresolved ambiguous
"' previous-character '" at end of line'
into error-message
perform report-error
end-evaluate
end-if
perform read-input-file
end-perform
evaluate true
when input-status <> '10'
string 'in lexer ' trim(input-name) ' invalid input status ' input-status
into error-message
perform report-error
when scan-state = 'comment'
string 'in lexer unclosed comment at end of input' into error-message
perform report-error
end-evaluate
move 'End_of_input' to token
move 1 to out-column
move 1 to start-col
add 1 to line-no
perform process-token
close input-file
stop run
.
process-identifier.
evaluate true
when out-value = 'print'
move 'Keyword_print' to token
move spaces to out-value
when out-value = 'while'
move 'Keyword_while' to token
move spaces to out-value
when out-value = 'if'
move 'Keyword_if' to token
move spaces to out-value
when out-value = 'else'
move 'Keyword_else' to token
move spaces to out-value
when out-value = 'putc'
move 'Keyword_putc' to token
move spaces to out-value
when other
move 'Identifier' to token
end-evaluate
.
increment-outx.
if outx >= out-lim
string 'in lexer token value length exceeds ' out-lim into error-message
perform report-error
end-if
add 1 to outx
.
process-token.
if token <> 'comment'
move start-col to out-column
move line-no to out-line
display output-line
end-if
move 0 to start-col
move spaces to scan-state message-area
.
report-error.
move line-no to error-line
move start-col to error-col
display error-record
close input-file
stop run with error status -1
.
read-input-file.
if input-name = spaces
move '00' to input-status
accept input-record on exception move '10' to input-status end-accept
else
read input-file
end-if
.
end program lexer.
- Output — test case 3:
prompt$ ./lexer <testcase3 5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
Common Lisp
Lisp has a built-in reader and you can customize the reader by modifying its readtable. I'm also using the Gray stream, which is an almost standard feature of Common Lisp, for counting lines and columns.
(defpackage #:lexical-analyzer
(:use #:cl #:sb-gray)
(:export #:main))
(in-package #:lexical-analyzer)
(defconstant +lex-symbols-package+ (or (find-package :lex-symbols)
(make-package :lex-symbols)))
(defclass counting-character-input-stream (fundamental-character-input-stream)
((stream :type stream :initarg :stream :reader stream-of)
(line :type fixnum :initform 1 :accessor line-of)
(column :type fixnum :initform 0 :accessor column-of)
(prev-column :type (or null fixnum) :initform nil :accessor prev-column-of))
(:documentation "Character input stream that counts lines and columns."))
(defmethod stream-read-char ((stream counting-character-input-stream))
(let ((ch (read-char (stream-of stream) nil :eof)))
(case ch
(#\Newline
(incf (line-of stream))
(setf (prev-column-of stream) (column-of stream)
(column-of stream) 0))
(t
(incf (column-of stream))))
ch))
(defmethod stream-unread-char ((stream counting-character-input-stream) char)
(unread-char char (stream-of stream))
(case char
(#\Newline
(decf (line-of stream))
(setf (column-of stream) (prev-column-of stream)))
(t
(decf (column-of stream)))))
(defstruct token
(name nil :type symbol)
(value nil :type t)
(line nil :type fixnum)
(column nil :type fixnum))
(defun lexer-error (format-control &rest args)
(apply #'error format-control args))
(defun handle-divide-or-comment (stream char)
(declare (ignore char))
(case (peek-char nil stream t nil t)
(#\* (loop with may-end = nil
initially (read-char stream t nil t)
for ch = (read-char stream t nil t)
until (and may-end (char= ch #\/))
do (setf may-end (char= ch #\*))
finally (return (read stream t nil t))))
(t (make-token :name :op-divide :line (line-of stream) :column (column-of stream)))))
(defun make-constant-handler (token-name)
(lambda (stream char)
(declare (ignore char))
(make-token :name token-name :line (line-of stream) :column (column-of stream))))
(defun make-this-or-that-handler (expect then &optional else)
(lambda (stream char)
(declare (ignore char))
(let ((line (line-of stream))
(column (column-of stream))
(next (peek-char nil stream nil nil t)))
(cond ((and expect (char= next expect))
(read-char stream nil nil t)
(make-token :name then :line line :column column))
(else
(make-token :name else :line line :column column))
(t
(lexer-error "Unrecognized character '~A'" next))))))
(defun identifier? (symbol)
(and (symbolp symbol)
(not (keywordp symbol))
(let ((name (symbol-name symbol)))
(and (find (char name 0) "_abcdefghijklmnopqrstuvwxyz" :test #'char-equal)
(or (< (length name) 2)
(not (find-if-not (lambda (ch)
(find ch "_abcdefghijklmnopqrstuvwxyz0123456789"
:test #'char-equal))
name :start 1)))))))
(defun id->keyword (id line column)
(case id
(lex-symbols::|if| (make-token :name :keyword-if :line line :column column))
(lex-symbols::|else| (make-token :name :keyword-else :line line :column column))
(lex-symbols::|while| (make-token :name :keyword-while :line line :column column))
(lex-symbols::|print| (make-token :name :keyword-print :line line :column column))
(lex-symbols::|putc| (make-token :name :keyword-putc :line line :column column))
(t nil)))
(defun handle-identifier (stream char)
(let ((*readtable* (copy-readtable)))
(set-syntax-from-char char #\z)
(let ((line (line-of stream))
(column (column-of stream)))
(unread-char char stream)
(let ((obj (read stream t nil t)))
(if (identifier? obj)
(or (id->keyword obj line column)
(make-token :name :identifier :value obj :line line :column column))
(lexer-error "Invalid identifier name: ~A" obj))))))
(defun handle-integer (stream char)
(let ((*readtable* (copy-readtable)))
(set-syntax-from-char char #\z)
(let ((line (line-of stream))
(column (column-of stream)))
(unread-char char stream)
(let ((obj (read stream t nil t)))
(if (integerp obj)
(make-token :name :integer :value obj :line line :column column)
(lexer-error "Invalid integer: ~A" obj))))))
(defun handle-char-literal (stream char)
(declare (ignore char))
(let* ((line (line-of stream))
(column (column-of stream))
(ch (read-char stream t nil t))
(parsed (case ch
(#\' (lexer-error "Empty character constant"))
(#\Newline (lexer-error "New line in character literal"))
(#\\ (let ((next-ch (read-char stream t nil t)))
(case next-ch
(#\n #\Newline)
(#\\ #\\)
(t (lexer-error "Unknown escape sequence: \\~A" next-ch)))))
(t ch))))
(if (char= #\' (read-char stream t nil t))
(make-token :name :integer :value (char-code parsed) :line line :column column)
(lexer-error "Only one character is allowed in character literal"))))
(defun handle-string (stream char)
(declare (ignore char))
(loop with result = (make-array 0 :element-type 'character :adjustable t :fill-pointer t)
with line = (line-of stream)
with column = (column-of stream)
for ch = (read-char stream t nil t)
until (char= ch #\")
do (setf ch (case ch
(#\Newline (lexer-error "New line in string"))
(#\\ (let ((next-ch (read-char stream t nil t)))
(case next-ch
(#\n #\Newline)
(#\\ #\\)
(t (lexer-error "Unknown escape sequence: \\~A" next-ch)))))
(t ch)))
(vector-push-extend ch result)
finally (return (make-token :name :string :value result :line line :column column))))
(defun make-lexer-readtable ()
(let ((*readtable* (copy-readtable nil)))
(setf (readtable-case *readtable*) :preserve)
(set-syntax-from-char #\\ #\z)
(set-syntax-from-char #\# #\z)
(set-syntax-from-char #\` #\z)
;; operators
(set-macro-character #\* (make-constant-handler :op-multiply))
(set-macro-character #\/ #'handle-divide-or-comment)
(set-macro-character #\% (make-constant-handler :op-mod))
(set-macro-character #\+ (make-constant-handler :op-add))
(set-macro-character #\- (make-constant-handler :op-subtract))
(set-macro-character #\< (make-this-or-that-handler #\= :op-lessequal :op-less))
(set-macro-character #\> (make-this-or-that-handler #\= :op-greaterequal :op-greater))
(set-macro-character #\= (make-this-or-that-handler #\= :op-equal :op-assign))
(set-macro-character #\! (make-this-or-that-handler #\= :op-notequal :op-not))
(set-macro-character #\& (make-this-or-that-handler #\& :op-and))
(set-macro-character #\| (make-this-or-that-handler #\| :op-or))
;; symbols
(set-macro-character #\( (make-constant-handler :leftparen))
(set-macro-character #\) (make-constant-handler :rightparen))
(set-macro-character #\{ (make-constant-handler :leftbrace))
(set-macro-character #\} (make-constant-handler :rightbrace))
(set-macro-character #\; (make-constant-handler :semicolon))
(set-macro-character #\, (make-constant-handler :comma))
;; identifiers & keywords
(set-macro-character #\_ #'handle-identifier t)
(loop for ch across "abcdefghijklmnopqrstuvwxyz"
do (set-macro-character ch #'handle-identifier t))
(loop for ch across "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
do (set-macro-character ch #'handle-identifier t))
;; integers
(loop for ch across "0123456789"
do (set-macro-character ch #'handle-integer t))
(set-macro-character #\' #'handle-char-literal)
;; strings
(set-macro-character #\" #'handle-string)
*readtable*))
(defun lex (stream)
(loop with *readtable* = (make-lexer-readtable)
with *package* = +lex-symbols-package+
with eof = (gensym)
with counting-stream = (make-instance 'counting-character-input-stream :stream stream)
for token = (read counting-stream nil eof)
until (eq token eof)
do (format t "~5D ~5D ~15A~@[ ~S~]~%"
(token-line token) (token-column token) (token-name token) (token-value token))
finally (format t "~5D ~5D ~15A~%"
(line-of counting-stream) (column-of counting-stream) :end-of-input)
(close counting-stream)))
(defun main ()
(lex *standard-input*))
- Output — test case 3:
5 16 KEYWORD-PRINT 5 40 OP-SUBTRACT 6 16 KEYWORD-PUTC 6 40 OP-LESS 7 16 KEYWORD-IF 7 40 OP-GREATER 8 16 KEYWORD-ELSE 8 40 OP-LESSEQUAL 9 16 KEYWORD-WHILE 9 40 OP-GREATEREQUAL 10 16 LEFTBRACE 10 40 OP-EQUAL 11 16 RIGHTBRACE 11 40 OP-NOTEQUAL 12 16 LEFTPAREN 12 40 OP-AND 13 16 RIGHTPAREN 13 40 OP-OR 14 16 OP-SUBTRACT 14 40 SEMICOLON 15 16 OP-NOT 15 40 COMMA 16 16 OP-MULTIPLY 16 40 OP-ASSIGN 17 16 OP-DIVIDE 17 40 INTEGER 42 18 16 OP-MOD 18 40 STRING "String literal" 19 16 OP-ADD 19 40 IDENTIFIER variable_name 20 26 INTEGER 10 21 26 INTEGER 92 22 26 INTEGER 32 23 1 END-OF-INPUT
Elixir
#!/bin/env elixir
# -*- elixir -*-
defmodule Lex do
def main args do
{inpf_name, outf_name, exit_status} =
case args do
[] -> {"-", "-", 0}
[name] -> {name, "-", 0}
[name1, name2] -> {name1, name2, 0}
[name1, name2 | _] -> {name1, name2, usage_error()}
end
{inpf, outf, exit_status} =
case {inpf_name, outf_name, exit_status} do
{"-", "-", 0} -> {:stdio, :stdio, 0}
{name1, "-", 0} ->
{inpf, exit_status} = open_file(name1, [:read])
{inpf, :stdio, exit_status}
{"-", name2, 0} ->
{outf, exit_status} = open_file(name2, [:write])
{:stdio, outf, exit_status}
{name1, name2, 0} ->
{inpf, exit_status} = open_file(name1, [:read])
if exit_status != 0 do
{inpf, name2, exit_status}
else
{outf, exit_status} = open_file(name2, [:write])
{inpf, outf, exit_status}
end
_ -> {inpf_name, outf_name, exit_status}
end
exit_status =
case exit_status do
0 -> main_program inpf, outf
_ -> exit_status
end
# Choose one.
System.halt exit_status # Fast exit.
#System.stop exit_status # Laborious cleanup.
end
def main_program inpf, outf do
inp = make_inp inpf
scan_text outf, inp
exit_status = 0
exit_status
end
def open_file name, rw do
case File.open name, rw do
{:ok, f} -> {f, 0}
_ ->
IO.write :stderr, "Cannot open "
IO.write :stderr, name
case rw do
[:read] -> IO.puts " for input"
[:write] -> IO.puts " for output"
end
{name, 1}
end
end
def scan_text outf, inp do
{toktup, inp} = get_next_token inp
print_token outf, toktup
case toktup do
{"End_of_input", _, _, _} -> :ok
_ -> scan_text outf, inp
end
end
def print_token outf, {tok, arg, line_no, column_no} do
IO.write outf, (String.pad_leading "#{line_no}", 5)
IO.write outf, " "
IO.write outf, (String.pad_leading "#{column_no}", 5)
IO.write outf, " "
IO.write outf, tok
case tok do
"Identifier" ->
IO.write outf, " "
IO.write outf, arg
"Integer" ->
IO.write outf, " "
IO.write outf, arg
"String" ->
IO.write outf, " "
IO.write outf, arg
_ -> :ok
end
IO.puts outf, ""
end
###-------------------------------------------------------------------
###
### The token dispatcher.
###
def get_next_token inp do
inp = skip_spaces_and_comments inp
{ch, inp} = get_ch inp
{chr, line_no, column_no} = ch
ln = line_no
cn = column_no
case chr do
:eof -> {{"End_of_input", "", ln, cn}, inp}
"," -> {{"Comma", ",", ln, cn}, inp}
";" -> {{"Semicolon", ";", ln, cn}, inp}
"(" -> {{"LeftParen", "(", ln, cn}, inp}
")" -> {{"RightParen", ")", ln, cn}, inp}
"{" -> {{"LeftBrace", "{", ln, cn}, inp}
"}" -> {{"RightBrace", "}", ln, cn}, inp}
"*" -> {{"Op_multiply", "*", ln, cn}, inp}
"/" -> {{"Op_divide", "/", ln, cn}, inp}
"%" -> {{"Op_mod", "%", ln, cn}, inp}
"+" -> {{"Op_add", "+", ln, cn}, inp}
"-" -> {{"Op_subtract", "-", ln, cn}, inp}
"<" ->
{ch1, inp} = get_ch inp
{chr1, _, _} = ch1
case chr1 do
"=" -> {{"Op_lessequal", "<=", ln, cn}, inp}
_ -> {{"Op_less", "<", ln, cn}, (push_back ch1, inp)}
end
">" ->
{ch1, inp} = get_ch inp
{chr1, _, _} = ch1
case chr1 do
"=" -> {{"Op_greaterequal", ">=", ln, cn}, inp}
_ -> {{"Op_greater", ">", ln, cn}, (push_back ch1, inp)}
end
"=" ->
{ch1, inp} = get_ch inp
{chr1, _, _} = ch1
case chr1 do
"=" -> {{"Op_equal", "==", ln, cn}, inp}
_ -> {{"Op_assign", "=", ln, cn}, (push_back ch1, inp)}
end
"!" ->
{ch1, inp} = get_ch inp
{chr1, _, _} = ch1
case chr1 do
"=" -> {{"Op_notequal", "!=", ln, cn}, inp}
_ -> {{"Op_not", "!", ln, cn}, (push_back ch1, inp)}
end
"&" ->
{ch1, inp} = get_ch inp
{chr1, _, _} = ch1
case chr1 do
"&" -> {{"Op_and", "&&", ln, cn}, inp}
_ -> unexpected_character ln, cn, chr
end
"|" ->
{ch1, inp} = get_ch inp
{chr1, _, _} = ch1
case chr1 do
"|" -> {{"Op_or", "||", ln, cn}, inp}
_ -> unexpected_character ln, cn, chr
end
"\"" ->
inp = push_back ch, inp
scan_string_literal inp
"'" ->
inp = push_back ch, inp
scan_character_literal inp
_ ->
cond do
String.match? chr, ~r/^[[:digit:]]$/u ->
inp = push_back ch, inp
scan_integer_literal inp
String.match? chr, ~r/^[[:alpha:]_]$/u ->
inp = push_back ch, inp
scan_identifier_or_reserved_word inp
true -> unexpected_character ln, cn, chr
end
end
end
###-------------------------------------------------------------------
###
### Skipping past spaces and /* ... */ comments.
###
### Comments are treated exactly like a bit of whitespace. They never
### make it to the dispatcher.
###
def skip_spaces_and_comments inp do
{ch, inp} = get_ch inp
{chr, line_no, column_no} = ch
cond do
chr == :eof -> push_back ch, inp
String.match? chr, ~r/^[[:space:]]$/u ->
skip_spaces_and_comments inp
chr == "/" ->
{ch1, inp} = get_ch inp
case ch1 do
{"*", _, _} ->
inp = scan_comment inp, line_no, column_no
skip_spaces_and_comments inp
_ -> push_back ch, (push_back ch1, inp)
end
true -> push_back ch, inp
end
end
def scan_comment inp, line_no, column_no do
{ch, inp} = get_ch inp
case ch do
{:eof, _, _} -> unterminated_comment line_no, column_no
{"*", _, _} ->
{ch1, inp} = get_ch inp
case ch1 do
{:eof, _, _} -> unterminated_comment line_no, column_no
{"/", _, _} -> inp
_ -> scan_comment inp, line_no, column_no
end
_ -> scan_comment inp, line_no, column_no
end
end
###-------------------------------------------------------------------
###
### Scanning of integer literals, identifiers, and reserved words.
###
### These three types of token are very similar to each other.
###
def scan_integer_literal inp do
# Scan an entire word, not just digits. This way we detect
# erroneous text such as "23skidoo".
{line_no, column_no, inp} = get_position inp
{word, inp} = scan_word inp
if String.match? word, (~r/^[[:digit:]]+$/u) do
{{"Integer", word, line_no, column_no}, inp}
else
invalid_integer_literal line_no, column_no, word
end
end
def scan_identifier_or_reserved_word inp do
# It is assumed that the first character is of the correct type,
# thanks to the dispatcher.
{line_no, column_no, inp} = get_position inp
{word, inp} = scan_word inp
tok =
case word do
"if" -> "Keyword_if"
"else" -> "Keyword_else"
"while" -> "Keyword_while"
"print" -> "Keyword_print"
"putc" -> "Keyword_putc"
_ -> "Identifier"
end
{{tok, word, line_no, column_no}, inp}
end
def scan_word inp, word\\"" do
{ch, inp} = get_ch inp
{chr, _, _} = ch
if String.match? chr, (~r/^[[:alnum:]_]$/u) do
scan_word inp, (word <> chr)
else
{word, (push_back ch, inp)}
end
end
def get_position inp do
{ch, inp} = get_ch inp
{_, line_no, column_no} = ch
inp = push_back ch, inp
{line_no, column_no, inp}
end
###-------------------------------------------------------------------
###
### Scanning of string literals.
###
### It is assumed that the first character is the opening quote, and
### that the closing quote is the same character.
###
def scan_string_literal inp do
{ch, inp} = get_ch inp
{quote_mark, line_no, column_no} = ch
{contents, inp} = scan_str_lit inp, ch
{{"String", quote_mark <> contents <> quote_mark,
line_no, column_no},
inp}
end
def scan_str_lit inp, ch, contents\\"" do
{quote_mark, line_no, column_no} = ch
{ch1, inp} = get_ch inp
{chr1, line_no1, column_no1} = ch1
if chr1 == quote_mark do
{contents, inp}
else
case chr1 do
:eof -> eoi_in_string_literal line_no, column_no
"\n" -> eoln_in_string_literal line_no, column_no
"\\" ->
{ch2, inp} = get_ch inp
{chr2, _, _} = ch2
case chr2 do
"n" -> scan_str_lit inp, ch, (contents <> "\\n")
"\\" -> scan_str_lit inp, ch, (contents <> "\\\\")
_ -> unsupported_escape line_no1, column_no1, chr2
end
_ -> scan_str_lit inp, ch, (contents <> chr1)
end
end
end
###-------------------------------------------------------------------
###
### Scanning of character literals.
###
### It is assumed that the first character is the opening quote, and
### that the closing quote is the same character.
###
### The tedious part of scanning a character literal is distinguishing
### between the kinds of lexical error. (One might wish to modify the
### code to detect, as a distinct kind of error, end of line within a
### character literal.)
###
def scan_character_literal inp do
{ch, inp} = get_ch inp
{_, line_no, column_no} = ch
{ch1, inp} = get_ch inp
{chr1, line_no1, column_no1} = ch1
{intval, inp} =
case chr1 do
:eof -> unterminated_character_literal line_no, column_no
"\\" ->
{ch2, inp} = get_ch inp
{chr2, _, _} = ch2
case chr2 do
:eof -> unterminated_character_literal line_no, column_no
"n" -> {(:binary.first "\n"), inp}
"\\" -> {(:binary.first "\\"), inp}
_ -> unsupported_escape line_no1, column_no1, chr2
end
_ -> {(:binary.first chr1), inp}
end
inp = check_character_literal_end inp, ch
{{"Integer", "#{intval}", line_no, column_no}, inp}
end
def check_character_literal_end inp, ch do
{chr, _, _} = ch
{{chr1, _, _}, inp} = get_ch inp
if chr1 == chr do
inp
else
# Lexical error.
find_char_lit_end inp, ch
end
end
def find_char_lit_end inp, ch do
{chr, line_no, column_no} = ch
{{chr1, _, _}, inp} = get_ch inp
if chr1 == chr do
multicharacter_literal line_no, column_no
else
case chr1 do
:eof -> unterminated_character_literal line_no, column_no
_ -> find_char_lit_end inp, ch
end
end
end
###-------------------------------------------------------------------
###
### Character-at-a-time input, with unrestricted pushback, and with
### line and column numbering.
###
def make_inp inpf do
{inpf, [], 1, 1}
end
def get_ch {inpf, pushback, line_no, column_no} do
case pushback do
[head | tail] ->
{head, {inpf, tail, line_no, column_no}}
[] ->
case IO.read(inpf, 1) do
:eof ->
{{:eof, line_no, column_no},
{inpf, pushback, line_no, column_no}}
{:error, _} ->
{{:eof, line_no, column_no},
{inpf, pushback, line_no, column_no}}
chr ->
case chr do
"\n" ->
{{chr, line_no, column_no},
{inpf, pushback, line_no + 1, 1}}
_ ->
{{chr, line_no, column_no},
{inpf, pushback, line_no, column_no + 1}}
end
end
end
end
def push_back ch, {inpf, pushback, line_no, column_no} do
{inpf, [ch | pushback], line_no, column_no}
end
###-------------------------------------------------------------------
###
### Lexical and usage errors.
###
def unterminated_comment line_no, column_no do
raise "#{scriptname()}: unterminated comment at #{line_no}:#{column_no}"
end
def invalid_integer_literal line_no, column_no, word do
raise "#{scriptname()}: invalid integer literal #{word} at #{line_no}:#{column_no}"
end
def unsupported_escape line_no, column_no, chr do
raise "#{scriptname()}: unsupported escape \\#{chr} at #{line_no}:#{column_no}"
end
def eoi_in_string_literal line_no, column_no do
raise "#{scriptname()}: end of input in string literal starting at #{line_no}:#{column_no}"
end
def eoln_in_string_literal line_no, column_no do
raise "#{scriptname()}: end of line in string literal starting at #{line_no}:#{column_no}"
end
def multicharacter_literal line_no, column_no do
raise "#{scriptname()}: unsupported multicharacter literal at #{line_no}:#{column_no}"
end
def unterminated_character_literal line_no, column_no do
raise "#{scriptname()}: unterminated character literal starting at #{line_no}:#{column_no}"
end
def unexpected_character line_no, column_no, chr do
raise "#{scriptname()}: unexpected character '#{chr}' at #{line_no}:#{column_no}"
end
def usage_error() do
IO.puts "Usage: #{scriptname()} [INPUTFILE [OUTPUTFILE]]"
IO.puts "If either of INPUTFILE or OUTPUTFILE is not present or is \"-\","
IO.puts "standard input or standard output is used, respectively."
exit_status = 2
exit_status
end
def scriptname() do
Path.basename(__ENV__.file)
end
#---------------------------------------------------------------------
end ## module Lex
Lex.main(System.argv)
- Output:
$ ./lex testcase3.t 5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
Emacs Lisp
#!/usr/bin/emacs --script
;;
;; The Rosetta Code lexical analyzer in GNU Emacs Lisp.
;;
;; Migrated from the ATS. However, Emacs Lisp is not friendly to the
;; functional style of the ATS implementation; therefore the
;; differences are vast.
;;
;; (A Scheme migration could easily, on the other hand, have been
;; almost exact. It is interesting to contrast Lisp dialects and see
;; how huge the differences are.)
;;
;; The script currently takes input only from standard input and
;; writes the token stream only to standard output.
;;
(require 'cl-lib)
;;; The type of a character, consisting of its code point and where it
;;; occurred in the text.
(cl-defstruct (ch_t (:constructor make-ch (ichar line-no column-no)))
ichar line-no column-no)
(defun ch-ichar (ch)
(ch_t-ichar ch))
(defun ch-line-no (ch)
(ch_t-line-no ch))
(defun ch-column-no (ch)
(ch_t-column-no ch))
;;; The type of an "inputter", consisting of an open file for the
;;; text, a pushback buffer (which is an indefinitely deep stack of
;;; ch_t), an input buffer for the current line, and a position in the
;;; text.
(cl-defstruct inp_t file pushback line line-no column-no)
(defun make-inp (file)
"Initialize a new inp_t."
(make-inp_t :file file
:pushback '()
:line ""
:line-no 0
:column-no 0))
(defvar inp (make-inp t)
"A global inp_t.")
(defun get-ch ()
"Get a ch_t, either from the pushback buffer or from the input."
(pcase (inp_t-pushback inp)
(`(,ch . ,tail)
;; Emacs Lisp has only single value return, so the results come
;; back as a list rather than multiple values.
(setq inp (make-inp_t :file (inp_t-file inp)
:pushback tail
:line (inp_t-line inp)
:line-no (inp_t-line-no inp)
:column-no (inp_t-column-no inp)))
ch)
('()
(let ((line (inp_t-line inp))
(line-no (inp_t-line-no inp))
(column-no (inp_t-column-no inp)))
(when (string= line "")
;; Refill the buffer.
(let ((text
(condition-case nil (read-string "")
nil (error 'eoi))))
(if (eq text 'eoi)
(setq line 'eoi)
(setq line (format "%s%c" text ?\n)))
(setq line-no (1+ line-no))
(setq column-no 1)))
(if (eq line 'eoi)
(progn
(setq inp (make-inp_t :file (inp_t-file inp)
:pushback (inp_t-pushback inp)
:line line
:line-no line-no
:column-no column-no))
(make-ch 'eoi line-no column-no))
(let ((c (elt line 0))
(line (substring line 1)))
(setq inp (make-inp_t :file (inp_t-file inp)
:pushback (inp_t-pushback inp)
:line line
:line-no line-no
:column-no (1+ column-no)))
(make-ch c line-no column-no)))))))
(defun get-new-line (file)
;; Currently "file" is ignored and the input must be from stdin.
(read-from-minibuffer "" :default 'eoi))
(defun push-back (ch)
"Push back a ch_t."
(setq inp (make-inp_t :file (inp_t-file inp)
:pushback (cons ch (inp_t-pushback inp))
:line (inp_t-line inp)
:line-no (inp_t-line-no inp)
:column-no (inp_t-column-no inp))))
(defun get-position ()
"Return the line-no and column-no of the next ch_t to be
returned by get-ch, assuming there are no more pushbacks
beforehand."
(let* ((ch (get-ch))
(line-no (ch-line-no ch))
(column-no (ch-column-no ch)))
(push-back ch)
(list line-no column-no)))
(defun scan-text (outf)
"The main loop."
(cl-loop for toktup = (get-next-token)
do (print-token outf toktup)
until (string= (elt toktup 0) "End_of_input")))
(defun print-token (outf toktup)
"Print a token, along with its position and possibly an
argument."
;; Currently outf is ignored, and the output goes to stdout.
(pcase toktup
(`(,tok ,arg ,line-no ,column-no)
(princ (format "%5d %5d %s" line-no column-no tok))
(pcase tok
("Identifier" (princ (format " %s\n" arg)))
("Integer" (princ (format " %s\n" arg)))
("String" (princ (format " %s\n" arg)))
(_ (princ "\n"))))))
(defun get-next-token ()
"The token dispatcher. Returns the next token, as a list along
with its argument and text position."
(skip-spaces-and-comments)
(let* ((ch (get-ch))
(ln (ch-line-no ch))
(cn (ch-column-no ch)))
(pcase (ch-ichar ch)
('eoi (list "End_of_input" "" ln cn))
(?, (list "Comma" "," ln cn))
(?\N{SEMICOLON} (list "Semicolon" ";" ln cn))
(?\N{LEFT PARENTHESIS} (list "LeftParen" "(" ln cn))
(?\N{RIGHT PARENTHESIS} (list "RightParen" ")" ln cn))
(?{ (list "LeftBrace" "{" ln cn))
(?} (list "RightBrace" "}" ln cn))
(?* (list "Op_multiply" "*" ln cn))
(?/ (list "Op_divide" "/" ln cn))
(?% (list "Op_mod" "%" ln cn))
(?+ (list "Op_add" "+" ln cn))
(?- (list "Op_subtract" "-" ln cn))
(?< (let ((ch1 (get-ch)))
(pcase (ch-ichar ch1)
(?= (list "Op_lessequal" "<=" ln cn))
(_ (push-back ch1)
(list "Op_less" "<" ln cn)))))
(?> (let ((ch1 (get-ch)))
(pcase (ch-ichar ch1)
(?= (list "Op_greaterequal" ">=" ln cn))
(_ (push-back ch1)
(list "Op_greater" ">" ln cn)))))
(?= (let ((ch1 (get-ch)))
(pcase (ch-ichar ch1)
(?= (list "Op_equal" "==" ln cn))
(_ (push-back ch1)
(list "Op_assign" "=" ln cn)))))
(?! (let ((ch1 (get-ch)))
(pcase (ch-ichar ch1)
(?= (list "Op_notequal" "!=" ln cn))
(_ (push-back ch1)
(list "Op_not" "!" ln cn)))))
(?& (let ((ch1 (get-ch)))
(pcase (ch-ichar ch1)
(?& (list "Op_and" "&&" ln cn))
(_ (unexpected-character ln cn (get-ichar ch))))))
(?| (let ((ch1 (get-ch)))
(pcase (ch-ichar ch1)
(?| (list "Op_or" "||" ln cn))
(_ (unexpected-character ln cn (get-ichar ch))))))
(?\N{QUOTATION MARK} (push-back ch) (scan-string-literal))
(?\N{APOSTROPHE} (push-back ch) (scan-character-literal))
((pred digitp) (push-back ch) (scan-integer-literal))
((pred identifier-start-p)
(progn
(push-back ch)
(scan-identifier-or-reserved-word)))
(c (unexpected-character ln cn c)))))
(defun skip-spaces-and-comments ()
"Skip spaces and comments. A comment is treated as equivalent
to a run of spaces."
(cl-loop for ch = (let ((ch1 (get-ch)))
(pcase (ch-ichar ch1)
(?/ (let* ((ch2 (get-ch))
(line-no (ch-line-no ch1))
(column-no (ch-column-no ch1))
(position `(,line-no ,column-no)))
(pcase (ch-ichar ch2)
(?* (scan-comment position)
(get-ch))
(_ (push-back ch2)
ch1))))
(_ ch1)))
while (spacep (ch-ichar ch))
finally do (push-back ch)))
(defun scan-comment (position)
(cl-loop for ch = (get-ch)
for done = (comment-done-p ch position)
until done))
(defun comment-done-p (ch position)
(pcase (ch-ichar ch)
('eoi (apply 'unterminated-comment position))
(?* (let ((ch1 (get-ch)))
(pcase (ch-ichar ch1)
('eoi (apply 'unterminated-comment position))
(?/ t)
(_ nil))))
(_ nil)))
(defun scan-integer-literal ()
"Scan an integer literal, on the assumption that a digit has
been seen and pushed back."
(let* ((position (get-position))
(lst (scan-word))
(s (list-to-string lst)))
(if (all-digits-p lst)
`("Integer" ,s . ,position)
(apply 'illegal-integer-literal `(,@position , s)))))
(defun scan-identifier-or-reserved-word ()
"Scan an identifier or reserved word, on the assumption that a
legal first character (for an identifier) has been seen and
pushed back."
(let* ((position (get-position))
(lst (scan-word))
(s (list-to-string lst))
(tok (pcase s
("else" "Keyword_else")
("if" "Keyword_if")
("while" "Keyword_while")
("print" "Keyword_print")
("putc" "Keyword_putc")
(_ "Identifier"))))
`(,tok ,s . ,position)))
(defun scan-word ()
(cl-loop for ch = (get-ch)
while (identifier-continuation-p (ch-ichar ch))
collect (ch-ichar ch)
finally do (push-back ch)))
(defun scan-string-literal ()
"Scan a string literal, on the assumption that a double quote
has been seen and pushed back."
(let* ((ch (get-ch))
(_ (cl-assert (= (ch-ichar ch) ?\N{QUOTATION MARK})))
(line-no (ch-line-no ch))
(column-no (ch-column-no ch))
(position `(,line-no ,column-no))
(lst (scan-str-lit position))
(lst `(?\N{QUOTATION MARK} ,@lst ?\N{QUOTATION MARK})))
`("String" ,(list-to-string lst) . ,position)))
(defun scan-str-lit (position)
(flatten
(cl-loop for ch = (get-ch)
until (= (ch-ichar ch) ?\N{QUOTATION MARK})
collect (process-str-lit-character
(ch-ichar ch) position))))
(defun process-str-lit-character (c position)
;; NOTE: This script might insert a newline before any eoi, so that
;; "end-of-input-in-string-literal" never actually occurs. It is a
;; peculiarity of the script's input mechanism.
(pcase c
('eoi (apply 'end-of-input-in-string-literal position))
(?\n (apply 'end-of-line-in-string-literal position))
(?\\ (let ((ch1 (get-ch)))
(pcase (ch-ichar ch1)
(?n '(?\\ ?n))
(?\\ '(?\\ ?\\))
(c (unsupported-escape (ch-line-no ch1)
(ch-column-no ch1)
c)))))
(c c)))
(defun scan-character-literal ()
"Scan a character literal, on the assumption that an ASCII
single quote (that is, a Unicode APOSTROPHE) has been seen and
pushed back."
(let* ((toktup (scan-character-literal-without-checking-end))
(line-no (elt toktup 2))
(column-no (elt toktup 3))
(position (list line-no column-no)))
(check-char-lit-end position)
toktup))
(defun check-char-lit-end (position)
(let ((ch (get-ch)))
(unless (and (integerp (ch-ichar ch))
(= (ch-ichar ch) ?\N{APOSTROPHE}))
(push-back ch)
(loop-to-char-lit-end position))))
(defun loop-to-char-lit-end (position)
(cl-loop for ch = (get-ch)
until (or (eq (ch-ichar ch) 'eoi)
(= (ch-ichar ch) ?\N{APOSTROPHE}))
finally do (if (eq (ch-ichar ch) 'eoi)
(apply 'unterminated-character-literal
position)
(apply 'multicharacter-literal position))))
(defun scan-character-literal-without-checking-end ()
(let* ((ch (get-ch))
(_ (cl-assert (= (ch-ichar ch) ?\N{APOSTROPHE})))
(line-no (ch-line-no ch))
(column-no (ch-column-no ch))
(position (list line-no column-no))
(ch1 (get-ch)))
(pcase (ch-ichar ch1)
('eoi (apply 'unterminated-character-literal position))
(?\\ (let ((ch2 (get-ch)))
(pcase (ch-ichar ch2)
('eoi (apply 'unterminated-character-literal position))
(?n `("Integer" ,(format "%d" ?\n) . ,position))
(?\\ `("Integer" ,(format "%d" ?\\) . ,position))
(c (unsupported-escape (ch-line-no ch1)
(ch-column-no ch1)
c)))))
(c `("Integer" ,(format "%d" c) . ,position)))))
(defun spacep (c)
(and (integerp c) (or (= c ?\N{SPACE})
(and (<= 9 c) (<= c 13)))))
(defun digitp (c)
(and (integerp c) (<= ?0 c) (<= c ?9)))
(defun lowerp (c)
;; Warning: in EBCDIC, this kind of test for "alphabetic" is no
;; good. The letters are not contiguous.
(and (integerp c) (<= ?a c) (<= c ?z)))
(defun upperp (c)
;; Warning: in EBCDIC, this kind of test for "alphabetic" is no
;; good. The letters are not contiguous.
(and (integerp c) (<= ?A c) (<= c ?Z)))
(defun alphap (c)
(or (lowerp c) (upperp c)))
(defun identifier-start-p (c)
(and (integerp c) (or (alphap c) (= c ?_))))
(defun identifier-continuation-p (c)
(and (integerp c) (or (alphap c) (= c ?_) (digitp c))))
(defun all-digits-p (thing)
(cl-loop for c in thing
if (not (digitp c)) return nil
finally return t))
(defun list-to-string (lst)
"Convert a list of characters to a string."
(apply 'string lst))
(defun flatten (lst)
"Flatten nested lists. (The implementation is recursive and not
for very long lists.)"
(pcase lst
('() '())
(`(,head . ,tail)
(if (listp head)
(append (flatten head) (flatten tail))
(cons head (flatten tail))))))
(defun unexpected-character (line-no column-no c)
(error (format "unexpected character '%c' at %d:%d"
c line-no column-no)))
(defun unsupported-escape (line-no column-no c)
(error (format "unsupported escape \\%c at %d:%d"
c line-no column-no)))
(defun illegal-integer-literal (line-no column-no s)
(error (format "illegal integer literal \"%s\" at %d:%d"
s line-no column-no)))
(defun unterminated-character-literal (line-no column-no)
(error (format "unterminated character literal starting at %d:%d"
line-no column-no)))
(defun multicharacter-literal (line-no column-no)
(error (format
"unsupported multicharacter literal starting at %d:%d"
line-no column-no)))
(defun end-of-input-in-string-literal (line-no column-no)
(error (format "end of input in string literal starting at %d:%d"
line-no column-no)))
(defun end-of-line-in-string-literal (line-no column-no)
(error (format "end of line in string literal starting at %d:%d"
line-no column-no)))
(defun unterminated-comment (line-no column-no)
(error (format "unterminated comment starting at %d:%d"
line-no column-no)))
(defun main ()
(setq inp (make-inp t))
(scan-text t))
(main)
- Output:
$ ./lex-in-el < compiler-tests/testcase3.t 5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
Erlang
#!/bin/env escript
%%%-------------------------------------------------------------------
-record (inp_t, {inpf, pushback, line_no, column_no}).
main (Args) ->
main_program (Args).
main_program ([]) ->
scan_from_inpf_to_outf ("-", "-"),
halt (0);
main_program ([Inpf_filename]) ->
scan_from_inpf_to_outf (Inpf_filename, "-"),
halt (0);
main_program ([Inpf_filename, Outf_filename]) ->
scan_from_inpf_to_outf (Inpf_filename, Outf_filename),
halt (0);
main_program ([_, _ | _]) ->
ProgName = escript:script_name (),
io:put_chars (standard_error, "Usage: "),
io:put_chars (standard_error, ProgName),
io:put_chars (standard_error, " [INPUTFILE [OUTPUTFILE]]\n"),
halt (1).
scan_from_inpf_to_outf ("-", "-") ->
scan_input (standard_io, standard_io);
scan_from_inpf_to_outf (Inpf_filename, "-") ->
case file:open (Inpf_filename, [read]) of
{ok, Inpf} -> scan_input (Inpf, standard_io);
_ -> open_failure (Inpf_filename, "input")
end;
scan_from_inpf_to_outf ("-", Outf_filename) ->
case file:open (Outf_filename, [write]) of
{ok, Outf} -> scan_input (standard_io, Outf);
_ -> open_failure (Outf_filename, "output")
end;
scan_from_inpf_to_outf (Inpf_filename, Outf_filename) ->
case file:open(Inpf_filename, [read]) of
{ok, Inpf} ->
case file:open (Outf_filename, [write]) of
{ok, Outf} -> scan_input (Inpf, Outf);
_ -> open_failure (Outf_filename, "output")
end;
_ -> open_failure (Inpf_filename, "input")
end.
open_failure (Filename, ForWhat) ->
ProgName = escript:script_name (),
io:put_chars (standard_error, ProgName),
io:put_chars (standard_error, ": failed to open \""),
io:put_chars (standard_error, Filename),
io:put_chars (standard_error, "\" for "),
io:put_chars (standard_error, ForWhat),
io:put_chars (standard_error, "\n"),
halt (1).
scan_input (Inpf, Outf) ->
scan_text (Outf, make_inp (Inpf)).
scan_text (Outf, Inp) ->
{TokTup, Inp1} = get_next_token (Inp),
print_token (Outf, TokTup),
case TokTup of
{"End_of_input", _, _, _} -> ok;
_ -> scan_text (Outf, Inp1)
end.
print_token (Outf, {Tok, Arg, Line_no, Column_no}) ->
S_line_no = erlang:integer_to_list (Line_no),
S_column_no = erlang:integer_to_list (Column_no),
io:put_chars (Outf, string:pad (S_line_no, 5, leading)),
io:put_chars (Outf, " "),
io:put_chars (Outf, string:pad (S_column_no, 5, leading)),
io:put_chars (Outf, " "),
io:put_chars (Outf, Tok),
{Padding, Arg1} =
case Tok of
"Identifier" -> {" ", Arg};
"Integer" -> {" ", Arg};
"String" -> {" ", Arg};
_ -> {"", ""}
end,
io:put_chars (Outf, Padding),
io:put_chars (Outf, Arg1),
io:put_chars ("\n").
%%%-------------------------------------------------------------------
%%%
%%% The token dispatcher.
%%%
get_next_token (Inp) ->
Inp00 = skip_spaces_and_comments (Inp),
{Ch, Inp0} = get_ch (Inp00),
{Char, Line_no, Column_no} = Ch,
Ln = Line_no,
Cn = Column_no,
case Char of
eof -> {{"End_of_input", "", Ln, Cn}, Inp0};
"," -> {{"Comma", ",", Ln, Cn}, Inp0};
";" -> {{"Semicolon", ";", Ln, Cn}, Inp0};
"(" -> {{"LeftParen", "(", Ln, Cn}, Inp0};
")" -> {{"RightParen", ")", Ln, Cn}, Inp0};
"{" -> {{"LeftBrace", "{", Ln, Cn}, Inp0};
"}" -> {{"RightBrace", "}", Ln, Cn}, Inp0};
"*" -> {{"Op_multiply", "*", Ln, Cn}, Inp0};
"/" -> {{"Op_divide", "/", Ln, Cn}, Inp0};
"%" -> {{"Op_mod", "%", Ln, Cn}, Inp0};
"+" -> {{"Op_add", "+", Ln, Cn}, Inp0};
"-" -> {{"Op_subtract", "-", Ln, Cn}, Inp0};
"<" ->
{Ch1, Inp1} = get_ch (Inp0),
{Char1, _, _} = Ch1,
case Char1 of
"=" -> {{"Op_lessequal", "<=", Ln, Cn}, Inp1};
_ -> {{"Op_less", "<", Ln, Cn}, push_back (Ch1, Inp1)}
end;
">" ->
{Ch1, Inp1} = get_ch (Inp0),
{Char1, _, _} = Ch1,
case Char1 of
"=" -> {{"Op_greaterequal", ">=", Ln, Cn}, Inp1};
_ -> {{"Op_greater", ">", Ln, Cn}, push_back (Ch1, Inp1)}
end;
"=" ->
{Ch1, Inp1} = get_ch (Inp0),
{Char1, _, _} = Ch1,
case Char1 of
"=" -> {{"Op_equal", "==", Ln, Cn}, Inp1};
_ -> {{"Op_assign", "=", Ln, Cn}, push_back (Ch1, Inp1)}
end;
"!" ->
{Ch1, Inp1} = get_ch (Inp0),
{Char1, _, _} = Ch1,
case Char1 of
"=" -> {{"Op_notequal", "!=", Ln, Cn}, Inp1};
_ -> {{"Op_not", "!", Ln, Cn}, push_back (Ch1, Inp1)}
end;
"&" ->
{Ch1, Inp1} = get_ch (Inp0),
{Char1, _, _} = Ch1,
case Char1 of
"&" -> {{"Op_and", "&&", Ln, Cn}, Inp1};
_ -> unexpected_character (Ln, Cn, Char)
end;
"|" ->
{Ch1, Inp1} = get_ch (Inp0),
{Char1, _, _} = Ch1,
case Char1 of
"|" -> {{"Op_or", "||", Ln, Cn}, Inp1};
_ -> unexpected_character (Ln, Cn, Char)
end;
"\"" ->
Inp1 = push_back (Ch, Inp0),
scan_string_literal (Inp1);
"'" ->
Inp1 = push_back (Ch, Inp0),
scan_character_literal (Inp1);
_ ->
case is_digit (Char) of
true ->
Inp1 = push_back (Ch, Inp0),
scan_integer_literal (Inp1);
false ->
case is_alpha_or_underscore (Char) of
true ->
Inp1 = push_back (Ch, Inp0),
scan_identifier_or_reserved_word (Inp1);
false ->
unexpected_character (Ln, Cn, Char)
end
end
end.
%%%-------------------------------------------------------------------
%%%
%%% Skipping past spaces and /* ... */ comments.
%%%
%%% Comments are treated exactly like a bit of whitespace. They never
%%% make it to the dispatcher.
%%%
skip_spaces_and_comments (Inp) ->
{Ch, Inp0} = get_ch (Inp),
{Char, Line_no, Column_no} = Ch,
case classify_char (Char) of
eof -> push_back (Ch, Inp0);
space -> skip_spaces_and_comments (Inp0);
slash ->
{Ch1, Inp1} = get_ch (Inp0),
case Ch1 of
{"*", _, _} ->
Inp2 = scan_comment (Inp1, Line_no, Column_no),
skip_spaces_and_comments (Inp2);
_ -> push_back (Ch, (push_back (Ch1, Inp1)))
end;
other -> push_back (Ch, Inp0)
end.
classify_char (Char) ->
case Char of
eof -> eof;
"/" -> slash;
_ -> case is_space (Char) of
true -> space;
false -> other
end
end.
scan_comment (Inp, Line_no, Column_no) ->
{Ch0, Inp0} = get_ch (Inp),
case Ch0 of
{eof, _, _} -> unterminated_comment (Line_no, Column_no);
{"*", _, _} ->
{Ch1, Inp1} = get_ch (Inp0),
case Ch1 of
{eof, _, _} ->
unterminated_comment (Line_no, Column_no);
{"/", _, _} -> Inp1;
_ -> scan_comment (Inp1, Line_no, Column_no)
end;
_ -> scan_comment (Inp0, Line_no, Column_no)
end.
is_space (S) ->
case re:run (S, "^[[:space:]]+$") of
{match, _} -> true;
_ -> false
end.
%%%-------------------------------------------------------------------
%%%
%%% Scanning of integer literals, identifiers, and reserved words.
%%%
%%% These three types of token are very similar to each other.
%%%
scan_integer_literal (Inp) ->
%% Scan an entire word, not just digits. This way we detect
%% erroneous text such as "23skidoo".
{Line_no, Column_no, Inp1} = get_position (Inp),
{Word, Inp2} = scan_word (Inp1),
case is_digit (Word) of
true -> {{"Integer", Word, Line_no, Column_no}, Inp2};
false -> invalid_integer_literal (Line_no, Column_no, Word)
end.
scan_identifier_or_reserved_word (Inp) ->
%% It is assumed that the first character is of the correct type,
%% thanks to the dispatcher.
{Line_no, Column_no, Inp1} = get_position (Inp),
{Word, Inp2} = scan_word (Inp1),
Tok =
case Word of
"if" -> "Keyword_if";
"else" -> "Keyword_else";
"while" -> "Keyword_while";
"print" -> "Keyword_print";
"putc" -> "Keyword_putc";
_ -> "Identifier"
end,
{{Tok, Word, Line_no, Column_no}, Inp2}.
scan_word (Inp) ->
scan_word_loop (Inp, "").
scan_word_loop (Inp, Word0) ->
{Ch1, Inp1} = get_ch (Inp),
{Char1, _, _} = Ch1,
case is_alnum_or_underscore (Char1) of
true -> scan_word_loop (Inp1, Word0 ++ Char1);
false -> {Word0, push_back (Ch1, Inp1)}
end.
get_position (Inp) ->
{Ch1, Inp1} = get_ch (Inp),
{_, Line_no, Column_no} = Ch1,
Inp2 = push_back (Ch1, Inp1),
{Line_no, Column_no, Inp2}.
is_digit (S) ->
case re:run (S, "^[[:digit:]]+$") of
{match, _} -> true;
_ -> false
end.
is_alpha_or_underscore (S) ->
case re:run (S, "^[[:alpha:]_]+$") of
{match, _} -> true;
_ -> false
end.
is_alnum_or_underscore (S) ->
case re:run (S, "^[[:alnum:]_]+$") of
{match, _} -> true;
_ -> false
end.
%%%-------------------------------------------------------------------
%%%
%%% Scanning of string literals.
%%%
%%% It is assumed that the first character is the opening quote, and
%%% that the closing quote is the same character.
%%%
scan_string_literal (Inp) ->
{Ch1, Inp1} = get_ch (Inp),
{Quote_mark, Line_no, Column_no} = Ch1,
{Contents, Inp2} = scan_str_lit (Inp1, Ch1),
Toktup = {"String", Quote_mark ++ Contents ++ Quote_mark,
Line_no, Column_no},
{Toktup, Inp2}.
scan_str_lit (Inp, Ch) -> scan_str_lit_loop (Inp, Ch, "").
scan_str_lit_loop (Inp, Ch, Contents) ->
{Quote_mark, Line_no, Column_no} = Ch,
{Ch1, Inp1} = get_ch (Inp),
{Char1, Line_no1, Column_no1} = Ch1,
case Char1 of
Quote_mark -> {Contents, Inp1};
eof -> eoi_in_string_literal (Line_no, Column_no);
"\n" -> eoln_in_string_literal (Line_no, Column_no);
"\\" ->
{Ch2, Inp2} = get_ch (Inp1),
{Char2, _, _} = Ch2,
case Char2 of
"n" ->
scan_str_lit_loop (Inp2, Ch, Contents ++ "\\n");
"\\" ->
scan_str_lit_loop (Inp2, Ch, Contents ++ "\\\\");
_ ->
unsupported_escape (Line_no1, Column_no1, Char2)
end;
_ -> scan_str_lit_loop (Inp1, Ch, Contents ++ Char1)
end.
%%%-------------------------------------------------------------------
%%%
%%% Scanning of character literals.
%%%
%%% It is assumed that the first character is the opening quote, and
%%% that the closing quote is the same character.
%%%
%%% The tedious part of scanning a character literal is distinguishing
%%% between the kinds of lexical error. (One might wish to modify the
%%% code to detect, as a distinct kind of error, end of line within a
%%% character literal.)
%%%
scan_character_literal (Inp) ->
{Ch, Inp0} = get_ch (Inp),
{_, Line_no, Column_no} = Ch,
{Ch1, Inp1} = get_ch (Inp0),
{Char1, Line_no1, Column_no1} = Ch1,
{Intval, Inp3} =
case Char1 of
eof -> unterminated_character_literal (Line_no, Column_no);
"\\" ->
{Ch2, Inp2} = get_ch (Inp1),
{Char2, _, _} = Ch2,
case Char2 of
eof -> unterminated_character_literal (Line_no,
Column_no);
"n" -> {char_to_code ("\n"), Inp2};
"\\" -> {char_to_code ("\\"), Inp2};
_ -> unsupported_escape (Line_no1, Column_no1,
Char2)
end;
_ -> {char_to_code (Char1), Inp1}
end,
Inp4 = check_character_literal_end (Inp3, Ch),
{{"Integer", Intval, Line_no, Column_no}, Inp4}.
char_to_code (Char) ->
%% Hat tip to https://archive.ph/BxZRS
lists:flatmap (fun erlang:integer_to_list/1, Char).
check_character_literal_end (Inp, Ch) ->
{Char, _, _} = Ch,
{{Char1, _, _}, Inp1} = get_ch (Inp),
case Char1 of
Char -> Inp1;
_ -> find_char_lit_end (Inp1, Ch) % Handle a lexical error.
end.
find_char_lit_end (Inp, Ch) ->
%% There is a lexical error. Determine which kind it fits into.
{Char, Line_no, Column_no} = Ch,
{{Char1, _, _}, Inp1} = get_ch (Inp),
case Char1 of
Char -> multicharacter_literal (Line_no, Column_no);
eof -> unterminated_character_literal (Line_no, Column_no);
_ -> find_char_lit_end (Inp1, Ch)
end.
%%%-------------------------------------------------------------------
%%%
%%% Character-at-a-time input, with unrestricted pushback, and with
%%% line and column numbering.
%%%
make_inp (Inpf) ->
#inp_t{inpf = Inpf,
pushback = [],
line_no = 1,
column_no = 1}.
get_ch (Inp) ->
#inp_t{inpf = Inpf,
pushback = Pushback,
line_no = Line_no,
column_no = Column_no} = Inp,
case Pushback of
[Ch | Tail] ->
Inp1 = Inp#inp_t{pushback = Tail},
{Ch, Inp1};
[] ->
case io:get_chars (Inpf, "", 1) of
eof ->
Ch = {eof, Line_no, Column_no},
{Ch, Inp};
{error, _} ->
Ch = {eof, Line_no, Column_no},
{Ch, Inp};
Char ->
case Char of
"\n" ->
Ch = {Char, Line_no, Column_no},
Inp1 = Inp#inp_t{line_no = Line_no + 1,
column_no = 1},
{Ch, Inp1};
_ ->
Ch = {Char, Line_no, Column_no},
Inp1 =
Inp#inp_t{column_no = Column_no + 1},
{Ch, Inp1}
end
end
end.
push_back (Ch, Inp) ->
Inp#inp_t{pushback = [Ch | Inp#inp_t.pushback]}.
%%%-------------------------------------------------------------------
invalid_integer_literal (Line_no, Column_no, Word) ->
error_abort ("invalid integer literal \"" ++
Word ++ "\" at " ++
integer_to_list (Line_no) ++ ":" ++
integer_to_list (Column_no)).
unsupported_escape (Line_no, Column_no, Char) ->
error_abort ("unsupported escape \\" ++
Char ++ " at " ++
integer_to_list (Line_no) ++ ":" ++
integer_to_list (Column_no)).
unexpected_character (Line_no, Column_no, Char) ->
error_abort ("unexpected character '" ++
Char ++ "' at " ++
integer_to_list (Line_no) ++ ":" ++
integer_to_list (Column_no)).
eoi_in_string_literal (Line_no, Column_no) ->
error_abort ("end of input in string literal starting at " ++
integer_to_list (Line_no) ++ ":" ++
integer_to_list (Column_no)).
eoln_in_string_literal (Line_no, Column_no) ->
error_abort ("end of line in string literal starting at " ++
integer_to_list (Line_no) ++ ":" ++
integer_to_list (Column_no)).
unterminated_character_literal (Line_no, Column_no) ->
error_abort ("unterminated character literal starting at " ++
integer_to_list (Line_no) ++ ":" ++
integer_to_list (Column_no)).
multicharacter_literal (Line_no, Column_no) ->
error_abort ("unsupported multicharacter literal starting at " ++
integer_to_list (Line_no) ++ ":" ++
integer_to_list (Column_no)).
unterminated_comment (Line_no, Column_no) ->
error_abort ("unterminated comment starting at " ++
integer_to_list (Line_no) ++ ":" ++
integer_to_list (Column_no)).
error_abort (Message) ->
ProgName = escript:script_name (),
io:put_chars (standard_error, ProgName),
io:put_chars (standard_error, ": "),
io:put_chars (standard_error, Message),
io:put_chars (standard_error, "\n"),
halt (1).
%%%-------------------------------------------------------------------
%%% Instructions to GNU Emacs --
%%% local variables:
%%% mode: erlang
%%% erlang-indent-level: 3
%%% end:
%%%-------------------------------------------------------------------
- Output:
$ ./lex-in-Erlang compiler-tests/testcase3.t 5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
Euphoria
Tested with Euphoria 4.05.
include std/io.e
include std/map.e
include std/types.e
include std/convert.e
constant true = 1, false = 0, EOF = -1
enum tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq,
tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While,
tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma,
tk_Ident, tk_Integer, tk_String
constant all_syms = {"End_of_input", "Op_multiply", "Op_divide", "Op_mod", "Op_add",
"Op_subtract", "Op_negate", "Op_not", "Op_less", "Op_lessequal", "Op_greater",
"Op_greaterequal", "Op_equal", "Op_notequal", "Op_assign", "Op_and", "Op_or",
"Keyword_if", "Keyword_else", "Keyword_while", "Keyword_print", "Keyword_putc",
"LeftParen", "RightParen", "LeftBrace", "RightBrace", "Semicolon", "Comma",
"Identifier", "Integer", "String"}
integer input_file, the_ch = ' ', the_col = 0, the_line = 1
sequence symbols
map key_words = new()
procedure error(sequence format, sequence data)
printf(STDOUT, format, data)
abort(1)
end procedure
-- get the next character from the input
function next_ch()
the_ch = getc(input_file)
the_col += 1
if the_ch = '\n' then
the_line += 1
the_col = 0
end if
return the_ch
end function
-- 'x' - character constants
function char_lit(integer err_line, integer err_col)
integer n = next_ch() -- skip opening quote
if the_ch = '\'' then
error("%d %d empty character constant", {err_line, err_col})
elsif the_ch = '\\' then
next_ch()
if the_ch = 'n' then
n = 10
elsif the_ch = '\\' then
n = '\\'
else
error("%d %d unknown escape sequence \\%c", {err_line, err_col, the_ch})
end if
end if
if next_ch() != '\'' then
error("%d %d multi-character constant", {err_line, err_col})
end if
next_ch()
return {tk_Integer, err_line, err_col, n}
end function
-- process divide or comments
function div_or_cmt(integer err_line, integer err_col)
if next_ch() != '*' then
return {tk_Div, err_line, err_col}
end if
-- comment found
next_ch()
while true do
if the_ch = '*' then
if next_ch() = '/' then
next_ch()
return get_tok()
end if
elsif the_ch = EOF then
error("%d %d EOF in comment", {err_line, err_col})
else
next_ch()
end if
end while
end function
-- "string"
function string_lit(integer start, integer err_line, integer err_col)
string text = ""
while next_ch() != start do
if the_ch = EOF then
error("%d %d EOF while scanning string literal", {err_line, err_col})
end if
if the_ch = '\n' then
error("%d %d EOL while scanning string literal", {err_line, err_col})
end if
text &= the_ch
end while
next_ch()
return {tk_String, err_line, err_col, text}
end function
-- handle identifiers and integers
function ident_or_int(integer err_line, integer err_col)
integer n, is_number = true
string text = ""
while t_alnum(the_ch) or the_ch = '_' do
text &= the_ch
if not t_digit(the_ch) then
is_number = false
end if
next_ch()
end while
if length(text) = 0 then
error("%d %d ident_or_int: unrecognized character: (%d) '%s'", {err_line, err_col, the_ch, the_ch})
end if
if t_digit(text[1]) then
if not is_number then
error("%d %d invalid number: %s", {err_line, err_col, text})
end if
n = to_integer(text)
return {tk_Integer, err_line, err_col, n}
end if
if has(key_words, text) then
return {get(key_words, text), err_line, err_col}
end if
return {tk_Ident, err_line, err_col, text}
end function
-- look ahead for '>=', etc.
function follow(integer expect, integer ifyes, integer ifno, integer err_line, integer err_col)
if next_ch() = expect then
next_ch()
return {ifyes, err_line, err_col}
end if
if ifno = tk_EOI then
error("%d %d follow: unrecognized character: (%d)", {err_line, err_col, the_ch})
end if
return {ifno, err_line, err_col}
end function
-- return the next token type
function get_tok()
while t_space(the_ch) do
next_ch()
end while
integer err_line = the_line
integer err_col = the_col
switch the_ch do
case EOF then return {tk_EOI, err_line, err_col}
case '/' then return div_or_cmt(err_line, err_col)
case '\'' then return char_lit(err_line, err_col)
case '<' then return follow('=', tk_Leq, tk_Lss, err_line, err_col)
case '>' then return follow('=', tk_Geq, tk_Gtr, err_line, err_col)
case '=' then return follow('=', tk_Eq, tk_Assign, err_line, err_col)
case '!' then return follow('=', tk_Neq, tk_Not, err_line, err_col)
case '&' then return follow('&', tk_And, tk_EOI, err_line, err_col)
case '|' then return follow('|', tk_Or, tk_EOI, err_line, err_col)
case '"' then return string_lit(the_ch, err_line, err_col)
case else
integer sym = symbols[the_ch]
if sym != tk_EOI then
next_ch()
return {sym, err_line, err_col}
end if
return ident_or_int(err_line, err_col)
end switch
end function
procedure init()
put(key_words, "else", tk_Else)
put(key_words, "if", tk_If)
put(key_words, "print", tk_Print)
put(key_words, "putc", tk_Putc)
put(key_words, "while", tk_While)
symbols = repeat(tk_EOI, 256)
symbols['{'] = tk_Lbrace
symbols['}'] = tk_Rbrace
symbols['('] = tk_Lparen
symbols[')'] = tk_Rparen
symbols['+'] = tk_Add
symbols['-'] = tk_Sub
symbols['*'] = tk_Mul
symbols['%'] = tk_Mod
symbols[';'] = tk_Semi
symbols[','] = tk_Comma
end procedure
procedure main(sequence cl)
sequence file_name
input_file = STDIN
if length(cl) > 2 then
file_name = cl[3]
input_file = open(file_name, "r")
if input_file = -1 then
error("Could not open %s", {file_name})
end if
end if
init()
sequence t
loop do
t = get_tok()
printf(STDOUT, "%5d %5d %-8s", {t[2], t[3], all_syms[t[1]]})
switch t[1] do
case tk_Integer then printf(STDOUT, " %5d\n", {t[4]})
case tk_Ident then printf(STDOUT, " %s\n", {t[4]})
case tk_String then printf(STDOUT, " \"%s\"\n", {t[4]})
case else printf(STDOUT, "\n")
end switch
until t[1] = tk_EOI
end loop
end procedure
main(command_line())
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
Flex
Tested with Flex 2.5.4.
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <limits.h>
#define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))
typedef enum {
tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq,
tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While,
tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma,
tk_Ident, tk_Integer, tk_String
} TokenType;
void yyerror(char msg[]) {
printf(msg);
exit(1);
}
static int yynval;
struct yylloc {
int first_line, first_col;
int last_line, last_col;
} yylloc;
static void update_loc() {
static int curr_line = 1;
static int curr_col = 1;
yylloc.first_line = curr_line;
yylloc.first_col = curr_col;
{char *s; for (s = yytext; *s != '\0'; s++) {
if (*s == '\n') {
curr_line++;
curr_col = 1;
} else {
curr_col++;
}
}}
yylloc.last_line = curr_line;
yylloc.last_col = curr_col-1;
}
#define YY_USER_ACTION update_loc();
static int kwd_cmp(const void *p1, const void *p2) {
return strcmp(*(char **)p1, *(char **)p2);
}
static TokenType get_ident_type(const char *ident) {
static struct {
char *s;
TokenType sym;
} kwds[] = {
{"else", tk_Else},
{"if", tk_If},
{"print", tk_Print},
{"putc", tk_Putc},
{"while", tk_While},
}, *kwp;
return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;
}
%}
%start COMMENT2
%option noyywrap
digit [0-9]
ident [a-zA-Z_][a-zA-Z_0-9]*
number {digit}+
string \"[^"\n]*\"
char_const \'([^'\n]|\\n|\\\\)\'
%%
<COMMENT2>[^*]+ ;
<COMMENT2>\*[^/] ;
<COMMENT2>\*\/ BEGIN 0; /* end comment */
"/*" BEGIN COMMENT2;
"{" {return tk_Lbrace;}
"}" {return tk_Rbrace;}
"(" {return tk_Lparen;}
")" {return tk_Rparen;}
"*" {return tk_Mul;}
"/" {return tk_Div;}
"%" {return tk_Mod;}
"+" {return tk_Add;}
"-" {return tk_Sub;}
"<" {return tk_Lss;}
">" {return tk_Gtr;}
"<=" {return tk_Leq;}
">=" {return tk_Geq;}
"!=" {return tk_Neq;}
"!" {return tk_Not;}
"&&" {return tk_And;}
"||" {return tk_Or;}
";" {return tk_Semi;}
"," {return tk_Comma;}
"==" {return tk_Eq;}
"=" {return tk_Assign;}
{ident} {return get_ident_type(yytext);}
{string} {return tk_String;}
[ \t\n]+ ; /* ignore whitespace */
{number} {
yynval = strtol(yytext, NULL, 0);
if (yynval == LONG_MAX && errno == ERANGE)
yyerror("Number exceeds maximum value");
return tk_Integer;
}
{char_const} {
int n = yytext[1];
char *p = yytext;
if (yyleng < 3)
yyerror("empty character constant");
++p;
if (p[0] == '\\') {
++p;
if (p[0] == 'n')
n = 10;
else if (p[0] == '\\')
n = '\\';
else
yyerror("unknown escape sequence");
}
yynval = n;
return tk_Integer;
}
. yyerror("Unknown character\n");
%%
int main(int argc, char *argv[]) {
int tok;
++argv, --argc; /* skip over program name */
yyin = stdin;
if (argc > 0)
yyin = fopen(argv[0], "r");
do {
tok = yylex();
printf("%5d %5d %.15s", yylloc.first_line, yylloc.first_col,
&"End_of_input Op_multiply Op_divide Op_mod Op_add "
"Op_subtract Op_negate Op_not Op_less Op_lessequal "
"Op_greater Op_greaterequal Op_equal Op_notequal Op_assign "
"Op_and Op_or Keyword_if Keyword_else Keyword_while "
"Keyword_print Keyword_putc LeftParen RightParen LeftBrace "
"RightBrace Semicolon Comma Identifier Integer "
"String "
[tok * 16]);
if (tok == tk_Integer) printf(" %5d", yynval);
else if (tok == tk_Ident) printf(" %s", yytext);
else if (tok == tk_String) printf(" %s", yytext);
printf("\n");
} while (tok != tk_EOI);
return 0;
}
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 22 29 End_of_input
Forth
Tested with Gforth 0.7.3.
CREATE BUF 0 , \ single-character look-ahead buffer
CREATE COLUMN# 0 ,
CREATE LINE# 1 ,
: NEWLINE? ( c -- t|f) DUP 10 = SWAP 13 = OR ;
: +IN ( c --)
1 SWAP NEWLINE?
IF 0 COLUMN# ! LINE# ELSE COLUMN# THEN
+! 0 BUF ! ;
: PEEK BUF @ 0= IF STDIN KEY-FILE BUF ! THEN BUF @ ;
: GETC PEEK DUP +IN ;
: SKIP GETC DROP ;
: .LOCATION 7 .R 4 .R SPACE ;
: WHERE COLUMN# @ LINE# @ ;
: .WHERE WHERE .LOCATION ;
: .WHERE+ WHERE SWAP 1+ SWAP .LOCATION ;
: EXPECT GETC OVER OVER =
IF 2DROP
ELSE CR ." stdin:" COLUMN# @ 0 LINE# @ 0
<# #s #> TYPE ." :" <# #s #> TYPE ." : "
." unexpected `" EMIT ." ', expecting `" EMIT ." '" CR
BYE
THEN ;
: EQ PEEK [CHAR] = = IF SKIP 2SWAP THEN
." Op_" TYPE CR 2DROP ;
CREATE ESC 4 C, CHAR $ C, CHAR $ C, CHAR \ C, 0 C,
: ?ESC? CR ." Unknown escape sequence `\" EMIT ." '" CR BYE ;
: >ESC ESC 4 + C! ESC ;
: $$\n 10 ;
: $$\\ [CHAR] \ ;
: ESCAPE DUP >ESC FIND IF NIP EXECUTE ELSE DROP ?ESC? THEN ;
: ?ESCAPE DUP [CHAR] \ = IF DROP GETC ESCAPE THEN ;
: ?EOF DUP 4 = IF CR ." End-of-file in string" CR BYE THEN ;
: ?EOL DUP NEWLINE?
IF CR ." End-of-line in string" CR BYE THEN ;
: STRING PAD
BEGIN GETC ?EOF ?EOL DUP [CHAR] " <>
WHILE OVER C! CHAR+
REPEAT DROP PAD TUCK - ;
: "TYPE" [CHAR] " EMIT TYPE [CHAR] " EMIT ;
CREATE TOKEN 4 C, CHAR $ C, CHAR $ C, 0 C, 0 C,
: >HEX DUP 9 > IF 7 + THEN [CHAR] 0 + ;
: HI! $F0 AND 2/ 2/ 2/ 2/ >HEX TOKEN 3 + C! ;
: LO! $0F AND >HEX TOKEN 4 + C! ;
: >TOKEN DUP HI! LO! TOKEN ;
: ?EOF DUP 4 = IF CR ." End-of-file in comment" CR BYE THEN ;
: $$2F PEEK [CHAR] * =
IF SKIP
BEGIN
GETC ?EOF [CHAR] * =
PEEK [CHAR] / = AND
UNTIL SKIP
ELSE .WHERE ." Op_divide" CR THEN ;
: $$22 .WHERE ." String " STRING "TYPE" CR ;
: $$27 .WHERE GETC ?ESCAPE ." Integer " . [CHAR] ' EXPECT CR ;
: $$04 .WHERE ." End_of_input" CR BYE ;
: $$2D .WHERE ." Op_subtract" CR ;
: $$2B .WHERE ." Op_add" CR ;
: $$25 .WHERE ." Op_mod" CR ;
: $$2A .WHERE ." Op_multiply" CR ;
: $$7B .WHERE ." LeftBrace" CR ;
: $$7D .WHERE ." RightBrace" CR ;
: $$2C .WHERE ." Comma" CR ;
: $$29 .WHERE ." RightParen" CR ;
: $$28 .WHERE ." LeftParen" CR ;
: $$3B .WHERE ." Semicolon" CR ;
: $$3D .WHERE s" equal" s" assign" EQ ;
: $$21 .WHERE s" notequal" s" not" EQ ;
: $$3C .WHERE s" lessequal" s" less" EQ ;
: $$3E .WHERE s" greaterequal" s" greater" EQ ;
: $$26 .WHERE [CHAR] & EXPECT ." Op_and" CR ;
: $$7C .WHERE [CHAR] | EXPECT ." Op_or" CR ;
: $$20 ; \ space
CREATE KEYWORD 0 C, CHAR $ C, CHAR $ C, 5 CHARS ALLOT
: >KEYWORD DUP 2 + KEYWORD C!
KEYWORD 3 + SWAP CMOVE KEYWORD ;
: FIND-KW DUP 5 <=
IF 2DUP >KEYWORD FIND
IF TRUE 2SWAP 2DROP ELSE DROP FALSE THEN
ELSE FALSE THEN ;
: $$if ." Keyword_if" ;
: $$else ." Keyword_else" ;
: $$while ." Keyword_while" ;
: $$print ." Keyword_print" ;
: $$putc ." Keyword_putc" ;
: DIGIT? 48 58 WITHIN ;
: ALPHA? DUP 95 = SWAP \ underscore?
DUP 97 123 WITHIN SWAP \ lower?
65 91 WITHIN OR OR ; \ upper?
: ALNUM? DUP DIGIT? SWAP ALPHA? OR ;
: INTEGER 0
BEGIN PEEK DIGIT?
WHILE GETC [CHAR] 0 - SWAP 10 * +
REPEAT ;
: ?INTEGER? CR ." Invalid number" CR BYE ;
: ?INTEGER PEEK ALPHA? IF ?INTEGER? THEN ;
: DIGIT .WHERE+ ." Integer " INTEGER ?INTEGER . CR ;
: NAME PAD
BEGIN PEEK ALNUM?
WHILE GETC OVER C! CHAR+
REPEAT PAD TUCK - ;
: IDENT ." Identifier " TYPE ;
: ALPHA .WHERE+ NAME FIND-KW
IF EXECUTE ELSE IDENT THEN CR ;
: ?CHAR? CR ." Character '" EMIT ." ' not recognized" CR BYE ;
: SPACE? DUP BL = SWAP 9 14 WITHIN OR ;
: SKIP-SPACE BEGIN PEEK SPACE? WHILE SKIP REPEAT ;
: CONSUME
SKIP-SPACE
PEEK DIGIT? IF DIGIT ELSE
PEEK ALPHA? IF ALPHA ELSE
PEEK >TOKEN FIND
IF SKIP EXECUTE ELSE GETC ?CHAR? BYE THEN
THEN THEN ;
: TOKENIZE BEGIN CONSUME AGAIN ;
TOKENIZE
- Output:
Tested against all programs in Compiler/Sample programs.
Fortran
You should call the source file ‘lex.F90’, so gfortran will know to use the C preprocessor. I use the preprocessor to select between different ways to read stream input from the standard input.
(Despite the ‘.F90’ extension that I recommend, this is Fortran 2008/2018 code.)
There is ‘framework’ for supporting Unicode, but no actual Unicode support. To support Unicode reliably I would probably use the C interface and GNU libunistring.
The author has placed this Fortran code in the public domain.
!!!
!!! An implementation of the Rosetta Code lexical analyzer task:
!!! https://rosettacode.org/wiki/Compiler/lexical_analyzer
!!!
!!! The C implementation was used as a reference on behavior, but was
!!! not adhered to for the implementation.
!!!
module string_buffers
use, intrinsic :: iso_fortran_env, only: error_unit
use, intrinsic :: iso_fortran_env, only: int64
implicit none
private
public :: strbuf_t
public :: strbuf_t_length_kind
public :: strbuf_t_character_kind
integer, parameter :: strbuf_t_length_kind = int64
! String buffers can handle Unicode.
integer, parameter :: strbuf_t_character_kind = selected_char_kind ('ISO_10646')
! Private abbreviations.
integer, parameter :: nk = strbuf_t_length_kind
integer, parameter :: ck = strbuf_t_character_kind
type :: strbuf_t
integer(kind = nk), private :: len = 0
!
! ‘chars’ is made public for efficient access to the individual
! characters.
!
character(1, kind = ck), allocatable, public :: chars(:)
contains
procedure, pass, private :: ensure_storage => strbuf_t_ensure_storage
procedure, pass :: to_unicode => strbuf_t_to_unicode
procedure, pass :: length => strbuf_t_length
procedure, pass :: set => strbuf_t_set
procedure, pass :: append => strbuf_t_append
generic :: assignment(=) => set
end type strbuf_t
contains
function strbuf_t_to_unicode (strbuf) result (s)
class(strbuf_t), intent(in) :: strbuf
character(:, kind = ck), allocatable :: s
!
! This does not actually ensure that the string is valid Unicode;
! any 31-bit ‘character’ is supported.
!
integer(kind = nk) :: i
allocate (character(len = strbuf%len, kind = ck) :: s)
do i = 1, strbuf%len
s(i:i) = strbuf%chars(i)
end do
end function strbuf_t_to_unicode
elemental function strbuf_t_length (strbuf) result (n)
class(strbuf_t), intent(in) :: strbuf
integer(kind = nk) :: n
n = strbuf%len
end function strbuf_t_length
elemental function next_power_of_two (x) result (y)
integer(kind = nk), intent(in) :: x
integer(kind = nk) :: y
!
! It is assumed that no more than 64 bits are used.
!
! The branch-free algorithm is that of
! https://archive.is/nKxAc#RoundUpPowerOf2
!
! Fill in bits until one less than the desired power of two is
! reached, and then add one.
!
y = x - 1
y = ior (y, ishft (y, -1))
y = ior (y, ishft (y, -2))
y = ior (y, ishft (y, -4))
y = ior (y, ishft (y, -8))
y = ior (y, ishft (y, -16))
y = ior (y, ishft (y, -32))
y = y + 1
end function next_power_of_two
elemental function new_storage_size (length_needed) result (size)
integer(kind = nk), intent(in) :: length_needed
integer(kind = nk) :: size
! Increase storage by orders of magnitude.
if (2_nk**32 < length_needed) then
size = huge (1_nk)
else
size = next_power_of_two (length_needed)
end if
end function new_storage_size
subroutine strbuf_t_ensure_storage (strbuf, length_needed)
class(strbuf_t), intent(inout) :: strbuf
integer(kind = nk), intent(in) :: length_needed
integer(kind = nk) :: new_size
type(strbuf_t) :: new_strbuf
if (.not. allocated (strbuf%chars)) then
! Initialize a new strbuf%chars array.
new_size = new_storage_size (length_needed)
allocate (strbuf%chars(1:new_size))
else if (ubound (strbuf%chars, 1) < length_needed) then
! Allocate a new strbuf%chars array, larger than the current
! one, but containing the same characters.
new_size = new_storage_size (length_needed)
allocate (new_strbuf%chars(1:new_size))
new_strbuf%chars(1:strbuf%len) = strbuf%chars(1:strbuf%len)
call move_alloc (new_strbuf%chars, strbuf%chars)
end if
end subroutine strbuf_t_ensure_storage
subroutine strbuf_t_set (dst, src)
class(strbuf_t), intent(inout) :: dst
class(*), intent(in) :: src
integer(kind = nk) :: n
integer(kind = nk) :: i
select type (src)
type is (character(*, kind = ck))
n = len (src, kind = nk)
call dst%ensure_storage(n)
do i = 1, n
dst%chars(i) = src(i:i)
end do
dst%len = n
type is (character(*))
n = len (src, kind = nk)
call dst%ensure_storage(n)
do i = 1, n
dst%chars(i) = src(i:i)
end do
dst%len = n
class is (strbuf_t)
n = src%len
call dst%ensure_storage(n)
dst%chars(1:n) = src%chars(1:n)
dst%len = n
class default
error stop
end select
end subroutine strbuf_t_set
subroutine strbuf_t_append (dst, src)
class(strbuf_t), intent(inout) :: dst
class(*), intent(in) :: src
integer(kind = nk) :: n_dst, n_src, n
integer(kind = nk) :: i
select type (src)
type is (character(*, kind = ck))
n_dst = dst%len
n_src = len (src, kind = nk)
n = n_dst + n_src
call dst%ensure_storage(n)
do i = 1, n_src
dst%chars(n_dst + i) = src(i:i)
end do
dst%len = n
type is (character(*))
n_dst = dst%len
n_src = len (src, kind = nk)
n = n_dst + n_src
call dst%ensure_storage(n)
do i = 1, n_src
dst%chars(n_dst + i) = src(i:i)
end do
dst%len = n
class is (strbuf_t)
n_dst = dst%len
n_src = src%len
n = n_dst + n_src
call dst%ensure_storage(n)
dst%chars((n_dst + 1):n) = src%chars(1:n_src)
dst%len = n
class default
error stop
end select
end subroutine strbuf_t_append
end module string_buffers
module lexical_analysis
use, intrinsic :: iso_fortran_env, only: input_unit
use, intrinsic :: iso_fortran_env, only: output_unit
use, intrinsic :: iso_fortran_env, only: error_unit
use, intrinsic :: iso_fortran_env, only: int32
use, non_intrinsic :: string_buffers
implicit none
private
public :: lexer_input_t
public :: lexer_output_t
public :: run_lexer
integer, parameter :: input_file_unit_no = 100
integer, parameter :: output_file_unit_no = 101
! Private abbreviations.
integer, parameter :: nk = strbuf_t_length_kind
integer, parameter :: ck = strbuf_t_character_kind
! Integers large enough for a Unicode code point. Unicode code
! points (and UCS-4) have never been allowed to go higher than
! 7FFFFFFF, and are even further restricted now.
integer, parameter :: ichar_kind = int32
character(1, kind = ck), parameter :: horizontal_tab_char = char (9, kind = ck)
character(1, kind = ck), parameter :: linefeed_char = char (10, kind = ck)
character(1, kind = ck), parameter :: vertical_tab_char = char (11, kind = ck)
character(1, kind = ck), parameter :: formfeed_char = char (12, kind = ck)
character(1, kind = ck), parameter :: carriage_return_char = char (13, kind = ck)
character(1, kind = ck), parameter :: space_char = ck_' '
! The following is correct for Unix and its relatives.
character(1, kind = ck), parameter :: newline_char = linefeed_char
character(1, kind = ck), parameter :: backslash_char = char (92, kind = ck)
character(*, kind = ck), parameter :: newline_intstring = ck_'10'
character(*, kind = ck), parameter :: backslash_intstring = ck_'92'
integer, parameter :: tk_EOI = 0
integer, parameter :: tk_Mul = 1
integer, parameter :: tk_Div = 2
integer, parameter :: tk_Mod = 3
integer, parameter :: tk_Add = 4
integer, parameter :: tk_Sub = 5
integer, parameter :: tk_Negate = 6
integer, parameter :: tk_Not = 7
integer, parameter :: tk_Lss = 8
integer, parameter :: tk_Leq = 9
integer, parameter :: tk_Gtr = 10
integer, parameter :: tk_Geq = 11
integer, parameter :: tk_Eq = 12
integer, parameter :: tk_Neq = 13
integer, parameter :: tk_Assign = 14
integer, parameter :: tk_And = 15
integer, parameter :: tk_Or = 16
integer, parameter :: tk_If = 17
integer, parameter :: tk_Else = 18
integer, parameter :: tk_While = 19
integer, parameter :: tk_Print = 20
integer, parameter :: tk_Putc = 21
integer, parameter :: tk_Lparen = 22
integer, parameter :: tk_Rparen = 23
integer, parameter :: tk_Lbrace = 24
integer, parameter :: tk_Rbrace = 25
integer, parameter :: tk_Semi = 26
integer, parameter :: tk_Comma = 27
integer, parameter :: tk_Ident = 28
integer, parameter :: tk_Integer = 29
integer, parameter :: tk_String = 30
character(len = 16), parameter :: token_names(0:30) = &
& (/ "End_of_input ", "Op_multiply ", "Op_divide ", "Op_mod ", "Op_add ", &
& "Op_subtract ", "Op_negate ", "Op_not ", "Op_less ", "Op_lessequal ", &
& "Op_greater ", "Op_greaterequal ", "Op_equal ", "Op_notequal ", "Op_assign ", &
& "Op_and ", "Op_or ", "Keyword_if ", "Keyword_else ", "Keyword_while ", &
& "Keyword_print ", "Keyword_putc ", "LeftParen ", "RightParen ", "LeftBrace ", &
& "RightBrace ", "Semicolon ", "Comma ", "Identifier ", "Integer ", &
& "String " /)
type :: token_t
integer :: token_no
! Our implementation stores the value of a tk_Integer as a
! string. The C reference implementation stores it as an int.
character(:, kind = ck), allocatable :: val
integer(nk) :: line_no
integer(nk) :: column_no
end type token_t
type :: lexer_input_t
logical, private :: using_input_unit = .true.
integer, private :: unit_no = -(huge (1))
integer(kind = nk) :: line_no = 1
integer(kind = nk) :: column_no = 0
integer, private :: unget_count = 0
! The maximum lookahead is 2, although I believe we are using
! only 1. In principle, the lookahead could be any finite number.
character(1, kind = ck), private :: unget_buffer(1:2)
logical, private :: unget_eof_buffer(1:2)
! Using the same strbuf_t multiple times reduces the need for
! reallocations. Putting that strbuf_t in the lexer_input_t is
! simply for convenience.
type(strbuf_t), private :: strbuf
contains
!
! Note: There is currently no facility for closing one input and
! switching to another.
!
! Note: There is currently no facility to decode inputs into
! Unicode codepoints. Instead, what happens is raw bytes of
! input get stored as strbuf_t_character_kind values. This
! behavior is adequate for ASCII inputs.
!
procedure, pass :: use_file => lexer_input_t_use_file
procedure, pass :: get_next_ch => lexer_input_t_get_next_ch
procedure, pass :: unget_ch => lexer_input_t_unget_ch
procedure, pass :: unget_eof => lexer_input_t_unget_eof
end type lexer_input_t
type :: lexer_output_t
integer, private :: unit_no = output_unit
contains
procedure, pass :: use_file => lexer_output_t_use_file
procedure, pass :: output_token => lexer_output_t_output_token
end type lexer_output_t
contains
subroutine lexer_input_t_use_file (inputter, filename)
class(lexer_input_t), intent(inout) :: inputter
character(*), intent(in) :: filename
integer :: stat
inputter%using_input_unit = .false.
inputter%unit_no = input_file_unit_no
inputter%line_no = 1
inputter%column_no = 0
open (unit = input_file_unit_no, file = filename, status = 'old', &
& action = 'read', access = 'stream', form = 'unformatted', &
& iostat = stat)
if (stat /= 0) then
write (error_unit, '("Error: failed to open ", A, " for input")') filename
stop 1
end if
end subroutine lexer_input_t_use_file
!!!
!!! If you tell gfortran you want -std=f2008 or -std=f2018, you likely
!!! will need to add also -fall-intrinsics or -U__GFORTRAN__
!!!
!!! The first way, you get the FGETC intrinsic. The latter way, you
!!! get the C interface code that uses getchar(3).
!!!
#ifdef __GFORTRAN__
subroutine get_input_unit_char (c, stat)
!
! The following works if you are using gfortran.
!
! (FGETC is considered a feature for backwards compatibility with
! g77. However, I know of no way to reconfigure input_unit as a
! Fortran 2003 stream, for use with ordinary ‘read’.)
!
character, intent(inout) :: c
integer, intent(out) :: stat
call fgetc (input_unit, c, stat)
end subroutine get_input_unit_char
#else
subroutine get_input_unit_char (c, stat)
!