Compiler/lexical analyzer: Difference between revisions
m (Vetted on 4 forums/mailing lists (2 programming, 2 compiler development) and it was well received. I believe it is ready to go!) |
m (Additional error checking.) |
||
Line 211: | Line 211: | ||
| Unrecognized character. |
| Unrecognized character. |
||
| <code>|</code> |
| <code>|</code> |
||
|- |
|||
| Invalid number. Starts like a number, but ends in non-numeric characters. |
|||
| <code>123abc</code> |
|||
|} |
|} |
||
Revision as of 15:54, 22 October 2016
You are encouraged to solve this task according to the task description, using any language you may know.
Lexical Analyzer
Definition from Wikipedia:
- Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).
Create a lexical analyzer for the simple programming language specified below. The program should read input from a file and/or stdin, and write output to a file and/or stdout. If the language being used has a lexer module/library/class, it would be great if two versions of the solution are provided: One without the lexer module, and one with.
The simple programming language to be analyzed is more or less a subset of C. It supports the following tokens:
- Operators
Name | Common name | Character sequence |
---|---|---|
Op_multiply | multiply | * |
Op_divide | divide | / |
Op_mod | mod | % |
Op_add | plus | + |
Op_subtract | minus | - |
Op_negate | unary minus | - |
Op_less | less than | < |
Op_lessequal | less than or equal | <= |
Op_greater | greater than | > |
Op_greaterequal | greater than or equal | >= |
Op_equal | equal | == |
Op_notequal | not equal | != |
Op_not | unary not | ! |
Op_assign | assignment | = |
Op_and | logical and | && |
Op_or | logical or | ¦¦ |
- The
-
token should always be interpreted as Op_subtract by the lexer. Turning some Op_subtract into Op_negate will be the job of the syntax analyzer, which is not part of this task.
- Symbols
Name | Common name | Character |
---|---|---|
LeftParen | left parenthesis | ( |
RightParen | right parenthesis | ) |
LeftBrace | left brace | { |
RightBrace | right brace | } |
Semicolon | semi-colon | ; |
Comma | comma | , |
- Keywords
Name | Character sequence |
---|---|
Keyword_if | if |
Keyword_else | else |
Keyword_while | while |
Keyword_print | |
Keyword_putc | putc |
- Identifiers and literals
These differ from the the previous tokens, in that each occurrence of them has a value associated with it.
Name | Common name | Format description | Format regex | Value |
---|---|---|---|---|
Identifier | identifier | one or more letter/number/underscore characters, but not starting with a number | [_a-zA-Z][_a-zA-Z0-9]*
|
as is |
Integer | integer literal | one or more digits | [0-9]+
|
as is, interpreted as a number |
Integer | char literal | exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes | '([^'\n]|\\n|\\\\)'
|
the ASCII code point number of the character, e.g. 65 for 'A' and 10 for '\n'
|
String | string literal | zero or more characters (anything except newline or double quote), enclosed by double quotes | "[^"\n]*"
|
the characters without the double quotes and with escape sequences converted |
- For char and string literals, the
\n
escape sequence is supported to represent a new-line character. - For char literals, to represent a backslash, use
\\
. - No other special sequences are supported. This means that:
- Char literals cannot represent a single quote character (value 39).
- String literals cannot represent strings containing double quote characters.
- Zero-width tokens
Name | Location |
---|---|
End_of_input | when the end of the input stream is reached |
- White space
- Zero or more whitespace characters, or comments enclosed in
/* ... */
, are allowed between any two tokens, with the exceptions noted below. - "Longest token matching" is used to resolve conflicts (e.g., in order to match <= as a single token rather than the two tokens < and =).
- Whitespace is required between two tokens that have an alphanumeric character or underscore at the edge.
- This means: keywords, identifiers, and integer literals.
- e.g.
ifprint
is recognized as an identifier, instead of the keywords if and print. - e.g.
42fred
is invalid, and neither recognized as a number nor an identifier.
- Whitespace is not allowed inside of tokens (except for chars and strings where they are part of the value).
- e.g.
& &
is invalid, and not interpreted as the && operator.
- e.g.
For example, the following two program fragments are equivalent, and should produce the same token stream except for the line and column positions:
- <lang c>if ( p /* meaning n is prime */ ) {
print ( n , " " ) ; count = count + 1 ; /* number of primes found so far */
}</lang>
- <lang c>if(p){print(n," ");count=count+1;}</lang>
- Complete list of token names
End_of_input Op_multiply Op_divide Op_mod Op_add Op_subtract Op_negate Op_not Op_less Op_lessequal Op_greater Op_greaterequal Op_equal Op_notequal Op_assign Op_and Op_or Keyword_if Keyword_else Keyword_while Keyword_print Keyword_putc LeftParen RightParen LeftBrace RightBrace Semicolon Comma Identifier Integer String
The program output should be a sequence of lines, each consisting of the following whitespace-separated fields:
- the line number where the token starts
- the column number where the token starts
- the token name
- the token value (only for Identifier, Integer, and String tokens)
- the number of spaces between fields is up to you. Neatly aligned is nice, but not a requirement.
The following error conditions should be caught:
Error | Example |
---|---|
Empty character constant | ''
|
Unknown escape sequence. | \r
|
Multi-character constant. | 'xx'
|
End-of-file in comment. Closing comment characters not found. | |
End-of-file while scanning string literal. Closing string character not found. | |
End-of-line while scanning string literal. Closing string character not found before end-of-line. | |
Unrecognized character. | |
Invalid number. Starts like a number, but ends in non-numeric characters. | 123abc
|
Input | Output |
---|---|
<lang c>/* Hello world */ print("Hello, World!\n");</lang> |
4 1 Keyword_print 4 6 LeftParen 4 7 String "Hello, World!\n" 4 24 RightParen 4 25 Semicolon 5 1 End_of_input |
<lang c>/* Show Ident and Integers */ phoenix_number = 142857; print(phoenix_number, "\n");</lang> |
4 1 Identifier phoenix_number 4 16 Op_assign 4 18 Integer 142857 4 24 Semicolon 5 1 Keyword_print 5 6 LeftParen 5 7 Identifier phoenix_number 5 21 Comma 5 23 String "\n" 5 27 RightParen 5 28 Semicolon 6 1 End_of_input |
<lang c>/* All lexical tokens - not syntactically correct, but that will have to wait until syntax analysis */ /* Print */ print /* Sub */ - /* Putc */ putc /* Lss */ < /* If */ if /* Gtr */ > /* Else */ else /* Leq */ <= /* While */ while /* Geq */ >= /* Lbrace */ { /* Eq */ == /* Rbrace */ } /* Neq */ != /* Lparen */ ( /* And */ && /* Rparen */ ) /* Or */ || /* Uminus */ - /* Semi */ ; /* Not */ ! /* Comma */ , /* Mul */ * /* Assign */ = /* Div */ / /* Integer */ 42 /* Mod */ % /* String */ "String literal" /* Add */ + /* Ident */ variable_name /* character literal */ '\n' /* character literal */ '\\' /* character literal */ ' '</lang> |
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input |
The C and Python versions can be considered reference implementations.
ALGOL W
<lang algolw>begin
%lexical analyser % % Algol W strings are limited to 256 characters in length so we limit source lines % % and tokens to 256 characters %
integer lineNumber, columnNumber; string(256) line; string(256) tkValue; integer tkType, tkLine, tkColumn, tkLength, tkIntegerValue; logical tkTooLong; string(1) currChar; string(1) newlineChar;
integer LINE_WIDTH, MAX_TOKEN_LENGTH, MAXINTEGER_OVER_10, MAXINTEGER_MOD_10; integer tOp_multiply , tOp_divide , tOp_mod , tOp_add , tOp_subtract , tOp_negate , tOp_less , tOp_lessequal , tOp_greater , tOp_greaterequal , tOp_equal , tOp_notequal , tOp_not , tOp_assign , tOp_and , tOp_or , tLeftParen , tRightParen , tLeftBrace , tRightBrace , tSemicolon , tComma , tKeyword_if , tKeyword_else , tKeyword_while , tKeyword_print , tKeyword_putc , tIdentifier , tInteger , tString , tEnd_of_input , tComment ;
string(16) array tkName ( 1 :: 32 );
% reports an error % procedure lexError( string(80) value message ); begin integer errorPos; write( i_w := 1, s_w := 0, "**** Error at(", lineNumber, ",", columnNumber, "): " ); errorPos := 0; while errorPos < 80 and message( errorPos // 1 ) not = "." do begin writeon( s_w := 0, message( errorPos // 1 ) ); errorPos := errorPos + 1 end while_not_at_end_of_message ; writeon( s_w := 0, "." ) end lexError ;
% gets the next source character % procedure nextChar ; begin if columnNumber = LINE_WIDTH then begin currChar := newlineChar; columnNumber := columnNumber + 1 end else if columnNumber > LINE_WIDTH then begin readcard( line ); columnNumber := 1; if not XCPNOTED(ENDFILE) then lineNumber := lineNumber + 1; currChar := line( 0 // 1 ) end else begin currChar := line( columnNumber // 1 ); columnNumber := columnNumber + 1 end end nextChar ;
% gets the next token, returns the token type % integer procedure nextToken ; begin
% returns true if currChar is in the inclusive range lowerValue to upperValue % % false otherwise % logical procedure range( string(1) value lowerValue, upperValue ) ; begin currChar >= lowerValue and currChar <= upperValue end range ;
% returns true if the current character can start an identifier, false otherwise % logical procedure identifierStartChar ; begin currChar = "_" or range( "a", "z" ) or range( "A", "Z" ) end identifierStartChar ;
% add the current character to the token and get the next % procedure addAndNextChar ; begin if tkLength >= MAX_TOKEN_LENGTH then tkTooLong := true else begin tkValue( tkLength // 1 ) := currChar; tkLength := tkLength + 1 end if_symbol_not_too_long ; nextChar end % addAndNextChar % ;
% handle a single character token % procedure singleCharToken( integer value tokenType ) ; begin tkType := tokenType; nextChar end singleCharToken ;
% handle a doubled character token: && or || % procedure doubleCharToken( integer value tokenType ) ; begin string(1) firstChar; firstChar := currChar; tkType := tokenType; nextChar; if currChar = firstChar then nextChar else % the character wasn't doubled % lexError( "Unrecognised character." ); end singleCharToken ;
% handle an operator or operator= token % procedure opOrOpEqual( integer value opToken, opEqualToken ) ; begin tkType := opToken; nextChar; if currChar = "=" then begin % have operator= % tkType := opEqualToken; nextChar end if_currChar_is_equal ; end opOrOpEqual ;
% handle a / operator or /* comment % procedure divideOrComment ; begin tkType := tOp_divide; nextChar; if currChar = "*" then begin % have a comment % logical moreComment; tkType := tComment; moreComment := true; while moreComment do begin nextChar; while currChar not = "*" and not XCPNOTED(ENDFILE) do nextChar; while currChar = "*" and not XCPNOTED(ENDFILE) do nextChar; moreComment := ( currChar not = "/" and not XCPNOTED(ENDFILE) ) end while_more_comment ; if not XCPNOTED(ENDFILE) then nextChar else lexError( "End-of-file in comment." ) end if_currChar_is_star ; end divideOrComment ;
% handle an indentifier or keyword % procedure identifierOrKeyword ; begin tkType := tIdentifier; while identifierStartChar or range( "0", "9" ) do addAndNextChar; % there are only 5 keywords, so we just test each in turn here % if tkValue = "if" then tkType := tKeyword_if else if tkValue = "else" then tkType := tKeyword_else else if tkValue = "while" then tkType := tKeyword_while else if tkValue = "print" then tkType := tKeyword_print else if tkValue = "putc" then tkType := tKeyword_putc; if tkType not = tIdentifier then tkValue := ""; end identifierOrKeyword ;
% handle an integer literal % procedure integerLiteral ; begin logical overflowed; integer digit; overflowed := false; tkType := tInteger; while range( "0", "9" ) do begin digit := ( decode( currChar ) - decode( "0" ) ); if tkIntegerValue > MAXINTEGER_OVER_10 then overflowed := true else if tkIntegerValue = MAXINTEGER_OVER_10 and digit > MAXINTEGER_MOD_10 then overflowed := true else begin tkIntegerValue := tkIntegerValue * 10; tkIntegerValue := tkIntegerValue + digit; end; nextChar end while_have_a_digit ; if overflowed then lexError( "Number too large." ); if identifierStartChar then lexError( "Number followed by letter or underscore." ); end integerLiteral ;
% handle a char literal % procedure charLiteral ; begin nextChar; if currChar = "'" or currChar = newlineChar then lexError( "Invalid character constant." ) else if currChar = "\" then begin % have an escape % nextChar; if currChar = "n" then currChar := newlineChar else if currChar not = "\" then lexError( "Unknown escape sequence." ) end; tkType := tInteger; tkIntegerValue := decode( currChar ); % should have a closing quoute next % nextChar; if currChar not = "'" then lexError( "Multi-character constant." ) else nextChar end charLiteral ;
% handle a string literal % procedure stringLiteral ; begin tkType := tString; tkValue( 0 // 1 ) := currChar; tkLength := 1; nextChar; while currChar not = """" and currChar not = newlineChar and not XCPNOTED(ENDFILE) do addAndNextChar; if currChar = newlineChar then lexError( "End-of-line while scanning string literal." ) else if XCPNOTED(ENDFILE) then lexError( "End-of-file while scanning string literal." ) else % currChar must be """" % addAndNextChar end stringLiteral ;
while begin % skip white space % while ( currChar = " " or currChar = newlineChar ) and not XCPNOTED(ENDFILE) do nextChar; % get the token % tkLine := lineNumber; tkColumn := columnNumber; tkValue := ""; tkLength := 0; tkIntegerValue := 0; tkTooLong := false; if XCPNOTED(ENDFILE) then tkType := tEnd_of_input else if currChar = "*" then singleCharToken( tOp_multiply ) else if currChar = "/" then divideOrComment else if currChar = "%" then singleCharToken( tOp_mod ) else if currChar = "+" then singleCharToken( tOp_add ) else if currChar = "-" then singleCharToken( tOp_subtract ) else if currChar = "<" then opOrOpEqual( tOp_less, tOp_lessequal ) else if currChar = ">" then opOrOpEqual( tOp_greater, tOp_greaterequal ) else if currChar = "=" then opOrOpEqual( tOp_assign, tOp_equal ) else if currChar = "!" then opOrOpEqual( tOp_not, tOp_notequal ) else if currChar = "&" then doubleCharToken( tOp_and ) else if currChar = "|" then doubleCharToken( tOp_or ) else if currChar = "(" then singleCharToken( tLeftParen ) else if currChar = ")" then singleCharToken( tRightParen ) else if currChar = "{" then singleCharToken( tLeftBrace ) else if currChar = "}" then singleCharToken( tRightBrace ) else if currChar = ";" then singleCharToken( tSemicolon ) else if currChar = "," then singleCharToken( tComma ) else if identifierStartChar then identifierOrKeyword else if range( "0", "9" ) then integerLiteral else if currChar = "'" then charLiteral else if currChar = """" then stringLiteral else begin lexError( "Unrecognised character." ); singleCharToken( tComment ) end ; % continue until we get something other than a comment % tkType = tComment end do begin end; if tkTooLong then if tkType = tString then lexError( "String literal too long." ) else lexError( "Identifier too long." ); tkType end nextToken ;
% outputs the current token % procedure writeToken ; begin write( i_w := 5, s_w := 2, tkLine, tkColumn, tkName( tkType ) ); if tkType = tInteger then writeon( i_w := 11, tkIntegerValue ) else if tkLength > 0 then begin writeon( " " ); for tkPos := 0 until tkLength - 1 do writeon( s_w := 0, tkValue( tkPos // 1 ) ); end end writeToken ;
LINE_WIDTH := 256; MAXINTEGER_MOD_10 := MAXINTEGER rem 10; MAX_TOKEN_LENGTH := 256; MAXINTEGER_OVER_10 := MAXINTEGER div 10; newlineChar := code( 10 ); tOp_multiply := 1; tkName( tOp_multiply ) := "Op_multiply"; tOp_divide := 2; tkName( tOp_divide ) := "Op_divide"; tOp_mod := 3; tkName( tOp_mod ) := "Op_mod"; tOp_add := 4; tkName( tOp_add ) := "Op_add"; tOp_subtract := 5; tkName( tOp_subtract ) := "Op_subtract"; tOp_negate := 6; tkName( tOp_negate ) := "Op_negate"; tOp_less := 7; tkName( tOp_less ) := "Op_less"; tOp_lessequal := 8; tkName( tOp_lessequal ) := "Op_lessequal"; tOp_greater := 9; tkName( tOp_greater ) := "Op_greater"; tOp_greaterequal := 10; tkName( tOp_greaterequal ) := "Op_greaterequal"; tOp_equal := 11; tkName( tOp_equal ) := "Op_equal"; tOp_notequal := 12; tkName( tOp_notequal ) := "Op_notequal"; tOp_not := 13; tkName( tOp_not ) := "Op_not"; tOp_assign := 14; tkName( tOp_assign ) := "Op_assign"; tOp_and := 15; tkName( tOp_and ) := "Op_and"; tOp_or := 16; tkName( tOp_or ) := "Op_or"; tLeftParen := 17; tkName( tLeftParen ) := "LeftParen"; tRightParen := 18; tkName( tRightParen ) := "RightParen"; tLeftBrace := 19; tkName( tLeftBrace ) := "LeftBrace"; tRightBrace := 20; tkName( tRightBrace ) := "RightBrace"; tSemicolon := 21; tkName( tSemicolon ) := "Semicolon"; tComma := 22; tkName( tComma ) := "Comma"; tKeyword_if := 23; tkName( tKeyword_if ) := "Keyword_if"; tKeyword_else := 24; tkName( tKeyword_else ) := "Keyword_else"; tKeyword_while := 25; tkName( tKeyword_while ) := "Keyword_while"; tKeyword_print := 26; tkName( tKeyword_print ) := "Keyword_print"; tKeyword_putc := 27; tkName( tKeyword_putc ) := "Keyword_putc"; tIdentifier := 28; tkName( tIdentifier ) := "Identifier"; tInteger := 29; tkName( tInteger ) := "Integer"; tString := 30; tkName( tString ) := "String"; tEnd_of_input := 31; tkName( tEnd_of_input ) := "End_of_input"; tComment := 32; tkName( tComment ) := "Comment";
% allow the program to continue after reaching end-of-file % ENDFILE := EXCEPTION( false, 1, 0, false, "EOF" ); % ensure the first call to nextToken reads the first line % lineNumber := 0; columnNumber := LINE_WIDTH + 1; currChar := " "; % get and print all tokens from standard input % while nextToken not = tEnd_of_input do writeToken; writeToken
end.</lang>
- Output:
Test case 3
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
C
Tested with gcc 4.81 and later, compiles warning free with -Wall -Wextra <lang C>#include <stdlib.h>
- include <stdio.h>
- include <stdarg.h>
- include <ctype.h>
- include <string.h>
- include <errno.h>
- include <stdbool.h>
- include <limits.h>
- define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))
- define da_dim(name, type) type *name = NULL; \
int _qy_ ## name ## _p = 0; \ int _qy_ ## name ## _max = 0
- define da_rewind(name) _qy_ ## name ## _p = 0
- define da_redim(name) do {if (_qy_ ## name ## _p >= _qy_ ## name ## _max) \
name = realloc(name, (_qy_ ## name ## _max += 32) * sizeof(name[0]));} while (0)
- define da_append(name, x) do {da_redim(name); name[_qy_ ## name ## _p++] = x;} while (0)
- define da_len(name) _qy_ ## name ## _p
typedef enum {
tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq, tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String
} TokenType;
typedef struct {
TokenType tok; int err_ln, err_col; union { int n; /* value for constants */ char *text; /* text for idents */ };
} tok_s;
static FILE *source_fp, *dest_fp; static int line = 1, col = 0, the_ch = ' '; da_dim(text, char);
tok_s gettok();
static void error(int err_line, int err_col, const char *fmt, ... ) {
char buf[1000]; va_list ap;
va_start(ap, fmt); vsprintf(buf, fmt, ap); va_end(ap); printf("(%d,%d) error: %s\n", err_line, err_col, buf); exit(1);
}
static int next_ch() { /* get next char from input */
the_ch = getc(source_fp); ++col; if (the_ch == '\n') { ++line; col = 0; } return the_ch;
}
static tok_s char_lit(int n, int err_line, int err_col) { /* 'x' */
if (the_ch == '\) error(err_line, err_col, "gettok: empty character constant"); if (the_ch == '\\') { next_ch(); if (the_ch == 'n') n = 10; else if (the_ch == '\\') n = '\\'; else error(err_line, err_col, "gettok: unknown escape sequence \\%c", the_ch); } if (next_ch() != '\) error(err_line, err_col, "multi-character constant"); next_ch(); return (tok_s){tk_Integer, err_line, err_col, {n}};
}
static tok_s div_or_cmt(int err_line, int err_col) { /* process divide or comments */
if (the_ch != '*') return (tok_s){tk_Div, err_line, err_col, {0}};
/* comment found */ for (;;) { if (next_ch() == '*' && next_ch() == '/') { next_ch(); return gettok(); } else if (the_ch == EOF) error(err_line, err_col, "EOF in comment"); }
}
static tok_s string_lit(int start, int err_line, int err_col) { /* "st" */
da_rewind(text);
while (next_ch() != start) { if (the_ch == '\n') error(err_line, err_col, "EOL in string"); if (the_ch == EOF) error(err_line, err_col, "EOF in string"); da_append(text, (char)the_ch); } da_append(text, '\0');
next_ch(); return (tok_s){tk_String, err_line, err_col, {.text=text}};
}
static int kwd_cmp(const void *p1, const void *p2) {
return strcmp(*(char **)p1, *(char **)p2);
}
static TokenType get_ident_type(const char *ident) {
static struct { char *s; TokenType sym; } kwds[] = { {"else", tk_Else}, {"if", tk_If}, {"print", tk_Print}, {"putc", tk_Putc}, {"while", tk_While}, }, *kwp;
return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;
}
static tok_s ident_or_int(int err_line, int err_col) {
int n, is_number = true;
da_rewind(text); while (isalnum(the_ch) || the_ch == '_') { da_append(text, (char)the_ch); if (!isdigit(the_ch)) is_number = false; next_ch(); } if (da_len(text) == 0) error(err_line, err_col, "gettok: unrecognized character (%d) '%c'\n", the_ch, the_ch); da_append(text, '\0'); if (isdigit(text[0])) { if (!is_number) error(err_line, err_col, "invalid number: %s\n", text); n = strtol(text, NULL, 0); if (n == LONG_MAX && errno == ERANGE) error(err_line, err_col, "Number exceeds maximum value"); return (tok_s){tk_Integer, err_line, err_col, {n}}; } return (tok_s){get_ident_type(text), err_line, err_col, {.text=text}};
}
static tok_s follow(int expect, TokenType ifyes, TokenType ifno, int err_line, int err_col) { /* look ahead for '>=', etc. */
if (the_ch == expect) { next_ch(); return (tok_s){ifyes, err_line, err_col, {0}}; } if (ifno == tk_EOI) error(err_line, err_col, "follow: unrecognized character '%c' (%d)\n", the_ch, the_ch); return (tok_s){ifno, err_line, err_col, {0}};
}
tok_s gettok() { /* return the token type */
/* skip white space */ while (isspace(the_ch)) next_ch(); int err_line = line; int err_col = col; switch (the_ch) { case '{': next_ch(); return (tok_s){tk_Lbrace, err_line, err_col, {0}}; case '}': next_ch(); return (tok_s){tk_Rbrace, err_line, err_col, {0}}; case '(': next_ch(); return (tok_s){tk_Lparen, err_line, err_col, {0}}; case ')': next_ch(); return (tok_s){tk_Rparen, err_line, err_col, {0}}; case '+': next_ch(); return (tok_s){tk_Add, err_line, err_col, {0}}; case '-': next_ch(); return (tok_s){tk_Sub, err_line, err_col, {0}}; case '*': next_ch(); return (tok_s){tk_Mul, err_line, err_col, {0}}; case '%': next_ch(); return (tok_s){tk_Mod, err_line, err_col, {0}}; case ';': next_ch(); return (tok_s){tk_Semi, err_line, err_col, {0}}; case ',': next_ch(); return (tok_s){tk_Comma,err_line, err_col, {0}}; case '/': next_ch(); return div_or_cmt(err_line, err_col); case '\: next_ch(); return char_lit(the_ch, err_line, err_col); case '<': next_ch(); return follow('=', tk_Leq, tk_Lss, err_line, err_col); case '>': next_ch(); return follow('=', tk_Geq, tk_Gtr, err_line, err_col); case '=': next_ch(); return follow('=', tk_Eq, tk_Assign, err_line, err_col); case '!': next_ch(); return follow('=', tk_Neq, tk_Not, err_line, err_col); case '&': next_ch(); return follow('&', tk_And, tk_EOI, err_line, err_col); case '|': next_ch(); return follow('|', tk_Or, tk_EOI, err_line, err_col); case '"' : return string_lit(the_ch, err_line, err_col); default: return ident_or_int(err_line, err_col); case EOF: return (tok_s){tk_EOI, err_line, err_col, {0}}; }
}
void run() { /* tokenize the given input */
tok_s tok; do { tok = gettok(); fprintf(dest_fp, "%5d %5d %.15s", tok.err_ln, tok.err_col, &"End_of_input Op_multiply Op_divide Op_mod Op_add " "Op_subtract Op_negate Op_not Op_less Op_lessequal " "Op_greater Op_greaterequal Op_equal Op_notequal Op_assign " "Op_and Op_or Keyword_if Keyword_else Keyword_while " "Keyword_print Keyword_putc LeftParen RightParen LeftBrace " "RightBrace Semicolon Comma Identifier Integer " "String " [tok.tok * 16]); if (tok.tok == tk_Integer) fprintf(dest_fp, " %4d", tok.n); else if (tok.tok == tk_Ident) fprintf(dest_fp, " %s", tok.text); else if (tok.tok == tk_String) fprintf(dest_fp, " \"%s\"", tok.text); fprintf(dest_fp, "\n"); } while (tok.tok != tk_EOI); if (dest_fp != stdout) fclose(dest_fp);
}
void init_io(FILE **fp, FILE *std, const char mode[], const char fn[]) {
if (fn[0] == '\0') *fp = std; else if ((*fp = fopen(fn, mode)) == NULL) error(0, 0, "Can't open %s\n", fn);
}
int main(int argc, char *argv[]) {
init_io(&source_fp, stdin, "r", argc > 1 ? argv[1] : ""); init_io(&dest_fp, stdout, "wb", argc > 2 ? argv[2] : ""); run(); return 0;
}</lang>
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
C#
Requires C#6.0 because of the use of null coalescing operators. <lang csharp> using System; using System.IO; using System.Linq; using System.Collections.Generic;
namespace Rosetta {
public enum TokenType { End_of_input, Op_multiply, Op_divide, Op_mod, Op_add, Op_subtract, Op_negate, Op_not, Op_less, Op_lessequal, Op_greater, Op_greaterequal, Op_equal, Op_notequal, Op_assign, Op_and, Op_or, Keyword_if, Keyword_else, Keyword_while, Keyword_print, Keyword_putc, LeftParen, RightParen, LeftBrace, RightBrace, Semicolon, Comma, Identifier, Integer, String, None }
/// <summary> /// Storage class for tokens /// </summary> public class Token { public TokenType Type { get; set; } public int Line { get; set; } public int Position { get; set; } public string Value { get; set; } public override string ToString() { if (Type == TokenType.Integer || Type == TokenType.Identifier) { return String.Format("{0,-5} {1,-5} {2,-14} {3}", Line, Position, Type.ToString(), Value); } else if (Type == TokenType.String) { return String.Format("{0,-5} {1,-5} {2,-14} \"{3}\"", Line, Position, Type.ToString(), Value.Replace("\n", "\\n")); } return String.Format("{0,-5} {1,-5} {2,-14}", Line, Position, Type.ToString()); } }
/// <summary> /// C# Example of Lexical scanner for Rosetta Compiler /// </summary> public class LexicalScanner {
// character classes private const string _letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_"; private const string _numbers = "0123456789"; private const string _identifier = _letters + _numbers + "_"; private const string _whitespace = " \t\n\r"; // mappings from string keywords to token type private Dictionary<string, TokenType> _keywordTokenTypeMap = new Dictionary<string, TokenType>() { { "if", TokenType.Keyword_if }, { "else", TokenType.Keyword_else }, { "while", TokenType.Keyword_while }, { "print", TokenType.Keyword_print }, { "putc", TokenType.Keyword_putc } };
// mappings from simple operators to token type private Dictionary<string, TokenType> _operatorTokenTypeMap = new Dictionary<string, TokenType>() { { "+", TokenType.Op_add }, { "-", TokenType.Op_subtract }, { "*", TokenType.Op_multiply }, { "/", TokenType.Op_divide }, { "%", TokenType.Op_mod }, { "=", TokenType.Op_assign }, { "<", TokenType.Op_less }, { ">", TokenType.Op_greater }, { "!", TokenType.Op_not }, };
private List<string> _keywords; private string _operators = "+-*/%=<>!%";
private string _code; private List<Token> tokens = new List<Token>();
private int _line = 1; private int _position = 1;
public string CurrentCharacter { get { try { return _code.Substring(0, 1); } catch (ArgumentOutOfRangeException) { return ""; } } }
/// <summary> /// Lexical scanner initialiser /// </summary> /// <param name="code">Code to be tokenised</param> public LexicalScanner (string code) { _code = code; _keywords = _keywordTokenTypeMap.Keys.ToList(); }
/// <summary> /// Advance the cursor forward given number of characters /// </summary> /// <param name="characters">Number of characters to advance</param> private void advance(int characters=1) { try { // reset position when there is a newline if (CurrentCharacter == "\n") { _position = 0; _line++; } _code = _code.Substring(characters, _code.Length - characters); _position += characters; } catch (ArgumentOutOfRangeException) { _code = ""; } }
/// <summary> /// Outputs error message to the console and exits /// </summary> /// <param name="message">Error message to display to user</param> /// <param name="line">Line error occurred on</param> /// <param name="position">Line column that the error occurred at</param> public void error(string message, int line, int position) { // output error to the console and exit Console.WriteLine(String.Format("{0} @ {1}:{2}", message, line, position)); Environment.Exit(1); }
/// <summary> /// Pattern matching using first & follow matching /// </summary> /// <param name="recogniseClass">String of characters that identifies the token type /// or the exact match the be made if exact:true</param> /// <param name="matchClass">String of characters to match against remaining target characters</param> /// <param name="tokenType">Type of token the match represents.</param> /// <param name="notNextClass">Optional class of characters that cannot follow the match</param> /// <param name="maxLen">Optional maximum length of token value</param> /// <param name="exact">Denotes whether recogniseClass represents an exact match or class match. /// Default: false</param> /// <param name="discard">Denotes whether the token is kept or discarded. Default: false</param> /// <param name="offset">Optiona line position offset to account for discarded tokens</param> /// <returns>Boolean indicating if a match was made </returns> public bool match(string recogniseClass, string matchClass, TokenType tokenType, string notNextClass=null, int maxLen=Int32.MaxValue, bool exact=false, bool discard=false, int offset=0) {
// if we've hit the end of the file, there's no more matching to be done if (CurrentCharacter == "") return false;
// store _current_ line and position so that our vectors point at the start // of each token int line = _line; int position = _position;
// special case exact tokens to avoid needing to worry about backtracking if (exact) { if (_code.StartsWith(recogniseClass)) { if (!discard) tokens.Add(new Token() { Type = tokenType, Value = recogniseClass, Line = line, Position = position - offset}); advance(recogniseClass.Length); return true; } return false; }
// first match - denotes the token type usually if (!recogniseClass.Contains(CurrentCharacter)) return false;
string tokenValue = CurrentCharacter; advance();
// follow match while we haven't exceeded maxLen and there are still characters // in the code stream while ((matchClass ?? "").Contains(CurrentCharacter) && tokenValue.Length <= maxLen && CurrentCharacter != "") { tokenValue += CurrentCharacter; advance(); }
// ensure that any incompatible characters are not next to the token // eg 42fred is invalid, and neither recognized as a number nor an identifier. // _letters would be the notNextClass if (notNextClass != null && notNextClass.Contains(CurrentCharacter)) error("Unrecognised character: " + CurrentCharacter, _line, _position);
// only add tokens to the stack that aren't marked as discard - dont want // things like open and close quotes/comments if (!discard) { Token token = new Token() { Type = tokenType, Value = tokenValue, Line = line, Position = position - offset }; tokens.Add(token); }
return true; }
/// <summary> /// Tokenise the input code /// </summary> /// <returns>List of Tokens</returns> public List<Token> scan() {
while (CurrentCharacter != "") { // match whitespace match(_whitespace, _whitespace, TokenType.None, discard: true);
// match integers match(_numbers, _numbers, TokenType.Integer, notNextClass:_letters); // match identifiers and keywords if (match(_letters, _identifier, TokenType.Identifier)) { Token match = tokens.Last(); if (_keywords.Contains(match.Value)) match.Type = _keywordTokenTypeMap[match.Value]; }
// match string similarly to comments without allowing newlines // this token doesn't get discarded though if (match("\"", null, TokenType.String, discard:true)) { string value = ""; int position = _position; while (!match("\"", null, TokenType.String, discard:true)) { // not allowed newlines in strings if (CurrentCharacter == "\n") error("End-of-line while scanning string literal. Closing string character not found before end-of-line", _line, _position); // end of file reached before finding end of string if (CurrentCharacter == "") error("End-of-file while scanning string literal. Closing string character not found", _line, _position);
value += CurrentCharacter;
// deal with escape sequences - we only accept newline (\n) if (value.Length >= 2) { string lastCharacters = value.Substring(value.Length - 2, 2); if (lastCharacters[0] == '\\') { if (lastCharacters[1] != 'n') { error("Unknown escape sequence. ", _line, position); } value = value.Substring(0, value.Length - 2).ToString() + "\n"; } }
advance(); } tokens.Add(new Token() { Type = TokenType.String, Value = value, Line = _line, Position = position - 1}); }
// match string literals if (match("'", null, TokenType.Integer, discard:true)) { int value; int position = _position; value = CurrentCharacter.ToCharArray()[0]; advance();
// deal with empty literals if (value == '\) error("Empty character literal", _line, _position);
// deal with escaped characters, only need to worry about \n and \\ // throw werror on any other if (value == '\\') { if (CurrentCharacter == "n") { value = '\n'; } else if (CurrentCharacter == "\\") { value = '\\'; } else { error("Unknown escape sequence. ", _line, _position - 1); } advance(); }
// if we haven't hit a closing ' here, there are two many characters // in the literal if (!match("'", null, TokenType.Integer, discard: true)) error("Multi-character constant", _line, _position);
tokens.Add(new Rosetta.Token() { Type = TokenType.Integer, Value = value.ToString(), Line = _line, Position = position - 1 }); }
// match comments by checking for starting token, then advancing // until closing token is matched if (match("/*", null, TokenType.None, exact: true, discard: true)) { while (!match("*/", null, TokenType.None, exact: true, discard: true)) { // reached the end of the file without closing comment! if (CurrentCharacter == "") error("End-of-file in comment. Closing comment characters not found.", _line, _position); advance(); } continue; }
// match complex operators match("<=", null, TokenType.Op_lessequal, exact: true); match(">=", null, TokenType.Op_greaterequal, exact: true); match("==", null, TokenType.Op_equal, exact: true); match("!=", null, TokenType.Op_notequal, exact: true); match("&&", null, TokenType.Op_and, exact: true); match("||", null, TokenType.Op_or, exact: true);
// match simple operators if (match(_operators, null, TokenType.None, maxLen:1)) { Token match = tokens.Last(); match.Type = _operatorTokenTypeMap[match.Value]; }
// brackets, braces and separators match("(", null, TokenType.LeftParen, exact: true); match(")", null, TokenType.RightParen, exact: true); match("{", null, TokenType.LeftBrace, exact: true); match("}", null, TokenType.RightBrace, exact: true); match(";", null, TokenType.Semicolon, exact: true); match(",", null, TokenType.Comma, exact: true);
}
// end of file token tokens.Add(new Rosetta.Token() { Type = TokenType.End_of_input, Line = _line, Position = _position }); return tokens; }
static void Main (string[] args) { StreamReader inputFile;
// if we passed in a filename, read code from that, else // read code from stdin if (args.Length > 0) { string path = args[0]; try { inputFile = new StreamReader(path); } catch (IOException) { inputFile = new StreamReader(Console.OpenStandardInput(8192)); } } else { inputFile = new StreamReader(Console.OpenStandardInput(8192)); }
string code = inputFile.ReadToEnd();
// strip windows line endings out code = code.Replace("\r", "");
LexicalScanner scanner = new LexicalScanner(code); List<Token> tokens = scanner.scan();
foreach(Token token in tokens) { Console.WriteLine(token.ToString()); } } }
} </lang>
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
Euphoria
Tested with Euphoria 4.05. <lang euphoria>include std/io.e include std/map.e include std/types.e include std/convert.e
constant true = 1, false = 0, EOF = -1
enum tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq,
tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String
constant all_syms = {"End_of_input", "Op_multiply", "Op_divide", "Op_mod", "Op_add",
"Op_subtract", "Op_negate", "Op_not", "Op_less", "Op_lessequal", "Op_greater", "Op_greaterequal", "Op_equal", "Op_notequal", "Op_assign", "Op_and", "Op_or", "Keyword_if", "Keyword_else", "Keyword_while", "Keyword_print", "Keyword_putc", "LeftParen", "RightParen", "LeftBrace", "RightBrace", "Semicolon", "Comma", "Identifier", "Integer", "String"}
integer input_file, the_ch = ' ', the_col = 0, the_line = 1 sequence symbols map key_words = new()
procedure error(sequence format, sequence data)
printf(STDOUT, format, data) abort(1)
end procedure
-- get the next character from the input function next_ch()
the_ch = getc(input_file) the_col += 1 if the_ch = '\n' then the_line += 1 the_col = 0 end if return the_ch
end function
-- 'x' - character constants function char_lit(integer err_line, integer err_col)
integer n = next_ch() -- skip opening quote if the_ch = '\ then error("%d %d empty character constant", {err_line, err_col}) elsif the_ch = '\\' then next_ch() if the_ch = 'n' then n = 10 elsif the_ch = '\\' then n = '\\' else error("%d %d unknown escape sequence \\%c", {err_line, err_col, the_ch}) end if end if if next_ch() != '\ then error("%d %d multi-character constant", {err_line, err_col}) end if next_ch() return {tk_Integer, err_line, err_col, n}
end function
-- process divide or comments function div_or_cmt(integer err_line, integer err_col)
if next_ch() != '*' then return {tk_Div, err_line, err_col} end if
-- comment found while true do if next_ch() = '*' and next_ch() = '/' then next_ch() return get_tok() elsif the_ch = EOF then error("%d %d EOF in comment", {err_line, err_col}) end if end while
end function
-- "string" function string_lit(integer start, integer err_line, integer err_col)
string text = ""
while next_ch() != start do if the_ch = EOF then error("%d %d EOF while scanning string literal", {err_line, err_col}) end if if the_ch = '\n' then error("%d %d EOL while scanning string literal", {err_line, err_col}) end if text &= the_ch end while
next_ch() return {tk_String, err_line, err_col, text}
end function
-- handle identifiers and integers function ident_or_int(integer err_line, integer err_col)
integer n, is_number = true string text = ""
while t_alnum(the_ch) or the_ch = '_' do text &= the_ch if not t_digit(the_ch) then is_number = false end if next_ch() end while
if length(text) = 0 then error("%d %d ident_or_int: unrecognized character: (%d) '%s'", {err_line, err_col, the_ch, the_ch}) end if
if t_digit(text[1]) then if not is_number then error("%d %d invalid number: %s", {err_line, err_col, text}) end if n = to_integer(text) return {tk_Integer, err_line, err_col, n} end if
if has(key_words, text) then return {get(key_words, text), err_line, err_col} end if
return {tk_Ident, err_line, err_col, text}
end function
-- look ahead for '>=', etc. function follow(integer expect, integer ifyes, integer ifno, integer err_line, integer err_col)
if next_ch() = expect then next_ch() return {ifyes, err_line, err_col} end if
if ifno = tk_EOI then error("%d %d follow: unrecognized character: (%d)", {err_line, err_col, the_ch}) end if
return {ifno, err_line, err_col}
end function
-- return the next token type function get_tok()
while t_space(the_ch) do next_ch() end while
integer err_line = the_line integer err_col = the_col
switch the_ch do case EOF then return {tk_EOI, err_line, err_col} case '/' then return div_or_cmt(err_line, err_col) case '\ then return char_lit(err_line, err_col)
case '<' then return follow('=', tk_Leq, tk_Lss, err_line, err_col) case '>' then return follow('=', tk_Geq, tk_Gtr, err_line, err_col) case '=' then return follow('=', tk_Eq, tk_Assign, err_line, err_col) case '!' then return follow('=', tk_Neq, tk_Not, err_line, err_col) case '&' then return follow('&', tk_And, tk_EOI, err_line, err_col) case '|' then return follow('|', tk_Or, tk_EOI, err_line, err_col)
case '"' then return string_lit(the_ch, err_line, err_col) case else integer sym = symbols[the_ch] if sym != tk_EOI then next_ch() return {sym, err_line, err_col} end if return ident_or_int(err_line, err_col) end switch
end function
procedure init()
put(key_words, "else", tk_Else) put(key_words, "if", tk_If) put(key_words, "print", tk_Print) put(key_words, "putc", tk_Putc) put(key_words, "while", tk_While)
symbols = repeat(tk_EOI, 256) symbols['{'] = tk_Lbrace symbols['}'] = tk_Rbrace symbols['('] = tk_Lparen symbols[')'] = tk_Rparen symbols['+'] = tk_Add symbols['-'] = tk_Sub symbols['*'] = tk_Mul symbols['%'] = tk_Mod symbols[';'] = tk_Semi symbols[','] = tk_Comma
end procedure
procedure main(sequence cl)
sequence file_name
input_file = STDIN if length(cl) > 2 then file_name = cl[3] input_file = open(file_name, "r") if input_file = -1 then error("Could not open %s", {file_name}) end if end if init() sequence t loop do t = get_tok() printf(STDOUT, "%5d %5d %-8s", {t[2], t[3], all_syms[t[1]]}) switch t[1] do case tk_Integer then printf(STDOUT, " %5d\n", {t[4]}) case tk_Ident then printf(STDOUT, " %s\n", {t[4]}) case tk_String then printf(STDOUT, " \"%s\"\n", {t[4]}) case else printf(STDOUT, "\n") end switch until t[1] = tk_EOI end loop
end procedure
main(command_line())</lang>
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
Flex
Tested with Flex 2.5.4. <lang C>%{
- include <stdio.h>
- include <stdlib.h>
- include <string.h>
- include <errno.h>
- include <limits.h>
- define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))
typedef enum {
tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq, tk_Gtr, tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String
} TokenType;
void yyerror(char msg[]) {
printf(msg); exit(1);
}
static int yynval;
struct yylloc {
int first_line, first_col; int last_line, last_col;
} yylloc;
static void update_loc() {
static int curr_line = 1; static int curr_col = 1;
yylloc.first_line = curr_line; yylloc.first_col = curr_col;
{char *s; for (s = yytext; *s != '\0'; s++) { if (*s == '\n') { curr_line++; curr_col = 1; } else { curr_col++; } }}
yylloc.last_line = curr_line; yylloc.last_col = curr_col-1;
}
- define YY_USER_ACTION update_loc();
static int kwd_cmp(const void *p1, const void *p2) {
return strcmp(*(char **)p1, *(char **)p2);
}
static TokenType get_ident_type(const char *ident) {
static struct { char *s; TokenType sym; } kwds[] = { {"else", tk_Else}, {"if", tk_If}, {"print", tk_Print}, {"putc", tk_Putc}, {"while", tk_While}, }, *kwp;
return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;
}
%}
%start COMMENT2
%option noyywrap
digit [0-9] ident [a-zA-Z_][a-zA-Z_0-9]*
number {digit}+ string \"[^"\n]*\" char_const \'([^'\n]|\\n|\\\\)\'
%%
<COMMENT2>[^*]+ ; <COMMENT2>\*[^/] ; <COMMENT2>\*\/ BEGIN 0; /* end comment */ "/*" BEGIN COMMENT2;
"{" {return tk_Lbrace;} "}" {return tk_Rbrace;} "(" {return tk_Lparen;} ")" {return tk_Rparen;} "*" {return tk_Mul;} "/" {return tk_Div;} "%" {return tk_Mod;} "+" {return tk_Add;} "-" {return tk_Sub;} "<" {return tk_Lss;} ">" {return tk_Gtr;} "<=" {return tk_Leq;} ">=" {return tk_Geq;} "!=" {return tk_Neq;} "!" {return tk_Not;} "&&" {return tk_And;} "||" {return tk_Or;} ";" {return tk_Semi;} "," {return tk_Comma;} "==" {return tk_Eq;} "=" {return tk_Assign;} {ident} {return get_ident_type(yytext);} {string} {return tk_String;}
[ \t\n]+ ; /* ignore whitespace */
{number} {
yynval = strtol(yytext, NULL, 0); if (yynval == LONG_MAX && errno == ERANGE) yyerror("Number exceeds maximum value");
return tk_Integer; }
{char_const} {
int n = yytext[1]; char *p = yytext;
if (yyleng < 3) yyerror("empty character constant"); ++p; if (p[0] == '\\') { ++p; if (p[0] == 'n') n = 10; else if (p[0] == '\\') n = '\\'; else yyerror("unknown escape sequence"); } yynval = n; return tk_Integer; }
. yyerror("Unknown character\n");
%%
int main(int argc, char *argv[]) {
int tok;
++argv, --argc; /* skip over program name */ yyin = stdin; if (argc > 0) yyin = fopen(argv[0], "r");
do { tok = yylex(); printf("%5d %5d %.15s", yylloc.first_line, yylloc.first_col, &"End_of_input Op_multiply Op_divide Op_mod Op_add " "Op_subtract Op_negate Op_not Op_less Op_lessequal " "Op_greater Op_greaterequal Op_equal Op_notequal Op_assign " "Op_and Op_or Keyword_if Keyword_else Keyword_while " "Keyword_print Keyword_putc LeftParen RightParen LeftBrace " "RightBrace Semicolon Comma Identifier Integer " "String " [tok * 16]);
if (tok == tk_Integer) printf(" %5d", yynval); else if (tok == tk_Ident) printf(" %s", yytext); else if (tok == tk_String) printf(" %s", yytext); printf("\n"); } while (tok != tk_EOI); return 0;
}</lang>
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 22 29 End_of_input
FreeBASIC
Tested with FreeBASIC 1.05 <lang FreeBASIC>enum Token_type
tk_EOI tk_Mul tk_Div tk_Mod tk_Add tk_Sub tk_Negate tk_Not tk_Lss tk_Leq tk_Gtr tk_Geq tk_Eq tk_Neq tk_Assign tk_And tk_Or tk_If tk_Else tk_While tk_Print tk_Putc tk_Lparen tk_Rparen tk_Lbrace tk_Rbrace tk_Semi tk_Comma tk_Ident tk_Integer tk_String
end enum
const NewLine = chr(10) const DoubleQuote = chr(34)
' where we store keywords and variables type Symbol
s_name as string tok as Token_type
end type
dim shared symtab() as Symbol
dim shared cur_line as string dim shared cur_ch as string dim shared line_num as integer dim shared col_num as integer
function is_digit(byval ch as string) as long
is_digit = (ch <> "") and ch >= "0" and ch <= "9"
end function
function is_alnum(byval ch as string) as long
is_alnum = (ch <> "") and ((UCase(ch) >= "A" and UCase(ch) <= "Z") or (is_digit(ch)))
end function
sub error_msg(byval eline as integer, byval ecol as integer, byval msg as string)
print "("; eline; ":"; ecol; ")"; " "; msg system
end sub
' add an identifier to the symbol table function install(byval s_name as string, byval tok as Token_type) as integer
dim n as integer
n = ubound(symtab) redim preserve symtab(n + 1) n = ubound(symtab)
symtab(n).s_name = s_name symtab(n).tok = tok return n
end function
' search for an identifier in the symbol table function lookup(byval s_name as string) as integer
dim i as integer
for i = lbound(symtab) to ubound(symtab) if symtab(i).s_name = s_name then return i next return -1
end function
sub next_line() ' read the next line of input from the source file
cur_line = "" cur_ch = "" ' empty cur_ch means end-of-file if eof(1) then exit sub line input #1, cur_line cur_line = cur_line + NewLine line_num += + 1 col_num = 1
end sub
sub next_char() ' get the next char
cur_ch = "" col_num += 1 if col_num > len(cur_line) then next_line() if col_num <= len(cur_line) then cur_ch = mid(cur_line, col_num, 1)
end sub
function follow(byval err_line as integer, byval err_col as integer, byval expect as string, byval ifyes as Token_type, byval ifno as Token_type) as Token_type
if cur_ch = expect then next_char() return ifyes end if if ifno = tk_eoi then error_msg(err_line, err_col, "follow unrecognized character: " + cur_ch) return ifno
end function
sub gettok(byref err_line as integer, byref err_col as integer, byref tok as Token_type, byref v as string)
' skip whitespace do while (cur_ch = " " or cur_ch = chr(9) or cur_ch = NewLine) and (cur_ch <> "") next_char() loop
err_line = line_num err_col = col_num
select case cur_ch case "": tok = tk_eoi: exit sub case "{": tok = tk_lbrace: next_char(): exit sub case "}": tok = tk_rbrace: next_char(): exit sub case "(": tok = tk_lparen: next_char(): exit sub case ")": tok = tk_rparen: next_char(): exit sub case "+": tok = tk_add: next_char(): exit sub case "-": tok = tk_sub: next_char(): exit sub case "*": tok = tk_mul: next_char(): exit sub case "%": tok = tk_Mod: next_char(): exit sub case ";": tok = tk_semi: next_char(): exit sub case ",": tok = tk_comma: next_char(): exit sub case "/": ' div or comment next_char() if cur_ch <> "*" then tok = tk_div exit sub end if ' skip comments do next_char() if cur_ch = "*" or cur_ch = "" then next_char() if cur_ch = "/" or cur_ch = "" then next_char() gettok(err_line, err_col, tok, v) exit sub end if end if loop case "'": ' single char literals next_char() v = str(Asc(cur_ch)) if cur_ch = "'" then error_msg(err_line, err_col, "empty character constant") if cur_ch = "\" then next_char() if cur_ch = "n" then v = "10" elseif cur_ch = "\" then v = Str(Asc("\")) else error_msg(err_line, err_col, "unknown escape sequence: " + cur_ch) end if end if next_char() if cur_ch <> "'" then error_msg(err_line, err_col, "multi-character constant") next_char() tok = tk_integer exit sub case "<": next_char(): tok = follow(err_line, err_col, "=", tk_Leq, tk_Lss): exit sub case ">": next_char(): tok = follow(err_line, err_col, "=", tk_Geq, tk_Gtr): exit sub case "!": next_char(): tok = follow(err_line, err_col, "=", tk_Neq, tk_Not): exit sub case "=": next_char(): tok = follow(err_line, err_col, "=", tk_Eq, tk_Assign): exit sub case "&": next_char(): tok = follow(err_line, err_col, "&", tk_And, tk_EOI): exit sub case "|": next_char(): tok = follow(err_line, err_col, "|", tk_Or, tk_EOI): exit sub case DoubleQuote: ' string v = cur_ch next_char() do while cur_ch <> DoubleQuote if cur_ch = NewLine then error_msg(err_line, err_col, "EOL in string") if cur_ch = "" then error_msg(err_line, err_col, "EOF in string") v += cur_ch next_char() loop v += cur_ch next_char() tok = tk_string exit sub case else ' integers or identifiers dim is_number as boolean = is_digit(cur_ch) v = "" do while is_alnum(cur_ch) orelse cur_ch = "_" if not is_digit(cur_ch) then is_number = false v += cur_ch next_char() loop if len(v) = 0 then error_msg(err_line, err_col, "unknown character: " + cur_ch) if is_digit(mid(v, 1, 1)) then if not is_number then error_msg(err_line, err_col, "invalid number: " + v) tok = tk_integer exit sub end if dim as integer index = lookup(v) if index = -1 then tok = tk_ident else tok = symtab(index).tok end if exit sub end select
end sub
sub init_lex(byval filein as string)
install("else", tk_else) install("if", tk_if) install("print", tk_print) install("putc", tk_putc) install("while", tk_while)
open filein for input as #1
cur_line = "" line_num = 0 col_num = 0 next_char()
end sub
sub scanner()
dim err_line as integer dim err_col as integer dim tok as Token_type dim v as string dim tok_list(tk_eoi to tk_string) as string
tok_list(tk_EOI ) = "End_of_input" tok_list(tk_Mul ) = "Op_multiply" tok_list(tk_Div ) = "Op_divide" tok_list(tk_Mod ) = "Op_mod" tok_list(tk_Add ) = "Op_add" tok_list(tk_Sub ) = "Op_subtract" tok_list(tk_Negate ) = "Op_negate" tok_list(tk_Not ) = "Op_not" tok_list(tk_Lss ) = "Op_less" tok_list(tk_Leq ) = "Op_lessequal" tok_list(tk_Gtr ) = "Op_greater" tok_list(tk_Geq ) = "Op_greaterequal" tok_list(tk_Eq ) = "Op_equal" tok_list(tk_Neq ) = "Op_notequal" tok_list(tk_Assign ) = "Op_assign" tok_list(tk_And ) = "Op_and" tok_list(tk_Or ) = "Op_or" tok_list(tk_If ) = "Keyword_if" tok_list(tk_Else ) = "Keyword_else" tok_list(tk_While ) = "Keyword_while" tok_list(tk_Print ) = "Keyword_print" tok_list(tk_Putc ) = "Keyword_putc" tok_list(tk_Lparen ) = "LeftParen" tok_list(tk_Rparen ) = "RightParen" tok_list(tk_Lbrace ) = "LeftBrace" tok_list(tk_Rbrace ) = "RightBrace" tok_list(tk_Semi ) = "Semicolon" tok_list(tk_Comma ) = "Comma" tok_list(tk_Ident ) = "Identifier" tok_list(tk_Integer) = "Integer" tok_list(tk_String ) = "String"
do gettok(err_line, err_col, tok, v) print using "##### ##### \ \"; err_line; err_col; tok_list(tok); if tok = tk_integer orelse tok = tk_ident orelse tok = tk_string then print " " + v; print loop until tok = tk_eoi
end sub
sub main()
if command(1) = "" then print "filename required" : system init_lex(command(1)) scanner()
end sub
main() system</lang>
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 22 30 End_of_input
Perl
<lang perl>#!/usr/bin/env perl
use strict; use warnings; no warnings 'once';
- ----- Definition of the language to be lexed -----#
my @tokens = (
# Name | Format | Value # # -------------- |----------------------|-------------# ['Op_multiply' , '*' , ], ['Op_divide' , '/' , ], ['Op_mod' , '%' , ], ['Op_add' , '+' , ], ['Op_subtract' , '-' , ], ['Op_lessequal' , '<=' , ], ['Op_less' , '<' , ], ['Op_greaterequal', '>=' , ], ['Op_greater' , '>' , ], ['Op_equal' , '==' , ], ['Op_assign' , '=' , ], ['Op_not' , '!' , ], ['Op_notequal' , '!=' , ], ['Op_and' , '&&' , ], ['Op_or' , '||' , ], ['Keyword_else' , qr/else\b/ , ], ['Keyword_if' , qr/if\b/ , ], ['Keyword_while' , qr/while\b/ , ], ['Keyword_print' , qr/print\b/ , ], ['Keyword_putc' , qr/putc\b/ , ],
['LeftParen' , '(' , ], ['RightParen' , ')' , ], ['LeftBrace' , '{' , ], ['RightBrace' , '}' , ], ['Semicolon' , ';' , ], ['Comma' , ',' , ],
['Identifier' , qr/[_a-z][_a-z0-9]*/i, \&raw ], ['Integer' , qr/[0-9]+\b/ , \&raw ], ['Integer' , qr/'([^']*)(')?/ , \&char_val ], ['String' , qr/"([^"]*)(")?/ , \&string_raw],
['End_of_input' , qr/$/ , ],
);
my $comment = qr/\/\* .+? (?: \*\/ | $ (?{die "End-of-file in comment\n"}) )/xs; my $whitespace = qr/(?: \s | $comment)*/x; my $unrecognized = qr/\w+ | ./x;
- | Returns the value of a matched char literal, or dies if it is invalid
sub char_val {
my $str = string_val(); die "Multiple characters\n" if length $str > 1; die "No character\n" if length $str == 0; ord $str;
}
- | Returns the value of a matched string literal, or dies if it is invalid
sub string_val {
my ($str, $end) = ($1, $2); die "End-of-file\n" if not defined $end; die "End-of-line\n" if $str =~ /\n/; $str =~ s/\\(.)/ $1 eq 'n' ? "\n" : $1 eq '\\' ? $1 : $1 eq $end ? $1 : die "Unknown escape sequence \\$1\n" /rge;
}
- | Returns the source string of a matched literal
sub raw { $& }
- | Returns the source string of a matched string literal, or dies if invalid
sub string_raw {
string_val(); # Just for the error handling side-effects $&;
}
- ----- Lexer "engine" -----#
- Construct the scanner regex:
my $tokens =
join "|", map { my $format = $tokens[$_][1]; "\n".(ref $format ? $format : quotemeta $format)." (*MARK:$_) "; } 0..$#tokens;
my $regex = qr/
\G (?| $whitespace \K (?| $tokens ) | $whitespace? \K ($unrecognized) (*MARK:!) )
/x;
- Run the lexer:
my $input = do { local $/ = undef; <STDIN> }; my $pos = 0; my $linecol = linecol_accumulator();
while ($input =~ /$regex/g) {
# Get the line and column number my ($line, $col) = $linecol->(substr $input, $pos, $-[0] - $pos); $pos = $-[0];
# Get the token type that was identified by the scanner regex my $type = $main::REGMARK; die "Unrecognized token $1 at line $line, col $col\n" if $type eq '!'; my ($name, $evaluator) = @{$tokens[$type]}[0, 2];
# Get the token value my $value; if ($evaluator) { eval { $value = $evaluator->() }; if ($@) { chomp $@; die "$@ in $name at line $line, col $col\n" } }
# Print the output line print "$line\t$col\t$name".($value ? "\t$value" : )."\n";
}
- | Returns a closure, which can be fed a string one piece at a time and gives
- | back the cumulative line and column number each time
sub linecol_accumulator {
my ($line, $col) = (1, 1); sub { my $str = shift; my @lines = split "\n", $str, -1; my ($l, $c) = @lines ? (@lines - 1, length $lines[-1]) : (0, 0); if ($l) { $line += $l; $col = 1 + $c } else { $col += $c } ($line, $col) }
}</lang>
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_not 11 41 Op_assign 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input
Perl 6
This is more complicated than strictly necessary for this task. It is set up to be easily adapted to do syntax analysis.
(Note: there are several bogus comments added solely to help with syntax highlighting.)
<lang perl6>grammar tiny_C {
rule TOP { ^ <.whitespace>? <tokens> + % <.whitespace> <.whitespace> <eoi> }
rule whitespace { [ <comment> + % <ws> | <ws> ] }
token comment { '/*' ~ '*/' .*? }
token tokens { [ | <operator> { make $/<operator>.ast } | <keyword> { make $/<keyword>.ast } | <symbol> { make $/<symbol>.ast } | <identifier> { make $/<identifier>.ast } | <integer> { make $/<integer>.ast } | <char> { make $/<char>.ast } | <string> { make $/<string>.ast } | <error> ] }
proto token operator {*} token operator:sym<*> { '*' { make 'Op_multiply' } } token operator:sym</> { '/'<!before '*'> { make 'Op_divide' } } token operator:sym<%> { '%' { make 'Op_mod' } } token operator:sym<+> { '+' { make 'Op_add' } } token operator:sym<-> { '-' { make 'Op_subtract' } } token operator:sym('<='){ '<=' { make 'Op_lessequal' } } token operator:sym('<') { '<' { make 'Op_less' } } token operator:sym('>='){ '>=' { make 'Op_greaterequal'} } token operator:sym('>') { '>' { make 'Op_greater' } } token operator:sym<==> { '==' { make 'Op_equal' } } token operator:sym<!=> { '!=' { make 'Op_notequal' } } token operator:sym<!> { '!' { make 'Op_not' } } token operator:sym<=> { '=' { make 'Op_assign' } } token operator:sym<&&> { '&&' { make 'Op_and' } } token operator:sym<||> { '||' { make 'Op_or' } }
proto token keyword {*} token keyword:sym<if> { 'if' { make 'Keyword_if' } } token keyword:sym<else> { 'else' { make 'Keyword_else' } } token keyword:sym<putc> { 'putc' { make 'Keyword_putc' } } token keyword:sym<while> { 'while' { make 'Keyword_while' } } token keyword:sym<print> { 'print' { make 'Keyword_print' } }
proto token symbol {*} token symbol:sym<(> { '(' { make 'LeftParen' } } token symbol:sym<)> { ')' { make 'RightParen' } } token symbol:sym<{> { '{' { make 'LeftBrace' } } token symbol:sym<}> { '}' { make 'RightBrace' } } token symbol:sym<;> { ';' { make 'Semicolon' } } token symbol:sym<,> { ',' { make 'Comma' } }
token identifier { <[_A..Za..z]><[_A..Za..z0..9]>* { make 'Identifier ' ~ $/ } } token integer { <[0..9]>+ { make 'Integer ' ~ $/ } }
token char { '\ [<-[']> | '\n' | '\\\\'] '\ { make 'Char_Literal ' ~ $/.subst("\\n", "\n").substr(1, *-1).ord } }
token string { '"' <-["\n]>* '"' #' { make 'String ' ~ $/; note 'Error: Unknown escape sequence.' and exit if (~$/ ~~ m:r/ <!after <[\\]>>[\\<-[n\\]>]<!before <[\\]>> /); } }
token eoi { $ { make 'End_of_input' } }
token error { | '\'\ { note 'Error: Empty character constant.' and exit } | '\ <-[']> ** {2..*} '\ { note 'Error: Multi-character constant.' and exit } | '/*' <-[*]>* $ { note 'Error: End-of-file in comment.' and exit } | '"' <-["]>* $ { note 'Error: End-of-file in string.' and exit } | '"' <-["]>*? \n { note 'Error: End of line in string.' and exit } #' }
}
sub parse_it ( $c_code ) {
my $l; my @pos = gather for $c_code.lines>>.chars.kv -> $line, $v { take [ $line + 1, $_ ] for 1 .. ($v+1); # v+1 for newline $l = $line+2; } @pos.push: [ $l, 1 ]; # capture eoi
for flat $c_code<tokens>.list, $c_code<eoi> -> $m { say join "\t", @pos[$m.from].fmt('%3d'), $m.ast; }
}
my $tokenizer = tiny_C.parse(@*ARGS[0].IO.slurp); parse_it( $tokenizer );</lang>
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Char_Literal 10 21 26 Char_Literal 92 22 26 Char_Literal 32 23 1 End_of_input
Python
Tested with Python 2.7 and 3.x <lang Python>from __future__ import print_function import sys
- following two must remain in the same order
tk_EOI, tk_Mul, tk_Div, tk_Mod, tk_Add, tk_Sub, tk_Negate, tk_Not, tk_Lss, tk_Leq, tk_Gtr, \ tk_Geq, tk_Eq, tk_Neq, tk_Assign, tk_And, tk_Or, tk_If, tk_Else, tk_While, tk_Print, \ tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, \ tk_Integer, tk_String = range(31)
all_syms = ["End_of_input", "Op_multiply", "Op_divide", "Op_mod", "Op_add", "Op_subtract",
"Op_negate", "Op_not", "Op_less", "Op_lessequal", "Op_greater", "Op_greaterequal", "Op_equal", "Op_notequal", "Op_assign", "Op_and", "Op_or", "Keyword_if", "Keyword_else", "Keyword_while", "Keyword_print", "Keyword_putc", "LeftParen", "RightParen", "LeftBrace", "RightBrace", "Semicolon", "Comma", "Identifier", "Integer", "String"]
- single character only symbols
symbols = { '{': tk_Lbrace, '}': tk_Rbrace, '(': tk_Lparen, ')': tk_Rparen, '+': tk_Add, '-': tk_Sub,
'*': tk_Mul, '%': tk_Mod, ';': tk_Semi, ',': tk_Comma }
key_words = {'if': tk_If, 'else': tk_Else, 'print': tk_Print, 'putc': tk_Putc, 'while': tk_While}
the_ch = " " # dummy first char - but it must be a space the_col = 0 the_line = 1 input_file = None
- show error and exit
def error(line, col, msg):
print(line, col, msg) exit(1)
- get the next character from the input
def next_ch():
global the_ch, the_col, the_line
the_ch = input_file.read(1) the_col += 1 if the_ch == '\n': the_line += 1 the_col = 0 return the_ch
- 'x' - character constants
def char_lit(err_line, err_col):
n = ord(next_ch()) # skip opening quote if the_ch == '\: error(err_line, err_col, "empty character constant") elif the_ch == '\\': next_ch() if the_ch == 'n': n = 10 elif the_ch == '\\': n = ord('\\') else: error(err_line, err_col, "unknown escape sequence \\%c" % (the_ch)) if next_ch() != '\: error(err_line, err_col, "multi-character constant") next_ch() return tk_Integer, err_line, err_col, n
- process divide or comments
def div_or_cmt(err_line, err_col):
if next_ch() != '*': return tk_Div, err_line, err_col
# comment found while True: if next_ch() == '*' and next_ch() == '/': next_ch() return gettok() elif len(the_ch) == 0: error(err_line, err_col, "EOF in comment")
- "string"
def string_lit(start, err_line, err_col):
text = ""
while next_ch() != start: if len(the_ch) == 0: error(err_line, err_col, "EOF while scanning string literal") if the_ch == '\n': error(err_line, err_col, "EOL while scanning string literal") text += the_ch
next_ch() return tk_String, err_line, err_col, text
- handle identifiers and integers
def ident_or_int(err_line, err_col):
is_number = True text = ""
while the_ch.isalnum() or the_ch == '_': text += the_ch if not the_ch.isdigit(): is_number = False next_ch()
if len(text) == 0: error(err_line, err_col, "ident_or_int: unrecognized character: (%d) '%c'" % (ord(the_ch), the_ch))
if text[0].isdigit(): if not is_number: error(err_line, err_col, "invalid number: %s" % (text)) n = int(text) return tk_Integer, err_line, err_col, n
if text in key_words: return key_words[text], err_line, err_col
return tk_Ident, err_line, err_col, text
- look ahead for '>=', etc.
def follow(expect, ifyes, ifno, err_line, err_col):
if next_ch() == expect: next_ch() return ifyes, err_line, err_col
if ifno == tk_EOI: error(err_line, err_col, "follow: unrecognized character: (%d) '%c'" % (ord(the_ch), the_ch))
return ifno, err_line, err_col
- return the next token type
def gettok():
while the_ch.isspace(): next_ch()
err_line = the_line err_col = the_col
if len(the_ch) == 0: return tk_EOI, err_line, err_col elif the_ch == '/': return div_or_cmt(err_line, err_col) elif the_ch == '\: return char_lit(err_line, err_col) elif the_ch == '<': return follow('=', tk_Leq, tk_Lss, err_line, err_col) elif the_ch == '>': return follow('=', tk_Geq, tk_Gtr, err_line, err_col) elif the_ch == '=': return follow('=', tk_Eq, tk_Assign, err_line, err_col) elif the_ch == '!': return follow('=', tk_Neq, tk_Not, err_line, err_col) elif the_ch == '&': return follow('&', tk_And, tk_EOI, err_line, err_col) elif the_ch == '|': return follow('|', tk_Or, tk_EOI, err_line, err_col) elif the_ch == '"': return string_lit(the_ch, err_line, err_col) elif the_ch in symbols: sym = symbols[the_ch] next_ch() return sym, err_line, err_col else: return ident_or_int(err_line, err_col)
- main driver
input_file = sys.stdin if len(sys.argv) > 1:
try: input_file = open(sys.argv[1], "r", 4096) except IOError as e: error(0, 0, "Can't open %s" % sys.argv[1])
while True:
t = gettok() tok = t[0] line = t[1] col = t[2]
print("%5d %5d %-14s" % (line, col, all_syms[tok]), end=)
if tok == tk_Integer: print(" %5d" % (t[3])) elif tok == tk_Ident: print(" %s" % (t[3])) elif tok == tk_String: print(' "%s"' % (t[3])) else: print("")
if tok == tk_EOI: break</lang>
- Output — test case 3:
5 16 Keyword_print 5 40 Op_subtract 6 16 Keyword_putc 6 40 Op_less 7 16 Keyword_if 7 40 Op_greater 8 16 Keyword_else 8 40 Op_lessequal 9 16 Keyword_while 9 40 Op_greaterequal 10 16 LeftBrace 10 40 Op_equal 11 16 RightBrace 11 40 Op_notequal 12 16 LeftParen 12 40 Op_and 13 16 RightParen 13 40 Op_or 14 16 Op_subtract 14 40 Semicolon 15 16 Op_not 15 40 Comma 16 16 Op_multiply 16 40 Op_assign 17 16 Op_divide 17 40 Integer 42 18 16 Op_mod 18 40 String "String literal" 19 16 Op_add 19 40 Identifier variable_name 20 26 Integer 10 21 26 Integer 92 22 26 Integer 32 23 1 End_of_input