Compiler/lexical analyzer: Difference between revisions
(removing the clarify template - I think everything has been cleared up now.) |
(use shorter example in the whitespace section) |
||
Line 144: | Line 144: | ||
** e.g. <code>& &</code> is invalid, and not interpreted as the <tt>&&</tt> operator. |
** e.g. <code>& &</code> is invalid, and not interpreted as the <tt>&&</tt> operator. |
||
⚫ | |||
The following programs are equivalent: |
|||
* <lang c>if ( p /* meaning n is prime */ ) { |
|||
⚫ | |||
⚫ | |||
count = 1 ; |
|||
count = count + 1 ; /* number of primes found so far */ |
|||
n = 1 ; |
|||
⚫ | |||
limit = 100 ; |
|||
* <lang c>if(p){print(n," ");count=count+1;</lang> |
|||
while ( n < limit ) { |
|||
k = 3 ; |
|||
p = 1 ; |
|||
n = n + 2 ; |
|||
while ( ( k * k <= n ) && ( p ) ) { |
|||
p = n / k * k != n ; |
|||
k = k + 2 ; |
|||
} |
|||
if ( p ) { |
|||
⚫ | |||
count = count + 1 ; |
|||
} |
|||
} |
|||
print ( count , "\n" ) ; |
|||
</lang> |
|||
<lang c> |
|||
count=1;n=1;limit=100;while(n<limit){k=3;p=1;n=n+2;while((k*k<=n)&&(p)){p=n/k*k!=n;k=k+2;}if(p){print(n," ");count=count+1;}}print(count,"\n"); |
|||
</lang> |
|||
⚫ | |||
;Complete list of token names |
;Complete list of token names |
Revision as of 08:41, 20 August 2016
Definition from Wikipedia:
- Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).
Create a lexical analyzer for the simple programming language specified below. The program should read input from a file and/or stdin, and write output to a file and/or stdout. If the language being used has a lexer module/library/class, it would be great if two versions of the solution are provided: One without the lexer module, and one with.
The simple programming language to be analyzed is more or less a subset of C. It supports the following tokens:
- Operators
Name | Common name | Character sequence |
---|---|---|
OP_MULTIPLY | multiply | * |
OP_DIVIDE | divide | / |
OP_ADD | plus | + |
OP_SUBTRACT | minus | - |
OP_NEGATE | unary minus | - |
OP_LESS | less than | < |
OP_LESSEQUAL | less than or equal | <= |
OP_GREATER | greater than | > |
OP_NOTEQUAL | not equal | != |
OP_ASSIGN | assignment | = |
OP_AND | logical and | && |
- The
-
token should always be interpreted as OP_SUBTRACT by the lexer. Turning some OP_SUBTRACT into OP_NEGATE will be the job of the syntax analyzer, which is not part of this task.
- Symbols
Name | Common name | Character |
---|---|---|
LEFTPAREN | left parenthesis | ( |
RIGHTPAREN | right parenthesis | ) |
LEFTBRACE | left brace | { |
RIGHTBRACE | right brace | } |
SEMICOLON | semi-colon | ; |
COMMA | comma | , |
- Keywords
Name | Character sequence |
---|---|
KEYWORD_IF | if |
KEYWORD_WHILE | while |
KEYWORD_PRINT | |
KEYWORD_PUTC | putc |
- Identifiers and literals
These differ from the the previous tokens, in that each occurrence of them has a value associated with it.
Name | Common name | Format description | Format regex | Value |
---|---|---|---|---|
IDENTIFIER | identifier | one or more letter/number/underscore characters, but not starting with a number | [_a-zA-Z][_a-zA-Z0-9]*
|
as is |
INTEGER | integer literal | one or more digits | [0-9]+
|
as is, interpreted as a number |
INTEGER | char literal | exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes | '([^'\n]|\\n|\\\\)'
|
the ASCII code point number of the character, e.g. 65 for 'A' and 10 for '\n'
|
STRING | string literal | zero or more characters (anything except newline or double quote), enclosed by double quotes | "[^"\n]*"
|
the characters without the double quotes and with escape sequences converted |
- For char and string literals, the
\n
escape sequence is supported to represent a new-line character. - For char literals, to represent a backslash, use
\\
. - No other special sequences are supported. This means that:
- Char literals cannot represent a single quote character (value 39).
- String literals cannot represent strings containing double quote characters.
- Zero-width tokens
Name | Location |
---|---|
END_OF_INPUT | when the end of the input stream is reached |
- White space
- Zero or more whitespace characters, or comments enclosed in
/* ... */
, are allowed between any two tokens, with the exceptions noted below. - "Longest token matching" is used to resolve conflicts (e.g., in order to match <= as a single token rather than the two tokens < and =).
- Whitespace is required between two tokens that have an alphanumeric character or underscore at the edge.
- This means: keywords, identifiers, and integer literals.
- e.g.
ifprint
is recognized as an identifier, instead of the keywords if and print. - e.g.
42fred
is invalid, and neither recognized as a number nor an identifier.
- Whitespace is not allowed inside of tokens (except for chars and strings where they are part of the value).
- e.g.
& &
is invalid, and not interpreted as the && operator.
- e.g.
For example, the following two program fragments are equivalent, and should produce the same token stream except for the line and column positions:
- <lang c>if ( p /* meaning n is prime */ ) {
print ( n , " " ) ; count = count + 1 ; /* number of primes found so far */
}</lang>
- <lang c>if(p){print(n," ");count=count+1;</lang>
- Complete list of token names
END_OF_INPUT OP_MULTIPLY OP_DIVIDE OP_ADD OP_SUBTRACT OP_NEGATE OP_LESS OP_LESSEQUAL OP_GREATER OP_NOTEQUAL OP_ASSIGN OP_AND KEYWORD_IF KEYWORD_WHILE KEYWORD_PRINT KEYWORD_PUTC LEFTPAREN RIGHTPAREN LEFTBRACE RIGHTBRACE SEMICOLON COMMA IDENTIFIER INTEGER STRING
The program output should be a sequence of lines, each consisting of the following whitespace-separated fields:
- the line number where the token starts
- the column number where the token starts
- the token name
- the token value (only for IDENTIFIER, INTEGER, and STRING tokens)
The following error conditions should be caught:
Error | Example |
---|---|
Empty character constant | ''
|
Unknown escape sequence. | \r
|
Multi-character constant. | 'xx'
|
End-of-file in comment. Closing comment characters not found. | |
End-of-file while scanning string literal. Closing string character not found. | |
End-of-line while scanning string literal. Closing string character not found before end-of-line. | |
Unrecognized character. |
Input | Output |
---|---|
<lang c> /* Hello world */ print("Hello, World!\n"); </lang> |
4 1 KEYWORD_PRINT 4 6 LEFTPAREN 4 7 STRING "Hello, World!\n" 4 24 RIGHTPAREN 4 25 SEMICOLON 5 1 END_OF_INPUT |
<lang c> /* Show Ident and Integers */ phoenix_number = 142857; print(phoenix_number, "\n"); </lang> |
4 1 IDENTIFIER phoenix_number 4 16 OP_ASSIGN 4 18 INTEGER 142857 4 24 SEMICOLON 5 1 KEYWORD_PRINT 5 6 LEFTPAREN 5 7 IDENTIFIER phoenix_number 5 21 COMMA 5 23 STRING "\n" 5 27 RIGHTPAREN 5 28 SEMICOLON 6 1 END_OF_INPUT |
<lang c> /* All lexical tokens - not syntactically correct, but that will have to wait until syntax analysis */ /* Print */ print /* Sub */ - /* Putc */ putc /* Lss */ < /* If */ if /* Gtr */ > /* While */ while /* Leq */ <= /* Lbrace */ { /* Neq */ != /* Rbrace */ } /* And */ && /* Lparen */ ( /* Semi */ ; /* Rparen */ ) /* Comma */ , /* Uminus */ - /* Assign */ = /* Mul */ * /* Integer */ 42 /* Div */ / /* String */ "String literal" /* Add */ + /* Ident */ variable_name /* character literal */ '\n' /* character literal */ '\\' /* character literal */ ' ' </lang> |
5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 20 1 END_OF_INPUT |
The Flex, C, Python and Euphoria versions can be considered reference implementations.
C
Tested with gcc 4.81 and later, compiles warning free with -Wall -Wextra <lang C>
- include <stdlib.h>
- include <stdio.h>
- include <stdarg.h>
- include <ctype.h>
- include <string.h>
- include <errno.h>
- include <stdbool.h>
- include <limits.h>
- define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))
- define da_dim(name, type) type *name = NULL; \
int _qy_ ## name ## _p = 0; \ int _qy_ ## name ## _max = 0
- define da_rewind(name) _qy_ ## name ## _p = 0
- define da_redim(name) if (_qy_ ## name ## _p >= _qy_ ## name ## _max) \
name = realloc(name, (_qy_ ## name ## _max += 32) * sizeof(name[0]))
- define da_append(name, x) do {da_redim(name); name[_qy_ ## name ## _p++] = x;} while (0)
- define da_len(name) _qy_ ## name ## _p
typedef enum {tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr,
tk_Neq, tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String
} TokenType;
typedef struct {
int tok; int err_ln, err_col; union { int n; /* value for constants */ char *text; /* text for idents */ };
} tok_s;
static FILE *source_fp, *dest_fp; static int line = 1, col = 0, the_ch = ' '; da_dim(text, char);
tok_s gettok();
static void error(int err_line, int err_col, const char *fmt, ... ) {
char buf[1000]; va_list ap;
va_start(ap, fmt); vsprintf(buf, fmt, ap); va_end(ap); printf("(%d,%d) error: %s\n", err_line, err_col, buf); exit(1);
}
static int next_ch() { /* get next char from input */
the_ch = getc(source_fp); ++col; if (the_ch == '\n') { ++line; col = 0; } return the_ch;
}
static tok_s char_lit(int n, int err_line, int err_col) { /* 'x' */
if (the_ch == '\) error(err_line, err_col, "gettok: empty character constant"); if (the_ch == '\\') { next_ch(); if (the_ch == 'n') n = 10; else if (the_ch == '\\') n = '\\'; else error(err_line, err_col, "gettok: unknown escape sequence \\%c", the_ch); } if (next_ch() != '\) error(err_line, err_col, "multi-character constant"); next_ch(); return (tok_s){tk_Integer, err_line, err_col, {n}};
}
static tok_s div_or_cmt(int err_line, int err_col) { /* process divide or comments */
if (the_ch != '*') return (tok_s){tk_Div, err_line, err_col, {0}};
/* comment found */ for (;;) { if (next_ch() == '*' && next_ch() == '/') { next_ch(); return gettok(); } else if (the_ch == EOF) error(err_line, err_col, "EOF in comment"); }
}
static tok_s string_lit(int start, int err_line, int err_col) { /* "st" */
da_rewind(text);
while (next_ch() != start) { if (the_ch == '\n') error(err_line, err_col, "EOL in string"); if (the_ch == EOF) error(err_line, err_col, "EOF in string"); da_append(text, (char)the_ch); } da_append(text, '\0');
next_ch(); return (tok_s){tk_String, err_line, err_col, {.text=text}};
}
static int kwd_cmp(const void *p1, const void *p2) {
return strcmp(*(char **)p1, *(char **)p2);
}
static TokenType get_ident_type(const char *ident) {
static struct { char *s; TokenType sym; } kwds[] = { {"if", tk_If}, {"print", tk_Print}, {"putc", tk_Putc}, {"while", tk_While}, }, *kwp;
return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;
}
static tok_s ident_or_int(int err_line, int err_col) {
int n, is_number = true;
da_rewind(text); while (isalnum(the_ch) || the_ch == '_') { da_append(text, (char)the_ch); if (!isdigit(the_ch)) is_number = false; next_ch(); } if (da_len(text) == 0) error(err_line, err_col, "gettok: unrecognized character (%d) '%c'\n", the_ch, the_ch); da_append(text, '\0'); if (isdigit(text[0])) { if (!is_number) error(err_line, err_col, "invalid number: %s\n", text); n = strtol(text, NULL, 0); if (n == LONG_MAX && errno == ERANGE) error(err_line, err_col, "Number exceeds maximum value"); return (tok_s){tk_Integer, err_line, err_col, {n}}; } return (tok_s){get_ident_type(text), err_line, err_col, {.text=text}};
}
static tok_s follow(int expect, TokenType ifyes, TokenType ifno, int err_line, int err_col) { /* look ahead for '>=', etc. */
if (the_ch == expect) { next_ch(); return (tok_s){ifyes, err_line, err_col, {0}}; } if (ifno == tk_EOI) error(err_line, err_col, "follow: unrecognized character '%c' (%d)\n", the_ch, the_ch); return (tok_s){ifno, err_line, err_col, {0}};
}
tok_s gettok() { /* return the token type */
/* skip white space */ while (isspace(the_ch)) next_ch(); int err_line = line; int err_col = col; switch (the_ch) { case '{': next_ch(); return (tok_s){tk_Lbrace, err_line, err_col, {0}}; case '}': next_ch(); return (tok_s){tk_Rbrace, err_line, err_col, {0}}; case '(': next_ch(); return (tok_s){tk_Lparen, err_line, err_col, {0}}; case ')': next_ch(); return (tok_s){tk_Rparen, err_line, err_col, {0}}; case '+': next_ch(); return (tok_s){tk_Add, err_line, err_col, {0}}; case '-': next_ch(); return (tok_s){tk_Sub, err_line, err_col, {0}}; case '*': next_ch(); return (tok_s){tk_Mul, err_line, err_col, {0}}; case ';': next_ch(); return (tok_s){tk_Semi, err_line, err_col, {0}}; case ',': next_ch(); return (tok_s){tk_Comma,err_line, err_col, {0}}; case '>': next_ch(); return (tok_s){tk_Gtr, err_line, err_col, {0}}; case '=': next_ch(); return (tok_s){tk_Assign, err_line, err_col, {0}}; case '/': next_ch(); return div_or_cmt(err_line, err_col); case '\: next_ch(); return char_lit(the_ch, err_line, err_col); case '<': next_ch(); return follow('=', tk_Leq, tk_Lss, err_line, err_col); case '!': next_ch(); return follow('=', tk_Neq, tk_EOI, err_line, err_col); case '&': next_ch(); return follow('&', tk_And, tk_EOI, err_line, err_col); case '"' : return string_lit(the_ch, err_line, err_col); default: return ident_or_int(err_line, err_col); case EOF: return (tok_s){tk_EOI, err_line, err_col, {0}}; }
}
void run() { /* tokenize the given input */
tok_s tok; do { tok = gettok(); fprintf(dest_fp, "%5d %5d %.14s", tok.err_ln, tok.err_col, &"END_OF_INPUT OP_MULTIPLY OP_DIVIDE OP_ADD OP_SUBTRACT OP_NEGATE " "OP_LESS OP_LESSEQUAL OP_GREATER OP_NOTEQUAL OP_ASSIGN OP_AND " "KEYWORD_IF KEYWORD_WHILE KEYWORD_PRINT KEYWORD_PUTC LEFTPAREN RIGHTPAREN " "LEFTBRACE RIGHTBRACE SEMICOLON COMMA IDENTIFIER INTEGER " "STRING "[tok.tok * 14]);
if (tok.tok == tk_Integer) fprintf(dest_fp, " %4d", tok.n); else if (tok.tok == tk_Ident) fprintf(dest_fp, " %s", tok.text); else if (tok.tok == tk_String) fprintf(dest_fp, " \"%s\"", tok.text); fprintf(dest_fp, "\n"); } while (tok.tok != tk_EOI); if (dest_fp != stdout) fclose(dest_fp);
}
void init_io(FILE **fp, FILE *std, const char mode[], const char fn[]) {
if (fn[0] == '\0') *fp = std; else if ((*fp = fopen(fn, mode)) == NULL) error(0, 0, "Can't open %s\n", fn);
}
int main(int argc, char *argv[]) {
init_io(&source_fp, stdin, "r", argc > 1 ? argv[1] : ""); init_io(&dest_fp, stdout, "wb", argc > 2 ? argv[2] : ""); run(); return 0;
} </lang>
Output from test case 3:
5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 20 1 END_OF_INPUT
Euphoria
Tested with Euphoria 4.05. <lang euphoria> include std/io.e include std/map.e include std/types.e include std/convert.e
constant true = 1, false = 0, EOF = -1
enum tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr, tk_Neq,
tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String
constant all_syms = {"END_OF_INPUT", "OP_MULTIPLY", "OP_DIVIDE", "OP_ADD", "OP_SUBTRACT",
"OP_NEGATE", "OP_LESS", "OP_LESSEQUAL", "OP_GREATER", "OP_NOTEQUAL", "OP_ASSIGN", "OP_AND", "KEYWORD_IF", "KEYWORD_WHILE", "KEYWORD_PRINT", "KEYWORD_PUTC", "LEFTPAREN", "RIGHTPAREN", "LEFTBRACE", "RIGHTBRACE", "SEMICOLON", "COMMA", "IDENTIFIER", "INTEGER", "STRING"}
integer input_file, the_ch = ' ', the_col = 0, the_line = 1 sequence symbols map key_words = new()
procedure error(sequence format, sequence data)
printf(STDOUT, format, data) abort(1)
end procedure
-- get the next character from the input function next_ch()
the_ch = getc(input_file) the_col += 1 if the_ch = '\n' then the_line += 1 the_col = 0 end if return the_ch
end function
-- 'x' - character constants function char_lit(integer err_line, integer err_col)
integer n = next_ch() -- skip opening quote if the_ch = '\ then error("%d %d empty character constant", {err_line, err_col}) elsif the_ch = '\\' then next_ch() if the_ch = 'n' then n = 10 elsif the_ch = '\\' then n = '\\' else error("%d %d unknown escape sequence \\%c", {err_line, err_col, the_ch}) end if end if if next_ch() != '\ then error("%d %d multi-character constant", {err_line, err_col}) end if next_ch() return {tk_Integer, err_line, err_col, n}
end function
-- process divide or comments function div_or_cmt(integer err_line, integer err_col)
if next_ch() != '*' then return {tk_Div, err_line, err_col} end if
-- comment found while true do if next_ch() = '*' and next_ch() = '/' then next_ch() return get_tok() elsif the_ch = EOF then error("%d %d EOF in comment", {err_line, err_col}) end if end while
end function
-- "string" function string_lit(integer start, integer err_line, integer err_col)
string text = ""
while next_ch() != start do if the_ch = EOF then error("%d %d EOF while scanning string literal", {err_line, err_col}) end if if the_ch = '\n' then error("%d %d EOL while scanning string literal", {err_line, err_col}) end if text &= the_ch end while
next_ch() return {tk_String, err_line, err_col, text}
end function
-- handle identifiers and integers function ident_or_int(integer err_line, integer err_col)
integer n, is_number = true string text = ""
while t_alnum(the_ch) or the_ch = '_' do text &= the_ch if not t_digit(the_ch) then is_number = false end if next_ch() end while
if length(text) = 0 then error("%d %d ident_or_int: unrecognized character: (%d) '%s'", {err_line, err_col, the_ch, the_ch}) end if
if t_digit(text[1]) then if not is_number then error("%d %d invalid number: %s", {err_line, err_col, text}) end if n = to_integer(text) return {tk_Integer, err_line, err_col, n} end if
if has(key_words, text) then return {get(key_words, text), err_line, err_col} end if
return {tk_Ident, err_line, err_col, text}
end function
-- look ahead for '>=', etc. function follow(integer expect, integer ifyes, integer ifno, integer err_line, integer err_col)
if next_ch() = expect then next_ch() return {ifyes, err_line, err_col} end if
if ifno = tk_EOI then error("%d %d follow: unrecognized character: (%d)", {err_line, err_col, the_ch}) end if
return {ifno, err_line, err_col}
end function
-- return the next token type function get_tok()
while t_space(the_ch) do next_ch() end while
integer err_line = the_line integer err_col = the_col
switch the_ch do case EOF then return {tk_EOI, err_line, err_col} case '/' then return div_or_cmt(err_line, err_col) case '\ then return char_lit(err_line, err_col) case '<' then return follow('=', tk_Leq, tk_Lss, err_line, err_col) case '!' then return follow('=', tk_Neq, tk_EOI, err_line, err_col) case '&' then return follow('&', tk_And, tk_EOI, err_line, err_col) case '"' then return string_lit(the_ch, err_line, err_col) case else integer sym = symbols[the_ch] if sym != tk_EOI then next_ch() return {sym, err_line, err_col} end if return ident_or_int(err_line, err_col) end switch
end function
procedure init()
put(key_words, "if", tk_If) put(key_words, "print", tk_Print) put(key_words, "putc", tk_Putc) put(key_words, "while", tk_While)
symbols = repeat(tk_EOI, 256) symbols['{'] = tk_Lbrace symbols['}'] = tk_Rbrace symbols['('] = tk_Lparen symbols[')'] = tk_Rparen symbols['+'] = tk_Add symbols['-'] = tk_Sub symbols['*'] = tk_Mul symbols[';'] = tk_Semi symbols[','] = tk_Comma symbols['>'] = tk_Gtr symbols['='] = tk_Assign
end procedure
procedure main(sequence cl)
sequence file_name
input_file = STDIN if length(cl) > 2 then file_name = cl[3] input_file = open(file_name, "r") if input_file = -1 then error("Could not open %s", {file_name}) end if end if init() sequence t loop do t = get_tok() printf(STDOUT, "%5d %5d %-8s", {t[2], t[3], all_syms[t[1]]}) switch t[1] do case tk_Integer then printf(STDOUT, " %5d\n", {t[4]}) case tk_Ident then printf(STDOUT, " %s\n", {t[4]}) case tk_String then printf(STDOUT, " \"%s\"\n", {t[4]}) case else printf(STDOUT, "\n") end switch until t[1] = tk_EOI end loop
end procedure
main(command_line()) </lang>
Output from test case 3:
5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 20 1 END_OF_INPUT
Flex
Tested with Flex 2.5.4. <lang C> %{
- include <stdio.h>
- include <stdlib.h>
- include <string.h>
- include <errno.h>
- include <limits.h>
- define NELEMS(arr) (sizeof(arr) / sizeof(arr[0]))
typedef enum {tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr,
tk_Neq, tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String
} TokenType;
void yyerror(char msg[]) {
printf(msg); exit(1);
}
static int yynval;
struct yylloc {
int first_line, first_col; int last_line, last_col;
} yylloc;
static void update_loc() {
static int curr_line = 1; static int curr_col = 1;
yylloc.first_line = curr_line; yylloc.first_col = curr_col;
{char *s; for (s = yytext; *s != '\0'; s++) { if (*s == '\n') { curr_line++; curr_col = 1; } else { curr_col++; } }}
yylloc.last_line = curr_line; yylloc.last_col = curr_col-1;
}
- define YY_USER_ACTION update_loc();
static int kwd_cmp(const void *p1, const void *p2) {
return strcmp(*(char **)p1, *(char **)p2);
}
static TokenType get_ident_type(const char *ident) {
static struct { char *s; TokenType sym; } kwds[] = { {"if", tk_If}, {"print", tk_Print}, {"putc", tk_Putc}, {"while", tk_While}, }, *kwp;
return (kwp = bsearch(&ident, kwds, NELEMS(kwds), sizeof(kwds[0]), kwd_cmp)) == NULL ? tk_Ident : kwp->sym;
}
%}
%start COMMENT2
%option noyywrap
digit [0-9] ident [a-zA-Z_][a-zA-Z_0-9]*
number {digit}+ string \"[^"\n]*\" char_const \'([^'\n]|\\n|\\\\)\'
%%
<COMMENT2>[^*]+ ; <COMMENT2>\*[^/] ; <COMMENT2>\*\/ BEGIN 0; /* end comment */ "/*" BEGIN COMMENT2;
"{" {return tk_Lbrace;} "}" {return tk_Rbrace;} "(" {return tk_Lparen;} ")" {return tk_Rparen;} "*" {return tk_Mul;} "/" {return tk_Div;} "+" {return tk_Add;} "-" {return tk_Sub;} "<" {return tk_Lss;} ">" {return tk_Gtr;} "<=" {return tk_Leq;} "!=" {return tk_Neq;} "&&" {return tk_And;} ";" {return tk_Semi;} "," {return tk_Comma;} "=" {return tk_Assign;} {ident} {return get_ident_type(yytext);} {string} {return tk_String;}
[ \t\n]+ ; /* ignore whitespace */
{number} {
yynval = strtol(yytext, NULL, 0); if (yynval == LONG_MAX && errno == ERANGE) yyerror("Number exceeds maximum value");
return tk_Integer; }
{char_const} {
int n = yytext[1]; char *p = yytext;
if (yyleng < 3) yyerror("empty character constant"); ++p; if (p[0] == '\\') { ++p; if (p[0] == 'n') n = 10; else if (p[0] == '\\') n = '\\'; else yyerror("unknown escape sequence"); } yynval = n; return tk_Integer; }
. yyerror("Unknown character\n");
%%
int main(int argc, char *argv[]) {
int tok;
++argv, --argc; /* skip over program name */ yyin = stdin; if (argc > 0) yyin = fopen(argv[0], "r");
do { tok = yylex(); printf("%5d %5d %.14s", yylloc.first_line, yylloc.first_col, &"END_OF_INPUT OP_MULTIPLY OP_DIVIDE OP_ADD OP_SUBTRACT OP_NEGATE " "OP_LESS OP_LESSEQUAL OP_GREATER OP_NOTEQUAL OP_ASSIGN OP_AND " "KEYWORD_IF KEYWORD_WHILE KEYWORD_PRINT KEYWORD_PUTC LEFTPAREN RIGHTPAREN " "LEFTBRACE RIGHTBRACE SEMICOLON COMMA IDENTIFIER INTEGER " "STRING "[tok * 14]);
if (tok == tk_Integer) printf(" %5d", yynval); else if (tok == tk_Ident) printf(" %s", yytext); else if (tok == tk_String) printf(" %s", yytext); printf("\n"); } while (tok != tk_EOI); return 0;
} </lang>
Output from test case 3:
5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 19 29 END_OF_INPUT
FreeBASIC
Tested with FreeBASIC 1.05 <lang FreeBASIC> enum Token_type
tk_EOI tk_Mul tk_Div tk_Add tk_Sub tk_Uminus tk_Lss tk_Leq tk_Gtr tk_Neq tk_Assign tk_And tk_If tk_While tk_Print tk_Putc tk_Lparen tk_Rparen tk_Lbrace tk_Rbrace tk_Semi tk_Comma tk_Ident tk_Integer tk_String
end enum
const NewLine = chr(10) const DoubleQuote = chr(34)
' where we store keywords and variables type Symbol
s_name as string tok as Token_type
end type
dim shared symtab() as Symbol
dim shared cur_line as string dim shared cur_ch as string dim shared line_num as integer dim shared col_num as integer
function is_digit(byval ch as string) as long
is_digit = (ch <> "") and ch >= "0" and ch <= "9"
end function
function is_alnum(byval ch as string) as long
is_alnum = (ch <> "") and ((UCase(ch) >= "A" and UCase(ch) <= "Z") or (is_digit(ch)))
end function
sub error_msg(byval eline as integer, byval ecol as integer, byval msg as string)
print "("; eline; ":"; ecol; ")"; " "; msg system
end sub
' add an identifier to the symbol table function install(byval s_name as string, byval tok as Token_type) as integer
dim n as integer
n = ubound(symtab) redim preserve symtab(n + 1) n = ubound(symtab)
symtab(n).s_name = s_name symtab(n).tok = tok return n
end function
' search for an identifier in the symbol table function lookup(byval s_name as string) as integer
dim i as integer
for i = lbound(symtab) to ubound(symtab) if symtab(i).s_name = s_name then return i next return -1
end function
sub next_line() ' read the next line of input from the source file
cur_line = "" cur_ch = "" ' empty cur_ch means end-of-file if eof(1) then exit sub line input #1, cur_line cur_line = cur_line + NewLine line_num += + 1 col_num = 1
end sub
sub next_char() ' get the next char
cur_ch = "" col_num += 1 if col_num > len(cur_line) then next_line() if col_num <= len(cur_line) then cur_ch = mid(cur_line, col_num, 1)
end sub
function follow(byval err_line as integer, byval err_col as integer, byval expect as string, byval ifyes as Token_type, byval ifno as Token_type) as Token_type
if cur_ch = expect then next_char() return ifyes end if if ifno = tk_eoi then error_msg(err_line, err_col, "follow unrecognized character: " + cur_ch) return ifno
end function
sub gettok(byref err_line as integer, byref err_col as integer, byref tok as Token_type, byref v as string)
' skip whitespace do while (cur_ch = " " or cur_ch = chr(9) or cur_ch = NewLine) and (cur_ch <> "") next_char() loop
err_line = line_num err_col = col_num
select case cur_ch case "": tok = tk_eoi: exit sub case "{": tok = tk_lbrace: next_char(): exit sub case "}": tok = tk_rbrace: next_char(): exit sub case "(": tok = tk_lparen: next_char(): exit sub case ")": tok = tk_rparen: next_char(): exit sub case "+": tok = tk_add: next_char(): exit sub case "-": tok = tk_sub: next_char(): exit sub case "*": tok = tk_mul: next_char(): exit sub case ";": tok = tk_semi: next_char(): exit sub case ",": tok = tk_comma: next_char(): exit sub case ">": tok = tk_gtr: next_char(): exit sub case "=": tok = tk_assign: next_char(): exit sub case "/": ' div or comment next_char() if cur_ch <> "*" then tok = tk_div exit sub end if ' skip comments do next_char() if cur_ch = "*" or cur_ch = "" then next_char() if cur_ch = "/" or cur_ch = "" then next_char() gettok(err_line, err_col, tok, v) exit sub end if end if loop case "'": ' single char literals next_char() v = str(Asc(cur_ch)) if cur_ch = "'" then error_msg(err_line, err_col, "empty character constant") if cur_ch = "\" then next_char() if cur_ch = "n" then v = "10" elseif cur_ch = "\" then v = Str(Asc("\")) else error_msg(err_line, err_col, "unknown escape sequence: " + cur_ch) end if end if next_char() if cur_ch <> "'" then error_msg(err_line, err_col, "multi-character constant") next_char() tok = tk_integer exit sub case "<": next_char(): tok = follow(err_line, err_col, "=", tk_Leq, tk_Lss): exit sub case "!": next_char(): tok = follow(err_line, err_col, "=", tk_Neq, tk_EOI): exit sub case "&": next_char(): tok = follow(err_line, err_col, "&", tk_And, tk_EOI): exit sub case DoubleQuote: ' string v = cur_ch next_char() do while cur_ch <> DoubleQuote if cur_ch = NewLine then error_msg(err_line, err_col, "EOL in string") if cur_ch = "" then error_msg(err_line, err_col, "EOF in string") v += cur_ch next_char() loop v += cur_ch next_char() tok = tk_string exit sub case else ' integers or identifiers dim is_number as boolean = is_digit(cur_ch) v = "" do while is_alnum(cur_ch) orelse cur_ch = "_" if not is_digit(cur_ch) then is_number = false v += cur_ch next_char() loop if len(v) = 0 then error_msg(err_line, err_col, "unknown character: " + cur_ch) if is_digit(mid(v, 1, 1)) then if not is_number then error_msg(err_line, err_col, "invalid number: " + v) tok = tk_integer exit sub end if dim as integer index = lookup(v) if index = -1 then tok = tk_ident else tok = symtab(index).tok end if exit sub end select
end sub
sub init_lex(byval filein as string)
install("if", tk_if) install("print", tk_print) install("putc", tk_putc) install("while", tk_while)
open filein for input as #1
cur_line = "" line_num = 0 col_num = 0 next_char()
end sub
sub scanner()
dim err_line as integer dim err_col as integer dim tok as Token_type dim v as string dim tok_list(tk_eoi to tk_string) as string
tok_list(tk_EOI ) = "END_OF_INPUT" tok_list(tk_Mul ) = "OP_MULTIPLY" tok_list(tk_Div ) = "OP_DIVIDE" tok_list(tk_Add ) = "OP_ADD" tok_list(tk_Sub ) = "OP_SUBTRACT" tok_list(tk_Uminus ) = "OP_NEGATE" tok_list(tk_Lss ) = "OP_LESS" tok_list(tk_Leq ) = "OP_LESSEQUAL" tok_list(tk_Gtr ) = "OP_GREATER" tok_list(tk_Neq ) = "OP_NOTEQUAL" tok_list(tk_Assign ) = "OP_ASSIGN" tok_list(tk_And ) = "OP_AND" tok_list(tk_If ) = "KEYWORD_IF" tok_list(tk_While ) = "KEYWORD_WHILE" tok_list(tk_Print ) = "KEYWORD_PRINT" tok_list(tk_Putc ) = "KEYWORD_PUTC" tok_list(tk_Lparen ) = "LEFTPAREN" tok_list(tk_Rparen ) = "RIGHTPAREN" tok_list(tk_Lbrace ) = "LEFTBRACE" tok_list(tk_Rbrace ) = "RIGHTBRACE" tok_list(tk_Semi ) = "SEMICOLON" tok_list(tk_Comma ) = "COMMA" tok_list(tk_Ident ) = "IDENTIFIER" tok_list(tk_Integer) = "INTEGER" tok_list(tk_String ) = "STRING"
do gettok(err_line, err_col, tok, v) print using "##### ##### \ \"; err_line; err_col; tok_list(tok); if tok = tk_integer orelse tok = tk_ident orelse tok = tk_string then print " " + v; print loop until tok = tk_eoi
end sub
sub main()
if command(1) = "" then print "filename required" : system init_lex(command(1)) scanner()
end sub
main() system </lang>
Output from test case 3:
5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 19 30 END_OF_INPUT
Perl
<lang perl>#!/usr/bin/env perl
use strict; use warnings; no warnings 'once';
- ----- Definition of the language to be lexed -----#
my @tokens = (
# Name | Format | Value # # --------------|----------------------|-------------# ['OP_MULTIPLY' , '*' , ], ['OP_DIVIDE' , '/' , ], ['OP_ADD' , '+' , ], ['OP_SUBTRACT' , '-' , ], ['OP_LESSEQUAL' , '<=' , ], ['OP_LESS' , '<' , ], ['OP_GREATER' , '>' , ], ['OP_NOTEQUAL' , '!=' , ], ['OP_ASSIGN' , '=' , ], ['OP_AND' , '&&' , ], ['KEYWORD_IF' , qr/if\b/ , ], ['KEYWORD_WHILE', qr/while\b/ , ], ['KEYWORD_PRINT', qr/print\b/ , ], ['KEYWORD_PUTC' , qr/putc\b/ , ],
['LEFTPAREN' , '(' , ], ['RIGHTPAREN' , ')' , ], ['LEFTBRACE' , '{' , ], ['RIGHTBRACE' , '}' , ], ['SEMICOLON' , ';' , ], ['COMMA' , ',' , ],
['IDENTIFIER' , qr/[_a-z][_a-z0-9]*/i, \&raw ], ['INTEGER' , qr/[0-9]+\b/ , \&raw ], ['INTEGER' , qr/'([^']*)(')?/ , \&char_val ], ['STRING' , qr/"([^"]*)(")?/ , \&string_raw], ['END_OF_INPUT' , qr/$/ , ],
);
my $comment = qr/\/\* .+? (?: \*\/ | $ (?{die "End-of-file in comment\n"}) )/xs; my $whitespace = qr/(?: \s | $comment)*/x; my $unrecognized = qr/\w+ | ./x;
- | Returns the value of a matched char literal, or dies if it is invalid
sub char_val {
my $str = string_val(); die "Multiple characters\n" if length $str > 1; die "No character\n" if length $str == 0; ord $str;
}
- | Returns the value of a matched string literal, or dies if it is invalid
sub string_val {
my ($str, $end) = ($1, $2); die "End-of-file\n" if not defined $end; die "End-of-line\n" if $str =~ /\n/; $str =~ s/\\(.)/ $1 eq 'n' ? "\n" : $1 eq '\\' ? $1 : $1 eq $end ? $1 : die "Unknown escape sequence \\$1\n" /rge;
}
- | Returns the source string of a matched literal
sub raw { $& }
- | Returns the source string of a matched string literal, or dies if invalid
sub string_raw {
string_val(); # Just for the error handling side-effects $&;
}
- ----- Lexer "engine" -----#
- Construct the scanner regex:
my $tokens =
join "|", map { my $format = $tokens[$_][1]; "\n".(ref $format ? $format : quotemeta $format)." (*MARK:$_) "; } 0..$#tokens;
my $regex = qr/
\G (?| $whitespace \K (?| $tokens ) | $whitespace? \K ($unrecognized) (*MARK:!) )
/x;
- Run the lexer:
my $input = do { local $/ = undef; <STDIN> }; my $pos = 0; my $linecol = linecol_accumulator();
while ($input =~ /$regex/g) {
# Get the line and column number my ($line, $col) = $linecol->(substr $input, $pos, $-[0] - $pos); $pos = $-[0]; # Get the token type that was identified by the scanner regex my $type = $main::REGMARK; die "Unrecognized token $1 at line $line, col $col\n" if $type eq '!'; my ($name, $evaluator) = @{$tokens[$type]}[0, 2]; # Get the token value my $value; if ($evaluator) { eval { $value = $evaluator->() }; if ($@) { chomp $@; die "$@ in $name at line $line, col $col\n" } } # Print the output line print "$line\t$col\t$name".($value ? "\t$value" : )."\n";
}
- | Returns a closure, which can be fed a string one piece at a time and gives
- | back the cumulative line and column number each time
sub linecol_accumulator {
my ($line, $col) = (1, 1); sub { my $str = shift; my @lines = split "\n", $str, -1; my ($l, $c) = @lines ? (@lines - 1, length $lines[-1]) : (0, 0); if ($l) { $line += $l; $col = 1 + $c } else { $col += $c } ($line, $col) }
}</lang>
- Output:
(for test-case 3)
5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 20 1 END_OF_INPUT
Perl 6
<lang perl6>grammar C_Language { rule TOP { ^ <all>+ %% \s* };
token all { | '*' { make 'OP_MULTIPLY' } | '/' { make 'OP_DIVIDE' } | '+' { make 'OP_ADD' } | '-' { make 'OP_SUBTRACT' } | '<=' { make 'OP_LESSEQUAL' } | '<' { make 'OP_LESS' } | '>' { make 'OP_GREATER' } | '!=' { make 'OP_NOTEQUAL' } | '=' { make 'OP_ASSIGN' } | '&&' { make 'OP_AND' } | '(' { make 'LEFT_PAREN' } | ')' { make 'RIGHT_PAREN' } | '{' { make 'LEFT_BRACE' } | '}' { make 'RIGHT_BRACE' } | ';' { make 'SEMICOLON' } | ',' { make 'COMMA' } | 'if' { make 'KEYWORD_IF' } | 'putc' { make 'KEYWORD_PUTC' } | 'while' { make 'KEYWORD_WHILE' } | 'print' { make 'KEYWORD_PRINT' } | '/*' .*? '*/' { make 'COMMENT' } | '"' .*? '"' { make 'STRING ' ~ $/.subst("\n", "\\n", :g) } | "'" .*? "'" { make 'INTEGER ' ~ $/.subst("\\n", "\n").substr(1, *-1).ord } | <.digit>+ { make 'INTEGER ' ~ $/ } | <.ident>+ { make 'IDENTIFER ' ~ $/ } } }
sub parse_it ( $c_code ) { my @pos = gather for $c_code.lines».chars.kv -> $line, $v { take [ $line + 1, $_ ] for 1 .. ($v+1); # v+1 for newline }
$/ = C_Language.parse( $c_code ) or warn "Cannot parse";
for $/<all>.list.grep(*.ast ne 'COMMENT') -> $m { say join "\t", @pos[$m.from].fmt('%3d'), $m.ast; }
if $c_code.substr( $/<all>[*-1].to ).trim-trailing -> $unparsed { say "Leftover: ", $unparsed.perl; } }
parse_it( slurp('input.txt') );</lang>
Python
Tested with Python 2.7 and 3.x <lang Python> from __future__ import print_function import sys
- following two must remain in the same order
tk_EOI, tk_Mul, tk_Div, tk_Add, tk_Sub, tk_Uminus, tk_Lss, tk_Leq, tk_Gtr, tk_Neq, \ tk_Assign, tk_And, tk_If, tk_While, tk_Print, tk_Putc, tk_Lparen, tk_Rparen, tk_Lbrace, \ tk_Rbrace, tk_Semi, tk_Comma, tk_Ident, tk_Integer, tk_String = range(25)
all_syms = ["END_OF_INPUT", "OP_MULTIPLY", "OP_DIVIDE", "OP_ADD", "OP_SUBTRACT", "OP_NEGATE", "OP_LESS", "OP_LESSEQUAL", "OP_GREATER", "OP_NOTEQUAL", "OP_ASSIGN", "OP_AND", "KEYWORD_IF", "KEYWORD_WHILE", "KEYWORD_PRINT", "KEYWORD_PUTC", "LEFTPAREN", "RIGHTPAREN", "LEFTBRACE", "RIGHTBRACE", "SEMICOLON", "COMMA", "IDENTIFIER", "INTEGER", "STRING"]
- single character only symbols
symbols = { '{': tk_Lbrace, '}': tk_Rbrace, '(': tk_Lparen, ')': tk_Rparen, '+': tk_Add, '-': tk_Sub, '*': tk_Mul, ';': tk_Semi, ',': tk_Comma, '>': tk_Gtr, '=': tk_Assign }
key_words = {'if': tk_If, 'print': tk_Print, 'putc': tk_Putc, 'while': tk_While}
the_ch = " " # dummy first char - but it must be a space the_col = 0 the_line = 1 input_file = None
- show error and exit
def error(line, col, msg): print(line, col, msg) exit(1)
- get the next character from the input
def next_ch(): global the_ch, the_col, the_line
the_ch = input_file.read(1) the_col += 1 if the_ch == '\n': the_line += 1 the_col = 0 return the_ch
- 'x' - character constants
def char_lit(err_line, err_col): n = ord(next_ch()) # skip opening quote if the_ch == '\: error(err_line, err_col, "empty character constant") elif the_ch == '\\': next_ch() if the_ch == 'n': n = 10 elif the_ch == '\\': n = ord('\\') else: error(err_line, err_col, "unknown escape sequence \\%c" % (the_ch)) if next_ch() != '\: error(err_line, err_col, "multi-character constant") next_ch() return tk_Integer, err_line, err_col, n
- process divide or comments
def div_or_cmt(err_line, err_col): if next_ch() != '*': return tk_Div, err_line, err_col
# comment found while True: if next_ch() == '*' and next_ch() == '/': next_ch() return gettok() elif len(the_ch) == 0: error(err_line, err_col, "EOF in comment")
- "string"
def string_lit(start, err_line, err_col): text = ""
while next_ch() != start: if len(the_ch) == 0: error(err_line, err_col, "EOF while scanning string literal") if the_ch == '\n': error(err_line, err_col, "EOL while scanning string literal") text += the_ch
next_ch() return tk_String, err_line, err_col, text
- handle identifiers and integers
def ident_or_int(err_line, err_col): is_number = True text = ""
while the_ch.isalnum() or the_ch == '_': text += the_ch if not the_ch.isdigit(): is_number = False next_ch()
if len(text) == 0: error(err_line, err_col, "ident_or_int: unrecognized character: (%d) '%c'" % (ord(the_ch), the_ch))
if text[0].isdigit(): if not is_number: error(err_line, err_col, "invalid number: %s" % (text)) n = int(text) return tk_Integer, err_line, err_col, n
if text in key_words: return key_words[text], err_line, err_col
return tk_Ident, err_line, err_col, text
- look ahead for '>=', etc.
def follow(expect, ifyes, ifno, err_line, err_col): if next_ch() == expect: next_ch() return ifyes, err_line, err_col
if ifno == tk_EOI: error(err_line, err_col, "follow: unrecognized character: (%d) '%c'" % (ord(the_ch), the_ch))
return ifno, err_line, err_col
- return the next token type
def gettok(): while the_ch.isspace(): next_ch()
err_line = the_line err_col = the_col
if len(the_ch) == 0: return tk_EOI, err_line, err_col elif the_ch == '/': return div_or_cmt(err_line, err_col) elif the_ch == '\: return char_lit(err_line, err_col) elif the_ch == '<': return follow('=', tk_Leq, tk_Lss, err_line, err_col) elif the_ch == '!': return follow('=', tk_Neq, tk_EOI, err_line, err_col) elif the_ch == '&': return follow('&', tk_And, tk_EOI, err_line, err_col) elif the_ch == '"': return string_lit(the_ch, err_line, err_col) elif the_ch in symbols: sym = symbols[the_ch] next_ch() return sym, err_line, err_col else: return ident_or_int(err_line, err_col)
- main driver
input_file = sys.stdin if len(sys.argv) > 1: try: input_file = open(sys.argv[1], "r", 4096) except IOError as e: error(0, 0, "Can't open %s" % sys.argv[1])
while True: t = gettok() tok = t[0] line = t[1] col = t[2]
print("%5d %5d %-14s" % (line, col, all_syms[tok]), end=)
if tok == tk_Integer: print(" %5d" % (t[3])) elif tok == tk_Ident: print(" %s" % (t[3])) elif tok == tk_String: print(' "%s"' % (t[3])) else: print("")
if tok == tk_EOI: break </lang>
Output from test case 3:
5 15 KEYWORD_PRINT 5 41 OP_SUBTRACT 6 15 KEYWORD_PUTC 6 41 OP_LESS 7 15 KEYWORD_IF 7 41 OP_GREATER 8 15 KEYWORD_WHILE 8 41 OP_LESSEQUAL 9 15 LEFTBRACE 9 41 OP_NOTEQUAL 10 15 RIGHTBRACE 10 41 OP_AND 11 15 LEFTPAREN 11 41 SEMICOLON 12 15 RIGHTPAREN 12 41 COMMA 13 15 OP_SUBTRACT 13 41 OP_ASSIGN 14 15 OP_MULTIPLY 14 41 INTEGER 42 15 15 OP_DIVIDE 15 41 STRING "String literal" 16 15 OP_ADD 16 41 IDENTIFIER variable_name 17 26 INTEGER 10 18 26 INTEGER 92 19 26 INTEGER 32 20 1 END_OF_INPUT