User:Ed Davis
You are encouraged to solve this task according to the task description, using any language you may know.
Description of the task
Lexical Analyzer
From Wikipedia
Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).
The Task
Create a lexical analyzer for the Tiny programming language. The program should read input from a file and/or stdin, and write output to a file and/or stdout.
Specification
The various token types are denoted below.
Operators
Characters | Common name | Name |
---|---|---|
'*' | multiply | Mul |
'/' | divide | Div |
'+' | plus | Add |
'-' | minus and unary minus | Sub and Uminus |
'<' | less than | Lss |
'<=' | less than or equal | Leq |
'>' | greater than | Gtr |
'!=' | not equal | Neq |
'=' | assign | Assign |
'&&' | and | And |
Symbols
Characters | Common name | Name |
---|---|---|
'(' | left parenthesis | Lparen |
')' | right parenthesis | Rparen |
'{' | left brace | Lbrace |
'}' | right brace | Rbrace |
';' | semi colon | Semi |
',' | comma | Comma |
Keywords
Characters | Name |
---|---|
"if" | If |
"while" | While |
"print" | |
"putc" | Putc |
Other entities
Characters | Regular expression | Name |
---|---|---|
integers | [0-9]+ | Integer |
char literal | 'x' | Integer |
identifiers | [_a-zA-Z][_a-zA-Z0-9]+ | Ident |
string literal | ".*" | String |
Notes: For char literals, '\n' is supported as a new line character. To represent \, use: '\\'. \n may also be used in Strings, to print a newline. No other special sequences are supported.
Comments /* ... */ (multi-line)
Complete list of token names
EOI, Print, Putc, If, While, Lbrace, Rbrace, Lparen, Rparen, Uminus, Mul, Div, Add, Sub, Lss, Gtr, Leq, Neq, And, Semi, Comma, Assign, Integerk, Stringk, Ident
Program output
Output of the program should be the line and column where the found token starts, followed by the Token name. For tokens Integer, Ident and String, the Integer, identifier, or string should follow.
Test Cases
<lang c> /*
Hello world */
print("Hello, World!\n"); </lang>
Output
line | 4 | col | 1 | ||
line | 4 | col | 6 | Lparen | |
line | 4 | col | 7 | String | "Hello, World!\n" |
line | 4 | col | 24 | Rparen | |
line | 4 | col | 25 | Semi | |
line | 5 | col | 1 | EOI |
<lang c> /*
Show Ident and Integers */
phoenix_number = 142857; print(phoenix_number, "\n"); </lang>
Output
line | 1 | col | 1 | Ident | phoenix_number |
line | 1 | col | 16 | Assign | |
line | 1 | col | 18 | Integer | 142857 |
line | 1 | col | 24 | Semi | |
line | 2 | col | 1 | ||
line | 2 | col | 6 | Lparen | |
line | 2 | col | 7 | Ident | phoenix_number |
line | 2 | col | 21 | Comma | |
line | 2 | col | 23 | String | "\n" |
line | 2 | col | 27 | Rparen | |
line | 2 | col | 28 | Semi | |
line | 3 | col | 1 | EOI |
Diagnostics
The following error conditions should be caught:
- Empty character constant. Example:
- Unknown escape sequence. Example: '\r'
- Multi-character constant. Example: 'xx'
- End-of-file in comment. Closing comment characters not found.
- End-of-file while scanning string literal. Closing string character not found.
- End-of-line while scanning string literal. Closing string character not found before end-of-line.
- Unrecognized character. Example: |
Refer additional questions to the C and Python implementations.