User:Ed Davis

Description of the task

Lexical Analyzer

Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).

The Task

Create a lexical analyzer for the Tiny programming language. The program should read input from a file and/or stdin, and write output to a file and/or stdout.

Specification

The various token types are denoted below.

Operators

Characters	Common name	Name
'*'	multiply	Mul
'/'	divide	Div
'+'	plus	Add
'-'	minus and unary minus	Sub and Uminus
'<'	less than	Lss
'<='	less than or equal	Leq
'>'	greater than	Gtr
'!='	not equal	Neq
'='	assign	Assign
'&&'	and	And

Symbols

Characters	Common name	Name
'('	left parenthesis	Lparen
')'	right parenthesis	Rparen
'{'	left brace	Lbrace
'}'	right brace	Rbrace
';'	semi colon	Semi
','	comma	Comma

Keywords

Characters	Name
"if"	If
"while"	While
"print"	Print
"putc"	Putc

Other entities

Characters	Regular expression	Name
integers	[0-9]+	Integer
char literal	'x'	Integer
identifiers	[_a-zA-Z][_a-zA-Z0-9]+	Ident
string literal	".*"	String

Notes: For char literals, '\n' is supported as a new line character. To represent \, use: '\\'. \n may also be used in Strings, to print a newline. No other special sequences are supported.

Comments /* ... */ (multi-line)

Complete list of token names

EOI, Print, Putc, If, While, Lbrace, Rbrace, Lparen, Rparen, Uminus, Mul, Div, Add, Sub, Lss, Gtr, Leq, Neq, And, Semi, Comma, Assign, Integerk, Stringk, Ident

Program output

Output of the program should be the line and column where the found token starts, followed by the Token name. For tokens Integer, Ident and String, the Integer, identifier, or string should follow.

Test Cases

<lang c> /*

 Hello world
*/

print("Hello, World!\n"); </lang>

Output

line	4	col	1	Print
line	4	col	6	Lparen
line	4	col	7	String	"Hello, World!\n"
line	4	col	24	Rparen
line	4	col	25	Semi
line	5	col	1	EOI

<lang c> /*

 Show Ident and Integers
*/

phoenix_number = 142857; print(phoenix_number, "\n"); </lang>

Output

line	1	col	1	Ident	phoenix_number
line	1	col	16	Assign
line	1	col	18	Integer	142857
line	1	col	24	Semi
line	2	col	1	Print
line	2	col	6	Lparen
line	2	col	7	Ident	phoenix_number
line	2	col	21	Comma
line	2	col	23	String	"\n"
line	2	col	27	Rparen
line	2	col	28	Semi
line	3	col	1	EOI

Diagnostics

The following error conditions should be caught:

Empty character constant. Example:
Unknown escape sequence. Example: '\r'
Multi-character constant. Example: 'xx'
End-of-file in comment. Closing comment characters not found.
End-of-file while scanning string literal. Closing string character not found.
End-of-line while scanning string literal. Closing string character not found before end-of-line.
Unrecognized character. Example: |

Refer additional questions to the C and Python implementations.

C

Python