User:Ed Davis: Difference between revisions

Content added Content deleted

Inline

Revision as of 22:01, 9 August 2016

Lexical Analyzer

Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).

The Task

Create a lexical analyzer for the Tiny programming language. The program should read input from a file and/or stdin, and write output to a file and/or stdout.

Specification

The various token types are denoted below.

Operators

Characters	Common name	Name
'*'	multiply	Mul
'/'	divide	Div
'+'	plus	Add
'-'	minus and unary minus	Sub and Uminus
'<'	less than	Lss
'<='	less than or equal	Leq
'>'	greater than	Gtr
'!='	not equal	Neq
'='	assign	Assign
'&&'	and	And

Symbols

Characters	Common name	Name
'('	left parenthesis	Lparen
')'	right parenthesis	Rparen
'{'	left brace	Lbrace
'}'	right brace	Rbrace
';'	semi colon	Semi
','	comma	Comma

Keywords

Characters	Name
"if"	If
"while"	While
"print"	Print
"putc"	Putc

Other entities

Characters	Regular expression	Name
integers	[0-9]+	Integer
char literal	'x'	Integer
identifiers	[_a-zA-Z][_a-zA-Z0-9]+	Ident
string literal	".*"	String

Notes: For char literals, '\n' is supported as a new line character. To represent \, use: '\\'. \n may also be used in Strings, to print a newline. No other special sequences are supported.

Comments /* ... */ (multi-line)

Complete list of token names

EOI, Print, Putc, If, While, Lbrace, Rbrace, Lparen, Rparen, Uminus, Mul, Div, Add, Sub, Lss, Gtr, Leq, Neq, And, Semi, Comma, Assign, Integerk, Stringk, Ident

Program output

Output of the program should be the line and column where the found token starts, followed by the Token name. For tokens Integer, Ident and String, the Integer, identifier, or string should follow.

Test Cases

<lang c> /*

 Hello world
*/

print("Hello, World!\n"); </lang>

Output

line	4	col	1	Print
line	4	col	6	Lparen
line	4	col	7	String	"Hello, World!\n"
line	4	col	24	Rparen
line	4	col	25	Semi
line	5	col	1	EOI

<lang c> /*

 Show Ident and Integers
*/

phoenix_number = 142857; print(phoenix_number, "\n"); </lang>

Output

line	1	col	1	Ident	phoenix_number
line	1	col	16	Assign
line	1	col	18	Integer	142857
line	1	col	24	Semi
line	2	col	1	Print
line	2	col	6	Lparen
line	2	col	7	Ident	phoenix_number
line	2	col	21	Comma
line	2	col	23	String	"\n"
line	2	col	27	Rparen
line	2	col	28	Semi
line	3	col	1	EOI

Diagnostics

The following error conditions should be caught:

Empty character constant. Example:
Unknown escape sequence. Example: '\r'
Multi-character constant. Example: 'xx'
End-of-file in comment. Closing comment characters not found.
End-of-file while scanning string literal. Closing string character not found.
End-of-line while scanning string literal. Closing string character not found before end-of-line.
Unrecognized character. Example: |

Refer additional questions to the C and Python implementations.

C

Python

@@ Line 1: / Line 1: @@
-{{task}}Description of the task
+{{task}}Lexical Analyzer
-Lexical Analyzer
-----------------
 From [https://en.wikipedia.org/wiki/Lexical_analysis Wikipedia]
@@ Line 10: / Line 8: @@
 or scanner (though "scanner" is also used to refer to the first stage of a lexer).
-==The Task==
+;The Task
 Create a lexical analyzer for the Tiny programming language.  The
@@ Line 16: / Line 14: @@
 output to a file and/or stdout.
-==Specification==
+;Specification
 The various token types are denoted below.
-===Operators===
+;Operators
 {| class="wikitable"
@@ Line 47: / Line 45: @@
 |}
-===Symbols===
+;Symbols
 {| class="wikitable"
@@ Line 66: / Line 64: @@
 |}
-===Keywords===
+;Keywords
 {| class="wikitable"
@@ Line 81: / Line 79: @@
 |}
-===Other entities===
+;Other entities
 {| class="wikitable"
@@ Line 103: / Line 101: @@
 '''Comments'''    /* ... */   (multi-line)
-====Complete list of token names====
+;Complete list of token names
 '''EOI, Print, Putc, If, While, Lbrace, Rbrace, Lparen, Rparen, Uminus, Mul, Div, Add, Sub, Lss, Gtr, Leq, Neq, And, Semi, Comma, Assign, Integerk, Stringk, Ident'''
-==Program output==
+;Program output
 Output of the program should be the line and column where the
@@ Line 114: / Line 112: @@
 should follow.
-===Test Cases===
+;Test Cases
 <lang c>
@@ Line 123: / Line 121: @@
 </lang>
-===Output===
+;Output
 {| class="wikitable"
@@ Line 148: / Line 146: @@
 </lang>
-===Output===
+;Output
 {| class="wikitable"
@@ Line 177: / Line 175: @@
 |}
-==Diagnostics==
+;Diagnostics
 The following error conditions should be caught: