User:Ed Davis: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
{{task}} |
{{task}}Lexical Analyzer |
||
Lexical Analyzer |
|||
---------------- |
|||
From [https://en.wikipedia.org/wiki/Lexical_analysis Wikipedia] |
From [https://en.wikipedia.org/wiki/Lexical_analysis Wikipedia] |
||
Line 10: | Line 8: | ||
or scanner (though "scanner" is also used to refer to the first stage of a lexer). |
or scanner (though "scanner" is also used to refer to the first stage of a lexer). |
||
;The Task |
|||
Create a lexical analyzer for the Tiny programming language. The |
Create a lexical analyzer for the Tiny programming language. The |
||
Line 16: | Line 14: | ||
output to a file and/or stdout. |
output to a file and/or stdout. |
||
;Specification |
|||
The various token types are denoted below. |
The various token types are denoted below. |
||
;Operators |
|||
{| class="wikitable" |
{| class="wikitable" |
||
Line 47: | Line 45: | ||
|} |
|} |
||
;Symbols |
|||
{| class="wikitable" |
{| class="wikitable" |
||
Line 66: | Line 64: | ||
|} |
|} |
||
;Keywords |
|||
{| class="wikitable" |
{| class="wikitable" |
||
Line 81: | Line 79: | ||
|} |
|} |
||
;Other entities |
|||
{| class="wikitable" |
{| class="wikitable" |
||
Line 103: | Line 101: | ||
'''Comments''' /* ... */ (multi-line) |
'''Comments''' /* ... */ (multi-line) |
||
;Complete list of token names |
|||
'''EOI, Print, Putc, If, While, Lbrace, Rbrace, Lparen, Rparen, Uminus, Mul, Div, Add, Sub, Lss, Gtr, Leq, Neq, And, Semi, Comma, Assign, Integerk, Stringk, Ident''' |
'''EOI, Print, Putc, If, While, Lbrace, Rbrace, Lparen, Rparen, Uminus, Mul, Div, Add, Sub, Lss, Gtr, Leq, Neq, And, Semi, Comma, Assign, Integerk, Stringk, Ident''' |
||
;Program output |
|||
Output of the program should be the line and column where the |
Output of the program should be the line and column where the |
||
Line 114: | Line 112: | ||
should follow. |
should follow. |
||
;Test Cases |
|||
<lang c> |
<lang c> |
||
Line 123: | Line 121: | ||
</lang> |
</lang> |
||
;Output |
|||
{| class="wikitable" |
{| class="wikitable" |
||
Line 148: | Line 146: | ||
</lang> |
</lang> |
||
;Output |
|||
{| class="wikitable" |
{| class="wikitable" |
||
Line 177: | Line 175: | ||
|} |
|} |
||
;Diagnostics |
|||
The following error conditions should be caught: |
The following error conditions should be caught: |
||
Revision as of 22:01, 9 August 2016
![Task](http://static.miraheze.org/rosettacodewiki/thumb/b/ba/Rcode-button-task-crushed.png/64px-Rcode-button-task-crushed.png)
You are encouraged to solve this task according to the task description, using any language you may know.
Lexical Analyzer
From Wikipedia
Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).
- The Task
Create a lexical analyzer for the Tiny programming language. The program should read input from a file and/or stdin, and write output to a file and/or stdout.
- Specification
The various token types are denoted below.
- Operators
Characters | Common name | Name |
---|---|---|
'*' | multiply | Mul |
'/' | divide | Div |
'+' | plus | Add |
'-' | minus and unary minus | Sub and Uminus |
'<' | less than | Lss |
'<=' | less than or equal | Leq |
'>' | greater than | Gtr |
'!=' | not equal | Neq |
'=' | assign | Assign |
'&&' | and | And |
- Symbols
Characters | Common name | Name |
---|---|---|
'(' | left parenthesis | Lparen |
')' | right parenthesis | Rparen |
'{' | left brace | Lbrace |
'}' | right brace | Rbrace |
';' | semi colon | Semi |
',' | comma | Comma |
- Keywords
Characters | Name |
---|---|
"if" | If |
"while" | While |
"print" | |
"putc" | Putc |
- Other entities
Characters | Regular expression | Name |
---|---|---|
integers | [0-9]+ | Integer |
char literal | 'x' | Integer |
identifiers | [_a-zA-Z][_a-zA-Z0-9]+ | Ident |
string literal | ".*" | String |
Notes: For char literals, '\n' is supported as a new line character. To represent \, use: '\\'. \n may also be used in Strings, to print a newline. No other special sequences are supported.
Comments /* ... */ (multi-line)
- Complete list of token names
EOI, Print, Putc, If, While, Lbrace, Rbrace, Lparen, Rparen, Uminus, Mul, Div, Add, Sub, Lss, Gtr, Leq, Neq, And, Semi, Comma, Assign, Integerk, Stringk, Ident
- Program output
Output of the program should be the line and column where the found token starts, followed by the Token name. For tokens Integer, Ident and String, the Integer, identifier, or string should follow.
- Test Cases
<lang c> /*
Hello world */
print("Hello, World!\n"); </lang>
- Output
line | 4 | col | 1 | ||
line | 4 | col | 6 | Lparen | |
line | 4 | col | 7 | String | "Hello, World!\n" |
line | 4 | col | 24 | Rparen | |
line | 4 | col | 25 | Semi | |
line | 5 | col | 1 | EOI |
<lang c> /*
Show Ident and Integers */
phoenix_number = 142857; print(phoenix_number, "\n"); </lang>
- Output
line | 1 | col | 1 | Ident | phoenix_number |
line | 1 | col | 16 | Assign | |
line | 1 | col | 18 | Integer | 142857 |
line | 1 | col | 24 | Semi | |
line | 2 | col | 1 | ||
line | 2 | col | 6 | Lparen | |
line | 2 | col | 7 | Ident | phoenix_number |
line | 2 | col | 21 | Comma | |
line | 2 | col | 23 | String | "\n" |
line | 2 | col | 27 | Rparen | |
line | 2 | col | 28 | Semi | |
line | 3 | col | 1 | EOI |
- Diagnostics
The following error conditions should be caught:
- Empty character constant. Example:
- Unknown escape sequence. Example: '\r'
- Multi-character constant. Example: 'xx'
- End-of-file in comment. Closing comment characters not found.
- End-of-file while scanning string literal. Closing string character not found.
- End-of-line while scanning string literal. Closing string character not found before end-of-line.
- Unrecognized character. Example: |
Refer additional questions to the C and Python implementations.