Compiler/lexical analyzer: Difference between revisions

Content added Content deleted
m (Move to draft status)
(improve formatting and wording of task description; make it a draft task until everything is clarified)
Line 1: Line 1:
{{draft task}}Lexical Analyzer
{{draft task}}
{{clarify task}}


Definition from [https://en.wikipedia.org/wiki/Lexical_analysis Wikipedia]:
<br>


: ''Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).''
Definition from [https://en.wikipedia.org/wiki/Lexical_analysis Wikipedia]


{{introheader|The Task}}
Lexical analysis is the process of converting a sequence of characters (such as in a
computer program or web page) into a sequence of tokens (strings with an identified
"meaning"). A program that performs lexical analysis may be called a lexer, tokenizer,
or scanner (though "scanner" is also used to refer to the first stage of a lexer).

;The Task


Create a lexical analyzer for the simple programming language specified below. The
Create a lexical analyzer for the simple programming language specified below. The
Line 17: Line 13:
if two versions of the solution are provided: One without the lexer module, and one with.
if two versions of the solution are provided: One without the lexer module, and one with.


;Specification
{{introheader|Input Specification}}


The simple programming language to be analyzed is more or less a subset of [[C]]. It supports the following tokens:
The various token types are denoted below.


;Operators
;Operators
Line 41: Line 37:
| > || greater than || Gtr
| > || greater than || Gtr
|-
|-
| != || not equal || Neq
| &#33;= || not equal || Neq
|-
|-
| = || assign || Assign
| = || assign || Assign
Line 97: Line 93:
|}
|}


Notes: For char literals, '\n' is supported as a new line
Notes: For char and string literals, <code>\n</code> is supported as a new line character. To represent a backslash, use <code>\\</code>. No other special sequences are supported.
character. To represent \, use: '\\'. \n may also be used in
Strings, to print a newline. No other special sequences are
supported.


;White space
;White space
Line 132: Line 125:
They should produce the same token stream, except for the line and column positions.
They should produce the same token stream, except for the line and column positions.


'''Comments''' /* ... */ (multi-line)
Comments enclosed in <code>/* ... */</code> are also treated as whitespace outside of strings.


;Complete list of token names
;Complete list of token names


<pre>
'''EOI, Print, Putc, If, While, Lbrace, Rbrace, Lparen, Rparen, Uminus, Mul, Div, Add, Sub, Lss, Gtr, Leq, Neq, And, Semi, Comma, Assign, Integerk, Stringk, Ident'''
EOI Print Putc If While Lbrace Rbrace
Lparen Rparen Uminus Mul Div Add Sub
Lss Gtr Leq Neq And Semi Comma
Assign Integer String Ident
</pre>


{{introheader|Output Format}}
;Program output


The program output should be a sequence of lines, each consisting of the following whitespace-separated fields:
Output of the program should be:


# <code>line</code>
* the word '''line''', followed by:
* the line number where the token starts, followed by:
# the line number where the token starts
# <code>col</code>
* the abbreviation '''col''', followed by:
* the column number where the token starts, followed by:
# the column number where the token starts
* the '''token name'''.
# the token name
* If the token name is one of Integer, Ident or String, the actual value of the same should follow.
# the token value, in case of ''Integer'', ''String'', or ''Ident''


{{introheader|Diagnostics}}
;Test Cases


The following error conditions should be caught:
'''Test Case 1'''

{| class="wikitable"
|-
! Error
! Example
|-
| Empty character constant
| <code>&apos;&apos;</code>
|-
| Unknown escape sequence.
| <code>\r</code>
|-
| Multi-character constant.
| <code>xx</code>
|-
| End-of-file in comment. Closing comment characters not found.
|-
| End-of-file while scanning string literal. Closing string character not found.
|-
| End-of-line while scanning string literal. Closing string character not found before end-of-line.
|-
| Unrecognized character.
| <code>&#124;</code>
|}

{{introheader|Test Cases}}

{| class="wikitable"
|-
! Input
! Output
|-
| style="vertical-align:top" |
<lang c>
<lang c>
/*
/*
Line 159: Line 190:
</lang>
</lang>


| style="vertical-align:top" |
;Output
<b><pre>

<b>
<pre>
line 4 col 1 Print
line 4 col 1 Print
line 4 col 6 Lparen
line 4 col 6 Lparen
Line 169: Line 198:
line 4 col 25 Semi
line 4 col 25 Semi
line 5 col 1 EOI
line 5 col 1 EOI
</pre>
</pre></b>
</b>


|-
'''Test Case 2'''
| style="vertical-align:top" |
<lang c>
<lang c>
/*
/*
Line 181: Line 210:
</lang>
</lang>


| style="vertical-align:top" |
;Output
<b><pre>

<b>
<pre>
line 4 col 1 Ident phoenix_number
line 4 col 1 Ident phoenix_number
line 4 col 16 Assign
line 4 col 16 Assign
Line 197: Line 224:
line 5 col 28 Semi
line 5 col 28 Semi
line 6 col 1 EOI
line 6 col 1 EOI
</pre>
</pre></b>
</b>


|-
'''Test Case 3'''
| style="vertical-align:top" |
<lang c>
<lang c>
/*
/*
Line 222: Line 249:
</lang>
</lang>


| style="vertical-align:top" |
;Output
<b><pre>

<b>
<pre>
line 5 col 15 Print
line 5 col 15 Print
line 5 col 41 Sub
line 5 col 41 Sub
Line 252: Line 277:
line 17 col 26 Integer 10
line 17 col 26 Integer 10
line 18 col 26 Integer 32
line 18 col 26 Integer 32
line 19 col 1 EOI</pre>
line 19 col 1 EOI
</b>
</pre></b>
|}


{{introheader|Reference}}
;Diagnostics
The following error conditions should be caught:

* Empty character constant. Example: &apos;&apos;
* Unknown escape sequence. Example: '\r'
* Multi-character constant. Example: 'xx'
* End-of-file in comment. Closing comment characters not found.
* End-of-file while scanning string literal. Closing string character not found.
* End-of-line while scanning string literal. Closing string character not found before end-of-line.
* Unrecognized character. Example: |

;Reference


The Flex, C, Python and Euphoria versions can be considered reference implementations.
The Flex, C, Python and Euphoria versions can be considered reference implementations.


<hr>
;Implementations

__TOC__
__TOC__