Compiler/lexical analyzer: Difference between revisions
m
J: add some documentation
(J: bugfix '\n' is not an identifier) |
m (J: add some documentation) |
||
Line 4,774:
=={{header|J}}==
Here, we first build a tokenizer state machine sufficient to recognize our mini-language. This tokenizer must not discard any characters, because we will be using cumulative character offsets to identify line numbers and column numbers.
Then, we refine this result: we generate those line and column numbers, discard whitespace and comments, and classify tokens based on their structure.
(Also, in this version, rather than building out a full state machine to recognize character literals, we treat character literals as a sequence of tokens which we must then refine. It might have been wiser to build character literals as single tokens,)
Implementation:
Line 4,853 ⟶ 4,859:
RightBrace Keyword_while Semicolon Keyword_print Comma Keyword_putc
}}-.LF)=: tkref=: tokenize '*/%+-<<=>>===!=!&&||=()if{else}while;print,putc'
NB. the reference tokens here were arranged to avoid whitespace tokens
NB. also, we reserve multiple token instances where a literal string
NB. appears in different syntactic productions. Here, we only use the initial
NB. instances -- the others will be used in the syntax analyzer which
NB. uses the same tkref and tknames,
shift=: |.!.0
|