Compiler/lexical analyzer: Difference between revisions

Line 158:

For example, the following two program fragments are equivalent, and should produce the same token stream except for the line and column positions:

* <~~lang~~ c>if ( p /* meaning n is prime */ ) {

* <syntaxhighlight lang="c">if ( p /* meaning n is prime */ ) {

print ( n , " " ) ;

count = count + 1 ; /* number of primes found so far */

}</~~lang~~>

}</syntaxhighlight>

* <~~lang~~ c>if(p){print(n," ");count=count+1;}</~~lang~~>

* <syntaxhighlight lang="c">if(p){print(n," ");count=count+1;}</syntaxhighlight>

;Complete list of token names

Line 237:

| style="vertical-align:top" |

Test Case 1:

<~~lang~~ c>/*

<syntaxhighlight lang="c">/*

Hello world

*/

print("Hello, World!\n");</~~lang~~>

print("Hello, World!\n");</syntaxhighlight>

| style="vertical-align:top" |

Line 255:

| style="vertical-align:top" |

Test Case 2:

<~~lang~~ c>/*

<syntaxhighlight lang="c">/*

Show Ident and Integers

*/

phoenix_number = 142857;

print(phoenix_number, "\n");</~~lang~~>

print(phoenix_number, "\n");</syntaxhighlight>

| style="vertical-align:top" |

Line 280:

| style="vertical-align:top" |

Test Case 3:

<~~lang~~ c>/*

<syntaxhighlight lang="c">/*

All lexical tokens - not syntactically correct, but that will

have to wait until syntax analysis

Line 301:

/* character literal */ '\n'

/* character literal */ '\\'

/* character literal */ ' '</~~lang~~>

/* character literal */ ' '</syntaxhighlight>

| style="vertical-align:top" |

Line 344:

| style="vertical-align:top" |

Test Case 4:

<~~lang~~ c>/*** test printing, embedded \n and comments with lots of '*' ***/

<syntaxhighlight lang="c">/*** test printing, embedded \n and comments with lots of '*' ***/

print(42);

print("\nHello World\nGood Bye\nok\n");

print("Print a slash n - \\n.\n");</~~lang~~>

print("Print a slash n - \\n.\n");</syntaxhighlight>

| style="vertical-align:top" |

Line 388:

=={{header|Ada}}==

<~~lang~~ ada>with Ada.Text_IO, Ada.Streams.Stream_IO, Ada.Strings.Unbounded, Ada.Command_Line,

<syntaxhighlight lang="ada">with Ada.Text_IO, Ada.Streams.Stream_IO, Ada.Strings.Unbounded, Ada.Command_Line,

Ada.Exceptions;

use Ada.Strings, Ada.Strings.Unbounded, Ada.Streams, Ada.Exceptions;

Line 648:

when error : others => IO.Put_Line("Error: " & Exception_Message(error));

end Main;

</syntaxhighlight>

</lang>

{{out}} Test case 3:

<pre>

Line 691:

As an addition, it emits a diagnostic if integer literals are too big.

<~~lang~~ algol68>BEGIN

<syntaxhighlight lang="algol68">BEGIN

# implement C-like getchar, where EOF and EOLn are "characters" (-1 and 10 resp.). #

INT eof = -1, eoln = 10;

Line 828:

OD;

output("End_Of_Input")

END</~~lang~~>

END</syntaxhighlight>

=={{header|ALGOL W}}==

<~~lang~~ algolw>begin

<syntaxhighlight lang="algolw">begin

%lexical analyser %

% Algol W strings are limited to 256 characters in length so we limit source lines %

Line 1,124:

while nextToken not = tEnd_of_input do writeToken;

writeToken

end.</~~lang~~>

end.</syntaxhighlight>

{{out}} Test case 3:

<pre>

Line 1,169:

(One point of note: the C "EOF" pseudo-character is detected in the following code by looking for a negative number. That EOF has to be negative and the other characters non-negative is implied by the ISO C standard.)

<~~lang~~ ~~ATS~~>(********************************************************************)

<syntaxhighlight lang="ats">(********************************************************************)

(* Usage: lex [INPUTFILE [OUTPUTFILE]]

If INPUTFILE or OUTPUTFILE is "-" or missing, then standard input

Line 2,041:

end

(********************************************************************)</~~lang~~>

(********************************************************************)</syntaxhighlight>

Line 2,082:

=={{header|AWK}}==

Tested with gawk 4.1.1 and mawk 1.3.4.

BEGIN {

all_syms["tk_EOI" ] = "End_of_input"

Line 2,288:

}

</syntaxhighlight>

</lang>

Line 2,325:

=={{header|C}}==

Tested with gcc 4.81 and later, compiles warning free with -Wpedantic -pedantic -Wall -Wextra

<~~lang~~ C>#include <stdlib.h>

<syntaxhighlight lang="c">#include <stdlib.h>

#include <stdio.h>

#include <stdarg.h>

Line 2,557:

run();

return 0;

}</~~lang~~>

}</syntaxhighlight>

Line 2,601:

=={{header|C sharp|C#}}==

Requires C#6.0 because of the use of null coalescing operators.

<~~lang~~ csharp>

using System;

using System.IO;

Line 2,951:

}

</syntaxhighlight>

</lang>

Line 2,995:

=={{header|C++}}==

Tested with GCC 9.3.0 (g++ -std=c++17)

<~~lang~~ cpp>#include <charconv> // std::from_chars

<syntaxhighlight lang="cpp">#include <charconv> // std::from_chars

#include <fstream> // file_to_string, string_to_file

#include <functional> // std::invoke

Line 3,380:

});

}

</syntaxhighlight>

</lang>

Line 3,425:

Using GnuCOBOL 2. By Steve Williams (with one change to get around a Rosetta Code code highlighter problem).

<~~lang~~ cobol> >>SOURCE FORMAT IS FREE

<syntaxhighlight lang="cobol"> >>SOURCE FORMAT IS FREE

*> this code is dedicated to the public domain

*> (GnuCOBOL) 2.3-dev.0

Line 3,831:

end-if

.

end program lexer.</~~lang~~>

end program lexer.</syntaxhighlight>

Line 3,873:

Lisp has a built-in reader and you can customize the reader by modifying its readtable. I'm also using the Gray stream, which is an almost standard feature of Common Lisp, for counting lines and columns.

<~~lang~~ lisp>(defpackage #:lexical-analyzer

<syntaxhighlight lang="lisp">(defpackage #:lexical-analyzer

(:use #:cl #:sb-gray)

(:export #:main))

Line 4,086:

(defun main ()

(lex *standard-input*))</~~lang~~>

(lex *standard-input*))</syntaxhighlight>

<pre> 5 16 KEYWORD-PRINT

Line 4,127:

<~~lang~~ ~~Elixir~~>#!/bin/env elixir

<syntaxhighlight lang="elixir">#!/bin/env elixir

# -*- elixir -*-

Line 4,595:

end ## module Lex

Lex.main(System.argv)</~~lang~~>

Lex.main(System.argv)</syntaxhighlight>

Line 4,641:

<~~lang~~ lisp>#!/usr/bin/emacs --script

<syntaxhighlight lang="lisp">#!/usr/bin/emacs --script

;;

;; The Rosetta Code lexical analyzer in GNU Emacs Lisp.

Line 5,059:

(scan-text t))

(main)</~~lang~~>

(main)</syntaxhighlight>

Line 5,105:

<~~lang~~ erlang>#!/bin/env escript

<syntaxhighlight lang="erlang">#!/bin/env escript

%%%-------------------------------------------------------------------

Line 5,610:

%%% erlang-indent-level: 3

%%% end:

%%%-------------------------------------------------------------------</~~lang~~>

%%%-------------------------------------------------------------------</syntaxhighlight>

Line 5,652:

=={{header|Euphoria}}==

Tested with Euphoria 4.05.

<~~lang~~ euphoria>include std/io.e

<syntaxhighlight lang="euphoria">include std/io.e

include std/map.e

include std/types.e

Line 5,877:

end procedure

main(command_line())</~~lang~~>

main(command_line())</syntaxhighlight>

Line 5,921:

=={{header|Flex}}==

Tested with Flex 2.5.4.

<syntaxhighlight lang="c">%{

<lang C>%{

#include <stdio.h>

#include <stdlib.h>

Line 6,094:

} while (tok != tk_EOI);

return 0;

}</~~lang~~>

}</syntaxhighlight>

Line 6,138:

=={{header|Forth}}==

Tested with Gforth 0.7.3.

<~~lang~~ ~~Forth~~>CREATE BUF 0 , \ single-character look-ahead buffer

<syntaxhighlight lang="forth">CREATE BUF 0 , \ single-character look-ahead buffer

CREATE COLUMN# 0 ,

CREATE LINE# 1 ,

Line 6,260:

THEN THEN ;

: TOKENIZE BEGIN CONSUME AGAIN ;

TOKENIZE</~~lang~~>

TOKENIZE</syntaxhighlight>

Line 6,274:

The author has placed this Fortran code in the public domain.

<syntaxhighlight lang="fortran">!!!

<lang Fortran>!!!

!!! An implementation of the Rosetta Code lexical analyzer task:

!!! https://rosettacode.org/wiki/Compiler/lexical_analyzer

Line 7,352:

end subroutine print_usage

end program lex</~~lang~~>

end program lex</syntaxhighlight>

Line 7,393:

=={{header|FreeBASIC}}==

Tested with FreeBASIC 1.05

<~~lang~~ ~~FreeBASIC~~>enum Token_type

<syntaxhighlight lang="freebasic">enum Token_type

tk_EOI

tk_Mul

Line 7,679:

print : print "Hit any to end program"

sleep

system</~~lang~~>

system</syntaxhighlight>

Line 7,720:

=={{header|Go}}==

<~~lang~~ go>package main

<syntaxhighlight lang="go">package main

import (

Line 8,097:

initLex()

process()

}</~~lang~~>

}</syntaxhighlight>

Line 8,140:

=={{header|Haskell}}==

Tested with GHC 8.0.2

<~~lang~~ haskell>import Control.Applicative hiding (many, some)

<syntaxhighlight lang="haskell">import Control.Applicative hiding (many, some)

import Control.Monad.State.Lazy

import Control.Monad.Trans.Maybe (MaybeT, runMaybeT)

Line 8,444:

where (Just t, s') = runState (runMaybeT lexer) s

(txt, _, _) = s'

</syntaxhighlight>

</lang>

Line 8,496:

Global variables are avoided except for some constants that require initialization.

<syntaxhighlight lang="icon">#

<lang Icon>#

# The Rosetta Code lexical analyzer in Icon with co-expressions. Based

# upon the ATS implementation.

Line 8,994:

procedure max(x, y)

return (if x < y then y else x)

end</~~lang~~>

end</syntaxhighlight>

Line 9,043:

Implementation:

<~~lang~~ J>symbols=:256#0

<syntaxhighlight lang="j">symbols=:256#0

ch=: {{1 0+x[symbols=: x (a.i.y)} symbols}}

'T0 token' =: 0 ch '%+-!(){};,<>=!|&'

Line 9,163:

keep=. (tokens~:<,'''')*-.comments+.whitespace+.unknown*a:=values

keep&#each ((1+lines),.columns);<names,.values

}}</~~lang~~>

}}</syntaxhighlight>

Test case 3:

flex=: {{

'A B'=.y

Line 9,233:

21 28 Integer 92

22 27 Integer 32

23 1 End_of_input </~~lang~~>

23 1 End_of_input </syntaxhighlight>

Here, it seems expedient to retain a structured representation of the lexical result. As shown, it's straightforward to produce a "pure" textual result for a hypothetical alternative implementation of the syntax analyzer, but the structured representation will be easier to deal with.

=={{header|Java}}==

<~~lang~~ java>

// Translated from python source

Line 9,479:

}

</syntaxhighlight>

</lang>

=={{header|JavaScript}}==

{{incorrect|Javascript|Please show output. Code is identical to [[Compiler/syntax_analyzer]] task}}

<~~lang~~ javascript>

/*

Token: type, value, line, pos

Line 9,696:

l.printTokens()

})

</syntaxhighlight>

</lang>

=={{header|Julia}}==

<~~lang~~ julia>struct Tokenized

<syntaxhighlight lang="julia">struct Tokenized

startline::Int

startcol::Int

Line 9,854:

println(lpad(tok.startline, 3), lpad(tok.startcol, 5), lpad(tok.name, 18), " ", tok.value != nothing ? tok.value : "")

end

</~~lang~~>{{output}}<pre>

</syntaxhighlight>{{output}}<pre>

Line Col Name Value

5 16 Keyword_print

Line 9,893:

=={{header|kotlin}}==

<~~lang~~ kotlin>// Input: command line argument of file to process or console input. A two or

<syntaxhighlight lang="kotlin">// Input: command line argument of file to process or console input. A two or

// three character console input of digits followed by a new line will be

// checked for an integer between zero and twenty-five to select a fixed test

Line 10,566:

System.exit(1)

} // try

} // main</~~lang~~>

} // main</syntaxhighlight>

Line 10,614:

The first module is simply a table defining the names of tokens which don't have an associated value.

<~~lang~~ ~~Lua~~>-- module token_name (in a file "token_name.lua")

<syntaxhighlight lang="lua">-- module token_name (in a file "token_name.lua")

local token_name = {

['*'] = 'Op_multiply',

Line 10,643:

['putc'] = 'Keyword_putc',

}

return token_name</~~lang~~>

return token_name</syntaxhighlight>

This module exports a function find_token, which attempts to find the next valid token from a specified position in a source line.

<~~lang~~ ~~Lua~~>-- module lpeg_token_finder

<syntaxhighlight lang="lua">-- module lpeg_token_finder

local M = {} -- only items added to M will be public (via 'return M' at end)

local table, concat = table, table.concat

Line 10,729:

end

return M</~~lang~~>

return M</syntaxhighlight>

The lexer module uses finder.find_token to produce an iterator over the tokens in a source.

<~~lang~~ ~~Lua~~>-- module lexer

<syntaxhighlight lang="lua">-- module lexer

local M = {} -- only items added to M will publicly available (via 'return M' at end)

local string, io, coroutine, yield = string, io, coroutine, coroutine.yield

Line 10,811:

-- M._INTERNALS = _ENV

return M

</syntaxhighlight>

</lang>

This script uses lexer.tokenize_text to show the token sequence produced from a source text.

<~~lang~~ ~~Lua~~>lexer = require 'lexer'

<syntaxhighlight lang="lua">lexer = require 'lexer'

format, gsub = string.format, string.gsub

Line 10,853:

-- etc.

end

</syntaxhighlight>

</lang>

===Using only standard libraries===

This version replaces the lpeg_token_finder module of the LPeg version with this basic_token_finder module, altering the require expression near the top of the lexer module accordingly. Tested with Lua 5.3.5. (Note that select is a standard function as of Lua 5.2.)

<~~lang~~ lua>-- module basic_token_finder

<syntaxhighlight lang="lua">-- module basic_token_finder

local M = {} -- only items added to M will be public (via 'return M' at end)

local table, string = table, string

Line 10,988:

-- M._ENV = _ENV

return M</~~lang~~>

return M</syntaxhighlight>

=={{header|M2000 Interpreter}}==

Module lexical_analyzer {

a$={/*

Line 11,247:

}

lexical_analyzer

</syntaxhighlight>

</lang>

Line 11,292:

<~~lang~~ ~~Mercury~~>% -*- mercury -*-

<syntaxhighlight lang="mercury">% -*- mercury -*-

%

% Compile with maybe something like:

Line 12,022:

:- func eof = int is det.

eof = -1.</~~lang~~>

eof = -1.</syntaxhighlight>

Line 12,071:

Tested with Nim v0.19.4. Both examples are tested against all programs in [[Compiler/Sample programs]].

===Using string with regular expressions===

<~~lang~~ nim>

import re, strformat, strutils

Line 12,263:

echo input.tokenize.output

</syntaxhighlight>

</lang>

===Using stream with lexer library===

<~~lang~~ nim>

import lexbase, streams

from strutils import Whitespace

Line 12,576:

echo &"({l.lineNumber},{l.getColNumber l.bufpos + 1}) {l.error}"

main()

</syntaxhighlight>

</lang>

===Using nothing but system and strutils===

<~~lang~~ nim>import strutils

<syntaxhighlight lang="nim">import strutils

type

Line 12,799:

stdout.write('\n')

if token.kind == tokEnd:

break</~~lang~~>

break</syntaxhighlight>

=={{header|ObjectIcon}}==

Line 12,809:

<~~lang~~ ~~ObjectIcon~~># -*- ObjectIcon -*-

<syntaxhighlight lang="objecticon"># -*- ObjectIcon -*-

#

# The Rosetta Code lexical analyzer in Object Icon. Based upon the ATS

Line 13,306:

write!([FileStream.stderr] ||| args)

exit(1)

end</~~lang~~>

end</syntaxhighlight>

Line 13,354:

(Much of the extra complication in the ATS comes from arrays being a linear type (whose "views" need tending), and from values of linear type having to be local to any function using them. This limitation could have been worked around, and arrays more similar to OCaml arrays could have been used, but at a cost in safety and efficiency.)

<~~lang~~ ~~OCaml~~>(*------------------------------------------------------------------*)

<syntaxhighlight lang="ocaml">(*------------------------------------------------------------------*)

(* The Rosetta Code lexical analyzer, in OCaml. Based on the ATS. *)

Line 13,881:

main ()

(*------------------------------------------------------------------*)</~~lang~~>

(*------------------------------------------------------------------*)</syntaxhighlight>

Line 13,924:

Note: we do not print the line and token source code position for the simplicity.

<~~lang~~ scheme>

(import (owl parse))

Line 14,048:

(if (null? (cdr stream))

(print 'End_of_input))))

</syntaxhighlight>

</lang>

==== Testing ====

Testing function:

<~~lang~~ scheme>

(define (translate source)

(let ((stream (try-parse token-parser (str-iter source) #t)))

Line 14,059:

(if (null? (force (cdr stream)))

(print 'End_of_input))))

</syntaxhighlight>

</lang>

====== Testcase 1 ======

<~~lang~~ scheme>

(translate "

/*

Line 14,069:

*/

print(\"Hello, World!\\\\n\");

")</~~lang~~>

")</syntaxhighlight>

<pre>

Line 14,082:

====== Testcase 2 ======

<~~lang~~ scheme>

(translate "

/*

Line 14,089:

phoenix_number = 142857;

print(phoenix_number, \"\\\\n\");

")</~~lang~~>

")</syntaxhighlight>

<pre>

Line 14,108:

====== Testcase 3 ======

<~~lang~~ scheme>

(translate "

/*

Line 14,132:

/* character literal */ '\\\\'

/* character literal */ ' '

")</~~lang~~>

")</syntaxhighlight>

<pre>

Line 14,173:

====== Testcase 4 ======

<~~lang~~ scheme>

(translate "

/*** test printing, embedded \\\\n and comments with lots of '*' ***/

Line 14,180:

print(\"Print a slash n - \\\\\\\\n.\\\\n\");

")

</syntaxhighlight>

</lang>

<pre>

Line 14,203:

=={{header|Perl}}==

<~~lang~~ perl>#!/usr/bin/env perl

<syntaxhighlight lang="perl">#!/usr/bin/env perl

use strict;

Line 14,342:

($line, $col)

}

}</~~lang~~>

}</syntaxhighlight>

Line 14,385:

===Alternate Perl Solution===

Tested on perl v5.26.1

<~~lang~~ ~~Perl~~>#!/usr/bin/perl

<syntaxhighlight lang="perl">#!/usr/bin/perl

use strict; # lex.pl - source to tokens

Line 14,421:

1 + $` =~ tr/\n//, 1 + length $` =~ s/.*\n//sr, $^R;

}

printf "%5d %7d %s\n", 1 + tr/\n//, 1, 'End_of_input';</~~lang~~>

printf "%5d %7d %s\n", 1 + tr/\n//, 1, 'End_of_input';</syntaxhighlight>

=={{header|Phix}}==

Line 14,428:

form. If required, demo\rosetta\Compiler\extra.e (below) contains some code that achieves the latter.

Code to print the human readable forms is likewise kept separate from any re-usable parts.

--

-- demo\rosetta\Compiler\core.e

Line 14,588:

return s

end function

For running under pwa/p2js, we also have a "fake file/io" component:

--

-- demo\rosetta\Compiler\js_io.e

Line 14,692:

return EOF

end function

The main lexer is also written to be reusable by later stages.

--

-- demo\\rosetta\\Compiler\\lex.e

Line 14,881:

return toks

end function

Optional: if you need human-readable output/input at each (later) stage, so you can use pipes

--

-- demo\rosetta\Compiler\extra.e

Line 14,936:

return {n_type, left, right}

end function

Finally, a simple test driver for the specific task:

--

-- demo\rosetta\Compiler\lex.exw

Line 14,966:

--main(command_line())

main({0,0,"test4.c"})

<pre>

Line 14,989:

=={{header|Prolog}}==

<~~lang~~ prolog>/*

<syntaxhighlight lang="prolog">/*

Test harness for the analyzer, not needed if we are actually using the output.

*/

Line 15,149:

% anything else is an error

tok(_,_,L,P) --> { format(atom(Error), 'Invalid token at line ~d,~d', [L,P]), throw(Error) }.</~~lang~~>

tok(_,_,L,P) --> { format(atom(Error), 'Invalid token at line ~d,~d', [L,P]), throw(Error) }.</syntaxhighlight>

<pre>

Line 15,190:

=={{header|Python}}==

Tested with Python 2.7 and 3.x

<~~lang~~ ~~Python~~>from __future__ import print_function

<syntaxhighlight lang="python">from __future__ import print_function

import sys

Line 15,371:

if tok == tk_EOI:

break</~~lang~~>

break</syntaxhighlight>

Line 15,415:

=={{header|QB64}}==

Tested with QB64 1.5

<~~lang~~ vb>dim shared source as string, the_ch as string, tok as string, toktyp as string

<syntaxhighlight lang="vb">dim shared source as string, the_ch as string, tok as string, toktyp as string

dim shared line_n as integer, col_n as integer, text_p as integer, err_line as integer, err_col as integer, errors as integer

Line 15,655:

end

end sub

</syntaxhighlight>

</lang>

Line 15,695:

=={{header|Racket}}==

<~~lang~~ racket>

#lang racket

(require parser-tools/lex)

Line 15,851:

"TEST 5"

(display-tokens (string->tokens test5))

</syntaxhighlight>

</lang>

=={{header|Raku}}==

Line 15,861:

<lang ~~perl6~~>grammar tiny_C {

<syntaxhighlight lang="raku" line>grammar tiny_C {

rule TOP { ^ <.whitespace>? <tokens> + % <.whitespace> <.whitespace> <eoi> }

Line 15,954:

my $tokenizer = tiny_C.parse(@*ARGS[0].IO.slurp);

parse_it( $tokenizer );</~~lang~~>

parse_it( $tokenizer );</syntaxhighlight>

Line 16,000:

<~~lang~~ ratfor>######################################################################

<syntaxhighlight lang="ratfor">######################################################################

#

# The Rosetta Code scanner in Ratfor 77.

Line 17,230:

end

######################################################################</~~lang~~>

######################################################################</syntaxhighlight>

Line 17,336:

The following code implements a configurable (from a symbol map and keyword map provided as parameters) lexical analyzer.

<~~lang~~ scala>

package xyz.hyperreal.rosettacodeCompiler

Line 17,597:

}

</syntaxhighlight>

</lang>

=={{header|Scheme}}==

<~~lang~~ scheme>

(import (scheme base)

(scheme char)

Line 17,798:

(display-tokens (lexer (cadr (command-line))))

(display "Error: provide program filename\n"))

</syntaxhighlight>

</lang>

Line 17,816:

<~~lang~~ ~~SML~~>(*------------------------------------------------------------------*)

<syntaxhighlight lang="sml">(*------------------------------------------------------------------*)

(* The Rosetta Code lexical analyzer, in Standard ML. Based on the ATS

and the OCaml. The intended compiler is Mlton or Poly/ML; there is

Line 18,622:

(* sml-indent-args: 2 *)

(* end: *)

(*------------------------------------------------------------------*)</~~lang~~>

(*------------------------------------------------------------------*)</syntaxhighlight>

Line 18,676:

<~~lang~~ ecmascript>import "/dynamic" for Enum, Struct, Tuple

<syntaxhighlight lang="ecmascript">import "/dynamic" for Enum, Struct, Tuple

import "/str" for Char

import "/fmt" for Fmt

Line 19,025:

lineCount = lines.count

initLex.call()

process.call()</~~lang~~>

process.call()</syntaxhighlight>

Line 19,067:

=={{header|Zig}}==

<~~lang~~ zig>

const std = @import("std");

Line 19,476:

return result.items;

}

</syntaxhighlight>

</lang>