Discuss issues related to the Syntax Highlighting system here. The old page got huge, and it became hard to discern what problems were current.

Recent changes

2010/01/06

Updated language files: c, c_mac, cpp, cpp-qt, clojure, erlang, lisp, java5, prolog
Parser Changes: Generally allowed .. before plain integers. Affects at least: delphi, modula3, pascal, perl and maybe some more.
- Just fixed the changed regexp as I forgot to escape two chars which made the number highlighting behave somewhat unexpectedly --BenBE 17:25, 7 January 2010 (UTC)

2009/12/25 - GeSHi update to GeSHi 1.0.8.6

Hi folks,

I'm helping out Short Circuit with the Server and the GeSHi installation that drives the syntax highlighting of this Wiki. I therefore updated the installation of GeSHi on the RC server to the latest official release - version 1.0.8.6.

I hope nothing horribly went wrong during the update. If you notice any problems please notify me here on the Wiki Page or upstream by mail.

I'll try to work through the (highlighting-related) issues ASAP, though this might take some time.

--BenBE 01:11, 26 December 2009 (UTC)

It doesn't seem to have taken (yet?). Look at the listing in the #Language tags section below. It still says 1.0.8.3.--Mwn3d 01:37, 26 December 2009 (UTC)

Caching, most likely. I just touched the MediaWiki configuration file, so it should expire all the caches. Caches may last up to 24 hours. --Michael Mol 07:38, 26 December 2009 (UTC)

Relationship Between Rosetta Code and GeSHi

Rosetta Code first started using GeSHi for syntax highlighting a long while back, but due to our nature, we quickly discovered, and frequently continue to discover, programming languages which GeSHi does not provide highlighting functionality. Additionally, we've uncovered bugs in various releases of the software, and the GeSHi folks have been welcoming of fixes sent to them.

Due to the way language support is managed in GeSHi, it's fairly trivial for someone who can follow PHP syntax to create a PHP file that adds support for the language of their choice. GeSHi's support for Oberon-2, Modula-3 and AutoHotKey is the direct result of contributions by MBishop and Tinku99. (If you like a language, and it doesn't appear to have syntax highlighting support, I strongly suggest you follow their lead. ;-) )

I am now also part of the GeSHi project focusing for now on adding and improving support for programming languages present on Rosetta Code. As a community of language aficionados and enthusiasts, Rosetta Code is a hotbed for opportunities for improving GeSHi, and improvement of GeSHi is extremely helpful for improving the readability code on Rosetta Code. --Short Circuit 06:43, 18 June 2009 (UTC)

So how can I add a language? I have made a GeSHi file for Vedit macro language. Where can I upload it? (I can not test it myself so I don't know if it is OK). I could do a file for RapidQ quite easily, too, since I have Vedit syntax file for it.

It would be nice if there was an example GeSHi file downloadable somewhere (or perhaps even a few of them). The only one I found was at GeSHi web page. I had to cut and paste the source from the browser, and the do quite a lot editing to get newlines etc. fixed. --PauliKL 15:26, 30 June 2009 (UTC)

Email them to me at mikemol@gmail.com, and I'll get them tested, staged on RC and committed to SVN as quickly as I can. --Short Circuit 17:06, 30 June 2009 (UTC)

Or upstream to BenBE at geshi (ddoott) org --BenBE 01:15, 26 December 2009 (UTC)

If anyone else would like to make a language file, look at this explanation of the language file format from the GeSHi site: http://qbnz.com/highlighter/geshi-doc.html#language-files. --Mwn3d 19:56, 8 September 2009 (UTC)

Language tags

Due to changes I want to implement in the way processing of the lang tag is done, we need to standardize codes for all of the languages currently on Rosetta Code ASAP. Subsequently, all pages need to be scoured for code snippets.

Here is a list of the languages that currently exist on Rosetta code as I write this. Next to each language, place the code for that language. If the code is already established in GeSHi, make it bold. If no code is provided by GeSHi, and no code is currently in use for that language on the wiki, make one up. Once this list is fully populated, all the pages for each language need to be checked, and ensure that the code snippets for that language are correct. (Not sure what to do about command-line one-liners or similar yet, though.)

Note: AutoHotKey currently has two codes, ahk and autohotkey. When AutoHotKey support was added to GeSHi, it was added with the longer code. (lang codes are derived from the filename.) Since there were already code snippets on RC that used ahk for AutoHotKey, I created a symlink that allowed ahk to be used as a language code as well. This should likely be reversed, meaning instances of the ahk tag need to be replaced with the autohotkey tag. --Short Circuit 07:01, 18 June 2009 (UTC)

Recommended language tag usage

4D 4d

A

ALGOL 60 algol60
ALGOL 68 algol68
APL apl
AWK awk
ActionScript actionscript
Ada ada
Agda2 agda2
AmigaE amigae
AppleScript applescript
Assembly asm (for x86)
AutoHotkey autohotkey

B

BASIC qbasic freebasic thinbasic
Bc bc
Befunge befunge
Brainf*** bf

C

C c
C# csharp
C++ cpp
Caml caml
Clean clean
Clojure lisp
Cobol cobol
ColdFusion cfm
Common Lisp lisp
Component Pascal pascal
Coq coq

D

D d
DOS Batch File dos winbatch?
Dc dc
Delphi delphi

E

E e
EC ec
ELLA ella
ESQL sql
Eiffel eiffel
Emacs Lisp lisp
Erlang erlang

F

F f
F# fsharp
FALSE false
FP fp
Factor factor
Fan fan
Forth forth
Fortran fortran

G

GAP gap
Gnuplot gnuplot
Groovy groovy

H

HaXe haxe
Haskell haskell

I

IDL idl
Icon icon
Io io

J

J j
JSON json
JScript.NET jscript
Java java java5
JavaScript javascript
JoCaml jocaml
Joy joy
JudoScript judoscript

K

Korn Shell korn

L

LSE64 lse64
LaTeX latex
LabVIEW labview
Lisaac lisaac
Lisp lisp
Logo logo
Logtalk logtalk
LotusScript lotusscript
Lua lua
Lucid lucid

M

M4 m4
MAXScript maxscript
MIRC Scripting Language mirc
MS SQL sql
Make make
Maple maple
Mathematica mathematica
MATLAB matlab
Maxima maxima
Metafont metafont
Modula-3 modula3

N

NewLISP lisp
Nial nial

O

OCaml ocaml
Oberon-2 oberon2
Object Pascal pascal
Objective-C objc
Octave octave
Omega omega
OpenEdge/Progress openedge
Oz oz

P

PHP php
PL/I pli
PL/SQL plsql
Pascal pascal
Perl perl
Perl 6 perl6
Pike pike
PlainTeX tex
Pop11 pop11
PostScript postscript
PowerShell powershell
Prolog prolog
Python python

Q

Q q

R

R r
REXX rexx
RapidQ rapidq
Raven raven
Rhope rhope
Ruby ruby

S

SAS sas
SETL setl
SMEQL smeql
SNUSP snusp
SQL sql
Scala scala
Scheme scheme
Script3D script3d
Seed7 seed7
Self self
Slate slate
Smalltalk smalltalk
Standard ML sml

T

TI-83 BASIC ti83b
TI-89 BASIC ti89b
Tcl tcl
Toka toka
Tr tr
Transact-SQL sql
Twelf twelf

U

UNIX Shell bash
UnixPipes bash
Unlambda unlambda

V

V v
VBScript vbscript
Vedit macro language vedit
Visual Basic vb
Visual Basic .NET vbnet
Visual Objects visobj

W

Wrapl wrapl

X

XQuery xquery
XSLT xml
XTalk xtalk

GeSHi extension self-report

Here is a list of the codes currently provided by GeSHi.

Non-GeSHi-issues to take care of

Whitespace at program beginning/end

I just wanted to make sure that this problem doesn't get forgotten now that the thread moved to the archive page. --Ce 23:18, 6 July 2009 (UTC)

I think you're going to have to reiterate exactly what the problem is, and how it manifests itself under the current configuration. --Short Circuit 23:50, 6 July 2009 (UTC)

In

<lang cpp>
int main {}
</lang>

there are additional blank lines,

which don't appear in

<pre>
int main()
</pre>

which results in

int main()

Those lines shouldn't be there. Currently the only way to get rid of them is to put the beginning/end tag on the same line as the first/last line of the code, which makes editing (and especially modifying existing snippets) unnecessarily hard (especially it's easy to miss an end tag, since it's bolted onto the last line of the code). In addition, it's an inconsistency. The lang tags should work exactly like pre tags, except for syntax hilighting.

The old discussion is in Village Pump:Home/Syntax Highlighting ( archived 2009-06-18 )#Another problem with eating whitespace characters (note that the start of that discussion concerns an older state with a different problem; initially too much was removed). --Ce 10:42, 7 July 2009 (UTC)

Is there still hope that this will get fixed some day? --Ce 20:41, 7 October 2009 (UTC)

I don't know. The syntax highlighter was my top priority just before I got slammed with work months ago, and things are just settling down. An update of the highlighter is definitely in order, but it's going to result in a significant number of breakages of things that currently work. And I'm hoping for a change in some MW internals that may make the GeSHi work simpler.

In all honesty, I'm hopeful, but not expectant, of a solution that's going to work for the rest of the code examples, while at the same time working for the Whitespace language. --Michael Mol 00:45, 8 October 2009 (UTC)

Test...

Fix is simpler than it appeared. Don't put newlines between the tags. :) --Michael Mol 01:15, 8 October 2009 (UTC)

But that makes the page source worse. Especially it makes it easy to miss an end tag. (Yes, it already happened.) I strongly hope that one day, it will work correctly. --Ce 16:16, 10 October 2009 (UTC)

Yes, it does, but in the case of Whitespace, I think it's justified. Otherwise, you're putting language-significant information between the tags, and effectually asking the wiki to strip out some of your program. I don't see how that's better behavior. If copy/paste is necessary, ideally they would copy from the rendered wiki page, not the wiki source. --Michael Mol 22:14, 10 October 2009 (UTC)

IMHO, the code begins in the line after the start tag and ends at the line before the end tag (putting it in the same line is just a hack to work around the bug). Therefore IMHO currently Whitespace is broken because the GeSHi adds extra newlines at the beginning/end. Also note that this is also how the HTML pre tag works, example:

As you see, a line break directly after the pre tag is removed.

One thing I previously got wrong it that this deletion is only if there's only a newline, any extra whitespace after the pre or before the /pre causes the newline not to be removed. But then, this should simplify the algorithm considerably: Instead of general trimming, one has only to check whether the first resp. last character of the text inside tags is a newline, and if so, remove it. --Ce 08:39, 11 October 2009 (UTC)

Looking at the HTML source of the page, I now noticed several things:

* GeSHi already uses pre tags internally, but:

* it adds an additional   to the first and last (empty) line. This space is not in the source, and I'd therefore consider it a bug in its own right.

* it replaces all newline characters by <br /> which is not only a waste of bandwidth (because pre already honors newline characters), but also defeats the pre tag handling of newlines at the beginning/end.

So the fix would probably to just remove the two latter points.

BTW, whitespace highlighting currently seems broken anyway, because it doesn't currently highlight newlines:

  <-_just_three_spaces
  <-_a_newline,_followed_by_three_more_spaces

</lang>

(my comments apply to what I think should be highlighted; from your interpretation, you'd expect two additional newlines at the beginning/end to be highlighted) Note that the extra spaces in the extra lines (although not highlighted) will break the whitespace code when doing copy/paste, even if using your interpretation. --Ce 09:03, 11 October 2009 (UTC)

It seems like it would be nice to have the lang tag work like the pre tag, but it may not be worth the work. An easy way to catch a lot (still not all) forgotten /lang tags would be to require a preview for all edits (I think it's a MW option). Everyone should preview anyway. And you're right the newline highlighting isn't showing up. It's even easier to see if you add a blank line between the text you showed: <lang whitespace>

  <-_just_three_spaces

  <-_a_newline,_followed_by_three_more_spaces

</lang> I think they used to be red? Maybe that was tabs. --Mwn3d 17:59, 11 October 2009 (UTC)

Given the information I gathered lately (see my latest comment above), I don't think it would be much work. It would be just

Find the code which changes newline characters to br tags and remove it (my guess is that it's also the reason why whitespace highlighting is broken; probably the GeSHi highlighting code looks for newline characters and doesn't find them because they are replaced by br tags).
Find the code which adds the extra   at the beginning/end and remove it.

I cannot imagine that taking a lot of time. --Ce 20:21, 11 October 2009 (UTC)

Here's the hacked-up MediaWiki extension RC is currrently using. It differs fairly significantly from the original version due to fixes and issues that were brought up since its initial use here. RC is currently using GeSHi 1.0.8.2. The latest in the 1.0.x branch is 1.0.8.4, but we encountered significant breakage when I upgraded to that months ago, I had to roll it back. I joined the GeSHi project with commit access, with the intention of adding languages, and finding and fixing issues, as well as running SVN HEAD on RC, but between ImplSearchBot issues and the StumbleUpon flood, my server (and remote backup target) at home suffering a [catastrophic hardware failure, my other computer's screen flaking out, as well as time cramps relating to family, work emergencies running fairly continually since June, and personal health issues coming to a head in the past month, I really haven't had time.

The 1.0.x GeSHi branch is no longer under active development, and I don't know how painful the transition to the 1.1.x branch is going to be. 1.1.x was supposed to be released in August when I last had the time to talk to the other members of that project, and I don't know why it's still in alpha.

Opticron has made significant headway on the ISB replacement, but has slammed into a problem that may require filing a ticket with the MW folks. I just finished getting my home server running again. Once I've got my home server pulling daily site backups again, I should finally be able to turn my attention toward the syntax highlighting again.

As an aside, I don't think anything changed in the server software between when newlines were highlighted and when they weren't. It may also be a bug brought on by a shift in browser usage. --Michael Mol 01:31, 12 October 2009 (UTC)

I think that it's not a browser problem. Inspecting the generated HTML shows there's no span created around the line breaks.

BTW, the changing of newlines to br tags is in the second-to-last line in the linked file, http://rosettacode.org/resources/gct_rcode.phps (i.e. in the return). Just replacing str_replace("\n",'<br />', $geshi->parse_code()) with $geshi->parse_code() should fix that part. However, that line also shows that my suspicion that this is responsible for the missing newline highlighting is wrong: It is only applied after highlighting.

I also see that you've added my previously suggested function prestyletrim; now it's also clear to me why it didn't work: You applied it after the parsing instead of before. But since I now recognized that the semantics would be wrong anyways, it should probably be removed. However, instead it may be modified to simply patch away the spurious   after the fact (the cleaner way would be not to generate it, but that seems to be in $geshi->parse_code(), which is not in the file you linked to).

I think changing prestyletrim to

<lang php> function prestyletrim($text) {

 return preg_replace("/^ /","",preg_replace("/ $/","",$text));

} </lang>

should work (I'm not completely sure because I'm no PHP programmer). This modified version should be called exactly where the original is called now (i.e. while it was the wrong place for the original version, it would be the right place for the new one).

Maybe if it works, the function should also be renamed.

BTW,. I now note that the extra empty line at the end does not appear on the PHP example, so adding the spurious   seems to be part of the programming language dependent code. --Ce 08:02, 12 October 2009 (UTC)

The extra line didn't appear because the parser was in the middle of a literal string at the end of the example. I changed the text to what I think you might have meant to put and the blank line shows up. --Mwn3d 12:11, 12 October 2009 (UTC)

Ah, thanks. However, you mis-fixed it (having the function call twice was intentional; probably it could have been done with one function call and a better regexp, but I chose the easy way, which didn't require me to dig up PHP regex info from the net again). I now replaced with the correctly corrected version (and at the same time also fixed another, unrelated bug in the function). --Ce 16:28, 12 October 2009 (UTC)

OCaml for Standard ML?

It has been suggested to use the OCaml syntax highlighting for Standard ML, and some SML code has been changed to OCaml highlighting already. While the languages are similar, they have many differences in keywords and stuff, and I feel that Standard ML should have a separate highlighting scheme. For example, SML has "datatype" keyword and OCaml does not; logical operators are "andalso" and "orelse" instead of "&&" and "||"; pattern matching is "case ... of" instead of "match ... with"; the "fn" keyword; and lots of other stuff. If I have time I could try to translate the OCaml GeSHi language file into SML; but I am reluctant to do so as I do not own a copy of the Definition of Standard ML, and so I am not confident I will get everything. --76.91.63.71 08:22, 19 July 2009 (UTC)

Sounds reasonable. Grepping around, I find that the list of keywords is this:

and abstype as case datatype else end eqtype exception do fn fun functor funsig handle if in include infix infixr lazy let local nonfix of op open overload raise rec sharing sig signature struct structure then type val where while with withtype orelse andalso

This list was extracted directly from the lexer's keyword table in the source to the SML/NJ implementation, so it should be complete and accurate. I've not categorized their meaning at all. —Donal Fellows 13:46, 19 July 2009 (UTC)

If the languages really are that similar, copy the language file, adjust it for any differences, and send it to me. I'll see that it gets added in as a proper language (and winds up in GeSHi upstream, as well.) --Short Circuit 14:47, 20 July 2009 (UTC)

Non-ASCII characters and unrecognized languages

Look at the below, then look at the source markup of this section.

A smiley face: ☺

Have a ☺ nice day!

The first three smiley faces come through fine, but the last is somehow corrupted. It appears that in general, if GeSHi doesn't recognize a language name, any non-ASCII characters between the lang tags get screwed up. I noticed this bug when using non-ASCII operators in Perl 6 examples. —Underscore 00:19, 7 November 2009 (UTC)

I can probably fix that by changing my "not found" code path when it looks at languages. It's using a PHP character-escaping builtin that's not Unicode-safe, as a brute method of guarding against code injection. --Michael Mol 23:48, 14 November 2009 (UTC)

For the record, this bug is now fixed. —Underscore (Talk) 01:25, 2 December 2009 (UTC)

Webform-driven language-file generator

(Moved from Category talk:J.)

While we're on the subject, I would be interested in a programmatic approach to generating these files. The structure of the GeSHi language files is very, very simple; It's little more than a PHP-native serialization of a few regex setrings and symbol constants. If GeSHi supported JSON for that structure, it would be trivial to import language highlighting as a JSON file, and such a file would be trivial to generate programmatically. But GeSHi doesn't, so I'm stuck with PHP files until I (or someone else) writes a JSON->PHP conversion. That said, a number of folks have sent me language files, and so are familiar with its structure. Would anyone be interested in writing a webform-driven language-file generator? For security's sake, I can't automate the import of the generated files, but it would greatly open up the process of generating the files, and perhaps make maintenance easier. I'd give it a subdomain such as geshi.rosettacode.org. --Michael Mol 03:33, 10 November 2009 (UTC)

Hm, I think I could write a CGI script (in Perl 5) to let folks fill out a form, perhaps with some minimal markup, to create a language definition. With careful use of resource limits and sanitization of the input, we could even let the user test the new definition without a local copy of PHP. But I'm not sure how using such a Web application would actually be easier or quicker than writing the literal definition. I mean, one of the things that makes writing new definitions so easy is that you can use a preexisting definition of a similar language as a starting point. Can you describe in more detail the interface you're imagining? —Underscore (Talk) 13:23, 10 November 2009 (UTC)

It's been ages since I've had a chance to look at the structure, but an editable list for each of the list-type members would be good. I don't know what to do about the regex-driven ones; A wizard would be sweet, but I don't know that that would be possible.

I could provide a MySQL backend for persistence That would make it plausible to tweak existing support.

Not sure how far to go as far as sanitation and execution. The more secure I try make it, the more time I'll need to spend responding to problems. --Michael Mol 16:09, 10 November 2009 (UTC)

How about this: sometime in the near future, I'll write the simplest such program that could possibly be useful. Then I'll take feature requests, or anybody who wants to can submit patches. How does that sound? Could you give me FTP access to geshi.rosettacode.org, or some other way of controlling what appears there? —Underscore (Talk) 23:28, 10 November 2009 (UTC)

Sounds like a plan. I'll send you an email regarding connectivity. --Michael Mol 00:17, 11 November 2009 (UTC)

Okay, everyone, a basic implementation is up at http://rosettacode.org/geshi. The source is available upon request. —Underscore (Talk) 23:16, 17 November 2009 (UTC)

Here are some notes about the current version:

You could add some more instructions for filling the field. For example, what is used as separator character (a space?).
The current version only has one keyword group. If you could add at least 2nd keyword group, then it would be easier to add more groups by editing the resulting file.

Will the form be inserted within a normal wiki page so that there will be navigation links?

--PauliKL 16:04, 16 December 2009 (UTC)

I figured the program would be easier to use if the user could just deduce the syntax from the example input rather than reading a detailed description of it. For instance, I think that the example provided for COMMENTS_MULTI (/* */, {- -}) is much easier to understand than an explanation like "separate the two delimiters of each pair with whitespace and separate each pair with a comma and optionally some whitespace". Tell me if there's a particular detail that you think could use explicit explanation.
Not a bad idea; I think I'll make that change when I get the opportunity. (I'm using a public computer running Windoze at the moment.)
It appears that MediaWiki blithely ignores <form>, unfortunately. Are there any particular links you'd like to see on rosettacode.org/geshi? —Underscore (Talk) 17:40, 16 December 2009 (UTC)

If it is not possible to embedd the form into Rosetta Code navigation frame, then maybe just add a link to the main page. And maybe a link to GeSHi site page that describes the format of the syntax file (I once found the page but I have lost the link). --PauliKL 12:23, 23 December 2009 (UTC)

Okay, now, there are a few links at the top of the page, and you can define multiple keyword groups. —Underscore (Talk) 14:57, 23 December 2009 (UTC)

For the form to be inserted, it would probably best be done as a Special page MediaWiki extension, which would require the extension to be written in PHP. A couple possibilities come to mind for marshalling from PHP to AutoGeSHi (which is Perl). Probably the best (least work for me, least invasiveness to MediaWiki processes) way would be if AutoGeSHi could export a descriptor of the fields, field types, field labels, form target and method, the extension would build the form, and the Submit button would pass the data back to AutoGeSHi. --Michael Mol 19:49, 16 December 2009 (UTC)

Actually, AutoGeSHi doesn't produce the form; it only processes the input. rosettacode.org/geshi/index.html is a static HTML document that I hand-wrote. —Underscore (Talk) 20:47, 16 December 2009 (UTC)

Currently there's a language file creation tool under construction for GeSHi itself that will be included in upcoming releases and will offer basic support for writing language files. This tool isn't finished yet though. --BenBE 13:58, 6 January 2010 (UTC)

Solved issues

Prolog

The Prolog highlighter is emitting garbage instead of hyperlinks. <lang prolog>write(X)</lang>
—Underscore (Talk) 13:49, 1 January 2010 (UTC)

This issue has been fixed on the server and upstream. Missed to forbid one character after variables. --BenBE 22:37, 1 January 2010 (UTC)

C syntax highlight

I've noticed that the "string" word is highlighted as a type in C; it shouldn't be so. --ShinTakezou 23:00, 3 July 2009 (UTC)

Removed string from plain C and C for Mac language files; yet added all the Standard Integer types from stdint.h instead. --BenBE

Small task: For the C language file the section containing the standard function names like printf is awfully empty. I'd need some guys to fill it up ;-) Basically the list should contain all the functions usually found in libc that ships twith the C compiler. --BenBE

Java5 Highlighting

The java5 highlighting removes links to classes when generics are specified for them without a space between the class name and the <. Example: <lang java5>LinkedList<T></lang> "LinkedList" should be a link. If you add a space the link shows up: <lang java5>LinkedList <T></lang> I think a lot of people leave the space out so the highlighting should put a link in with or without the space. --Mwn3d 12:35, 29 July 2009 (UTC)

Should be fixed now --BenBE 13:03, 2 January 2010 (UTC)

The java5 (and java) highlighting doesn't highlight javadoc comments as multi-line comments: <lang java5>/**

this is a comment
/</lang>

The style for co3 is missing. GeSHi itself already highlights it. --BenBE 13:03, 2 January 2010 (UTC)

Which style does it use for regular comments? I can copy that and change the color slightly to make sure it looks like there's a little bit of difference. --Mwn3d 19:23, 7 January 2010 (UTC)

It also doesn't highlight "import static" lines properly (this should only be in java5: <lang java5>import static SomeClass.someMethod;</lang> "import" and "static" should both be blue and bold and "SomeClass.someMethod" should be gray, bold, and italic. --Mwn3d 13:34, 23 September 2009 (UTC)

Wasn't supported until now. Added support for import static. Should work now --BenBE 13:03, 2 January 2010 (UTC)

It mostly works now. Look at this example:

Map<String, Integer> </lang>

"String" should also be a link like it is in this example:

Map < String, Integer > </lang>

--Mwn3d 23:27, 5 January 2010 (UTC)

Done --BenBE 00:20, 6 January 2010 (UTC)

Known bugs with GeSHi Syntax Highlighting

OCaml syntax highlighting issues

Here are bugs in the ocaml syntax highlighting

there is a difference between integer arithmetic and float arithmetic, for exemple:

<lang ocaml>3 + 4 (* integer addition *) 3.4 +. 1.2 (* float addition *)</lang> the point that indicates the float operation (+. *. /. -.) is caught as the separator between a module name and a function from that module: <lang ocaml>x +. y</lang> here x and y should have the same color.
This problem also occurs with fields of structures: <lang ocaml>type t = { x:int; y:int } ;; let v = {x=2; y=3} ;; print_int (v.x + v.y) ;;</lang>

there is also something that I consider as an issue, when a function or a constructor follows a module name, these two different entities are colored in the same way, example (from there):

<lang ocaml>Unix.sleep (* function *) Unix.O_CREAT (* constructor *) Unix.LargeFile (* a sub-module *) (ref 3).contents (* the field of a structure *)</lang> also I don't find it very uniform to colorise elements that follow a module name, and not when the module is opened: <lang ocaml>open Unix (* equivalent elements, but here with different colors *) sleep O_CREAT open LargeFile { contents = 3 }</lang> so I think GeSHi should not colorise names after a dot. This would by the way fix the bug with float operations.

when one open a module, we can open also open a module inside another module

<lang ocaml>open Ode open Ode.LowLevel (* LowLevel is a module inside the module Ode *)</lang>

Modules that are not part of the standard library are not colorised, IMHO this is not very uniform and relevant. In editors with ocaml syntax-highlight, all modules are highlighted. (modules from the standard lib have a link to the doc, please keep this feature)
- (identifiers (identifiers to values (often called variables) and identifiers to functions) have the first letter lower case or an underscore)
- a name that starts with an upper-case letter is a module name or a constructor (they are capitalised)

we can make the difference between a module name and a constructor when it's followed by a dot it's a module name (maybe I'm wrong, but I can't find an example of a constructor followed by a dot) <lang ocaml>Unix.sleep (* module *) Some "text" (* a constructor *)</lang> when there is no dot after it but the keyword open before it, it is a module name: <lang ocaml>open Unix</lang> we can also open a submodule: <lang ocaml>open ExtString.String (* both modules ExtString and String should have the same color *)</lang> there can be any depth: <lang ocaml>open M1.M2.M3.M4.M5.M6.M7.M8</lang> we can also recognize a module name if there is the keyword module before it: <lang ocaml>module Attrib</lang> in all other cases it is a constructor.

I think constructors should be highlighted too (with another color than modules), (in editors constructors are highlighted).

(see previous paragraph to see when a capitalised word is a module or a constructor)

Modules from the standard library are not treated the same, some are highlighted with a link to the doc, and some other are not:

<lang ocaml>Arg Arith_status Array Array1 (* * *) Array2 (* * *) Array3 (* * *) ArrayLabels Big_int Bigarray Buffer Callback CamlinternalLazy (* TODO *) CamlinternalMod (* TODO *) CamlinternalOO Char Complex Condition Dbm Digest Dynlink Event Filename Format Gc Genarray (* * *) Genlex Graphics GraphicsX11 Hashtbl Int32 Int64 LargeFile (* !!! *) Lazy Lexing List ListLabels Make (* !!! *) Map Marshal MoreLabels Mutex Nativeint Num Obj Oo Parsing Pervasives Printexc Printf Queue Random Scanf Scanning (* * *) Scanf Set MoreLabels Sort Stack State (* * *) Random StdLabels Str Stream String StringLabels Sys Thread ThreadUnix Tk Unix (* TODO *) UnixLabels (* TODO *) Weak (* TODO *)</lang> Those tagged with TODO should be added (in particular Unix in priority), the other could be omited, those with a star are modules inside another module for example Array1, Array2, Array3 and Genarray are inside Bigarray, for a root module named Foo the url is http://caml.inria.fr/pub/docs/manual-ocaml/libref/Foo.html for a module Bar inside Foo the url is http://caml.inria.fr/pub/docs/manual-ocaml/libref/Foo.Bar.html the tags !!! are for modules which url can not be resolved because for example for the module name LargeFile, there are LargeFile modules inside several root modules in Pervasives, in Unix, and in UnixLabels.

not a bug nor an issue, but a suggestion: maybe we could highlight the core OCaml types:

<lang ocaml>list array int float (* !!! *) bool char string unit int32 int64 nativeint <fun> in_channel out_channel file_descr exn</lang>

I have keep in_channel and out_channel in the list because I think we should use the same color.

float is a type in contexts where ocaml expects a type and a function in contexts where ocaml expects a function. So we could keep it as a function, or convert it to a type. I'm not sure which is best, but in RC context it seems it is most often the function.

not a bug nor an issue, but a suggestion: maybe we could highlight labels (names with a tild before it):

<lang ocaml>let my_compare ~left ~right =
(* ... *)</lang>
I propose this style (font-weight:bold; color:#339933;): ~left

not a bug nor an issue, but a suggestion: maybe we could highlight polymorphic variants (names with a backtick before it):

<lang ocaml>`left (* constructor of a polymorphic variant *)
Left (* constructor of a non-polymorphic variant *)</lang> IMHO these two kinds of constructors should be highlighted with the same color.
I suggest this style (font-weight:bold; color:#993399;): Left `left (if the highlight for names following a dot is removed as suggested above, because this is the same color)

This bug is not related to the OCaml syntax (but affects ocaml syntax), GeSHi doesn't handle nested comments (see Comments)

<lang ocaml>(* This a comment
(* containing nested comment *) *)</lang> just a suggestion, maybe an easy way to handle this would be to match *) ... *)?

-- Blue Prawn 21:29, 22 January 2010 (UTC)