Rosetta Code:Village Pump/Syntax highlighting: Difference between revisions

From Rosetta Code
Content added Content deleted
(Non-ASCII characters and unrecognized languages)
Line 376: Line 376:


The first three smiley faces come through fine, but the last is somehow corrupted. It appears that in general, if GeSHi doesn't recognize a language name, any non-ASCII characters between the lang tags get screwed up. I noticed this bug when using non-ASCII operators in Perl 6 examples. —[[User:Underscore|Underscore]] 00:19, 7 November 2009 (UTC)
The first three smiley faces come through fine, but the last is somehow corrupted. It appears that in general, if GeSHi doesn't recognize a language name, any non-ASCII characters between the lang tags get screwed up. I noticed this bug when using non-ASCII operators in Perl 6 examples. —[[User:Underscore|Underscore]] 00:19, 7 November 2009 (UTC)

= Webform-driven language-file generator =
''(Moved from [[Category talk:J]].)''

While we're on the subject, I would be interested in a programmatic approach to generating these files. The structure of the GeSHi language files is very, very simple; It's little more than a PHP-native serialization of a few regex setrings and symbol constants. If GeSHi supported JSON for that structure, it would be trivial to import language highlighting as a JSON file, and such a file would be trivial to generate programmatically. But GeSHi doesn't, so I'm stuck with PHP files until I (or someone else) writes a JSON->PHP conversion. That said, a number of folks have sent me language files, and so are familiar with its structure. Would anyone be interested in writing a webform-driven language-file ''generator''? For security's sake, I can't automate the import of the generated files, but it would greatly open up the process of generating the files, and perhaps make maintenance easier. I'd give it a subdomain such as geshi.rosettacode.org. --[[User:Short Circuit|Michael Mol]] 03:33, 10 November 2009 (UTC)
:Hm, I think I could write a CGI script (in Perl 5) to let folks fill out a form, perhaps with some minimal markup, to create a language definition. With careful use of resource limits and sanitization of the input, we could even let the user test the new definition without a local copy of PHP. But I'm not sure how using such a Web application would actually be easier or quicker than writing the literal definition. I mean, one of the things that makes writing new definitions so easy is that you can use a preexisting definition of a similar language as a starting point. Can you describe in more detail the interface you're imagining? —[[User:Underscore|Underscore]] ([[User talk:Underscore|Talk]]) 13:23, 10 November 2009 (UTC)
:: It's been ages since I've had a chance to look at the structure, but an editable list for each of the list-type members would be good. I don't know what to do about the regex-driven ones; A wizard would be sweet, but I don't know that that would be possible.
:: I could provide a MySQL backend for persistence That would make it plausible to tweak existing support.
:: Not sure how far to go as far as sanitation and execution. The more secure I try make it, the more time I'll need to spend responding to problems. --[[User:Short Circuit|Michael Mol]] 16:09, 10 November 2009 (UTC)
::: How about this: sometime in the near future, I'll write the simplest such program that could possibly be useful. Then I'll take feature requests, or anybody who wants to can submit patches. How does that sound? Could you give me FTP access to geshi.rosettacode.org, or some other way of controlling what appears there? —[[User:Underscore|Underscore]] ([[User talk:Underscore|Talk]]) 23:28, 10 November 2009 (UTC)

Revision as of 23:28, 10 November 2009

Discuss issues related to the Syntax Highlighting system here. The old page got huge, and it became hard to discern what problems were current.

Relationship Between Rosetta Code and GeSHi

Rosetta Code first started using GeSHi for syntax highlighting a long while back, but due to our nature, we quickly discovered, and frequently continue to discover, programming languages which GeSHi does not provide highlighting functionality. Additionally, we've uncovered bugs in various releases of the software, and the GeSHi folks have been welcoming of fixes sent to them.

Due to the way language support is managed in GeSHi, it's fairly trivial for someone who can follow PHP syntax to create a PHP file that adds support for the language of their choice. GeSHi's support for Oberon-2, Modula-3 and AutoHotKey is the direct result of contributions by MBishop and Tinku99. (If you like a language, and it doesn't appear to have syntax highlighting support, I strongly suggest you follow their lead. ;-) )

I am now also part of the GeSHi project focusing for now on adding and improving support for programming languages present on Rosetta Code. As a community of language aficionados and enthusiasts, Rosetta Code is a hotbed for opportunities for improving GeSHi, and improvement of GeSHi is extremely helpful for improving the readability code on Rosetta Code. --Short Circuit 06:43, 18 June 2009 (UTC)

So how can I add a language? I have made a GeSHi file for Vedit macro language. Where can I upload it? (I can not test it myself so I don't know if it is OK). I could do a file for RapidQ quite easily, too, since I have Vedit syntax file for it.
It would be nice if there was an example GeSHi file downloadable somewhere (or perhaps even a few of them). The only one I found was at GeSHi web page. I had to cut and paste the source from the browser, and the do quite a lot editing to get newlines etc. fixed. --PauliKL 15:26, 30 June 2009 (UTC)
Email them to me at mikemol@gmail.com, and I'll get them tested, staged on RC and committed to SVN as quickly as I can. --Short Circuit 17:06, 30 June 2009 (UTC)

If anyone else would like to make a language file, look at this explanation of the language file format from the GeSHi site: http://qbnz.com/highlighter/geshi-doc.html#language-files. --Mwn3d 19:56, 8 September 2009 (UTC)

C syntax highlight

I've noticed that the "string" word is highlighted as a type in C; it shouldn't be so. --ShinTakezou 23:00, 3 July 2009 (UTC)

Whitespace at program beginning/end

I just wanted to make sure that this problem doesn't get forgotten now that the thread moved to the archive page. --Ce 23:18, 6 July 2009 (UTC)

I think you're going to have to reiterate exactly what the problem is, and how it manifests itself under the current configuration. --Short Circuit 23:50, 6 July 2009 (UTC)
In
<lang cpp>
int main {}
</lang>
there are additional blank lines,

<lang cpp> int main {} </lang>

which don't appear in
<pre>
int main()
</pre>
which results in
int main()
Those lines shouldn't be there. Currently the only way to get rid of them is to put the beginning/end tag on the same line as the first/last line of the code, which makes editing (and especially modifying existing snippets) unnecessarily hard (especially it's easy to miss an end tag, since it's bolted onto the last line of the code). In addition, it's an inconsistency. The lang tags should work exactly like pre tags, except for syntax hilighting.
The old discussion is in Village Pump:Home/Syntax Highlighting ( archived 2009-06-18 )#Another problem with eating whitespace characters (note that the start of that discussion concerns an older state with a different problem; initially too much was removed). --Ce 10:42, 7 July 2009 (UTC)
Is there still hope that this will get fixed some day? --Ce 20:41, 7 October 2009 (UTC)
I don't know. The syntax highlighter was my top priority just before I got slammed with work months ago, and things are just settling down. An update of the highlighter is definitely in order, but it's going to result in a significant number of breakages of things that currently work. And I'm hoping for a change in some MW internals that may make the GeSHi work simpler.
In all honesty, I'm hopeful, but not expectant, of a solution that's going to work for the rest of the code examples, while at the same time working for the Whitespace language. --Michael Mol 00:45, 8 October 2009 (UTC)

Test...

<lang cpp>int main {}</lang>

Fix is simpler than it appeared. Don't put newlines between the tags. :) --Michael Mol 01:15, 8 October 2009 (UTC)
But that makes the page source worse. Especially it makes it easy to miss an end tag. (Yes, it already happened.) I strongly hope that one day, it will work correctly. --Ce 16:16, 10 October 2009 (UTC)
Yes, it does, but in the case of Whitespace, I think it's justified. Otherwise, you're putting language-significant information between the tags, and effectually asking the wiki to strip out some of your program. I don't see how that's better behavior. If copy/paste is necessary, ideally they would copy from the rendered wiki page, not the wiki source. --Michael Mol 22:14, 10 October 2009 (UTC)
IMHO, the code begins in the line after the start tag and ends at the line before the end tag (putting it in the same line is just a hack to work around the bug). Therefore IMHO currently Whitespace is broken because the GeSHi adds extra newlines at the beginning/end. Also note that this is also how the HTML pre tag works, example:
  
   
As you see, a line break directly after the pre tag is removed.
One thing I previously got wrong it that this deletion is only if there's only a newline, any extra whitespace after the pre or before the /pre causes the newline not to be removed. But then, this should simplify the algorithm considerably: Instead of general trimming, one has only to check whether the first resp. last character of the text inside tags is a newline, and if so, remove it. --Ce 08:39, 11 October 2009 (UTC)
Looking at the HTML source of the page, I now noticed several things:
* GeSHi already uses pre tags internally, but:
* it adds an additional &nbsp; to the first and last (empty) line. This space is not in the source, and I'd therefore consider it a bug in its own right.
* it replaces all newline characters by <br /> which is not only a waste of bandwidth (because pre already honors newline characters), but also defeats the pre tag handling of newlines at the beginning/end.
So the fix would probably to just remove the two latter points.
BTW, whitespace highlighting currently seems broken anyway, because it doesn't currently highlight newlines:

<lang whitespace>

  <-_just_three_spaces
  <-_a_newline,_followed_by_three_more_spaces

</lang>

(my comments apply to what I think should be highlighted; from your interpretation, you'd expect two additional newlines at the beginning/end to be highlighted) Note that the extra spaces in the extra lines (although not highlighted) will break the whitespace code when doing copy/paste, even if using your interpretation. --Ce 09:03, 11 October 2009 (UTC)

It seems like it would be nice to have the lang tag work like the pre tag, but it may not be worth the work. An easy way to catch a lot (still not all) forgotten /lang tags would be to require a preview for all edits (I think it's a MW option). Everyone should preview anyway. And you're right the newline highlighting isn't showing up. It's even easier to see if you add a blank line between the text you showed: <lang whitespace>

  <-_just_three_spaces
  <-_a_newline,_followed_by_three_more_spaces

</lang> I think they used to be red? Maybe that was tabs. --Mwn3d 17:59, 11 October 2009 (UTC)

Given the information I gathered lately (see my latest comment above), I don't think it would be much work. It would be just
  • Find the code which changes newline characters to br tags and remove it (my guess is that it's also the reason why whitespace highlighting is broken; probably the GeSHi highlighting code looks for newline characters and doesn't find them because they are replaced by br tags).
  • Find the code which adds the extra &nbsp; at the beginning/end and remove it.
I cannot imagine that taking a lot of time. --Ce 20:21, 11 October 2009 (UTC)
Here's the hacked-up MediaWiki extension RC is currrently using. It differs fairly significantly from the original version due to fixes and issues that were brought up since its initial use here. RC is currently using GeSHi 1.0.8.2. The latest in the 1.0.x branch is 1.0.8.4, but we encountered significant breakage when I upgraded to that months ago, I had to roll it back. I joined the GeSHi project with commit access, with the intention of adding languages, and finding and fixing issues, as well as running SVN HEAD on RC, but between ImplSearchBot issues and the StumbleUpon flood, my server (and remote backup target) at home suffering a [catastrophic hardware failure, my other computer's screen flaking out, as well as time cramps relating to family, work emergencies running fairly continually since June, and personal health issues coming to a head in the past month, I really haven't had time.
The 1.0.x GeSHi branch is no longer under active development, and I don't know how painful the transition to the 1.1.x branch is going to be. 1.1.x was supposed to be released in August when I last had the time to talk to the other members of that project, and I don't know why it's still in alpha.
Opticron has made significant headway on the ISB replacement, but has slammed into a problem that may require filing a ticket with the MW folks. I just finished getting my home server running again. Once I've got my home server pulling daily site backups again, I should finally be able to turn my attention toward the syntax highlighting again.
As an aside, I don't think anything changed in the server software between when newlines were highlighted and when they weren't. It may also be a bug brought on by a shift in browser usage. --Michael Mol 01:31, 12 October 2009 (UTC)
I think that it's not a browser problem. Inspecting the generated HTML shows there's no span created around the line breaks.
BTW, the changing of newlines to br tags is in the second-to-last line in the linked file, http://rosettacode.org/resources/gct_rcode.phps (i.e. in the return). Just replacing str_replace("\n",'<br />', $geshi->parse_code()) with $geshi->parse_code() should fix that part. However, that line also shows that my suspicion that this is responsible for the missing newline highlighting is wrong: It is only applied after highlighting.
I also see that you've added my previously suggested function prestyletrim; now it's also clear to me why it didn't work: You applied it after the parsing instead of before. But since I now recognized that the semantics would be wrong anyways, it should probably be removed. However, instead it may be modified to simply patch away the spurious &nbsp; after the fact (the cleaner way would be not to generate it, but that seems to be in $geshi->parse_code(), which is not in the file you linked to).
I think changing prestyletrim to

<lang php> function prestyletrim($text) {

 return preg_replace("/^ /","",preg_replace("/ $/","",$text));

} </lang>

should work (I'm not completely sure because I'm no PHP programmer). This modified version should be called exactly where the original is called now (i.e. while it was the wrong place for the original version, it would be the right place for the new one).
Maybe if it works, the function should also be renamed.
BTW,. I now note that the extra empty line at the end does not appear on the PHP example, so adding the spurious &nbsp; seems to be part of the programming language dependent code. --Ce 08:02, 12 October 2009 (UTC)
The extra line didn't appear because the parser was in the middle of a literal string at the end of the example. I changed the text to what I think you might have meant to put and the blank line shows up. --Mwn3d 12:11, 12 October 2009 (UTC)
Ah, thanks. However, you mis-fixed it (having the function call twice was intentional; probably it could have been done with one function call and a better regexp, but I chose the easy way, which didn't require me to dig up PHP regex info from the net again). I now replaced with the correctly corrected version (and at the same time also fixed another, unrelated bug in the function). --Ce 16:28, 12 October 2009 (UTC)

Java5 Highlighting

The java5 highlighting removes links to classes when generics are specified for them without a space between the class name and the <. Example: <lang java5>LinkedList<T></lang> "LinkedList" should be a link. If you add a space the link shows up: <lang java5>LinkedList <T></lang> I think a lot of people leave the space out so the highlighting should put a link in with or without the space. --Mwn3d 12:35, 29 July 2009 (UTC)

The java5 (and java) highlighting doesn't highlight javadoc comments as multi-line comments: <lang java5>/**

  • this is a comment
  • /</lang>

It also doesn't highlight "import static" lines properly (this should only be in java5: <lang java5>import static SomeClass.someMethod;</lang> "import" and "static" should both be blue and bold and "SomeClass.someMethod" should be gray, bold, and italic. --Mwn3d 13:34, 23 September 2009 (UTC)

Language tags

Due to changes I want to implement in the way processing of the lang tag is done, we need to standardize codes for all of the languages currently on Rosetta Code ASAP. Subsequently, all pages need to be scoured for code snippets.

Here is a list of the languages that currently exist on Rosetta code as I write this. Next to each language, place the code for that language. If the code is already established in GeSHi, make it bold. If no code is provided by GeSHi, and no code is currently in use for that language on the wiki, make one up. Once this list is fully populated, all the pages for each language need to be checked, and ensure that the code snippets for that language are correct. (Not sure what to do about command-line one-liners or similar yet, though.)

Note: AutoHotKey currently has two codes, ahk and autohotkey. When AutoHotKey support was added to GeSHi, it was added with the longer code. (lang codes are derived from the filename.) Since there were already code snippets on RC that used ahk for AutoHotKey, I created a symlink that allowed ahk to be used as a language code as well. This should likely be reversed, meaning instances of the ahk tag need to be replaced with the autohotkey tag. --Short Circuit 07:01, 18 June 2009 (UTC)

  • 4D

A

  • ALGOL 60
  • ALGOL 68
  • APL apl
  • AWK awk
  • ActionScript actionscript
  • Ada ada
  • Agda2
  • AmigaE amigae
  • AppleScript applescript
  • Assembly asm (for x86)
  • AutoHotkey autohotkey

B

  • BASIC qbasic freebasic thinbasic
  • Bc
  • Befunge
  • Brainf*** bf

C

  • C c
  • C sharp csharp
  • C++ cpp
  • Caml
  • Clean clean
  • Clojure
  • Cobol cobol
  • ColdFusion
  • Common Lisp lisp
  • Component Pascal
  • Coq

D

  • D d
  • DOS Batch File dos
  • Dc
  • Delphi delphi

E

  • E e
  • EC
  • ELLA
  • ESQL
  • Eiffel eiffel
  • Emacs Lisp
  • Erlang

F

  • F
  • F Sharp fsharp
  • FALSE
  • FP
  • Factor
  • Fan
  • Forth forth
  • Fortran fortran

G

  • GAP
  • Gnuplot
  • Groovy groovy

H

  • HaXe haxe
  • Haskell haskell

I

  • IDL idl
  • Icon
  • Io io

J

  • J j
  • JSON
  • JScript.NET
  • Java java java5
  • JavaScript javascript
  • JoCaml
  • Joy joy
  • JudoScript

K

  • Korn Shell

L

  • LSE64
  • LaTeX latex
  • LabVIEW
  • Lisaac
  • Lisp lisp
  • Logo logo
  • Logtalk
  • LotusScript lotusscript
  • Lua lua
  • Lucid lucid

M

  • M4 m4
  • MAXScript maxscript
  • MIRC Scripting Language mirc
  • MS SQL
  • Make make
  • Maple
  • Mathematica mathematica
  • MATLAB matlab
  • Maxima
  • Metafont metafont
  • Modula-3 modula3

N

  • NewLISP
  • Nial nial

O

  • OCaml ocaml
  • Oberon-2 oberon2
  • Object Pascal
  • Objective-C objc
  • Octave octave
  • Omega
  • OpenEdge/Progress
  • Oz

P

  • PHP php
  • PL/I
  • PL/SQL plsql
  • Pascal pascal
  • Perl perl
  • Perl 6 perl6
  • Pike
  • PlainTeX tex
  • Pop11 pop11
  • PostScript postscript
  • PowerShell powershell
  • Prolog prolog
  • Python python

Q

  • Q q

R

  • R r R
  • REXX rexx
  • RapidQ rapidq
  • Raven
  • Rhope
  • Ruby ruby

S

  • SAS sas
  • SETL
  • SMEQL
  • SNUSP
  • SQL sql
  • Scala scala
  • Scheme scheme
  • Script3D
  • Seed7
  • Self
  • Slate slate
  • Smalltalk smalltalk
  • Standard ML sml (ocaml?)

T

  • TI-83 BASIC
  • TI-89 BASIC
  • Tcl tcl
  • Toka
  • Tr
  • Transact-SQL
  • Twelf

U

  • UNIX Shell bash
  • UnixPipes
  • Unlambda

V

  • V v
  • VBScript
  • Vedit macro language vedit
  • Visual Basic vb
  • Visual Basic .NET vbnet
  • Visual Objects

W

  • Wrapl

X

  • XSLT
  • XTalk

Here is a list of the codes currently provided by GeSHi.

<lang list></lang>

OCaml for Standard ML?

It has been suggested to use the OCaml syntax highlighting for Standard ML, and some SML code has been changed to OCaml highlighting already. While the languages are similar, they have many differences in keywords and stuff, and I feel that Standard ML should have a separate highlighting scheme. For example, SML has "datatype" keyword and OCaml does not; logical operators are "andalso" and "orelse" instead of "&&" and "||"; pattern matching is "case ... of" instead of "match ... with"; the "fn" keyword; and lots of other stuff. If I have time I could try to translate the OCaml GeSHi language file into SML; but I am reluctant to do so as I do not own a copy of the Definition of Standard ML, and so I am not confident I will get everything. --76.91.63.71 08:22, 19 July 2009 (UTC)

Sounds reasonable. Grepping around, I find that the list of keywords is this:
and abstype as case datatype else end eqtype exception do fn fun functor funsig handle if in include infix infixr lazy let local nonfix of op open overload raise rec sharing sig signature struct structure then type val where while with withtype orelse andalso
This list was extracted directly from the lexer's keyword table in the source to the SML/NJ implementation, so it should be complete and accurate. I've not categorized their meaning at all. —Donal Fellows 13:46, 19 July 2009 (UTC)
If the languages really are that similar, copy the language file, adjust it for any differences, and send it to me. I'll see that it gets added in as a proper language (and winds up in GeSHi upstream, as well.) --Short Circuit 14:47, 20 July 2009 (UTC)

Non-ASCII characters and unrecognized languages

Look at the below, then look at the source markup of this section.

A smiley face: ☺

<lang haskell>(☺) :: Mood</lang>

Have a ☺ nice day!

<lang ratatouille>Have a ☺ nice day!</lang>

The first three smiley faces come through fine, but the last is somehow corrupted. It appears that in general, if GeSHi doesn't recognize a language name, any non-ASCII characters between the lang tags get screwed up. I noticed this bug when using non-ASCII operators in Perl 6 examples. —Underscore 00:19, 7 November 2009 (UTC)

Webform-driven language-file generator

(Moved from Category talk:J.)

While we're on the subject, I would be interested in a programmatic approach to generating these files. The structure of the GeSHi language files is very, very simple; It's little more than a PHP-native serialization of a few regex setrings and symbol constants. If GeSHi supported JSON for that structure, it would be trivial to import language highlighting as a JSON file, and such a file would be trivial to generate programmatically. But GeSHi doesn't, so I'm stuck with PHP files until I (or someone else) writes a JSON->PHP conversion. That said, a number of folks have sent me language files, and so are familiar with its structure. Would anyone be interested in writing a webform-driven language-file generator? For security's sake, I can't automate the import of the generated files, but it would greatly open up the process of generating the files, and perhaps make maintenance easier. I'd give it a subdomain such as geshi.rosettacode.org. --Michael Mol 03:33, 10 November 2009 (UTC)

Hm, I think I could write a CGI script (in Perl 5) to let folks fill out a form, perhaps with some minimal markup, to create a language definition. With careful use of resource limits and sanitization of the input, we could even let the user test the new definition without a local copy of PHP. But I'm not sure how using such a Web application would actually be easier or quicker than writing the literal definition. I mean, one of the things that makes writing new definitions so easy is that you can use a preexisting definition of a similar language as a starting point. Can you describe in more detail the interface you're imagining? —Underscore (Talk) 13:23, 10 November 2009 (UTC)
It's been ages since I've had a chance to look at the structure, but an editable list for each of the list-type members would be good. I don't know what to do about the regex-driven ones; A wizard would be sweet, but I don't know that that would be possible.
I could provide a MySQL backend for persistence That would make it plausible to tweak existing support.
Not sure how far to go as far as sanitation and execution. The more secure I try make it, the more time I'll need to spend responding to problems. --Michael Mol 16:09, 10 November 2009 (UTC)
How about this: sometime in the near future, I'll write the simplest such program that could possibly be useful. Then I'll take feature requests, or anybody who wants to can submit patches. How does that sound? Could you give me FTP access to geshi.rosettacode.org, or some other way of controlling what appears there? —Underscore (Talk) 23:28, 10 November 2009 (UTC)