Unicode strings: Difference between revisions

(Added FreeBASIC)
Line 856:
 
=={{header|langur}}==
Source code in langur is pure UTF-8 without a BOM and without surrogate codes.
 
Identifiers are ASCII only. Comments and string literals may use Unicode.
 
Indexing on a string indexes by code point. The index may be a single number, a range, or ana list of such things.
A string or regex literal using an "any" modifier may include any code point (without using an escape sequence). Otherwise, they are restricted to Graphic, Space, and Private Use Area code points, and a select set of invisible spaces. The idea around the "allowed" characters is to keep source code from having hidden text or codes and to allay confusion and deception.
 
The following is an example of using the "any" modifier on a string literal.
 
<syntaxhighlight lang="langur">q:any"any code points here"</syntaxhighlight>
 
Indexing on a string indexes by code point. The index may be a single number, a range, or an list of such things.
 
Conversion between code point numbers, graphemes, and strings can be done with the cp2s(), s2cp(), and s2gc() functions. Conversion between UTF-8 byte lists and langur strings can be done with b2s() and s2b() functions.
Line 875 ⟶ 869:
 
Using a for of loop over a string gives the code point indices, and using a for in loop over a string gives the code point numbers.
 
Interpolation modifiers allow limiting a string by code points or by graphemes.
 
See langurlang.org for more details.
890

edits