Unicode strings: Difference between revisions

Line 793:
 
=={{header|langur}}==
Source code in langur is UTF-8 with no BOM allowed.
 
Comments and string literals may use Unicode.
 
A string or regex literal using an "any" modifier may include any code point (without using an escape sequence). Otherwise, they are restricted to Graphic and Space characters.
 
For clarity, identifiers are ASCII only.
Line 801 ⟶ 803:
Indexing on a string indexes by code point. The index may be a single number, a range, or an array of such things.
 
Conversion between code point numbers and strings can be done with the cp2s() and s2cp() functions. The s2cp() function accepts a single index number or range, returning a single code point number or an array of them. The s2s() function returns a string instead (while allowing you to index by code points). The cp2s() function accepts a single code point or an array and returns a string.
 
Conversion between UTF-8 byte arrays and langur strings can be done with b2s() and s2b() functions.
Line 808 ⟶ 810:
 
Normalization can be handled with the functions nfc(), nfd(), nfkc(), and nfkd().
 
Using a for of loop over a string gives the code point indices, and using a for in loop over a string gives the code point numbers.
 
See langurlang.org for more details.
990

edits