Unicode strings: Difference between revisions

add link to a Tom Scott video
(add link to a Tom Scott video)
 
(2 intermediate revisions by 2 users not shown)
Line 2:
As the world gets smaller each day, internationalization becomes more and more important.   For handling multiple languages, [[Unicode]] is your best friend.
 
It is a very capable and [https://www.youtube.com/watch?v=MijmeoH9LT4 remarquable] tool, but also quite complex compared to older single- and double-byte character encodings.
 
How well prepared is your programming language for Unicode?
Line 640:
=={{header|Erlang}}==
The simplified explanation is that Erlang allows Unicode in comments/data/file names/etc, but not in function or variable names.
 
=={{header|FreeBASIC}}==
FreeBASIC has decent support for Unicode, although not as complete as some other languages.
 
* How easy is it to present Unicode strings in source code?
FreeBASIC can handle ASCII files with Unicode escape sequences (\u), and can also parse source (.bas) or header (.bi) files into UTF-8, UTF-16LE, UTF-16BE. , UTF-32LE and UTF-32BE. These files can be freely mixed with other source or header files in the same project.
 
* Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
String literals can be written in the original non-Latin alphabet, you just need to use a text editor that supports some of the mentioned Unicode formats.
 
* How well can the language communicate with the rest of the world?
FreeBASIC can communicate with other programs and systems that use Unicode. However, manipulating Unicode strings can be more complicated because many string functions become more complex.
 
* Is it good at input/output with Unicode?
The <code>Open</code> function supports UTF-8, UTF-16LE and UTF-32LE files with the encoding specifier.
The <code>Input#</code> and <code>Line Input#</code> functions as well as <code>Print#</code> <code>Write#</code> can be used normally, and any conversion between Unicode and ASCII is done automatically if necessary. The <code>Print</code> function also supports Unicode output.
 
* Is it convenient to manipulate Unicode strings in the language?
Although FreeBASIC supports wide characters in a string, it does not support dynamic strings. However, there are some libraries included with FreeBASIC to decode UTF-8 to wstring.
 
* What encodings (e.g. UTF-8, UTF-16, etc) can be used?
Unicode support in FreeBASIC is quite extensive, but not as deep as in other programming languages. It can handle most basic Unicode tasks, but more advanced tasks may require additional libraries.
 
* What encodings (e.g. UTF-8, UTF-16, etc) can be used?
FreeBASIC supports several encodings, including UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.
 
* Does it support normalization?
FreeBASIC does not have built-in support for Unicode normalization. However, it is possible to use external libraries to perform normalization.
 
For example, <syntaxhighlight lang="vbnet">' Define a Unicode string
Dim unicodeString As String
unicodeString = "こんにちは, 世界! 🌍"
 
' Print the Unicode string
Print unicodeString
 
' Wait for the user to press a key before closing the console
Sleep</syntaxhighlight>
 
 
=={{header|Go}}==
Line 817 ⟶ 856:
 
=={{header|langur}}==
Source code in langur is pure UTF-8 without a BOM and without surrogate codes.
 
Identifiers are ASCII only. Comments and string literals may use Unicode.
 
Indexing on a string indexes by code point. The index may be a single number, a range, or ana list of such things.
A string or regex literal using an "any" modifier may include any code point (without using an escape sequence). Otherwise, they are restricted to Graphic, Space, and Private Use Area code points, and a select set of invisible spaces. The idea around the "allowed" characters is to keep source code from having hidden text or codes and to allay confusion and deception.
 
The following is an example of using the "any" modifier on a string literal.
 
<syntaxhighlight lang="langur">q:any"any code points here"</syntaxhighlight>
 
Indexing on a string indexes by code point. The index may be a single number, a range, or an list of such things.
 
Conversion between code point numbers, graphemes, and strings can be done with the cp2s(), s2cp(), and s2gc() functions. Conversion between UTF-8 byte lists and langur strings can be done with b2s() and s2b() functions.
Line 836 ⟶ 869:
 
Using a for of loop over a string gives the code point indices, and using a for in loop over a string gives the code point numbers.
 
Interpolation modifiers allow limiting a string by code points or by graphemes.
 
See langurlang.org for more details.
1,934

edits