Unicode strings: Difference between revisions

add link to a Tom Scott video
(Add Lua)
(add link to a Tom Scott video)
 
(10 intermediate revisions by 8 users not shown)
Line 2:
As the world gets smaller each day, internationalization becomes more and more important.   For handling multiple languages, [[Unicode]] is your best friend.
 
It is a very capable and [https://www.youtube.com/watch?v=MijmeoH9LT4 remarquable] tool, but also quite complex compared to older single- and double-byte character encodings.
 
How well prepared is your programming language for Unicode?
Line 446:
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? - There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings.
 
=={{header|BBC BASIC}}==
==={{header|BBC BASIC}}===
{{works with|BBC BASIC for Windows}}
* How easy is it to present Unicode strings in source code?
Line 619 ⟶ 620:
ELENA supports both UTF8 and UTF16 strings, Unicode identifiers are also supported:
 
ELENA 46.x:
<syntaxhighlight lang="elena">public program()
{
Line 625 ⟶ 626:
var строка := "Привет"w; // UTF16 string
console.writeLine:(строка);
console.writeLine:(四十二);
}</syntaxhighlight>
{{out}}
Line 639 ⟶ 640:
=={{header|Erlang}}==
The simplified explanation is that Erlang allows Unicode in comments/data/file names/etc, but not in function or variable names.
 
=={{header|FreeBASIC}}==
FreeBASIC has decent support for Unicode, although not as complete as some other languages.
 
* How easy is it to present Unicode strings in source code?
FreeBASIC can handle ASCII files with Unicode escape sequences (\u), and can also parse source (.bas) or header (.bi) files into UTF-8, UTF-16LE, UTF-16BE. , UTF-32LE and UTF-32BE. These files can be freely mixed with other source or header files in the same project.
 
* Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
String literals can be written in the original non-Latin alphabet, you just need to use a text editor that supports some of the mentioned Unicode formats.
 
* How well can the language communicate with the rest of the world?
FreeBASIC can communicate with other programs and systems that use Unicode. However, manipulating Unicode strings can be more complicated because many string functions become more complex.
 
* Is it good at input/output with Unicode?
The <code>Open</code> function supports UTF-8, UTF-16LE and UTF-32LE files with the encoding specifier.
The <code>Input#</code> and <code>Line Input#</code> functions as well as <code>Print#</code> <code>Write#</code> can be used normally, and any conversion between Unicode and ASCII is done automatically if necessary. The <code>Print</code> function also supports Unicode output.
 
* Is it convenient to manipulate Unicode strings in the language?
Although FreeBASIC supports wide characters in a string, it does not support dynamic strings. However, there are some libraries included with FreeBASIC to decode UTF-8 to wstring.
 
* What encodings (e.g. UTF-8, UTF-16, etc) can be used?
Unicode support in FreeBASIC is quite extensive, but not as deep as in other programming languages. It can handle most basic Unicode tasks, but more advanced tasks may require additional libraries.
 
* What encodings (e.g. UTF-8, UTF-16, etc) can be used?
FreeBASIC supports several encodings, including UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.
 
* Does it support normalization?
FreeBASIC does not have built-in support for Unicode normalization. However, it is possible to use external libraries to perform normalization.
 
For example, <syntaxhighlight lang="vbnet">' Define a Unicode string
Dim unicodeString As String
unicodeString = "こんにちは, 世界! 🌍"
 
' Print the Unicode string
Print unicodeString
 
' Wait for the user to press a key before closing the console
Sleep</syntaxhighlight>
 
 
=={{header|Go}}==
Line 739 ⟶ 779:
1</syntaxhighlight>
 
Here, we see that even when comparing non-ascii characters, we can coerce both arguments to be utf-8 or utf-16 andor inutf-32 either caseand the resulting literal strings would match. (8 u: string produces a utf-8 result.)
 
Output uses characters in whatever format they happen to be in.
Line 816 ⟶ 856:
 
=={{header|langur}}==
Source code in langur is pure UTF-8 without a BOM and without surrogate codes.
 
Identifiers are ASCII only. Comments and string literals may use Unicode.
 
Indexing on a string indexes by code point. The index may be a single number, a range, or ana arraylist of such things.
A string or regex literal using an "any" modifier may include any code point (without using an escape sequence). Otherwise, they are restricted to Graphic, Space, and Private Use Area code points, and a select set of invisible spaces. The idea around the "allowed" characters is to keep source code from having hidden text or codes and to allay confusion and deception.
 
Conversion between code point numbers, graphemes, and strings can be done with the cp2s(), s2cp(), and s2gc() functions. Conversion between UTF-8 byte arrayslists and langur strings can be done with b2s() and s2b() functions.
The following is an example of using the "any" modifier on a string literal.
 
<syntaxhighlight lang="langur">q:any"any code points here"</syntaxhighlight>
 
Indexing on a string indexes by code point. The index may be a single number, a range, or an array of such things.
 
Conversion between code point numbers and strings can be done with the cp2s() and s2cp() functions. The s2cp() function accepts a single index number or range, returning a single code point number or an array of them. The s2s() function returns a string instead (while allowing you to index by code points). The cp2s() function accepts a single code point or an array and returns a string.
 
Conversion between UTF-8 byte arrays and langur strings can be done with b2s() and s2b() functions.
 
The len() function returns the number of code points in a string.
Line 837 ⟶ 869:
 
Using a for of loop over a string gives the code point indices, and using a for in loop over a string gives the code point numbers.
 
Interpolation modifiers allow limiting a string by code points or by graphemes.
 
See langurlang.org for more details.
Line 1,179 ⟶ 1,213:
λ ä
</pre>
 
=={{header|PowerShell}}==
 
Unicode escape sequence (added in PowerShell 6<ref>https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_special_characters?view=powershell-7.3</ref>):
 
<syntaxhighlight lang="powershell">
# `u{x}
"I`u{0307}" # => İ
</syntaxhighlight>
 
=={{header|Python}}==
Line 1,231 ⟶ 1,274:
Non-Unicode strings are represented as Buf types rather than Str types, and Unicode operations may not be applied to Buf types without some kind of explicit conversion. Only ASCIIish operations are allowed on buffers.
 
Raku tracks the Unicode consortium standards releases and is generally up to the latest standard within a few months or so of its release. (currently at 15.0 as of February 2023)
As the latest version (2022.12) of Rakudo (Raku compiler) available, the official [https://docs.raku.org/language/unicode Raku documentation about Unicode supports] says:
 
<blockquote>
Raku has a high level of support of Unicode, with the latest version supporting Unicode 12.1.
</blockquote>
 
So Unicode 13.0, 14.0 and 15.0 are not yet supported (or the documentation is outdated).
 
However, Raku still supports the following Unicode features
 
* Supports the normalized forms NFC, NFD, NFKC, and NFKD, and character equivalence as specified in [http://unicode.org/reports/tr15/ Unicode technical report #15].
Line 1,252 ⟶ 1,287:
In general, it tries to make dealing with Unicode "just work".
 
Raku intends to support Unicode even better than Perl 5, which already does a great job in recent versions of accessing large swaths of Unicode spec. functionality. Raku improves on Perl 5 primarily by offering explicitly typed strings that always know which operations are sensical and which are not.
 
A very important distinctive characteristic of Raku to keep in mind is that it applies normalization (Unicode NFC form (Normalization Form Canonical)) autoamticallyautomatically by default to all strings as showcased and explained on the [[String comparison#Unicode_normalization_by_default|String comparison page]].
 
=={{header|REXX}}==
Line 1,693 ⟶ 1,728:
 
The standard library does not support normalization but the above module does allow one to split a string into ''user perceived characters'' (or ''graphemes'').
<syntaxhighlight lang="ecmascriptwren">import "./upc" for Graphemes
 
var w = "voilà"
1,934

edits