Unicode strings: Difference between revisions

← Older edit

Unicode strings (view source)

Revision as of 09:51, 29 March 2024

5,168 bytes added , 1 month ago

add link to a Tom Scott video

Grondilu

1,934

edits

Revision as of 11:38, 5 September 2021 (view source) Shuisman (talk \| contribs) m (→‎{{header\|Mathematica}}) ← Older edit		Latest revision as of 09:51, 29 March 2024 (view source) Grondilu (talk \| contribs) (add link to a Tom Scott video)
(19 intermediate revisions by 10 users not shown)
Line 2: As the world gets smaller each day, internationalization becomes more and more important.   For handling multiple languages, [[Unicode]] is your best friend. It is a very capable and [https://www.youtube.com/watch?v=MijmeoH9LT4 remarquable] tool, but also quite complex compared to older single- and double-byte character encodings. How well prepared is your programming language for Unicode? Line 30: *   [[Terminal control/Display an extended character]] <br><br> =={{header\|11l}}== 11l source code is specified to be UTF-8 encoded. All strings in 11l are UTF-16 encoded. =={{header\|80386 Assembly}}== Line 83 ⟶ 88: {{works with\|ALGOL 68G\|Any - tested with release [http://sourceforge.net/projects/algol68/files/algol68g/algol68g-1.18.0/algol68g-1.18.0-9h.tiny.el5.centos.fc11.i386.rpm/download 1.18.0-9h.tiny].}} {{wont work with\|ELLA ALGOL 68\|Any (with appropriate job cards) - tested with release [http://sourceforge.net/projects/algol68/files/algol68toc/algol68toc-1.8.8d/algol68toc-1.8-8d.fc9.i386.rpm/download 1.8-8d] - due to extensive use of '''format'''[ted] ''transput''.}} <~~lang~~syntaxhighlight lang="algol68">#!/usr/local/bin/a68g --script # # -- coding: utf-8 -- # Line 383 ⟶ 388: )) )</~~lang~~syntaxhighlight> {{out}} <pre> Line 396 ⟶ 401: =={{header\|Arturo}}== <~~lang~~syntaxhighlight lang="rebol">text: "你好" print ["text:" text] Line 402 ⟶ 407: print ["contains string '好'?:" contains? text "好"] print ["contains character '平'?:" contains? text `平`] print ["text as ascii:" as.ascii text]</~~lang~~syntaxhighlight> {{out}} Line 441 ⟶ 446: How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? - There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings. =={{header\|~~BBC~~ BASIC}}== ==={{header\|BBC BASIC}}=== {{works with\|BBC BASIC for Windows}} * How easy is it to present Unicode strings in source code? Line 457 ⟶ 463: '''Code example:''' (whether this listing displays correctly will depend on your browser) <~~lang~~syntaxhighlight lang="bbcbasic"> VDU 23,22,640;512;8,16,16,128+8 : REM Select UTF-8 mode FONT Times New Roman, 20 Line 514 ⟶ 520: B$ += CHR$?A% NEXT = LEFT$(B$)</~~lang~~syntaxhighlight> [[Image:unicode_bbc.gif]] Line 530 ⟶ 536: =={{header\|C}}== C is not the most unicode friendly language, to put it mildly. Generally using unicode in C requires dealing with locales, manage data types carefully, and checking various aspects of your compiler. Directly embedding unicode strings in your C source might be a bad idea, too; it's safer to use their hex values. Here's a short example of doing the simplest string handling: print it.<~~lang~~syntaxhighlight Clang="c">#include <stdio.h> #include <stdlib.h> #include <locale.h> Line 559 ⟶ 565: #endif return 0; }</~~lang~~syntaxhighlight> =={{header\|C sharp\|C#}}== Line 575 ⟶ 581: Default unicode strings for most implementations. Unicode chars can be used on variable and function names. Tested in SBCL 1.2.7 and ECL 13.5.1 <~~lang~~syntaxhighlight lang="lisp"> (defvar ♥♦♣♠ "♥♦♣♠") (defun ✈ () "a plane unicode function") </syntaxhighlight> ~~</lang>~~ =={{header\|D}}== <~~lang~~syntaxhighlight Dlang="d">import std.stdio; import std.uni; // standard package for normalization, composition/decomposition, etc.. import std.utf; // standard package for decoding/encoding, etc... Line 602 ⟶ 608: // escape sequences like what is defined in C are also allowed inside of strings and characters. }</~~lang~~syntaxhighlight> =={{header\|DWScript}}== Line 614 ⟶ 620: ELENA supports both UTF8 and UTF16 strings, Unicode identifiers are also supported: ELENA 46.x: <~~lang~~syntaxhighlight lang="elena">public program() { var 四十二 := "♥♦♣♠"; // UTF8 string var строка := "Привет"w; // UTF16 string console.writeLine:(строка); console.writeLine:(四十二); }</~~lang~~syntaxhighlight> {{out}} <pre> Line 630 ⟶ 636: =={{header\|Elixir}}== Elixir has ~~exceptionally~~ good Unicode support in Strings. Its String module is ~~fully~~ compliant with the Unicode Standard, version 6.3.0. Internally, Strings are encoded in UTF-8. As source files are also typically Unicode encoded, String literals can be either written directly or via escape sequences. However, non-ASCII Unicode identifiers (variables, functions, ...) are not allowed. =={{header\|Erlang}}== The simplified explanation is that Erlang allows Unicode in comments/data/file names/etc, but not in function or variable names. =={{header\|FreeBASIC}}== FreeBASIC has decent support for Unicode, although not as complete as some other languages. How easy is it to present Unicode strings in source code? FreeBASIC can handle ASCII files with Unicode escape sequences (\u), and can also parse source (.bas) or header (.bi) files into UTF-8, UTF-16LE, UTF-16BE. , UTF-32LE and UTF-32BE. These files can be freely mixed with other source or header files in the same project. * Can Unicode literals be written directly, or be part of identifiers/keywords/etc? String literals can be written in the original non-Latin alphabet, you just need to use a text editor that supports some of the mentioned Unicode formats. * How well can the language communicate with the rest of the world? FreeBASIC can communicate with other programs and systems that use Unicode. However, manipulating Unicode strings can be more complicated because many string functions become more complex. * Is it good at input/output with Unicode? The <code>Open</code> function supports UTF-8, UTF-16LE and UTF-32LE files with the encoding specifier. The <code>Input#</code> and <code>Line Input#</code> functions as well as <code>Print#</code> <code>Write#</code> can be used normally, and any conversion between Unicode and ASCII is done automatically if necessary. The <code>Print</code> function also supports Unicode output. * Is it convenient to manipulate Unicode strings in the language? Although FreeBASIC supports wide characters in a string, it does not support dynamic strings. However, there are some libraries included with FreeBASIC to decode UTF-8 to wstring. * What encodings (e.g. UTF-8, UTF-16, etc) can be used? Unicode support in FreeBASIC is quite extensive, but not as deep as in other programming languages. It can handle most basic Unicode tasks, but more advanced tasks may require additional libraries. * What encodings (e.g. UTF-8, UTF-16, etc) can be used? FreeBASIC supports several encodings, including UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE. * Does it support normalization? FreeBASIC does not have built-in support for Unicode normalization. However, it is possible to use external libraries to perform normalization. For example, <syntaxhighlight lang="vbnet">' Define a Unicode string Dim unicodeString As String unicodeString = "こんにちは, 世界! 🌍" ' Print the Unicode string Print unicodeString ' Wait for the user to press a key before closing the console Sleep</syntaxhighlight> =={{header\|Go}}== Line 641 ⟶ 686: The <code>string</code> data type represents a read-only sequence of bytes, conventionally but not necessarily representing UTF-8-encoded text. A number of built-in features interpret <code>string</code>s as UTF-8. For example, <~~lang~~syntaxhighlight lang="go"> var i int var u rune for i, u = range "voilà" { fmt.Println(i, u) }</~~lang~~syntaxhighlight> {{out}} <pre> Line 658 ⟶ 703: In contrast, <~~lang~~syntaxhighlight lang="go"> w := "voilà" for i := 0; i < len(w); i++ { fmt.Println(i, w[i]) } </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 694 ⟶ 739: Unicode characters can be represented directly in J strings: <~~lang~~syntaxhighlight lang="j"> '♥♦♣♠' ♥♦♣♠</~~lang~~syntaxhighlight> By default, they are represented as utf-8: <~~lang~~syntaxhighlight lang="j"> #'♥♦♣♠' 12</~~lang~~syntaxhighlight> The above string requires 12 literal elements to represent the four characters using utf-8. Line 706 ⟶ 751: However, they can be represented as utf-16 instead: <~~lang~~syntaxhighlight lang="j"> 7 u:'♥♦♣♠' ♥♦♣♠ #7 u:'♥♦♣♠' 4</~~lang~~syntaxhighlight> The above string requires 4 literal elements to represent the four characters using utf-16. (7 u: string produces a utf-16 result.) Line 715 ⟶ 760: These forms are not treated as equivalent: <~~lang~~syntaxhighlight lang="j"> '♥♦♣♠' -: 7 u:'♥♦♣♠' 0</~~lang~~syntaxhighlight> The utf-8 string of literals is a different string of literals from the utf-16 string. Line 722 ⟶ 767: unless the character literals themselves are equivalent: <~~lang~~syntaxhighlight lang="j"> 'abcd'-:7 u:'abcd' 1</~~lang~~syntaxhighlight> Here, we were dealing with ascii characters, so the four literals needed to represent the characters using utf-8 matched the four literals needed to represent the characters using utf-16. Line 729 ⟶ 774: When this is likely to be an issue, you should enforce a single representation. For example: <~~lang~~syntaxhighlight lang="j"> '♥♦♣♠' -:&(7&u:) 7 u:'♥♦♣♠' 1 '♥♦♣♠' -:&(8&u:) 7 u:'♥♦♣♠' 1</~~lang~~syntaxhighlight> Here, we see that even when comparing non-ascii characters, we can coerce both arguments to be utf-8 or utf-16 ~~and~~or inutf-32 ~~either case~~and the resulting literal strings would match. (8 u: string produces a utf-8 result.) Output uses characters in whatever format they happen to be in. Line 785 ⟶ 830: =={{header\|Julia}}== Non-ASCII strings in Julia are UTF8-encoded by default, and Unicode identifiers are also supported: <~~lang~~syntaxhighlight ~~Julia~~lang="julia">julia> 四十二 = "voilà"; julia> println(四十二) voilà</~~lang~~syntaxhighlight> And you can also specify unicode characters by ordinal: <~~lang~~syntaxhighlight ~~Julia~~lang="julia">julia>println("\u2708") ✈</~~lang~~syntaxhighlight> =={{header\|Kotlin}}== Line 798 ⟶ 843: Here's a simple example of using both unicode identifiers and unicode strings in Kotlin: <~~lang~~syntaxhighlight lang="scala">// version 1.1.2 fun main(args: Array<String>) { val åäö = "as⃝df̅ ♥♦♣♠ 頰" println(åäö) }</~~lang~~syntaxhighlight> {{out}} Line 811 ⟶ 856: =={{header\|langur}}== Source code in langur is pure UTF-8 without a BOM and without surrogate codes. Identifiers are ASCII only. Comments and string literals may use Unicode. Indexing on a string indexes by code point. The index may be a single number, a range, or a list of such things. A string or regex literal using an "any" modifier may include any code point (without using an escape sequence). Otherwise, they are restricted to Graphic, Space, and Private Use Area code points, and a select set of invisible spaces. The idea around the "allowed" characters is to keep source code from having hidden text or codes and to allay confusion and deception. Conversion between code point numbers, graphemes, and strings can be done with the cp2s(), s2cp(), and s2gc() functions. Conversion between UTF-8 byte lists and langur strings can be done with b2s() and s2b() functions. ~~The following is an example of using the "any" modifier on a string literal.~~ ~~<lang langur>q:any"any code points here"</lang>~~ ~~Indexing on a string indexes by code point. The index may be a single number, a range, or an array of such things.~~ Conversion between code point numbers and strings can be done with the cp2s() and s2cp() functions. The s2cp() function accepts a single index number or range, returning a single code point number or an array of them. The s2s() function returns a string instead (while allowing you to index by code points). The cp2s() function accepts a single code point or an array and returns a string. ~~Conversion between UTF-8 byte arrays and langur strings can be done with b2s() and s2b() functions.~~ The len() function returns the number of code points in a string. Line 832 ⟶ 869: Using a for of loop over a string gives the code point indices, and using a for in loop over a string gives the code point numbers. Interpolation modifiers allow limiting a string by code points or by graphemes. See langurlang.org for more details. Line 842 ⟶ 881: Variable names can not contain anything but ASCII. <~~lang~~syntaxhighlight ~~Lasso~~lang="lasso">local(unicode = '♥♦♣♠') #unicode -> append('\u9830') #unicode Line 848 ⟶ 887: #unicode -> get (2) '<br />' #unicode -> get (4) -> integer</~~lang~~syntaxhighlight> {{out}} <pre>♥♦♣♠頰 Line 859 ⟶ 898: Here is example UFT-8 encoding: <~~lang~~syntaxhighlight lang="lisp"> > (set encoded (binary ("åäö ð" utf8))) #B(195 165 195 164 195 182 32 195 176) </~~lang~~syntaxhighlight> Display it in native Erlang format: <~~lang~~syntaxhighlight lang="lisp"> > (io:format "~tp~n" (list encoded)) <<"åäö ð"/utf8>> </syntaxhighlight> ~~</lang>~~ Example UFT-8 decoding: <~~lang~~syntaxhighlight lang="lisp"> > (unicode:characters_to_list encoded 'utf8) "åäö ð" </syntaxhighlight> ~~</lang>~~ =={{header\|Lingo}}== In recent versions (since v11.5) of Lingo's only implementation "Director" UTF-8 is the default encoding for both scripts and strings. Therefor Unicode string literals can be specified directly in the code, and also variable names support Unicode. To represent/deal with string data in other encodings, you have to use the ByteArray data type. Various ByteArray as well as FileIO methods support an optional 'charSet' parameter that allows to transcode data to/from UTF-8 on the fly. The supported 'charSet' strings can be displayed like this: <~~lang~~syntaxhighlight lang="lingo">put _system.getInstalledCharSets() -- ["big5", "cp1026", "cp866", "ebcdic-cp-us", "gb2312", "ibm437", "ibm737", "ibm775", "ibm850", "ibm852", "ibm857", "ibm861", "ibm869", "iso-8859-1", Line 888 ⟶ 927: "windows-1256", "windows-1257", "windows-1258", "windows-874", "x-ebcdic-greekmodern", "x-mac-ce", "x-mac-cyrillic", "x-mac-greek", "x-mac-icelandic", "x-mac-turkish"]</~~lang~~syntaxhighlight> =={{header\|Locomotive Basic}}== Line 912 ⟶ 951: It should be added however that the character set can be easily redefined from BASIC with the SYMBOL and SYMBOL AFTER commands, so the CPC character set can be turned into e.g. Latin-1. As two-byte UTF-8 characters can be converted to Latin-1, at least a subset of Unicode can be printed in this way: <~~lang~~syntaxhighlight lang="locobasic">10 CLS:DEFINT a-z 20 ' define German umlauts as in Latin-1 30 SYMBOL AFTER 196 Line 933 ⟶ 972: 200 ' zero-terminated UTF-8 string 210 DATA &48,&C3,&A4,&6C,&6C,&C3,&B6,&20,&4C,&C3,&BC,&64,&77,&69,&67,&2E,&20,&C3,&84,&C3,&96,&C3,&9C 220 DATA &20,&C3,&A4,&C3,&B6,&C3,&BC,&20,&56,&69,&65,&6C,&65,&20,&47,&72,&C3,&BC,&C3,&9F,&65,&21,&00</~~lang~~syntaxhighlight> Produces this (slightly nonsensical) output: [[File:Unicode print locomotive basic.png]] =={{header\|Lua}}== By default, Lua doesn't support Unicode. Most string methods will work properly on the ASCII range only like [[String case#Lua\|case transformation]]. But there is a [https://www.lua.org/manual/5.4/manual.html#6.5 <code>utf8</code>] module that add some very basic support with a very limited number of functions. For example, this module brings a new [[String length#Lua\|length method]] adapted for UTF-8. But there is no method to transform the case of Unicode string correctly. So globally the Unicode support is very limited and not by default. =={{header\|M2000 Interpreter}}== Line 973 ⟶ 1,015: <syntaxhighlight lang="m2000 interpreter"> ~~<lang M2000 Interpreter>~~ Font "Arial" Mode 32 Line 996 ⟶ 1,038: القديم=10 Print القديم+1=11 ' true </syntaxhighlight> ~~</lang>~~ =={{header\|Mathematica}}/{{header\|Wolfram Language}}== Line 1,107 ⟶ 1,149: =={{header\|Perl}}== In Perl, "Unicode" means "UTF-8". If you want to include utf8 characters in your source file, unless you have set <code>PERL_UNICODE</code> environment correctly, you should do<syntaxhighlight lang ~~Perl~~="perl">use utf8;</~~lang~~syntaxhighlight> or you risk the parser treating the file as raw bytes. Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<~~lang~~syntaxhighlight ~~Perl~~lang="perl">$四十二 = "voilà"; print "$四十二"; # voilà print uc($四十二); # VOILÀ</~~lang~~syntaxhighlight> or you can specify unicode characters by name or ordinal:<~~lang~~syntaxhighlight ~~Perl~~lang="perl">use charnames qw(greek); $x = "\N{sigma} \U\N{sigma}"; $y = "\x{2708}"; print scalar reverse("$x $y"); # ✈ Σ σ</~~lang~~syntaxhighlight> Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<~~lang~~syntaxhighlight ~~Perl~~lang="perl">print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</~~lang~~syntaxhighlight> When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<~~lang~~syntaxhighlight ~~Perl~~lang="perl">open IN, "<:utf8", "file_utf"; open OUT, ">:raw", "file_byte";</~~lang~~syntaxhighlight> The default of IO behavior can also be set in <code>PERL_UNICODE</code>. Line 1,137 ⟶ 1,179: =={{header\|PicoLisp}}== PicoLisp can directly handle _only_ Unicode (UTF-8) strings. So the problem is rather how to handle non-Unicode strings: They must be pre- or post-processed by external tools, typically with pipes during I/O. For example, to read a line from a file in 8859 encoding: <~~lang~~syntaxhighlight ~~PicoLisp~~lang="picolisp">(in '(iconv "-f" "ISO-8859-15" "file.txt") (line))</~~lang~~syntaxhighlight> =={{header\|Pike}}== Line 1,154 ⟶ 1,196: writing it out. <syntaxhighlight lang="pike"> ~~<lang Pike>~~ #charset utf8 void main() Line 1,165 ⟶ 1,207: write( string_to_utf8(nånsense) ); } </syntaxhighlight> ~~</lang>~~ {{Out}} <pre> Line 1,171 ⟶ 1,213: λ ä </pre> =={{header\|PowerShell}}== Unicode escape sequence (added in PowerShell 6<ref>https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_special_characters?view=powershell-7.3</ref>): <syntaxhighlight lang="powershell"> # `u{x} "I`u{0307}" # => İ </syntaxhighlight> =={{header\|Python}}== Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file: <~~lang~~syntaxhighlight ~~Python~~lang="python">#!/usr/bin/env python # -- coding: latin-1 -- u = 'abcdé' print(ord(u[-1]))</~~lang~~syntaxhighlight> In Python 3, the default encoding is UTF-8. Before that it was ASCII. Line 1,185 ⟶ 1,236: =={{header\|Racket}}== <syntaxhighlight lang="racket"> ~~<lang Racket>~~ #lang racket Line 1,203 ⟶ 1,254: ;; and in fact the standard language makes use of some of these (λ(x) x) ; -> an identity function </syntaxhighlight> ~~</lang>~~ Further points: Line 1,216 ⟶ 1,267: (formerly Perl 6) Raku programs and strings are all in Unicode and operate at a grapheme abstraction level, which is agnostic to underlying encodings ~~or normalizations~~. (These are generally handled at program boundaries.) Opened files default to UTF-8 encoding. All Unicode character properties are in play, so any appropriate characters may be used as parts of identifiers, whitespace, or user-defined operators. For instance: <syntaxhighlight lang="raku" ~~perl6~~line>sub prefix:<∛> (\𝐕) { 𝐕 ** ~~(1/3)~~⅓ } say ∛27; # prints 3</~~lang~~syntaxhighlight> Non-Unicode strings are represented as Buf types rather than Str types, and Unicode operations may not be applied to Buf types without some kind of explicit conversion. Only ASCIIish operations are allowed on buffers. Raku tracks the Unicode consortium standards releases and is generally up to the latest standard within a few months or so of its release. (currently at 15.0 as of February 2023) ~~standard within a month or so of its release. (currently at 13.1 as of May 2021)~~ * Supports the normalized forms NFC, NFD, NFKC, and NFKD, and character equivalence as specified in [http://unicode.org/reports/tr15/ Unicode technical report #15]. Line 1,234 ⟶ 1,284: * Works seamlessly with upper plane and private use plane character codepoints. * Provides tools to deal with strings that contain invalid Unicode characters. In general, it tries to make dealing with Unicode "just work". Raku intends to support Unicode even better than Perl 5, which already does a great job in recent versions of accessing large swaths of Unicode spec. functionality. Raku improves on Perl 5 primarily by offering explicitly typed strings that always know which operations are sensical and which are not. A very important distinctive characteristic of Raku to keep in mind is that it applies normalization (Unicode NFC form (Normalization Form Canonical)) automatically by default to all strings as showcased and explained on the [[String comparison#Unicode_normalization_by_default\|String comparison page]]. =={{header\|REXX}}== Line 1,264 ⟶ 1,315: =={{header\|Ring}}== <~~lang~~syntaxhighlight lang="ring"> see "Hello, World!" Line 1,276 ⟶ 1,327: ok ring_see("Converted To (Hindi): " + cText + nl) </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 1,292 ⟶ 1,343: Unicode strings are no problem: <~~lang~~syntaxhighlight lang="ruby">str = "你好" str.include?("好") # => true</~~lang~~syntaxhighlight> Unicode code is no problem either: <~~lang~~syntaxhighlight lang="ruby">def Σ(array) array.inject(:+) end puts Σ([4,5,6]) #=>15 </syntaxhighlight> ~~</lang>~~ Ruby 2.2 introduced a method to normalize unicode strings: <~~lang~~syntaxhighlight lang="ruby"> p bad = "¿como\u0301 esta\u0301s?" # => "¿comó estás?" p bad.unicode_normalized? # => false p bad.unicode_normalize! # => "¿comó estás?" p bad.unicode_normalized? # => true </syntaxhighlight> ~~</lang>~~ Since Ruby 2.4 Ruby strings have full Unicode case mapping. Line 1,315 ⟶ 1,366: =={{header\|Scala}}== {{libheader\|Scala}} <~~lang~~syntaxhighlight lang="scala">object UTF8 extends App { def charToInt(s: String) = { Line 1,342 ⟶ 1,393: val a = "$abcde¢£¤¥©ÇßçĲĳŁłʒλπ•₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₵←→⇒∙⌘☺☻ア字文𪚥". map(c => "%s\t\\u%04X".format(c, c.toInt)).foreach(println) }</~~lang~~syntaxhighlight> {{out}} <pre style="height:20ex;overflow:scroll">true true Line 1,412 ⟶ 1,463: Swift has an [https://swiftdoc.org/v5.1/type/string/ advanced string type] that defaults to i18n operations and exposes encoding through views: <~~lang~~syntaxhighlight lang="swift">let flag = "🇵🇷" print(flag.characters.count) // Prints "1" Line 1,427 ⟶ 1,478: print(nfc == nfd) //NFx: true print(nfc == nfkx) //NFKx: false </syntaxhighlight> ~~</lang>~~ Swift [https://forums.swift.org/t/string-s-abi-and-utf-8/17676 apparently uses a null terminiated char array] for storage to provide compatibility with C, but does a lot of work under-the-covers to make things more ergonomic: Line 1,479 ⟶ 1,530: =={{header\|Sidef}}== Sidef use UTF-8 encoding for pretty much everything, such as source files, chars, strings, stdout, stderr and stdin. <~~lang~~syntaxhighlight lang="ruby"># International class; name and street class 国際( なまえ, Straße ) { Line 1,497 ⟶ 1,548: 民族.each { \|garçon\| garçon.言え; }</~~lang~~syntaxhighlight> {{out}} <pre> Line 1,552 ⟶ 1,603: Japanese test case: <~~lang~~syntaxhighlight ~~TXR~~lang="txr">@{TITLE /[あ-ん一-耙]+/} (@ROMAJI/@ENGLISH) @(freeform) @(coll)@{STANZA /[^\n\x3000 ]+/}@(end)@/./ </syntaxhighlight> ~~</lang>~~ Test data: Japanese traditional song: Line 1,622 ⟶ 1,673: Vala strings are UTF-8 encoded by default. In order to print them correctly on the screen, use stdout.printf instead of print. <~~lang~~syntaxhighlight lang="vala">stdout.printf ("UTF-8 encoded string. Let's go to a café!");</~~lang~~syntaxhighlight> =={{header\|Visual Basic .NET}}== See the C# for some general information about the .NET runtime. Below is an example of certain parts based of the information in the D entry. <syntaxhighlight lang="text">Module Module1 Sub Main() Line 1,648 ⟶ 1,699: End Sub End Module</~~lang~~syntaxhighlight> {{out}} <pre>some text Line 1,660 ⟶ 1,711: WDTE supports Unicode in both identifiers and strings. WDTE is very loose about identifier rules. If it doesn't conflict with a syntactic structure, such as a keyword, literal, or operator, than it's allowed as an identifier. <~~lang~~syntaxhighlight ~~WDTE~~lang="wdte">let プリント t => io.writeln io.stdout t; プリント 'これは実験です。';</~~lang~~syntaxhighlight> =={{header\|Wren}}== Line 1,677 ⟶ 1,728: The standard library does not support normalization but the above module does allow one to split a string into ''user perceived characters'' (or ''graphemes''). <syntaxhighlight lang="wren">import "./upc" for Graphemes ~~<lang ecmascript>var w = "voilà"~~ var w = "voilà" for (c in w) { System.write("%(c) ") // prints the 5 Unicode 'characters'. Line 1,698 ⟶ 1,751: System.print(" %(zwe.bytes.count) bytes: %(zwe.bytes.toList.join(" "))") System.print(" %(zwe.codePoints.count) code-points: %(zwe.codePoints.toList.join(" "))") System.print(" %(Graphemes.clusterCount(zwe)) grapheme")</~~lang~~syntaxhighlight> {{out}} Line 1,745 ⟶ 1,798: How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings. A decoder and output routine would need to be written, but this is easy to do on the Spectrum. =={{header\|Zig}}== The encoding of a string in Zig is de-facto assumed to be UTF-8. Because Zig source code is UTF-8 encoded, any non-ASCII bytes appearing within a string literal in source code carry their UTF-8 meaning into the content of the string in the Zig program; the bytes are not modified by the compiler. However, it is possible to embed non-UTF-8 bytes into a string literal using \xNN notation.<ref>[https://ziglang.org/documentation/master/#String-Literals-and-Unicode-Code-Point-Literals Zig Documentation - String Literals and Unicode Code Point Literals]</ref> {{omit from\|GUISS}}