Unicode strings: Difference between revisions

Content added Content deleted
(Added 11l)
m (syntax highlighting fixup automation)
Line 88: Line 88:
{{works with|ALGOL 68G|Any - tested with release [http://sourceforge.net/projects/algol68/files/algol68g/algol68g-1.18.0/algol68g-1.18.0-9h.tiny.el5.centos.fc11.i386.rpm/download 1.18.0-9h.tiny].}}
{{works with|ALGOL 68G|Any - tested with release [http://sourceforge.net/projects/algol68/files/algol68g/algol68g-1.18.0/algol68g-1.18.0-9h.tiny.el5.centos.fc11.i386.rpm/download 1.18.0-9h.tiny].}}
{{wont work with|ELLA ALGOL 68|Any (with appropriate job cards) - tested with release [http://sourceforge.net/projects/algol68/files/algol68toc/algol68toc-1.8.8d/algol68toc-1.8-8d.fc9.i386.rpm/download 1.8-8d] - due to extensive use of '''format'''[ted] ''transput''.}}
{{wont work with|ELLA ALGOL 68|Any (with appropriate job cards) - tested with release [http://sourceforge.net/projects/algol68/files/algol68toc/algol68toc-1.8.8d/algol68toc-1.8-8d.fc9.i386.rpm/download 1.8-8d] - due to extensive use of '''format'''[ted] ''transput''.}}
<lang algol68>#!/usr/local/bin/a68g --script #
<syntaxhighlight lang="algol68">#!/usr/local/bin/a68g --script #
# -*- coding: utf-8 -*- #
# -*- coding: utf-8 -*- #


Line 388: Line 388:
))
))


)</lang>
)</syntaxhighlight>
{{out}}
{{out}}
<pre>
<pre>
Line 401: Line 401:
=={{header|Arturo}}==
=={{header|Arturo}}==


<lang rebol>text: "你好"
<syntaxhighlight lang="rebol">text: "你好"


print ["text:" text]
print ["text:" text]
Line 407: Line 407:
print ["contains string '好'?:" contains? text "好"]
print ["contains string '好'?:" contains? text "好"]
print ["contains character '平'?:" contains? text `平`]
print ["contains character '平'?:" contains? text `平`]
print ["text as ascii:" as.ascii text]</lang>
print ["text as ascii:" as.ascii text]</syntaxhighlight>


{{out}}
{{out}}
Line 462: Line 462:
'''Code example:'''
'''Code example:'''
(whether this listing displays correctly will depend on your browser)
(whether this listing displays correctly will depend on your browser)
<lang bbcbasic> VDU 23,22,640;512;8,16,16,128+8 : REM Select UTF-8 mode
<syntaxhighlight lang="bbcbasic"> VDU 23,22,640;512;8,16,16,128+8 : REM Select UTF-8 mode
*FONT Times New Roman, 20
*FONT Times New Roman, 20
Line 519: Line 519:
B$ += CHR$?A%
B$ += CHR$?A%
NEXT
NEXT
= LEFT$(B$)</lang>
= LEFT$(B$)</syntaxhighlight>
[[Image:unicode_bbc.gif]]
[[Image:unicode_bbc.gif]]


Line 535: Line 535:


=={{header|C}}==
=={{header|C}}==
C is not the most unicode friendly language, to put it mildly. Generally using unicode in C requires dealing with locales, manage data types carefully, and checking various aspects of your compiler. Directly embedding unicode strings in your C source might be a bad idea, too; it's safer to use their hex values. Here's a short example of doing the simplest string handling: print it.<lang C>#include <stdio.h>
C is not the most unicode friendly language, to put it mildly. Generally using unicode in C requires dealing with locales, manage data types carefully, and checking various aspects of your compiler. Directly embedding unicode strings in your C source might be a bad idea, too; it's safer to use their hex values. Here's a short example of doing the simplest string handling: print it.<syntaxhighlight lang="c">#include <stdio.h>
#include <stdlib.h>
#include <stdlib.h>
#include <locale.h>
#include <locale.h>
Line 564: Line 564:
#endif
#endif
return 0;
return 0;
}</lang>
}</syntaxhighlight>


=={{header|C sharp|C#}}==
=={{header|C sharp|C#}}==
Line 580: Line 580:
Default unicode strings for most implementations. Unicode chars can be used on variable and function names.
Default unicode strings for most implementations. Unicode chars can be used on variable and function names.
Tested in SBCL 1.2.7 and ECL 13.5.1
Tested in SBCL 1.2.7 and ECL 13.5.1
<lang lisp>
<syntaxhighlight lang="lisp">
(defvar ♥♦♣♠ "♥♦♣♠")
(defvar ♥♦♣♠ "♥♦♣♠")
(defun ✈ () "a plane unicode function")
(defun ✈ () "a plane unicode function")
</syntaxhighlight>
</lang>


=={{header|D}}==
=={{header|D}}==
<lang D>import std.stdio;
<syntaxhighlight lang="d">import std.stdio;
import std.uni; // standard package for normalization, composition/decomposition, etc..
import std.uni; // standard package for normalization, composition/decomposition, etc..
import std.utf; // standard package for decoding/encoding, etc...
import std.utf; // standard package for decoding/encoding, etc...
Line 607: Line 607:


// escape sequences like what is defined in C are also allowed inside of strings and characters.
// escape sequences like what is defined in C are also allowed inside of strings and characters.
}</lang>
}</syntaxhighlight>


=={{header|DWScript}}==
=={{header|DWScript}}==
Line 620: Line 620:


ELENA 4.x:
ELENA 4.x:
<lang elena>public program()
<syntaxhighlight lang="elena">public program()
{
{
var 四十二 := "♥♦♣♠"; // UTF8 string
var 四十二 := "♥♦♣♠"; // UTF8 string
Line 627: Line 627:
console.writeLine:строка;
console.writeLine:строка;
console.writeLine:四十二;
console.writeLine:四十二;
}</lang>
}</syntaxhighlight>
{{out}}
{{out}}
<pre>
<pre>
Line 646: Line 646:
The <code>string</code> data type represents a read-only sequence of bytes, conventionally but not necessarily representing UTF-8-encoded text.
The <code>string</code> data type represents a read-only sequence of bytes, conventionally but not necessarily representing UTF-8-encoded text.
A number of built-in features interpret <code>string</code>s as UTF-8. For example,
A number of built-in features interpret <code>string</code>s as UTF-8. For example,
<lang go> var i int
<syntaxhighlight lang="go"> var i int
var u rune
var u rune
for i, u = range "voilà" {
for i, u = range "voilà" {
fmt.Println(i, u)
fmt.Println(i, u)
}</lang>
}</syntaxhighlight>
{{out}}
{{out}}
<pre>
<pre>
Line 663: Line 663:


In contrast,
In contrast,
<lang go> w := "voilà"
<syntaxhighlight lang="go"> w := "voilà"
for i := 0; i < len(w); i++ {
for i := 0; i < len(w); i++ {
fmt.Println(i, w[i])
fmt.Println(i, w[i])
}
}
</syntaxhighlight>
</lang>
{{out}}
{{out}}
<pre>
<pre>
Line 699: Line 699:
Unicode characters can be represented directly in J strings:
Unicode characters can be represented directly in J strings:


<lang j> '♥♦♣♠'
<syntaxhighlight lang="j"> '♥♦♣♠'
♥♦♣♠</lang>
♥♦♣♠</syntaxhighlight>


By default, they are represented as utf-8:
By default, they are represented as utf-8:


<lang j> #'♥♦♣♠'
<syntaxhighlight lang="j"> #'♥♦♣♠'
12</lang>
12</syntaxhighlight>


The above string requires 12 literal elements to represent the four characters using utf-8.
The above string requires 12 literal elements to represent the four characters using utf-8.
Line 711: Line 711:
However, they can be represented as utf-16 instead:
However, they can be represented as utf-16 instead:


<lang j> 7 u:'♥♦♣♠'
<syntaxhighlight lang="j"> 7 u:'♥♦♣♠'
♥♦♣♠
♥♦♣♠
#7 u:'♥♦♣♠'
#7 u:'♥♦♣♠'
4</lang>
4</syntaxhighlight>


The above string requires 4 literal elements to represent the four characters using utf-16. (7 u: string produces a utf-16 result.)
The above string requires 4 literal elements to represent the four characters using utf-16. (7 u: string produces a utf-16 result.)
Line 720: Line 720:
These forms are not treated as equivalent:
These forms are not treated as equivalent:


<lang j> '♥♦♣♠' -: 7 u:'♥♦♣♠'
<syntaxhighlight lang="j"> '♥♦♣♠' -: 7 u:'♥♦♣♠'
0</lang>
0</syntaxhighlight>


The utf-8 string of literals is a different string of literals from the utf-16 string.
The utf-8 string of literals is a different string of literals from the utf-16 string.
Line 727: Line 727:
unless the character literals themselves are equivalent:
unless the character literals themselves are equivalent:


<lang j> 'abcd'-:7 u:'abcd'
<syntaxhighlight lang="j"> 'abcd'-:7 u:'abcd'
1</lang>
1</syntaxhighlight>


Here, we were dealing with ascii characters, so the four literals needed to represent the characters using utf-8 matched the four literals needed to represent the characters using utf-16.
Here, we were dealing with ascii characters, so the four literals needed to represent the characters using utf-8 matched the four literals needed to represent the characters using utf-16.
Line 734: Line 734:
When this is likely to be an issue, you should enforce a single representation. For example:
When this is likely to be an issue, you should enforce a single representation. For example:


<lang j> '♥♦♣♠' -:&(7&u:) 7 u:'♥♦♣♠'
<syntaxhighlight lang="j"> '♥♦♣♠' -:&(7&u:) 7 u:'♥♦♣♠'
1
1
'♥♦♣♠' -:&(8&u:) 7 u:'♥♦♣♠'
'♥♦♣♠' -:&(8&u:) 7 u:'♥♦♣♠'
1</lang>
1</syntaxhighlight>


Here, we see that even when comparing non-ascii characters, we can coerce both arguments to be utf-8 or utf-16 and in either case the resulting literal strings match. (8 u: string produces a utf-8 result.)
Here, we see that even when comparing non-ascii characters, we can coerce both arguments to be utf-8 or utf-16 and in either case the resulting literal strings match. (8 u: string produces a utf-8 result.)
Line 790: Line 790:
=={{header|Julia}}==
=={{header|Julia}}==
Non-ASCII strings in Julia are UTF8-encoded by default, and Unicode identifiers are also supported:
Non-ASCII strings in Julia are UTF8-encoded by default, and Unicode identifiers are also supported:
<lang Julia>julia> 四十二 = "voilà";
<syntaxhighlight lang="julia">julia> 四十二 = "voilà";
julia> println(四十二)
julia> println(四十二)
voilà</lang>
voilà</syntaxhighlight>
And you can also specify unicode characters by ordinal:
And you can also specify unicode characters by ordinal:
<lang Julia>julia>println("\u2708")
<syntaxhighlight lang="julia">julia>println("\u2708")
✈</lang>
✈</syntaxhighlight>


=={{header|Kotlin}}==
=={{header|Kotlin}}==
Line 803: Line 803:


Here's a simple example of using both unicode identifiers and unicode strings in Kotlin:
Here's a simple example of using both unicode identifiers and unicode strings in Kotlin:
<lang scala>// version 1.1.2
<syntaxhighlight lang="scala">// version 1.1.2


fun main(args: Array<String>) {
fun main(args: Array<String>) {
val åäö = "as⃝df̅ ♥♦♣♠ 頰"
val åäö = "as⃝df̅ ♥♦♣♠ 頰"
println(åäö)
println(åäö)
}</lang>
}</syntaxhighlight>


{{out}}
{{out}}
Line 824: Line 824:
The following is an example of using the "any" modifier on a string literal.
The following is an example of using the "any" modifier on a string literal.


<lang langur>q:any"any code points here"</lang>
<syntaxhighlight lang="langur">q:any"any code points here"</syntaxhighlight>


Indexing on a string indexes by code point. The index may be a single number, a range, or an array of such things.
Indexing on a string indexes by code point. The index may be a single number, a range, or an array of such things.
Line 847: Line 847:
Variable names can not contain anything but ASCII.
Variable names can not contain anything but ASCII.


<lang Lasso>local(unicode = '♥♦♣♠')
<syntaxhighlight lang="lasso">local(unicode = '♥♦♣♠')
#unicode -> append('\u9830')
#unicode -> append('\u9830')
#unicode
#unicode
Line 853: Line 853:
#unicode -> get (2)
#unicode -> get (2)
'<br />'
'<br />'
#unicode -> get (4) -> integer</lang>
#unicode -> get (4) -> integer</syntaxhighlight>
{{out}}
{{out}}
<pre>♥♦♣♠頰
<pre>♥♦♣♠頰
Line 864: Line 864:


Here is example UFT-8 encoding:
Here is example UFT-8 encoding:
<lang lisp>
<syntaxhighlight lang="lisp">
> (set encoded (binary ("åäö ð" utf8)))
> (set encoded (binary ("åäö ð" utf8)))
#B(195 165 195 164 195 182 32 195 176)
#B(195 165 195 164 195 182 32 195 176)
</lang>
</syntaxhighlight>


Display it in native Erlang format:
Display it in native Erlang format:


<lang lisp>
<syntaxhighlight lang="lisp">
> (io:format "~tp~n" (list encoded))
> (io:format "~tp~n" (list encoded))
<<"åäö ð"/utf8>>
<<"åäö ð"/utf8>>
</syntaxhighlight>
</lang>


Example UFT-8 decoding:
Example UFT-8 decoding:
<lang lisp>
<syntaxhighlight lang="lisp">
> (unicode:characters_to_list encoded 'utf8)
> (unicode:characters_to_list encoded 'utf8)
"åäö ð"
"åäö ð"
</syntaxhighlight>
</lang>


=={{header|Lingo}}==
=={{header|Lingo}}==
In recent versions (since v11.5) of Lingo's only implementation "Director" UTF-8 is the default encoding for both scripts and strings. Therefor Unicode string literals can be specified directly in the code, and also variable names support Unicode. To represent/deal with string data in other encodings, you have to use the ByteArray data type. Various ByteArray as well as FileIO methods support an optional 'charSet' parameter that allows to transcode data to/from UTF-8 on the fly. The supported 'charSet' strings can be displayed like this:
In recent versions (since v11.5) of Lingo's only implementation "Director" UTF-8 is the default encoding for both scripts and strings. Therefor Unicode string literals can be specified directly in the code, and also variable names support Unicode. To represent/deal with string data in other encodings, you have to use the ByteArray data type. Various ByteArray as well as FileIO methods support an optional 'charSet' parameter that allows to transcode data to/from UTF-8 on the fly. The supported 'charSet' strings can be displayed like this:
<lang lingo>put _system.getInstalledCharSets()
<syntaxhighlight lang="lingo">put _system.getInstalledCharSets()
-- ["big5", "cp1026", "cp866", "ebcdic-cp-us", "gb2312", "ibm437", "ibm737",
-- ["big5", "cp1026", "cp866", "ebcdic-cp-us", "gb2312", "ibm437", "ibm737",
"ibm775", "ibm850", "ibm852", "ibm857", "ibm861", "ibm869", "iso-8859-1",
"ibm775", "ibm850", "ibm852", "ibm857", "ibm861", "ibm869", "iso-8859-1",
Line 893: Line 893:
"windows-1256", "windows-1257", "windows-1258", "windows-874",
"windows-1256", "windows-1257", "windows-1258", "windows-874",
"x-ebcdic-greekmodern", "x-mac-ce", "x-mac-cyrillic", "x-mac-greek",
"x-ebcdic-greekmodern", "x-mac-ce", "x-mac-cyrillic", "x-mac-greek",
"x-mac-icelandic", "x-mac-turkish"]</lang>
"x-mac-icelandic", "x-mac-turkish"]</syntaxhighlight>


=={{header|Locomotive Basic}}==
=={{header|Locomotive Basic}}==
Line 917: Line 917:
It should be added however that the character set can be easily redefined from BASIC with the SYMBOL and SYMBOL AFTER commands, so the CPC character set can be turned into e.g. Latin-1. As two-byte UTF-8 characters can be converted to Latin-1, at least a subset of Unicode can be printed in this way:
It should be added however that the character set can be easily redefined from BASIC with the SYMBOL and SYMBOL AFTER commands, so the CPC character set can be turned into e.g. Latin-1. As two-byte UTF-8 characters can be converted to Latin-1, at least a subset of Unicode can be printed in this way:


<lang locobasic>10 CLS:DEFINT a-z
<syntaxhighlight lang="locobasic">10 CLS:DEFINT a-z
20 ' define German umlauts as in Latin-1
20 ' define German umlauts as in Latin-1
30 SYMBOL AFTER 196
30 SYMBOL AFTER 196
Line 938: Line 938:
200 ' zero-terminated UTF-8 string
200 ' zero-terminated UTF-8 string
210 DATA &48,&C3,&A4,&6C,&6C,&C3,&B6,&20,&4C,&C3,&BC,&64,&77,&69,&67,&2E,&20,&C3,&84,&C3,&96,&C3,&9C
210 DATA &48,&C3,&A4,&6C,&6C,&C3,&B6,&20,&4C,&C3,&BC,&64,&77,&69,&67,&2E,&20,&C3,&84,&C3,&96,&C3,&9C
220 DATA &20,&C3,&A4,&C3,&B6,&C3,&BC,&20,&56,&69,&65,&6C,&65,&20,&47,&72,&C3,&BC,&C3,&9F,&65,&21,&00</lang>
220 DATA &20,&C3,&A4,&C3,&B6,&C3,&BC,&20,&56,&69,&65,&6C,&65,&20,&47,&72,&C3,&BC,&C3,&9F,&65,&21,&00</syntaxhighlight>


Produces this (slightly nonsensical) output:
Produces this (slightly nonsensical) output:
Line 978: Line 978:




<syntaxhighlight lang="m2000 interpreter">
<lang M2000 Interpreter>
Font "Arial"
Font "Arial"
Mode 32
Mode 32
Line 1,001: Line 1,001:
القديم=10
القديم=10
Print القديم+1=11 ' true
Print القديم+1=11 ' true
</syntaxhighlight>
</lang>


=={{header|Mathematica}}/{{header|Wolfram Language}}==
=={{header|Mathematica}}/{{header|Wolfram Language}}==
Line 1,112: Line 1,112:


=={{header|Perl}}==
=={{header|Perl}}==
In Perl, "Unicode" means "UTF-8". If you want to include utf8 characters in your source file, unless you have set <code>PERL_UNICODE</code> environment correctly, you should do<lang Perl>use utf8;</lang> or you risk the parser treating the file as raw bytes.
In Perl, "Unicode" means "UTF-8". If you want to include utf8 characters in your source file, unless you have set <code>PERL_UNICODE</code> environment correctly, you should do<syntaxhighlight lang="perl">use utf8;</syntaxhighlight> or you risk the parser treating the file as raw bytes.


Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà";
Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<syntaxhighlight lang="perl">$四十二 = "voilà";
print "$四十二"; # voilà
print "$四十二"; # voilà
print uc($四十二); # VOILÀ</lang>
print uc($四十二); # VOILÀ</syntaxhighlight>
or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek);
or you can specify unicode characters by name or ordinal:<syntaxhighlight lang="perl">use charnames qw(greek);
$x = "\N{sigma} \U\N{sigma}";
$x = "\N{sigma} \U\N{sigma}";
$y = "\x{2708}";
$y = "\x{2708}";
print scalar reverse("$x $y"); # ✈ Σ σ</lang>
print scalar reverse("$x $y"); # ✈ Σ σ</syntaxhighlight>


Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</lang>
Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<syntaxhighlight lang="perl">print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</syntaxhighlight>


When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf";
When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<syntaxhighlight lang="perl">open IN, "<:utf8", "file_utf";
open OUT, ">:raw", "file_byte";</lang>
open OUT, ">:raw", "file_byte";</syntaxhighlight>
The default of IO behavior can also be set in <code>PERL_UNICODE</code>.
The default of IO behavior can also be set in <code>PERL_UNICODE</code>.


Line 1,142: Line 1,142:
=={{header|PicoLisp}}==
=={{header|PicoLisp}}==
PicoLisp can directly handle _only_ Unicode (UTF-8) strings. So the problem is rather how to handle non-Unicode strings: They must be pre- or post-processed by external tools, typically with pipes during I/O. For example, to read a line from a file in 8859 encoding:
PicoLisp can directly handle _only_ Unicode (UTF-8) strings. So the problem is rather how to handle non-Unicode strings: They must be pre- or post-processed by external tools, typically with pipes during I/O. For example, to read a line from a file in 8859 encoding:
<lang PicoLisp>(in '(iconv "-f" "ISO-8859-15" "file.txt") (line))</lang>
<syntaxhighlight lang="picolisp">(in '(iconv "-f" "ISO-8859-15" "file.txt") (line))</syntaxhighlight>


=={{header|Pike}}==
=={{header|Pike}}==
Line 1,159: Line 1,159:
writing it out.
writing it out.


<syntaxhighlight lang="pike">
<lang Pike>
#charset utf8
#charset utf8
void main()
void main()
Line 1,170: Line 1,170:
write( string_to_utf8(nånsense) );
write( string_to_utf8(nånsense) );
}
}
</syntaxhighlight>
</lang>
{{Out}}
{{Out}}
<pre>
<pre>
Line 1,179: Line 1,179:
=={{header|Python}}==
=={{header|Python}}==
Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:
Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:
<lang Python>#!/usr/bin/env python
<syntaxhighlight lang="python">#!/usr/bin/env python
# -*- coding: latin-1 -*-
# -*- coding: latin-1 -*-


u = 'abcdé'
u = 'abcdé'
print(ord(u[-1]))</lang>
print(ord(u[-1]))</syntaxhighlight>
In Python 3, the default encoding is UTF-8. Before that it was ASCII.
In Python 3, the default encoding is UTF-8. Before that it was ASCII.


Line 1,190: Line 1,190:
=={{header|Racket}}==
=={{header|Racket}}==


<syntaxhighlight lang="racket">
<lang Racket>
#lang racket
#lang racket


Line 1,208: Line 1,208:
;; and in fact the standard language makes use of some of these
;; and in fact the standard language makes use of some of these
(λ(x) x) ; -> an identity function
(λ(x) x) ; -> an identity function
</syntaxhighlight>
</lang>


Further points:
Further points:
Line 1,223: Line 1,223:
Raku programs and strings are all in Unicode and operate at a grapheme abstraction level, which is agnostic to underlying encodings or normalizations. (These are generally handled at program boundaries.) Opened files default to UTF-8 encoding. All Unicode character properties are in play, so any appropriate characters may be used as parts of identifiers, whitespace, or user-defined operators. For instance:
Raku programs and strings are all in Unicode and operate at a grapheme abstraction level, which is agnostic to underlying encodings or normalizations. (These are generally handled at program boundaries.) Opened files default to UTF-8 encoding. All Unicode character properties are in play, so any appropriate characters may be used as parts of identifiers, whitespace, or user-defined operators. For instance:


<lang perl6>sub prefix:<∛> (\𝐕) { 𝐕 ** (1/3) }
<syntaxhighlight lang="raku" line>sub prefix:<∛> (\𝐕) { 𝐕 ** (1/3) }
say ∛27; # prints 3</lang>
say ∛27; # prints 3</syntaxhighlight>


Non-Unicode strings are represented as Buf types rather than Str types, and Unicode operations may not be applied to Buf types without some kind of explicit conversion. Only ASCIIish operations are allowed on buffers.
Non-Unicode strings are represented as Buf types rather than Str types, and Unicode operations may not be applied to Buf types without some kind of explicit conversion. Only ASCIIish operations are allowed on buffers.
Line 1,269: Line 1,269:


=={{header|Ring}}==
=={{header|Ring}}==
<lang ring>
<syntaxhighlight lang="ring">
see "Hello, World!"
see "Hello, World!"


Line 1,281: Line 1,281:
ok
ok
ring_see("Converted To (Hindi): " + cText + nl)
ring_see("Converted To (Hindi): " + cText + nl)
</syntaxhighlight>
</lang>
{{out}}
{{out}}
<pre>
<pre>
Line 1,297: Line 1,297:
Unicode strings are no problem:
Unicode strings are no problem:


<lang ruby>str = "你好"
<syntaxhighlight lang="ruby">str = "你好"
str.include?("好") # => true</lang>
str.include?("好") # => true</syntaxhighlight>


Unicode code is no problem either:
Unicode code is no problem either:


<lang ruby>def Σ(array)
<syntaxhighlight lang="ruby">def Σ(array)
array.inject(:+)
array.inject(:+)
end
end


puts Σ([4,5,6]) #=>15
puts Σ([4,5,6]) #=>15
</syntaxhighlight>
</lang>
Ruby 2.2 introduced a method to normalize unicode strings:
Ruby 2.2 introduced a method to normalize unicode strings:
<lang ruby>
<syntaxhighlight lang="ruby">
p bad = "¿como\u0301 esta\u0301s?" # => "¿comó estás?"
p bad = "¿como\u0301 esta\u0301s?" # => "¿comó estás?"
p bad.unicode_normalized? # => false
p bad.unicode_normalized? # => false
p bad.unicode_normalize! # => "¿comó estás?"
p bad.unicode_normalize! # => "¿comó estás?"
p bad.unicode_normalized? # => true
p bad.unicode_normalized? # => true
</syntaxhighlight>
</lang>


Since Ruby 2.4 Ruby strings have full Unicode case mapping.
Since Ruby 2.4 Ruby strings have full Unicode case mapping.
Line 1,320: Line 1,320:
=={{header|Scala}}==
=={{header|Scala}}==
{{libheader|Scala}}
{{libheader|Scala}}
<lang scala>object UTF8 extends App {
<syntaxhighlight lang="scala">object UTF8 extends App {


def charToInt(s: String) = {
def charToInt(s: String) = {
Line 1,347: Line 1,347:
val a = "$abcde¢£¤¥©ÇßçIJijŁłʒλπ•₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₵←→⇒∙⌘☺☻ア字文𪚥".
val a = "$abcde¢£¤¥©ÇßçIJijŁłʒλπ•₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₵←→⇒∙⌘☺☻ア字文𪚥".
map(c => "%s\t\\u%04X".format(c, c.toInt)).foreach(println)
map(c => "%s\t\\u%04X".format(c, c.toInt)).foreach(println)
}</lang>
}</syntaxhighlight>
{{out}}
{{out}}
<pre style="height:20ex;overflow:scroll">true true
<pre style="height:20ex;overflow:scroll">true true
Line 1,417: Line 1,417:
Swift has an [https://swiftdoc.org/v5.1/type/string/ advanced string type] that defaults to i18n operations and exposes encoding through views:
Swift has an [https://swiftdoc.org/v5.1/type/string/ advanced string type] that defaults to i18n operations and exposes encoding through views:


<lang swift>let flag = "🇵🇷"
<syntaxhighlight lang="swift">let flag = "🇵🇷"
print(flag.characters.count)
print(flag.characters.count)
// Prints "1"
// Prints "1"
Line 1,432: Line 1,432:
print(nfc == nfd) //NFx: true
print(nfc == nfd) //NFx: true
print(nfc == nfkx) //NFKx: false
print(nfc == nfkx) //NFKx: false
</syntaxhighlight>
</lang>


Swift [https://forums.swift.org/t/string-s-abi-and-utf-8/17676 apparently uses a null terminiated char array] for storage to provide compatibility with C, but does a lot of work under-the-covers to make things more ergonomic:
Swift [https://forums.swift.org/t/string-s-abi-and-utf-8/17676 apparently uses a null terminiated char array] for storage to provide compatibility with C, but does a lot of work under-the-covers to make things more ergonomic:
Line 1,484: Line 1,484:
=={{header|Sidef}}==
=={{header|Sidef}}==
Sidef use UTF-8 encoding for pretty much everything, such as source files, chars, strings, stdout, stderr and stdin.
Sidef use UTF-8 encoding for pretty much everything, such as source files, chars, strings, stdout, stderr and stdin.
<lang ruby># International class; name and street
<syntaxhighlight lang="ruby"># International class; name and street
class 国際( なまえ, Straße ) {
class 国際( なまえ, Straße ) {


Line 1,502: Line 1,502:
民族.each { |garçon|
民族.each { |garçon|
garçon.言え;
garçon.言え;
}</lang>
}</syntaxhighlight>
{{out}}
{{out}}
<pre>
<pre>
Line 1,557: Line 1,557:


Japanese test case:
Japanese test case:
<lang TXR>@{TITLE /[あ-ん一-耙]+/} (@ROMAJI/@ENGLISH)
<syntaxhighlight lang="txr">@{TITLE /[あ-ん一-耙]+/} (@ROMAJI/@ENGLISH)
@(freeform)
@(freeform)
@(coll)@{STANZA /[^\n\x3000 ]+/}@(end)@/.*/
@(coll)@{STANZA /[^\n\x3000 ]+/}@(end)@/.*/
</syntaxhighlight>
</lang>


Test data: Japanese traditional song:
Test data: Japanese traditional song:
Line 1,627: Line 1,627:


Vala strings are UTF-8 encoded by default. In order to print them correctly on the screen, use stdout.printf instead of print.
Vala strings are UTF-8 encoded by default. In order to print them correctly on the screen, use stdout.printf instead of print.
<lang vala>stdout.printf ("UTF-8 encoded string. Let's go to a café!");</lang>
<syntaxhighlight lang="vala">stdout.printf ("UTF-8 encoded string. Let's go to a café!");</syntaxhighlight>


=={{header|Visual Basic .NET}}==
=={{header|Visual Basic .NET}}==
See the C# for some general information about the .NET runtime.
See the C# for some general information about the .NET runtime.
Below is an example of certain parts based of the information in the D entry.
Below is an example of certain parts based of the information in the D entry.
<lang>Module Module1
<syntaxhighlight lang="text">Module Module1


Sub Main()
Sub Main()
Line 1,653: Line 1,653:
End Sub
End Sub


End Module</lang>
End Module</syntaxhighlight>
{{out}}
{{out}}
<pre>some text
<pre>some text
Line 1,665: Line 1,665:
WDTE supports Unicode in both identifiers and strings. WDTE is very loose about identifier rules. If it doesn't conflict with a syntactic structure, such as a keyword, literal, or operator, than it's allowed as an identifier.
WDTE supports Unicode in both identifiers and strings. WDTE is very loose about identifier rules. If it doesn't conflict with a syntactic structure, such as a keyword, literal, or operator, than it's allowed as an identifier.


<lang WDTE>let プリント t => io.writeln io.stdout t;
<syntaxhighlight lang="wdte">let プリント t => io.writeln io.stdout t;


プリント 'これは実験です。';</lang>
プリント 'これは実験です。';</syntaxhighlight>


=={{header|Wren}}==
=={{header|Wren}}==
Line 1,682: Line 1,682:


The standard library does not support normalization but the above module does allow one to split a string into ''user perceived characters'' (or ''graphemes'').
The standard library does not support normalization but the above module does allow one to split a string into ''user perceived characters'' (or ''graphemes'').
<lang ecmascript>var w = "voilà"
<syntaxhighlight lang="ecmascript">var w = "voilà"
for (c in w) {
for (c in w) {
System.write("%(c) ") // prints the 5 Unicode 'characters'.
System.write("%(c) ") // prints the 5 Unicode 'characters'.
Line 1,703: Line 1,703:
System.print(" %(zwe.bytes.count) bytes: %(zwe.bytes.toList.join(" "))")
System.print(" %(zwe.bytes.count) bytes: %(zwe.bytes.toList.join(" "))")
System.print(" %(zwe.codePoints.count) code-points: %(zwe.codePoints.toList.join(" "))")
System.print(" %(zwe.codePoints.count) code-points: %(zwe.codePoints.toList.join(" "))")
System.print(" %(Graphemes.clusterCount(zwe)) grapheme")</lang>
System.print(" %(Graphemes.clusterCount(zwe)) grapheme")</syntaxhighlight>


{{out}}
{{out}}