Unicode strings: Difference between revisions

Content added Content deleted
(→‎{{header|Wren}}: Added a upc example.)
(Add comment for Rust)
Line 523: Line 523:


int main() {
int main() {
/* Set the locale to alert C's multibyte output routines */
/* Set the locale to alert C's multibyte output routines */
if (!setlocale(LC_CTYPE, "")) {
if (!setlocale(LC_CTYPE, "")) {
fprintf(stderr, "Locale failure, check your env vars\n");
fprintf(stderr, "Locale failure, check your env vars\n");
return 1;
return 1;
}
}


#ifdef __STDC_ISO_10646__
#ifdef __STDC_ISO_10646__
/* C99 compilers should understand these */
/* C99 compilers should understand these */
printf("%lc\n", 0x2708); /* ✈ */
printf("%lc\n", 0x2708); /* ✈ */
printf("%ls\n", poker); /* ♥♦♣♠ */
printf("%ls\n", poker); /* ♥♦♣♠ */
printf("%ls\n", four_two); /* 四十二 */
printf("%ls\n", four_two); /* 四十二 */
#else
#else
/* oh well */
/* oh well */
printf("airplane\n");
printf("airplane\n");
printf("club diamond club spade\n");
printf("club diamond club spade\n");
printf("for ty two\n");
printf("for ty two\n");
#endif
#endif
return 0;
return 0;
}</lang>
}</lang>


Line 1,066: Line 1,066:


Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà";
Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà";
print "$四十二"; # voilà
print "$四十二"; # voilà
print uc($四十二); # VOILÀ</lang>
print uc($四十二); # VOILÀ</lang>
or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek);
or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek);
$x = "\N{sigma} \U\N{sigma}";
$x = "\N{sigma} \U\N{sigma}";
$y = "\x{2708}";
$y = "\x{2708}";
print scalar reverse("$x $y"); # ✈ Σ σ</lang>
print scalar reverse("$x $y"); # ✈ Σ σ</lang>


Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</lang>
Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</lang>


When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf";
When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf";
Line 1,301: Line 1,301:
𪚥
𪚥
𝄞
𝄞
$ \u0024
$ \u0024
a \u0061
a \u0061
b \u0062
b \u0062
c \u0063
c \u0063
d \u0064
d \u0064
e \u0065
e \u0065
¢ \u00A2
¢ \u00A2
£ \u00A3
£ \u00A3
¤ \u00A4
¤ \u00A4
¥ \u00A5
¥ \u00A5
© \u00A9
© \u00A9
Ç \u00C7
Ç \u00C7
ß \u00DF
ß \u00DF
ç \u00E7
ç \u00E7
IJ \u0132
IJ \u0132
ij \u0133
ij \u0133
Ł \u0141
Ł \u0141
ł \u0142
ł \u0142
ʒ \u0292
ʒ \u0292
λ \u03BB
λ \u03BB
π \u03C0
π \u03C0
\u2022
\u2022
\u20A0
\u20A0
\u20A1
\u20A1
\u20A2
\u20A2
\u20A3
\u20A3
\u20A4
\u20A4
\u20A5
\u20A5
\u20A6
\u20A6
\u20A7
\u20A7
\u20A8
\u20A8
\u20A9
\u20A9
\u20AA
\u20AA
\u20AB
\u20AB
\u20AC
\u20AC
\u20AD
\u20AD
\u20AE
\u20AE
\u20AF
\u20AF
\u20B0
\u20B0
\u20B1
\u20B1
\u20B2
\u20B2
\u20B3
\u20B3
\u20B4
\u20B4
\u20B5
\u20B5
\u20B5
\u20B5
\u2190
\u2190
\u2192
\u2192
\u21D2
\u21D2
\u2219
\u2219
\u2318
\u2318
\u263A
\u263A
\u263B
\u263B
\u30A2
\u30A2
\u5B57
\u5B57
\u6587
\u6587
? \uD869
? \uD869
? \uDEA5
? \uDEA5
\uF8FF
\uF8FF
</pre>
</pre>

=={{header|Rust}}==

Source code must be encoded in UTF-8.
Non-ASCII characters are, however, acceptable only in character and string literals.
Literals may specify Unicode characters in the form of escape sequences as well.
An escape sequence has the form of <code>\u{X}</code> where <code>X</code> is the hexadecimal code of the character (up to 6 digits).

Unicode characters can be represented by built-in type <code>char</code>.
Unicode strings can be represented by several types, respecting the ownership of the string.
The most primitive string type is built-in type <code>str</code>, called also string slice.
Other string types allow usually borrowing a string slice for accessing the actual string data.

String slices are stored as UTF-8 encoded byte sequences.
This kind of representation does not allow string indexing in constant time, which is usual in many other languages.
As the result, string handling requires usually a slightly different approach that leverages general concepts like slices and iterators.

Rust is very strict in correct Unicode representation.
It has even distinct types for filesystem paths, because these may contain byte sequences that do not form valid Unicode characters.
Conversions between different string representations may result in errors rather than producing an invalid or inexact result.
Some inexact (lossy) conversions can be requested explicitly.

Besides UTF-8, the standard library provides functions for handling UTF-16.
More advanced Unicode operations (e.g., segmentation into graphemes) are available in third-party libraries (crates).



=={{header|Seed7}}==
=={{header|Seed7}}==
Line 1,551: Line 1,576:
Здравствуйте
Здравствуйте
שלום
שלום
text ✈"blue</pre>
text ✈"blue</pre>


=={{header|WDTE}}==
=={{header|WDTE}}==