Unicode strings: Difference between revisions
Content added Content deleted
(→{{header|Wren}}: Added a upc example.) |
(Add comment for Rust) |
||
Line 523: | Line 523: | ||
int main() { |
int main() { |
||
/* Set the locale to alert C's multibyte output routines */ |
|||
if (!setlocale(LC_CTYPE, "")) { |
|||
fprintf(stderr, "Locale failure, check your env vars\n"); |
|||
return 1; |
|||
} |
|||
} |
|||
#ifdef __STDC_ISO_10646__ |
#ifdef __STDC_ISO_10646__ |
||
/* C99 compilers should understand these */ |
|||
printf("%lc\n", 0x2708); /* ✈ */ |
|||
printf("%ls\n", poker); /* ♥♦♣♠ */ |
|||
printf("%ls\n", four_two); /* 四十二 */ |
|||
#else |
#else |
||
/* oh well */ |
|||
printf("airplane\n"); |
|||
printf("club diamond club spade\n"); |
|||
printf("for ty two\n"); |
|||
#endif |
#endif |
||
return 0; |
|||
}</lang> |
}</lang> |
||
Line 1,066: | Line 1,066: | ||
Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà"; |
Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà"; |
||
print "$四十二"; |
print "$四十二"; # voilà |
||
print uc($四十二); |
print uc($四十二); # VOILÀ</lang> |
||
or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek); |
or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek); |
||
$x = "\N{sigma} \U\N{sigma}"; |
$x = "\N{sigma} \U\N{sigma}"; |
||
$y = "\x{2708}"; |
$y = "\x{2708}"; |
||
print scalar reverse("$x $y"); |
print scalar reverse("$x $y"); # ✈ Σ σ</lang> |
||
Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; |
Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</lang> |
||
When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf"; |
When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf"; |
||
Line 1,301: | Line 1,301: | ||
𪚥 |
𪚥 |
||
𝄞 |
𝄞 |
||
$ |
$ \u0024 |
||
a |
a \u0061 |
||
b |
b \u0062 |
||
c |
c \u0063 |
||
d |
d \u0064 |
||
e |
e \u0065 |
||
¢ |
¢ \u00A2 |
||
£ |
£ \u00A3 |
||
¤ |
¤ \u00A4 |
||
¥ |
¥ \u00A5 |
||
© |
© \u00A9 |
||
Ç |
Ç \u00C7 |
||
ß |
ß \u00DF |
||
ç |
ç \u00E7 |
||
IJ |
IJ \u0132 |
||
ij |
ij \u0133 |
||
Ł |
Ł \u0141 |
||
ł |
ł \u0142 |
||
ʒ |
ʒ \u0292 |
||
λ |
λ \u03BB |
||
π |
π \u03C0 |
||
• |
• \u2022 |
||
₠ |
₠ \u20A0 |
||
₡ |
₡ \u20A1 |
||
₢ |
₢ \u20A2 |
||
₣ |
₣ \u20A3 |
||
₤ |
₤ \u20A4 |
||
₥ |
₥ \u20A5 |
||
₦ |
₦ \u20A6 |
||
₧ |
₧ \u20A7 |
||
₨ |
₨ \u20A8 |
||
₩ |
₩ \u20A9 |
||
₪ |
₪ \u20AA |
||
₫ |
₫ \u20AB |
||
€ |
€ \u20AC |
||
₭ |
₭ \u20AD |
||
₮ |
₮ \u20AE |
||
₯ |
₯ \u20AF |
||
₰ |
₰ \u20B0 |
||
₱ |
₱ \u20B1 |
||
₲ |
₲ \u20B2 |
||
₳ |
₳ \u20B3 |
||
₴ |
₴ \u20B4 |
||
₵ |
₵ \u20B5 |
||
₵ |
₵ \u20B5 |
||
← |
← \u2190 |
||
→ |
→ \u2192 |
||
⇒ |
⇒ \u21D2 |
||
∙ |
∙ \u2219 |
||
⌘ |
⌘ \u2318 |
||
☺ |
☺ \u263A |
||
☻ |
☻ \u263B |
||
ア |
ア \u30A2 |
||
字 |
字 \u5B57 |
||
文 |
文 \u6587 |
||
? |
? \uD869 |
||
? |
? \uDEA5 |
||
|
\uF8FF |
||
</pre> |
</pre> |
||
=={{header|Rust}}== |
|||
Source code must be encoded in UTF-8. |
|||
Non-ASCII characters are, however, acceptable only in character and string literals. |
|||
Literals may specify Unicode characters in the form of escape sequences as well. |
|||
An escape sequence has the form of <code>\u{X}</code> where <code>X</code> is the hexadecimal code of the character (up to 6 digits). |
|||
Unicode characters can be represented by built-in type <code>char</code>. |
|||
Unicode strings can be represented by several types, respecting the ownership of the string. |
|||
The most primitive string type is built-in type <code>str</code>, called also string slice. |
|||
Other string types allow usually borrowing a string slice for accessing the actual string data. |
|||
String slices are stored as UTF-8 encoded byte sequences. |
|||
This kind of representation does not allow string indexing in constant time, which is usual in many other languages. |
|||
As the result, string handling requires usually a slightly different approach that leverages general concepts like slices and iterators. |
|||
Rust is very strict in correct Unicode representation. |
|||
It has even distinct types for filesystem paths, because these may contain byte sequences that do not form valid Unicode characters. |
|||
Conversions between different string representations may result in errors rather than producing an invalid or inexact result. |
|||
Some inexact (lossy) conversions can be requested explicitly. |
|||
Besides UTF-8, the standard library provides functions for handling UTF-16. |
|||
More advanced Unicode operations (e.g., segmentation into graphemes) are available in third-party libraries (crates). |
|||
=={{header|Seed7}}== |
=={{header|Seed7}}== |
||
Line 1,551: | Line 1,576: | ||
Здравствуйте |
Здравствуйте |
||
שלום |
שלום |
||
text ✈"blue</pre> |
|||
=={{header|WDTE}}== |
=={{header|WDTE}}== |