Unicode strings: Difference between revisions
Content added Content deleted
(→{{header|Ruby}}: added unicode normalisation) |
m ({{out}}) |
||
Line 1: | Line 1: | ||
{{task}} |
{{task}} |
||
As the world gets smaller each day, internationalization becomes more and more important. For handling multiple languages, [[Unicode]] is your best friend. |
As the world gets smaller each day, internationalization becomes more and more important. For handling multiple languages, [[Unicode]] is your best friend. <br> |
||
It is a very capable tool, but also quite complex compared to older single- |
|||
and double-byte character encodings. How well prepared is your programming language for Unicode? Discuss and demonstrate its unicode awareness and capabilities. |
|||
Some suggested topics: |
|||
* How easy is it to present Unicode strings in source code? Can Unicode literals be written directly, or be part of identifiers/keywords/etc? |
* How easy is it to present Unicode strings in source code? Can Unicode literals be written directly, or be part of identifiers/keywords/etc? |
||
Line 7: | Line 11: | ||
* How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? Normalization? |
* How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? Normalization? |
||
'''Note''' This task is a bit unusual in that it encourages general discussion |
'''Note''' This task is a bit unusual in that it encourages general discussion |
||
rather than clever coding. |
|||
See also: |
See also: |
||
Line 362: | Line 367: | ||
)</lang> |
)</lang> |
||
{{out}} |
|||
Output: |
|||
<pre> |
<pre> |
||
aircraft: 16r2708 => ✈ |
aircraft: 16r2708 => ✈ |
||
Line 382: | Line 387: | ||
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? UTF-8 is most often used. StrPut/StrGet and FileRead/FileAppend allow unicode in AutoHotkey_L (the current build) |
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? UTF-8 is most often used. StrPut/StrGet and FileRead/FileAppend allow unicode in AutoHotkey_L (the current build) |
||
=={{header|AWK}}== |
=={{header|AWK}}== |
||
Line 398: | Line 404: | ||
Is it convenient to manipulate Unicode strings in the language? - No |
Is it convenient to manipulate Unicode strings in the language? - No |
||
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings. |
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? - There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings. |
||
=={{header|BBC BASIC}}== |
=={{header|BBC BASIC}}== |
||
Line 409: | Line 415: | ||
Identifiers (variable names) and keywords cannot use Unicode characters. |
Identifiers (variable names) and keywords cannot use Unicode characters. |
||
* How well can the language communicate with the rest of the world? Is it good at input/output with Unicode? |
* How well can the language communicate with the rest of the world? Is it good at input/output with Unicode? |
||
Output of Unicode text to both the screen and the printer is supported, but must be enabled using a '''VDU 23,22''' command since the default output mode is ANSI. |
Output of Unicode text to both the screen and the printer is supported, but must be enabled using a '''VDU 23,22''' command since the default output mode is ANSI. |
||
The text printing direction can be set to right-to-left for languages such as Hebrew and Arabic. Run-time support for Arabic ligatures is not built-in, but is provided by means of the FNarabic() function. No specific support for Unicode input at run time is provided, although this is possible by means of Windows controls. |
|||
* Is it convenient to manipulate Unicode strings in the language? |
* Is it convenient to manipulate Unicode strings in the language? |
||
The supported character encoding is UTF-8 which, being a byte stream, is compatible with most of the language's string manipulation functions. However, the parameters in functions like '''LEFT$''' and '''MID$''' refer to byte counts rather than character counts. |
The supported character encoding is UTF-8 which, being a byte stream, is compatible with most of the language's string manipulation functions. However, the parameters in functions like '''LEFT$''' and '''MID$''' refer to byte counts rather than character counts. |
||
Line 543: | Line 550: | ||
fmt.Println(i, u) |
fmt.Println(i, u) |
||
}</lang> |
}</lang> |
||
{{out}} |
|||
outputs |
|||
<pre> |
<pre> |
||
0 118 |
0 118 |
||
Line 560: | Line 567: | ||
} |
} |
||
</lang> |
</lang> |
||
{{out}} |
|||
outputs |
|||
<pre> |
<pre> |
||
0 118 |
0 118 |
||
Line 634: | Line 641: | ||
Here, we see that even when comparing non-ascii characters, we can coerce both arguments to be utf-8 or utf-16 and in either case the resulting literal strings match. (8 u: string produces a utf-8 result.) |
Here, we see that even when comparing non-ascii characters, we can coerce both arguments to be utf-8 or utf-16 and in either case the resulting literal strings match. (8 u: string produces a utf-8 result.) |
||
Output uses characters in whatever format they happen to be in. |
Output uses characters in whatever format they happen to be in. |
||
Character input assumes 8 bit characters but places no additional interpretation on them. |
|||
See also: http://www.jsoftware.com/help/dictionary/duco.htm |
See also: http://www.jsoftware.com/help/dictionary/duco.htm |
||
Line 690: | Line 698: | ||
=={{header|Lasso}}== |
=={{header|Lasso}}== |
||
All string data in Lasso is processed as double-byte Unicode characters. Any input is assumed to be UTF-8 if not otherwise told. |
All string data in Lasso is processed as double-byte Unicode characters. Any input is assumed to be UTF-8 if not otherwise told. |
||
All output is UTF-8 unless specified to a different encoding. |
|||
You can specify unicode characters by ordinal. |
|||
Variable names can not contain anything but ASCII. |
Variable names can not contain anything but ASCII. |
||
Line 701: | Line 711: | ||
'<br />' |
'<br />' |
||
#unicode -> get (4) -> integer</lang> |
#unicode -> get (4) -> integer</lang> |
||
{{out}} |
|||
Output: |
|||
<pre>♥♦♣♠頰 |
<pre>♥♦♣♠頰 |
||
♦ |
♦ |
||
Line 782: | Line 792: | ||
Can Unicode literals be written directly, or be part of identifiers/keywords/etc? '''Yes; identifiers, literals and such can be written directly as UTF8 strings.''' |
Can Unicode literals be written directly, or be part of identifiers/keywords/etc? '''Yes; identifiers, literals and such can be written directly as UTF8 strings.''' |
||
How well can the language communicate with the rest of the world? Is it good at input/output with Unicode? '''Nemerle plays well with the 'rest of the world' (it was developed in Poland and most of its user-base is Polish or Russian, so the 'rest of the world' from the language developers/users perspective is different than that typically envisioned by an Anglophone.) |
How well can the language communicate with the rest of the world? Is it good at input/output with Unicode? '''Nemerle plays well with the 'rest of the world' (it was developed in Poland and most of its user-base is Polish or Russian, so the 'rest of the world' from the language developers/users perspective is different than that typically envisioned by an Anglophone.) |
||
Input/output in UTF8 is handled readily, other encodings, text directions and such are handled by classes in the <tt>System.Text</tt> and <tt>System.Globalization</tt> namespaces. See [http://msdn.microsoft.com/en-us/library/h6270d0z.aspx this] MSDN page for recommendations on globalization/localization of applications.''' |
|||
Is it convenient to manipulate Unicode strings in the language? '''Yes; string methods expect UTF8 strings''' |
Is it convenient to manipulate Unicode strings in the language? '''Yes; string methods expect UTF8 strings''' |
||
Line 1,006: | Line 1,017: | ||
=={{header|Tcl}}== |
=={{header|Tcl}}== |
||
All characters in Tcl are ''always'' Unicode characters, with ordinary string operations (as listed elsewhere on Rosetta Code) always performed on Unicode. |
|||
⚫ | |||
Input and output characters are translated from and to the system's native encoding automatically (with this being able to be overridden on a per file-handle basis via <code>fconfigure -encoding</code>). |
|||
⚫ | Source files can be written in encodings other than the native encoding — from Tcl 8.5 onwards, the encoding to use for a file can be controlled by the <code>-encoding</code> option to [[tclsh]], [[wish]] and <code>source</code> — though it is usually recommended that programmers maximize their portability by writing in the ASCII subset and using the <code>\uXXXX</code> escape sequence for all other characters. |
||
Tcl does ''not'' handle byte-order marks by default, because that requires deeper understanding of the application level (and sometimes the encoding information is available in metadata anyway, such as when handling HTTP connections). |
|||
The way in which characters are encoded in memory is not defined by the Tcl language (the implementation uses byte arrays, UTF-16 arrays and UCS-2 strings as appropriate) and the only characters with any restriction on use as command or variable names are the ASCII parenthesis and colon characters. |
The way in which characters are encoded in memory is not defined by the Tcl language (the implementation uses byte arrays, UTF-16 arrays and UCS-2 strings as appropriate) and the only characters with any restriction on use as command or variable names are the ASCII parenthesis and colon characters. |
||
However, the <code>$var</code> shorthand syntax is much more restricted (to ASCII alphanumeric plus underline only); other cases have to use the more verbose form: <code>[set funny–var–name]</code>. |
|||
=={{header|TXR}}== |
=={{header|TXR}}== |
||
TXR source code and I/O are all assumed to be text which is UTF-8 encoded. |
TXR source code and I/O are all assumed to be text which is UTF-8 encoded. |
||
This is a self-contained implementation, not relying on any encoding library. |
|||
TXR ignores LANG and such environment variables. |
|||
One of the regression test cases uses Japanese text. |
One of the regression test cases uses Japanese text. |
||
Line 1,071: | Line 1,088: | ||
=={{header|UNIX Shell}}== |
=={{header|UNIX Shell}}== |
||
The Bourne shell does not have any inbuilt Unicode functionality. |
|||
⚫ | |||
However, Unicode can be represented as ASCII based hexadecimal number sequences, or by using form of escape sequence encoding, such as \uXXXX. |
|||
The shell will produce its output in ASCII, but can call other programs to produce the Unicode output. |
|||
⚫ | |||
* How well prepared is the programming language for Unicode? - Fine. All Unicode strings can be represented as hexadecimal sequences. |
* How well prepared is the programming language for Unicode? - Fine. All Unicode strings can be represented as hexadecimal sequences. |