Unicode strings: Difference between revisions

m
{{out}}
(→‎{{header|Ruby}}: added unicode normalisation)
m ({{out}})
Line 1:
{{task}}
As the world gets smaller each day, internationalization becomes more and more important. For handling multiple languages, [[Unicode]] is your best friend. It is a very capable tool, but also quite complex compared to older single- and double-byte character encodings. How well prepared is your programming language for Unicode? Discuss and demonstrate its unicode awareness and capabilities. Some suggested topics:<br>
It is a very capable tool, but also quite complex compared to older single-
and double-byte character encodings. How well prepared is your programming language for Unicode? Discuss and demonstrate its unicode awareness and capabilities.
 
Some suggested topics:
 
* How easy is it to present Unicode strings in source code? Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
Line 7 ⟶ 11:
* How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? Normalization?
 
'''Note''' This task is a bit unusual in that it encourages general discussion rather than clever coding.
rather than clever coding.
 
See also:
Line 362 ⟶ 367:
 
)</lang>
{{out}}
Output:
<pre>
aircraft: 16r2708 => ✈
Line 382 ⟶ 387:
 
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? UTF-8 is most often used. StrPut/StrGet and FileRead/FileAppend allow unicode in AutoHotkey_L (the current build)
 
=={{header|AWK}}==
 
Line 398 ⟶ 404:
Is it convenient to manipulate Unicode strings in the language? - No
 
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? - There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings.
 
=={{header|BBC BASIC}}==
Line 409 ⟶ 415:
Identifiers (variable names) and keywords cannot use Unicode characters.
* How well can the language communicate with the rest of the world? Is it good at input/output with Unicode?
Output of Unicode text to both the screen and the printer is supported, but must be enabled using a '''VDU 23,22''' command since the default output mode is ANSI. The text printing direction can be set to right-to-left for languages such as Hebrew and Arabic. Run-time support for Arabic ligatures is not built-in, but is provided by means of the FNarabic() function. No specific support for Unicode input at run time is provided, although this is possible by means of Windows controls.
The text printing direction can be set to right-to-left for languages such as Hebrew and Arabic. Run-time support for Arabic ligatures is not built-in, but is provided by means of the FNarabic() function. No specific support for Unicode input at run time is provided, although this is possible by means of Windows controls.
* Is it convenient to manipulate Unicode strings in the language?
The supported character encoding is UTF-8 which, being a byte stream, is compatible with most of the language's string manipulation functions. However, the parameters in functions like '''LEFT$''' and '''MID$''' refer to byte counts rather than character counts.
Line 543 ⟶ 550:
fmt.Println(i, u)
}</lang>
{{out}}
outputs
<pre>
0 118
Line 560 ⟶ 567:
}
</lang>
{{out}}
outputs
<pre>
0 118
Line 634 ⟶ 641:
Here, we see that even when comparing non-ascii characters, we can coerce both arguments to be utf-8 or utf-16 and in either case the resulting literal strings match. (8 u: string produces a utf-8 result.)
 
Output uses characters in whatever format they happen to be in. Character input assumes 8 bit characters but places no additional interpretation on them.
Character input assumes 8 bit characters but places no additional interpretation on them.
 
See also: http://www.jsoftware.com/help/dictionary/duco.htm
Line 690 ⟶ 698:
 
=={{header|Lasso}}==
All string data in Lasso is processed as double-byte Unicode characters. Any input is assumed to be UTF-8 if not otherwise told. All output is UTF-8 unless specified to a different encoding. You can specify unicode characters by ordinal.
All output is UTF-8 unless specified to a different encoding.
You can specify unicode characters by ordinal.
 
Variable names can not contain anything but ASCII.
Line 701 ⟶ 711:
'<br />'
#unicode -> get (4) -> integer</lang>
{{out}}
Output:
<pre>♥♦♣♠頰
Line 782 ⟶ 792:
Can Unicode literals be written directly, or be part of identifiers/keywords/etc? '''Yes; identifiers, literals and such can be written directly as UTF8 strings.'''
 
How well can the language communicate with the rest of the world? Is it good at input/output with Unicode? '''Nemerle plays well with the 'rest of the world' (it was developed in Poland and most of its user-base is Polish or Russian, so the 'rest of the world' from the language developers/users perspective is different than that typically envisioned by an Anglophone.) Input/output in UTF8 is handled readily, other encodings, text directions and such are handled by classes in the <tt>System.Text</tt> and <tt>System.Globalization</tt> namespaces. See [http://msdn.microsoft.com/en-us/library/h6270d0z.aspx this] MSDN page for recommendations on globalization/localization of applications.'''
Input/output in UTF8 is handled readily, other encodings, text directions and such are handled by classes in the <tt>System.Text</tt> and <tt>System.Globalization</tt> namespaces. See [http://msdn.microsoft.com/en-us/library/h6270d0z.aspx this] MSDN page for recommendations on globalization/localization of applications.'''
 
Is it convenient to manipulate Unicode strings in the language? '''Yes; string methods expect UTF8 strings'''
Line 1,006 ⟶ 1,017:
 
=={{header|Tcl}}==
All characters in Tcl are ''always'' Unicode characters, with ordinary string operations (as listed elsewhere on Rosetta Code) always performed on Unicode.
All characters in Tcl are ''always'' Unicode characters, with ordinary string operations (as listed elsewhere on Rosetta Code) always performed on Unicode. Input and output characters are translated from and to the system's native encoding automatically (with this being able to be overridden on a per file-handle basis via <code>fconfigure -encoding</code>). Source files can be written in encodings other than the native encoding — from Tcl 8.5 onwards, the encoding to use for a file can be controlled by the <code>-encoding</code> option to [[tclsh]], [[wish]] and <code>source</code> — though it is usually recommended that programmers maximize their portability by writing in the ASCII subset and using the <code>\uXXXX</code> escape sequence for all other characters. Tcl does ''not'' handle byte-order marks by default, because that requires deeper understanding of the application level (and sometimes the encoding information is available in metadata anyway, such as when handling HTTP connections).
Input and output characters are translated from and to the system's native encoding automatically (with this being able to be overridden on a per file-handle basis via <code>fconfigure -encoding</code>).
All characters in Tcl are ''always'' Unicode characters, with ordinary string operations (as listed elsewhere on Rosetta Code) always performed on Unicode. Input and output characters are translated from and to the system's native encoding automatically (with this being able to be overridden on a per file-handle basis via <code>fconfigure -encoding</code>). Source files can be written in encodings other than the native encoding — from Tcl 8.5 onwards, the encoding to use for a file can be controlled by the <code>-encoding</code> option to [[tclsh]], [[wish]] and <code>source</code> — though it is usually recommended that programmers maximize their portability by writing in the ASCII subset and using the <code>\uXXXX</code> escape sequence for all other characters. Tcl does ''not'' handle byte-order marks by default, because that requires deeper understanding of the application level (and sometimes the encoding information is available in metadata anyway, such as when handling HTTP connections).
Tcl does ''not'' handle byte-order marks by default, because that requires deeper understanding of the application level (and sometimes the encoding information is available in metadata anyway, such as when handling HTTP connections).
 
The way in which characters are encoded in memory is not defined by the Tcl language (the implementation uses byte arrays, UTF-16 arrays and UCS-2 strings as appropriate) and the only characters with any restriction on use as command or variable names are the ASCII parenthesis and colon characters. However, the <code>$var</code> shorthand syntax is much more restricted (to ASCII alphanumeric plus underline only); other cases have to use the more verbose form: <code>[set funny–var–name]</code>.
However, the <code>$var</code> shorthand syntax is much more restricted (to ASCII alphanumeric plus underline only); other cases have to use the more verbose form: <code>[set funny–var–name]</code>.
 
=={{header|TXR}}==
 
TXR source code and I/O are all assumed to be text which is UTF-8 encoded. This is a self-contained implementation, not relying on any encoding library. TXR ignores LANG and such environment variables.
This is a self-contained implementation, not relying on any encoding library.
TXR ignores LANG and such environment variables.
 
One of the regression test cases uses Japanese text.
Line 1,071 ⟶ 1,088:
=={{header|UNIX Shell}}==
 
The Bourne shell does not have any inbuilt Unicode functionality.
The Bourne shell does not have any inbuilt Unicode functionality. However, Unicode can be represented as ASCII based hexadecimal number sequences, or by using form of escape sequence encoding, such as \uXXXX. The shell will produce its output in ASCII, but can call other programs to produce the Unicode output. The shell does not have any inbuilt string manipulation utilities, so uses external tools such as cut, expr, grep, sed and awk. These would typically manipulate the hexadecimal sequences to provide string manipulation, or dedicated Unicode based tools could be used.
However, Unicode can be represented as ASCII based hexadecimal number sequences, or by using form of escape sequence encoding, such as \uXXXX.
The shell will produce its output in ASCII, but can call other programs to produce the Unicode output.
The Bourne shell does not have any inbuilt Unicode functionality. However, Unicode can be represented as ASCII based hexadecimal number sequences, or by using form of escape sequence encoding, such as \uXXXX. The shell will produce its output in ASCII, but can call other programs to produce the Unicode output. The shell does not have any inbuilt string manipulation utilities, so uses external tools such as cut, expr, grep, sed and awk. These would typically manipulate the hexadecimal sequences to provide string manipulation, or dedicated Unicode based tools could be used.
 
* How well prepared is the programming language for Unicode? - Fine. All Unicode strings can be represented as hexadecimal sequences.
Anonymous user