Unicode strings: Difference between revisions

→‎{{header|Go}}: Update info on normalization support; add link to highly relevant official Go blog article; other tweaks
(jq)
(→‎{{header|Go}}: Update info on normalization support; add link to highly relevant official Go blog article; other tweaks)
Line 533:
 
=={{header|Go}}==
Go source code is specified to be UTF-8 encoded.
Go source code is specified to be UTF-8 encoded. This directly allows non-ASCII Unicode characters in character and string literals. Non-ASCII unicode is also allowed in identifiers like variables and field names, with some resrictions. The string data type is specified to be simply an array of bytes and it is idomatic to use strings as UTF-8. A number of built-in features interpret strings as UTF-8. For example,
This directly allows any Unicode code point in character and string literals.
Unicode is also allowed in identifiers like variables and field names, with some restrictions.
The <code>string</code> data type represents a read-only sequence of bytes, conventionally but not necessarily represents UTF-8-encoded text.
A number of built-in features interpret <code>string</code>s as UTF-8. For example,
<lang go> var i int
var u rune
Line 547 ⟶ 551:
4 224
</pre>
224 being the Unicode code point for the à character. Note <tt>rune</tt> is predefined to be a type that can hold a Unicode control point.
Note <code>rune</code> is predefined to be a type that can hold a Unicode code point.
 
In contrast,
Line 564 ⟶ 569:
5 160
</pre>
bytes 4 and 5 showing the UTF-8 encoding of à. The expression w[i] in this case has the type of byte rather than rune.
The expression <code>w[i]</code> in this case has the type of <code>byte</code> rather than <code>rune</code>.
A Go blog post covers this in more detail: [http://blog.golang.org/strings Strings, bytes, runes and characters in Go].
 
The heavily used standard packages <ttcode>bytes</ttcode> and <ttcode>strings</ttcode> both have functions for working with strings both as UTF-8 and as encoding-unspecified bytes. The standard packages utf8, utf16, and unicode have additional functions.
The standard packages <code>unicode</code>, <code>unicode/utf8</code>, and <code>unicode/utf16</code> have additional functions.
 
Normalization support is available in the [[:Category:Go sub-repositories|sub-repository]] package <code>code.google.com/p/go.text/unicode/norm</code>.
Normalization support is under development. The exp/norm package currently (2011) contains a number of string manipulation functions that work with the four normalization forms C, D, KC, and KD. The normalization form type in this package implements the io.Reader and io.WriteCloser interfaces to enable on-the-fly normalization during I/O.
It contains a number of string manipulation functions that work with the four normalization forms NFC, NFD, NFKC, and NFKD.
The normalization form type in this package implements the <code>io.Reader</code> and <code>io.WriteCloser</code> interfaces to enable on-the-fly normalization during I/O.
A Go blog post covers this in more detail: [http://blog.golang.org/normalization Text normalization in Go].
 
There is no built-in or automatic handling of byte order marks (which are at best unnecessary with UTF-8).
 
=={{header|Haskell}}==
Anonymous user