Unicode strings: Difference between revisions

Unicode strings (view source)

Revision as of 19:12, 26 April 2021

2,397 bytes added , 3 years ago

Replaced by a more detailed description bases on the questions presented in the task.

Anonymous user

rosettacode>Lscrd

Revision as of 18:53, 25 March 2021 (view source) Petelomax (talk \| contribs) m (→‎{{header\|Phix}}: phix/basics) ← Older edit		Revision as of 19:12, 26 April 2021 (view source) rosettacode>Lscrd (Replaced by a more detailed description bases on the questions presented in the task.) Newer edit →
Line 1,015: =={{header\|Nim}}== Nim strings are sequences of 8-bit bytes, so they can readily be encoded in UTF-8. Other than allowing \uXXXX and \u{x+} UTF-8 escape sequences in strings, the language is agnostic about their content. All unicode handling is via libraries. There is a minimal unicode module that is part of the standard distribution that allows converting UTF-8 strings to and from UTF-32 (the type of a UTF-32 value is Rune), as well as performing find, append, etc. directly on strings considered as UTF-8. More advanced unicode handling is available via installable modules from the community library. ::– How easy is it to present Unicode strings in source code? ~~[TBD ... the code below is from the editor before me and needs to be greatly expanded]~~ It is very easy, provided that the editor understands UTF-8. Indeed, Nim considers that source is encoded in UTF-8. ~~<lang nim>let c = "abcdé"~~ ::– Can Unicode literals be written directly, or be part of identifiers/keywords/etc? ~~let Δ = 12~~ Unicode literals can be written directly in strings, again provided that the editor (and the font) are able to display them. It is of course possible to use the \uXXXX form. Identifiers may contain Unicode characters but the restrictions regarding the ~~let e = "$abcde¢£¤¥©ÇßçĲĳŁłʒλπ•₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₵←→⇒∙⌘☺☻ア字文𪚥"~~ allowed characters apply. That means than the first character must be a letter (for instance “é” is allowed as is “Δ”) and the following characters may be letters, “_” or digits. For instance “x²” is allowed. ~~echo e</lang>~~ ::– How well can the language communicate with the rest of the world? Nim strings may contain any UTF-8 combination. Note however that, by default, no check is done regarding the validity of the string. This is the usual behavior for system languages for obvious performance reasons. Note also that Nim strings are easily converted to C string. Interoperability with C is an important feature in Nim (helped by the fact that C is one of the intermediate languages used to produce native code). ::– Is it good at input/output with Unicode? Again, this is more a question regarding the environment than the language itself. For instance, if a terminal accepts Unicode strings as input and output, Nim programs will be able to read and write them as UTF-8 strings. So, in practice, with modern systems, reading and writing Unicode is flawless (I write this as a user of Unicode myself). ::– Is it convenient to manipulate Unicode strings in the language? It depends. If manipulation consists to read and write, this is easy. But strings are not aware that their contents is encoded in UTF-8. That means that if you loop on the string, you get eight bits character values and not code points. To operate on code points, Nim provides a module “unicode”. In this module, there are only two types: strings which contains values encoded in UTF-8 and runes which are, in fact, UTF-32 values. The module provides procedures to convert strings to and from sequences of runes. It provides also iterators to get the code points of a UTF-8 encoded string, a set of operations such as find or append and, of course, a procedure to check the validity of a string (which is not done by default). ::– How broad/deep does the language support Unicode? The module “unicode” provides only basic functionalities. For more advanced Unicode handling, third party modules are available from community. ::– What encodings (e.g. UTF-8, UTF-16, etc) can be used? UTF-8 is the default encoding. UTF-32 is available with the module “unicode”. The standard module “encodings” gives access to many other encodings. The available encodings depends on the operating system. On Unix/Linux, the “iconv” library is used. On Windows, this is the Windows API. ::– Does it support normalization? The “unicode” module provides only basic Unicode handling. It doesn’t provide normalization. But there exists at least one library which offers this functionality: “nim-normalize” (https://github.com/nitely/nim-normalize). =={{header\|Oforth}}==