Talk:String case: Difference between revisions

m
Line 8:
:I disagree: this is a question on the abstraction level the language provides.On a high level, a string is a collection of characters, and I really do not care how it is encoded internally. I may care when talking to the outside world via file or socket. On a low level, this is a sequence of bytes, which have to be interpreted according to a rule in order to know which character is represented. If a programming language mixes those two, you are in trouble, as you need to know the encoding in order for a string to be interpreted. Higher level String datatypes should hide this (separate those two) and provide conversions. Smalltalk, Java, JS and many others do it. In Smalltalk, for example, I would write "(CharacterEncoder encoderFor:#'iso8859-5') encode: 'someString')" to get a string's particlar encoding. The result is conceptionally no longer a sequence of characters, but a sequence of bytes which represent those in that particular encoding. So, as soon as you ask for a particular encoding to be part of the task, we actually no longer talk about the language's String implementation and capabilities, but instead about the language's byte-collection support. Of course, I see the problem that in many low level languages, these are the same. [[User:Cg|Cg]] 09:37, 25 January 2013 (UTC)
 
::Conceptually, though, the issue of representing fat unicode characters is not much different from the issue of supporting different numeric types (consider, especially, the distinction between <code>signed char</code> and <code>float</code> in C). The significant differences between numeric types and character types are the conversion process, along with the issue that most non-ascii characters are represented by a sequence of bytes in utf-8 rather than a single byte. That said, in this context we are not talking about which storage formats the language uses to represent unicode characters - any language that can represent bytes can represent sequences of unicode characters. And we do not know, without knowing the language, whether 'String' represents a unicode type, or an ascii type or whether it even exists in a particular langauge. In other words, I am inclined to consider Short Circuit's point of view to be more relevant here than Cg's disagreement. Still, I agree with both that expanding this to unicode will significantly increase the complexity of the task. Simply representing the translation between upper and lower case, in a language which does not implement that for you, will be bulky. --[[User:Rdm|Rdm]] 13:24, 25 January 2013 (UTC)
 
== C++ ==
6,951

edits