Talk:Reverse a string: Difference between revisions

Line 35:
 
Um. If your character set includes Unicode, a reversing routine should handle it. If your character set does not include Unicode, the reversing routine need not handle it. --[[User:Kevin Reid|Kevin Reid]] 00:41, 30 July 2009 (UTC)
 
== Notes about Unicode combining characters ==
 
[[Ruby]] has the regular expression <tt>/\p{M}/</tt> which matches a combining mark. With this expression, I might be able to reverse a string while preserving the combining marks.
 
# The most relevant parts of [http://www.unicode.org/versions/Unicode6.0.0/ Unicode 6.0.0] seem to be section 3.6 "Combination" and section 3.12 "Conjoining Jamo Behavior".
# I am not yet certain whether to preserve "combining character sequences" or "grapheme clusters". My best guess for now is to preserve the combining character sequences (CCS), not the grapheme clusters.
# The regular expression for a CCS-or-char might look like <tt>/(?>#{base}\p{M}*|\p{M}+|.)/</tt> where <tt>#{base}</tt> is whatever regular expression matches a base character or extended base. The <tt>?></tt> prevents backtracking, so the regexp always matches the longest possible CCS.
# I need some way with Ruby to comb a string for all matches of a regular expression. For example, with <tt>/[aeiou]./</tt> and <tt>"Rosetta Code"</tt>, I want <tt>["os", "et", "a ", "od"]</tt>. Then I would comb a string for CCS-or-char, reverse the array, join.
# Korean hangul is a special case. A group of 2 or 3 jamo characters might form an extended base (a syllable with an leading consonant, a medial vowel and perhaps a trailing consonant. Because a CCS may contain an extended base, I need some way to group jamos.
# I probably want a Korean test string. I must enter this string with jamo characters, not syllable characters, to test the code to group jamos.
# If EUC-KR has jamo characters, then the code should work with both EUC-KR and UTF-8.
# Avoid normalization. A normalization to NFC would replace some CCS with individual characters, but Unicode does not have individual characters for every possible sequence.
# Do I have a library that already does some of this?
 
--[[User:Kernigh|Kernigh]] 04:00, 31 January 2011 (UTC)
Anonymous user