Multisplit: Difference between revisions
(tidyup of task description) |
m (wordliness) |
||
Line 1: | Line 1: | ||
{{draft task}}It is often necessary to split a string into pieces based on several different (potentially multi-character) separator strings, while still retaining the information about which separators were present in the input. This is particularly useful when doing small parsing tasks. |
{{draft task}}It is often necessary to split a string into pieces based on several different (potentially multi-character) separator strings, while still retaining the information about which separators were present in the input. This is particularly useful when doing small parsing tasks. |
||
Write code to demonstrate this. The function (or procedure or method, as appropriate) should take an input string and an ordered collection of |
Write code to demonstrate this. The function (or procedure or method, as appropriate) should take an input string and an ordered collection of separators, and split the string into pieces representing the various substrings. The order of the separators is significant: The delimiter order represents priority in matching. The first defined delimiter has the highest priority, so where there would be an ambiguity as to which separator to use at a particular point (e.g., because one separator is a prefix of another) the first separator in the collection should be used. Delimiters can be reused. The result of the function should be an ordered sequence of substrings. |
||
'''Extra Credit:''' include match information that indicates which separator was matched at each separation point and where in the input string that separator was matched. |
'''Extra Credit:''' include match information that indicates which separator was matched at each separation point and where in the input string that separator was matched. |
Revision as of 08:24, 6 July 2011
It is often necessary to split a string into pieces based on several different (potentially multi-character) separator strings, while still retaining the information about which separators were present in the input. This is particularly useful when doing small parsing tasks.
Write code to demonstrate this. The function (or procedure or method, as appropriate) should take an input string and an ordered collection of separators, and split the string into pieces representing the various substrings. The order of the separators is significant: The delimiter order represents priority in matching. The first defined delimiter has the highest priority, so where there would be an ambiguity as to which separator to use at a particular point (e.g., because one separator is a prefix of another) the first separator in the collection should be used. Delimiters can be reused. The result of the function should be an ordered sequence of substrings.
Extra Credit: include match information that indicates which separator was matched at each separation point and where in the input string that separator was matched.
Test your code using the input string “a!===b=!=c
” and the separators “==
”, “!=
” and “=
”.
For these inputs the string should be parsed as a != == b = != c
.
Ada
multisplit.adb: <lang Ada>with Ada.Containers.Indefinite_Doubly_Linked_Lists; with Ada.Text_IO;
procedure Multisplit is
package String_Lists is new Ada.Containers.Indefinite_Doubly_Linked_Lists (Element_Type => String); use type String_Lists.Cursor;
function Split (Source : String; Separators : String_Lists.List) return String_Lists.List is Result : String_Lists.List; Next_Position : Natural := Source'First; Prev_Position : Natural := Source'First; Separator_Position : String_Lists.Cursor; Separator_Length : Natural; Changed : Boolean; begin loop Changed := False; Separator_Position := Separators.First; while Separator_Position /= String_Lists.No_Element loop Separator_Length := String_Lists.Element (Separator_Position)'Length; if Next_Position + Separator_Length - 1 <= Source'Last and then Source (Next_Position .. Next_Position + Separator_Length - 1) = String_Lists.Element (Separator_Position) then if Next_Position > Prev_Position then Result.Append (Source (Prev_Position .. Next_Position - 1)); end if; Result.Append (String_Lists.Element (Separator_Position)); Next_Position := Next_Position + Separator_Length; Prev_Position := Next_Position; Changed := True; exit; end if; Separator_Position := String_Lists.Next (Separator_Position); end loop; if not Changed then Next_Position := Next_Position + 1; end if; if Next_Position > Source'Last then Result.Append (Source (Prev_Position .. Source'Last)); exit; end if; end loop; return Result; end Split;
Test_Input : constant String := "a!===b=!=c"; Test_Separators : String_Lists.List; Test_Result : String_Lists.List; Pos : String_Lists.Cursor;
begin
Test_Separators.Append ("=="); Test_Separators.Append ("!="); Test_Separators.Append ("="); Test_Result := Split (Test_Input, Test_Separators); Pos := Test_Result.First; while Pos /= String_Lists.No_Element loop Ada.Text_IO.Put (" " & String_Lists.Element (Pos)); Pos := String_Lists.Next (Pos); end loop; Ada.Text_IO.New_Line; -- other order of separators Test_Separators.Clear; Test_Separators.Append ("="); Test_Separators.Append ("!="); Test_Separators.Append ("=="); Test_Result := Split (Test_Input, Test_Separators); Pos := Test_Result.First; while Pos /= String_Lists.No_Element loop Ada.Text_IO.Put (" " & String_Lists.Element (Pos)); Pos := String_Lists.Next (Pos); end loop;
end Multisplit;</lang>
output:
a != == b = != c a != = = b = != c
D
<lang d>import std.stdio, std.array, std.algorithm;
string[] multiSplit(in string s, string[] divisors) {
string[] result; auto rest = s.idup; while (true) {
auto best= ""; auto delim= ""; bool done= true; foreach (div; divisors) { auto maybe= find(rest, div); if (maybe.length > best.length) { best= maybe; delim= div; done= false; } } result.length+= 1; if (done) { result[$ - 1]= rest.idup; return result; } else { auto t= findSplit(rest, delim); result[$ - 1]= t[0].idup; rest= t[2]; }
}
}
void main() {
auto s = "a!===b=!=c"; auto divs = ["==", "!=", "="]; auto lst = multiSplit(s, divs); foreach (i, p; lst) { write(p);
if (i < lst.length-1) {
write(" {} ");
}
} writeln();
}</lang> Output (separator locations indicated by braces):
a {} {} b {} {} c
F#
If we ignore the "Extra Credit" requirements and skip 'ordered separators' condition (i.e. solving absolute different task), this is exactly what one of the overloads of .NET's String.Split
method does. Using F# Interactive:
<lang fsharp>> "a!===b=!=c".Split([|"=="; "!="; "="|], System.StringSplitOptions.None);;
val it : string [] = [|"a"; ""; "b"; ""; "c"|]
> "a!===b=!=c".Split([|"="; "!="; "=="|], System.StringSplitOptions.None);; val it : string [] = [|"a"; ""; ""; "b"; ""; "c"|]</lang>
System.StringSplitOptions.None
specifies that empty strings should be included in the result.
Icon and Unicon
<lang Icon>procedure main()
s := "a!===b=!=c" # just list the tokens every writes(multisplit(s,["==", "!=", "="])," ") | write() # list tokens and indices every ((p := "") ||:= t := multisplit(s,sep := ["==", "!=", "="])) | break write() do if t == !sep then writes(t," (",*p+1-*t,") ") else writes(t," ")
end
procedure multisplit(s,L) s ? while not pos(0) do {
t := =!L | 1( arb(), match(!L)|pos(0) ) suspend t }
end
procedure arb() suspend .&subject[.&pos:&pos <- &pos to *&subject + 1] end</lang>
Sample Output:
a != == b = != c a != (2) == (4) b = (7) != (8) c
J
<lang j>multisplit=:4 :0
'sep begin'=.|:t=. y /:~&.:(|."1)@;@(i.@#@[ ,.L:0"0 I.@E.L:0) x end=. begin + sep { #@>y last=.next=.0 r=.2 0$0 while.next<#begin do. r=.r,.(last}.x{.~next{begin);next{t last=.next{end next=.1 i.~(begin>next{begin)*.begin>:last end. r=.r,.;~last}.x
)</lang>
Explanation:
First find all potentially relevant separator instances, and sort them in increasing order, by starting location and separator index. sep
is separator index, and begin
is starting location. end
is ending location.
Then, loop through the possibilities, skipping over those separators which would overlap with previously used separators.
The result consists of two rows: The first row is the extracted substrings, the second row is the "extra credit" part -- for each extracted substring, the numbers in the second row are the separator index (0 for the first index, 1 for the seconde, ...), and the location in the original string where the separator appeared. Note that the very last substring does not have a separator following it, so the extra credit part is blank for that substring.
Example use:
<lang j> S=:'a!===b=!=c'
S multisplit '==';'!=';'='
┌───┬───┬───┬───┬─┐ │a │ │b │ │c│ ├───┼───┼───┼───┼─┤ │1 1│0 3│2 6│1 7│ │ └───┴───┴───┴───┴─┘
S multisplit '=';'!=';'=='
┌───┬───┬───┬───┬───┬─┐ │a │ │ │b │ │c│ ├───┼───┼───┼───┼───┼─┤ │1 1│0 3│0 4│0 6│1 7│ │ └───┴───┴───┴───┴───┴─┘
'X123Y' multisplit '1';'12';'123';'23';'3'
┌───┬───┬─┐ │X │ │Y│ ├───┼───┼─┤ │0 1│3 2│ │ └───┴───┴─┘</lang>
Perl 6
<lang perl6>sub multisplit($str, @seps) { $str.split(/ ||@seps /, :all) }
my @chunks = multisplit( 'a!===b=!=c==d', < == != = > );
- Print the strings.
say @chunks».Str.perl;
- Print the positions of the separators.
for grep Match, @chunks -> $s {
say " $s from $s.from() to $s.to()";
}</lang> Output:
("a", "!=", "", "==", "b", "=", "", "!=", "c", "==", "d") != from 1 to 3 == from 3 to 5 = from 6 to 7 != from 7 to 9 == from 10 to 12
Using the array @seps in a pattern automatically does alternation. By default this would do longest-term matching (that is, | semantics), but we can force it to do left-to-right matching by embedding the array in a short-circuit alternation (that is, || semantics). As it happens, with the task's specified list of separators, it doesn't make any difference.
Perl 6 automatically returns Match objects that will stringify to the matched pattern, but can also be interrogated for their match positions, as illustrated above by post-processing the results two different ways.
PicoLisp
<lang PicoLisp>(de multisplit (Str Sep)
(setq Sep (mapcar chop Sep)) (make (for (S (chop Str) S) (let L (make (loop (T (find head Sep (circ S)) (link (list (- (length Str) (length S)) (pack (cut (length @) 'S)) ) ) ) (link (pop 'S)) (NIL S (link NIL)) ) ) (link (pack (cdr (rot L)))) (and (car L) (link @)) ) ) ) )
(println (multisplit "a!===b=!=c" '("==" "!=" "="))) (println (multisplit "a!===b=!=c" '("=" "!=" "==")))</lang> Output:
("a" (1 "!=") NIL (3 "==") "b" (6 "=") NIL (7 "!=") "c") ("a" (1 "!=") NIL (3 "=") NIL (4 "=") "b" (6 "=") NIL (7 "!=") "c")
Prolog
Works with SWI-Prolog. <lang Prolog>multisplit(_LSep, ) --> {!}, [].
multisplit(LSep, T) --> {next_sep(LSep, T, [], Token, Sep, T1)}, ( {Token \= },[Token], {!}; []), ( {Sep \= },[Sep], {!}; []), multisplit(LSep, T1).
next_sep([], T, Lst, Token, Sep, T1) :- % if we can't find any separator, the game is over ( Lst = [] -> Token = T, Sep = , T1 = ;
% we sort the list to get nearest longest separator predsort(my_sort, Lst, [(_,_, Sep)|_]), atomic_list_concat([Token|_], Sep, T), atom_concat(Token, Sep, Tmp), atom_concat(Tmp, T1, T)).
next_sep([HSep|TSep], T, Lst, Token, Sep, T1) :- sub_atom(T, Before, Len, _, HSep), next_sep(TSep, T, [(Before, Len,HSep) | Lst], Token, Sep, T1).
next_sep([_HSep|TSep], T, Lst, Token, Sep, T1) :- next_sep(TSep, T, Lst, Token, Sep, T1).
my_sort(<, (N1, _, _), (N2, _, _)) :-
N1 < N2.
my_sort(>, (N1, _, _), (N2, _, _)) :- N1 > N2.
my_sort(>, (N, N1, _), (N, N2, _)) :- N1 < N2.
my_sort(<, (N, N1, _), (N, N2, _)) :- N1 > N2. </lang> Output :
?- parse(['==', '!=', '='], 'ax!===b=!=c', Lst, []). Lst = [ax,'!=',==,b,=,'!=',c] .
Python
Using Regular expressions
<lang python>>>> import re >>> def ms2(txt="a!===b=!=c", sep=["==", "!=", "="]): if not txt or not sep: return [] ans = m = [] for m in re.finditer('(.*?)(?:' + '|'.join('('+re.escape(s)+')' for s in sep) + ')', txt): ans += [m.group(1), (m.lastindex-2, m.start(m.lastindex))] if m and txt[m.end(m.lastindex):]: ans += [txt[m.end(m.lastindex):]] return ans
>>> ms2() ['a', (1, 1), , (0, 3), 'b', (2, 6), , (1, 7), 'c'] >>> ms2(txt="a!===b=!=c", sep=["=", "!=", "=="]) ['a', (1, 1), , (0, 3), , (0, 4), 'b', (0, 6), , (1, 7), 'c']</lang>
Not using RE's
<lang python>>>> def ms(txt="a!===b=!=c", sep=["==", "!=", "="]): if not txt or not sep: return [] size = [len(s) for s in sep] ans, pos0 = [], 0 def getfinds(): return [(-txt.find(s, pos0), -sepnum, size[sepnum]) for sepnum, s in enumerate(sep) if s in txt[pos0:]]
finds = getfinds() while finds: pos, snum, sz = max(finds) pos, snum = -pos, -snum ans += [ txt[pos0:pos], [snum, pos] ] pos0 = pos+sz finds = getfinds() if txt[pos0:]: ans += [ txt[pos0:] ] return ans
>>> ms() ['a', [1, 1], , [0, 3], 'b', [2, 6], , [1, 7], 'c'] >>> ms(txt="a!===b=!=c", sep=["=", "!=", "=="]) ['a', [1, 1], , [0, 3], , [0, 4], 'b', [0, 6], , [1, 7], 'c']</lang>
Alternative version <lang python>def min_pos(List): return List.index(min(List))
def find_all(S, Sub, Start = 0, End = -1, IsOverlapped = 0): Res = [] if End == -1: End = len(S) if IsOverlapped: DeltaPos = 1 else: DeltaPos = len(Sub) Pos = Start while True: Pos = S.find(Sub, Pos, End) if Pos == -1: break Res.append(Pos) Pos += DeltaPos return Res
def multisplit(S, SepList): SepPosListList = [] SLen = len(S) SepNumList = [] ListCount = 0 for i, Sep in enumerate(SepList): SepPosList = find_all(S, Sep, 0, SLen, IsOverlapped = 1) if SepPosList != []: SepNumList.append(i) SepPosListList.append(SepPosList) ListCount += 1 if ListCount == 0: return [S] MinPosList = [] for i in range(ListCount): MinPosList.append(SepPosListList[i][0]) SepEnd = 0 MinPosPos = min_pos(MinPosList) Res = [] while True: Res.append( S[SepEnd : MinPosList[MinPosPos]] ) Res.append([SepNumList[MinPosPos], MinPosList[MinPosPos]]) SepEnd = MinPosList[MinPosPos] + len(SepList[SepNumList[MinPosPos]]) while True: MinPosPos = min_pos(MinPosList) if MinPosList[MinPosPos] < SepEnd: del SepPosListList[MinPosPos][0] if len(SepPosListList[MinPosPos]) == 0: del SepPosListList[MinPosPos] del MinPosList[MinPosPos] del SepNumList[MinPosPos] ListCount -= 1 if ListCount == 0: break else: MinPosList[MinPosPos] = SepPosListList[MinPosPos][0] else: break if ListCount == 0: break Res.append(S[SepEnd:]) return Res
S = "a!===b=!=c"
multisplit(S, ["==", "!=", "="]) # output: ['a', [1, 1], , [0, 3], 'b', [2, 6], , [1, 7], 'c']
multisplit(S, ["=", "!=", "=="]) # output: ['a', [1, 1], , [0, 3], , [0, 4], 'b', [0, 6], , [1, 7], 'c']</lang>
Ruby
The simple method, using a regular expression to split the text.
<lang ruby>text = 'a!===b=!=c' separators = ['==', '!=', '=']
def multisplit_simple(text, separators)
sep_regex = Regexp.new(separators.collect {|sep| Regexp.escape(sep)}.join('|')) text.split(sep_regex)
end
p multisplit_simple(text, separators) ["a", "", "b", "", "c"] => nil p multisplit_simple(text, ['=', '!=', '==']) ["a", "", "", "b", "", "c"] => nil</lang>
The version that also returns the information about the separations.
<lang ruby>def multisplit(text, separators)
sep_regex = Regexp.new(separators.collect {|sep| Regexp.escape(sep)}.join('|')) separator_info = [] pieces = [] i = prev = 0 while i = text.index(sep_regex, i) separator = Regexp.last_match(0) pieces << text[prev .. i-1] separator_info << [separator, i] i = i + separator.length prev = i end pieces << text[prev .. -1] [pieces, separator_info]
end
p multisplit(text, separators)
- => [["a", "", "b", "", "c"], [["!=", 1], ["==", 3], ["=", 6], ["!=", 7]]]</lang>
Also demonstrating a method to rejoin the string given the separator information.
<lang ruby>def multisplit_rejoin(info)
str = info[0].zip(info[1])[0..-2].inject("") {|str, (piece, (sep, idx))| str << piece << sep} str << info[0].last
end
p multisplit_rejoin(multisplit(text, separators)) == text
- => true</lang>
Tcl
This simple version does not retain information about what the separators were: <lang tcl>proc simplemultisplit {text sep} {
set map {}; foreach s $sep {lappend map $s "\uffff"} return [split [string map $map $text] "\uffff"]
} puts [simplemultisplit "a!===b=!=c" {"==" "!=" "="}]</lang>
Output:
a {} b {} c
However, to keep the match information a more sophisticated technique is best. Note that the most natural model of result here is to return the split substrings as a separate list to the match information (because the two collections of information are of different lengths). <lang tcl>proc multisplit {text sep} {
foreach s $sep {lappend sr [regsub -all {\W} $s {\\&}]} set sepRE [join $sr "|"] set pieces {} set match {} set start 0 while {[regexp -indices -start $start -- $sepRE $text found]} {
lassign $found x y lappend pieces [string range $text $start [expr {$x-1}]] lappend match [lsearch -exact $sep [string range $text {*}$found]] $x set start [expr {$y + 1}]
} return [list [lappend pieces [string range $text $start end]] $match]
}</lang> Demonstration code: <lang tcl>set input "a!===b=!=c" set matchers {"==" "!=" "="} lassign [multisplit $input $matchers] substrings matchinfo puts $substrings puts $matchinfo</lang> Output:
a {} b {} c 1 1 0 3 2 6 1 7