Multisplit: Difference between revisions

Content added Content deleted

Inline

Revision as of 10:24, 28 February 2011

It is often necessary to split a string into pieces based on several different (potentially multi-character) separator strings, while still retaining the information about which separators were present in the input. This is particularly useful when doing small parsing tasks.

Write code to demonstrate this. The function (or procedure or method, as appropriate) should take an input string and an ordered collection of separator strings, and split the string into pieces representing the various substrings. Note that the order of the separators is significant; where there would otherwise be an ambiguity as to which separator to use at a particular point (e.g., because one separator is a prefix of another) the first separator in the collection should be used. The result of the function should be an ordered sequence of substrings.

Extra Credit: include match information that indicates which separator was matched at each separation point and where in the input string that separator was matched.

Test your code using the input string “a!===b=!=c” and the separators “==”, “!=” and “=”.

J

<lang j>multisplit=:4 :0

 'sep begin'=.|:t=. y /:~&.:(|."1)@;@(i.@#@[ ,.L:0"0 I.@E.L:0) x
 end=. begin + sep { #@>y
 last=.next=.0
 r=.2 0$0
 while.next<#begin do.
   r=.r,.(last}.x{.~next{begin);next{t
   last=.next{end
   next=.1 i.~(begin>next{begin)*.begin>:last
 end.
 r=.r,.;~last}.x

)</lang>

Explanation:

First find all potentially relevant separator instances, and sort them in increasing order, by starting location and separator index. sep is separator index, and begin is starting location. end is ending location.

Then, loop through the possibilities, skipping over those which conflict with the currently selected sequence.

Example use:

<lang j> S multisplit '==';'!=';'=' ┌───┬───┬───┬───┬─┐ │a │ │b │ │c│ ├───┼───┼───┼───┼─┤ │1 1│0 3│2 6│1 7│ │ └───┴───┴───┴───┴─┘

  S multisplit '=';'!=';'=='

┌───┬───┬───┬───┬───┬─┐ │a │ │ │b │ │c│ ├───┼───┼───┼───┼───┼─┤ │1 1│0 3│0 4│0 6│1 7│ │ └───┴───┴───┴───┴───┴─┘

  'X123Y' multisplit '1';'12';'123';'23';'3'

┌───┬───┬─┐ │X │ │Y│ ├───┼───┼─┤ │0 1│3 2│ │ └───┴───┴─┘</lang>

Python

Using Regular expressions

<lang python>>>> import re >>> def ms2(txt="a!===b=!=c", sep=["==", "!=", "="]): if not txt or not sep: return [] ans = m = [] for m in re.finditer('(.*?)(?:' + '|'.join('('+re.escape(s)+')' for s in sep) + ')', txt): ans += [m.group(1), (m.lastindex-2, m.start(m.lastindex))] if m and txt[m.end(m.lastindex):]: ans += [txt[m.end(m.lastindex):]] return ans

>>> ms2() ['a', (1, 1), , (0, 3), 'b', (2, 6), , (1, 7), 'c'] >>> ms2(txt="a!===b=!=c", sep=["=", "!=", "=="]) ['a', (1, 1), , (0, 3), , (0, 4), 'b', (0, 6), , (1, 7), 'c']</lang>

Not using RE's

<lang python>>>> def ms(txt="a!===b=!=c", sep=["==", "!=", "="]): if not txt or not sep: return [] size = [len(s) for s in sep] ans, pos0 = [], 0 def getfinds(): return [(-txt.find(s, pos0), -sepnum, size[sepnum]) for sepnum, s in enumerate(sep) if s in txt[pos0:]]

finds = getfinds() while finds: pos, snum, sz = max(finds) pos, snum = -pos, -snum ans += [ txt[pos0:pos], [snum, pos] ] pos0 = pos+sz finds = getfinds() if txt[pos0:]: ans += [ txt[pos0:] ] return ans

>>> ms() ['a', [1, 1], , [0, 3], 'b', [2, 6], , [1, 7], 'c'] >>> ms(txt="a!===b=!=c", sep=["=", "!=", "=="]) ['a', [1, 1], , [0, 3], , [0, 4], 'b', [0, 6], , [1, 7], 'c']</lang> Small inaccuracy in the version above: ms("", ["="]) outputs [] instead of [''].

Alternative version <lang python>def min_pos(List): return List.index(min(List))

def find_all(S, Sub, Start = 0, End = -1, IsOverlapped = 0): Res = [] if End == -1: End = len(S) if IsOverlapped: DeltaPos = 1 else: DeltaPos = len(Sub) Pos = Start while 1: Pos = S.find(Sub, Pos, End) if Pos == -1: break Res.append(Pos) Pos += DeltaPos return Res

def multisplit(S, SepList): SepPosListList = [] SLen = len(S) SepNumList = [] ListCount = 0 for i in range(len(SepList)): Sep = SepList[i] SepPosList = find_all(S, Sep, 0, SLen, IsOverlapped = 1) if SepPosList != []: SepNumList.append(i) SepPosListList.append(SepPosList) ListCount += 1 if ListCount == 0: return [S] MinPosList = [] for i in range(ListCount): MinPosList.append(SepPosListList[i][0]) SepEnd = 0 MinPosPos = min_pos(MinPosList) Res = [] while 1: Res.append( S[SepEnd : MinPosList[MinPosPos]] ) Res.append([SepNumList[MinPosPos], MinPosList[MinPosPos]]) SepEnd = MinPosList[MinPosPos] + len(SepList[SepNumList[MinPosPos]]) while 1: MinPosPos = min_pos(MinPosList) if MinPosList[MinPosPos] < SepEnd: del(SepPosListList[MinPosPos][0]) if len(SepPosListList[MinPosPos]) == 0: del(SepPosListList[MinPosPos]) del(MinPosList[MinPosPos]) del(SepNumList[MinPosPos]) ListCount -= 1 if ListCount == 0: break else: MinPosList[MinPosPos] = SepPosListList[MinPosPos][0] else: break if ListCount == 0: break Res.append(S[SepEnd:]) return Res

S = "a!===b=!=c" multisplit(S, ["==", "!=", "="]) # output: ['a', [1, 1], , [0, 3], 'b', [2, 6], , [1, 7], 'c'] multisplit(S, ["=", "!=", "=="]) # output: ['a', [1, 1], , [0, 3], , [0, 4], 'b', [0, 6], , [1, 7], 'c'] </lang>

@@ Line 1: / Line 1: @@
+{{draft task}}It is often necessary to split a string into pieces based on several different (potentially multi-character) separator strings, while still retaining the information about which separators were present in the input. This is particularly useful when doing small parsing tasks.
-{{draft task}}Code to split string with several separators.<br>
-Input: string, list of separators<br>
+Write code to demonstrate this. The function (or procedure or method, as appropriate) should take an input string and an ordered collection of separator strings, and split the string into pieces representing the various substrings. Note that the order of the separators is significant; where there would otherwise be an ambiguity as to which separator to use at a particular point (e.g., because one separator is a prefix of another) the first separator in the collection should be used. The result of the function should be an ordered sequence of substrings.
-Output: [Sub0, [Sep0Num, Sep0Pos], Sub1, [Sep1Num, Sep1Pos], ..., SubN]<br>
-Note: Sub - substring, SepNum - separator number in input list, SepPos - separator position in input string.<br>
+'''Extra Credit:''' include match information that indicates which separator was matched at each separation point and where in the input string that separator was matched.
-Input order of separators is important: they are considered in that order.
+Test your code using the input string “<code>a!===b=!=c</code>” and the separators “<code>==</code>”, “<code>!=</code>” and “<code>=</code>”.
 =={{header|J}}==