Talk:Multisplit: Difference between revisions

From Rosetta Code
Content added Content deleted
Line 197: Line 197:
0 0 0 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1
(i.&1)0 1 1 1 1 1 1 1 1 *. 0 0 0 1 1 1 1 1 1
(i.&1)0 1 1 1 1 1 1 1 1 *. 0 0 0 1 1 1 1 1 1
3</code>
3</lang>


The next time through the loop, <code>next> will be 3. And, when we are done, one of these bitmasks will be all zeros, so next wil l be larger than the the bitmask length.
The next time through the loop, <code>next> will be 3. And, when we are done, one of these bitmasks will be all zeros, so next wil l be larger than the the bitmask length.

Revision as of 17:27, 28 February 2011

What is going on here?

So I see that we have this for some sample output:

['a', [1, 1], '', [0, 3], 'b', [2, 6], '', [1, 7], 'c']

What's with the empty strings there? Is it saying that there are empty strings at position 3 and 7 in the input string and that they correspond to separators 0 and 1? This doesn't make sense to me at all. --Mwn3d 04:27, 27 February 2011 (UTC)

The task description doesn't stand on its own at the moment. Do you mean to have arbitrary interpretation of what constitutes a match for example?
I.e. given a text of 'X123Y' and separators '1', '12', '123', '23', and '3' then there are a multitude of possible answers which would fit the initial task description. This may lead to answers that are difficult to compare due to diverging interpretations of the spec.

The output format seems to be Python specific. Do you mean it to be or could any ordered output of sub<N>, sepnum<N>, seppos<N>, ... work? --Paddy3118 04:46, 27 February 2011 (UTC)

@Mwn3d: This format is for convenience: at even (from 0) positions we have strings, at odd - separators. Different formats are suitable for different tasks.
@Paddy3118: Separators are considered in input order.

Input: 'X123Y', ['1', '12', '123', '23', '3']
Output: ['X', [0, 1], '', [3, 2], 'Y']

It is acceptable to output this information in data structures native to programming language. --DSblizzard

'a' is at an odd position according to the output, but it's a string. And that still doesn't explain where "''" came from. --Mwn3d 03:07, 28 February 2011 (UTC)
odd - if we will count from 1, OK. This format with empty strings allows us to quickly answer, what is the 1548-th separator in a string, without looking over first 1547 separators. Another format which allows to answer such question will be more complicated. --DSblizzard

I've reworded the task so that it actually describes the task rather than referring to an implementation. I'm also trying to avoid having it state that the strings and separator information have to be interleaved; that's a very odd thing to do in some languages. My aim was that the solutions given should continue to be solutions, but that the description won't stop other ways of doing the challenge; after all, we want solutions to tasks to be as idiomatic as possible in their language. –Donal Fellows 10:31, 28 February 2011 (UTC)

Small inaccuracy in the smaller non-RE Python version?

This is a reason to tighten the task description, as I think the task description relies too much on the original Python implementation at the moment. --Paddy3118 08:16, 28 February 2011 (UTC)

I'm not tempted in a formulation of Rosetta tasks, therefore feel free to rewrite the task as you see fit.--DSblizzard 09:33, 28 February 2011 (UTC)

J implementation

implementation

The J implementation currently looks like this:

<lang j>multisplit=:4 :0

 'sep begin'=.|:t=. y /:~&.:(|."1)@;@(i.@#@[ ,.L:0"0 I.@E.L:0) x
 end=. begin + sep { #@>y
 last=.next=.0
 r=.2 0$0
 while.next<#begin do.
   r=.r,.(last}.x{.~next{begin);next{t
   last=.next{end
   next=.1 i.~(begin>next{begin)*.begin>:last
 end.
 r=.r,.;~last}.x

)</lang>

The first line uses a fair bit of point free code (because it was convenient, and easy for me) and many Rosetta readers are likely to have problems reading it. And task currently has people puzzled, so perhaps a fuller discussion would also be helpful.

sep begin

The first line of this multisplit defines two variables: sep and begin

<lang j> y=. '==';'!=';'='

  x=. 'a!===b=!=c'
  'sep begin'=.|:t=. y /:~&.:(|."1)@;@(i.@#@[ ,.L:0"0 I.@E.L:0) x
  sep

1 0 2 0 2 2 2 1 2

  begin

1 2 2 3 3 4 6 7 8</lang>

(Note: its normally bad practice to use variables named 'x' and 'y' because they get values bound to them in functions. However, for purposes of illustration I think it's fine to pull code out of a function and use that with explicitly assigned values.)

Here, sep is a list of 0, 1 or 2, with 0 corresponding to '==', 1 corresponding to '!=' and 2 corresponding to '='. And, begin is a list of corresponding character positions. The values in begin are in non-decreasing order, and the values in sep are ascending when they correspond to equal values in begin.

So, how does that work?

E. finds locations where substrings match:

<lang j> '==' E. 'a!===b=!=c' 0 0 1 1 0 0 0 0 0 0</lang>

In other words, for each character in the right argument, we have a 1 if the substring in the left argument begins there.

Or, we can get indices instead of a bit mask:

<lang j> '==' I.@E. 'a!===b=!=c' 2 3</lang>

Except we need to deal with multiple separators, and they can be differing lengths which means that representing them as a homogeneous array would be bad (every item in a homogeneous array must have the same length). But we can put them in boxes, and have an array of boxes:

<lang j> '==';'!=';'=' ┌──┬──┬─┐ │==│!=│=│ └──┴──┴─┘</lang>

But now we need to tell I.@E. that it is not supposed to work on the boxes directly, instead it needs to work inside the boxes.

<lang j> ('==';'!=';'=') I.@E.L:0 'a!===b=!=c' ┌───┬───┬─────────┐ │2 3│1 7│2 3 4 6 8│ └───┴───┴─────────┘</lang>

Ok, so great, but now we need to combine the results from those different boxes so we can work with them. Except, before that, right now the boxes distinguish which separator we are using, and we are going to need to add that, explicitly, so we do not lose track of that information after the boxes are gone. In other words, we have three boxes so we need three different values to label their contents with:

<lang j> i.@# '==';'!=';'=' 0 1 2

  ('==';'!=';'=') i.@#@[ 'a!===b=!=c'

0 1 2</lang>

So, let's add the labels to each of the values in each of the boxes:

<lang j> 0 1 2 (,.L:0"0) 2 3;1 7;2 3 4 6 8 ┌───┬───┬───┐ │0 2│1 1│2 2│ │0 3│1 7│2 3│ │ │ │2 4│ │ │ │2 6│ │ │ │2 8│ └───┴───┴───┘

  ('==';'!=';'=') (i.@#@[ ,.L:0"0 I.@E.L:0) 'a!===b=!=c'

┌───┬───┬───┐ │0 2│1 1│2 2│ │0 3│1 7│2 3│ │ │ │2 4│ │ │ │2 6│ │ │ │2 8│ └───┴───┴───┘</lang>

(In the first example, we needed parenthesis to tell the computer that 0 2 3 was not meant to be combined. In the second example, we need parenthesis to tell the computer that we want the result of the left and right verbs to be the arguments for the verb in the middle.)

So, now we are ready to combine them:

<lang j>  ;('==';'!=';'=') (i.@#@[ ,.L:0"0 I.@E.L:0) 'a!===b=!=c' 0 2 0 3 1 1 1 7 2 2 2 3 2 4 2 6 2 8</lang>

and sort them:

<lang j> ('==';'!=';'=') /:~&.:(|."1)@;@(i.@#@[ ,.L:0"0 I.@E.L:0) 'a!===b=!=c' 1 1 0 2 2 2 0 3 2 3 2 4 2 6 1 7 2 8</lang>

This suggests a simplification -- The &.:(."1) means we are reversing each row before we sort and then reversing them back after each sort. But this busy work could be eliminated by reversing the order of the columns. I might change the code (and this writeup), but I am not going to do that now -- and someone else can (and delete this paragraph) if they get to that before I do.

Finally, to assign this array to two different values, we need an array of two items, so we transpose the array.

<lang j> |:('==';'!=';'=') /:~&.:(|."1)@;@(i.@#@[ ,.L:0"0 I.@E.L:0) 'a!===b=!=c' 1 0 2 0 2 2 2 1 2 1 2 2 3 3 4 6 7 8</lang>

But before transposing it we save a copy in the variable t because we need that for the "extra credit" part of this task.

other pre-loop setup

<lang j> end=. begin + sep { #@>y

 last=.next=.0
 r=.2 0$0</lang>

end has the same structure as begin and sep, and gives the character position where each separator ends. It's just the beginning character location plus the length of each corresponding separator.

next will be our loop variable -- it's the index of the next value from begin/sep/t/end to be used.

last will be our "the last separator ended here" value, which we will be using to search for the next value for next

Finally, r will hold our result. It's arranged horizontally rather than vertically, because that uses less screen real estate on rosettacode. And, fortuitously, that makes it easier to select out just the string values (the useful part) from the "extra credit" result.

while. loop

next is an index into any of begin/sep/t/end and when it goes off the end, we are done.

<lang j>r=.r,.(last}.x{.~next{begin);next{t</lang>

This has three important parts:

  1. r=.r,.value (the result builder).
  2. last}.x{.~next{begin (the substring extractor), and
  3. next{t (the extra credit assignment)

The first time through the loop, last and <next> will be 0, and the first value from begin is 1, so last}.x{.~next{begin is equivalent to 0}.'a!===b=!=c'{.~1 or: take the first 1 character from that string and drop the first 0 characters from it. That's the substring containing the single character 'a'. Likewise, 0{t will be the value 1 1.

Next, we save the value where this delimiter instance ends. In other words last takes on the value 3.

Finally, we use that ending location to find our next delimiter instance. begin>:last, the first time through the loop, is equivalent to

<lang j> 1 2 2 3 3 4 6 7 8 >: 3 0 0 0 1 1 1 1 1 1</lang>

And, that would be enough, but I did not want an infinite loop if someone happened to use an empty delimiter, so I added an explicit test for that. It's just a safety measure, and does not actually matter for the required test case. But next{begin is still 1, so:

<lang j> 1 2 2 3 3 4 6 7 8 > 1 0 1 1 1 1 1 1 1 1</lang>

Anyways, I find the index of the first value where these two bitmasks are both set. (*. is J's boolean and operator.) <lang j> 0 1 1 1 1 1 1 1 1 *. 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1

  (i.&1)0 1 1 1 1 1 1 1 1 *. 0 0 0 1 1 1 1 1 1

3</lang>

The next time through the loop, next> will be 3. And, when we are done, one of these bitmasks will be all zeros, so next wil l be larger than the the bitmask length.

ending

Finally, we have one last bit of code to execute: <lang j>r=.r,.;~last}.x</lang>

This gives us the "tail end" -- everything appearing after the last valid separator (or the entire argument string if none were valid). Since there is no separator terminating this instance, I use an empty value to represent its details.