Talk:Dutch national flag problem: Difference between revisions

Content added Content deleted

Inline

Revision as of 17:10, 2 July 2012

Flag request

This task needs a picture of the Dutch national flag. --Paddy3118 09:34, 1 July 2012 (UTC)

http://en.wikipedia.org/wiki/Netherlands Fwend 06:04, 2 July 2012 (UTC)

Algorithms

I kinda ducked the issue of optimized algorithms in favour of giving an answer in keeping with how things would be done idiomatically in a language. This shouldn't stop those who want to give famous algorithms as well though. --Paddy3118 09:34, 1 July 2012 (UTC)

That's completely backwards. Making a Dutch flag isn't an idiomatic thing to do no matter what language you use, and the Python and J examples' using built-in sort completely missed the point. What's interesting about the problem is that, if you know sorting the array only involves comparing a samll set of values (3 in this case), you can have an O(n) method instead of O(n log(n)) as with a standard sort. Using an "idiomatic", non-specific method here makes no sense: there are plenty other sorting tasks on RC you can go idiomaticize with, why make yet another one?

BTW, the J comment about bin sort (counting sort) isn't entirely correct: two elements may compare equal for this sorting purpose, but it doesn't mean they are necessarily identical. That is, balls with the same color may still be distinguishable from each other, and the sort routine shouldn't automatically assume otherwise. --Ledrug 02:57, 2 July 2012 (UTC)

The wording does not stop you from using other algorithms and listing their relative strengths. I've just added a second Python entry myself. --Paddy3118 06:39, 2 July 2012 (UTC)

The task, as currently written, does not support the idea of balls containing information which is ignored for the purpose of the sort. (For example, imagine a task which asks us to sort numbers in the range 1000..3999 by their most significant digit, preserving the relative order of values with the same leading digit. Or, imagine a task which asks us to generate a variety of random objects and then sort a list of them based on one of their properties.)

That said, if it did, we could still use the concept of a counting sort. For example, we could replace the counters with stacks (or queues) -- that would still give us an O(n) algorithm. (Or we could use some mathematically equivalent construct.)

That said, note that O(n) algorithm does not guarantee O(n) behavior. The cache architecture of modern machines will add a Heaviside step function (of n, where n is the amount of memory used) to the time needed for the algorithm to complete. And, typical OS (and memory management) infrastructure will add random noise to this time. --Rdm 13:22, 2 July 2012 (UTC)

With three colors, the sorting can be done in-place, so cache miss is not relevant--the O(n) method requires absolutely minimal swaps, so if the entire array can't fit into cache, there isn't anything better you can do anyway. Plus, the same cache issue is in your default sort routine, too. But what's really wrong with initial Python and J examples is that they completely disregarded the essential constraint the problem posed, and provided something uninstructive. In a sense, they are more like counter-examples than examples: what not to do to solve the original problem. --Ledrug 15:36, 2 July 2012 (UTC)

When you say "cache miss is not relevant" you are making assumptions about the language implementation, about the OS, and about the hardware. (You have assumed, for example, that intermediate results will not have to be garbage collected by an arbitrary gc implementation.) Note also that in a typical machine architecture you have fixed costs associated with the hardware -- if you do not use those resources you're wasting them. --Rdm 15:59, 2 July 2012 (UTC)

Cache miss is irrelevant because it's not worse in the O(n) method than in whatever other method you can think of. It's funny how cache suddenly becomes an issue now, but not when you were using a generic sort routine, hmm? In any event, let it be known that I wasn't the one who dragged cache into this discussion, someone else did. --Ledrug 17:01, 2 July 2012 (UTC)

Correct, I raised that issue. I raised it because I tried to devise a test that would let me compare the efficiency of the builtin with the efficiency of an O(n) userspace sort. That said, I failed to point out that cache miss is a measurable issue even for in-place modify: data which is smaller than a cache element will have one timing and data which is larger than a cache element will have a different timing (a different constant multiplier), even for in-place modify. In a test rig, this can look similar to the kind of effect you would see in a low exponent polynomial time implementation, for the kinds of numbers usually reported in benchmarks. Anyways, when builtin sort is faster than userspace sort, and efficiency is the grounds for using userspace sort instead of builtin sort, what does that mean for this task? --Rdm 17:08, 2 July 2012 (UTC)

Ignoring those sort of technical details (because they're not important to this task)...it seems that the original intent of the problem is to find a way to do it that in fact does minimize comparisons and reads on the data. O(n) would do that. O(n log(n)) would not do that. It seems like a big requirement of the original problem. If that isn't going to be part of the task here, then it might not be very valuable. --Mwn3d 16:14, 2 July 2012 (UTC)

Actually, O(n log(n)) would do that for the original data if it's being applied to an extracted result rather than the original data. --Rdm 16:21, 2 July 2012 (UTC)

"ensuring that they are not in the order of the Dutch national flag"

Please ensure that entries check for this too. (J language?). --Paddy3118 22:01, 1 July 2012 (UTC)

That's a low probability event for any reasonably sized collection of balls, so I did it manually in the example. If you want it automatic, replacing ? with (3||.@/:)^:(-:/:~)@:? will work unless the user tries to generate a selection of only one random ball. [I fussed with that one a bit -- the issue behind my fussing was that I wanted a single correction step for all the improbable things that could go wrong in a random sort.] --Rdm 00:58, 2 July 2012 (UTC)

Hi, The Python is set up to generate a randomized, random sample of from one to five balls of each colour so it only has to check that the balls returned are not ordered. I don't read J code but I guess if you have the possibility of generating zero balls of a colour then the check would have to be for more? --Paddy3118 06:29, 2 July 2012 (UTC)

Yes. (3||.@/:)^:(-:/:~)@:? first tries to generate values independently of each other, then it tests to see if they are sorted. If they are sorted, it generates a red,white,blue,red,white... sequence of the same length as the original sequence, and then reverses it. This "eliminate random values which are sorted" thing seems to be a dubious requirement though, for the test data -- as near as I can tell the main thing it accomplishes is allowing sort implementations which are broken when the data is already sorted. --Rdm 13:28, 2 July 2012 (UTC)

@@ Line 22: / Line 22: @@
 :::: When you say "cache miss is not relevant" you are making assumptions about the language implementation, about the OS, and about the hardware.  (You have assumed, for example, that intermediate results will not have to be garbage collected by an arbitrary gc implementation.)  Note also that in a typical machine architecture you have fixed costs associated with the hardware -- if you do not use those resources you're wasting them. --[[User:Rdm|Rdm]] 15:59, 2 July 2012 (UTC)
 ::::: Cache miss is irrelevant because it's not worse in the O(n) method than in whatever other method you can think of.  It's funny how cache suddenly becomes an issue now, but not when you were using a generic sort routine, hmm?  In any event, let it be known that ''I'' wasn't the one who dragged cache into this discussion, someone else did. --[[User:Ledrug|Ledrug]] 17:01, 2 July 2012 (UTC)
-:::::: Correct, I raised that issue.  I raised it because I tried to devise a test that would let me compare the efficiency of the builtin with the efficiency of an O(n) userspace sort.  That said, I failed to point out that cache miss is a measurable issue even for in-place modify: data which is smaller than a cache element will have one timing and data which is larger than a cache element will have a different timing, even for in-place modify.  Anyways, when builtin sort is faster than userspace sort, and efficiency is the grounds for using userspace sort instead of builtin sort, what does that mean for this task?  --[[User:Rdm|Rdm]] 17:08, 2 July 2012 (UTC)
+:::::: Correct, I raised that issue.  I raised it because I tried to devise a test that would let me compare the efficiency of the builtin with the efficiency of an O(n) userspace sort.  That said, I failed to point out that cache miss is a measurable issue even for in-place modify: data which is smaller than a cache element will have one timing and data which is larger than a cache element will have a different timing (a different constant multiplier), even for in-place modify.  In a test rig, this can look similar to the kind of effect you would see in a low exponent polynomial time implementation, for the kinds of numbers usually reported in benchmarks.  Anyways, when builtin sort is faster than userspace sort, and efficiency is the grounds for using userspace sort instead of builtin sort, what does that mean for this task?  --[[User:Rdm|Rdm]] 17:08, 2 July 2012 (UTC)
 :::::Ignoring those sort of technical details (because they're not important to this task)...it seems that the original intent of the problem is to find a way to do it that in fact ''does'' minimize comparisons and reads on the data. O(n) would do that. O(n log(n)) would not do that. It seems like a big requirement of the original problem. If that isn't going to be part of the task here, then it might not be very valuable. --[[User:Mwn3d|Mwn3d]] 16:14, 2 July 2012 (UTC)
 :::::: Actually, O(n log(n)) would do that for the original data if it's being applied to an extracted result rather than the original data.  --[[User:Rdm|Rdm]] 16:21, 2 July 2012 (UTC)