Talk:Fivenum: Difference between revisions

Content added Content deleted
Line 19: Line 19:
== Large vs not large ==
== Large vs not large ==
I removed the requirement, as it seems unrelated to the task. We are faced with a choice here:
I removed the requirement, as it seems unrelated to the task. We are faced with a choice here:
* Either the important part is the large dataset. But then, how large? Does the data fit in memory? On a single hard drive? Does it require multiple hard drives in a network of computers? A dataset that fits in memory does not look large to me. Of course, it's a matter of hardware: a server with 256 GB memory will be enough to do in-memory computations that would require a hard drive on most PCs. A really large file would require a network, and technology like Hadoop or Spark. If we insist in requiring all of this (which looks perfectly acceptable, as it would be a good exercise in managine large data), the task will be much more difficult, or impossible for most languages. And the R solution would be wrong (but I imagine there are packages to do that correctly in R).
* Either the important part is the large dataset. But then, how large? Does the data fit in memory? On a single hard drive? Does it require multiple hard drives in a network of computers? A dataset that fits in memory does not look large to me. Of course, it's a matter of hardware: a server with 256 GB memory will be enough to do in-memory computations that would require a hard drive on most PCs. A really large file would require a network, and technology like [https://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] or [https://en.wikipedia.org/wiki/Apache_Spark Spark], or other cluster-computing facility. If we insist in requiring all of this (which looks perfectly acceptable, as it would be a good exercise in managine large data), the task will be much more difficult, or impossible for most languages. And the R solution would be wrong (but I imagine there are packages to do that correctly in R).
* Either the important part is computing these numbers. Then it's all about computing the median and quartiles (min and max are trivially doable in O(n)). A much simpler task, but every language should be able to do that.
* Either the important part is computing these numbers. Then it's all about computing the median and quartiles (min and max are trivially doable in O(n)). A much simpler task, but every language should be able to do that.