Talk:Statistics/Basic: Difference between revisions

Content added Content deleted

Inline

Revision as of 18:03, 2 July 2011

Wrong emphasis in 'Extra'?

Once one example shows how to calculate them from keeping running sums of x and x-squared then they should all be able to copy. Better to just add reference to the other formulas so we can compare the language implementations.
I guess this is because RC is about showing off language capabilities and, (trying to), rely less on the knowledge of individual contributors. --Paddy3118 12:53, 2 July 2011 (UTC)

There isn't much of a formula to talk about, and the requirement is a real world need. This task isn't meant to be difficult, but more related to what actually happens in real data analysis. And why narrow down on "language capabilities"? What kind of programmer would be hurt by a little thinking about simple algorithms? --Ledrug 13:38, 2 July 2011 (UTC)

The idea is to aid language comparison rather than be yet another generic programmer challenge site. --Paddy3118 16:09, 2 July 2011 (UTC)

I agree. I don't think this is really the place for challenges. We don't have much of a framework for it anyway since the solutions are all on the task page. I think it's best to just name or describe an algorithm right from the start. A "better" algorithm could be used as extra credit (e.g. one that greatly reduces error or handles corner cases well). --Mwn3d 16:27, 2 July 2011 (UTC)

How about stating the possible patterns in the standard deviation you might find and adjusting the task to give languages a chance to show them? --Paddy3118 16:10, 2 July 2011 (UTC)

Which part of it looks like a challenge? You add up some numbers, then maybe divide by another number, it's not like there's tricky coding to be done. Large dataset is a real senario and not hard to deal with, as long as you don't artificially complicate it. And as sample size increases, numbers such as mean and stddev becomes stable, which is almost the whole point of statistics: it's an easily noticeable trend, I'm not asking you to find face of Jesus in the output numbers. As a programmer, none of these should be hard to understand, and I never said anything about greatly reducing errors: you can only avoid greatly increasing it, but that's natural requirement for anyone doing numerical work. --Ledrug 17:02, 2 July 2011 (UTC)

Making it numerically stable, that's challenging. It's easy enough if you have a small number of values of all about the same scale, but that's not always the case. –Donal Fellows 17:11, 2 July 2011 (UTC)

That's actually rarely a problem in the real world. If a distribution is narrow, scale difference is small; if distribution is wide, losing some precision on really small numbers wouldn't affect either average or stddev. It probably will be a concern only when you have a few very large numbers and a lot of smaller ones (say < 10^-16 relatively in abs, but about 10^16 in quantity), but what kind of physical measurement would give a distribution like that? In any event, I didn't say anything about that in the task; the distribution used is uniform, it really can't get much simpler than that. --Ledrug 17:35, 2 July 2011 (UTC)

The distribution comes up quite often when working with quantities that follow a power law (i.e., where they are distributed more evenly in log space) which is actually quite often. In any case, the warning about such things is relevant because someone will copy the code on this page and use it unwisely; there are whole legions of fools who want to program by cut-n-paste only and without any thought for side conditions, but even so it is still something that we should note for our own consciences. Write robust code for extra credit! –Donal Fellows 18:03, 2 July 2011 (UTC)

@@ Line 10: / Line 10: @@
 :: Making it numerically stable, that's challenging. It's easy enough if you have a small number of values of all about the same scale, but that's not always the case. –[[User:Dkf|Donal Fellows]] 17:11, 2 July 2011 (UTC)
 ::: That's actually rarely a problem in the real world.  If a distribution is narrow, scale difference is small; if distribution is wide, losing some precision on really small numbers wouldn't affect either average or stddev.  It probably will be a concern only when you have a few very large numbers and a lot of smaller ones (say < 10^-16 relatively in abs, but about 10^16 in quantity), but what kind of physical measurement would give a distribution like that?  In any event, I didn't say anything about that in the task; the distribution used is uniform, it really can't get much simpler than that. --[[User:Ledrug|Ledrug]] 17:35, 2 July 2011 (UTC)
+:::: The distribution comes up quite often when working with quantities that follow a power law (i.e., where they are distributed more evenly in log space) which is actually quite often. In any case, the warning about such things is relevant because someone ''will'' copy the code on this page and use it unwisely; there are whole legions of fools who want to program by cut-n-paste only and without any thought for side conditions, but even so it is still something that we should note for our own consciences. Write robust code for extra credit! –[[User:Dkf|Donal Fellows]] 18:03, 2 July 2011 (UTC)