Talk:Welch's t-test

Needs better task description

I haven't looked at the C code yet, but I'm assuming it's using t-test? The description should provide more context and explanations of concepts, and preferably links to algorithms. --Ledrug (talk) 20:16, 26 May 2015 (UTC)

Yes, this uses Welch's 2-sided t-test, as I commented inside the code.

Hi, you need to take all those nuggets out of your code comments and put them into an improved task description. The task description needs to stand on its own as a clear and concise description of what needs to be accomplished.

(P.S. Please sign your comments). --Paddy3118 (talk) 19:09, 27 May 2015 (UTC)

Hi Hailholyghost, I just had a look at the link you give and it is inadequate as a description for an RC task. The task description needs to be written for an audience of enthusiastic programmers - not necessarily maths or stats or whatever enthusiasts. It seems that you are new to RC and maybe you need to take time and lurk a bit more to understand a little more about how things are done.

This task needs a full description of the calculation method to use, probably in pseudocode, together with a decription of what the algorithm should be used for to complete a good task. The Code you give is not enough for a task description. --Paddy3118 (talk) 19:23, 27 May 2015 (UTC)

I added that link, as a pointer to the right direction for now. To be fair though, null hypothesis testing is very involved and sometimes borders black magic, so it may be difficult to explain everything clearly in a short text. The following wiki links may be relevant: wp:Statistical hypothesis testing, wp:ANOVA, and more specifically wp:Student's t-test and wp:Welch's t-test. The Student's t-test article has more details on actual computations, which forms the basis for the Welch's test. --Ledrug (talk) 19:44, 27 May 2015 (UTC)

So task description should perhaps also include explicit cautions about p values... Perhaps xkcd 882 and 1478? --Rdm (talk) 13:32, 3 June 2015 (UTC)

I've improved the C function to work with larger arrays using tgammal instead of tgamma, and have exception handling if the entered array is too small. I have made some modifications to the Simpson integration part, and the function now runs about twice as fast as before. I have also added a description. I have removed comments in my code. I hope this is satisfactory.--Hailholyghost (talk) 18:28, 3 June 2015 (UTC)

Looks like you pulled some of that math out of wikipedia, but even there there's not quite enough context. For example, what is the definition of u and of f(u)? That kind of stuff works in a classroom context where representative examples have been recently referenced, but that's not the case here.

Also, if you are going the math route I think you should mention basic assumptions (for example, I think you are assuming that the list of values were taken from what would be some normal distribution). --Rdm (talk) 20:04, 3 June 2015 (UTC)

I can work on the task description later. On a more practical matter, this code cannot calculate p-value for very large array sizes (> about 1755 elements). Does anyone know how to solve this? ==hailholyghost 15:18 Friday 5 June (UTC)

The fraction ${\frac {\Gamma (a)}{\Gamma (a+0.5)}}$ blows up. How can I get ratio in terms of lgammal? ==hailholyghost 15:26 7 June 2015.

I can get this fraction in terms of $B(a,a+0.5)$ , but it is computationally expensive. At least it works now. As for the task description, how much detail is required? I only put what I considered necessary to the computation, as this is work I did myself. The internet is awash with articles about p-value, so I only linked to those wikipedia articles. The reason I wrote this page is because I was unable to find a way to implement this computation directly, after weeks and weeks of internet searches. I hope that this computer code can be beneficial to others.--hailholyghost 14:25 Tuesday 9 June 2015 UTC.

Just use exp(lgamma(a) - lgamma(a+0.5)). Replacing tgamma with tgammal is only delaying the overflow until longer data (10000 or so?), while loggamma function should not overflow with any reasonable data. --Ledrug (talk) 18:35, 9 June 2015 (UTC)

Hi Ledrug, I tihnk you used the logarithm identity

\log({\frac {a}{b}})=\log(a)-\log(b)

but this doesn't apply here because

{\frac {\Gamma (a)}{\Gamma (a+0.5)}}\implies {\frac {\ln(\Gamma (a))}{\ln(\Gamma (a+0.5))}}\neq \ln(\Gamma (a))-\ln(\Gamma (a+0.5))

rather,

{\frac {\ln(\Gamma (a))}{\ln(\Gamma (a+0.5))}}=\log _{\Gamma (a+0.5)}(\Gamma (a))

which unfortunately doesn't seem to go anywhere.

The answer to this has to buried somewhere in the bowels of the internet... but I can't find it...--hailholyghost 14:07 10 June 2015 (UTC)

How does that not apply?

e^{\ln A-\ln B}=e^{\ln {A \over B}}={A \over B}

where A and B are the gammas, isn't that what you want? --Ledrug (talk) 19:02, 10 June 2015 (UTC)

If I understand the mass of expressions on the task page, you want to evaluate

\mathrm {B} (x,y)={\dfrac {\Gamma (x)\,\Gamma (y)}{\Gamma (x+y)}}\!

But this is equivalent to

\mathrm {B} (x,y)=\exp((\ln(\Gamma (x))+\ln(\Gamma (y)))-\ln(\Gamma (x+y)))\!

Or have I misunderstood what the task needs? --Rdm (talk) 13:31, 10 June 2015 (UTC)

Rdm, thank you so much!!!! --hailholyghost 15:11 EST 10 June 2015 (UST)

--Ledrug you are correct, of course, I put what you said into the task description. I'll put more about the definition of the p-value and warnings, maybe split the task description into two different sections.--hailholyghost 16:00 UTC 13 June 2015 (UST)

Task description complete?

I have made the task description more complete. I consider this page as ready to be published as a complete task. If someone else feels it is not ready, please give me a *specific* description of what's missing or why this isn't yet ready. I tried adding references but had formatting issues. I would like to cite this link, among others, if someone could please show me how to do this: http://www.nature.com/polopoly_fs/1.14700!/menu/main/topColumns/topLeftColumn/pdf/506150a.pdf --Hailholyghost (talk) 13:28, 23 June 2015 (UTC)--

If you look at most of the other tasks and consider the authorship of the examples, then you might agree with me that your task description stands out as being couched in heavy mathematical notation rather than in, say, pseudo-code for example. This puts a barrier between the readership and the task as perfectly proficient programmers would also need to be statisticians to follow the description.

This is not what RC is about - as you can see from other examples where very mathematical concepts such as Quaternions are given in task form that is explained to a programming audience. That has not been done in your draft task.

In short; explain it to the RC audience rather than to yourself - If you don't have an idea of the RC audience, (and that might be the case as you are asking how to create links), then you need to both lurk more on the site and read other tasks until you do. --Paddy3118 (talk) 14:03, 23 June 2015 (UTC)

Probably would be a good idea to include a link to Gamma_function and also to explain how to handle the definite integrals. I think we also need some documentation on how to implement lngamma given a decent implementation of gamma (log of gamma of fractional part of n plus sum of logs of the positive integers less than the integer part of n, or something like that). We might need a bit more than that, but I think we need at least that much. --Rdm (talk) 18:17, 23 June 2015 (UTC)

One more thing we need here is how to calculate the degrees of freedom for a single dataset. You've only supplied an expression for approximating the degrees of freedom of the two sets combined if we already know the degrees of freedom for each of the sets. (Presumably - since you are asking for sample variance - it's N-1 - but that should be specified.) --Rdm (talk) 05:07, 3 July 2015 (UTC)

Another issue: according to wp:Welch's_t_test "

s_{1}^{2}

is sample variance" and you seem to be using the equations from there, but in your task description you currently instead say that "

s_{n}

is the sample variance of set

n

" (Also, sample variance can be calculated in one of two ways - the expression you gave corresponds to what they label "unbiased sample variance" .. perhaps a minor issue? But I implemented what I think your task description has declared I should be calculating and I get a different result than the other task implementations, so I'm having to review all the basics...) --Rdm (talk) 06:03, 3 July 2015 (UTC)

I also noticed the unexplained nu1 and nu2 and the s instead of s^2. In addition in the sample variance formula I think the subscript on the mean should be lower case n. And the term p2-tail is unexplained. This means the same as p-value? —Sonia (talk) 00:07, 7 July 2015 (UTC)

Hello Sonia and Rdm, I think I've answered your questions about sample variance and other errors, and thank you for catching the mistakes. Please let me know if you see any other errors.--Hailholyghost (talk) 06:08, 7 July 2015 (UTC)

is there anything else that should be done? It was said that some other people should weigh in on whether or not this should be a complete task, but no one seems to be reading this.--Hailholyghost (talk) 14:29, 26 July 2015 (UTC)

I think things are pretty good, and if no one speaks up soon (let's say, within the next month?) with any specific problems that need to be addressed, we can promote the task from draft status. Or, if you are feeling optimistic, you could take it out of draft status now and if someone has concerns they can put it back into draft status along with a description of what problems they feel need to be addressed. --Rdm (talk) 14:47, 26 July 2015 (UTC)

Task definition vs. Task implementation

Currently the task implementation asks for

$p=1-{\frac {1}{2}}\times {\frac {\int _{0}^{\frac {\nu }{t^{2}+\nu }}{\frac {r^{{\frac {\nu }{2}}-1}}{\sqrt {1-r}}}\,\mathrm {d} r}{\exp((\ln(\Gamma ({\frac {\nu }{2}}))+\ln(\Gamma (0.5))-\ln(\Gamma ({\frac {\nu }{2}}+0.5)))}}$

But after reading the C implementation, what is actually being calculated is

$p={\frac {\int _{0}^{\frac {\nu }{t^{2}+\nu }}{\frac {r^{{\frac {\nu }{2}}-1}}{\sqrt {1-r}}}\,\mathrm {d} r}{\exp((\ln(\Gamma ({\frac {\nu }{2}}))+\ln(\Gamma (0.5))-\ln(\Gamma ({\frac {\nu }{2}}+0.5)))}}$

In other words, for <27.5 21 19 23.6 17 17.9 16.9 20.1 21.9 22.6 23.1 19.6 19 21.7 21.4> and <27.1 22 20.8 23.4 23.4 23.5 25.8 22 24.8 20.2 21.9 22.1 22.9 20.5 24.4> the task description would have us calculate a value of 0.989311 but the implementations give a value of 0.021378. And you can easily see this in the code -

<lang c> double return_value = ((h / 6.0) * ((pow(x,a-1))/(sqrt(1-x)) + 4.0 * sum1 + 2.0 * sum2))/(expl(lgammal(a)+0.57236494292470009-lgammal(a+0.5)))</lang>

There is no 1-expression here (except deeply inside parenthesis) and there is no divide by 2 or multiple 0.5 (again, except deeply inside parenthesis).

I think that either the task description needs to be changed to match the implementation, or the implementation needs to be changed to match the task description. --Rdm (talk) 07:47, 3 July 2015 (UTC)

Is there anything missing/in error that this page is still considered in draft mode?--Hailholyghost (talk) 13:40, 8 July 2015 (UTC)

Well... the lngamma algorithm will be important for anyone who doesn't have a native implementation of that. That could be a separate task, and linked in the task description, if you are not comfortable documenting it here. Paddy's suggestion of pseudocode is also worth considering (perhaps on a separate page such as Calculate_P-Value/Pseudocode linked into the task description?), though at this point there is perhaps enough real code that that is not such an issue? I guess, let's give some of the other people here time to weigh in on this... --Rdm (talk) 13:48, 8 July 2015 (UTC)

I saw that C99 standard math.h has lgammal, so I figured it was standard in every language like log or pow. However, I see now that lgammal is not standard in every language. Is there something from math.h that is like <lang c>

include <stdio.h>

long double lgammal (const long double input) { //... some math....

return result;

} </lang> that I could paste into my code?--Hailholyghost (talk) 18:22, 8 July 2015 (UTC)

You probably do not need to have duplicate copies of your implementation just for the missing lngamma issue. Instead, name your implementation of it "lngamma" and include a note that this routine should be included in the implementation if it is not supplied when linked with -lm. --Rdm (talk) 11:40, 15 July 2015 (UTC)

Hi Rdm, I'm a little confused, but I think I understand what you mean. The copies I put are not complete duplicates, because lgammal and LnGamma are spelled differently. I am concerned that lgamma and LnGamma do not output exactly the same numbers, as you can see in the output section. I don't see alternate implementation coded on any other Rosetta Code page, could you please provide an example of how to format the implementations on the page? I'm trying to imitate the formatting of the quaternion page, but I don't see alternate implementations there or on a few other pages.--Hailholyghost (talk) 14:13, 15 July 2015 (UTC)

How bad are the differences from the alternate implementation of ln gamma? Do they matter for the task example? What kind of example would they matter for?

Anyways, yes, I would get rid of the LnGamma implementation. And, I would change the spelling of its name to 'lngamma' so it can be used as a drop in replacement. If accuracy is a concern, I would document that as an issue so that an interested programmer could address the problem(s).

Does that address your concerns adequately? --Rdm (talk) 15:39, 15 July 2015 (UTC)

The max difference between lgamma and LnGamma for double is 1.862645e-09, and for long double is 1.919034e-10. I don't think this matters for most applications. The code belows shows how I calculated the difference.

<lang C>#include <stdio.h>//printf

include <math.h>//lgamma

long double LnGamma(const double xx) {

  unsigned int j;
  long double x,y,tmp,ser;
  const double cof[6] = {
     76.18009172947146,    -86.50532032941677,
     24.01409824083091,    -1.231739572450155,
     0.1208650973866179e-2,-0.5395239384953e-5
  };

  y = x = xx;
  tmp = x + 5.5 - (x + 0.5) * logl(x + 5.5);
  ser = 1.000000000190015;
  for (j=0;j<=5;j++)
     ser += (cof[j] / ++y);
  return(log(2.5066282746310005 * ser / x) - tmp);

}

int main (void) { long double max_difference = 0.0; long double lgamma_ans = 0.0, LnGamma_ans = 0.0; long double worst_lgamma_ans = 0.0, worst_LnGamma_ans = 0.0; unsigned int worst_x = 3; for (unsigned int x = 3; x < 965535; x++) { lgamma_ans = lgammal(x); LnGamma_ans = LnGamma(x); if (fabsl(LnGamma_ans-lgamma_ans) > max_difference) { max_difference = fabsl(LnGamma_ans-lgamma_ans); worst_lgamma_ans = lgamma_ans; worst_LnGamma_ans= LnGamma_ans; worst_x = x; } // printf("%d\t%.15f\t%.15f\t%e\n",x,lgamma(x),LnGamma(x),lgamma(x)-LnGamma(x)); } printf("Max difference between lgamma & LnGamma = %Le, for x = %d, lgamma(%d) = %Le; LnGamma(%d) = %Le\n\n",max_difference,worst_x,worst_x,worst_lgamma_ans,worst_x,worst_LnGamma_ans); return 0; } </lang>

I'll modify the main page accordingly.--Hailholyghost (talk) 21:01, 15 July 2015 (UTC)

lngamma

You can build an adequate lngamma from the rosettacode gamma function implementation, something like this:

<lang pseudocode>function lngamma(x) {

  if x < 3 then
     return ln(gamma(x))
  else
     frac= x mod 1
     r= ln(gamma(1+frac))
     for index= 1 thru x-1
        r= r + ln(index+frac)
     end for
     return r
  end if

end function</lang>

(I think I got that right - this is based on an email from Roger Hui, but I've adapted it from J to pseudo-code. Any errors are my own. Except errors in the gamma function implementation - those might be someone else's. --Rdm (talk) 20:23, 8 July 2015 (UTC))

That's short and sweet! I looked up a few library implementations and they were much more involved with lots of special cases and magic numbers. It all depends on what is "adequate." For the example data used so far in this task, the Racket and Tcl examples show that even the non-log gamma is adequate. It seems okay to me to allow the non-log gamma for the task. I think it's still nice though that the task description points out the limitation of non-log gamma and shows the preferred solution using log gamma. —Sonia (talk) 21:46, 8 July 2015 (UTC)

For languages that have a native lgamma implementation, such as C, no lgamma function I write will be as good, and will only detriment the C code. Nonetheless, your point is taken, and I'll write a perl translation of my code with an lgamma. However, I want to make the code as good as I can. How can I get the code for math.h lgamma? I've looked all over and it's buried in my laptop somewhere but I can't find it.--Hailholyghost (talk) 10:08, 10 July 2015 (UTC)

My copy of math.h does not have lngamma. Perhaps you should ask whoever you got your copy from? --Rdm (talk) 11:11, 10 July 2015 (UTC)

Hi Rdm, you can call lgamma not lngamma, by just placing #include <math.h> in your C program, and calling lgamma(5.5) or whatever inside main. My copy of math.h came standard with my Ubuntu installation.

<lang perl>sub lgamma { # per code from numerical recipies, http://hea-www.harvard.edu/~alexey/calc-src.txt

 my $xx = $_[0];
 my ($j, $ser, $tmp, $x, $y);
 my @cof = (0.0, 76.18009172947146, -86.50532032941677,

24.01409824083091, -1.231739572450155, 0.1208650973866179e-2, -0.5395239384953e-5);

 my $stp = 2.5066282746310005;
   
 $x = $xx; $y = $x;
 $tmp = $x + 5.5;
 $tmp = ($x+0.5)*log($tmp)-$tmp;
 $ser = 1.000000000190015;
 foreach $j ( 1 .. 6 ) {
   $y+=1.0;
   $ser+=$cof[$j]/$y;
 }
 return $tmp + log($stp*$ser/$x);

} </lang>

I got this code from another website (commented) is this acceptable to use in Perl on Rosetta Code?--Hailholyghost (talk) 12:56, 10 July 2015 (UTC)

I'm not using ubuntu at the moment. I do have access to some ubuntu 12.x boxes, but those boxes don't have lngamma nor lgamma. But ubuntu provides the complete sources - so you could, if you wanted to, look through them yourself? Presumably your version is in the source package for libc6 (that's eglibc on the ubuntu box that I was looking at, but it might be different on your machine?). But anyways - no, it doesn't work that way for me.

As for "acceptable to use in perl on this site"... I guess that should be ok. A quick web search finds http://paulbourke.net/miscellaneous/functions/ suggesting that the original source for the original version of that code was the Numerical Recipes book. And that code gets heavy use all over the place so I expect that no one should object to your using it here. --Rdm (talk) 13:17, 10 July 2015 (UTC)

is it acceptable to have two different implementations of this C code: 1. for computers with lgamma implemented, the first C program, and 2. for computers without lgamma implemented, a 2nd C program?

there are minor differences between LnGamma and lgamma, as you can see when you run this program: <lang C>

include <stdio.h> // printf
include <math.h>//lgamma,log

double LnGamma(const double xx) {

  unsigned int j;
  double x,y,tmp,ser;
  const double cof[6] = {
     76.18009172947146,    -86.50532032941677,
     24.01409824083091,    -1.231739572450155,
     0.1208650973866179e-2,-0.5395239384953e-5
  };

  y = x = xx;
  tmp = x + 5.5 - (x + 0.5) * logl(x + 5.5);
  ser = 1.000000000190015;
  for (j=0;j<=5;j++)
     ser += (cof[j] / ++y);
  return(log(2.5066282746310005 * ser / x) - tmp);

}

int main (void) { printf("x\tlgamma(x)\tLnGamma(x)\tlgamma(x)-LnGamma(x)\n"); for (unsigned short int x = 3; x < 65535; x++) { printf("%d\t%.15f\t%.15f\t%e\n",x,lgamma(x),LnGamma(x),lgamma(x)-LnGamma(x)); } return 0; } </lang> --Hailholyghost (talk) 16:34, 10 July 2015 (UTC)

I have added an lngamma function in case the user does not have his/her own implementation of lgamma. Is there anything else with the C code or task description that can be improved?--Hailholyghost (talk) 15:06, 14 July 2015 (UTC)

Perl

Hello, I have created a perl translation of my C code. Here is my perl script:

<lang = Perl>#!/usr/bin/env perl

use strict; use warnings;

sub lngamma { # per code from numerical recipies, http://hea-www.harvard.edu/~alexey/calc-src.txt

 my $xx = $_[0];
 my ($j, $ser, $tmp, $x, $y);
 my @cof = (0.0, 76.18009172947146, -86.50532032941677,

24.01409824083091, -1.231739572450155, 0.1208650973866179e-2, -0.5395239384953e-5);

 my $stp = 2.5066282746310005;
   
 $x = $xx; $y = $x;
 $tmp = $x + 5.5;
 $tmp = ($x+0.5)*log($tmp)-$tmp;
 $ser = 1.000000000190015;
 foreach $j ( 1 .. 6 ) {
   $y+=1.0;
   $ser+=$cof[$j]/$y;
 }
 return $tmp + log($stp*$ser/$x);

}

sub calculate_Pvalue { my $array1 = shift; my $array2 = shift; if ((@$array1 <= 1) || (@$array2 <= 1)) { return 1.0; } my $mean1 = 0.0; my $mean2 = 0.0; foreach my $e (@$array1) { $mean1 += $e; } foreach my $e (@$array2) { $mean2 += $e; } if ($mean1 == $mean2) { return 1.0; } $mean1 /= @$array1; $mean2 /= @$array2; my ($variance1,$variance2) = (0,0); foreach my $e (@$array1) { $variance1 += ($mean1-$e)*($mean1-$e); } foreach my $e (@$array2) { $variance2 += ($mean2-$e)*($mean2-$e); } if (($variance1 == 0.0) && ($variance2 == 0.0)) { return 1.0; } $variance1 = $variance1/(@$array1-1); $variance2 = $variance2/(@$array2-1); my $WELCH_T_STATISTIC = ($mean1-$mean2)/sqrt($variance1/scalar(@$array1)+$variance2/scalar(@$array2)); my $DEGREES_OF_FREEDOM = (($variance1/@$array1+$variance2/scalar(@$array2))**2)#numerator / ( ($variance1*$variance1)/(scalar(@$array1)*scalar(@$array1)*(scalar(@$array1)-1))+ ($variance2*$variance2)/(scalar(@$array2)*scalar(@$array2)*(scalar(@$array2)-1)) ); printf("t = %lf; DOF = %lf\n",$WELCH_T_STATISTIC,$DEGREES_OF_FREEDOM); my $sa = $DEGREES_OF_FREEDOM/2; my $x = $DEGREES_OF_FREEDOM/($WELCH_T_STATISTIC*$WELCH_T_STATISTIC+$DEGREES_OF_FREEDOM); my $N = 65355; my $h = $x/$N; my ($sum1,$sum2) = (0.0,0.0); for (my $i = 0; $i < $N; $i++) {

     $sum1 += (($h * $i + $h / 2.0)**($sa-1))/(sqrt(1-($h * $i + $h / 2.0)));
     $sum2 += (($h * $i)**($sa-1))/(sqrt(1-$h * $i));

} print "sum1 = $sum1; sum2 = $sum2\n"; return ($h / 6.0) * ((($x**($sa-1))/(sqrt(1-$x)) + 4.0 * $sum1 + 2.0 * $sum2))/(exp(&lngamma($sa)+0.57236494292470009-&lngamma($sa+0.5))); } my @d1 = (27.5,21.0,19.0,23.6,17.0,17.9,16.9,20.1,21.9,22.6,23.1,19.6,19.0,21.7,21.4); my @d2 = (27.1,22.0,20.8,23.4,23.4,23.5,25.8,22.0,24.8,20.2,21.9,22.1,22.9,20.5,24.4); my @d3 = (17.2,20.9,22.6,18.1,21.7,21.4,23.5,24.2,14.7,21.8); my @d4 = (21.5,22.8,21.0,23.0,21.6,23.6,22.5,20.7,23.4,21.8,20.7,21.7,21.5,22.5,23.6,21.5,22.5,23.5,21.5,21.8); my @d5 = (19.8,20.4,19.6,17.8,18.5,18.9,18.3,18.9,19.5,22.0); my @d6 = (28.2,26.6,20.1,23.3,25.2,22.1,17.7,27.6,20.6,13.7,23.2,17.5,20.6,18.0,23.9,21.6,24.3,20.4,24.0,13.2); my @d7 = (30.02,29.99,30.11,29.97,30.01,29.99); my @d8 = (29.89,29.93,29.72,29.98,30.02,29.98); my @x = (3.0,4.0,1.0,2.1); my @y = (490.2,340.0,433.9); printf("Test sets 1 p-value = %lf\n",&calculate_Pvalue(\@d1,\@d2)); printf("Test sets 2 p-value = %lf\n",&calculate_Pvalue(\@d3,\@d4)); printf("Test sets 3 p-value = %lf\n",&calculate_Pvalue(\@d5,\@d6)); printf("Test sets 4 p-value = %lf\n",&calculate_Pvalue(\@d7,\@d8)); printf("Test sets 5 p-value = %lf\n",&calculate_Pvalue(\@x,\@y)); </lang>

and comparing with the C code:

Output:

con@e:~/DNA_Methylation$ perl pvalue.pl
t = -2.455356; DOF = 24.988529
sum1 = 878.360186998937; sum2 = 878.265618888918
Test sets 1 p-value = 0.021378
t = -1.565434; DOF = 9.904741
sum1 = 9911.19818303728; sum2 = 9910.72960543568
Test sets 2 p-value = 0.148842
t = -2.219241; DOF = 24.496223
sum1 = 1444.70812944688; sum2 = 1444.55247891539
Test sets 3 p-value = 0.035972
t = 1.959006; DOF = 7.030560
sum1 = 8982.44374807736; sum2 = 8982.16243331816
Test sets 4 p-value = 0.090773
t = -9.559498; DOF = 2.000852
sum1 = 65573.4697898075; sum2 = 65572.4685866506
Test sets 5 p-value = 0.010752

real	0m0.287s
user	0m0.284s
sys	0m0.000s
con@e:~/DNA_Methylation$ ./pvalue 
t = -2.455356; DOF = 24.988529
sum1 = 880.779357; sum2 = 880.684789
Test sets 1 p-value = 0.021378001462867
t = -1.565434; DOF = 9.904741
sum1 = 9938.495493; sum2 = 9938.026915
Test sets 2 p-value = 0.148841696605327
t = -2.219241; DOF = 24.496223
sum1 = 1448.687128; sum2 = 1448.531478
Test sets 3 p-value = 0.035972271029797
t = 1.959006; DOF = 7.030560
sum1 = 9007.183093; sum2 = 9006.901778
Test sets 4 p-value = 0.090773324285661
t = -9.559498; DOF = 2.000852
sum1 = 65754.071496; sum2 = 65753.070294
Test sets 5 p-value = 0.010751534107903

The t statistics and DOF are the same between perl and C, but the sums are slightly off. How can I fix this?--Hailholyghost (talk) 15:46, 16 July 2015 (UTC)

I would start by finding in which part of the algorithm your implementations start producing different values.

Once you have isolated where they differ in value, and you've a relatively good set of representative test cases, the next step would be to find which is mathematically accurate (if either of them significantly more accurate than the other).

And once you have that, you can work on fixing the other implementation (and we might be able to help you, there, if you can explain the issue). --Rdm (talk) 22:55, 16 July 2015 (UTC)

Hi Rdm, as the output shows, the sums are different while the t statistic and DOF are the same.

That is, this C code <lang = C>for(unsigned short int i = 0;i < N; i++) {

     sum1 += (pow(h * i + h / 2.0,a-1))/(sqrt(1-(h * i + h / 2.0)));
     sum2 += (pow(h * i,a-1))/(sqrt(1-h * i));

} </lang> and this Perl code <lang = Perl> for (my $i = 0; $i < $N; $i++) {

     $sum1 += (($h * $i + $h / 2.0)**($sa-1))/(sqrt(1-($h * $i + $h / 2.0)));
     $sum2 += (($h * $i)**($sa-1))/(sqrt(1-$h * $i));

} </lang>

add differently. I am not a programmer by trade, and I'm completely self-taught, so there is a lot I don't know. Why would Perl and C sum differently? Is there a computer science reason?--Hailholyghost (talk) 13:49, 17 July 2015 (UTC)

Presumably either the numbers are different or the implementation of addition is different. (Though you have other operations in there, besides addition, which could also be different.) Hypothetically speaking, you might also have type differences giving you different results, but I don't see any obvious expressions where that seems likely (still, you might try changing your C implementation so that everything is type double - I don't think that will change anything but when isolating problems you sort of have to assume that all assumptions are suspect).

I suppose you might try logging summed numbers to a file, instead of adding them up. And then verifying whether or not you get the same or different sums in the same language or a different language.

Or often it's useful to just guess what might be going wrong and then putting print statements in the calculation to test that. (Maybe shrink N to something much smaller, just for isolating the differences, so that you are not spamming yourself.)

Or, I suppose I could try taking apart your code and doing these experiments myself. But I've got some other things I need to be doing, so that's not going to happen today. --Rdm (talk) 14:19, 17 July 2015 (UTC)

Hi Rdm, I tried switching the C code int to double, but the answer still comes out the same. I have no idea why the perl code would output differently than the C code, or how to fix it.--Hailholyghost (talk) 23:39, 19 July 2015 (UTC)

When I run into this kind of problem, I start looking at intermediate values. I also try to simplify the underlying dataset so that I can review all the intermediate data (which, in this case, would mean debugging with a much smaller value for $N, and then seeing if any fixes discovered there are adequate with the original value of N). --Rdm (talk) 01:35, 20 July 2015 (UTC)