Verify distribution uniformity/Chi-squared test: Difference between revisions

(→‎{{header|jq}}: simplify)
Line 741:
enough for the χ2 distribution to be used.
 
The implementation of `gammaIncompleteChi2_cdf` below is intended to be fairly
simple and robust rather than fast and highly accurate. For an
industrial-strength algorithm, see
e.g. https://people.sc.fsu.edu/~jburkardt/c_src/asa239/asa239.c
 
'''Generic Functions'''
<syntaxhighlight lang=jq>
Line 756 ⟶ 755:
# sum of squares
def ss(s): reduce s as $x (0; . + ($x * $x));
 
# f must have arity 0
def integrate($a; $b; $n; f):
(($b - $a) / $n) as $h
| reduce range(0;$n) as $i (0;
($a + $i * $h) as $x
| . + ((( ($x|f) + 4 * (($x + ($h/2))|f) + (($x + $h)|f)) / 6)) )
| . * $h;
 
# Lower Incomplete Gamma Function
# Integral(from:0; to:$x) t^($a-1)*(-t | exp) dt
def gammaIncomplete($a; $x):
def small: 1e-3; # 1e-3 is adequate; 1e-4 takes significantly longer for small $x
def small($a;$b): (($a-$b)/($a+$b)) | length < small or
(($a-$b)|length) < 1e-10; # length here is abs
def f0: (if . == 0 then 1 else pow(.; $a-1) end) * ((- .) |exp) ;
 
# If $upper is large then divide to conquer:
def integrate($upper; $n):
if $upper < $a * 2 then integrate(0; $upper; $n; f0)
else integrate( 0; 2*$a; $n; f0) +
integrate(2*$a; $upper; $n; f0)
end ;
 
if $a == 1 then 1 - ((-$x)|exp)
elif $x > $a * 1e4 then 1
else {current: 0, n: 25, iteration: 0}
| .prev = 1
| until( small(.current; .prev);
.n *= 2
| .iteration += 1
| .prev = .current
| .current = integrate($x; .n) )
| .current
end ;
 
# Cumulative density function of the chi-squared distribution with $k
# degrees of freedom
# Use lgamma to avoid $x^m (for large $x and large m) and
def cdf($x; $k):
# to avoid calling gamma for large $k
gammaIncomplete($k/2;$x/2) / (($k/2)|gamma) ;
def cdfChi2_cdf($x; $k):
if $ax == 10 then 1 - ((-$x)|exp)0
elif $x > $a(1e3 * 1e4$k) then 1
else 1e-15 as $tol # for example
| { s: 0, m: 0, term: $tol}
| until (($a-$b).term|length) < 1e-10$tol; # length here is abs
# .term = (pow($x/2; .m) / (($k/2 + .m + 1)|gamma))
| .term += ((( ($x|f) + 4.m * (($x + ($h/2))|flog)) +- (($xk/2 + $h.m + 1)|flgamma)) / 6))| exp)
| .ns *+= 2.term
| .m *+= $h;1)
| .s * ( (-$x/2) + ($k/2)*(($x/2)|log)|exp)
end ;
</syntaxhighlight>
'''The Tasks'''
<syntaxhighlight lang=jq>
# Input: array of frequencies
def chi2UniformDistance:
Line 807 ⟶ 785:
# of freedom
def chi2Probability($dof):
(1 - cdfChi2_cdf(.; $dof))
| if . < 1e-10 then "< 1e-10"
else .
end;
 
</syntaxhighlight>
'''The Tasks
<syntaxhighlight lang=jq>
# Input: array of frequencies
# Output: result of a two-tailed test based on the chi-squared statistic
Line 820 ⟶ 796:
(length - 1) as $dof
| chi2UniformDistance
| cdfChi2_cdf(.; $dof) as $cdf
| if $cdf
then ($significance/2) as $s
Line 827 ⟶ 803:
end;
def datasetsdsets: [
[199809, 200665, 199607, 200270, 199649],
[522573, 244456, 139979, 71531, 21461],
Line 835 ⟶ 811:
];
 
def taskstask:
datasetsdsets[]
| "Dataset: \(.)",
( chi2UniformDistance as $dist
Line 842 ⟶ 818:
| "DOF: \($dof) D (Distance): \($dist)",
" Estimated probability of observing a value >= D: \($dist|chi2Probability($dof)|round(2))",
" Uniform? \( (select(chiIsUniform(0.05)) | "Yes") // "No" )\n" ) ;
| "Yes") // "No" )\n" ) ;
 
def task0($a):
tasks
def f0: (if . == 0 then 1 else pow(.; $a-1) end) * ((- .) |exp) ;
def integrate($a0; $b1; $n100; ff0):;
 
task
</syntaxhighlight>
{{output}}
<pre>
Dataset: [199809,200665,199607,200270,199649]
Line 861 ⟶ 839:
Dataset: [19,14,6,18,7,5,1]
DOF: 6 D (Distance): 29.2
Estimated probability of observing a value >= D: < 1e-100
Uniform? No
 
Line 874 ⟶ 852:
Uniform? No
</pre>
 
 
=={{header|Julia}}==
2,442

edits