Verify distribution uniformity/Chi-squared test: Difference between revisions

(preferably a two-tailed test)
Line 732:
4 4,146 0,38657083 YES [199809.0, 200665.0, 199607.0, 200270.0, 199649.0]
4 790063,276 0,00000000 NO [522573.0, 244456.0, 139979.0, 71531.0, 21461.0]</pre>
 
=={{header|jq}}==
{{works with|jq}}
'''Also works with gojq, the Go implementation of jq.'''
 
This entry uses a two-tailed test, as is appropriate for this type of problem as illustrated
by the last example. The test is based on the assumption that the sample size is large
enough for the χ2 distribution to be used.
 
The implementation of `gammaIncomplete` below is intended to be fairly
simple and robust rather than fast and highly accurate. For an
industrial-strength algorithm, see
e.g. https://people.sc.fsu.edu/~jburkardt/c_src/asa239/asa239.c
 
'''Generic Functions'''
<syntaxhighlight lang=jq>
def round($dec):
if type == "string" then .
else pow(10;$dec) as $m
| . * $m | floor / $m
end;
 
# sum of squares
def ss(s): reduce s as $x (0; . + ($x * $x));
 
# f must have arity 0
def integrate($a; $b; $n; f):
(($b - $a) / $n) as $h
| reduce range(0;$n) as $i (0;
($a + $i * $h) as $x
| . + ((( ($x|f) + 4 * (($x + ($h/2))|f) + (($x + $h)|f)) / 6)) )
| . * $h;
 
# Lower Incomplete Gamma Function
# Integral(from:0; to:$x) t^($a-1)*(-t | exp) dt
def gammaIncomplete($a; $x):
def small: 1e-3; # 1e-3 is adequate; 1e-4 takes significantly longer for small $x
def small($a;$b): (($a-$b)/($a+$b)) | length < small or
(($a-$b)|length) < 1e-10; # length here is abs
def f0: (if . == 0 then 1 else pow(.; $a-1) end) * ((- .) |exp) ;
 
# If $upper is large then divide to conquer:
def integrate($upper; $n):
if $upper < $a * 2 then integrate(0; $upper; $n; f0)
else integrate( 0; 2*$a; $n; f0) +
integrate(2*$a; $upper; $n; f0)
end ;
 
if $a == 1 then 1 - ((-$x)|exp)
elif $x > $a * 1e4 then 1
else {current: 0, n: 25, iteration: 0}
| .prev = 1
| until( small(.current; .prev);
.n *= 2
| .iteration += 1
| .prev = .current
| .current = integrate($x; .n) )
| .current
end ;
 
# Cumulative density function of the chi-squared distribution with $k
# degrees of freedom
def cdf($x; $k):
gammaIncomplete($k/2;$x/2) / (($k/2)|gamma) ;
# Input: array of frequencies
def chi2UniformDistance:
(add / length) as $expected
| ss(.[] - $expected) / $expected;
 
# Input: a number
# Output: an indication of the probability of observing this value or higher
# assuming the value is drawn from a chi-squared distribution with $dof degrees
# of freedom
def chi2Probability($dof):
(1 - cdf(.; $dof))
| if . < 1e-10 then "< 1e-10"
else .
end;
</syntaxhighlight>
'''The Tasks
<syntaxhighlight lang=jq>
# Input: array of frequencies
# Output: result of a two-tailed test based on the chi-squared statistic
# assuming the sample size is large enough
def chiIsUniform($significance):
(length - 1) as $dof
| chi2UniformDistance
| cdf(.; $dof) as $cdf
| if $cdf
then ($significance/2) as $s
| $cdf > $s and $cdf < (1-$s)
else false
end;
def datasets: [
[199809, 200665, 199607, 200270, 199649],
[522573, 244456, 139979, 71531, 21461],
[19,14,6,18,7,5,1], # low entropy
[9,11,9,10,15,11,5], # high entropy
[20,20,20] # made-up
];
 
def tasks:
datasets[]
| "Dataset: \(.)",
( chi2UniformDistance as $dist
| (length - 1) as $dof
| "DOF: \($dof) D (Distance): \($dist)",
" Estimated probability of observing a value >= D: \($dist|chi2Probability($dof)|round(2))",
" Uniform? \( (select(chiIsUniform(0.05))
| "Yes") // "No" )\n" ) ;
 
tasks
</syntaxhighlight>
{{output}}
<pre>
Dataset: [199809,200665,199607,200270,199649]
DOF: 4 D (Distance): 4.14628
Estimated probability of observing a value >= D: 0.38
Uniform? Yes
 
Dataset: [522573,244456,139979,71531,21461]
DOF: 4 D (Distance): 790063.27594
Estimated probability of observing a value >= D: < 1e-10
Uniform? No
 
Dataset: [19,14,6,18,7,5,1]
DOF: 6 D (Distance): 29.2
Estimated probability of observing a value >= D: < 1e-10
Uniform? No
 
Dataset: [9,11,9,10,15,11,5]
DOF: 6 D (Distance): 5.4
Estimated probability of observing a value >= D: 0.49
Uniform? Yes
 
Dataset: [20,20,20]
DOF: 2 D (Distance): 0
Estimated probability of observing a value >= D: 1
Uniform? No
</pre>
 
 
=={{header|Julia}}==
2,449

edits