Verify distribution uniformity/Chi-squared test
In this task, write a function to verify that a given distribution of values is uniform by using the test to see if the distribution has a likelihood of happening of at least the significance level (conventionally 5%). The function should return a boolean that is true if the distribution is one that a uniform distribution (with appropriate number of degrees of freedom) may be expected to produce.
You are encouraged to solve this task according to the task description, using any language you may know.
More information about the distribution may be found at mathworld.
J
J has two types of verb definitions, Explicit and Tacit.
This is an Explicit solution: <lang j>require 'stats/base'
NB.*isUniform v Tests whether y is uniformly distributed NB. result is: boolean describing if distribution y is uniform NB. y is: distribution to test NB. x is: optionally specify number of categories possible isUniform=: verb define
(#~. y) isUniform y
signif=. 0.95 NB. set significance level expected=. (#y) % x NB. number of items divided by the category count observed=. #/.~ y NB. frequency count for each category X2=. +/ (*: observed - expected) % expected NB. the test statistic degfreedom=. <: x NB. degrees of freedom signif > degfreedom chisqcdf :: 1: X2
)</lang> This is the equivalent Tacit solution: <lang j>require 'stats/base'
SIGNIF=: 0.95 NB. set significance level countCats=: #@~. NB. counts the number of unique items getExpected=: #@] % [ NB. divides no of items by category count getObserved=: #/.~@] NB. counts frequency for each category calcX2=: [: +/ *:@(getObserved - getExpected) % getExpected NB. calculates test statistic calcDf=: <:@[ NB. calculates degrees of freedom for uniform distribution
NB.*isUniformT v Tests whether y is uniformly distributed NB. result is: boolean describing if distribution y is uniform NB. y is: distribution to test NB. x is: optionally specify number of categories possible isUniformT=: (countCats $: ]) : (SIGNIF > calcDf chisqcdf :: 1: calcX2)</lang>
Verbs and Distributions for testing: <lang j>freqCount=: ~. ,. #/.~
testFair=: monad define
'distribution ', (":,freqCount y) , ' assessed as ', (isUniform y) {:: ;:'unfair fair'
)
FairDistrib=: 1e6 ?@$ 5 UnfairDistrib=: (9.5e5 ?@$ 5) , (5e4 ?@$ 4)</lang>
Usage: <lang j> testFair FairDistrib distribution 4 143155 0 143085 1 143111 2 142706 6 142666 5 142365 3 142912 assessed as fair
testFair UnfairDistrib
distribution 0 203086 1 201897 2 202648 3 202388 4 189981 assessed as unfair
isUniformT 4 4 4 5 5 5 5 5 5 5 NB. uniform if only 2 categories possible
1
4 isUniformT 4 4 4 5 5 5 5 5 5 5 NB. not uniform if 4 categories possible
0</lang>
OCaml
This code needs to be compiled with library gsl.cma.
<lang ocaml>let sqr x = x *. x
let chi2UniformDistance distrib =
let count, len = Array.fold_left (fun (s, c) e -> s + e, succ c) (0, 0) distrib in let expected = float count /. float len in let distance = Array.fold_left (fun s e -> s +. sqr (float e -. expected) /. expected ) 0. distrib in let dof = float (pred len) in dof, distance
let chi2Proba dof distance =
Gsl_sf.gamma_inc_Q (0.5 *. dof) (0.5 *. distance)
let chi2IsUniform distrib significance =
let dof, distance = chi2UniformDistance distrib in let likelihoodOfRandom = chi2Proba dof distance in likelihoodOfRandom > significance
let _ =
List.iter (fun distrib -> let dof, distance = chi2UniformDistance distrib in Printf.printf "distribution "; Array.iter (Printf.printf "\t%d") distrib; Printf.printf "\tdistance %g" distance; Printf.printf "\t[%g > 0.05]" (chi2Proba dof distance); if chi2IsUniform distrib 0.05 then Printf.printf " fair\n" else Printf.printf " unfair\n" ) [ [| 199809; 200665; 199607; 200270; 199649 |]; [| 522573; 244456; 139979; 71531; 21461 |] ]</lang>
Output
distribution 199809 200665 199607 200270 199649 distance 4.14628 [0.386571 > 0.05] fair distribution 522573 244456 139979 71531 21461 distance 790063 [0 > 0.05] unfair
Python
Implements the Chi Square Probability function with an integration. I'm sure there are better ways to do this. Compare to OCaml implementation. <lang c>import math
def GammaInc_Q( a, x):
a1 = a-1 a2 = a-2 def a0( t ): return pow(t, a1)*math.exp(-t)
def da0(t): return (a1 -t)*pow(t,a2)*math.exp(-t) y = 0.2 while(a0(y) >1.0e-7) and (y < x): y += 1 if y > x: y = x
h = 1.0e-4 n = int(y/h) h = y/n s = h * sum( a0(t)+.5*h*da0(t) for t in ( h*j for j in xrange(n))) return 1.0 - s
def chi2UniformDistance( dataSet ):
expected = sum(dataSet)*1.0/len(dataSet) e = (d*1.0-expected for d in dataSet) f = (x*x for x in e) fsum = sum(f) distance = fsum/expected return distance
def chi2Probability(dof, distance):
return GammaInc_Q( 0.5*dof, 0.5*distance)
def chi2IsUniform(dataSet, significance):
dof = len(dataSet)-1 dist = chi2UniformDistance(dataSet) return chi2Probability( dof, dist ) > significance
dset1 = [ 199809, 200665, 199607, 200270, 199649 ]
dset2 = [ 522573, 244456, 139979, 71531, 21461 ]
for ds in (dset1, dset2):
print "Data set:", ds dof = len(ds)-1 distance =chi2UniformDistance(ds) print "dof: %d distance: %.4f" % (dof, distance), prob = chi2Probability( dof, distance) print "probability: %.4f"%prob, print "uniform? ", "Yes"if chi2IsUniform(ds,0.05) else "No"</lang>
Output:
Data set: [199809, 200665, 199607, 200270, 199649] dof: 4 distance: 4.146280 probability: 0.3866 uniform? Yes Data set: [522573, 244456, 139979, 71531, 21461] dof: 4 distance: 790063.275940 probability: 0.0000 uniform? No
Tcl
<lang tcl>package require Tcl 8.5 package require math::statistics
proc isUniform {distribution {significance 0.05}} {
set count [tcl::mathop::+ {*}[dict values $distribution]] set expected [expr {double($count) / [dict size $distribution]}] set X2 0.0 foreach value [dict values $distribution] {
set X2 [expr {$X2 + ($value - $expected)**2 / $expected}]
} set degreesOfFreedom [expr {[dict size $distribution] - 1}] set likelihoodOfRandom [::math::statistics::incompleteGamma \
[expr {$degreesOfFreedom / 2.0}] [expr {$X2 / 2.0}]]
expr {$likelihoodOfRandom > $significance}
}</lang> Testing: <lang tcl>proc makeDistribution {operation {count 1000000}} {
for {set i 0} {$i<$count} {incr i} {incr distribution([uplevel 1 $operation])} return [array get distribution]
}
set distFair [makeDistribution {expr int(rand()*5)}] puts "distribution \"$distFair\" assessed as [expr [isUniform $distFair]?{fair}:{unfair}]" set distUnfair [makeDistribution {expr int(rand()*rand()*5)}] puts "distribution \"$distUnfair\" assessed as [expr [isUniform $distUnfair]?{fair}:{unfair}]"</lang> Output:
distribution "0 199809 4 199649 1 200665 2 199607 3 200270" assessed as fair distribution "4 21461 0 522573 1 244456 2 139979 3 71531" assessed as unfair