Statistics/Basic

Statistics is all about large groups of numbers. When talking about a set of sampled data, most frequently used is their mean value and standard deviation (stddev). If you have set of data $x_{i}$ where $i=1,2,\cdots n$ , the mean is ${\bar {x}}\equiv {1 \over n}\sum _{i}x_{i}$ , while the stddev is $\sigma \equiv {\sqrt {{1 \over n}\sum _{i}\left(x_{i}-{\bar {x}}\right)^{2}}}$ .

When examining a large quantity of data, one often uses a histgram, which shows the counts of data samples falling into a prechosen set of intervals (or bins). When plotted, often as bar graphs, it visually indicates how often each data value occurs.

Task Using your language's random number routine, generate real numbers in the range of [0, 1]. It doesn't matter if you chose to use open or closed range. Create 100 of such numbers (i.e. sample size 100) and calculate their mean and stddev. Do so for sample size of 1,000 and 10,000, maybe even higher if you feel like. Show a histogram of any of these sets. Do you notice some patterns about the standard deviation?

Extra Sometimes so much data need to be processed that it's impossible to keep all of them at once. Can you calculate the mean, stddev and histogram of a trillion numbers? (You don't really need to do a trillion numbers, just show how it can be done.)

Hint

For a finite population with equal probabilities at all points, one can derive:

{\sqrt {{\frac {1}{N}}\sum _{i=1}^{N}(x_{i}-{\overline {x}})^{2}}}={\sqrt {{\frac {1}{N}}\left(\sum _{i=1}^{N}x_{i}^{2}\right)-{\overline {x}}^{2}}}.

C

Sample code. <lang C>#include <stdio.h>

include <stdlib.h>
include <math.h>
include <stdint.h>

define n_bins 10

double rand01() { return rand() / (RAND_MAX + 1.0); }

double avg(int count, double *stddev, int *hist) { double x[count]; double m = 0, s = 0;

for (int i = 0; i < n_bins; i++) hist[i] = 0; for (int i = 0; i < count; i++) { m += (x[i] = rand01()); hist[(int)(x[i] * n_bins)] ++; }

m /= count; for (int i = 0; i < count; i++) s += x[i] * x[i]; *stddev = sqrt(s / count - m * m); return m; }

void hist_plot(int *hist) { int max = 0, step = 1; double inc = 1.0 / n_bins;

for (int i = 0; i < n_bins; i++) if (hist[i] > max) max = hist[i];

/* scale if numbers are too big */ if (max >= 60) step = (max + 59) / 60;

for (int i = 0; i < n_bins; i++) { printf("[%5.2g,%5.2g]%5d ", i * inc, (i + 1) * inc, hist[i]); for (int j = 0; j < hist[i]; j += step) printf("#"); printf("\n"); } }

/* record for moving average and stddev. Values kept are sums and sum data^2

*  to avoid excessive precision loss due to divisions, but some loss is inevitable
*/

typedef struct { uint64_t size; double sum, x2; uint64_t hist[n_bins]; } moving_rec;

void moving_avg(moving_rec *rec, double *data, int count) { double sum = 0, x2 = 0; /* not adding data directly to the sum in case both recorded sum and * count of this batch are large; slightly less likely to lose precision*/ for (int i = 0; i < count; i++) { sum += data[i]; x2 += data[i] * data[i]; rec->hist[(int)(data[i] * n_bins)]++; }

rec->sum += sum; rec->x2 += x2; rec->size += count; }

int main() { double m, stddev; int hist[n_bins], samples = 10;

while (samples <= 10000) { m = avg(samples, &stddev, hist); printf("size %5d: %g %g\n", samples, m, stddev); samples *= 10; }

printf("\nHistograph:\n"); hist_plot(hist);

printf("\nMoving average:\n N Mean Sigma\n"); moving_rec rec = { 0, 0, 0, {0} }; double data[100]; for (int i = 0; i < 10000; i++) { for (int j = 0; j < 100; j++) data[j] = rand01();

moving_avg(&rec, data, 100);

if ((i % 1000) == 999) { printf("%4lluk %f %f\n", rec.size/1000, rec.sum / rec.size, sqrt(rec.x2 * rec.size - rec.sum * rec.sum)/rec.size ); } } }</lang>

Icon and Unicon

The following uses the stddev procedure from the Standard_deviation task. In this example,

<lang Icon>procedure main(A)

W := 50 # avg width for histogram bar B := 10 # histogram bins if *A = 0 then put(A,100) # 100 if none specified

while N := get(A) do { # once per argument

  write("\nN=",N)

  N := 0 < integer(N) | next   # skip if invalid 
  
  stddev() # reset
  m := 0.
  H := list(B,0)               # Histogram of 
  every i := 1 to N do {       # calc running ...
     s := stddev(r := ?0)      # ... std dev 
     m +:= r/N                 # ... mean
     H[integer(*H*r)+1] +:= 1  # ... histogram
     }

  write("mean=",m)
  write("stddev=",s)
  every i := 1 to *H do        # show histogram 
     write(right(real(i)/*H,5)," : ",repl("*",integer(*H*50./N*H[i]))) 
  }

end</lang>

Output:

N=100
mean=0.4941076275054806
stddev=0.2812938788216594
  0.1 : ****************************************
  0.2 : *******************************************************
  0.3 : *******************************************************
  0.4 : **********************************************************************
  0.5 : ****************************************
  0.6 : *********************************************
  0.7 : ****************************************
  0.8 : *****************************************************************
  0.9 : ****************************************
  1.0 : **************************************************

N=10000
mean=0.4935428224375008
stddev=0.2884171825227816
  0.1 : ***************************************************
  0.2 : ***************************************************
  0.3 : ***************************************************
  0.4 : **************************************************
  0.5 : ****************************************************
  0.6 : *************************************************
  0.7 : ***********************************************
  0.8 : ************************************************
  0.9 : **************************************************
  1.0 : ***********************************************

N=1000000
mean=0.4997503773607869
stddev=0.2886322440610256
  0.1 : *************************************************
  0.2 : **************************************************
  0.3 : **************************************************
  0.4 : **************************************************
  0.5 : *************************************************
  0.6 : **************************************************
  0.7 : *************************************************
  0.8 : *************************************************
  0.9 : **************************************************
  1.0 : *************************************************

J

J has library routines to compute mean and standard deviation: <lang j> require'statfns'

  (mean,stddev) ?1000#0

0.484669 0.287482

  (mean,stddev) ?10000#0

0.503642 0.290777

  (mean,stddev) ?100000#0

0.499677 0.288726</lang>

but they are not quite what is being asked for here.

Instead:

<lang j>meanstddevP=:3 :0

 NB. compute mean and std dev of y random numbers 
 NB. picked from even distribution between 0 and 1
 s=.t=. 0
 for_n.i.<.y%1e6 do.
   s=.s+ +/ data=. ?1e6#0
   t=.t+ +/(data-0.5)^2
 end.
 s=.s+ +/ data=. ?(1e6|y)#0
 t=.t++/(data-0.5)^2
 (s%y),%:t%y

)</lang>

For a histogram:

<lang j>histogram=: <: @ (#/.~) @ (i.@#@[ , I.) require'plot' plot ((%*1+i.)100) ([;histogram) ?10000#0</lang>

(should upload an image generated this way)

Example use:

<lang j> meanstddevP 1000 0.501617 0.288271

  meanstddevP 10000

0.49732 0.290061

  meanstddevP 100000

0.498912 0.289179</lang>

(That said, note that these numbers are random, so reported standard deviation will vary with the random sample being tested.)

This could handle a trillion random numbers on a bog-standard computer, but I am not inclined to wait that long.

Python

The second function, sd2 only needs to go once through the numbers and so can more efficiently handle large streams of numbers. <lang python>def sd1(numbers):

   if numbers:
       mean = sum(numbers) / len(numbers)
       sd = (sum((n - mean)**2 for n in numbers) / len(numbers))**0.5
       return sd, mean
   else:
       return 0, 0

def sd2(numbers):

   if numbers:
       sx = sxx = n = 0
       for x in numbers:
           sx += x
           sxx += x*x
           n += 1
       sd = (n * sxx - sx*sx)**0.5 / n
       return sd, sx / n
   else:
       return 0, 0

def histogram(numbers):

   h = [0] * 10
   maxwidth = 50 # characters
   for n in numbers:
       h[int(n*10)] += 1
   mx = max(h)
   print()
   for n, i in enumerate(h):
       print('%3.1f: %s' % (n / 10, '+' * int(i / mx * maxwidth)))
   print()

if __name__ == '__main__':

   import random
   for i in range(1,8):
       n = [random.random() for i in range(10**i)]
       print("\n##\n## %i numbers\n##" % 10**i)
       print('  Naive  method: sd: %8.6f, mean: %8.6f' % sd1(n))
       print('  Second method: sd: %8.6f, mean: %8.6f' % sd2(n))
       histogram(n)</lang>

Section of output

for larger sets of random numbers, the distribution of numbers between the bins of the histogram evens out.

...
##
## 100 numbers
##
  Naive  method: sd: 0.288911, mean: 0.508686
  Second method: sd: 0.288911, mean: 0.508686

0.0: +++++++++++++++++++++++++++++++
0.1: ++++++++++++++++++++++++++++
0.2: +++++++++++++++++++++++++
0.3: ++++++++++++++++++++++++++++++++++++++++++++++++++
0.4: ++++++++++++++++++
0.5: +++++++++++++++++++++++++++++++
0.6: ++++++++++++++++++
0.7: +++++++++++++++++++++++++++++++++++++
0.8: ++++++++++++++++++++++++++++++++++++++++
0.9: +++++++++++++++++++++++++++++++

...

##
## 10000000 numbers
##
  Naive  method: sd: 0.288750, mean: 0.499839
  Second method: sd: 0.288750, mean: 0.499839

0.0: ++++++++++++++++++++++++++++++++++++++++++++++++++
0.1: +++++++++++++++++++++++++++++++++++++++++++++++++
0.2: +++++++++++++++++++++++++++++++++++++++++++++++++
0.3: +++++++++++++++++++++++++++++++++++++++++++++++++
0.4: +++++++++++++++++++++++++++++++++++++++++++++++++
0.5: +++++++++++++++++++++++++++++++++++++++++++++++++
0.6: +++++++++++++++++++++++++++++++++++++++++++++++++
0.7: +++++++++++++++++++++++++++++++++++++++++++++++++
0.8: +++++++++++++++++++++++++++++++++++++++++++++++++
0.9: +++++++++++++++++++++++++++++++++++++++++++++++++