Cumulative standard deviation: Difference between revisions

Content added Content deleted
(→‎{{header|Ada}}: Marked incomplete.)
(→‎Old style, two ways: Demonstrate truncation error.)
Line 1,213: Line 1,213:
</lang>
</lang>


===Old style, two ways===
===Old style, four ways===
Early computers loaded the entire programme and its working storage into memory and left it there throughout the run. Uninitialised variables would start with whatever had been left in memory at their address by whatever last used those addresses, though some systems would clear all of memory to zero or possibly some other value before each load. Either way, if a routine was invoked a second time, its variables would have the values left in them by their previous invocation. The DATA statement allows initial values to be specified, and allows repeat counts as well. It is not an executable statement: it is not re-executed on second and subsequent invocations of the containing routine. Thus, it is easy to have a routine employ counters and the like, visible only within themselves and initialised to zero or whatever suited.
Early computers loaded the entire programme and its working storage into memory and left it there throughout the run. Uninitialised variables would start with whatever had been left in memory at their address by whatever last used those addresses, though some systems would clear all of memory to zero or possibly some other value before each load. Either way, if a routine was invoked a second time, its variables would have the values left in them by their previous invocation. The DATA statement allows initial values to be specified, and allows repeat counts as well. It is not an executable statement: it is not re-executed on second and subsequent invocations of the containing routine. Thus, it is easy to have a routine employ counters and the like, visible only within themselves and initialised to zero or whatever suited.


With more complex operating systems, routines that relied on retaining values across invocations might no longer work - perhaps a fresh version of the routine would be loaded to memory (perhaps at odd intervals), or, on exit, the working storage would be discarded. There was a half-way scheme, whereby variables that had appeared in DATA statements would be retained while the others would be discarded. This subtle indication has been discarded in favour of the explicit SAVE statement, naming those variables whose values are to be retained between invocations, though compilers might also offer an option such as "automatic" (for each invocation always allocate then discard working memory) and "static" (retain values), possibly introducing non-standard keywords as well. Otherwise, the routines would have to use storage global to them such as additional parameters, or, COMMON storage and in later Fortran, the MODULE arrangements for shared items. The persistence of such storage can still be limited, but by naming them in the main line can be ensured for the life of the run. The other routines with access to such storage could enable re-initialisation, additional reports, or multiple accumulations, etc.
With more complex operating systems, routines that relied on retaining values across invocations might no longer work - perhaps a fresh version of the routine would be loaded to memory (perhaps at odd intervals), or, on exit, the working storage would be discarded. There was a half-way scheme, whereby variables that had appeared in DATA statements would be retained while the others would be discarded. This subtle indication has been discarded in favour of the explicit SAVE statement, naming those variables whose values are to be retained between invocations, though compilers might also offer an option such as "automatic" (for each invocation always allocate then discard working memory) and "static" (retain values), possibly introducing non-standard keywords as well. Otherwise, the routines would have to use storage global to them such as additional parameters, or, COMMON storage and in later Fortran, the MODULE arrangements for shared items. The persistence of such storage can still be limited, but by naming them in the main line can be ensured for the life of the run. The other routines with access to such storage could enable re-initialisation, additional reports, or multiple accumulations, etc.


Since the standard deviation can be calculated in a single pass through the data, producing values for the standard deviation of all values so far supplied is easily done without re-calculation. Accuracy is quite another matter. Calculations using deviances from a working mean are much better, and capturing the first X as the working mean would be easy, just test on N = 0. The sum and sum-of-squares method is quite capable of generating a negative variance, but the second method cannot, because the terms going added in to V are never negative. This could be explored by comparing the results computed from StdDev(A), StdDev(A + 10), StdDev(A + 100), StdDev(A + 1000), etc.
Since the standard deviation can be calculated in a single pass through the data, producing values for the standard deviation of all values so far supplied is easily done without re-calculation. Accuracy is quite another matter. Calculations using deviances from a working mean are much better, and capturing the first X as the working mean would be easy, just test on N = 0. The sum and sum-of-squares method is quite capable of generating a negative variance, but the second method cannot, because the terms going added in to V are never negative. This is demonstrated by comparing the results computed from StdDev(A), StdDev(A + 10), StdDev(A + 100), StdDev(A + 1000), etc.


Incidentally, Fortran implementations rarely enable reentrancy for the WRITE statement, so, since here the functions are invoked in a WRITE statement, the functions cannot themselves use WRITE statements, say for debugging.
Incidentally, Fortran implementations rarely enable re-entrancy for the WRITE statement, so, since here the functions are invoked in a WRITE statement, the functions cannot themselves use WRITE statements, say for debugging.
<lang Fortran>
<lang Fortran>
REAL FUNCTION STDDEV(X) !Standard deviation for successive values.
REAL FUNCTION STDDEV(X) !Standard deviation for successive values.
Line 1,232: Line 1,232:
EX = X + EX !Augment the total.
EX = X + EX !Augment the total.
EX2 = X**2 + EX2 !Augment the sum of squares.
EX2 = X**2 + EX2 !Augment the sum of squares.
STDDEV = SQRT(EX2/N - (EX/N)**2) !The variance, but, it might come out negative!
V = EX2/N - (EX/N)**2 !The variance, but, it might come out negative!
STDDEV = SIGN(SQRT(ABS(V)),V) !Protext the SQRT, but produce a negative result if so.
END FUNCTION STDDEV !For the sequence of received X values.
END FUNCTION STDDEV !For the sequence of received X values.


Line 1,246: Line 1,247:
STDDEVP = SQRT(V/N) !V can never be negative, even with limited precision.
STDDEVP = SQRT(V/N) !V can never be negative, even with limited precision.
END FUNCTION STDDEVP !For the sequence of received X values.
END FUNCTION STDDEVP !For the sequence of received X values.

REAL FUNCTION STDDEVW(X) !Standard deviation for successive values.
REAL X !The latest value.
REAL V !Scratchpad.
INTEGER N !Ongoing: count of the values.
REAL EX,EX2 !Ongoing: sum of X and X**2.
REAL W,D !Ongoing: working mean.
SAVE N,EX,EX2,W !Retain values from one invocation to the next.
DATA N,EX,EX2/0,0.0,0.0/ !Initial values.
IF (N.LE.0) W = X !Take the first value as the working mean.
N = N + 1 !Another value arrives.
D = X - W !Its deviation from the working mean.
EX = D + EX !Augment the total.
EX2 = D**2 + EX2 !Augment the sum of squares.
V = EX2/N - (EX/N)**2 !The variance, but, it might come out negative!
STDDEVW = SIGN(SQRT(ABS(V)),V) !Protext the SQRT, but produce a negative result if so.
END FUNCTION STDDEVW !For the sequence of received X values.

REAL FUNCTION STDDEVPW(X) !Standard deviation for successive values.
REAL X !The latest value.
INTEGER N !Ongoing: count of the values.
REAL A,V !Ongoing: average, and sum of squared deviations.
REAL W !Ongoing: working mean.
SAVE N,A,V,W !Retain values from one invocation to the next.
DATA N,A,V/0,0.0,0.0/ !Initial values.
IF (N.LE.0) W = X !Take the first value as the working mean.
N = N + 1 !Another value arrives.
D = X - W !Its deviation from the working mean.
V = (N - 1)*(D - A)**2 /N + V !First, as it requires the existing average.
A = (D - A)/N + A != [x + (n - 1).A)]/n: recover the total from the average.
STDDEVPW = SQRT(V/N) !V can never be negative, even with limited precision.
END FUNCTION STDDEVPW !For the sequence of received X values.


PROGRAM TEST
PROGRAM TEST
Line 1,251: Line 1,284:
REAL A(8) !The example data.
REAL A(8) !The example data.
DATA A/2.0,3*4.0,2*5.0,7.0,9.0/ !Alas, another opportunity to use @ passed over.
DATA A/2.0,3*4.0,2*5.0,7.0,9.0/ !Alas, another opportunity to use @ passed over.
REAL B !An offsetting base.
WRITE (6,1)
WRITE (6,1)
1 FORMAT ("Progressive calculation of the standard deviation."/
1 FORMAT ("Progressive calculation of the standard deviation."/
1 " I, A(I), EX EX2, Av V*N.")
1 " I",7X,"A(I) EX EX2 Av V*N Ew Ew2 wAv V*N")
B = 1000000 !Provoke truncation error.
DO I = 1,8 !Step along the data series,
DO I = 1,8 !Step along the data series,
WRITE (6,2) I,A(I),STDDEV(A(I)),STDDEVP(A(I)) !Showing progressive values.
WRITE (6,2) I,INT(A(I) + B), !No fractional part, so I don't want F11.0.
1 STDDEV(A(I) + B),STDDEVP(A(I) + B), !Showing progressive values.
2 FORMAT (I2,F6.1,2F10.6) !Should do for the example.
2 STDDEVW(A(I) + B),STDDEVPW(A(I) + B) !These with a working mean.
2 FORMAT (I2,I11,1X,4F12.6) !Should do for the example.
END DO !On to the next value.
END DO !On to the next value.
END
END
</lang>
</lang>


Output: the second pair of columns have the calculations done with a working mean.
Output:
Progressive calculation of the standard deviation.
Progressive calculation of the standard deviation.
I, A(I), EX EX2, Av V*N.
I A(I) EX EX2 Av V*N Ew Ew2 wAv V*N
1 2.0 0.000000 0.000000
1 2 0.000000 0.000000 0.000000 0.000000
2 4.0 1.000000 1.000000
2 4 1.000000 1.000000 1.000000 1.000000
3 4.0 0.942809 0.942809
3 4 0.942809 0.942809 0.942809 0.942809
4 4.0 0.866025 0.866025
4 4 0.866025 0.866025 0.866025 0.866025
5 5.0 0.979796 0.979796
5 5 0.979796 0.979796 0.979796 0.979796
6 5.0 1.000000 1.000000
6 5 1.000000 1.000000 1.000000 1.000000
7 7.0 1.399708 1.399708
7 7 1.399708 1.399708 1.399708 1.399708
8 9.0 2.000000 2.000000
8 9 2.000000 2.000000 2.000000 2.000000

I A(I) EX EX2 Av V*N Ew Ew2 wAv V*N
1 12 0.000000 0.000000 0.000000 0.000000
2 14 1.000000 1.000000 1.000000 1.000000
3 14 0.942809 0.942809 0.942809 0.942809
4 14 0.866025 0.866025 0.866025 0.866025
5 15 0.979796 0.979796 0.979796 0.979796
6 15 1.000000 1.000000 1.000000 1.000000
7 17 1.399708 1.399708 1.399708 1.399708
8 19 2.000000 2.000000 2.000000 2.000000

I A(I) EX EX2 Av V*N Ew Ew2 wAv V*N
1 102 0.000000 0.000000 0.000000 0.000000
2 104 1.000000 1.000000 1.000000 1.000000
3 104 0.942809 0.942809 0.942809 0.942809
4 104 0.866025 0.866025 0.866025 0.866025
5 105 0.979796 0.979796 0.979796 0.979796
6 105 1.000000 0.999999 1.000000 1.000000
7 107 1.399708 1.399708 1.399708 1.399708
8 109 2.000000 1.999999 2.000000 2.000000

I A(I) EX EX2 Av V*N Ew Ew2 wAv V*N
1 1002 0.000000 0.000000 0.000000 0.000000
2 1004 1.000000 1.000000 1.000000 1.000000
3 1004 0.942809 0.942809 0.942809 0.942809
4 1004 0.866025 0.866028 0.866025 0.866025
5 1005 0.979796 0.979798 0.979796 0.979796
6 1005 1.000000 1.000004 1.000000 1.000000
7 1007 1.399708 1.399711 1.399708 1.399708
8 1009 2.000000 1.999997 2.000000 2.000000

I A(I) EX EX2 Av V*N Ew Ew2 wAv V*N
1 10002 -2.000000 0.000000 0.000000 0.000000
2 10004 -1.000000 1.000000 1.000000 1.000000
3 10004 -0.666667 0.942809 0.942809 0.942809
4 10004 1.936492 0.866072 0.866025 0.866025
5 10005 2.181742 0.979829 0.979796 0.979796
6 10005 2.309401 1.000060 1.000000 1.000000
7 10007 1.801360 1.399745 1.399708 1.399708
8 10009 2.645751 1.999987 2.000000 2.000000

I A(I) EX EX2 Av V*N Ew Ew2 wAv V*N
1 100002 19.493589 0.000000 0.000000 0.000000
2 100004 7.416198 1.000000 1.000000 1.000000
3 100004 -7.333333 0.942809 0.942809 0.942809
4 100004 20.093531 0.865650 0.866025 0.866025
5 100005 -1.280625 0.979531 0.979796 0.979796
6 100005 -16.492422 1.000305 1.000000 1.000000
7 100007 17.851427 1.399895 1.399708 1.399708
8 100009 20.566963 1.999835 2.000000 2.000000

I A(I) EX EX2 Av V*N Ew Ew2 wAv V*N
1 1000002 -80.024994 0.000000 0.000000 0.000000
2 1000004 158.767120 1.000000 1.000000 1.000000
3 1000004 -89.146576 0.942809 0.942809 0.942809
4 1000004 90.795097 0.869074 0.866025 0.866025
5 1000005 193.357590 0.981953 0.979796 0.979796
6 1000005 238.361069 0.999691 1.000000 1.000000
7 1000007 153.462296 1.399519 1.399708 1.399708
8 1000009 151.284500 1.997653 2.000000 2.000000

Speaking loosely, to square a number of d digits accurately requires the ability to represent 2d digits accurately, with more digits needed if many such squares are to be added together accurately. In this example, 1000 when squared, is pushing at the limits of single precision. The average&variance method is resistant to this problem (and does not generate negative variances either!) because it manipulates differences from the running average, but it is still better to use a working mean.

In other words, a two-pass method will be more accurate (where the second pass calculates the variance by accumulating deviations from the actual average, itself calculated with a working mean) but at the cost of that second pass and the saving of all the values. Higher precision variables for the accumulations are the easiest way towards accurate results.


=={{header|Go}}==
=={{header|Go}}==