Jump to content

Kahan summation: Difference between revisions

Line 390:
The expression <code>S = A(1) + A(2) + A(3)</code> was rendered as four operation codes in a row: FLD, FADD, FADD, FSTP - omitting the addressing details. This means that the arithmetic was conducted in the cpu's floating-point unit using eighty-bit floating-point arithmetic (or, REAL*10), however the FSTP that stored the result to S, stored to a 32-bit recipient so that the result was in single precision and would have been rounded to fit.
 
If instead a simple DO-loop is used to calculate the sum, each iteration involved a FLD, FADD, FSTP. This means that at each step, the value of S (single precision) is loaded into the 80-bit register, a (single precision) value is added using 80-bit arithmetic, then the result is stored back to the single-precision variable, S. The third option is to use the supplied facility, <code>S = SUM(A)</code>, but alas, this accumulates the result in the same stepwise manner, using a temporary storage area of the same type as the array: single precision. It makes no attempt to recognise that the summation could be accumulated in a register, still less to do so with 80-bit precision. Similarly, the compiler made no attempt to recognise that on completion of one expression, the next expression used the result of the previous expression. Thus, in the sequence <code>T:=S + Y; C:=(T - S) - Y;</code> the value of T was reloaded for the second statement, even though the previous statement's last step had been to save a result to T. Some compilers and some languages do allow this possibility (possibly as an option) but on computers where the working register has a different precision from that of a variable in storage, it will not be just the speed of the calculation that changes. Such changes will almost surely wreck the workings of compensated summation! Fortran has strict rules on the order of evaluation, and there is a long history of needing care. For instance, to compute m/h² where m is the mass of an electron and h is Planck's constant, h² risks exponent overflow so (m/h)/h would be safer.
 
For the test data, all methods, including compensated summation, returned exactly the same result, even to the last bit, so one wonders just what is going on. Using decimal numbers when concerned with the exact details of binary arithmetic is troublesome because the conversions are generally inexact, so the plan is to reveal the numbers in base two. Besides F format, later Fortran version offers further options such as B O, and Z formats for binary, octal and hexadecimal (the code H being spoken for another use) however these present the bit-patterns as packed into the storage used for the variable, they do not reveal its ''numerical'' value in binary, etc. So, for 3.14159 in decimal,
Line 398:
11.001001000011111101101010100010001000010110100011 is 4*ATAN(1) in double precision.
11.001001000011111101101010100010001000010110100011000010001101001100010... is more accurate.
 
It is a small relief to see that 4*ATAN(1) gives a correctly-rounded value - at least with this system. Specifying a decimal value, even if with additional digits, may not yield the best result in binary. To find out exactly what binary values are being used, one needs a scheme as follows:
<lang Fortran>
MODULE PROBE
CONTAINS
CHARACTER*76 FUNCTION FP8DIGITS(X,BASE,W) !Full expansion of the value of X in BASE.
Converts a number X to a specified BASE. For integers, successive division by BASE, for fractions, successive multiplication.
Can't use the FORMAT system in case this function is invoked in a formatted WRITE: some furrytrans don't handle reentrancy.
REAL*8 X,T !The value, and an associate.
INTEGER BASE !As desired.
INTEGER W !Allowance for the integer part, to promote alignment.
CHARACTER*(76) TEXT !Scratchpad for results.
INTEGER L !The length of the result.
INTEGER N,ND !Counters.
INTEGER D !The digit of the moment.
LOGICAL NEG !Annoyance with signs.
CHARACTER*1 DIGIT(0:15) !Want more than just decimal.
PARAMETER (DIGIT = (/"0","1","2","3","4","5","6","7","8","9", !So no CHAR(ICHAR("0") + d) tricks.
1 "A","B","C","D","E","F"/)) !Because these are not adjacent character codes.
IF (BASE.LE.1 .OR. BASE.GT.16) BASE = 10 !Preclude oddities.
TEXT = "Base" !Scrub the TEXT with an announcement.
IF (BASE.GE.10) TEXT (6:6) = "1" !I'm playing with a restricted range only.
TEXT (7:7) = DIGIT(MOD(BASE,10)) !So in-line code will do.
TEXT(8:8) = ":" !So much for the preamble.
T = X !Grab the value.
N = T !Its integer part, with truncation.
T = ABS(T - N) !Thus obtain the fractional part.
NEG = N .LT. 0 !Negative numbers are a nuisance.
IF (NEG) N = -N !So simplify for what follows.
L = 10 + W !Finger the units position..
ND = 0 !No digits have been rolled.
Crunch the integer part.
10 D = MOD(N,BASE) !Extract the low-order digit in BASE.
TEXT(L:L) = DIGIT(D) !Place it as text.
ND = ND + 1 !Count another digit rolled.
N = N/BASE !Drop down a power.
L = L - 1 !Move back correspondingly.
IF (L.LE.0) THEN !Run out of space?
TEXT(10:10) = "!" !"Overflow!"
GO TO 900 !TEXT might be far too short.
END IF !But, space is expected.
IF (N.GT.0) GO TO 10 !Are we there yet?
IF (NEG) THEN !Yes! Is a negative sign needed?
TEXT(L:L) = "-" !Yes. Place it.
L = L - 1 !And retreat another place.
END IF !No + sign for positive numbers.
L = 10 + W + 1 !Finger what follows the units position.
TEXT(L:L) = "." !Laziness leads to a full stop for a decimal point.
Crunch through the fractional part until nothing remains.
DO WHILE(T.GT.0) !Eventually, this will be zero.
IF (L.GE.LEN(TEXT)) THEN !Provided I have enough space!
L = LEN(TEXT) !If not, use the whole supply.
TEXT(L:L) = "~" !Place a marker suggesting that more should follow.
GO TO 900 !And give up.
END IF !Otherwise, a digit is to be found.
T = T*BASE !Shift up a power.
N = T !The integer part is the digit.
T = T - N !Remove that integer part from T.
L = L + 1 !Advance the finger.
TEXT(L:L) = DIGIT(N) !Place the digit.
ND = ND + 1 !Count it also.
END DO !And see if anything remains.
Cast forth an addendum, to save the reader from mumbling while counting long strings of digits.
IF (LEN(TEXT) - L .GT. 11) THEN !Err, is there space for an addendum?
TEXT(L + 2:L + 8) = "Digits:" !Yes! Reveal the number of digits.
L = L + 10 !Advance to the tens-to-be location.
IF (ND.GT.9) TEXT(L:L) = DIGIT(ND/10) !I expect no more than two-digit digit counts.
L = L + 1 !Finger the units position.
TEXT(L:L) = DIGIT(MOD(ND,10)) !Thus, no use of the FORMAT system.
END IF !So m uch for the addendum.
900 FP8DIGITS = TEXT !Anyway, here it all is.
END FUNCTION FP8DIGITS !Bases play best with related bases, such as 4 and 8. Less so with (say) 3 and 7...
 
REAL FUNCTION SUMC(A,N) !Add elements of the array, using limited precision.
Compensated summation. C will not stay zero, despite mathematics.
REAL A(N) !The array. Presumably with at least one element. Horror if N is wrongly supplied!
INTEGER N !The number of elements.
REAL S,C,Y,T !Assistants.
INTEGER I !A stepper.
S = A(1) !Start with the first element.
C = 0.0 !No previous omissions to carry forward.
DO I = 2,N !Step through the remainder of the array.
WRITE (6,*) "i=",I
WRITE (6,666) "A(i)", FP8DIGITS(DBLE(A(I)),2,14)
666 FORMAT (A8,":",A)
WRITE (6,666) "C", FP8DIGITS(DBLE(C),2,14)
Y = A(I) - C !Combine the next value with the compensation.
WRITE (6,666) "Y=A-C",FP8DIGITS(DBLE(Y),2,14)
WRITE (6,666) "S", FP8DIGITS(DBLE(S),2,14)
T = S + Y !Augment the sum, temporarily in T.
WRITE (6,666) "T=S+Y",FP8DIGITS(DBLE(T),2,14)
WRITE (6,666) "T-S", FP8DIGITS(DBLE(T - S),2,14)
C = (T - S) - Y !Catch what part of Y didn't get added to T.
WRITE (6,666) "C=T-S-Y",FP8DIGITS(DBLE(C),2,14)
S = T !Place the sum.
WRITE (6,666) "S",FP8DIGITS(DBLE(S),2,14)
END DO !On to the next element.
SUMC = S !C will no longer be zero.
END FUNCTION SUMC !Using a working mean might help.
END MODULE PROBE
 
USE PROBE
REAL A(3)
REAL*4 X4E,X4L,X4S,X4C
REAL*8 X8E
INTEGER I
INTEGER BASE,W
BASE = 2
W = 14
A(1) = 10000.0
A(2) = 3.14159
A(3) = 2.71828
 
WRITE (6,*) "Sum via the additions in one expression."
X4E = A(1) + A(2) + A(3) !Calc. in R10, saved to R4.
WRITE (6,1) "4",X4E
WRITE (6,666) "1exprn",FP8DIGITS(DBLE(X4E),BASE,W)
 
WRITE (6,*) "Sum via a loop."
X4L = 0
DO I = 1,3
X4L = X4L + A(I) !Terms in R10, saved to R4.
WRITE (6,666) "A(i)",FP8DIGITS(DBLE(A(I)),BASE,W)
WRITE (6,666) "X4L",FP8DIGITS(DBLE(X4L),BASE,W)
END DO
WRITE (6,1) "L",X4L
WRITE (6,666) "Loop",FP8DIGITS(DBLE(X4L),BASE,W)
 
WRITE (6,*) "Sum via SUM(A)"
X4S = SUM(A)
WRITE (6,1) "s",X4S
WRITE (6,666) "SUM(A)",FP8DIGITS(DBLE(X4S),BASE,W)
X8E = A(1) + A(2) + A(3) !Calc in R10, saved to R8.
WRITE (6,*) "X4S - X4L",X4S - X4L
WRITE (6,*) "X4E - X8E=",X4E - X8E
WRITE (6,1) "8",X8E
WRITE (6,666) "1exprn*8",FP8DIGITS(X8E,BASE,W)
 
WRITE (6,*) "Sum via SUMC"
X4C = SUMC(A,3)
WRITE (6,1) "C",X4C
WRITE (6,666) "SUMC",FP8DIGITS(DBLE(X4C),BASE,W)
 
WRITE (6,666) "Dec. calc",FP8DIGITS(10005.85987D0,BASE,W)
WRITE (6,666) "Pie",FP8DIGITS(10000D0+4*ATAN(1D0)+EXP(1D0),BASE,W)
WRITE (6,*) "The array..."
DO I = 1,3
WRITE (6,*) FP8DIGITS(DBLE(A(I)),BASE,W)
END DO
WRITE (6,"('3.14159 bits, packed',B32)") A(2)
1 FORMAT (A1,"Sum = ",F12.1)
666 FORMAT (A8,":",A)
END
</lang>
Because of the use of a function returning a CHARACTER value, and an array with a lower bound of zero, F90 features are needed, and to reduce the complaints over non-declaration of types of invoked functions, the MODULE protocol was convenient. Startlingly, if function SUMC was invoked with the wrong value of N (after removing a fourth element, Euler's constant, 0·5772156649015..., but forgetting to change the 4 back to 3) there was no complaint from the array bound-checking, and a fourth element was accessed! (It was zero) Changing the declaration from the old-style of A(N) to A(:) revived the bound checking, but at the loss of a documentation hint.
 
The output is rather messy, so after some slight editing,
<pre>
Sum via the additions in one expression.
4Sum = 10005.9
1exprn: 10011100010101.1101110001
Sum via a loop.
A(i): 10011100010000.
X4L: 10011100010000.
A(i): 11.001001000011111101
X4L: 10011100010011.0010010001
A(i): 10.1011011111100001001101
X4L: 10011100010101.1101110001
LSum = 10005.9
Loop: 10011100010101.1101110001
Sum via SUM(A)
sSum = 10005.9
SUM(A): 10011100010101.1101110001
X4S - X4L 0.0000000E+00
X4E - X8E= 4.813671112060547E-004
8Sum = 10005.9
1exprn*8: 10011100010101.1101110000100000011101
Sum via SUMC
i= 2
A(i): 11.001001000011111101
C: 0.
Y=A-C: 11.001001000011111101
S: 10011100010000.
T=S+Y: 10011100010011.0010010001
T-S: 11.0010010001
C=T-S-Y: 0.000000000000000011
S: 10011100010011.0010010001
i= 3
A(i): 10.1011011111100001001101
C: 0.000000000000000011
Y=A-C: 10.1011011111100000011101
S: 10011100010011.0010010001
T=S+Y: 10011100010101.1101110001
T-S: 10.10111
C=T-S-Y: 0.0000000000011111100011
S: 10011100010101.1101110001
CSum = 10005.9
SUMC: 10011100010101.1101110001
Dec. cal: 10011100010101.110111000010000001110000101110001101
Pie: 10011100010101.110111000010000010111011111010110001
The array...
10011100010000.
11.001001000011111101
10.1011011111100001001101
3.14159 bits, packed 1000000010010010000111111010000
</pre>
The key point is that twenty-four bits are employed for the mantissa of a single-precision floating-point number, and this is using the implicit leading-one bit of normalised numbers - that bit is not stored so that 23 bits of storage deliver 24 bits... When 10,000 is represented in binary that requires fourteen bits so a further ten bits below the fractional point are yet available. So, placing a | after the tenth fractional bit,
3.14159: 11.0010010000 | 11111101
2.71828: +10.1011011111 | 100001001101
= 101.1101101111 |1100000011101 There should be a carry from the lower bits...
= 101.1101110000 This should be the result in the lower bits.
101.1101110001 This is the result in the lower bits.
Because both values round up one (because the eleventh bit is a 1) in the actual calculation, the result is one bit high.
 
=={{header|Go}}==
1,220

edits

Cookies help us deliver our services. By using our services, you agree to our use of cookies.