Posit numbers/decoding
Posit is a quantization of the real projective line proposed by John Gustafson in 2015. It is claimed to be an improvement over IEEE 754.
Posit numbers/decoding is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
The purpose of this task is to write a program capable of decoding a posit number. You will use the example provided by Gustafson in his paper : 0b0000110111011101, representing a 16-bit long real number with three bits for the exponent. Once decoded, you should obtain either the fraction 477/134217728 or the floating point value 3.55393E−6.
Jeff Johnson from Facebook research, described posit numbers as such:
- A more efficient representation for tapered floating points is the recent posit format by Gustafson. It has no explicit size field; the exponent is encoded using a Golomb-Rice prefix-free code, with the exponent encoded as a Golomb-Rice quotient and remainder with in unary and in binary (in posit terminology, is the regime). Remainder encoding size is defined by the exponent scale , where is the Golomb-Rice divisor. Any space not used by the exponent encoding is used by the significand, which unlike IEEE 754 always has a leading 1; gradual underflow (and overflow) is handled by tapering. A posit number system is characterized by , where is the word length in bits and is the exponent scale. The minimum and maximum positive finite numbers in are Failed to parse (syntax error): {\displaystyle f_\mathrm{min} = 2^{−(N−2)2^s}} and Failed to parse (syntax error): {\displaystyle f_\mathrm{max} = 2^{(N−2)2^s}} . The number line is represented much as the projective reals, with a single point at bounding Failed to parse (syntax error): {\displaystyle −f_\mathrm{max}} and . and 0 have special encodings; there is no NaN. The number system allows any choice of and Failed to parse (syntax error): {\displaystyle 0\le s\le N − 3} .
- controls the dynamic range achievable; e.g., 8-bit (8, 5)-posit is larger than in float32. (8, 0) and (8, 1) are more reasonable values to choose for 8-bit floating point representations, with of 64 and 4096 accordingly. Precision is maximized in the range Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pm\left[2^{−(s+1)}, 2^{s+1}\right)} with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N − 3 − s} significand fraction bits, tapering to no fraction bits at .
- — Jeff Johnson, Rethinking floating point for deep learning, Facebook research.
raku
unit role Posit[UInt $nbits, UInt $es];
has Bool @.bits[$nbits];
method Str { sprintf('%0b' x $nbits, @!bits) }
sub useed { 2**(2**$es) }
sub two-complement(Str $n where /^<[01]>+$/) {
(
(
$n
.trans("01" => "10")
.parse-base(2)
+ 1
) +& (2**$n.chars - 1)
).polymod(2 xx $n.chars - 1)
.reverse
.join
}
method Real {
return 0 unless @!bits.any;
return Inf if self ~~ /^10*$/;
my $sign = @!bits.head ?? -1 !! +1;
$sign *
grammar {
token TOP { ^ <regime> <exponent>? <fraction>? $ }
token regime { [ 1+ 0? ] | [ 0+ 1? ] }
token exponent { <.bit> ** {1..$es} }
token fraction { <.bit>+ }
token bit { <[01]> }
}.parse(
($sign > 0 ?? {$_} !! &two-complement)(self.Str.substr(1)),
actions => class {
method TOP($/) {
make $<regime>.made *
($<exponent> ?? $<exponent>.made !! 1) *
($<fraction> ?? $<fraction>.made !! 1);
}
method regime($/) {
my $first-bit = $/.Str.substr(0,1);
my $m = $/.comb.Bag{$first-bit};
make useed**($first-bit eq '1' ?? $m - 1 !! -$m);
}
method exponent($/) { make 2**($/.Str.parse-base: 2); }
method fraction($/) {
make reduce { $^a + $^b / ($*=2.FatRat) }, 1, |$/.comb;
}
}
)
.made
}
CHECK {
use Test;
# example from L<http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf>
is Posit[16, 3]
.new(bits => '0000110111011101'.comb.map({.Int.Bool})).Real.nude,
(477, 134217728);
}
- Output:
ok 1 -