Posit numbers/decoding

Revision as of 07:36, 18 September 2023 by Grondilu (talk | contribs) (adding quote from Jeff Johnson)

Posit is a quantization of the real projective line proposed by John Gustafson in 2015. It is claimed to be an improvement over IEEE 754.

Posit numbers/decoding is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

The purpose of this task is to write a program capable of decoding a posit number. You will use the example provided by Gustafson in his paper : 0b0000110111011101, representing a 16-bit long real number with three bits for the exponent. Once decoded, you should obtain either the fraction 477/134217728 or the floating point value 3.55393E−6.

Jeff Johnson from Facebook research, described posit numbers as such:

A more efficient representation for tapered floating points is the recent posit format by Gustafson. It has no explicit size field; the exponent is encoded using a Golomb-Rice prefix-free code, with the exponent encoded as a Golomb-Rice quotient and remainder with in unary and in binary (in posit terminology, is the regime). Remainder encoding size is defined by the exponent scale , where is the Golomb-Rice divisor. Any space not used by the exponent encoding is used by the significand, which unlike IEEE 754 always has a leading 1; gradual underflow (and overflow) is handled by tapering. A posit number system is characterized by , where is the word length in bits and is the exponent scale. The minimum and maximum positive finite numbers in are Failed to parse (syntax error): {\displaystyle f_\mathrm{min} = 2^{−(N−2)2^s}} and Failed to parse (syntax error): {\displaystyle f_\mathrm{max} = 2^{(N−2)2^s}} . The number line is represented much as the projective reals, with a single point at bounding Failed to parse (syntax error): {\displaystyle −f_\mathrm{max}} and . and 0 have special encodings; there is no NaN. The number system allows any choice of and Failed to parse (syntax error): {\displaystyle 0\le s\le N − 3} .
controls the dynamic range achievable; e.g., 8-bit (8, 5)-posit is larger than in float32. (8, 0) and (8, 1) are more reasonable values to choose for 8-bit floating point representations, with of 64 and 4096 accordingly. Precision is maximized in the range Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pm\left[2^{−(s+1)}, 2^{s+1}\right)} with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N − 3 − s} significand fraction bits, tapering to no fraction bits at .
— Jeff Johnson, Rethinking floating point for deep learning, Facebook research.


raku

unit role Posit[UInt $nbits, UInt $es];

has Bool @.bits[$nbits];

method Str { sprintf('%0b' x $nbits, @!bits) }
sub useed { 2**(2**$es) }

sub two-complement(Str $n where /^<[01]>+$/) {
  (
   (
    $n
    .trans("01" => "10")
    .parse-base(2)
    + 1
   ) +& (2**$n.chars - 1)
  ).polymod(2 xx $n.chars - 1)
  .reverse
  .join
}

method Real {
  return 0 unless @!bits.any;
  return Inf if self ~~ /^10*$/;
  my $sign = @!bits.head ?? -1 !! +1;
  $sign *
    grammar {
      token TOP { ^ <regime> <exponent>? <fraction>? $ }
      token regime { [ 1+ 0? ] | [ 0+ 1? ] }
      token exponent { <.bit> ** {1..$es} }
      token fraction { <.bit>+ }
      token bit { <[01]> }
    }.parse(
      ($sign > 0 ?? {$_} !! &two-complement)(self.Str.substr(1)),
      actions => class {
        method TOP($/) {
          make $<regime>.made *
            ($<exponent> ?? $<exponent>.made !! 1) *
            ($<fraction> ?? $<fraction>.made !! 1);
        }
        method regime($/) {
          my $first-bit = $/.Str.substr(0,1);
          my $m = $/.comb.Bag{$first-bit};
          make useed**($first-bit eq '1' ?? $m - 1 !! -$m);
        }
        method exponent($/) { make 2**($/.Str.parse-base: 2); }
        method fraction($/) {
          make reduce { $^a + $^b / ($*=2.FatRat) }, 1, |$/.comb;
        }
      }
    )
    .made
}

CHECK {
  use Test;
  # example from L<http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf>
  is Posit[16, 3]
    .new(bits => '0000110111011101'.comb.map({.Int.Bool})).Real.nude,
    (477, 134217728);
}
Output:
ok 1 -