Posit numbers/decoding

Posit is a quantization of the real projective line proposed by John Gustafson in 2015. It is claimed to be an improvement over IEEE 754.

The purpose of this task is to write a program capable of decoding a posit number. You will use the example provided by Gustafson in his paper : 0b0000110111011101, representing a 16-bit long real number with three bits for the exponent. Once decoded, you should obtain either the fraction 477/134217728 or the floating point value 3.55393E−6.

Jeff Johnson from Facebook research, described posit numbers as such:

A more efficient representation for tapered floating points is the recent posit format by Gustafson. It has no explicit size field; the exponent is encoded using a Golomb-Rice prefix-free code, with the exponent

e

encoded as a Golomb-Rice quotient and remainder

(q,r)

with

q

in unary and

r

in binary (in posit terminology,

q

is the regime). Remainder encoding size is defined by the exponent scale

s

, where

2^{s}

is the Golomb-Rice divisor. Any space not used by the exponent encoding is used by the significand, which unlike IEEE 754 always has a leading 1; gradual underflow (and overflow) is handled by tapering. A posit number system is characterized by

(N,s)

, where

N

is the word length in bits and

s

is the exponent scale. The minimum and maximum positive finite numbers in

(N,s)

are Failed to parse (syntax error): {\displaystyle f_\mathrm{min} = 2^{−(N−2)2^s}} and Failed to parse (syntax error): {\displaystyle f_\mathrm{max} = 2^{(N−2)2^s}} . The number line is represented much as the projective reals, with a single point at

\pm \infty

bounding Failed to parse (syntax error): {\displaystyle −f_\mathrm{max}} and

f_{\mathrm {max} }

.

\pm \infty

and 0 have special encodings; there is no NaN. The number system allows any choice of

N\geq 3

and Failed to parse (syntax error): {\displaystyle 0\le s\le N − 3} .

s

controls the dynamic range achievable; e.g., 8-bit (8, 5)-posit

f_{\mathrm {max} }=2^{192}

is larger than

f_{\mathrm {max} }

in float32. (8, 0) and (8, 1) are more reasonable values to choose for 8-bit floating point representations, with

f_{\mathrm {max} }

of 64 and 4096 accordingly. Precision is maximized in the range Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pm\left[2^{−(s+1)}, 2^{s+1}\right)} with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N − 3 − s} significand fraction bits, tapering to no fraction bits at

\pm f_{\mathrm {max} }

.

— Jeff Johnson, Rethinking floating point for deep learning, Facebook research.

raku

unit role Posit[UInt $nbits, UInt $es];

has Bool @.bits[$nbits];

method Str { sprintf('%0b' x $nbits, @!bits) }
sub useed { 2**(2**$es) }

sub two-complement(Str $n where /^<[01]>+$/) {
  (
   (
    $n
    .trans("01" => "10")
    .parse-base(2)
    + 1
   ) +& (2**$n.chars - 1)
  ).polymod(2 xx $n.chars - 1)
  .reverse
  .join
}

method Real {
  return 0 unless @!bits.any;
  return Inf if self ~~ /^10*$/;
  my $sign = @!bits.head ?? -1 !! +1;
  $sign *
    grammar {
      token TOP { ^ <regime> <exponent>? <fraction>? $ }
      token regime { [ 1+ 0? ] | [ 0+ 1? ] }
      token exponent { <.bit> ** {1..$es} }
      token fraction { <.bit>+ }
      token bit { <[01]> }
    }.parse(
      ($sign > 0 ?? {$_} !! &two-complement)(self.Str.substr(1)),
      actions => class {
        method TOP($/) {
          make $<regime>.made *
            ($<exponent> ?? $<exponent>.made !! 1) *
            ($<fraction> ?? $<fraction>.made !! 1);
        }
        method regime($/) {
          my $first-bit = $/.Str.substr(0,1);
          my $m = $/.comb.Bag{$first-bit};
          make useed**($first-bit eq '1' ?? $m - 1 !! -$m);
        }
        method exponent($/) { make 2**($/.Str.parse-base: 2); }
        method fraction($/) {
          make reduce { $^a + $^b / ($*=2.FatRat) }, 1, |$/.comb;
        }
      }
    )
    .made
}

CHECK {
  use Test;
  # example from L<http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf>
  is Posit[16, 3]
    .new(bits => '0000110111011101'.comb.map({.Int.Bool})).Real.nude,
    (477, 134217728);
}

Output:

ok 1 -