Posit numbers/decoding: Difference between revisions

From Rosetta Code
Content added Content deleted
(new draft : Posit numbers)
 
(adding quote from Jeff Johnson)
Line 4: Line 4:


The purpose of this task is to write a program capable of decoding a posit number. You will use the example provided by Gustafson in his paper : 0b0000110111011101, representing a 16-bit long real number with three bits for the exponent. Once decoded, you should obtain either the fraction 477/134217728 or the floating point value 3.55393E−6.
The purpose of this task is to write a program capable of decoding a posit number. You will use the example provided by Gustafson in his paper : 0b0000110111011101, representing a 16-bit long real number with three bits for the exponent. Once decoded, you should obtain either the fraction 477/134217728 or the floating point value 3.55393E−6.

Jeff Johnson from Facebook research, described posit numbers as such:

:A more efficient representation for tapered floating points is the recent ''posit'' format by Gustafson. It has no explicit size field; the exponent is encoded using a Golomb-Rice prefix-free code, with the exponent <math>e</math> encoded as a Golomb-Rice quotient and remainder <math>(q, r)</math> with <math>q</math> in unary and <math>r</math> in binary (in posit terminology, <math>q</math> is the ''regime''). Remainder encoding size is defined by the ''exponent scale'' <math>s</math>, where <math>2^s</math> is the Golomb-Rice divisor. Any space not used by the exponent encoding is used by the significand, which unlike IEEE 754 always has a leading 1; gradual underflow (and overflow) is handled by tapering. A posit number system is characterized by <math>(N, s)</math>, where <math>N</math> is the word length in bits and <math>s</math> is the exponent scale. The minimum and maximum positive finite numbers in <math>(N, s)</math> are <math>f_\mathrm{min} = 2^{−(N−2)2^s}</math> and <math>f_\mathrm{max} = 2^{(N−2)2^s}</math>. The number line is represented much as the projective reals, with a single point at <math>\pm\infty</math> bounding <math>−f_\mathrm{max}</math> and <math>f_\mathrm{max}</math>. <math>\pm\infty</math> and 0 have special encodings; there is no <tt>NaN</tt>. The number system allows any choice of <math>N\ge 3</math> and <math>0\le s\le N − 3</math>.
:<math>s</math> controls the dynamic range achievable; e.g., 8-bit (8, 5)-posit <math>f_\mathrm{max} = 2^{192}</math> is larger than <math>f_\mathrm{max}</math> in <tt>float32</tt>. (8, 0) and (8, 1) are more reasonable values to choose for 8-bit floating point representations, with <math>f_\mathrm{max}</math> of 64 and 4096 accordingly. Precision is maximized in the range <math>\pm\left[2^{−(s+1)}, 2^{s+1}\right)</math> with <math>N − 3 − s</math> significand fraction bits, tapering to no fraction bits at <math>\pm f_\mathrm{max}</math>.
:— Jeff Johnson, ''[https://arxiv.org/abs/1811.01721 Rethinking floating point for deep learning]'', Facebook research.



=={{header|raku}}==
=={{header|raku}}==

Revision as of 07:36, 18 September 2023

Posit numbers/decoding is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Posit is a quantization of the real projective line proposed by John Gustafson in 2015. It is claimed to be an improvement over IEEE 754.

The purpose of this task is to write a program capable of decoding a posit number. You will use the example provided by Gustafson in his paper : 0b0000110111011101, representing a 16-bit long real number with three bits for the exponent. Once decoded, you should obtain either the fraction 477/134217728 or the floating point value 3.55393E−6.

Jeff Johnson from Facebook research, described posit numbers as such:

A more efficient representation for tapered floating points is the recent posit format by Gustafson. It has no explicit size field; the exponent is encoded using a Golomb-Rice prefix-free code, with the exponent encoded as a Golomb-Rice quotient and remainder with in unary and in binary (in posit terminology, is the regime). Remainder encoding size is defined by the exponent scale , where is the Golomb-Rice divisor. Any space not used by the exponent encoding is used by the significand, which unlike IEEE 754 always has a leading 1; gradual underflow (and overflow) is handled by tapering. A posit number system is characterized by , where is the word length in bits and is the exponent scale. The minimum and maximum positive finite numbers in are Failed to parse (syntax error): {\displaystyle f_\mathrm{min} = 2^{−(N−2)2^s}} and Failed to parse (syntax error): {\displaystyle f_\mathrm{max} = 2^{(N−2)2^s}} . The number line is represented much as the projective reals, with a single point at bounding Failed to parse (syntax error): {\displaystyle −f_\mathrm{max}} and . and 0 have special encodings; there is no NaN. The number system allows any choice of and Failed to parse (syntax error): {\displaystyle 0\le s\le N − 3} .
controls the dynamic range achievable; e.g., 8-bit (8, 5)-posit is larger than in float32. (8, 0) and (8, 1) are more reasonable values to choose for 8-bit floating point representations, with of 64 and 4096 accordingly. Precision is maximized in the range Failed to parse (syntax error): {\displaystyle \pm\left[2^{−(s+1)}, 2^{s+1}\right)} with Failed to parse (syntax error): {\displaystyle N − 3 − s} significand fraction bits, tapering to no fraction bits at .
— Jeff Johnson, Rethinking floating point for deep learning, Facebook research.


raku

unit role Posit[UInt $nbits, UInt $es];

has Bool @.bits[$nbits];

method Str { sprintf('%0b' x $nbits, @!bits) }
sub useed { 2**(2**$es) }

sub two-complement(Str $n where /^<[01]>+$/) {
  (
   (
    $n
    .trans("01" => "10")
    .parse-base(2)
    + 1
   ) +& (2**$n.chars - 1)
  ).polymod(2 xx $n.chars - 1)
  .reverse
  .join
}

method Real {
  return 0 unless @!bits.any;
  return Inf if self ~~ /^10*$/;
  my $sign = @!bits.head ?? -1 !! +1;
  $sign *
    grammar {
      token TOP { ^ <regime> <exponent>? <fraction>? $ }
      token regime { [ 1+ 0? ] | [ 0+ 1? ] }
      token exponent { <.bit> ** {1..$es} }
      token fraction { <.bit>+ }
      token bit { <[01]> }
    }.parse(
      ($sign > 0 ?? {$_} !! &two-complement)(self.Str.substr(1)),
      actions => class {
        method TOP($/) {
          make $<regime>.made *
            ($<exponent> ?? $<exponent>.made !! 1) *
            ($<fraction> ?? $<fraction>.made !! 1);
        }
        method regime($/) {
          my $first-bit = $/.Str.substr(0,1);
          my $m = $/.comb.Bag{$first-bit};
          make useed**($first-bit eq '1' ?? $m - 1 !! -$m);
        }
        method exponent($/) { make 2**($/.Str.parse-base: 2); }
        method fraction($/) {
          make reduce { $^a + $^b / ($*=2.FatRat) }, 1, |$/.comb;
        }
      }
    )
    .made
}

CHECK {
  use Test;
  # example from L<http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf>
  is Posit[16, 3]
    .new(bits => '0000110111011101'.comb.map({.Int.Bool})).Real.nude,
    (477, 134217728);
}
Output:
ok 1 -