Jump to content

UTF-8 encode and decode

From Rosetta Code
Task
UTF-8 encode and decode
You are encouraged to solve this task according to the task description, using any language you may know.

As described in UTF-8 and in Wikipedia, UTF-8 is a popular encoding of (multi-byte) Unicode code-points into eight-bit octets.

The goal of this task is to write a encoder that takes a unicode code-point (an integer representing a unicode character) and returns a sequence of 1–4 bytes representing that character in the UTF-8 encoding.

Then you have to write the corresponding decoder that takes a sequence of 1–4 UTF-8 encoded bytes and return the corresponding unicode character.

Demonstrate the functionality of your encoder and decoder on the following five characters:

Character   Name                                  Unicode    UTF-8 encoding (hex)
---------------------------------------------------------------------------------
A           LATIN CAPITAL LETTER A                U+0041     41
ö           LATIN SMALL LETTER O WITH DIAERESIS   U+00F6     C3 B6
Ж           CYRILLIC CAPITAL LETTER ZHE           U+0416     D0 96
€           EURO SIGN                             U+20AC     E2 82 AC
𝄞           MUSICAL SYMBOL G CLEF                 U+1D11E    F0 9D 84 9E

Provided below is a reference implementation in Common Lisp.

11l

Translation of: Python
F unicode_code(ch)
   R ‘U+’hex(ch.code).zfill(4)

F utf8hex(ch)
   R ch.encode(‘utf-8’).map(c -> hex(c)).join(‘ ’)

print(‘#<11 #<15 #<15’.format(‘Character’, ‘Unicode’, ‘UTF-8 encoding (hex)’))
V chars = [‘A’, ‘ö’, ‘Ж’, ‘€’]
L(char) chars
   print(‘#<11 #<15 #<15’.format(char, unicode_code(char), utf8hex(char)))
Output:
Character   Unicode         UTF-8 encoding (hex)
A           U+0041          41             
ö           U+00F6          C3 B6          
Ж           U+0416          D0 96          
€           U+20AC          E2 82 AC       

8th

hex \ so bytes print nicely

[
  "\u0041", 
  "\u00F6",
  "\u0416",
  "\u20AC"
] 
\ add the 0x1D11E one; the '\u' string notation requires four hex digits
  "" 1D11E s:+ a:push

\ for each test, print it out and its bytes:
(
  dup . space
  b:new 
  ( . space drop ) b:each 
  cr
) a:each! drop

cr
\ now the inverse:
[
  [41],
  [C3,B6],
  [D0,96],
  [E2,82,AC],
  [$F0,9D,84,9E]
]

(
  dup . space
  b:new >s  .  cr
) a:each! drop

bye

Output:

A 41 
ö c3 b6 
Ж d0 96 
€ e2 82 ac 
𝄞 f0 9d 84 9e 

[41] A
[c3,b6] ö
[d0,96] Ж
[e2,82,ac] €
[f0,9d,84,9e] 𝄞

Action!

TYPE Unicode=[BYTE bc1,bc2,bc3]
BYTE ARRAY hex=['0 '1 '2 '3 '4 '5 '6 '7 '8 '9 'A 'B 'C 'D 'E 'F]

BYTE FUNC DecodeHex(CHAR c)
  BYTE i

  FOR i=0 TO 15
  DO
    IF c=hex(i) THEN
      RETURN (i)
    FI
  OD
  Break()
RETURN (255)

BYTE FUNC DecodeHex2(CHAR c1,c2)
  BYTE h1,h2,res

  h1=DecodeHex(c1)
  h2=DecodeHex(c2)
  res=(h1 LSH 4)%h2
RETURN (res)

PROC ValUnicode(CHAR ARRAY s Unicode POINTER u)
  BYTE i,len

  len=s(0)
  IF len<6 AND len>8 THEN Break() FI
  IF s(1)#'U OR s(2)#'+ THEN Break() FI
  
  IF len=6 THEN
    u.bc1=0
  ELSEIF len=7 THEN
    u.bc1=DecodeHex(s(3))
    IF u.bc1>$10 THEN Break() FI
  ELSE
    u.bc1=DecodeHex2(s(3),s(4))
  FI
  u.bc2=DecodeHex2(s(len-3),s(len-2))
  u.bc3=DecodeHex2(s(len-1),s(len))
RETURN

PROC PrintHex2(BYTE x)
  Put(hex(x RSH 4))
  Put(hex(x&$0F))
RETURN

PROC StrUnicode(Unicode POINTER u)
  Print("U+")
  IF u.bc1>$F THEN
    PrintHex2(u.bc1)
  ELSEIF u.bc1>0 THEN
    Put(hex(u.bc1))
  FI
  PrintHex2(u.bc2)
  PrintHex2(u.bc3)
RETURN

PROC PrintArray(BYTE ARRAY a BYTE len)
  BYTE i

  Put('[)
  FOR i=0 TO len-1
  DO
    IF i>0 THEN Put(32 )FI
    PrintHex2(a(i))
  OD
  Put('])
RETURN

PROC Encode(Unicode POINTER u BYTE ARRAY buf BYTE POINTER len)
  IF u.bc1>0 THEN
    len^=4
    buf(0)=$F0 % (u.bc1 RSH 2)
    buf(1)=$80 % ((u.bc1 & $03) LSH 4) % (u.bc2 RSH 4)
    buf(2)=$80 % ((u.bc2 & $0F) LSH 2) % (u.bc3 RSH 6)
    buf(3)=$80 % (u.bc3 & $3F)
  ELSEIF u.bc2>=$08 THEN
    len^=3
    buf(0)=$E0 % (u.bc2 RSH 4)
    buf(1)=$80 % ((u.bc2 & $0F) LSH 2) % (u.bc3 RSH 6)
    buf(2)=$80 % (u.bc3 & $3F)
  ELSEIF u.bc2>0 OR u.bc3>=$80 THEN
    len^=2
    buf(0)=$C0 % (u.bc2 LSH 2) % (u.bc3 RSH 6)
    buf(1)=$80 % (u.bc3 & $3F)
  ELSE
    len^=1
    buf(0)=u.bc3
  FI
RETURN

PROC Decode(BYTE ARRAY buf BYTE len Unicode POINTER u)
  IF len=1 THEN
    u.bc1=0
    u.bc2=0
    u.bc3=buf(0)
  ELSEIF len=2 THEN
    u.bc1=0
    u.bc2=(buf(0) & $1F) RSH 2
    u.bc3=(buf(0) LSH 6) % (buf(1) & $3F)
  ELSEIF len=3 THEN
    u.bc1=0
    u.bc2=(buf(0) LSH 4) % ((buf(1) & $3F) RSH 2)
    u.bc3=(buf(1) LSH 6) % (buf(2) & $3F)
  ELSEIF len=4 THEN
    u.bc1=((buf(0) & $07) LSH 2) % ((buf(1) & $3F) RSH 4)
    u.bc2=(buf(1) LSH 4) % ((buf(2) & $3F) RSH 2)
    u.bc3=((buf(2) & $03) LSH 6) % (buf(3) & $3F)
  ELSE
    Break()
  FI
RETURN

PROC Main()
  DEFINE PTR="CARD"
  DEFINE COUNT="11"
  PTR ARRAY case(COUNT)
  Unicode uni,res
  BYTE ARRAY buf(4)
  BYTE i,len

  case(0)="U+0041"
  case(1)="U+00F6"
  case(2)="U+0416"
  case(3)="U+20AC"
  case(4)="U+1D11E"
  case(5)="U+0024"
  case(6)="U+00A2"
  case(7)="U+0939"
  case(8)="U+20AC"
  case(9)="U+D55C"
  case(10)="U+10348"

  FOR i=0 TO COUNT-1
  DO
    IF i=0 THEN
      PrintE("From RosettaCode:")
    ELSEIF i=5 THEN
      PutE() PrintE("From Wikipedia:")
    FI
    ValUnicode(case(i),uni)
    Encode(uni,buf,@len)
    Decode(buf,len,res)

    StrUnicode(uni) Print(" -> ")
    PrintArray(buf,len) Print(" -> ")
    StrUnicode(res) PutE()
  OD
RETURN
Output:

Screenshot from Atari 8-bit computer

From RosettaCode:
U+0041 -> [41] -> U+0041
U+00F6 -> [C3 B6] -> U+00F6
U+0416 -> [D0 96] -> U+0416
U+20AC -> [E2 82 AC] -> U+20AC
U+1D11E -> [F0 9D 84 9E] -> U+1D11E

From Wikipedia:
U+0024 -> [24] -> U+0024
U+00A2 -> [C2 A2] -> U+00A2
U+0939 -> [E0 A4 B9] -> U+0939
U+20AC -> [E2 82 AC] -> U+20AC
U+D55C -> [ED 95 9C] -> U+D55C
U+10348 -> [F0 90 8D 88] -> U+10348

Ada

Works with: Ada version 2012
with Ada.Strings.Fixed; use Ada.Strings.Fixed;
with Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
with Ada.Integer_Text_IO;
with Ada.Text_IO;
with Ada.Wide_Wide_Text_IO;

procedure UTF8_Encode_And_Decode
is
   package TIO renames Ada.Text_IO;
   package WWTIO renames Ada.Wide_Wide_Text_IO;
   package WWS renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings;

   function To_Hex
     (i : in Integer;
      width : in Natural := 0;
      fill : in Character := '0') return String
   is
      holder : String(1 .. 20);
   begin
      Ada.Integer_Text_IO.Put(holder, i, 16);
      declare
         hex : constant String := holder(Index(holder, "#")+1 .. holder'Last-1);
         filled : String := Natural'Max(width, hex'Length) * fill;
      begin
         filled(filled'Last - hex'Length + 1 .. filled'Last) := hex;
         return filled;
      end;
   end To_Hex;

   input : constant Wide_Wide_String := "AöЖ€𝄞";
begin
   TIO.Put_Line("Character   Unicode    UTF-8 encoding (hex)");
   TIO.Put_Line(43 * '-');
   for WWC of input loop
      WWTIO.Put(WWC & "           ");
      declare
         filled : String := 11 * ' ';
         unicode : constant String := "U+" & To_Hex(Wide_Wide_Character'Pos(WWC), width => 4);
         utf8_string : constant String := WWS.Encode((1 => WWC));
      begin
         filled(filled'First .. filled'First + unicode'Length - 1) := unicode;
         TIO.Put(filled);
         for C of utf8_string loop
            TIO.Put(To_Hex(Character'Pos(C)) & " ");
         end loop;
         TIO.New_Line;
      end;
   end loop;
end UTF8_Encode_And_Decode;
Output:
Character   Unicode    UTF-8 encoding (hex)
-------------------------------------------
A           U+0041     41 
ö           U+00F6     C3 B6 
Ж           U+0416     D0 96 
€           U+20AC     E2 82 AC 
𝄞           U+1D11E    F0 9D 84 9E 

ATS

The following code is long but consists largely of proofs. UTF-8 is a complicated encoding, but also a well defined one that lends itself to compile-time verification methods.

Despite this complexity, what actually gets generated is highly optimizable C code. Note that the following demonstration requires no ATS-specific library support whatsoever.

(*

  UTF-8 encoding and decoding in ATS2.

  This is adapted from library code I wrote long ago and actually
  handles the original 1-to-6-byte encoding, as well. Valid Unicode
  requires only 1 to 4 bytes.

  What is remarkable about the following rather lengthy code is its
  use of proofs of what is and what is not valid UTF-8. Much of what
  follows is proofs rather than executable code. It seemed simpler to
  "mostly copy" my old code, than to try to pare that code down to a
  minimum that still demonstrated some of what makes ATS different
  from all but a few (if any) other languages.

*)

#define ATS_EXTERN_PREFIX "utf8_encoding_"

#include "share/atspre_define.hats"
#include "share/atspre_staload.hats"

(* A variant of ‘andalso’, with dependent types. *)
fn {}
andalso1 {b1, b2 : bool} (b1 : bool b1, b2 : bool b2) :<>
    [b3 : bool | b3 == (b1 && b2)] bool b3 =
  if b1 then
    b2
  else
    false

infixl (&&) &&&
macdef &&& = andalso1

(*###################### C CODE ####################################*)

%{^

_Static_assert (4 <= sizeof (int),
                "sizeof(int) must equal at least 4");

#define utf8_encoding_2_entries(v)   \
  (v), (v)

#define utf8_encoding_4_entries(v)   \
  utf8_encoding_2_entries (v),       \
    utf8_encoding_2_entries (v)

#define utf8_encoding_8_entries(v)   \
  utf8_encoding_4_entries (v),       \
    utf8_encoding_4_entries (v)

#define utf8_encoding_16_entries(v)  \
  utf8_encoding_8_entries (v),       \
    utf8_encoding_8_entries (v)

#define utf8_encoding_32_entries(v)  \
  utf8_encoding_16_entries (v),      \
    utf8_encoding_16_entries (v)

#define utf8_encoding_64_entries(v)  \
  utf8_encoding_32_entries (v),      \
    utf8_encoding_32_entries (v)

#define utf8_encoding_128_entries(v) \
  utf8_encoding_64_entries (v),      \
    utf8_encoding_64_entries (v)

static const atstype_int8 utf8_encoding_extended_utf8_lengths__[256] = {
  utf8_encoding_128_entries (1),
  utf8_encoding_64_entries (-1),
  utf8_encoding_32_entries (2),
  utf8_encoding_16_entries (3),
  utf8_encoding_8_entries (4),
  utf8_encoding_4_entries (5),
  utf8_encoding_2_entries (6),
  utf8_encoding_2_entries (-1)
};

static const atstype_int8 utf8_encoding_utf8_lengths__[256] = {
  utf8_encoding_128_entries (1),
  utf8_encoding_64_entries (-1),
  utf8_encoding_32_entries (2),
  utf8_encoding_16_entries (3),
  utf8_encoding_8_entries (4),
  utf8_encoding_4_entries (-1),
  utf8_encoding_2_entries (-1),
  utf8_encoding_2_entries (-1)
};

#define utf8_encoding_extended_utf8_character_length(c)      \
  (utf8_encoding_extended_utf8_lengths__[(atstype_uint8) (c)])

#define utf8_encoding_utf8_character_length(c)               \
  (utf8_encoding_utf8_lengths__[(atstype_uint8) (c)])

%}

(*###################### INTERFACE #################################*)

stadef
is_valid_unicode_code_point (u : int) : bool =
  (0x0 <= u && u < 0xD800) || (0xE000 <= u && u <= 0x10FFFF)

extern fun {}
is_valid_unicode_code_point :
  {u : int} int u -<>
    [b : bool | b == is_valid_unicode_code_point u]
    bool b

(*------------------------------------------------------------------*)

stadef
is_extended_utf8_1byte_first_byte (c0 : int) : bool =
  0x00 <= c0 && c0 <= 0x7F

stadef
is_extended_utf8_2byte_first_byte (c0 : int) : bool =
  0xC0 <= c0 && c0 <= 0xDF

stadef
is_extended_utf8_3byte_first_byte (c0 : int) : bool =
  0xE0 <= c0 && c0 <= 0xEF

stadef
is_extended_utf8_4byte_first_byte (c0 : int) : bool =
  0xF0 <= c0 && c0 <= 0xF7

stadef
is_extended_utf8_5byte_first_byte (c0 : int) : bool =
  0xF8 <= c0 && c0 <= 0xFB

stadef
is_extended_utf8_6byte_first_byte (c0 : int) : bool =
  0xFC <= c0 && c0 <= 0xFD

stadef
extended_utf8_character_length (c0 : int) : int =
  ifint (is_extended_utf8_1byte_first_byte c0, 1,
    ifint (is_extended_utf8_2byte_first_byte c0, 2,
      ifint (is_extended_utf8_3byte_first_byte c0, 3,
        ifint (is_extended_utf8_4byte_first_byte c0, 4,
          ifint (is_extended_utf8_5byte_first_byte c0, 5,
            ifint (is_extended_utf8_6byte_first_byte c0, 6, ~1))))))

stadef
extended_utf8_char_length_relation (c0 : int, n : int) : bool =
  (n == 1 && is_extended_utf8_1byte_first_byte c0) ||
  (n == 2 && is_extended_utf8_2byte_first_byte c0) ||
  (n == 3 && is_extended_utf8_3byte_first_byte c0) ||
  (n == 4 && is_extended_utf8_4byte_first_byte c0) ||
  (n == 5 && is_extended_utf8_5byte_first_byte c0) ||
  (n == 6 && is_extended_utf8_6byte_first_byte c0)

extern prfun
extended_utf8_char_length_relation_to_length :
  {c0 : int}
  {n : int | extended_utf8_char_length_relation (c0, n) ||
             (n == ~1 && ((c0 < 0x00) ||
                          (0x7F < c0 && c0 < 0xC0) ||
                          (0xFD < c0)))}
  () -<prf> [n == extended_utf8_character_length c0] void

extern prfun
extended_utf8_char_length_to_length_relation :
  {c0 : int}
  {n : int | n == extended_utf8_character_length c0}
  () -<prf>
    [extended_utf8_char_length_relation (c0, n) ||
     (n == ~1 && ((c0 < 0x00) || (0x7F < c0 && c0 < 0xC0) || (0xFD < c0)))]
    void

// extended_utf8_character_length:
//
// Return value = 1, 2, 3, or 4 indicates that there may be a valid
// UTF-8 character, of the given length. It may also be a `valid'
// overlong sequence. Otherwise there definitely is not a valid
// character of any sort starting with the given byte. Return
// value = 5 or 6 indicates the possible start of an `extended
// UTF-8' character of the given length, including code points up
// to 0xffffffff. Return value = ~1 means the byte is not the
// start of a valid sequence.
extern fun
extended_utf8_character_length :
  {c0 : int | 0x00 <= c0; c0 <= 0xFF} int c0 -<>
    [n : int | n == extended_utf8_character_length c0] int n = "mac#%"

stadef
utf8_character_length (c0 : int) : int =
  ifint (is_extended_utf8_1byte_first_byte c0, 1,
    ifint (is_extended_utf8_2byte_first_byte c0, 2,
      ifint (is_extended_utf8_3byte_first_byte c0, 3,
        ifint (is_extended_utf8_4byte_first_byte c0, 4, ~1))))

stadef
utf8_char_length_relation (c0 : int, n : int) : bool =
  (n == 1 && is_extended_utf8_1byte_first_byte c0) ||
  (n == 2 && is_extended_utf8_2byte_first_byte c0) ||
  (n == 3 && is_extended_utf8_3byte_first_byte c0) ||
  (n == 4 && is_extended_utf8_4byte_first_byte c0)

extern prfun
utf8_char_length_relation_to_length :
  {c0 : int}
  {n : int | utf8_char_length_relation (c0, n) ||
             (n == ~1 && ((c0 < 0x00) ||
                          (0x7F < c0 && c0 < 0xC0) ||
                          (0xF7 < c0)))}
  () -<prf> [n == utf8_character_length c0] void

extern prfun
utf8_char_length_to_length_relation :
  {c0 : int}
  {n : int | n == utf8_character_length c0}
  () -<prf>
    [utf8_char_length_relation (c0, n) ||
     (n == ~1 && ((c0 < 0x00) || (0x7F < c0 && c0 < 0xC0) || (0xF7 < c0)))]
    void

extern fun
utf8_character_length :
  {c0 : int | 0x00 <= c0; c0 <= 0xFF} int c0 -<>
    [n : int | n == utf8_character_length c0] int n = "mac#%"

(*------------------------------------------------------------------*)

stadef
is_valid_utf8_continuation_byte (c : int) : bool =
  0x80 <= c && c <= 0xBF

stadef
is_invalid_utf8_continuation_byte (c : int) : bool =
  c < 0x80 || 0xBF < c

extern fun {}
is_valid_utf8_continuation_byte :
  {c : int} int c -<>
    [b : bool | b == is_valid_utf8_continuation_byte c] bool b

(*------------------------------------------------------------------*)

stadef
extended_utf8_char_1byte_decoding (c0 : int) : int =
  c0

stadef
extended_utf8_char_2byte_decoding (c0 : int, c1 : int) : int =
  64 * (c0 - 0xC0) + (c1 - 0x80)

stadef
extended_utf8_char_3byte_decoding (c0 : int, c1 : int, c2 : int) : int =
  64 * 64 * (c0 - 0xE0) + 64 * (c1 - 0x80) + (c2 - 0x80)

stadef
extended_utf8_char_4byte_decoding (c0 : int, c1 : int, c2 : int,
                                   c3 : int) : int =
  64 * 64 * 64 * (c0 - 0xF0) + 64 * 64 * (c1 - 0x80) +
    64 * (c2 - 0x80) + (c3 - 0x80)

stadef
extended_utf8_char_5byte_decoding (c0 : int, c1 : int, c2 : int,
                                   c3 : int, c4 : int) : int =
  64 * 64 * 64 * 64 * (c0 - 0xF8) + 64 * 64 * 64 * (c1 - 0x80) +
    64 * 64 * (c2 - 0x80) + 64 * (c3 - 0x80) + (c4 - 0x80)

stadef
extended_utf8_char_6byte_decoding (c0 : int, c1 : int, c2 : int,
                                   c3 : int, c4 : int, c5 : int) : int =
  64 * 64 * 64 * 64 * 64 * (c0 - 0xFC) + 64 * 64 * 64 * 64 * (c1 - 0x80) +
    64 * 64 * 64 * (c2 - 0x80) + 64 * 64 * (c3 - 0x80) +
       64 * (c4 - 0x80) + (c5 - 0x80)

dataprop EXTENDED_UTF8_CHAR (length : int, u : int, c0 : int, c1 : int,
                             c2 : int, c3 : int, c4 : int, c5 : int) =
  | {u, c0 : int |
     0 <= u; u <= 0x7F;
     is_extended_utf8_1byte_first_byte c0;
     u == extended_utf8_char_1byte_decoding (c0)}
    EXTENDED_UTF8_CHAR_1byte (1, u, c0, ~1, ~1, ~1, ~1, ~1)
    //
  | {u, c0, c1 : int |
     0 <= u; u <= 0x7FF;
     is_extended_utf8_2byte_first_byte c0;
     is_valid_utf8_continuation_byte c1;
     u == extended_utf8_char_2byte_decoding (c0, c1)}
    EXTENDED_UTF8_CHAR_2byte (2, u, c0, c1, ~1, ~1, ~1, ~1)
    //
  | {u, c0, c1, c2 : int |
     0 <= u; u <= 0xFFFF;
     is_extended_utf8_3byte_first_byte c0;
     is_valid_utf8_continuation_byte c1;
     is_valid_utf8_continuation_byte c2;
     u == extended_utf8_char_3byte_decoding (c0, c1, c2)}
    EXTENDED_UTF8_CHAR_3byte (3, u, c0, c1, c2, ~1, ~1, ~1)
    //
  | {u, c0, c1, c2, c3 : int |
     0 <= u; u <= 0x1FFFFF;
     is_extended_utf8_4byte_first_byte c0;
     is_valid_utf8_continuation_byte c1;
     is_valid_utf8_continuation_byte c2;
     is_valid_utf8_continuation_byte c3;
     u == extended_utf8_char_4byte_decoding (c0, c1, c2, c3)}
    EXTENDED_UTF8_CHAR_4byte (4, u, c0, c1, c2, c3, ~1, ~1)
    //
  | {u, c0, c1, c2, c3, c4 : int |
     0 <= u; u <= 0x3FFFFFF;
     is_extended_utf8_5byte_first_byte c0;
     is_valid_utf8_continuation_byte c1;
     is_valid_utf8_continuation_byte c2;
     is_valid_utf8_continuation_byte c3;
     is_valid_utf8_continuation_byte c4;
     u == extended_utf8_char_5byte_decoding (c0, c1, c2, c3, c4)}
    EXTENDED_UTF8_CHAR_5byte (5, u, c0, c1, c2, c3, c4, ~1)
    //
  | {u, c0, c1, c2, c3, c4, c5 : int |
     0 <= u; u <= 0x7FFFFFFF;
     is_extended_utf8_6byte_first_byte c0;
     is_valid_utf8_continuation_byte c1;
     is_valid_utf8_continuation_byte c2;
     is_valid_utf8_continuation_byte c3;
     is_valid_utf8_continuation_byte c4;
     is_valid_utf8_continuation_byte c5;
     u == extended_utf8_char_6byte_decoding (c0, c1, c2, c3, c4, c5)}
    EXTENDED_UTF8_CHAR_6byte (6, u, c0, c1, c2, c3, c4, c5)

extern prfun
decode_extended_utf8_istot :
  {n : int | 1 <= n; n <= 6}
  {c0, c1, c2, c3, c4, c5 : int |
   extended_utf8_char_length_relation (c0, n);
   n <= 1 || is_valid_utf8_continuation_byte c1;
   n <= 2 || is_valid_utf8_continuation_byte c2;
   n <= 3 || is_valid_utf8_continuation_byte c3;
   n <= 4 || is_valid_utf8_continuation_byte c4;
   n <= 5 || is_valid_utf8_continuation_byte c5;
   1 < n || c1 == ~1;
   2 < n || c2 == ~1;
   3 < n || c3 == ~1;
   4 < n || c4 == ~1;
   5 < n || c5 == ~1}
   () -<prf> [u : int] EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5)

extern prfun
decode_extended_utf8_isfun :
  {na : int | 1 <= na; na <= 6}
  {ua : int}
  {c0a, c1a, c2a, c3a, c4a, c5a : int}
  {nb : int | nb == na}
  {ub : int}
  {c0b, c1b, c2b, c3b, c4b, c5b : int |
   c0b == c0a;
   c1b == c1a;
   c2b == c2a;
   c3b == c3a;
   c4b == c4a;
   c5b == c5a}
  (EXTENDED_UTF8_CHAR (na, ua, c0a, c1a, c2a, c3a, c4a, c5a),
   EXTENDED_UTF8_CHAR (nb, ub, c0b, c1b, c2b, c3b, c4b, c5b)) -<prf>
    [ua == ub] void

extern prfun
lemma_extended_utf8_char_length :
  {n : int} {u : int} {c0, c1, c2, c3, c4, c5 : int}
  EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) -<prf>
    [n == extended_utf8_character_length c0] void

extern fun {}
decode_extended_utf8_1byte :
  {c0 : int | 0x00 <= c0; c0 <= 0x7F} int c0 -<>
    [u : int] (EXTENDED_UTF8_CHAR (1, u, c0, ~1, ~1, ~1, ~1, ~1) | int u)

extern fun {}
decode_extended_utf8_2byte :
  {c0, c1 : int | 0xC0 <= c0; c0 <= 0xDF;
                  is_valid_utf8_continuation_byte c1}
  (int c0, int c1) -<>
    [u : int] (EXTENDED_UTF8_CHAR (2, u, c0, c1, ~1, ~1, ~1, ~1) | int u)

extern fun {}
decode_extended_utf8_3byte :
  {c0, c1, c2 : int | 0xE0 <= c0; c0 <= 0xEF;
                      is_valid_utf8_continuation_byte c1;
                      is_valid_utf8_continuation_byte c2}
  (int c0, int c1, int c2) -<>
    [u : int] (EXTENDED_UTF8_CHAR (3, u, c0, c1, c2, ~1, ~1, ~1) | int u)

extern fun {}
decode_extended_utf8_4byte :
  {c0, c1, c2, c3 : int | 0xF0 <= c0; c0 <= 0xF7;
                          is_valid_utf8_continuation_byte c1;
                          is_valid_utf8_continuation_byte c2;
                          is_valid_utf8_continuation_byte c3}
  (int c0, int c1, int c2, int c3) -<>
    [u : int] (EXTENDED_UTF8_CHAR (4, u, c0, c1, c2, c3, ~1, ~1) | int u)

extern fun {}
decode_extended_utf8_5byte :
  {c0, c1, c2, c3, c4 : int | 0xF8 <= c0; c0 <= 0xFB;
                              is_valid_utf8_continuation_byte c1;
                              is_valid_utf8_continuation_byte c2;
                              is_valid_utf8_continuation_byte c3;
                              is_valid_utf8_continuation_byte c4}
  (int c0, int c1, int c2, int c3, int c4) -<>
    [u : int] (EXTENDED_UTF8_CHAR (5, u, c0, c1, c2, c3, c4, ~1) | int u)

extern fun {}
decode_extended_utf8_6byte :
  {c0, c1, c2, c3, c4, c5 : int | 0xFC <= c0; c0 <= 0xFD;
                                  is_valid_utf8_continuation_byte c1;
                                  is_valid_utf8_continuation_byte c2;
                                  is_valid_utf8_continuation_byte c3;
                                  is_valid_utf8_continuation_byte c4;
                                  is_valid_utf8_continuation_byte c5}
  (int c0, int c1, int c2, int c3, int c4, int c5) -<>
    [u : int]
    (EXTENDED_UTF8_CHAR (6, u, c0, c1, c2, c3, c4, c5) | int u)

(*------------------------------------------------------------------*)

stadef extended_utf8_shortest_length (u : int) =
  ifint (u < 0, ~1,
    ifint (u <= 0x7F, 1,
      ifint (u <= 0x7FF, 2,
        ifint (u <= 0xFFFF, 3,
          ifint (u <= 0x1FFFFF, 4,
            ifint (u <= 0x3FFFFFF, 5,
              ifint (u <= 0x7FFFFFFF, 6, ~1)))))))

dataprop EXTENDED_UTF8_SHORTEST (length : int, u : int, c0 : int, c1 : int,
                                 c2 : int, c3 : int, c4 : int, c5 : int) =
  | {u, c0 : int |
     0 <= u; u <= 0x7F;
     c0 == u}
    EXTENDED_UTF8_SHORTEST_1byte (1, u, c0, ~1, ~1, ~1, ~1, ~1) of
      EXTENDED_UTF8_CHAR (1, u, c0, ~1, ~1, ~1, ~1, ~1)
    //
  | {u, c0, c1 : int |
     0x7F < u; u <= 0x7FF;
     c0 == 0xC0 + (u \ndiv 64);
     c1 == 0x80 + (u \nmod 64)}
    EXTENDED_UTF8_SHORTEST_2byte (2, u, c0, c1, ~1, ~1, ~1, ~1) of
      EXTENDED_UTF8_CHAR (2, u, c0, c1, ~1, ~1, ~1, ~1)
    //
  | {u, c0, c1, c2 : int |
     0x7FF < u; u <= 0xFFFF;
     c0 == 0xE0 + (u \ndiv (64 * 64));
     c1 == 0x80 + ((u \ndiv 64) \nmod 64);
     c2 == 0x80 + (u \nmod 64)}
    EXTENDED_UTF8_SHORTEST_3byte (3, u, c0, c1, c2, ~1, ~1, ~1) of
      EXTENDED_UTF8_CHAR (3, u, c0, c1, c2, ~1, ~1, ~1)
    //
  | {u, c0, c1, c2, c3 : int |
     0xFFFF < u; u <= 0x1FFFFF;
     c0 == 0xF0 + (u \ndiv (64 * 64 * 64));
     c1 == 0x80 + ((u \ndiv (64 * 64)) \nmod 64);
     c2 == 0x80 + ((u \ndiv 64) \nmod 64);
     c3 == 0x80 + (u \nmod 64)}
    EXTENDED_UTF8_SHORTEST_4byte (4, u, c0, c1, c2, c3, ~1, ~1) of
      EXTENDED_UTF8_CHAR (4, u, c0, c1, c2, c3, ~1, ~1)
    //
  | {u, c0, c1, c2, c3, c4 : int |
     0x1FFFFF < u; u <= 0x3FFFFFF;
     c0 == 0xF8 + (u \ndiv (64 * 64 * 64 * 64));
     c1 == 0x80 + ((u \ndiv (64 * 64 * 64)) \nmod 64);
     c2 == 0x80 + ((u \ndiv (64 * 64)) \nmod 64);
     c3 == 0x80 + ((u \ndiv 64) \nmod 64);
     c4 == 0x80 + (u \nmod 64)}
    EXTENDED_UTF8_SHORTEST_5byte (5, u, c0, c1, c2, c3, c4, ~1) of
      EXTENDED_UTF8_CHAR (5, u, c0, c1, c2, c3, c4, ~1)
    //
  | {u, c0, c1, c2, c3, c4, c5 : int |
     0x3FFFFFF < u; u <= 0x7FFFFFFF;
     c0 == 0xFC + (u \ndiv (64 * 64 * 64 * 64 * 64));
     c1 == 0x80 + ((u \ndiv (64 * 64 * 64 * 64)) \nmod 64);
     c2 == 0x80 + ((u \ndiv (64 * 64 * 64)) \nmod 64);
     c3 == 0x80 + ((u \ndiv (64 * 64)) \nmod 64);
     c4 == 0x80 + ((u \ndiv 64) \nmod 64);
     c5 == 0x80 + (u \nmod 64)}
    EXTENDED_UTF8_SHORTEST_6byte (6, u, c0, c1, c2, c3, c4, c5) of
      EXTENDED_UTF8_CHAR (6, u, c0, c1, c2, c3, c4, c5)

extern prfun
extended_utf8_shortest_is_char :
  {n : int} {u : int} {c0, c1, c2, c3, c4, c5 : int}
  EXTENDED_UTF8_SHORTEST (n, u, c0, c1, c2, c3, c4, c5) -<prf>
    EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5)

extern prfun
lemma_extended_utf8_shortest_length :
  {n : int} {u : int} {c0, c1, c2, c3, c4, c5 : int}
  EXTENDED_UTF8_SHORTEST (n, u, c0, c1, c2, c3, c4, c5) -<prf>
    [n == extended_utf8_shortest_length u] void

extern fun {}
encode_extended_utf8_character :
  {u : nat | u <= 0x7FFFFFFF}
  int u -<>
    [n : int] [c0, c1, c2, c3, c4, c5 : int]
    @(EXTENDED_UTF8_SHORTEST (n, u, c0, c1, c2, c3, c4, c5) |
      int n, int c0, int c1, int c2, int c3, int c4, int c5)

(*------------------------------------------------------------------*)
//
// A valid UTF-8 character is one that encodes a valid Unicode
// code point and is not overlong.
//

dataprop UTF8_CHAR_INVALID_CASES (c0 : int, c1 : int, c2 : int, c3 : int) =
  // The cases are not mutually exclusive.
  | {c0, c1, c2, c3 : int | extended_utf8_character_length c0 == ~1}
    // We might not be using this case, presently (has that changed?),
    // but it is included for completeness.
    UTF8_CHAR_INVALID_bad_c0 (c0, c1, c2, c3)
  | {c0, c1, c2, c3 : int | 2 <= extended_utf8_character_length c0;
                            ~(is_valid_utf8_continuation_byte c1)}
    UTF8_CHAR_INVALID_bad_c1 (c0, c1, c2, c3)
  | {c0, c1, c2, c3 : int | 3 <= extended_utf8_character_length c0;
                            ~(is_valid_utf8_continuation_byte c2)}
    UTF8_CHAR_INVALID_bad_c2 (c0, c1, c2, c3)
  | {c0, c1, c2, c3 : int | extended_utf8_character_length c0 == 4;
                            ~(is_valid_utf8_continuation_byte c3)}
    UTF8_CHAR_INVALID_bad_c3 (c0, c1, c2, c3)
  | {c0, c1, c2, c3 : int |
     extended_utf8_character_length c0 == 2;
     extended_utf8_char_2byte_decoding (c0, c1) <= 0x7F}
    UTF8_CHAR_INVALID_invalid_2byte (c0, c1, c2, c3)
  | {c0, c1, c2, c3 : int |
     extended_utf8_character_length c0 == 3
       && (extended_utf8_char_3byte_decoding (c0, c1, c2) <= 0x7FF
             || ~(is_valid_unicode_code_point
                    (extended_utf8_char_3byte_decoding (c0, c1, c2))))}
    UTF8_CHAR_INVALID_invalid_3byte (c0, c1, c2, c3)
  | {c0, c1, c2, c3 : int |
     extended_utf8_character_length c0 == 4
       && (extended_utf8_char_4byte_decoding (c0, c1, c2, c3) <= 0xFFFF
             || ~(is_valid_unicode_code_point
                    (extended_utf8_char_4byte_decoding (c0, c1, c2, c3))))}
    UTF8_CHAR_INVALID_invalid_4byte (c0, c1, c2, c3)

dataprop UTF8_CHAR_VALID_BYTES (c0 : int, c1 : int, c2 : int, c3 : int) =
  | {c0 : int | 0 <= c0 && c0 <= 0x7F}
    UTF8_CHAR_VALID_BYTES_1byte (c0, ~1, ~1, ~1)
  | {c0, c1 : int | 0xC2 <= c0 && c0 <= 0xDF;
                    is_valid_utf8_continuation_byte c1}
    UTF8_CHAR_VALID_BYTES_2byte (c0, c1, ~1, ~1)
  | {c0, c1, c2 : int | (0xE1 <= c0 && c0 <= 0xEC)
                          || c0 == 0xEE
                          || c0 == 0xEF
                          || (c0 == 0xE0 && 0xA0 <= c1)
                          || (c0 == 0xED && c1 < 0xA0);
                        is_valid_utf8_continuation_byte c1;
                        is_valid_utf8_continuation_byte c2}
    UTF8_CHAR_VALID_BYTES_3byte (c0, c1, c2, ~1)
  | {c0, c1, c2, c3 : int | (0xF1 <= c0 && c0 <= 0xF3)
                              || (c0 == 0xF0 && 0x90 <= c1)
                              || (c0 == 0xF4 && c1 < 0x90);
                            is_valid_utf8_continuation_byte c1;
                            is_valid_utf8_continuation_byte c2;
                            is_valid_utf8_continuation_byte c3}
    UTF8_CHAR_VALID_BYTES_4byte (c0, c1, c2, c3)

dataprop UTF8_CHAR_INVALID_BYTES (c0 : int, c1 : int, c2 : int, c3 : int) =
  | // This should never occur.
    {c0 : int | is_extended_utf8_1byte_first_byte c0}
    UTF8_CHAR_INVALID_BYTES_1byte (c0, ~1, ~1, ~1)
  | {c0, c1 : int | is_extended_utf8_2byte_first_byte c0;
                    c0 == 0xC0 || c0 == 0xC1 || 
                    is_invalid_utf8_continuation_byte c1}
    UTF8_CHAR_INVALID_BYTES_2byte (c0, c1, ~1, ~1)
  | {c0, c1, c2 : int | is_extended_utf8_3byte_first_byte c0;
                        (c0 == 0xE0 && c1 < 0xA0) ||
                        (c0 == 0xED && 0xA0 <= c1) ||
                        is_invalid_utf8_continuation_byte c1 ||
                        is_invalid_utf8_continuation_byte c2}
    UTF8_CHAR_INVALID_BYTES_3byte (c0, c1, c2, ~1)
  | {c0, c1, c2, c3 : int | is_extended_utf8_4byte_first_byte c0;
                            0xF4 < c0 ||
                            (c0 == 0xF0 && c1 < 0x90) ||
                            (c0 == 0xF4 && 0x90 <= c1) ||
                            is_invalid_utf8_continuation_byte c1 ||
                            is_invalid_utf8_continuation_byte c2 ||
                            is_invalid_utf8_continuation_byte c3}
    UTF8_CHAR_INVALID_BYTES_4byte (c0, c1, c2, c3)
  | {c0, c1, c2, c3 : int | ~(is_extended_utf8_1byte_first_byte c0);
                            ~(is_extended_utf8_2byte_first_byte c0);
                            ~(is_extended_utf8_3byte_first_byte c0);
                            ~(is_extended_utf8_4byte_first_byte c0)}
    UTF8_CHAR_INVALID_BYTES_bad_c0 (c0, c1, c2, c3)

dataprop UTF8_CHAR_VALIDITY (n : int, u : int, c0 : int, c1 : int,
                             c2 : int, c3 : int, b : bool) =
  | {n : int} {u : int | is_valid_unicode_code_point u}
    {c0, c1, c2, c3 : int}
    UTF8_CHAR_valid (n, u, c0, c1, c2, c3, true) of
      (EXTENDED_UTF8_SHORTEST (n, u, c0, c1, c2, c3, ~1, ~1),
       UTF8_CHAR_VALID_BYTES (c0, c1, c2, c3))
  | {n : int} {u : int} {c0, c1, c2, c3 : int}
    UTF8_CHAR_invalid (n, u, c0, c1, c2, c3, false) of
      (UTF8_CHAR_INVALID_CASES (c0, c1, c2, c3),
       UTF8_CHAR_INVALID_BYTES (c0, c1, c2, c3))

propdef UTF8_CHAR_VALID (n : int, u : int, c0 : int, c1 : int,
                         c2 : int, c3 : int) =
  UTF8_CHAR_VALIDITY (n, u, c0, c1, c2, c3, true)

propdef UTF8_CHAR_INVALID (c0 : int, c1 : int, c2 : int, c3 : int) =
  [n : int] [u : int] UTF8_CHAR_VALIDITY (n, u, c0, c1, c2, c3, false)

extern prfun
utf8_char_valid_implies_shortest :
  {n : int} {u : int} {c0, c1, c2, c3 : int}
  UTF8_CHAR_VALID (n, u, c0, c1, c2, c3) -<prf>
    EXTENDED_UTF8_SHORTEST (n, u, c0, c1, c2, c3, ~1, ~1)

extern prfun
lemma_valid_utf8_character_1byte :
  {u, c0 : int |
   is_extended_utf8_1byte_first_byte c0;
   u == extended_utf8_char_1byte_decoding c0;
   0 <= u; u <= 0x7F}
  () -<prf> UTF8_CHAR_VALID (1, u, c0, ~1, ~1, ~1)

extern prfun
lemma_valid_utf8_character_2byte :
  {u, c0, c1 : int |
   is_extended_utf8_2byte_first_byte c0;
   is_valid_utf8_continuation_byte c1;
   u == extended_utf8_char_2byte_decoding (c0, c1);
   0x7F < u; u <= 0x7FF}
  () -<prf> UTF8_CHAR_VALID (2, u, c0, c1, ~1, ~1)

extern prfun
lemma_valid_utf8_character_3byte :
  {u, c0, c1, c2 : int |
   is_extended_utf8_3byte_first_byte c0;
   is_valid_utf8_continuation_byte c1;
   is_valid_utf8_continuation_byte c2;
   u == extended_utf8_char_3byte_decoding (c0, c1, c2);
   0x7FF < u; u <= 0xFFFF;
   //
   // Exclude the UTF-16 surrogate halves.
   //
   ~(0xD800 <= u && u < 0xE000)}
  () -<prf> UTF8_CHAR_VALID (3, u, c0, c1, c2, ~1)

extern prfun
lemma_valid_utf8_character_4byte :
  {u, c0, c1, c2, c3 : int |
   is_extended_utf8_4byte_first_byte c0;
   is_valid_utf8_continuation_byte c1;
   is_valid_utf8_continuation_byte c2;
   is_valid_utf8_continuation_byte c3;
   u == extended_utf8_char_4byte_decoding (c0, c1, c2, c3);
   0xFFFF < u; u <= 0x10FFFF}
  () -<prf> UTF8_CHAR_VALID (4, u, c0, c1, c2, c3)

extern prfun
utf8_character_1byte_valid_bytes :
  // This does not really do anything, but is included
  // for completeness.
  {u, c0 : int | is_extended_utf8_1byte_first_byte c0}
  UTF8_CHAR_VALID (1, u, c0, ~1, ~1, ~1) -<prf> void

extern prfun
utf8_character_2byte_valid_bytes :
  {u, c0, c1 : int | is_extended_utf8_2byte_first_byte c0}
  UTF8_CHAR_VALID (2, u, c0, c1, ~1, ~1) -<prf>
    [0xC2 <= c0; c0 <= 0xDF;
     is_valid_utf8_continuation_byte c1]
    void

extern prfun
utf8_character_3byte_valid_bytes :
  {u, c0, c1, c2 : int | is_extended_utf8_3byte_first_byte c0}
  UTF8_CHAR_VALID (3, u, c0, c1, c2, ~1) -<prf>
    [(0xE1 <= c0 && c0 <= 0xEC)
        || c0 == 0xEE
        || c0 == 0xEF
        || (c0 == 0xE0 && 0xA0 <= c1)
        || (c0 == 0xED && c1 < 0xA0);
     is_valid_utf8_continuation_byte c1;
     is_valid_utf8_continuation_byte c2]
    void

extern prfun
utf8_character_4byte_valid_bytes :
  {u, c0, c1, c2, c3 : int | is_extended_utf8_4byte_first_byte c0}
  UTF8_CHAR_VALID (4, u, c0, c1, c2, c3) -<prf>
    [(0xF1 <= c0 && c0 <= 0xF3)
        || (c0 == 0xF0 && 0x90 <= c1)
        || (c0 == 0xF4 && c1 < 0x90);
     is_valid_utf8_continuation_byte c1;
     is_valid_utf8_continuation_byte c2;
     is_valid_utf8_continuation_byte c3]
    void

extern prfun
utf8_character_1byte_invalid_bytes :
  // This does not really do anything, but is included
  // for completeness.
  {c0 : int | is_extended_utf8_1byte_first_byte c0}
  UTF8_CHAR_INVALID (c0, ~1, ~1, ~1) -<prf> void

extern prfun
utf8_character_2byte_invalid_bytes :
  {c0, c1 : int | is_extended_utf8_2byte_first_byte c0}
  UTF8_CHAR_INVALID (c0, c1, ~1, ~1) -<prf>
    [c0 == 0xC0 || c0 == 0xc1 || is_invalid_utf8_continuation_byte c1]
    void

extern prfun
utf8_character_3byte_invalid_bytes :
  {c0, c1, c2 : int | is_extended_utf8_3byte_first_byte c0}
  UTF8_CHAR_INVALID (c0, c1, c2, ~1) -<prf>
    [(c0 == 0xE0 && c1 < 0xA0) ||
     (c0 == 0xED && 0xA0 <= c1) ||
     is_invalid_utf8_continuation_byte c1 ||
     is_invalid_utf8_continuation_byte c2]
    void

extern prfun
utf8_character_4byte_invalid_bytes :
  {c0, c1, c2, c3 : int | is_extended_utf8_4byte_first_byte c0}
  UTF8_CHAR_INVALID (c0, c1, c2, c3) -<prf>
    [0xF4 < c0 ||
     (c0 == 0xF0 && c1 < 0x90) ||
     (c0 == 0xF4 && 0x90 <= c1) ||
     is_invalid_utf8_continuation_byte c1 ||
     is_invalid_utf8_continuation_byte c2 ||
     is_invalid_utf8_continuation_byte c3]
    void

extern fun {}
is_valid_utf8_character_1byte :
  {c0 : int | is_extended_utf8_1byte_first_byte c0}
  int c0 -<>
    [b : bool | b == true] [u : int]
    (UTF8_CHAR_VALIDITY (1, u, c0, ~1, ~1, ~1, b) | bool b)

extern fun {}
is_valid_utf8_character_2byte :
  {c0, c1 : int | is_extended_utf8_2byte_first_byte c0}
  (int c0, int c1) -<>
    [b : bool] [u : int]
    (UTF8_CHAR_VALIDITY (2, u, c0, c1, ~1, ~1, b) | bool b)

extern fun {}
is_valid_utf8_character_3byte :
  {c0, c1, c2 : int | is_extended_utf8_3byte_first_byte c0}
  (int c0, int c1, int c2) -<>
    [b : bool] [u : int]
    (UTF8_CHAR_VALIDITY (3, u, c0, c1, c2, ~1, b) | bool b)

extern fun {}
is_valid_utf8_character_4byte :
  {c0, c1, c2, c3 : int | is_extended_utf8_4byte_first_byte c0}
  (int c0, int c1, int c2, int c3) -<>
    [b : bool] [u : int]
    (UTF8_CHAR_VALIDITY (4, u, c0, c1, c2, c3, b) | bool b)

extern fun {}
decode_utf8_1byte :
  {c0 : int | is_extended_utf8_1byte_first_byte c0}
  int c0 -<> [u : int | 0 <= u; u <= 0x7F] int u

extern fun {}
decode_utf8_2byte :
  {c0, c1 : int | 0xC2 <= c0; c0 <= 0xDF;
                  is_valid_utf8_continuation_byte c1}
  (int c0, int c1) -<> [u : int | 0x7F < u; u <= 0x7FF] int u

extern fun {}
decode_utf8_3byte :
  {c0, c1, c2 : int | (0xE1 <= c0 && c0 <= 0xEC)
                        || c0 == 0xEE
                        || c0 == 0xEF
                        || (c0 == 0xE0 && 0xA0 <= c1)
                        || (c0 == 0xED && c1 < 0xA0);
                      is_valid_utf8_continuation_byte c1;
                      is_valid_utf8_continuation_byte c2}
  (int c0, int c1, int c2) -<>
    [u : int | 0x7FF < u; u <= 0xFFFF; u < 0xD800 || 0xE000 <= u] int u

extern fun {}
decode_utf8_4byte :
  {c0, c1, c2, c3 : int | (0xF1 <= c0 && c0 <= 0xF3)
                            || (c0 == 0xF0 && 0x90 <= c1)
                            || (c0 == 0xF4 && c1 < 0x90);
                          is_valid_utf8_continuation_byte c1;
                          is_valid_utf8_continuation_byte c2;
                          is_valid_utf8_continuation_byte c3}
  (int c0, int c1, int c2, int c3) -<>
    [u : int | 0xFFFF < u; u <= 0x10FFFF] int u

extern fun {}
encode_utf8_character :
  {u : int | is_valid_unicode_code_point u}
  int u -<>
    [n : int]
    [c0, c1, c2, c3 : int | extended_utf8_char_length_relation (c0, n)]
    @(UTF8_CHAR_VALID (n, u, c0, c1, c2, c3) |
      int n, int c0, int c1, int c2, int c3)

(*------------------------------------------------------------------*)

#define utf8_decode_next_error_char ~1

(* Returns @(utf8_decode_next_error_char, ..) on error. *)
extern fun
utf8_array_decode_next {utf8len    : int | 0 < utf8len}
                       {n_utf8arr  : int | utf8len <= n_utf8arr}
                       {i          : int | i < utf8len}
                       (utf8len    : size_t utf8len,
                        utf8arr    : &(@[char][n_utf8arr]),
                        i          : size_t i) :<>
    [c      : int | is_valid_unicode_code_point c ||
                    c == utf8_decode_next_error_char]
    [i_next : int | i_next == i + 1 || i_next == i + 2 ||
                    i_next == i + 3 || i_next == i + 4;
                    i_next <= utf8len]
    @(int c, size_t i_next)

extern fun
utf8_string_decode_next {utf8len    : int | 0 < utf8len}
                        {n_utf8str  : int | utf8len <= n_utf8str}
                        {i          : int | i < utf8len}
                        (utf8len    : size_t utf8len,
                         utf8str    : string n_utf8str,
                         i          : size_t i) :<>
    [c      : int | is_valid_unicode_code_point c ||
                    c == utf8_decode_next_error_char]
    [i_next : int | i_next == i + 1 || i_next == i + 2 ||
                    i_next == i + 3 || i_next == i + 4;
                    i_next <= utf8len]
    @(int c, size_t i_next)

overload utf8_decode_next with utf8_array_decode_next
overload utf8_decode_next with utf8_string_decode_next

(*###################### IMPLEMENTATION ############################*)

// Integer division by 64 is equivalent to shifting right
// by 6 bits.

extern prfun _shift_right_twelve :
  {x, y, z : nat | y == (x \ndiv 64);
                   z == (y \ndiv 64)}
  (int x, int y, int z) -<prf> [z == (x \ndiv (64 * 64))] void

primplement _shift_right_twelve (x, y, z) =
  ()

extern prfun _shift_right_eighteen :
  {x, y, z, u : nat | y == (x \ndiv 64);
                      z == (y \ndiv 64);
                      u == (z \ndiv 64)}
  (int x, int y, int z, int u) -<prf> [u == (x \ndiv (64 * 64 * 64))] void

primplement _shift_right_eighteen (x, y, z, u) = ()


extern prfun _shift_right_twenty_four :
  {x, y, z, u, v : nat | y == (x \ndiv 64);
                         z == (y \ndiv 64);
                         u == (z \ndiv 64);
                         v == (u \ndiv 64)}
  (int x, int y, int z, int u, int v) -<prf>
    [v == (x \ndiv (64 * 64 * 64 * 64))] void

primplement _shift_right_twenty_four (x, y, z, u, v) = ()

(*------------------------------------------------------------------*)

implement {}
is_valid_unicode_code_point u =
  ((0x0 <= u) * (u < 0xD800)) + ((0xE000 <= u) * (u <= 0x10FFFF))

(*------------------------------------------------------------------*)

primplement
extended_utf8_char_length_relation_to_length () = ()

primplement
extended_utf8_char_length_to_length_relation () = ()

primplement
utf8_char_length_relation_to_length () = ()

primplement
utf8_char_length_to_length_relation () = ()

(*------------------------------------------------------------------*)

implement {}
is_valid_utf8_continuation_byte c =
  (0x80 <= c) * (c <= 0xBF)

(*------------------------------------------------------------------*)

primplement
decode_extended_utf8_istot {n} {c0, c1, c2, c3, c4, c5} () =
  sif n == 1 then
    let
      prfn
      make_pf {u : int | u == extended_utf8_char_1byte_decoding (c0)}
              () :<prf>
          EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) =
        EXTENDED_UTF8_CHAR_1byte ()
      stadef u = extended_utf8_char_1byte_decoding (c0)
    in
      make_pf {u} ()
    end
  else sif n == 2 then
    let
      prfn
      make_pf {u : int | u == extended_utf8_char_2byte_decoding (c0, c1)}
              () :<prf>
          EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) =
        EXTENDED_UTF8_CHAR_2byte ()
      stadef u = extended_utf8_char_2byte_decoding (c0, c1)
    in
      make_pf {u} ()
    end
  else sif n == 3 then
    let
      prfn
      make_pf {u : int | u == extended_utf8_char_3byte_decoding (c0, c1, c2)}
              () :<prf>
          EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) =
        EXTENDED_UTF8_CHAR_3byte ()
      stadef u = extended_utf8_char_3byte_decoding (c0, c1, c2)
    in
      make_pf {u} ()
    end
  else sif n == 4 then
    let
      prfn
      make_pf {u : int |
               u == extended_utf8_char_4byte_decoding (c0, c1, c2, c3)}
              () :<prf>
          EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) =
        EXTENDED_UTF8_CHAR_4byte ()
      stadef u = extended_utf8_char_4byte_decoding (c0, c1, c2, c3)
    in
      make_pf {u} ()
    end
  else sif n == 5 then
    let
      prfn
      make_pf {u : int |
               u == extended_utf8_char_5byte_decoding (c0, c1, c2, c3, c4)}
              () :<prf>
          EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) =
        EXTENDED_UTF8_CHAR_5byte ()
      stadef u = extended_utf8_char_5byte_decoding (c0, c1, c2, c3, c4)
    in
      make_pf {u} ()
    end
  else
    let
      prfn
      make_pf {u : int |
               u == extended_utf8_char_6byte_decoding (c0, c1, c2, c3, c4, c5)}
              () :<prf>
          EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) =
        EXTENDED_UTF8_CHAR_6byte ()
      stadef u = extended_utf8_char_6byte_decoding (c0, c1, c2, c3, c4, c5)
    in
      make_pf {u} ()
    end

primplement
decode_extended_utf8_isfun {na} (pf_a, pf_b) =
  sif na == 1 then
    let
      prval EXTENDED_UTF8_CHAR_1byte () = pf_a
      prval EXTENDED_UTF8_CHAR_1byte () = pf_b
    in
    end
  else sif na == 2 then
    let
      prval EXTENDED_UTF8_CHAR_2byte () = pf_a
      prval EXTENDED_UTF8_CHAR_2byte () = pf_b
    in
    end
  else sif na == 3 then
    let
      prval EXTENDED_UTF8_CHAR_3byte () = pf_a
      prval EXTENDED_UTF8_CHAR_3byte () = pf_b
    in
    end
  else sif na == 4 then
    let
      prval EXTENDED_UTF8_CHAR_4byte () = pf_a
      prval EXTENDED_UTF8_CHAR_4byte () = pf_b
    in
    end
  else sif na == 5 then
    let
      prval EXTENDED_UTF8_CHAR_5byte () = pf_a
      prval EXTENDED_UTF8_CHAR_5byte () = pf_b
    in
    end
  else
    let
      prval EXTENDED_UTF8_CHAR_6byte () = pf_a
      prval EXTENDED_UTF8_CHAR_6byte () = pf_b
    in
    end


primplement
lemma_extended_utf8_char_length pf_char =
  case+ pf_char of
  | EXTENDED_UTF8_CHAR_1byte () => ()
  | EXTENDED_UTF8_CHAR_2byte () => ()
  | EXTENDED_UTF8_CHAR_3byte () => ()
  | EXTENDED_UTF8_CHAR_4byte () => ()
  | EXTENDED_UTF8_CHAR_5byte () => ()
  | EXTENDED_UTF8_CHAR_6byte () => ()

implement {}
decode_extended_utf8_1byte c0 =
  (EXTENDED_UTF8_CHAR_1byte () | c0)

implement {}
decode_extended_utf8_2byte (c0, c1) =
  let
    val u0 = c0 - 0xC0
    val u1 = c1 - 0x80
  in
    (EXTENDED_UTF8_CHAR_2byte () | 64 * u0 + u1)
  end

implement {}
decode_extended_utf8_3byte (c0, c1, c2) =
  let
    val u0 = c0 - 0xE0
    val u1 = c1 - 0x80
    val u2 = c2 - 0x80
  in
    (EXTENDED_UTF8_CHAR_3byte () | 64 * (64 * u0 + u1) + u2)
  end

implement {}
decode_extended_utf8_4byte (c0, c1, c2, c3) =
  let
    val u0 = c0 - 0xF0
    val u1 = c1 - 0x80
    val u2 = c2 - 0x80
    val u3 = c3 - 0x80
  in
    (EXTENDED_UTF8_CHAR_4byte () | 64 * (64 * (64 * u0 + u1) + u2) + u3)
  end

implement {}
decode_extended_utf8_5byte (c0, c1, c2, c3, c4) =
  let
    val u0 = c0 - 0xF8
    val u1 = c1 - 0x80
    val u2 = c2 - 0x80
    val u3 = c3 - 0x80
    val u4 = c4 - 0x80
  in
    (EXTENDED_UTF8_CHAR_5byte () |
     64 * (64 * (64 * (64 * u0 + u1) + u2) + u3) + u4)
  end

implement {}
decode_extended_utf8_6byte (c0, c1, c2, c3, c4, c5) =
  let
    val u0 = c0 - 0xFC
    val u1 = c1 - 0x80
    val u2 = c2 - 0x80
    val u3 = c3 - 0x80
    val u4 = c4 - 0x80
    val u5 = c5 - 0x80
  in
    (EXTENDED_UTF8_CHAR_6byte () |
     64 * (64 * (64 * (64 * (64 * u0 + u1) + u2) + u3) + u4) + u5)
  end

(*------------------------------------------------------------------*)

primplement
extended_utf8_shortest_is_char pf_shortest =
  case+ pf_shortest of
  | EXTENDED_UTF8_SHORTEST_1byte pf_char => pf_char
  | EXTENDED_UTF8_SHORTEST_2byte pf_char => pf_char
  | EXTENDED_UTF8_SHORTEST_3byte pf_char => pf_char
  | EXTENDED_UTF8_SHORTEST_4byte pf_char => pf_char
  | EXTENDED_UTF8_SHORTEST_5byte pf_char => pf_char
  | EXTENDED_UTF8_SHORTEST_6byte pf_char => pf_char

primplement
lemma_extended_utf8_shortest_length pf_shortest =
  case+ pf_shortest of
  | EXTENDED_UTF8_SHORTEST_1byte pf_char =>
    { prval EXTENDED_UTF8_CHAR_1byte () = pf_char }
  | EXTENDED_UTF8_SHORTEST_2byte pf_char =>
    { prval EXTENDED_UTF8_CHAR_2byte () = pf_char }
  | EXTENDED_UTF8_SHORTEST_3byte pf_char =>
    { prval EXTENDED_UTF8_CHAR_3byte () = pf_char }
  | EXTENDED_UTF8_SHORTEST_4byte pf_char =>
    { prval EXTENDED_UTF8_CHAR_4byte () = pf_char }
  | EXTENDED_UTF8_SHORTEST_5byte pf_char =>
    { prval EXTENDED_UTF8_CHAR_5byte () = pf_char }
  | EXTENDED_UTF8_SHORTEST_6byte pf_char =>
    { prval EXTENDED_UTF8_CHAR_6byte () = pf_char }

implement {}
encode_extended_utf8_character u =
  if u <= 0x7F then
    let
      val c0 = u
    in
      @(EXTENDED_UTF8_SHORTEST_1byte (EXTENDED_UTF8_CHAR_1byte ()) |
        1, c0, ~1, ~1, ~1, ~1, ~1)
    end
  else if u <= 0x7FF then
    let
      val c1 = 0x80 + (u \nmod 64)
      val u1 = u \ndiv 64
      val c0 = 0xC0 + u1
    in
      @(EXTENDED_UTF8_SHORTEST_2byte (EXTENDED_UTF8_CHAR_2byte ()) |
        2, c0, c1, ~1, ~1, ~1, ~1)
    end
  else if u <= 0xFFFF then
    let
      val c2 = 0x80 + (u \nmod 64)
      val u1 = u \ndiv 64
      val c1 = 0x80 + (u1 \nmod 64)
      val u2 = u1 \ndiv 64
      val c0 = 0xE0 + u2
    in
      @(EXTENDED_UTF8_SHORTEST_3byte (EXTENDED_UTF8_CHAR_3byte ()) |
        3, c0, c1, c2, ~1, ~1, ~1)
    end
  else if u <= 0x1FFFFF then
    let
      val c3 = 0x80 + (u \nmod 64)
      val u1 = u \ndiv 64
      val c2 = 0x80 + (u1 \nmod 64)
      val u2 = u1 \ndiv 64
      val c1 = 0x80 + (u2 \nmod 64)
      val u3 = u2 \ndiv 64
      val c0 = 0xF0 + u3
      prval () = _shift_right_twelve (u, u1, u2)
    in
      @(EXTENDED_UTF8_SHORTEST_4byte (EXTENDED_UTF8_CHAR_4byte ()) |
        4, c0, c1, c2, c3, ~1, ~1)
    end
  else if u <= 0x3FFFFFF then
    let
      val c4 = 0x80 + (u \nmod 64)
      val u1 = u \ndiv 64
      val c3 = 0x80 + (u1 \nmod 64)
      val u2 = u1 \ndiv 64
      val c2 = 0x80 + (u2 \nmod 64)
      val u3 = u2 \ndiv 64
      val c1 = 0x80 + (u3 \nmod 64)
      val u4 = u3 \ndiv 64
      val c0 = 0xF8 + u4
      prval () = _shift_right_twelve (u, u1, u2)
      prval () = _shift_right_eighteen (u, u1, u2, u3)
    in
      @(EXTENDED_UTF8_SHORTEST_5byte (EXTENDED_UTF8_CHAR_5byte ()) |
        5, c0, c1, c2, c3, c4, ~1)
    end
  else
    let
      val c5 = 0x80 + (u \nmod 64)
      val u1 = u \ndiv 64
      val c4 = 0x80 + (u1 \nmod 64)
      val u2 = u1 \ndiv 64
      val c3 = 0x80 + (u2 \nmod 64)
      val u3 = u2 \ndiv 64
      val c2 = 0x80 + (u3 \nmod 64)
      val u4 = u3 \ndiv 64
      val c1 = 0x80 + (u4 \nmod 64)
      val u5 = u4 \ndiv 64
      val c0 = 0xFC + u5
      prval () = _shift_right_twelve (u, u1, u2)
      prval () = _shift_right_eighteen (u, u1, u2, u3)
      prval () = _shift_right_twenty_four (u, u1, u2, u3, u4)
    in
      @(EXTENDED_UTF8_SHORTEST_6byte (EXTENDED_UTF8_CHAR_6byte ()) |
        6, c0, c1, c2, c3, c4, c5)
    end

(*------------------------------------------------------------------*)

primplement
utf8_char_valid_implies_shortest pf_valid =
  case+ pf_valid of
  | UTF8_CHAR_valid (pf_shortest, _) => pf_shortest

primplement
lemma_valid_utf8_character_1byte {u, c0} () =
  let
    prfn
    make_pf {u : int | u == extended_utf8_char_1byte_decoding (c0)}
            () :<prf> EXTENDED_UTF8_CHAR (1, u, c0, ~1, ~1, ~1, ~1, ~1) =
      EXTENDED_UTF8_CHAR_1byte ()
    prval pf_char = make_pf {u} ()
    prval pf_shortest = EXTENDED_UTF8_SHORTEST_1byte pf_char
    prval pf_bytes = UTF8_CHAR_VALID_BYTES_1byte ()
    prval pf_valid = UTF8_CHAR_valid (pf_shortest, pf_bytes)
  in
    pf_valid
  end

primplement
lemma_valid_utf8_character_2byte {u, c0, c1} () =
  let
    prfn
    make_pf {u : int | u == extended_utf8_char_2byte_decoding (c0, c1)}
            () :<prf> EXTENDED_UTF8_CHAR (2, u, c0, c1, ~1, ~1, ~1, ~1) =
      EXTENDED_UTF8_CHAR_2byte ()
    prval pf_char = make_pf {u} ()
    prval pf_shortest = EXTENDED_UTF8_SHORTEST_2byte pf_char
    prval pf_bytes = UTF8_CHAR_VALID_BYTES_2byte ()
    prval pf_valid = UTF8_CHAR_valid (pf_shortest, pf_bytes)
  in
    pf_valid
  end

primplement
lemma_valid_utf8_character_3byte {u, c0, c1, c2} () =
  let
    prfn
    make_pf {u : int | u == extended_utf8_char_3byte_decoding (c0, c1, c2)}
            () :<prf> EXTENDED_UTF8_CHAR (3, u, c0, c1, c2, ~1, ~1, ~1) =
      EXTENDED_UTF8_CHAR_3byte ()
    prval pf_char = make_pf {u} ()
    prfn lemma_c1 () :<prf> [c1 == 0x80 + ((u \ndiv 64) \nmod 64)] void =
      {
        // Convert to Horner form.
        stadef u1 = 64 * (64 * (c0 - 0xE0) + (c1 - 0x80)) + (c2 - 0x80)
        prval EQINT () = eqint_make {u, u1} ()
      }
    prval () = lemma_c1 ()
    prval pf_shortest = EXTENDED_UTF8_SHORTEST_3byte pf_char
    prval pf_bytes = UTF8_CHAR_VALID_BYTES_3byte ()
    prval pf_valid = UTF8_CHAR_valid (pf_shortest, pf_bytes)
  in
    pf_valid
  end

primplement
lemma_valid_utf8_character_4byte {u, c0, c1, c2, c3} () =
  let
    prfn
    make_pf {u : int | u == extended_utf8_char_4byte_decoding (c0, c1, c2, c3)}
            () :<prf> EXTENDED_UTF8_CHAR (4, u, c0, c1, c2, c3, ~1, ~1) =
      EXTENDED_UTF8_CHAR_4byte ()
    prval pf_char = make_pf {u} ()
    prfn lemma_c1 () :<prf> [c1 == 0x80 + ((u \ndiv (64 * 64)) \nmod 64)] void =
      {
        prval EQINT () =
          eqint_make {(64 * 64 * (c0 - 0xE0)) \ndiv (64 * 64), c0 - 0xE0} ()
        prval pfd = divmod_istot {u \ndiv (64 * 64), 64} ()
        prval pfm = divmod_elim pfd
        prval () = mul_elim pfm
      }
    prval () = lemma_c1 ()
    prfn lemma_c2 () :<prf> [c2 == 0x80 + ((u \ndiv 64) \nmod 64)] void =
      {
        // Convert to Horner form.
        stadef u1 = 64 * (64 * (64 * (c0 - 0xF0) + (c1 - 0x80)) + (c2 - 0x80))
                     + (c3 - 0x80)
        prval EQINT () = eqint_make {u, u1} ()
      }
    prval () = lemma_c2 ()
    prval pf_shortest = EXTENDED_UTF8_SHORTEST_4byte pf_char
    prval pf_bytes = UTF8_CHAR_VALID_BYTES_4byte ()
    prval pf_valid = UTF8_CHAR_valid (pf_shortest, pf_bytes)
  in
    pf_valid
  end

primplement
utf8_character_1byte_valid_bytes pf_valid = ()

primplement
utf8_character_2byte_valid_bytes pf_valid =
  {
    prval UTF8_CHAR_valid (_, pf_bytes) = pf_valid
    prval UTF8_CHAR_VALID_BYTES_2byte () = pf_bytes
  }

primplement
utf8_character_3byte_valid_bytes pf_valid =
  {
    prval UTF8_CHAR_valid (_, pf_bytes) = pf_valid
    prval UTF8_CHAR_VALID_BYTES_3byte () = pf_bytes
  }

primplement
utf8_character_4byte_valid_bytes pf_valid =
  {
    prval UTF8_CHAR_valid (_, pf_bytes) = pf_valid
    prval UTF8_CHAR_VALID_BYTES_4byte () = pf_bytes
  }

primplement
utf8_character_1byte_invalid_bytes pf_invalid =
  {
    prval UTF8_CHAR_invalid (_, pf_bytes) = pf_invalid
    prval UTF8_CHAR_INVALID_BYTES_1byte () = pf_bytes
  }

primplement
utf8_character_2byte_invalid_bytes pf_invalid =
  {
    prval UTF8_CHAR_invalid (_, pf_bytes) = pf_invalid
    prval UTF8_CHAR_INVALID_BYTES_2byte () = pf_bytes
  }

primplement
utf8_character_3byte_invalid_bytes pf_invalid =
  {
    prval UTF8_CHAR_invalid (_, pf_bytes) = pf_invalid
    prval UTF8_CHAR_INVALID_BYTES_3byte () = pf_bytes
  }

primplement
utf8_character_4byte_invalid_bytes pf_invalid =
  {
    prval UTF8_CHAR_invalid (_, pf_bytes) = pf_invalid
    prval UTF8_CHAR_INVALID_BYTES_4byte () = pf_bytes
  }

implement {}
is_valid_utf8_character_1byte {c0} c0 =
  let
    stadef u = extended_utf8_char_1byte_decoding c0
  in
    (lemma_valid_utf8_character_1byte {u, c0} () | true)
  end

implement {}
is_valid_utf8_character_2byte {c0, c1} (c0, c1) =
  let
    stadef u = extended_utf8_char_2byte_decoding (c0, c1)
  in
    if not (is_valid_utf8_continuation_byte c1) then
      (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c1 (),
                          UTF8_CHAR_INVALID_BYTES_2byte ()) |
       false)
    else if c0 < 0xC2 then
      // The sequence is overlong.
      (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_invalid_2byte (),
                          UTF8_CHAR_INVALID_BYTES_2byte ()) |
       false)
    else
      (lemma_valid_utf8_character_2byte {u, c0, c1} () | true)
  end

implement {}
is_valid_utf8_character_3byte {c0, c1, c2} (c0, c1, c2) =
  let
    stadef u = extended_utf8_char_3byte_decoding (c0, c1, c2)
  in
    if not (is_valid_utf8_continuation_byte c1) then
      (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c1 (),
                          UTF8_CHAR_INVALID_BYTES_3byte ()) |
       false)
    else if not (is_valid_utf8_continuation_byte c2) then
      (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c2 (),
                          UTF8_CHAR_INVALID_BYTES_3byte ()) |
       false)
    else if (0xE1 <= c0) * (c0 <= 0xEC) then
      (lemma_valid_utf8_character_3byte {u, c0, c1, c2} () | true)
    else if c0 = 0xEE then
      (lemma_valid_utf8_character_3byte {u, c0, c1, c2} () | true)
    else if c0 = 0xEF then
      (lemma_valid_utf8_character_3byte {u, c0, c1, c2} () | true)
    else if (c0 = 0xE0) * (0xA0 <= c1) then
      (lemma_valid_utf8_character_3byte {u, c0, c1, c2} () | true)
    else if (c0 = 0xED) * (c1 < 0xA0) then
      (lemma_valid_utf8_character_3byte {u, c0, c1, c2} () | true)
    else
      // Either the sequence is overlong or it decodes to
      // an invalid code point.
      (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_invalid_3byte (),
                          UTF8_CHAR_INVALID_BYTES_3byte ()) |
       false)
  end

implement {}
is_valid_utf8_character_4byte {c0, c1, c2, c3} (c0, c1, c2, c3) =
  let
    stadef u = extended_utf8_char_4byte_decoding (c0, c1, c2, c3)
  in
    if not (is_valid_utf8_continuation_byte c1) then
      (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c1 (),
                          UTF8_CHAR_INVALID_BYTES_4byte ()) |
       false)
    else if not (is_valid_utf8_continuation_byte c2) then
      (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c2 (),
                          UTF8_CHAR_INVALID_BYTES_4byte ()) |
       false)
    else if not (is_valid_utf8_continuation_byte c3) then
      (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c3 (),
                          UTF8_CHAR_INVALID_BYTES_4byte ()) |
       false)
    else if (0xF1 <= c0) * (c0 <= 0xF3) then
      (lemma_valid_utf8_character_4byte {u, c0, c1, c2, c3} () | true)
    else if (c0 = 0xF0) * (0x90 <= c1) then
      (lemma_valid_utf8_character_4byte {u, c0, c1, c2, c3} () | true)
    else if (c0 = 0xF4) * (c1 < 0x90) then
      (lemma_valid_utf8_character_4byte {u, c0, c1, c2, c3} () | true)
    else
      // Either the sequence is overlong or it decodes to
      // an invalid code point.
      (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_invalid_4byte (),
                          UTF8_CHAR_INVALID_BYTES_4byte ()) |
       false)
  end

implement {}
decode_utf8_1byte c0 = c0

implement {}
decode_utf8_2byte (c0, c1) =
  let
    val u0 = c0 - 0xC0
    val u1 = c1 - 0x80
  in
    64 * u0 + u1
  end

implement {}
decode_utf8_3byte (c0, c1, c2) =
  let
    val u0 = c0 - 0xE0
    val u1 = c1 - 0x80
    val u2 = c2 - 0x80
  in
    64 * (64 * u0 + u1) + u2
  end

implement {}
decode_utf8_4byte (c0, c1, c2, c3) =
  let
    val u0 = c0 - 0xF0
    val u1 = c1 - 0x80
    val u2 = c2 - 0x80
    val u3 = c3 - 0x80
  in
    64 * (64 * (64 * u0 + u1) + u2) + u3
  end

implement {}
encode_utf8_character {u} u =
  if u <= 0x7F then
    let
      val c0 = u
      stadef c0 = u
    in
      @(lemma_valid_utf8_character_1byte {u, c0} () | 1, c0, ~1, ~1, ~1)
    end
  else if u <= 0x7FF then
    let
      val c1 = 0x80 + (u \nmod 64)
      val u1 = u \ndiv 64
      val c0 = 0xC0 + u1
      stadef c1 = 0x80 + (u \nmod 64)
      stadef u1 = u \ndiv 64
      stadef c0 = 0xC0 + u1
    in
      @(lemma_valid_utf8_character_2byte {u, c0, c1} () | 2, c0, c1, ~1, ~1)
    end
  else if u <= 0xFFFF then
    let
      val c2 = 0x80 + (u \nmod 64)
      val u1 = u \ndiv 64
      val c1 = 0x80 + (u1 \nmod 64)
      val u2 = u1 \ndiv 64
      val c0 = 0xE0 + u2
      stadef c2 = 0x80 + (u \nmod 64)
      stadef u1 = u \ndiv 64
      stadef c1 = 0x80 + (u1 \nmod 64)
      stadef u2 = u1 \ndiv 64
      stadef c0 = 0xE0 + u2
    in
      @(lemma_valid_utf8_character_3byte {u, c0, c1, c2} () | 3, c0, c1, c2, ~1)
    end
  else
    let
      val c3 = 0x80 + (u \nmod 64)
      val u1 = u \ndiv 64
      val c2 = 0x80 + (u1 \nmod 64)
      val u2 = u1 \ndiv 64
      val c1 = 0x80 + (u2 \nmod 64)
      val u3 = u2 \ndiv 64
      val c0 = 0xF0 + u3
      stadef c3 = 0x80 + (u \nmod 64)
      stadef u1 = u \ndiv 64
      stadef c2 = 0x80 + (u1 \nmod 64)
      stadef u2 = u1 \ndiv 64
      stadef c1 = 0x80 + (u2 \nmod 64)
      stadef u3 = u2 \ndiv 64
      stadef c0 = 0xF0 + u3
      prval () = _shift_right_twelve (u, u1, u2)
    in
      @(lemma_valid_utf8_character_4byte {u, c0, c1, c2, c3} () |
        4, c0, c1, c2, c3)
    end

(*------------------------------------------------------------------*)

implement
utf8_array_decode_next (utf8len, utf8arr, i) =
  let
    macdef getchr (j) = g1ofg0 (char2u2i (utf8arr[,(j)]))
    macdef error_return = @(utf8_decode_next_error_char, i + i2sz 1)

    prval _ = lemma_g1uint_param (i)

    val c0 = getchr (i)
  in
    if 0x00 <= c0 &&& c0 <= 0xFF then
      let
        val seqlen = utf8_character_length c0
      in
        case+ seqlen of
        | 1 => @(c0, i + i2sz 1)

        | 2 =>
          if utf8len < i + i2sz 2 then
            error_return
          else
            let
              val c1 = getchr (i + i2sz 1)
              val (pf | valid) =
                is_valid_utf8_character_2byte (c0, c1)
            in
              if valid then
                let
                  prval _ = utf8_character_2byte_valid_bytes pf
                  val code_point = decode_utf8_2byte (c0, c1)
                in
                  @(code_point, i + i2sz 2)
                end
              else
                error_return
            end

        | 3 =>
          if utf8len < i + i2sz 3 then
            error_return
          else
            let
              val c1 = getchr (i + i2sz 1)
              val c2 = getchr (i + i2sz 2)
              val (pf | valid) =
                is_valid_utf8_character_3byte (c0, c1, c2)
            in
              if valid then
                let
                  prval _ = utf8_character_3byte_valid_bytes pf
                  val code_point = decode_utf8_3byte (c0, c1, c2)
                in
                  @(code_point, i + i2sz 3)
                end
              else
                error_return
            end

        | 4 =>
          if utf8len < i + i2sz 4 then
            error_return
          else
            let
              val c1 = getchr (i + i2sz 1)
              val c2 = getchr (i + i2sz 2)
              val c3 = getchr (i + i2sz 3)
              val (pf | valid) =
                is_valid_utf8_character_4byte (c0, c1, c2, c3)
            in
              if valid then
                let
                  prval _ = utf8_character_4byte_valid_bytes pf
                  val code_point = decode_utf8_4byte (c0, c1, c2, c3)
                in
                  @(code_point, i + i2sz 4)
                end
              else
                error_return
            end

        | _ => error_return
      end
    else
      (* This branch should never be run on a system
         with 8-bit char. *)
      error_return
  end

implement
utf8_string_decode_next {..} {n_utf8str} (utf8len, utf8str, i) =
  let
    val [p : addr] p = string2ptr utf8str
    val (pf, consume_pf | p) =
      $UNSAFE.ptr1_vtake{@[char][n_utf8str]} p
    val result = utf8_array_decode_next (utf8len, !p, i)
    prval _ = consume_pf pf
  in
    result
  end

(*###################### DEMONSTRATION #############################*)

fn
encode_LATIN_CAPITAL_LETTER_A () : void =
  {
    val u = 0x0041

    (* Return both a proof of valid encoding and the encoding. *)
    val (pf_valid | n, c0, c1, c2, c3) = encode_utf8_character (u)

    (* Verify the encoding. *)
    val _ = assertloc (n = 1)
    val _ = assertloc (c0 = 0x41)
  }

fn
decode_LATIN_CAPITAL_LETTER_A () : void =
  {
    val str = "\x41\0"
    val n = length str
    val (c, i) = utf8_decode_next (n, str, i2sz 0)

    (* Verify that the decoding is correct. *)
    val _ = assertloc (c = 0x0041)

    (* Verify that the next index is 1. *)
    val _ = assertloc (i = i2sz 1)
  }

fn
encode_LATIN_SMALL_LETTER_O_WITH_DIAERESIS () : void =
  {
    val u = 0x00F6

    (* Return both a proof of valid encoding and the encoding. *)
    val (pf_valid | n, c0, c1, c2, c3) = encode_utf8_character (u)

    (* Verify the encoding. *)
    val _ = assertloc (n = 2)
    val _ = assertloc (c0 = 0xC3)
    val _ = assertloc (c1 = 0xB6)
  }

fn
decode_LATIN_SMALL_LETTER_O_WITH_DIAERESIS () : void =
  {
    val str = "\xC3\xB6\0"
    val n = length str
    val (c, i) = utf8_decode_next (n, str, i2sz 0)

    (* Verify that the decoding is correct. *)
    val _ = assertloc (c = 0x00F6)

    (* Verify that the next index is 2. *)
    val _ = assertloc (i = i2sz 2)
  }

fn
encode_CYRILLIC_CAPITAL_LETTER_ZHE () : void =
  {
    val u = 0x0416

    (* Return both a proof of valid encoding and the encoding. *)
    val (pf_valid | n, c0, c1, c2, c3) = encode_utf8_character (u)

    (* Verify the encoding. *)
    val _ = assertloc (n = 2)
    val _ = assertloc (c0 = 0xD0)
    val _ = assertloc (c1 = 0x96)
  }

fn
decode_CYRILLIC_CAPITAL_LETTER_ZHE () : void =
  {
    val str = "\xD0\x96\0"
    val n = length str
    val (c, i) = utf8_decode_next (n, str, i2sz 0)

    (* Verify that the decoding is correct. *)
    val _ = assertloc (c = 0x0416)

    (* Verify that the next index is 2. *)
    val _ = assertloc (i = i2sz 2)
  }

fn
encode_EURO_SIGN () : void =
  {
    val u = 0x20AC

    (* Return both a proof of valid encoding and the encoding. *)
    val (pf_valid | n, c0, c1, c2, c3) = encode_utf8_character (u)

    (* Verify the encoding. *)
    val _ = assertloc (n = 3)
    val _ = assertloc (c0 = 0xE2)
    val _ = assertloc (c1 = 0x82)
    val _ = assertloc (c2 = 0xAC)
  }

fn
decode_EURO_SIGN () : void =
  {
    val str = "\xE2\x82\xAC\0"
    val n = length str
    val (c, i) = utf8_decode_next (n, str, i2sz 0)

    (* Verify that the decoding is correct. *)
    val _ = assertloc (c = 0x20AC)

    (* Verify that the next index is 3. *)
    val _ = assertloc (i = i2sz 3)
  }

fn
encode_MUSICAL_SYMBOL_G_CLEF () : void =
  {
    val u = 0x1D11E

    (* Return both a proof of valid encoding and the encoding. *)
    val (pf_valid | n, c0, c1, c2, c3) = encode_utf8_character (u)

    (* Verify the encoding. *)
    val _ = assertloc (n = 4)
    val _ = assertloc (c0 = 0xF0)
    val _ = assertloc (c1 = 0x9D)
    val _ = assertloc (c2 = 0x84)
    val _ = assertloc (c3 = 0x9E)
  }

fn
decode_MUSICAL_SYMBOL_G_CLEF () : void =
  {
    val str = "\xF0\x9D\x84\x9E\0"
    val n = length str
    val (c, i) = utf8_decode_next (n, str, i2sz 0)

    (* Verify that the decoding is correct. *)
    val _ = assertloc (c = 0x1D11E)

    (* Verify that the next index is 4. *)
    val _ = assertloc (i = i2sz 4)
  }

implement
main0 () =
  begin
    encode_LATIN_CAPITAL_LETTER_A ();
    decode_LATIN_CAPITAL_LETTER_A ();

    encode_LATIN_SMALL_LETTER_O_WITH_DIAERESIS ();
    decode_LATIN_SMALL_LETTER_O_WITH_DIAERESIS ();

    encode_CYRILLIC_CAPITAL_LETTER_ZHE ();
    decode_CYRILLIC_CAPITAL_LETTER_ZHE ();

    encode_EURO_SIGN ();
    decode_EURO_SIGN ();

    encode_MUSICAL_SYMBOL_G_CLEF ();
    decode_MUSICAL_SYMBOL_G_CLEF ();

    println! ("SUCCESS")
  end
Output:
$ patscc -O2 utf8_encoding.dats && ./a.out
SUCCESS

AutoHotkey

Encode_UTF(hex){
	Bytes :=  hex>=0x10000 ? 4 : hex>=0x0800 ? 3 : hex>=0x0080 ? 2 : hex>=0x0001 ? 1 : 0
	Prefix := [0, 0xC0, 0xE0, 0xF0]
	loop % Bytes {
		if (A_Index < Bytes)
			UTFCode := Format("{:X}", (hex&0x3F) + 0x80) . UTFCode		; 3F=00111111, 80=10000000
		else
			UTFCode := Format("{:X}", hex + Prefix[Bytes]) . UTFCode	; C0=11000000, E0=11100000, F0=11110000
		hex := hex>>6
	}
	return "0x" UTFCode
}
;----------------------------------------------------------------------------------------
Decode_UTF(hex){
	Bytes :=  hex>=0x10000 ? 4 : hex>=0x0800 ? 3 : hex>=0x0080 ? 2 : hex>=0x0001 ? 1 : 0
	bin := ConvertBase(16, 2, hex)
	loop, % Bytes {
		B := SubStr(bin, -7)
		if Bytes > 1
			B := LTrim(B, 1) , B := StrReplace(B, 0,,, 1) 
		bin := SubStr(bin, 1, StrLen(bin)-8)
		Uni := B . Uni
	}
	return "0x" ConvertBase(2, 16, Uni)
}
;----------------------------------------------------------------------------------------
; www.autohotkey.com/boards/viewtopic.php?f=6&t=3607#p18985
ConvertBase(InputBase, OutputBase, number){
    static u := A_IsUnicode ? "_wcstoui64" : "_strtoui64"
    static v := A_IsUnicode ? "_i64tow"    : "_i64toa"
    VarSetCapacity(s, 65, 0)
    value := DllCall("msvcrt.dll\" u, "Str", number, "UInt", 0, "UInt", InputBase, "CDECL Int64")
    DllCall("msvcrt.dll\" v, "Int64", value, "Str", s, "UInt", OutputBase, "CDECL")
    return s
}

Examples:

data = 
(comment
0x0041
0x00F6
0x0416
0x20AC
0x1D11E
)

output := "unicode`t`tUTF`t`tunicode`n"
for i, Hex in StrSplit(data, "`n", "`r"){
	UTFCode := Encode_UTF(Hex)
	output .= Hex "`t`t" UTFCode "`t`t" Decode_UTF(UTFCode) "`n"
}
MsgBox % output
return
Output:
Unicode		Encode_UTF	Decode_UTF
0x0041		0x41		0x41
0x00F6		0xC3B6		0xf6
0x0416		0xD096		0x416
0x20AC		0xE282AC	0x20ac
0x1D11E		0xF09D849E	0x1d11e

BaCon

BaCon supports UTF8 natively.

DECLARE x TYPE STRING

CONST letter$ = "A ö Ж € 𝄞"

PRINT "Char", TAB$(1), "Unicode", TAB$(2), "UTF-8 (hex)"
PRINT "-----------------------------------"

FOR x IN letter$
    PRINT x, TAB$(1), "U+", HEX$(UCS(x)), TAB$(2), COIL$(LEN(x), HEX$(x[_-1] & 255))
NEXT
Output:
Char    Unicode         UTF-8 (hex)
-----------------------------------
A       U+41            41
ö       U+F6            C3 B6
Ж       U+416           D0 96
€       U+20AC          E2 82 AC
𝄞       U+1D11E         F0 9D 84 9E

C

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>

typedef struct {
	char mask;    /* char data will be bitwise AND with this */
	char lead;    /* start bytes of current char in utf-8 encoded character */
	uint32_t beg; /* beginning of codepoint range */
	uint32_t end; /* end of codepoint range */
	int bits_stored; /* the number of bits from the codepoint that fits in char */
}utf_t;

utf_t * utf[] = {
	/*             mask        lead        beg      end       bits */
	[0] = &(utf_t){0b00111111, 0b10000000, 0,       0,        6    },
	[1] = &(utf_t){0b01111111, 0b00000000, 0000,    0177,     7    },
	[2] = &(utf_t){0b00011111, 0b11000000, 0200,    03777,    5    },
	[3] = &(utf_t){0b00001111, 0b11100000, 04000,   0177777,  4    },
	[4] = &(utf_t){0b00000111, 0b11110000, 0200000, 04177777, 3    },
	      &(utf_t){0},
};

/* All lengths are in bytes */
int codepoint_len(const uint32_t cp); /* len of associated utf-8 char */
int utf8_len(const char ch);          /* len of utf-8 encoded char */

char *to_utf8(const uint32_t cp);
uint32_t to_cp(const char chr[4]);

int codepoint_len(const uint32_t cp)
{
	int len = 0;
	for(utf_t **u = utf; *u; ++u) {
		if((cp >= (*u)->beg) && (cp <= (*u)->end)) {
			break;
		}
		++len;
	}
	if(len > 4) /* Out of bounds */
		exit(1);

	return len;
}

int utf8_len(const char ch)
{
	int len = 0;
	for(utf_t **u = utf; *u; ++u) {
		if((ch & ~(*u)->mask) == (*u)->lead) {
			break;
		}
		++len;
	}
	if(len > 4) { /* Malformed leading byte */
		exit(1);
	}
	return len;
}

char *to_utf8(const uint32_t cp)
{
	static char ret[5];
	const int bytes = codepoint_len(cp);

	int shift = utf[0]->bits_stored * (bytes - 1);
	ret[0] = (cp >> shift & utf[bytes]->mask) | utf[bytes]->lead;
	shift -= utf[0]->bits_stored;
	for(int i = 1; i < bytes; ++i) {
		ret[i] = (cp >> shift & utf[0]->mask) | utf[0]->lead;
		shift -= utf[0]->bits_stored;
	}
	ret[bytes] = '\0';
	return ret;
}

uint32_t to_cp(const char chr[4])
{
	int bytes = utf8_len(*chr);
	int shift = utf[0]->bits_stored * (bytes - 1);
	uint32_t codep = (*chr++ & utf[bytes]->mask) << shift;

	for(int i = 1; i < bytes; ++i, ++chr) {
		shift -= utf[0]->bits_stored;
		codep |= ((char)*chr & utf[0]->mask) << shift;
	}

	return codep;
}

int main(void)
{
	const uint32_t *in, input[] = {0x0041, 0x00f6, 0x0416, 0x20ac, 0x1d11e, 0x0};

	printf("Character  Unicode  UTF-8 encoding (hex)\n");
	printf("----------------------------------------\n");

	char *utf8;
	uint32_t codepoint;
	for(in = input; *in; ++in) {
		utf8 = to_utf8(*in);
		codepoint = to_cp(utf8);
		printf("%s          U+%-7.4x", utf8, codepoint);

		for(int i = 0; utf8[i] && i < 4; ++i) {
			printf("%hhx ", utf8[i]);
		}
		printf("\n");
	}
	return 0;
}

Output

Character  Unicode  UTF-8 encoding (hex)
----------------------------------------
A          U+0041   41
ö          U+00f6   c3 b6
Ж          U+0416   d0 96
€          U+20ac   e2 82 ac
𝄞          U+1d11e  f0 9d 84 9e

C#

using System;
using System.Text;

namespace Rosetta
{
    class Program
    {
        static byte[] MyEncoder(int codepoint) => Encoding.UTF8.GetBytes(char.ConvertFromUtf32(codepoint));
        static string MyDecoder(byte[] utf8bytes) => Encoding.UTF8.GetString(utf8bytes);
        static void Main(string[] args)
        {
            Console.OutputEncoding = Encoding.UTF8;  // makes sure it doesn't print rectangles...
            foreach (int unicodePoint in new int[] {  0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E})
            {
                byte[] asUtf8bytes = MyEncoder(unicodePoint);
                string theCharacter = MyDecoder(asUtf8bytes);
                Console.WriteLine("{0,8} {1,5}     {2,-15}", unicodePoint.ToString("X4"), theCharacter, BitConverter.ToString(asUtf8bytes));
            }
        }
    }
}
/* Output:
 *  0041     A     41
    00F6     ö     C3-B6
    0416     Ж     D0-96
    20AC     €     E2-82-AC
   1D11E     𝄞     F0-9D-84-9E */

C++

#include <codecvt>
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <vector>

struct utf8 {
	char mask;            /* char data will be bitwise ANDed with this binary number */
	char start;           /* start index of the bytes of current char in a utf-8 encoded character */
	uint32_t begin;       /* beginning of codepoint range */
	uint32_t end;         /* end of codepoint range */
	uint32_t bits_stored; /* the number of bits from the codepoint that fit in the char */
};

const std::vector<utf8> utf8s = {
	/*    mask                          start        begin    end       bits */
	utf8{ 0b00111111, static_cast<char>(0b10000000), 0,       0,        6 },
	utf8{ 0b01111111, static_cast<char>(0b00000000), 0000,    0177,     7 },
	utf8{ 0b00011111, static_cast<char>(0b11000000), 0200,    03777,    5 },
	utf8{ 0b00001111, static_cast<char>(0b11100000), 04000,   0177777,  4 },
	utf8{ 0b00000111, static_cast<char>(0b11110000), 0200000, 04177777, 3 } };

uint32_t codepoint_size(const uint32_t& codepoint) {
	uint32_t size = 0;
	for ( const utf8& utf : utf8s ) {
		if ( ( codepoint >= utf.begin ) && ( codepoint <= utf.end ) ) {
			break;
		}
		size++;
	}
	return size;
}

uint32_t utf8_size(const char& ch) {
	uint32_t size = 0;
	for ( const utf8& utf : utf8s ) {
		if( ( ch & ~utf.mask ) == utf.start ) {
			break;
		}
		size++;
	}
	return size;
}

std::vector<char> to_utf8(const uint32_t& codepoint) {
	const uint32_t byte_count = codepoint_size(codepoint);
	std::vector<char> result{ };

	uint32_t shift = utf8s[0].bits_stored * ( byte_count - 1 );
	result.emplace_back(( codepoint >> shift & utf8s[byte_count].mask ) | utf8s[byte_count].start);
	shift -= utf8s[0].bits_stored;
	for ( uint32_t i = 1; i < byte_count; ++i ) {
		result.emplace_back(( codepoint >> shift & utf8s[0].mask ) | utf8s[0].start);
		shift -= utf8s[0].bits_stored;
	}
	return result;
}

uint32_t to_codepoint(const std::vector<char>& chars) {
	const uint32_t byte_count = utf8_size(chars[0]);
	uint32_t shift = utf8s[0].bits_stored * ( byte_count - 1 );
	uint32_t codepoint = ( chars[0] & utf8s[byte_count].mask ) << shift;

	for ( uint32_t index = 1; index < byte_count; ++index ) {
		shift -= utf8s[0].bits_stored;
		codepoint |= ( chars[index] & utf8s[0].mask ) << shift;
	}
	return codepoint;
}

int main() {
	const std::vector<uint32_t> tests = { 0x0041, 0x00f6, 0x0416, 0x20ac, 0x1d11e };

	std::cout << "Character   Unicode   UTF-8 encoding (hex)" << "\n";
	std::cout << "------------------------------------------" << "\n";

	for ( const uint32_t& test : tests ) {
		std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
		std::cout << convert.to_bytes(test) << "\t" << "    ";

		const std::vector<char> utf8 = to_utf8(test);
		const uint32_t codepoint = to_codepoint(utf8);
		std::cout << "U+" << std::setw(4) << std::setfill('0') << std::hex << codepoint << "\t";

		for ( uint32_t i = 0; i < utf8.size(); ++i ) {
			std::cout << std::setw(2) << ( utf8[i] & 0xff ) << " ";
		}
		std::cout << "\n";
	}
}
Output:
Character   Unicode   UTF-8 encoding (hex)
------------------------------------------
A	    U+0041	41 
ö	    U+00f6	c3 b6 
Ж	    U+0416	d0 96 
€	    U+20ac	e2 82 ac 
𝄞	    U+1d11e	f0 9d 84 9e 

Common Lisp

Helper functions

(defun ascii-byte-p (octet)
  "Return t if octet is a single-byte 7-bit ASCII char.
  The most significant bit is 0, so the allowed pattern is 0xxx xxxx."
  (assert (typep octet 'integer))
  (assert (<= (integer-length octet) 8))
  (let ((bitmask  #b10000000)
        (template #b00000000))
    ;; bitwise and the with the bitmask #b11000000 to extract the first two bits.
    ;; check if the first two bits are equal to the template #b10000000.
    (= (logand bitmask octet) template)))

(defun multi-byte-p (octet)
  "Return t if octet is a part of a multi-byte UTF-8 sequence.
  The multibyte pattern is 1xxx xxxx. A multi-byte can be either a lead byte or a trail byte."
  (assert (typep octet 'integer))
  (assert (<= (integer-length octet) 8))
  (let ((bitmask  #b10000000)
        (template #b10000000))
    ;; bitwise and the with the bitmask #b11000000 to extract the first two bits.
    ;; check if the first two bits are equal to the template #b10000000.
    (= (logand bitmask octet) template)))

(defun lead-byte-p (octet)
  "Return t if octet is one of the leading bytes of an UTF-8 sequence, nil otherwise.
  Allowed leading byte patterns are 0xxx xxxx, 110x xxxx, 1110 xxxx and 1111 0xxx."
  (assert (typep octet 'integer))
  (assert (<= (integer-length octet) 8))
  (let ((bitmasks  (list #b10000000 #b11100000 #b11110000 #b11111000))
        (templates (list #b00000000 #b11000000 #b11100000 #b11110000)))
    (some #'(lambda (a b) (= (logand a octet) b)) bitmasks templates)))

(defun n-trail-bytes (octet)
  "Take a leading utf-8 byte, return the number of continuation bytes 1-3."
  (assert (typep octet 'integer))
  (assert (<= (integer-length octet) 8))
  (let ((bitmasks  (list #b10000000 #b11100000 #b11110000 #b11111000))
        (templates (list #b00000000 #b11000000 #b11100000 #b11110000)))
    (loop for i from 0 to 3
       when (= (nth i templates) (logand (nth i bitmasks) octet))
       return i)))

Encoder

(defun unicode-to-utf-8 (int)
  "Take a unicode code point, return a list of one to four UTF-8 encoded bytes (octets)."
  (assert (<= (integer-length int) 21))
  (let ((n-trail-bytes (cond ((<= #x00000 int #x00007F) 0)
                             ((<= #x00080 int #x0007FF) 1)
                             ((<= #x00800 int #x00FFFF) 2)
                             ((<= #x10000 int #x10FFFF) 3)))
        (lead-templates (list #b00000000 #b11000000 #b11100000 #b11110000))
        (trail-template #b10000000)
        ;; number of content bits in the lead byte.
        (n-lead-bits (list 7 5 4 3))
        ;; number of content bits in the trail byte.
        (n-trail-bits 6)
        ;; list to put the UTF-8 encoded bytes in.
        (byte-list nil))
    (if (= n-trail-bytes 0)
        ;; if we need 0 trail bytes, ist just an ascii single byte.
        (push int byte-list)
        (progn
          ;; if we need more than one byte, first fill the trail bytes with 6 bits each.
          (loop for i from 0 to (1- n-trail-bytes)
             do (push (+ trail-template
                         (ldb (byte n-trail-bits (* i n-trail-bits)) int))
                      byte-list))
          ;; then copy the remaining content bytes to the lead byte.
          (push (+ (nth n-trail-bytes lead-templates)
                   (ldb (byte (nth n-trail-bytes n-lead-bits) (* n-trail-bytes n-trail-bits)) int))
                byte-list)))
    ;; return the list of UTF-8 encoded bytes.
    byte-list))

Decoder

(defun utf-8-to-unicode (byte-list)
  "Take a list of one to four utf-8 encoded bytes (octets), return a code point."
  (let ((b1 (car byte-list)))
    (cond ((ascii-byte-p b1) b1) ; if a single byte, just return it.
          ((multi-byte-p b1)
           (if (lead-byte-p b1)
               (let ((n (n-trail-bytes b1))
                     ;; Content bits we want to extract from each lead byte.
                     (lead-templates (list #b01111111 #b00011111 #b00001111 #b00000111))
                     ;; Content bits we want to extract from each trail byte.
                     (trail-template #b00111111))
                 (if (= n (1- (list-length byte-list)))
                     ;; add lead byte
                     (+ (ash (logand (nth 0 byte-list) (nth n lead-templates)) (* 6 n))
                        ;; and the trail bytes
                        (loop for i from 1 to n sum
                             (ash (logand (nth i byte-list) trail-template) (* 6 (- n i)))))
                     (error "calculated number of bytes doesnt match the length of the byte list")))
               (error "first byte in the list isnt a lead byte"))))))

The test

(defun test-utf-8 ()
  "Return t if the chosen unicode points are encoded and decoded correctly."
  (let* ((unicodes-orig (list 65 246 1046 8364 119070))
         (unicodes-test (mapcar #'(lambda (x) (utf-8-to-unicode (unicode-to-utf-8 x)))
                                unicodes-orig)))
    (mapcar #'(lambda (x)
                (format t
                        "character ~A, code point: ~6x, utf-8: ~{~x ~}~%"
                        (code-char x)
                        x
                        (unicode-to-utf-8 x)))
            unicodes-orig)
    ;; return t if all are t
    (every #'= unicodes-orig unicodes-test)))

Test output

CL-USER> (test-utf-8)
character A, code point:     41, utf-8: 41 
character ö, code point:     F6, utf-8: C3 B6 
character Ж, code point:    416, utf-8: D0 96 
character , code point:   20AC, utf-8: E2 82 AC 
character 𝄞, code point:  1D11E, utf-8: F0 9D 84 9E 
T

D

import std.conv;
import std.stdio;

immutable CHARS = ["A","ö","Ж","€","𝄞"];

void main() {
    writeln("Character   Code-Point   Code-Units");
    foreach (c; CHARS) {
        auto bytes = cast(ubyte[]) c; //The raw bytes of a character can be accessed by casting
        auto unicode = cast(uint) to!dstring(c)[0]; //Convert from a UTF8 string to a UTF32 string, and cast the first character to a number
        writefln("%s              %7X   [%(%X, %)]", c, unicode, bytes);
    }
}
Output:
Character   Code-Point   Code-Units
A                   41   [41]
ö                   F6   [C3, B6]
Ж                  416   [D0, 96]
€                 20AC   [E2, 82, AC]
𝄞                1D11E   [F0, 9D, 84, 9E]

DuckDB

Works with: DuckDB version V1.0

Codepoint to utf-8-encoded string and return

DuckDB provides chr() for mapping a decimal integer codepoint to a UTF-8 encoded string, and unicode() for the return trip:

  codepoint -> chr(codepoint) as s -> unicode(s)

For example:

D select 246 as codepoint, chr(codepoint) as string, unicode(string) == codepoint;
┌───────────┬─────────┬───────────────────────────────┐
│ codepoint │ string  │ (unicode(string) = codepoint) │
│   int32   │ varchar │            boolean            │
├───────────┼─────────┼───────────────────────────────┤
│       246 │ ö       │ true                          │
└───────────┴─────────┴───────────────────────────────┘

Round-trip from codepoint via hex representation of the utf-8 bytes

The round-trip from codepoint via a hex-string representation of the UTF-8-encoded string can be made as follows:

  codepoint -> hex(encode(chr(codepoint)) as x -> decode(from_hex(x))
create or replace function utf8x(codepoint) as (
   select hex(encode(chr(codepoint)))
);

# Example:
select s,
       unicode(s) as codepoint,
       chr(codepoint) as c,
       unicode(c),
       utf8x(codepoint),
       decode(from_hex(utf8x(codepoint))) as "decode(from_hex())"
from values ('A'), ('ö'), ('Ж'), ('€'), ('𝄞') t(s);
┌─────────┬───────────┬─────────┬────────────┬──────────────────┬────────────────────┐
│    s    │ codepoint │    c    │ unicode(c) │ utf8x(codepoint) │ decode(from_hex()) │
│ varchar │   int32   │ varchar │   int32    │     varchar      │      varchar       │
├─────────┼───────────┼─────────┼────────────┼──────────────────┼────────────────────┤
│ A       │        65 │ A       │         65 │ 41               │ A                  │
│ ö       │       246 │ ö       │        246 │ C3B6             │ ö                  │
│ Ж       │      1046 │ Ж       │       1046 │ D096             │ Ж                  │
│ €       │      8364 │ €       │       8364 │ E282AC           │ €                  │
│ 𝄞       │    119070 │ 𝄞       │     119070 │ F09D849E         │ 𝄞                  │
└─────────┴───────────┴─────────┴────────────┴──────────────────┴────────────────────┘

Elena

ELENA 6.x :

import system'routines;
import extensions;
 
extension op : String
{
    string printAsString()
    {
       console.print(self," ")
    }
 
    string printAsUTF8Array()
    {
        self.toByteArray().forEach::(b){ console.print(b.toString(16)," ") }
    }
 
    string printAsUTF32()
    {
        self.toArray().forEach::(c){ console.print("U+",c.toInt().toString(16)," ") }
    }
}
 
public program()
{
    "A".printAsString().printAsUTF8Array().printAsUTF32();
    console.printLine();
 
    "ö".printAsString().printAsUTF8Array().printAsUTF32();
    console.printLine();
 
    "Ж".printAsString().printAsUTF8Array().printAsUTF32();
    console.printLine();
 
    "€".printAsString().printAsUTF8Array().printAsUTF32();
    console.printLine();
 
    "𝄞".printAsString().printAsUTF8Array().printAsUTF32();
    console.printLine();
}
Output:
A 41 U+41 
ö C3 B6 U+F6 
Ж D0 96 U+416 
€ E2 82 AC U+20AC 
𝄞 F0 9D 84 9E U+1D11E

FreeBASIC

Translation of: VBScript
Function unicode_2_utf8(x As Long) As String
    Dim As String y
    Dim As Long r
    Select Case x
    Case 0 To &H7F
        y = Chr(x)
    Case &H80 To &H7FF
        y = Chr(192 + x \ 64) + Chr(128 + x Mod 64)
    Case &H800 To &H7FFF, 32768 To 65535
        r = x \ 64
        y = Chr(224 + r \ 64) + Chr(128 + r Mod 64) + Chr(128 + x Mod 64)
    Case &H10000 To &H10FFFF
        r = x \ 4096
        y = Chr(240 + r \ 64) + Chr(128 + r Mod 64) + Chr(128 + (x \ 64) Mod 64) + Chr(128 + x Mod 64)
    Case Else
        Print "what else? " & x & " " & Hex(x)
    End Select
    Return y
End Function

Function utf8_2_unicode(x As String) As Long
    Dim As Long primero, segundo, tercero, cuarto
    Dim As Long total
    Select Case Len(x)
    Case 1 'one byte
        If Asc(x) < 128 Then
            total = Asc(x)
        Else
            Print "highest bit set error"
        End If
    Case 2 'two bytes and assume primero byte is leading byte
        If Asc(x) \ 32 = 6 Then
            primero = Asc(x) Mod 32
            If Asc(Mid(x, 2, 1)) \ 64 = 2 Then
                segundo = Asc(Mid(x, 2, 1)) Mod 64
            Else
                Print "mask error"
            End If
        Else
            Print "leading byte error"
        End If
        total = 64 * primero + segundo
    Case 3 'three bytes and assume primero byte is leading byte
        If Asc(x) \ 16 = 14 Then
            primero = Asc(x) Mod 16
            If Asc(Mid(x, 2, 1)) \ 64 = 2 Then
                segundo = Asc(Mid(x, 2, 1)) Mod 64
                If Asc(Mid(x, 3, 1)) \ 64 = 2 Then
                    tercero = Asc(Mid(x, 3, 1)) Mod 64
                Else
                    Print "mask error last byte"
                End If
            Else
                Print "mask error middle byte"
            End If
        Else
            Print "leading byte error"
        End If
        total = 4096 * primero + 64 * segundo + tercero
    Case 4 'four bytes and assume primero byte is leading byte
        If Asc(x) \ 8 = 30 Then
            primero = Asc(x) Mod 8
            If Asc(Mid(x, 2, 1)) \ 64 = 2 Then
                segundo = Asc(Mid(x, 2, 1)) Mod 64
                If Asc(Mid(x, 3, 1)) \ 64 = 2 Then
                    tercero = Asc(Mid(x, 3, 1)) Mod 64
                    If Asc(Mid(x, 4, 1)) \ 64 = 2 Then
                        cuarto = Asc(Mid(x, 4, 1)) Mod 64
                    Else
                        Print "mask error last byte"
                    End If
                Else
                    Print "mask error tercero byte"
                End If
            Else
                Print "mask error second byte"
            End If
        Else
            Print "mask error leading byte"
        End If
        total = Clng(262144 * primero + 4096 * segundo + 64 * tercero + cuarto)
    Case Else
        Print "more bytes than expected"
    End Select
    Return total
End Function

Dim As Long cp(4) = {65, 246, 1046, 8364, 119070} '[{&H0041,&H00F6,&H0416,&H20AC,&H1D11E}]
Dim As String r, s
Dim As integer i, j
Print "ch  unicode  UTF-8 encoded  decoded"
For i = Lbound(cp) To Ubound(cp)
    Dim As Long cpi = cp(i)
    r = unicode_2_utf8(cpi)
    s = Hex(cpi)
    Print Chr(cpi); Space(10 - Len(s)); s;
    s = ""
    For j = 1 To Len(r)
        s &= Hex(Asc(Mid(r, j, 1))) & " "
    Next j
    Print Space(16 - Len(s)); s;
    s = Hex(utf8_2_unicode(r))
    Print Space(8 - Len(s)); s
Next i

Sleep

F#

// Unicode character point to UTF8. Nigel Galloway: March 19th., 2018
let fN g = match List.findIndex (fun n->n>g) [0x80;0x800;0x10000;0x110000] with
           |0->[g]
           |1->[0xc0+(g&&&0x7c0>>>6);0x80+(g&&&0x3f)]
           |2->[0xe0+(g&&&0xf000>>>12);0x80+(g&&&0xfc0>>>6);0x80+(g&&&0x3f)]
           |_->[0xf0+(g&&&0x1c0000>>>18);0x80+(g&&&0x3f000>>>12);0x80+(g&&&0xfc0>>>6);0x80+(g&&&0x3f)]
Output:
for n in fN 0x41    do printf "%x " n -> 41
for n in fN 0xf6    do printf "%x " n -> c3 b6 
for n in fN 0x416   do printf "%x " n -> d0 96 
for n in fN 0x20ac  do printf "%x " n -> e2 82 ac 
for n in fN 0x1d11e do printf "%x " n -> f0 9d 84 9e 

Forth

Forth-2012 includes the words xc!+ and xc@+ that convert code point numbers into bytes and vice versa, respectively, for the encoding the system is using. If the system uses UTF-8 (e.g., in Gforth, if it is started in a UTF-8 environment), this converts between code point number and UTF-8.

So the following code just uses these words to perform the tests.

Works with: gforth version 0.7.9_20191121
Works with: lxf version 1.6-982-823
: showbytes ( c-addr u -- )
    over + swap ?do
	i c@ 3 .r loop ;

: test {: xc -- :}
    xc xemit xc 6 .r xc pad xc!+ pad tuck - ( c-addr u )
    2dup showbytes drop xc@+ xc <> abort" test failed" drop cr ;

hex
$41 test $f6 test $416 test $20ac test $1d11e test
\ can also be written as
\ 'A' test 'ö' test 'Ж' test '€' test '𝄞' test
Output:
A    41 41
ö    F6 C3 B6
Ж   416 D0 96
€  20AC E2 82 AC
𝄞 1D11E F0 9D 84 9E

If you also want to see the implementation of xc!+ and xc@+, here it is (u8!+ is the UTF-8 implementation of xc!+, and likewise for u8@+):

-77 Constant UTF-8-err

$80 Constant max-single-byte

: u8@+ ( u8addr -- u8addr' u )
    count  dup max-single-byte u< ?EXIT  \ special case ASCII
    dup $C2 u< IF  UTF-8-err throw  THEN  \ malformed character
    $7F and  $40 >r
    BEGIN  dup r@ and  WHILE  r@ xor
	    6 lshift r> 5 lshift >r >r count
	    dup $C0 and $80 <> IF   UTF-8-err throw  THEN
	    $3F and r> or
    REPEAT  rdrop ;

: u8!+ ( u u8addr -- u8addr' )
    over max-single-byte u< IF  tuck c! 1+  EXIT  THEN \ special case ASCII
    >r 0 swap  $3F
    BEGIN  2dup u>  WHILE
	    2/ >r  dup $3F and $80 or swap 6 rshift r>
    REPEAT  $7F xor 2* or  r>
    BEGIN   over $80 u>= WHILE  tuck c! 1+  REPEAT  nip ;

Go

Implementation

This implementation is missing all checks for invalid data and so is not production-ready, but illustrates the basic UTF-8 encoding scheme.

package main

import (
    "bytes"
    "encoding/hex"
    "fmt"
    "log"
    "strings"
)

var testCases = []struct {
    rune
    string
}{
    {'A', "41"},
    {'ö', "C3 B6"},
    {'Ж', "D0 96"},
    {'€', "E2 82 AC"},
    {'𝄞', "F0 9D 84 9E"},
}

func main() {
    for _, tc := range testCases {
        // derive some things from test data
        u := fmt.Sprintf("U+%04X", tc.rune)
        b, err := hex.DecodeString(strings.Replace(tc.string, " ", "", -1))
        if err != nil {
            log.Fatal("bad test data")
        }
        // exercise encoder and decoder on test data
        e := encodeUTF8(tc.rune)
        d := decodeUTF8(b)
        // show function return values
        fmt.Printf("%c  %-7s  %X\n", d, u, e)
        // validate return values against test data
        if !bytes.Equal(e, b) {
            log.Fatal("encodeUTF8 wrong")
        }
        if d != tc.rune {
            log.Fatal("decodeUTF8 wrong")
        }
    }
}

const (
    // first byte of a 2-byte encoding starts 110 and carries 5 bits of data
    b2Lead = 0xC0 // 1100 0000
    b2Mask = 0x1F // 0001 1111

    // first byte of a 3-byte encoding starts 1110 and carries 4 bits of data
    b3Lead = 0xE0 // 1110 0000
    b3Mask = 0x0F // 0000 1111

    // first byte of a 4-byte encoding starts 11110 and carries 3 bits of data
    b4Lead = 0xF0 // 1111 0000
    b4Mask = 0x07 // 0000 0111

    // non-first bytes start 10 and carry 6 bits of data
    mbLead = 0x80 // 1000 0000
    mbMask = 0x3F // 0011 1111
)

func encodeUTF8(r rune) []byte {
    switch i := uint32(r); {
    case i <= 1<<7-1: // max code point that encodes into a single byte
        return []byte{byte(r)}
    case i <= 1<<11-1: // into two bytes
        return []byte{
            b2Lead | byte(r>>6),
            mbLead | byte(r)&mbMask}
    case i <= 1<<16-1: // three
        return []byte{
            b3Lead | byte(r>>12),
            mbLead | byte(r>>6)&mbMask,
            mbLead | byte(r)&mbMask}
    default:
        return []byte{
            b4Lead | byte(r>>18),
            mbLead | byte(r>>12)&mbMask,
            mbLead | byte(r>>6)&mbMask,
            mbLead | byte(r)&mbMask}
    }
}

func decodeUTF8(b []byte) rune {
    switch b0 := b[0]; {
    case b0 < 0x80:
        return rune(b0)
    case b0 < 0xE0:
        return rune(b0&b2Mask)<<6 |
            rune(b[1]&mbMask)
    case b0 < 0xF0:
        return rune(b0&b3Mask)<<12 |
            rune(b[1]&mbMask)<<6 |
            rune(b[2]&mbMask)
    default:
        return rune(b0&b4Mask)<<18 |
            rune(b[1]&mbMask)<<12 |
            rune(b[2]&mbMask)<<6 |
            rune(b[3]&mbMask)
    }
}
Output:
A  U+0041   41
ö  U+00F6   C3B6
Ж  U+0416   D096
€  U+20AC   E282AC
𝄞  U+1D11E  F09D849E

Library/language

package main

import (
    "fmt"
    "unicode/utf8"
)

func utf8encode(codepoint rune) []byte {
    buffer := make([]byte, 4)
    length := utf8.EncodeRune(buffer, codepoint)
    return buffer[:length]
}

func utf8decode(bytes []byte) rune {
    result, _ := utf8.DecodeRune(bytes)
    return result
}

func main() {
        fmt.Printf("%-7s %7s\t%s\t%s\n", "Char", "Unicode", "UTF-8 encoded", "Decoded");
    for _, codepoint := range []rune{'A', 'ö', 'Ж', '€', '𝄞'} {
        encoded := utf8encode(codepoint)
        decoded := utf8decode(encoded)
        fmt.Printf("%-7c U+%04X\t%-12X\t%c\n", codepoint, codepoint, encoded, decoded)
    }
}
Output:
Char    Unicode	UTF-8 encoded	Decoded
A       U+0041	41          	A
ö       U+00F6	C3B6        	ö
Ж       U+0416	D096        	Ж
€       U+20AC	E282AC      	€
𝄞       U+1D11E	F09D849E    	𝄞

Alternately:

package main

import (
    "fmt"
)

func utf8encode(codepoint rune) []byte {
    return []byte(string([]rune{codepoint}))
}

func utf8decode(bytes []byte) rune {
    return []rune(string(bytes))[0]
}

func main() {
        fmt.Printf("%-7s %7s\t%s\t%s\n", "Char", "Unicode", "UTF-8 encoded", "Decoded");
    for _, codepoint := range []rune{'A', 'ö', 'Ж', '€', '𝄞'} {
        encoded := utf8encode(codepoint)
        decoded := utf8decode(encoded)
        fmt.Printf("%-7c U+%04X\t%-12X\t%c\n", codepoint, codepoint, encoded, decoded)
    }
}
Output:
Char    Unicode	UTF-8 encoded	Decoded
A       U+0041	41          	A
ö       U+00F6	C3B6        	ö
Ж       U+0416	D096        	Ж
€       U+20AC	E282AC      	€
𝄞       U+1D11E	F09D849E    	𝄞

Groovy

Translation of: Java
import java.nio.charset.StandardCharsets

class UTF8EncodeDecode {
    static byte[] utf8encode(int codePoint) {
        char[] characters = [codePoint]
        new String(characters, 0, 1).getBytes StandardCharsets.UTF_8
    }

    static int utf8decode(byte[] bytes) {
        new String(bytes, StandardCharsets.UTF_8).codePointAt(0)
    }

    static void main(String[] args) {
        printf "%-7s %-43s %7s\t%s\t%7s%n", "Char", "Name", "Unicode", "UTF-8 encoded", "Decoded"

        ([0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E]).each { int codePoint ->
            byte[] encoded = utf8encode codePoint
            Formatter formatter = new Formatter()
            encoded.each { byte b ->
                formatter.format "%02X ", b
            }
            String encodedHex = formatter.toString()
            int decoded = utf8decode encoded
            printf "%-7c %-43s U+%04X\t%-12s\tU+%04X%n", codePoint, Character.getName(codePoint), codePoint, encodedHex, decoded
        }
    }
}
Output:
Char    Name                                        Unicode	UTF-8 encoded	Decoded
A       LATIN CAPITAL LETTER A                      U+0041	41          	U+0041
ö       LATIN SMALL LETTER O WITH DIAERESIS         U+00F6	C3 B6       	U+00F6
Ж       CYRILLIC CAPITAL LETTER ZHE                 U+0416	D0 96       	U+0416
€       EURO SIGN                                   U+20AC	E2 82 AC    	U+20AC
𝄞      MUSICAL SYMBOL G CLEF                       U+1D11E	ED 84 9E    	U+D11E

Haskell

Example makes use of bytestring and text packages:

module Main (main) where
 
import qualified Data.ByteString as ByteString (pack, unpack)
import           Data.Char (chr, ord)
import           Data.Foldable (for_)
import           Data.List (intercalate)
import qualified Data.Text as Text (head, singleton)
import qualified Data.Text.Encoding as Text (decodeUtf8, encodeUtf8)
import           Text.Printf (printf)
 
encodeCodepoint :: Int -> [Int]
encodeCodepoint = map fromIntegral . ByteString.unpack . Text.encodeUtf8 . Text.singleton . chr

decodeToCodepoint :: [Int] -> Int
decodeToCodepoint = ord . Text.head . Text.decodeUtf8 . ByteString.pack . map fromIntegral

main :: IO ()
main = do
    putStrLn "Character  Unicode  UTF-8 encoding (hex)  Decoded"
    putStrLn "-------------------------------------------------"
    for_ [0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E] $ \codepoint -> do
        let values = encodeCodepoint codepoint
            codepoint' = decodeToCodepoint values
        putStrLn $ printf "%c          %-7s  %-20s  %c"
            codepoint
            (printf "U+%04X" codepoint :: String)
            (intercalate " " (map (printf "%02X") values))
            codepoint'
Output:
Character  Unicode  UTF-8 encoding (hex)  Decoded
-------------------------------------------------
A          U+0041   41                    A
ö          U+00F6   C3 B6                 ö
Ж          U+0416   D0 96                 Ж
€          U+20AC   E2 82 AC              €
𝄞          U+1D11E  F0 9D 84 9E           𝄞

J

Solution:

utf8=: 8&u:               NB. converts to UTF-8 from unicode or unicode codepoint integer
ucp=: 9&u:                NB. converts to unicode from UTF-8 or unicode codepoint integer
ucp_hex=: hfd@(3 u: ucp)  NB. converts to unicode codepoint hexadecimal from UTF-8, unicode or unicode codepoint integer

Examples:

   utf8 65 246 1046 8364 119070
AöЖ€𝄞
   ucp 65 246 1046 8364 119070
AöЖ€𝄞
   ucp 'AöЖ€𝄞'
AöЖ€𝄞
   utf8 ucp 65 246 1046 8364 119070
AöЖ€𝄞
   ucp_hex utf8 65 246 1046 8364 119070
00041
000f6
00416
020ac
1d11e
   utf8@dfh ucp_hex utf8 65 246 1046 8364 119070
AöЖ€𝄞

Java

Works with: Java version 7+
import java.nio.charset.StandardCharsets;
import java.util.Formatter;

public class UTF8EncodeDecode {

    public static byte[] utf8encode(int codepoint) {
        return new String(new int[]{codepoint}, 0, 1).getBytes(StandardCharsets.UTF_8);
    }

    public static int utf8decode(byte[] bytes) {
        return new String(bytes, StandardCharsets.UTF_8).codePointAt(0);
    }

    public static void main(String[] args) {
        System.out.printf("%-7s %-43s %7s\t%s\t%7s%n",
                "Char", "Name", "Unicode", "UTF-8 encoded", "Decoded");

        for (int codepoint : new int[]{0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}) {
            byte[] encoded = utf8encode(codepoint);
            Formatter formatter = new Formatter();
            for (byte b : encoded) {
                formatter.format("%02X ", b);
            }
            String encodedHex = formatter.toString();
            int decoded = utf8decode(encoded);
            System.out.printf("%-7c %-43s U+%04X\t%-12s\tU+%04X%n",
                    codepoint, Character.getName(codepoint), codepoint, encodedHex, decoded);
        }
    }
}
Output:
Char    Name                                        Unicode	UTF-8 encoded	Decoded
A       LATIN CAPITAL LETTER A                      U+0041	41          	A
ö       LATIN SMALL LETTER O WITH DIAERESIS         U+00F6	C3 B6       	ö
Ж       CYRILLIC CAPITAL LETTER ZHE                 U+0416	D0 96       	Ж
€       EURO SIGN                                   U+20AC	E2 82 AC    	€
𝄞      MUSICAL SYMBOL G CLEF                       U+1D11E	F0 9D 84 9E 	𝄞

JavaScript

An implementation in ECMAScript 2015 (ES6):

/***************************************************************************\
|*  Pure UTF-8 handling without detailed error reporting functionality.    *|
|***************************************************************************|
|*  utf8encode                                                             *|
|*    < String character or UInt32 code point                              *|
|*    > Uint8Array encoded_character                                       *|
|*    | ErrorString                                                        *|
|*                                                                         *|
|*  utf8encode takes a string or uint32 representing a single code point   *|
|*    as its argument and returns an array of length 1 up to 4 containing  *|
|*    utf8 code units representing that character.                         *|
|***************************************************************************|
|*  utf8decode                                                             *|
|*    < Unit8Array [highendbyte highmidendbyte lowmidendbyte lowendbyte]   *|
|*    > uint32 character                                                   *|
|*    | ErrorString                                                        *|
|*                                                                         *|
|*  utf8decode takes an array of one to four uint8 representing utf8 code  *|
|*    units and returns a uint32 representing that code point.             *|
\***************************************************************************/

const
  utf8encode=
    n=>
      (m=>
        m<0x80
       ?Uint8Array.from(
          [ m>>0&0x7f|0x00])
       :m<0x800
       ?Uint8Array.from(
          [ m>>6&0x1f|0xc0,m>>0&0x3f|0x80])
       :m<0x10000
       ?Uint8Array.from(
          [ m>>12&0x0f|0xe0,m>>6&0x3f|0x80,m>>0&0x3f|0x80])
       :m<0x110000
       ?Uint8Array.from(
          [ m>>18&0x07|0xf0,m>>12&0x3f|0x80,m>>6&0x3f|0x80,m>>0&0x3f|0x80])
       :(()=>{throw'Invalid Unicode Code Point!'})())
      ( typeof n==='string'
       ?n.codePointAt(0)
       :n&0x1fffff),
  utf8decode=
    ([m,n,o,p])=>
      m<0x80
     ?( m&0x7f)<<0
     :0xc1<m&&m<0xe0&&n===(n&0xbf)
     ?( m&0x1f)<<6|( n&0x3f)<<0
     :( m===0xe0&&0x9f<n&&n<0xc0
      ||0xe0<m&&m<0xed&&0x7f<n&&n<0xc0
      ||m===0xed&&0x7f<n&&n<0xa0
      ||0xed<m&&m<0xf0&&0x7f<n&&n<0xc0)
    &&o===o&0xbf
     ?( m&0x0f)<<12|( n&0x3f)<<6|( o&0x3f)<<0
     :( m===0xf0&&0x8f<n&&n<0xc0
      ||m===0xf4&&0x7f<n&&n<0x90
      ||0xf0<m&&m<0xf4&&0x7f<n&&n<0xc0)
    &&o===o&0xbf&&p===p&0xbf
     ?( m&0x07)<<18|( n&0x3f)<<12|( o&0x3f)<<6|( p&0x3f)<<0
     :(()=>{throw'Invalid UTF-8 encoding!'})()

The testing inputs:

const
  str=
    'AöЖ€𝄞'
 ,cps=
    Uint32Array.from(str,s=>s.codePointAt(0))
 ,cus=
    [ [ 0x41]
     ,[ 0xc3,0xb6]
     ,[ 0xd0,0x96]
     ,[ 0xe2,0x82,0xac]
     ,[ 0xf0,0x9d,0x84,0x9e]]
   .map(a=>Uint8Array.from(a))
 ,zip3=
    ([a,...as],[b,...bs],[c,...cs])=>
      0<as.length+bs.length+cs.length
     ?[ [ a,b,c],...zip3(as,bs,cs)]
     :[ [ a,b,c]]
 ,inputs=zip3(str,cps,cus);

The testing code:

console.log(`\
${'Character'.padEnd(16)}\
${'CodePoint'.padEnd(16)}\
${'CodeUnits'.padEnd(16)}\
${'uft8encode(ch)'.padEnd(16)}\
${'uft8encode(cp)'.padEnd(16)}\
utf8decode(cu)`)
for(let [ch,cp,cu] of inputs)
  console.log(`\
${ch.padEnd(16)}\
${cp.toString(0x10).padStart(8,'U+000000').padEnd(16)}\
${`[${[...cu].map(n=>n.toString(0x10))}]`.padEnd(16)}\
${`[${[...utf8encode(ch)].map(n=>n.toString(0x10))}]`.padEnd(16)}\
${`[${[...utf8encode(cp)].map(n=>n.toString(0x10))}]`.padEnd(16)}\
${utf8decode(cu).toString(0x10).padStart(8,'U+000000')}`)

and finally, the output from the test:

Character       CodePoint       CodeUnits       uft8encode(ch)  uft8encode(cp)  utf8decode(cu)
A               U+000041        [41]            [41]            [41]            U+000041
ö               U+0000f6        [c3,b6]         [c3,b6]         [c3,b6]         U+0000f6
Ж               U+000416        [d0,96]         [d0,96]         [d0,96]         U+000416
€               U+0020ac        [e2,82,ac]      [e2,82,ac]      [e2,82,ac]      U+0020ac
𝄞              U+01d11e        [f0,9d,84,9e]   [f0,9d,84,9e]   [f0,9d,84,9e]   U+01d11e

Note that the misalign there on the last line is caused by the string length of astral characters being 2 so the padding functions break.

jq

Works with: jq

Works with gojq, the Go implementation of jq

Preliminaries

# input: a decimal integer
# output: the corresponding binary array, most significant bit first
def binary_digits:
  if . == 0 then 0
  else [recurse( if . == 0 then empty else ./2 | floor end ) % 2]
    | reverse
    | .[1:] # remove the leading 0
  end ;

# Input: an array of binary digits, msb first.
def binary_to_decimal:
  reduce reverse[] as $b ({power:1, result:0};
       .result += .power * $b
       | .power *= 2)
  | .result;

Encode to UTF-8

# input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint.
# output: the corresponding decimal number of that codepoint.
def utf8_encode:
   def lpad($width): [(range(0;8)|0), .[]][- $width:];
   def multibyte:  [1,0,       (.[-6: ]|lpad(6))[]];
   def firstOf2:   [1,1,0,     (.[: -6]|lpad(5))[]];
   def firstOf3:   [1,1,1,0,   (.[:-12]|lpad(4))[]];
   def firstOf4:   [1,1,1,1,0, (.[:-18]|lpad(3))[]];
   . as $n
   | binary_digits
   | length as $len
   | if $len <8 then [$n]
     else if   $len <= 12 then [ firstOf2, multibyte ]
          elif $len <= 16 then [ firstOf3, (.[:-6] | multibyte), multibyte ]
          else [firstOf4,
                (.[   :-12] | multibyte),
                (.[-12: -6] | multibyte),
                multibyte]
          end
	  | map(binary_to_decimal)
     end;

Decode an array of UTF-8 bytes

# input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint.
# output: the corresponding decimal number of that codepoint.
def utf8_decode:
   # Magic numbers:
   # x80: 128,       # 10000000
   # xe0: 224,       # 11100000
   # xf0: 240        # 11110000
     (-6) as $mb     # non-first bytes start 10 and carry 6 bits of data
                     # first byte of a 2-byte encoding starts 110 and carries 5 bits of data
                     # first byte of a 3-byte encoding starts 1110 and carries 4 bits of data
                     # first byte of a 4-byte encoding starts 11110 and carries 3 bits of data
   | map(binary_digits) as $d
   | .[0]
   | if   . < 128 then $d[0]
     elif . < 224 then $d[0][-5:] + $d[1][$mb:]
     elif . < 240 then $d[0][-4:] + $d[1][$mb:] + $d[2][$mb:]
     else              $d[0][-3:] + $d[1][$mb:] + $d[2][$mb:] + $d[3][$mb:]
     end
   | binary_to_decimal ;

Task

def task:
  [ "A", "ö", "Ж", "€", "𝄞" ][]
  | . as $glyph
  | explode[]
  | utf8_encode as $encoded
  | ($encoded|utf8_decode) as $decoded
  | "Glyph \($glyph) => \($encoded) => \($decoded) => \([$decoded]|implode)" ;

task
Output:
Glyph A => [65] => 65 => A
Glyph ö => [195,182] => 246 => ö
Glyph Ж => [208,150] => 1046 => Ж
Glyph € => [226,130,172] => 8364 => €
Glyph 𝄞 => [240,157,132,158] => 119070 => 𝄞

Julia

Julia supports the UTF-8 encoding (and others through packages).

for t in ("A", "ö", "Ж", "€", "𝄞")
    println(t, " → ", codeunits(t))
end
Output:
A → UInt8[0x41]
ö → UInt8[0xc3, 0xb6]
Ж → UInt8[0xd0, 0x96]
€ → UInt8[0xe2, 0x82, 0xac]
𝄞 → UInt8[0xf0, 0x9d, 0x84, 0x9e]

Kotlin

// version 1.1.2

fun utf8Encode(codePoint: Int) = String(intArrayOf(codePoint), 0, 1).toByteArray(Charsets.UTF_8)

fun utf8Decode(bytes: ByteArray) = String(bytes, Charsets.UTF_8).codePointAt(0)

fun main(args: Array<String>) {
    val codePoints = intArrayOf(0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)
    println("Char  Name                                 Unicode  UTF-8         Decoded")
    for (codePoint in codePoints) {
        var n = if(codePoint <= 0xFFFF) 4 else 5 
        System.out.printf("%-${n}c  %-35s  U+%05X  ", codePoint, Character.getName(codePoint), codePoint)  
        val bytes = utf8Encode(codePoint)
        var s = ""
        for (byte in bytes) s += "%02X ".format(byte)
        val decoded = utf8Decode(bytes)
        n = if(decoded.toInt() <= 0xFFFF) 12 else 11 
        System.out.printf("%-${n}s  %c\n", s, decoded)  
    } 
}
Output:
Char  Name                                 Unicode  UTF-8         Decoded
A     LATIN CAPITAL LETTER A               U+00041  41            A
ö     LATIN SMALL LETTER O WITH DIAERESIS  U+000F6  C3 B6         ö
Ж     CYRILLIC CAPITAL LETTER ZHE          U+00416  D0 96         Ж
€     EURO SIGN                            U+020AC  E2 82 AC      €
𝄞     MUSICAL SYMBOL G CLEF                U+1D11E  F0 9D 84 9E   𝄞

langur

writeln "character   Unicode    UTF-8 encoding (hex)"

for cp in "AöЖ€𝄞" {
    val utf8 =  cp -> cp2s -> s2b
    val cpstr = utf8 -> b2s
    val utf8rep = join(map(utf8, by=fn b:"{{b:X02}}"), by=" ")
    writeln "{{cpstr:-11}} U+{{cp:X04:-8}} {{utf8rep}}"
}
Output:
character   Unicode    UTF-8 encoding (hex)
A           U+0041     41
ö           U+00F6     C3 B6
Ж           U+0416     D0 96
€           U+20AC     E2 82 AC
𝄞           U+1D11E    F0 9D 84 9E

Lingo

Since UTF-8 is Lingo's native string encoding, and UTF-8 strings can be read into byteArrays (and v.v.), such UTF-8 encoding and decoding is built-in.
Relevant Lingo functions are:
- charToNum (string): converts single-character string to unicode code point (int)
- numToChar (int): converts unicode code point (int) to single-character string
- byteArray (string): creates byte array of UTF-8 bytes for string
- byteArray.toHexString (intStart, intLen): returns hex string representation of byte array (e.g. for printing)
- byteArray.readRawString (intLen, [strCharSet="UTF-8"]): reads a fixed number of bytes as a string

chars = ["A", "ö", "Ж", "€", "𝄞"]
put "Character   Unicode (int)   UTF-8 (hex)   Decoded"
repeat with c in chars
    ba = bytearray(c)
    put col(c, 12) & col(charToNum(c), 16) & col(ba.toHexString(1, ba.length), 14) & ba.readRawString(ba.length)
end repeat

Helper function for table formatting

on col (val, len)
    str = string(val)
    repeat with i = str.length+1 to len
        put " " after str
    end repeat
    return str
end
Output:
Character   Unicode (int)   UTF-8 (hex)   Decoded
A           65              41            A
ö           246             c3 b6         ö
Ж           1046            d0 96         Ж
€           8364            e2 82 ac      €
𝄞           119070          f0 9d 84 9e   𝄞

Lua

Works with: Lua version 5.3
-- Accept an integer representing a codepoint.
-- Return the values of the individual octets.
function encode (codepoint)
  local codepoint_str = utf8.char(codepoint)
  local result = {}

  for i = 1, #codepoint_str do
    result[#result + 1] = string.unpack("B", codepoint_str, i)
  end

  return table.unpack(result)
end

-- Accept a variable number of octets.
-- Return the corresponding Unicode character.
function decode (...)
  local len = select("#", ...) -- the number of octets
  local fmt = string.rep("B", len)

  return string.pack(fmt, ...)
end

-- Run the given test cases.
function test_encode_decode ()
  -- "A", "ö", "Ж", "€", "𝄞"
  local tests = {tonumber("41", 16),  tonumber("f6", 16), tonumber("416", 16),
                  tonumber("20ac", 16), tonumber("1d11e", 16)}

  for i, test in ipairs(tests) do
    print("Char: ", test)
    print("Encoding: ", encode(test))
    print("Decoding: ", decode(encode(test)))
  end
end
Output:
Char: 	65
Encoding: 	65
Decoding: 	A
Char: 	246
Encoding: 	195	182
Decoding: 	ö
Char: 	1046
Encoding: 	208	150
Decoding: 	Ж
Char: 	8364
Encoding: 	226	130	172
Decoding: 	€
Char: 	119070
Encoding: 	240	157	132	158
Decoding: 	𝄞

M2000 Interpreter

Module EncodeDecodeUTF8 {
      a$=string$("Hello" as UTF8enc)
      Print Len(A$)=2.5   ' 2.5 words=5 bytes
      b$=string$(a$ as UTF8dec)
      Print b$
      Print Len(b$)=5 ' 5 words = 10 bytes   
      
      Print Len(string$("A" as UTF8enc))=.5  ' 1 byte
      Print Len(string$("ö" as UTF8enc))=1   ' 2 bytes
      Print Len(string$("Ж" as UTF8enc))=1   ' 2 bytes
      Print Len(string$("€" as UTF8enc))=1.5   ' 3 bytes
      Print Len(string$("𝄞" as UTF8enc))=2      '4 bytes
      a$=string$("𝄞" as UTF8enc)
      Buffer Bytes as Byte*4
      Return Bytes, 0:=a$
      \\ F0 9D 84 9E
      Hex Eval(bytes, 0), Eval(bytes, 1), Eval(bytes, 2), Eval(bytes, 3) 
}
EncodeDecodeUTF8
Output:
      True
Hello
      True
      True
      True
      True
      True
      True
    0x00F0    0x009D    0x0084    0x009E

Mathematica /Wolfram Language

utf = ToCharacterCode[ToString["AöЖ€", CharacterEncoding -> "UTF8"]]
ToCharacterCode[FromCharacterCode[utf, "UTF8"]]
Output:
{65, 195, 182, 208, 150, 226, 130, 172}
{65, 246, 1046, 8364}

Nim

Using the standard library

Nim strings are encoded in UTF-8. The natural way to deal with UTF-8 and Unicode code points consists to use the module “unicode” which provides procedures to convert from strings to sequences of code points (names “runes”) and conversely. For this purpose, using sequences or bytes is not natural. Here is a way to proceed using the module “unicode”.

import unicode, sequtils, strformat, strutils

const UChars = ["\u0041", "\u00F6", "\u0416", "\u20AC", "\u{1D11E}"]

proc toSeqByte(r: Rune): seq[byte] =
  let s = r.toUTF8
  result = @(s.toOpenArrayByte(0, s.high))

proc toRune(s: seq[byte]): Rune =
  s.mapIt(chr(it)).join().toRunes[0]

echo "Character  Unicode  UTF-8 encoding (hex)"
for uchar in UChars:
  # Convert the UTF-8 string to a rune (codepoint).
  var r = uchar.toRunes[0]
  # Convert the rune to a sequence of bytes.
  let s = r.toSeqByte
  # Convert back the sequence of bytes to a rune.
  r = s.toRune
  # Display.
  echo &"""{uchar:>5}      U+{r.int.toHex(5)}  {s.map(toHex).join(" ")}"""
Output:
Character  Unicode  UTF-8 encoding (hex)
    A      U+00041  41
    ö      U+000F6  C3 B6
    Ж      U+00416  D0 96
    €      U+020AC  E2 82 AC
    𝄞      U+1D11E  F0 9D 84 9E

Implementation

In this section, we provide two procedures to convert a Unicode code point to a UTF-8 sequence of bytes and conversely, without using the module “unicode”. We provide also a procedure to convert a sequence of bytes to a string in order to print it. The algorithm is the one used by the Go solution.

import sequtils, strformat, strutils

const

  # First byte of a 2-byte encoding starts 110 and carries 5 bits of data.
  B2Lead = 0xC0 # 1100 0000
  B2Mask = 0x1F # 0001 1111

  # First byte of a 3-byte encoding starts 1110 and carries 4 bits of data.
  B3Lead = 0xE0 # 1110 0000
  B3Mask = 0x0F # 0000 1111

  # First byte of a 4-byte encoding starts 11110 and carries 3 bits of data.
  B4Lead = 0xF0 # 1111 0000
  B4Mask = 0x07 # 0000 0111

  # Non-first bytes start 10 and carry 6 bits of data.
  MbLead = 0x80 # 1000 0000
  MbMask = 0x3F # 0011 1111


type CodePoint = distinct int32


proc toUtf8(c: CodePoint): seq[byte] =
  let i = int32(c)
  result = if i <= 1 shl 7 - 1:
             @[byte(i)]
           elif i <= 1 shl 11 - 1:
             @[B2Lead or byte(i shr 6),
               MbLead or byte(i) and MbMask]
           elif i <= 1 shl  16 - 1:
             @[B3Lead or byte(i shr 12),
               MbLead or byte(i shr 6) and MbMask,
               MbLead or byte(i) and MbMask]
           else:
             @[B4Lead or byte(i shr 18),
               MbLead or byte(i shr 12) and MbMask,
               MbLead or byte(i shr 6) and MbMask,
               MbLead or byte(i) and MbMask]


proc toCodePoint(b: seq[byte]): CodePoint =
  let b0 = b[0].int32
  result = CodePoint(
    if b0 < 0x80: b0
    elif b0 < 0xE0: (b0 and B2Mask) shl 6 or b[1].int32 and MbMask
    elif b0 < 0xF0: (b0 and B3Mask) shl 12 or
                    (b[1].int32 and MbMask) shl 6 or b[2].int32 and MbMask
    else: (b0 and B4Mask) shl 18 or (b[1].int32 and MbMask) shl 12 or
          (b[2].int32 and MbMask) shl 6 or b[3].int32 and MbMask)


proc toString(s: seq[byte]): string =
  s.mapIt(chr(it)).join()


const UChars = [CodePoint(0x00041),
                CodePoint(0x000F6),
                CodePoint(0x00416),
                CodePoint(0x020AC),
                CodePoint(0x1D11E)]

echo "Character  Unicode  UTF-8 encoding (hex)"

for uchar in UChars:
  # Convert the code point to a sequence of bytes.
  let s = uchar.toUtf8
  # Convert back the sequence of bytes to a code point.
  let c = s.toCodePoint
  # Display.
  echo &"""{s.toString:>5}      U+{c.int.toHex(5)}  {s.map(toHex).join(" ")}"""
Output:

Same output as in the previous solution.

Perl

#!/usr/bin/perl
use strict;
use warnings;
use Unicode::UCD 'charinfo';         # getting the unicode name of the character
use utf8;                            # using non-ascii-characters in source code
binmode STDOUT, ":encoding(UTF-8)";  # printing non-ascii-characters to screen

my @chars = map {ord} qw/A ö Ж € 𝄞/; # @chars contains the unicode points
my $print_format = '%5s  %-35s';
printf "$print_format %8s  %s\n" , 'char', 'name', 'unicode', 'utf-8 encoding';
map{
	my $name = charinfo($_)->{'name'}; # get unicode name
	printf "$print_format  %06x  " , chr, lc $name, $_;
	my $utf8 = chr;                    # single char (using implicit $_)
	utf8::encode($utf8);               # inplace encoding into utf8 parts
	map{                               # for each utf8 char print ord
		printf " %x", ord;
	} split //, $utf8;
	print "\n";
} @chars;
Output:
 char  name                                 unicode  utf-8 encoding
    A  latin capital letter a               000041   41
    ö  latin small letter o with diaeresis  0000f6   c3 b6
    Ж  cyrillic capital letter zhe          000416   d0 96
    €  euro sign                            0020ac   e2 82 ac
    𝄞  musical symbol g clef                01d11e   f0 9d 84 9e

Phix

Library: Phix/basics

Standard autoinclude, see the manual and/or builtins/utfconv.e ( http://phix.x10.mx/docs/html/utfconv.htm and/or https://github.com/petelomax/Phix/blob/master/builtins/utfconv.e )
As requested in the task description:

constant tests = {#0041, #00F6, #0416, #20AC, #1D11E}
 
function hex(sequence s, string fmt)    -- output helper
    return join(apply(true,sprintf,{{fmt},s}),',')
end function
 
for i=1 to length(tests) do
    integer codepoint = tests[i]
    sequence s = utf32_to_utf8({codepoint}),
             r = utf8_to_utf32(s)
    printf(1,"#%04x -> {%s} -> {%s}\n",{codepoint, hex(s,"#%02x"),hex(r,"#%04x")})
end for
Output:
#0041 -> {#41} -> {#0041}
#00F6 -> {#C3,#B6} -> {#00F6}
#0416 -> {#D0,#96} -> {#0416}
#20AC -> {#E2,#82,#AC} -> {#20AC}
#1D11E -> {#F0,#9D,#84,#9E} -> {#1D11E}

Processing

import java.nio.charset.StandardCharsets;

Integer[] code_points = {0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E};

void setup() {
  size(850, 230);
  background(255);
  fill(0);
  textSize(16);
  int tel_1 = 80;
  int tel_2 = 50;
  text("Char     Name                                                            Unicode          UTF-8 (encoding)      Decoded", 40, 40);
  for (int cp : code_points) {  
    byte[] encoded = new String(new int[]{cp}, 0, 1).getBytes(StandardCharsets.UTF_8);
    for (byte b : encoded) {                                                    
      text(hex(b), tel_2+530, tel_1);
      tel_2 += 30;
    }
    text(char(cp), 50, tel_1);
    text(Character.getName(cp), 100, tel_1);
    String unicode = hex(cp);
    while (unicode.length() > 4 && unicode.indexOf("0") == 0) unicode = unicode.substring(1);
    text("U+"+unicode, 450, tel_1);
    Character decoded = char(new String(encoded, StandardCharsets.UTF_8).codePointAt(0));
    text(decoded, 750, tel_1);
    tel_1 += 30;  tel_2 = 50;
  }
}


PureBasic

The encoding and decoding procedure are kept simple and designed to work with an array of 5 elements for input/output of the UTF-8 encoding for a single code point at a time. It was decided not to use a more elaborate example that would have been able to operate on a buffer to encode/decode more than one code point at a time.

#UTF8_codePointMaxByteCount = 4 ;UTF-8 encoding uses only a maximum of 4 bytes to encode a codepoint

Procedure UTF8_encode(x, Array encoded_codepoint.a(1)) ;x is codepoint to encode, the array will contain output
  ;Array encoded_codepoint() is used for output.
  ;After encode element zero holds the count of significant bytes in elements 1 to 4
  If ArraySize(encoded_codepoint()) < #UTF8_codePointMaxByteCount
    ReDim encoded_codepoint.a(#UTF8_codePointMaxByteCount)
  EndIf
  
  Select x
    Case 0 To $7F
      encoded_codepoint(0) = 1
      encoded_codepoint(1) = x ;all 7 bits
    Case $80 To $7FF
      encoded_codepoint(0) = 2
      encoded_codepoint(2) = (x & %00111111) | %10000000         ;lowest 6 bits
      encoded_codepoint(1) = (x >> 6) | %11000000                ;highest bits 7 -> 11
    Case $800 To $FFFF
      encoded_codepoint(0) = 3
      encoded_codepoint(3) = (x & %00111111) | %10000000         ;lowest 6 bits
      encoded_codepoint(2) = ((x >> 6) & %00111111) | %10000000  ;bits 7 -> 12
      encoded_codepoint(1) = (x >> 12) | %11100000               ;highest bits 13 -> 16
      
    Case $10000 To $10FFFF
      encoded_codepoint(0) = 4
      encoded_codepoint(4) = (x & %00111111) | %10000000         ;lowest 6 bits
      encoded_codepoint(3) = ((x >> 6) & %00111111) | %10000000  ;bits 7 -> 12
      encoded_codepoint(2) = ((x >> 12) & %00111111) | %10000000 ;bits 13 -> 18
      encoded_codepoint(1) = (x >> 18) | %11110000               ;highest bits 19 -> 21
    Default                                      
      encoded_codepoint(0) = 0  ;error, codepoint is not valid and can't be encoded
  EndSelect
EndProcedure

Procedure UTF8_decode(Array encoded_codepoint.a(1))
  ;Array encoded_codepoint() holds the UTF-8 encoding in elements 1 to 4, element zero isn't used for decoding.
  Protected x = -1 ;initialzie with error value for possible improper encoding
  
  If ArraySize(encoded_codepoint()) < #UTF8_codePointMaxByteCount
    ProcedureReturn x ;Input array was not dimensioned properly.
  EndIf
  
  ;Determine the number of bytes in the UTF8 encoding by looking at first byte
  ;and then proceeding accordingly.
  Select encoded_codepoint(1)
    Case %00000000 To %01111111 ;1 byte encoding
      x = encoded_codepoint(1)
    Case %11000000 To %11011111 ;2 byte encoding
      x = (encoded_codepoint(1) & %00011111) << 6 ;last 5 bits only
      x | (encoded_codepoint(2) & %00111111)
    Case %11100000 To %11101111 ;3 byte encoding
      x = (encoded_codepoint(1) & %00001111) << 6 ;last 4 bits only
      x << 6 + (encoded_codepoint(2) & %00111111) 
      x << 6 + (encoded_codepoint(3) & %00111111) 
    Case %11110000 To %11110111 ;4 byte encoding
      x = (encoded_codepoint(1) & %00000111) << 6 ;last 3 bits only
      x << 6 + (encoded_codepoint(2) & %00111111) 
      x << 6 + (encoded_codepoint(3) & %00111111) 
      x << 6 + (encoded_codepoint(4) & %00111111) 
  EndSelect
  
  ProcedureReturn x
EndProcedure

;helper procedure to format output for this example
Procedure.s formatOutput(c$, c, Array encoded_utf.a(1), dcp) ;character, codepooint, UTf8 encoding, decoded codepoint
  Protected o$, i, encoding$
  
  o$ = "   " + LSet(c$, 8) + LSet("U+" + RSet(Hex(c), 5, "0"), 10)
  For i = 1 To encoded_utf(0)
    encoding$ + RSet(Hex(encoded_utf(i)), 2, "0") + " "
  Next
  o$ + "  " + LSet(encoding$, 11, " ") + "   " + RSet(Hex(dcp), 5, "0")
  
  ProcedureReturn o$  
EndProcedure

DataSection
  ;unicode code points in hex
  unicode_codepoints:
  Data.i 5, $41, $F6, $416, $20AC, $1D11E
  ;The names for these codepoints are: latin capital letter a; latin small letter o With diaeresis
  ;cyrillic capital letter zhe; euro sign; musical symbol g clef.
EndDataSection

;read initial unicode codepoint values
Restore unicode_codepoints
Read num_codepoints
num_codepoints - 1

Dim codepoint(num_codepoints)
For i = 0 To num_codepoints
  Read codepoint(i)
Next

;This array is used for input and output from the UTF8 encode and decode procedures.  After encoding its elements
;hold the byte count of the encoding followed by the respective bytes.  For decoding element zero is not used and
;elements 1 To 4 holds the bytes to be decoded.
Dim encoded_codepoint.a(#UTF8_codePointMaxByteCount)
If OpenConsole("", #PB_UTF8)
  PrintN(LSet("", 11) + LSet("Unicode", 12) + LSet("UTF-8",14) + LSet("Decoded",12))
  PrintN(LSet("Character", 11) + LSet("Code Point", 12) + LSet("Encoding",14) + LSet("Code Point",12))
  PrintN(LSet("---------", 11) + LSet("----------", 12) + LSet("-----------",14) + LSet("-----------",12))
  
  For i = 0 To num_codepoints
    UTF8_encode(codepoint(i), encoded_codepoint())
    dcp = UTF8_decode(encoded_codepoint()) ;Decoded UTF-8 encoding should match original codepoint that was encoded.
    PrintN(formatOutput(Chr(codepoint(i)), codepoint(i), encoded_codepoint(), dcp))
  Next
  Print(#CRLF$ + #CRLF$ + "Press ENTER to exit"): Input()
  CloseConsole()
EndIf

Sample output:

           Unicode     UTF-8         Decoded
Character  Code Point  Encoding      Code Point
---------  ----------  -----------   -----------
   A       U+00041     41            00041
   ö       U+000F6     C3 B6         000F6
   ?       U+00416     D0 96         00416
   ?       U+020AC     E2 82 AC      800AC
   ?       U+1D11E     F0 9D 84 9E   1D11E

Python

#!/usr/bin/env python3
from unicodedata import name


def unicode_code(ch):
    return 'U+{:04x}'.format(ord(ch))


def utf8hex(ch):
    return " ".join([hex(c)[2:] for c in ch.encode('utf8')]).upper()


if __name__ == "__main__":
    print('{:<11} {:<36} {:<15} {:<15}'.format('Character', 'Name', 'Unicode', 'UTF-8 encoding (hex)'))
    chars = ['A', 'ö', 'Ж', '€', '𝄞']
    for char in chars:
        print('{:<11} {:<36} {:<15} {:<15}'.format(char, name(char), unicode_code(char), utf8hex(char)))
Output:
Character   Name                                 Unicode         UTF-8 encoding (hex)
A           LATIN CAPITAL LETTER A               U+0041          41             
ö           LATIN SMALL LETTER O WITH DIAERESIS  U+00f6          C3 B6          
Ж           CYRILLIC CAPITAL LETTER ZHE          U+0416          D0 96          
€           EURO SIGN                            U+20ac          E2 82 AC       
𝄞           MUSICAL SYMBOL G CLEF                U+1d11e         F0 9D 84 9E

Racket

#lang racket

(define char-map
  '((LATIN-CAPITAL-LETTER-A              .  #\U0041)
    (LATIN-SMALL-LETTER-O-WITH-DIAERESIS .  #\U00F6)
    (CYRILLIC-CAPITAL-LETTER-ZHE         .  #\U0416)
    (EURO-SIGN                           .  #\U20AC)
    (MUSICAL-SYMBOL-G-CLEF               .  #\U1D11E)))

(for ((name.char (in-list char-map)))
  (define name (car name.char))
  (define chr (cdr name.char))
  (let ((bites (bytes->list (string->bytes/utf-8 (list->string (list chr))))))
    (printf "~s\t~a\t~a\t~a\t~a~%" chr chr
            (map (curryr number->string 16) bites)
            (bytes->string/utf-8 (list->bytes bites))
            name)))
Output:
#\A	A	(41)	A	LATIN-CAPITAL-LETTER-A
#\ö	ö	(c3 b6)	ö	LATIN-SMALL-LETTER-O-WITH-DIAERESIS
#\Ж	Ж	(d0 96)	Ж	CYRILLIC-CAPITAL-LETTER-ZHE
#\€	€	(e2 82 ac)	€	EURO-SIGN
#\𝄞	𝄞	(f0 9d 84 9e)	𝄞	MUSICAL-SYMBOL-G-CLEF

Raku

(formerly Perl 6)

Works with: Rakudo version 2017.02

Pretty much all built in to the language.

say sprintf("%-18s %-36s|%8s| %7s |%14s | %s\n", 'Character|', 'Name', 'Ordinal', 'Unicode', 'UTF-8 encoded', 'decoded'), '-' x 100;

for < A ö Ж € 𝄞 😜 👨‍👩‍👧‍👦> -> $char {
    printf "   %-5s | %-43s | %6s | %-7s | %12s  |%4s\n", $char, $char.uninames.join(','), $char.ords.join(' '),
      ('U+' X~ $char.ords».base(16)).join(' '), $char.encode('UTF8').list».base(16).Str, $char.encode('UTF8').decode;
}
Output:
Character|         Name                                | Ordinal| Unicode | UTF-8 encoded | decoded
----------------------------------------------------------------------------------------------------
   A     | LATIN CAPITAL LETTER A                      |     65 | U+41    |           41  |   A
   ö     | LATIN SMALL LETTER O WITH DIAERESIS         |    246 | U+F6    |        C3 B6  |   ö
   Ж    | CYRILLIC CAPITAL LETTER ZHE                 |   1046 | U+416   |        D0 96  |   Ж
   €     | EURO SIGN                                   |   8364 | U+20AC  |     E2 82 AC  |   €
   𝄞     | MUSICAL SYMBOL G CLEF                       | 119070 | U+1D11E |  F0 9D 84 9E  |   𝄞
   😜    | FACE WITH STUCK-OUT TONGUE AND WINKING EYE  | 128540 | U+1F61C |  F0 9F 98 9C  |   😜
   👨‍👩‍👧‍👦    | MAN,ZERO WIDTH JOINER,WOMAN,ZERO WIDTH JOINER,GIRL,ZERO WIDTH JOINER,BOY | 128104 8205 128105 8205 128103 8205 128102 | U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466 | F0 9F 91 A8 E2 80 8D F0 9F 91 A9 E2 80 8D F0 9F 91 A7 E2 80 8D F0 9F 91 A6  |   👨‍👩‍👧‍👦

Ruby

character_arr = ["A","ö","Ж","€","𝄞"]
for c in character_arr do
    puts "Character: " + c.encode("utf-8")
    puts "Code-Point: #{c.encode("utf-8").ord.to_s(16).upcase}"
    puts "Code-Units: " + c.each_byte.map { |n| '%02X ' % (n & 0xFF) }.join
    puts ""
end
Output:
Character: A
Code-Point: 41
Code-Units: 41 

Character: ö
Code-Point: F6
Code-Units: C3 B6 

Character: Ж
Code-Point: 416
Code-Units: D0 96 

Character: €
Code-Point: 20AC
Code-Units: E2 82 AC 

Character: 𝄞
Code-Point: 1D11E
Code-Units: F0 9D 84 9E 

Rust

fn main() {
    let chars = vec!('A', 'ö', 'Ж', '€', '𝄞');
    chars.iter().for_each(|c| {
        let mut encoded = vec![0; c.len_utf8()];
        c.encode_utf8(&mut encoded);
        let decoded = String::from_utf8(encoded.to_vec()).unwrap();
        let encoded_string = encoded.iter().fold(String::new(), |acc, val| format!("{}{:X}", acc, val));
        println!("Character: {}, Unicode:{}, UTF-8 encoded:{}, Decoded: {}", c, c.escape_unicode(), encoded_string , decoded);
    });
}
Output:
Character: A, Unicode:\u{41}, UTF-8 encoded:41, Decoded: A
Character: ö, Unicode:\u{f6}, UTF-8 encoded:C3B6, Decoded: ö
Character: Ж, Unicode:\u{416}, UTF-8 encoded:D096, Decoded: Ж
Character: €, Unicode:\u{20ac}, UTF-8 encoded:E282AC, Decoded: €
Character: 𝄞, Unicode:\u{1d11e}, UTF-8 encoded:F09D849E, Decoded: 𝄞

Scala

Imperative solution

object UTF8EncodeAndDecode extends App {

  val codePoints = Seq(0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)

  def utf8Encode(codepoint: Int): Array[Byte] =
    new String(Array[Int](codepoint), 0, 1).getBytes(StandardCharsets.UTF_8)

  def utf8Decode(bytes: Array[Byte]): Int =
    new String(bytes, StandardCharsets.UTF_8).codePointAt(0)

  println("Char Name                                 Unicode  UTF-8       Decoded")
  for (codePoint <- codePoints) {
    val w = if (Character.isBmpCodePoint(codePoint)) 4 else 5 // Compute spacing
    val bytes = utf8Encode(codePoint)

    def leftAlignedHex = f"U+${codePoint}%04X"

    val s = new StringBuilder()
    bytes.foreach(byte => s ++= "%02X ".format(byte))

    printf(s"%-${w}c %-36s %-7s  %-${16 - w}s%c%n",
      codePoint, Character.getName(codePoint), leftAlignedHex, s, utf8Decode(bytes))
  }

Functional solution

import java.nio.charset.StandardCharsets

object UTF8EncodeAndDecode extends App {

  val codePoints = Seq(0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)

  def utf8Encode(codepoint: Int): Array[Byte] =
    new String(Array[Int](codepoint), 0, 1).getBytes(StandardCharsets.UTF_8)

  def utf8Decode(bytes: Array[Byte]): Int =
    new String(bytes, StandardCharsets.UTF_8).codePointAt(0)

  println("Char Name                                 Unicode  UTF-8       Decoded")
  codePoints.foreach{ codePoint =>
    val w = if (Character.isBmpCodePoint(codePoint)) 4 else 5 // Compute spacing
    val bytes = utf8Encode(codePoint)

    def leftAlignedHex: String = f"U+${codePoint}%04X"

    def utf: String = bytes.foldLeft("")(_ + "%02X ".format(_))

    printf(s"%-${w}c %-36s %-7s  %-${16 - w}s%c%n",
      codePoint, Character.getName(codePoint), leftAlignedHex, utf, utf8Decode(bytes))  }

  println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]")
}

Composable and testable solution

package example

object UTF8EncodeAndDecode extends TheMeat with App {
  val codePoints = Seq(0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)

  println("Char Name                                 Unicode  UTF-8       Decoded")
  codePoints.foreach { codepoint => print(composeString(codepoint)) }

  println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]")
}

trait TheMeat {
  import java.nio.charset.StandardCharsets

  def composeString(codePoint: Int): String = {
    val w = if (Character.isBmpCodePoint(codePoint)) 4 else 5 // Compute spacing
    val bytes = utf8Encode(codePoint)

    def leftAlignedHex: String = f"U+${codePoint}%04X"

    def utf: String = bytes.foldLeft("")(_ + "%02X ".format(_))

    s"%-${w}c %-36s %-7s  %-${16 - w}s%c%n"
      .format(codePoint, Character.getName(codePoint), leftAlignedHex, utf, utf8Decode(bytes))
  }

  def utf8Encode(codepoint: Int): Array[Byte] =
    new String(Array[Int](codepoint), 0, 1).getBytes(StandardCharsets.UTF_8)

  def utf8Decode(bytes: Array[Byte]): Int =
    new String(bytes, StandardCharsets.UTF_8).codePointAt(0)

}

Seed7

$ include "seed7_05.s7i";
  include "unicode.s7i";
  include "console.s7i";
  include "bytedata.s7i";

const proc: main is func
  local
    var char: ch is ' ';
    var string: utf8 is "";
  begin
    OUT := STD_CONSOLE;
    writeln("Character  Unicode  UTF-8 encoding (hex)  Decoded");
    writeln("-------------------------------------------------");
    for ch range "AöЖ€𝄞" do
      utf8 := toUtf8(str(ch));
      writeln(ch rpad 11 <& "U+" <& ord(ch) radix 16 lpad0 4 rpad 7 <&
              hex(utf8) rpad 22 <& fromUtf8(utf8));
    end for;
  end func;
Output:
Character  Unicode  UTF-8 encoding (hex)  Decoded
-------------------------------------------------
A          U+0041   41                    A
ö          U+00f6   c3b6                  ö
Ж          U+0416   d096                  Ж
€          U+20ac   e282ac                €
𝄞          U+1d11e  f09d849e              𝄞

Sidef

func utf8_encoder(Number code) {
    code.chr.encode('UTF-8').bytes.map{.chr}
}

func utf8_decoder(Array bytes) {
    bytes.map{.ord}.decode('UTF-8')
}

for n in ([0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E]) {
    var encoded = utf8_encoder(n)
    var decoded = utf8_decoder(encoded)
    assert_eq(n, decoded.ord)
    say "#{decoded} -> #{encoded}"
}
Output:
A -> ["A"]
ö -> ["\xC3", "\xB6"]
Ж -> ["\xD0", "\x96"]
€ -> ["\xE2", "\x82", "\xAC"]
𝄞 -> ["\xF0", "\x9D", "\x84", "\x9E"]

Swift

In Swift there's a difference between UnicodeScalar, which is a single unicode code point, and Character which may consist out of multiple UnicodeScalars, usually because of combining characters.

import Foundation

func encode(_ scalar: UnicodeScalar) -> Data {
  return Data(String(scalar).utf8)
}

func decode(_ data: Data) -> UnicodeScalar? {
  guard let string = String(data: data, encoding: .utf8) else {
    assertionFailure("Failed to convert data to a valid String")
    return nil
  }
  assert(string.unicodeScalars.count == 1, "Data should contain one scalar!")
  return string.unicodeScalars.first
}

for scalar in "AöЖ€𝄞".unicodeScalars {
  let bytes = encode(scalar)
  let formattedBytes = bytes.map({ String($0, radix: 16)}).joined(separator: " ")
  let decoded = decode(bytes)!
  print("character: \(decoded), code point: U+\(String(scalar.value, radix: 16)), \tutf-8: \(formattedBytes)")
}
Output:
character: A, code point: U+41, 	utf-8: 41
character: ö, code point: U+f6, 	utf-8: c3 b6
character: Ж, code point: U+416, 	utf-8: d0 96
character: €, code point: U+20ac, 	utf-8: e2 82 ac
character: 𝄞, code point: U+1d11e, 	utf-8: f0 9d 84 9e

Tcl

Note: Tcl can handle Unicodes only up to U+FFFD, i.e. the Basic Multilingual Plane (BMP, 16 bits wide). Therefore, the fifth test fails as expected.

proc encoder int {
   set u [format %c $int]
   set bytes {}
   foreach byte [split [encoding convertto utf-8 $u] ""] {
      lappend bytes [format %02X [scan $byte %c]]
   }
   return $bytes
}
proc decoder bytes {
   set str {}
   foreach byte $bytes {
      append str [format %c [scan $byte %x]]
   }
   return [encoding convertfrom utf-8 $str]
}
foreach test {0x0041 0x00f6 0x0416 0x20ac 0x1d11e} {
   set res $test
   lappend res [encoder $test] -> [decoder [encoder $test]]
   puts $res
}
0x0041 41 -> A
0x00f6 {C3 B6} -> ö
0x0416 {D0 96} -> Ж
0x20ac {E2 82 AC} -> €
0x1d11e {EF BF BD} -> �

Alternative Implementation

While perhaps not as readable as the above, this version handles beyond-BMP codepoints by manually composing the utf-8 byte sequences and emitting raw bytes to the console. encoding convertto utf-8 command still does the heavy lifting where it can.

proc utf8 {codepoint} {
    scan $codepoint %llx cp
    if {$cp < 0x10000} {
        set str [subst \\u$codepoint]               ;# substitute per Tcl backslash rule
        set bytes [encoding convertto utf-8 $str]   ;# encode
    } else {                                        ;# codepoints beyond the BMP need manual approach
        set bits [format %021b $cp]                 ;# format as binary string
        set unibits    11110[string range $bits 0 2];# insert extra bits for utf-8 4-byte encoding
        append unibits 10[string range $bits 3 8]
        append unibits 10[string range $bits 9 14]
        append unibits 10[string range $bits 15 20]
        set bytes [binary format B* $unibits]       ;# turn into a sequence of bytes
    }
    return $bytes
}

proc hexchars {s} {
    binary scan $s H* hex
    regsub -all .. $hex {\0 }
}

# for the test, we assume the tty is in utf-8 mode and can handle beyond-BMP chars
# so set output mode to binary so we can write raw bytes!
chan configure stdout -encoding binary
foreach codepoint { 41 F6 416 20AC 1D11E } {
    set utf8 [utf8 $codepoint]
    puts "[format U+%04s $codepoint]\t$utf8\t[hexchars $utf8]"
}
Output:
U+0041  A       41

U+00F6 ö c3 b6 U+0416 Ж d0 96 U+20AC € e2 82 ac U+1D11E 𝄞 f0 9d 84 9e

UNIX Shell

Works with: Bourne Again SHell
Works with: Korn Shell
Works with: Zsh

Works with locale set to UTF-8.

function encode {
    typeset -i code_point=$1
    printf "$(printf '\\U%08X\\n' "$code_point")"
}
function decode {
    typeset character=$1
    printf 'U+%04X\n' "'$character"
    set +x
}
printf 'Char\tCode Point\tUTF-8 Bytes\n'
for test in A ö Ж  𝄞; do
    code_point=$(decode "$test")
    utf8=$(encode "$(( 16#${code_point#U+} ))")
    bytes=$(printf '%b' "$utf8" | od -An -tx1 | sed -nE '/./s/^  *|  *$//p')
    printf '%-4b\t%-10s\t%s\n' "$utf8" "$code_point" "$bytes"
done
Output:
Char    Code Point      UTF-8 Bytes
A       U+0041          41                                                            
ö       U+00F6          c3  b6                                                        
Ж      U+0416          d0  96                                                        
€       U+20AC          e2  82  ac                                                    
𝄞       U+1D11E         f0  9d  84  9e

VBA

Private Function unicode_2_utf8(x As Long) As Byte()
    Dim y() As Byte
    Dim r As Long
    Select Case x
        Case 0 To &H7F
            ReDim y(0)
            y(0) = x
        Case &H80 To &H7FF
            ReDim y(1)
            y(0) = 192 + x \ 64
            y(1) = 128 + x Mod 64
        Case &H800 To &H7FFF
            ReDim y(2)
            y(2) = 128 + x Mod 64
            r = x \ 64
            y(1) = 128 + r Mod 64
            y(0) = 224 + r \ 64
        Case 32768 To 65535 '&H8000 To &HFFFF equals in VBA as -32768 to -1
            ReDim y(2)
            y(2) = 128 + x Mod 64
            r = x \ 64
            y(1) = 128 + r Mod 64
            y(0) = 224 + r \ 64
        Case &H10000 To &H10FFFF
            ReDim y(3)
            y(3) = 128 + x Mod 64
            r = x \ 64
            y(2) = 128 + r Mod 64
            r = r \ 64
            y(1) = 128 + r Mod 64
            y(0) = 240 + r \ 64
        Case Else
            MsgBox "what else?" & x & " " & Hex(x)
    End Select
    unicode_2_utf8 = y
End Function
Private Function utf8_2_unicode(x() As Byte) As Long
    Dim first As Long, second As Long, third As Long, fourth As Long
    Dim total As Long
    Select Case UBound(x) - LBound(x)
        Case 0 'one byte
            If x(0) < 128 Then
                total = x(0)
            Else
                MsgBox "highest bit set error"
            End If
        Case 1 'two bytes and assume first byte is leading byte
            If x(0) \ 32 = 6 Then
                first = x(0) Mod 32
                If x(1) \ 64 = 2 Then
                    second = x(1) Mod 64
                Else
                    MsgBox "mask error"
                End If
            Else
                MsgBox "leading byte error"
            End If
            total = 64 * first + second
        Case 2 'three bytes and assume first byte is leading byte
            If x(0) \ 16 = 14 Then
                first = x(0) Mod 16
                If x(1) \ 64 = 2 Then
                    second = x(1) Mod 64
                    If x(2) \ 64 = 2 Then
                        third = x(2) Mod 64
                    Else
                        MsgBox "mask error last byte"
                    End If
                Else
                    MsgBox "mask error middle byte"
                End If
            Else
                MsgBox "leading byte error"
            End If
                total = 4096 * first + 64 * second + third
        Case 3 'four bytes and assume first byte is leading byte
            If x(0) \ 8 = 30 Then
                first = x(0) Mod 8
                If x(1) \ 64 = 2 Then
                    second = x(1) Mod 64
                    If x(2) \ 64 = 2 Then
                        third = x(2) Mod 64
                        If x(3) \ 64 = 2 Then
                            fourth = x(3) Mod 64
                        Else
                            MsgBox "mask error last byte"
                        End If
                    Else
                        MsgBox "mask error third byte"
                    End If
                Else
                    MsgBox "mask error second byte"
                End If
            Else
                MsgBox "mask error leading byte"
            End If
            total = CLng(262144 * first + 4096 * second + 64 * third + fourth)
        Case Else
            MsgBox "more bytes than expected"
        End Select
        utf8_2_unicode = total
End Function
Public Sub program()
    Dim cp As Variant
    Dim r() As Byte, s As String
    cp = [{65, 246, 1046, 8364, 119070}] '[{&H0041,&H00F6,&H0416,&H20AC,&H1D11E}]
    Debug.Print "ch  unicode  UTF-8 encoded  decoded"
    For Each cpi In cp
        r = unicode_2_utf8(CLng(cpi))
        On Error Resume Next
        s = CStr(Hex(cpi))
        Debug.Print ChrW(cpi); String$(10 - Len(s), " "); s,
        If Err.Number = 5 Then Debug.Print "?"; String$(10 - Len(s), " "); s,
        s = ""
        For Each yz In r
            s = s & CStr(Hex(yz)) & " "
        Next yz
        Debug.Print String$(13 - Len(s), " "); s;
        s = CStr(Hex(utf8_2_unicode(r)))
        Debug.Print String$(8 - Len(s), " "); s
    Next cpi
End Sub
Output:
ch  unicode  UTF-8 encoded  decoded

A 41 41 41 ö F6 C3 B6 F6 ? 416 D0 96 416 € 20AC E2 82 AC 20AC ? 1D11E F0 9D 84 9E 1D11E

V (Vlang)

import encoding.hex
fn decode(s string) ?[]u8 {
    return hex.decode(s)
}

fn main() {
    println("${'Char':-7} ${'Unicode':7}\tUTF-8 encoded\tDecoded")
    for codepoint in [`A`, `ö`, `Ж`, `€`, `𝄞`] {
        encoded := codepoint.bytes().hex()
        decoded := decode(encoded)?
        println("${codepoint:-7} U+${codepoint:04X}\t${encoded:-12}\t${decoded.bytestr()}")
    }
}
Output:
Char    Unicode	UTF-8 encoded	Decoded
A       U+0041	41          	A
ö       U+00F6	c3b6        	ö
Ж       U+0416	d096        	Ж
€       U+20AC	e282ac      	€
𝄞      U+1D11E	f09d849e    	𝄞

VBScript

Option Explicit
Dim m_1,m_2,m_3,m_4
Dim d_2,d_3,d_4
Dim h_0,h_2,h_3,h_4
Dim mc_0,mc_2,mc_3,mc_4

m_1=&h3F
d_2=m_1+1
m_2=m_1 * d_2
d_3= (m_2 Or m_1)+1
m_3= m_2* d_2
d_4=(m_3 Or m_2 Or m_1)+1

h_0=&h80
h_2=&hC0
h_3=&hE0
h_4=&hF0

mc_0=&h3f
mc_2=&h1F
mc_3=&hF
mc_4=&h7

Function cp2utf8(cp)  'cp as long, returns string
If cp<&h80 Then 
  cp2utf8=Chr(cp)
ElseIf (cp <=&H7FF) Then 
  cp2utf8=Chr(h_2 or (cp \ d_2) )&Chr(h_0 Or (cp And m_1))
ElseIf (cp <=&Hffff&) Then 
  cp2utf8= Chr(h_3 Or (cp\ d_3)) & Chr(h_0 Or (cp And  m_2)\d_2) & Chr(h_0 Or (cp And m_1))
Else  
  cp2utf8= Chr(h_4 Or (cp\d_4))& Chr(h_0 Or ((cp And m_3) \d_3))& Chr(h_0 Or ((cp And  m_2)\d_2)) & Chr(h_0 Or (cp And m_1))
End if  
End Function

Function utf82cp(utf) 'utf as string, returns long
   Dim a,b,m
   m=strreverse(utf)
   b= Len(utf)
   a=asc(mid(m,1,1))
   utf82cp=a And &h7f
   if b=1 Then Exit Function 
   a=asc(mid(m,2,1))
   If b=2 Then utf82cp= utf82cp Or (a And mc_2)*d_2 :Exit function
   utf82cp= utf82cp Or (a And m_1)*d_2
   a=asc(mid(m,3,1))
   If b=3 Then utf82cp= utf82cp Or (a And mc_3)*d_3 :Exit function
   utf82cp= utf82cp Or (a And m_1)*d_3 Or  (a=asc(mid(m,4,1)) And mc_4)*d_4
End Function

Sub print(s): 
    On Error Resume Next
    WScript.stdout.Write (s)  
    If  err= &h80070006& Then WScript.Echo " Please run this script with CScript": WScript.quit
End Sub
  
Function utf8displ(utf)
  Dim s,i
  s=""
  For i=1 To Len(utf)
    s=s &" "& Hex(Asc(Mid(utf,i,1)))
  Next
  utf8displ= pad(s,12)  
End function  

function pad(s,n) if n<0 then pad= right(space(-n) & s ,-n) else  pad= left(s& space(n),n) end if :end function

Sub check(i) 
Dim c,c0,c1,c2,u 
c=b(i):c0=pad(c(0),29) :c1=c(1) :c2=pad(c(2),12):u=cp2utf8(c1)
print c0 & " CP:" & pad("U+" & Hex(c1),-8) & "    my utf8:" & utf8displ (u) & " should be:" & c2 & " back to CP:" & pad("U+" & Hex(utf82cp(u)),-8)& vbCrLf
End Sub 

Dim b
b=Array(_
  Array("LATIN CAPITAL LETTER A ",&h41," 41"),_
  Array("LATIN SMALL LETTER O WITH DIAERESIS ",&hF6," C3 B6"),_
  Array("CYRILLIC CAPITAL LETTER ZHE ",&h416," D0 96"),_
  Array("EURO SIGN",&h20AC," E2 82 AC "),_
  Array("MUSICAL SYMBOL G CLEF ",&h1D11E," F0 9D 84 9E"))

check 0
check 1
check 2
check 3
check 4
Output:

LATIN CAPITAL LETTER A        CP:    U+41    my utf8: 41          should be: 41          back to CP:    U+41
LATIN SMALL LETTER O WITH DIA CP:    U+F6    my utf8: C3 B6       should be: C3 B6       back to CP:    U+F6
CYRILLIC CAPITAL LETTER ZHE   CP:   U+416    my utf8: D0 96       should be: D0 96       back to CP:   U+416
EURO SIGN                     CP:  U+20AC    my utf8: E2 82 AC    should be: E2 82 AC    back to CP:  U+20AC
MUSICAL SYMBOL G CLEF         CP: U+1D11E    my utf8: F0 9D 84 9E should be: F0 9D 84 9E back to CP: U+1D11E

Wren

The utf8_decode function was translated from the Go entry.

import "./fmt" for Fmt

var utf8_encode = Fn.new { |cp| String.fromCodePoint(cp).bytes.toList }

var utf8_decode = Fn.new { |b|
    var mbMask = 0x3f // non-first bytes start 10 and carry 6 bits of data
    var b0 = b[0]
    if (b0 < 0x80) {
        return b0
    } else if (b0 < 0xe0) {
        var b2Mask = 0x1f // first byte of a 2-byte encoding starts 110 and carries 5 bits of data
        return (b0 & b2Mask) <<  6 | (b[1] & mbMask)
    } else if (b0 < 0xf0) {
        var b3Mask = 0x0f // first byte of a 3-byte encoding starts 1110 and carries 4 bits of data
        return (b0 & b3Mask) << 12 | (b[1] & mbMask) <<  6 | (b[2] & mbMask)
    } else {
        var b4Mask = 0x07 // first byte of a 4-byte encoding starts 11110 and carries 3 bits of data
        return (b0 & b4Mask) << 18 | (b[1] & mbMask) << 12 | (b[2] & mbMask) << 6 | (b[3] & mbMask)
    }
}

var tests = [
    ["LATIN CAPITAL LETTER A", 0x41],
    ["LATIN SMALL LETTER O WITH DIAERESIS", 0xf6],
    ["CYRILLIC CAPITAL LETTER ZHE", 0x416],
    ["EURO SIGN", 0x20ac],
    ["MUSICAL SYMBOL G CLEF", 0x1d11e]
]

System.print("Character   Name                                  Unicode    UTF-8 encoding (hex)")
System.print("---------------------------------------------------------------------------------")

for (test in tests) {
    var cp = test[1]
    var bytes = utf8_encode.call(cp)
    var utf8 = bytes.map { |b| Fmt.Xz(2, b) }.join(" ")
    var cp2 = utf8_decode.call(bytes)
    var uni = String.fromCodePoint(cp2)
    System.print("%(Fmt.s(-11, uni)) %(Fmt.s(-37, test[0])) U+%(Fmt.s(-8, Fmt.Xz(4, cp2))) %(utf8)")
}
Output:
Character   Name                                  Unicode    UTF-8 encoding (hex)
---------------------------------------------------------------------------------
A           LATIN CAPITAL LETTER A                U+0041     41
ö           LATIN SMALL LETTER O WITH DIAERESIS   U+00F6     C3 B6
Ж           CYRILLIC CAPITAL LETTER ZHE           U+0416     D0 96
€           EURO SIGN                             U+20AC     E2 82 AC
𝄞           MUSICAL SYMBOL G CLEF                 U+1D11E    F0 9D 84 9E

zkl

println("Char  Unicode  UTF-8");
foreach utf,unicode_int in (T( T("\U41;",0x41), T("\Uf6;",0xf6), 
      T("\U416;",0x416), T("\U20AC;",0x20ac), T("\U1D11E;",0x1d11e))){
   utf_int:=utf.reduce(fcn(s,c){ 0x100*s + c.toAsc() },0);
   char :=unicode_int.toString(-8);	// Unicode int to UTF-8 string
   // UTF-8 bytes to UTF-8 string:
   char2:=Data(Void,utf_int.toBigEndian(utf_int.len())).text;

   println("%s %s %9s  %x".fmt(char,char2,"U+%x".fmt(unicode_int),utf_int));
}

Int.len() --> number of bytes in int. This could be hard coded because UTF-8 has a max of 6 bytes and (0x41).toBigEndian(6) --> 0x41,0,0,0,0,0 which is a zero terminated string ("A");

Output:
Char  Unicode  UTF-8
A A      U+41  41
ö ö      U+f6  c3b6
Ж Ж     U+416  d096
€ €    U+20ac  e282ac
𝄞 𝄞   U+1d11e  f09d849e
Cookies help us deliver our services. By using our services, you agree to our use of cookies.