UTF-8 encode and decode: Difference between revisions

Line 1,119:
𝄞 U+01d11e [f0,9d,84,9e] [f0,9d,84,9e] [f0,9d,84,9e] U+01d11e
</pre>Note that the misalign there on the last line is caused by the string length of astral characters being 2 so the padding functions break.
 
=={{header|jq}}==
{{works with|jq}}
'''Works with gojq, the Go implementation of jq'''
 
'''Preliminaries'''
<lang jq># input: a decimal integer
# output: the corresponding binary array, most significant bit first
def binary_digits:
if . == 0 then 0
else [recurse( if . == 0 then empty else ./2 | floor end ) % 2]
| reverse
| .[1:] # remove the leading 0
end ;
 
# Input: an array of binary digits, msb first.
def binary_to_decimal:
reduce reverse[] as $b ({power:1, result:0};
.result += .power * $b
| .power *= 2)
| .result;</lang>
'''Encode to UTF-8'''
<lang jq># input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint.
# output: the corresponding decimal number of that codepoint.
def utf8_encode:
def lpad($width): [(range(0;8)|0), .[]][- $width:];
def multibyte: [1,0, (.[-6: ]|lpad(6))[]];
def firstOf2: [1,1,0, (.[: -6]|lpad(5))[]];
def firstOf3: [1,1,1,0, (.[:-12]|lpad(4))[]];
def firstOf4: [1,1,1,1,0, (.[:-18]|lpad(3))[]];
. as $n
| binary_digits
| length as $len
| if $len <8 then [$n]
else if $len <= 12 then [ firstOf2, multibyte ]
elif $len <= 16 then [ firstOf3, (.[:-6] | multibyte), multibyte ]
else [firstOf4,
(.[ :-12] | multibyte),
(.[-12: -6] | multibyte),
multibyte]
end
| map(binary_to_decimal)
end;</lang>
'''Decode an array of UTF-8 bytes'''
<lang jq># input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint.
# output: the corresponding decimal number of that codepoint.
def utf8_decode:
# Magic numbers:
# x80: 128, # 10000000
# xe0: 224, # 11100000
# xf0: 240 # 11110000
(-6) as $mb # non-first bytes start 10 and carry 6 bits of data
# first byte of a 2-byte encoding starts 110 and carries 5 bits of data
# first byte of a 3-byte encoding starts 1110 and carries 4 bits of data
# first byte of a 4-byte encoding starts 11110 and carries 3 bits of data
| map(binary_digits) as $d
| .[0]
| if . < 128 then $d[0]
elif . < 224 then [$d[0][-5:][], $d[1][$mb:][]]
elif . < 240 then [$d[0][-4:][], $d[1][$mb:][], $d[2][$mb:][]]
else [$d[0][-3:][], $d[1][$mb:][], $d[2][$mb:][], $d[3][$mb:][]]
end
| binary_to_decimal ;</lang>
'''Task'''
<lang jq>def task:
[ "A", "ö", "Ж", "€", "𝄞" ][]
| . as $glyph
| explode[]
| utf8_encode as $encoded
| ($encoded|utf8_decode) as $decoded
| "Glyph \($glyph) => \($encoded) => \($decoded) => \([$decoded]|implode)" ;
 
task</lang>
{{out}}
<pre>
Glyph A => [65] => 65 => A
Glyph ö => [195,182] => 246 => ö
Glyph Ж => [208,150] => 1046 => Ж
Glyph € => [226,130,172] => 8364 => €
Glyph 𝄞 => [240,157,132,158] => 119070 => 𝄞
</pre>
 
 
=={{header|Julia}}==
2,496

edits