UTF-8 encode and decode: Difference between revisions
m
syntax highlighting fixup automation
Thundergnat (talk | contribs) m (syntax highlighting fixup automation) |
|||
Line 24:
{{trans|Python}}
<
R ‘U+’hex(ch.code).zfill(4)
Line 33:
V chars = [‘A’, ‘ö’, ‘Ж’, ‘€’]
L(char) chars
print(‘#<11 #<15 #<15’.format(char, unicode_code(char), utf8hex(char)))</
{{out}}
Line 45:
=={{header|8th}}==
<
hex \ so bytes print nicely
Line 81:
bye
</syntaxhighlight>
Output:<pre>
A 41
Line 97:
=={{header|Action!}}==
<
BYTE ARRAY hex=['0 '1 '2 '3 '4 '5 '6 '7 '8 '9 'A 'B 'C 'D 'E 'F]
Line 246:
StrUnicode(res) PutE()
OD
RETURN</
{{out}}
[https://gitlab.com/amarok8bit/action-rosetta-code/-/raw/master/images/UTF-8_encode_and_decode.png Screenshot from Atari 8-bit computer]
Line 269:
{{works with|Ada|Ada|2012}}
<
with Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
with Ada.Integer_Text_IO;
Line 317:
end;
end loop;
end UTF8_Encode_And_Decode;</
{{out}}
<pre>Character Unicode UTF-8 encoding (hex)
Line 334:
Despite this complexity, what actually gets generated is highly optimizable C code. Note that the following demonstration requires no ATS-specific library support whatsoever.
<syntaxhighlight lang="ats">(*
UTF-8 encoding and decoding in ATS2.
Line 2,111:
println! ("SUCCESS")
end</
{{out}}
Line 2,118:
=={{header|AutoHotkey}}==
<
Bytes := hex>=0x10000 ? 4 : hex>=0x0800 ? 3 : hex>=0x0080 ? 2 : hex>=0x0001 ? 1 : 0
Prefix := [0, 0xC0, 0xE0, 0xF0]
Line 2,152:
DllCall("msvcrt.dll\" v, "Int64", value, "Str", s, "UInt", OutputBase, "CDECL")
return s
}</
Examples:<
(comment
0x0041
Line 2,168:
}
MsgBox % output
return</
{{out}}
<pre>
Line 2,181:
=={{header|BaCon}}==
BaCon supports UTF8 natively.
<
CONST letter$ = "A ö Ж € 𝄞"
Line 2,190:
FOR x IN letter$
PRINT x, TAB$(1), "U+", HEX$(UCS(x)), TAB$(2), COIL$(LEN(x), HEX$(x[_-1] & 255))
NEXT</
{{out}}
<pre>Char Unicode UTF-8 (hex)
Line 2,201:
=={{header|C}}==
<syntaxhighlight lang="c">
#include <stdio.h>
#include <stdlib.h>
Line 2,312:
return 0;
}
</syntaxhighlight>
Output
<syntaxhighlight lang="text">
Character Unicode UTF-8 encoding (hex)
----------------------------------------
Line 2,323:
𝄞 U+1d11e f0 9d 84 9e
</syntaxhighlight>
=={{header|C sharp|C#}}==
<
using System.Text;
Line 2,354:
20AC € E2-82-AC
1D11E 𝄞 F0-9D-84-9E */
</syntaxhighlight>
Line 2,362:
Helper functions
<
(defun ascii-byte-p (octet)
"Return t if octet is a single-byte 7-bit ASCII char.
Line 2,403:
when (= (nth i templates) (logand (nth i bitmasks) octet))
return i)))
</syntaxhighlight>
Encoder
<
(defun unicode-to-utf-8 (int)
"Take a unicode code point, return a list of one to four UTF-8 encoded bytes (octets)."
Line 2,438:
;; return the list of UTF-8 encoded bytes.
byte-list))
</syntaxhighlight>
Decoder
<
(defun utf-8-to-unicode (byte-list)
"Take a list of one to four utf-8 encoded bytes (octets), return a code point."
Line 2,462:
(error "calculated number of bytes doesnt match the length of the byte list")))
(error "first byte in the list isnt a lead byte"))))))
</syntaxhighlight>
The test
<
(defun test-utf-8 ()
"Return t if the chosen unicode points are encoded and decoded correctly."
Line 2,481:
;; return t if all are t
(every #'= unicodes-orig unicodes-test)))
</syntaxhighlight>
Test output
<
CL-USER> (test-utf-8)
character A, code point: 41, utf-8: 41
Line 2,493:
character 𝄞, code point: 1D11E, utf-8: F0 9D 84 9E
T
</syntaxhighlight>
=={{header|D}}==
<
import std.stdio;
Line 2,508:
writefln("%s %7X [%(%X, %)]", c, unicode, bytes);
}
}</
{{out}}
Line 2,520:
=={{header|Elena}}==
ELENA 4.x :
<
import extensions;
Line 2,557:
"𝄞".printAsString().printAsUTF8Array().printAsUTF32();
console.printLine();
}</
{{out}}
<pre>
Line 2,567:
=={{header|F_Sharp|F#}}==
<
// Unicode character point to UTF8. Nigel Galloway: March 19th., 2018
let fN g = match List.findIndex (fun n->n>g) [0x80;0x800;0x10000;0x110000] with
Line 2,574:
|2->[0xe0+(g&&&0xf000>>>12);0x80+(g&&&0xfc0>>>6);0x80+(g&&&0x3f)]
|_->[0xf0+(g&&&0x1c0000>>>18);0x80+(g&&&0x3f000>>>12);0x80+(g&&&0xfc0>>>6);0x80+(g&&&0x3f)]
</syntaxhighlight>
{{out}}
<pre>
Line 2,590:
{{works with|gforth|0.7.9_20191121}}
{{works with|lxf|1.6-982-823}}
<
over + swap ?do
i c@ 3 .r loop ;
Line 2,602:
\ can also be written as
\ 'A' test 'ö' test 'Ж' test '€' test '𝄞' test
</syntaxhighlight>
{{out}}
<pre>
Line 2,614:
If you also want to see the implementation of <code>xc!+</code> and <code>xc@+</code>, here it is (<code>u8!+</code> is the UTF-8 implementation of <code>xc!+</code>, and likewise for <code>u8@+</code>):
<
$80 Constant max-single-byte
Line 2,635:
REPEAT $7F xor 2* or r>
BEGIN over $80 u>= WHILE tuck c! 1+ REPEAT nip ;
</syntaxhighlight>
=={{header|Go}}==
===Implementation===
This implementation is missing all checks for invalid data and so is not production-ready, but illustrates the basic UTF-8 encoding scheme.
<
import (
Line 2,741:
rune(b[3]&mbMask)
}
}</
{{out}}
<pre>
Line 2,752:
===Library/language===
<
import (
Line 2,777:
fmt.Printf("%-7c U+%04X\t%-12X\t%c\n", codepoint, codepoint, encoded, decoded)
}
}</
{{out}}
<pre>
Line 2,789:
Alternately:
<
import (
Line 2,810:
fmt.Printf("%-7c U+%04X\t%-12X\t%c\n", codepoint, codepoint, encoded, decoded)
}
}</
{{out}}
<pre>
Line 2,823:
=={{header|Groovy}}==
{{trans|Java}}
<
class UTF8EncodeDecode {
Line 2,849:
}
}
}</
{{out}}
<pre>Char Name Unicode UTF-8 encoded Decoded
Line 2,862:
Example makes use of [http://hackage.haskell.org/package/bytestring <tt>bytestring</tt>] and [http://hackage.haskell.org/package/text <tt>text</tt>] packages:
<
import qualified Data.ByteString as ByteString (pack, unpack)
Line 2,889:
(printf "U+%04X" codepoint :: String)
(intercalate " " (map (printf "%02X") values))
codepoint'</
{{out}}
<pre>
Line 2,903:
=={{header|J}}==
'''Solution:'''
<
ucp=: 9&u: NB. converts to unicode from UTF-8 or unicode codepoint integer
ucp_hex=: hfd@(3 u: ucp) NB. converts to unicode codepoint hexadecimal from UTF-8, unicode or unicode codepoint integer</
'''Examples:'''
<
AöЖ€𝄞
ucp 65 246 1046 8364 119070
Line 2,923:
1d11e
utf8@dfh ucp_hex utf8 65 246 1046 8364 119070
AöЖ€𝄞</
=={{header|Java}}==
{{works with|Java|7+}}
<
import java.util.Formatter;
Line 2,956:
}
}
}</
{{out}}
<pre>
Line 2,969:
=={{header|JavaScript}}==
An implementation in ECMAScript 2015 (ES6):
<
/***************************************************************************\
|* Pure UTF-8 handling without detailed error reporting functionality. *|
Line 3,029:
?( m&0x07)<<18|( n&0x3f)<<12|( o&0x3f)<<6|( p&0x3f)<<0
:(()=>{throw'Invalid UTF-8 encoding!'})()
</syntaxhighlight>
The testing inputs:
<
const
str=
Line 3,050:
:[ [ a,b,c]]
,inputs=zip3(str,cps,cus);
</syntaxhighlight>
The testing code:
<
console.log(`\
${'Character'.padEnd(16)}\
Line 3,068:
${`[${[...utf8encode(cp)].map(n=>n.toString(0x10))}]`.padEnd(16)}\
${utf8decode(cu).toString(0x10).padStart(8,'U+000000')}`)
</syntaxhighlight>
and finally, the output from the test:
<pre>
Line 3,084:
'''Preliminaries'''
<
# output: the corresponding binary array, most significant bit first
def binary_digits:
Line 3,098:
.result += .power * $b
| .power *= 2)
| .result;</
'''Encode to UTF-8'''
<
# output: the corresponding decimal number of that codepoint.
def utf8_encode:
Line 3,120:
end
| map(binary_to_decimal)
end;</
'''Decode an array of UTF-8 bytes'''
<
# output: the corresponding decimal number of that codepoint.
def utf8_decode:
Line 3,140:
else $d[0][-3:] + $d[1][$mb:] + $d[2][$mb:] + $d[3][$mb:]
end
| binary_to_decimal ;</
'''Task'''
<
[ "A", "ö", "Ж", "€", "𝄞" ][]
| . as $glyph
Line 3,150:
| "Glyph \($glyph) => \($encoded) => \($decoded) => \([$decoded]|implode)" ;
task</
{{out}}
<pre>
Line 3,165:
Julia supports by default UTF-8 encoding.
<
enc = Vector{UInt8}(t)
dec = String(enc)
println(dec, " → ", enc)
end</
{{out}}
Line 3,179:
=={{header|Kotlin}}==
<
fun utf8Encode(codePoint: Int) = String(intArrayOf(codePoint), 0, 1).toByteArray(Charsets.UTF_8)
Line 3,198:
System.out.printf("%-${n}s %c\n", s, decoded)
}
}</
{{out}}
Line 3,212:
=={{header|langur}}==
{{works with|langur|0.8.4}}
<
for .cp in "AöЖ€𝄞" {
Line 3,219:
val .utf8rep = join " ", map f $"\.b:X02;", .utf8
writeln $"\.cpstr:-11; U+\.cp:X04:-8; \.utf8rep;"
}</
{{out}}
Line 3,237:
- byteArray.toHexString (intStart, intLen): returns hex string representation of byte array (e.g. for printing)<br />
- byteArray.readRawString (intLen, [strCharSet="UTF-8"]): reads a fixed number of bytes as a string
<
put "Character Unicode (int) UTF-8 (hex) Decoded"
repeat with c in chars
ba = bytearray(c)
put col(c, 12) & col(charToNum(c), 16) & col(ba.toHexString(1, ba.length), 14) & ba.readRawString(ba.length)
end repeat</
Helper function for table formatting
<
str = string(val)
repeat with i = str.length+1 to len
Line 3,250:
end repeat
return str
end</
{{out}}
<pre>
Line 3,263:
=={{header|Lua}}==
{{works with|Lua|5.3}}
<syntaxhighlight lang="lua">
-- Accept an integer representing a codepoint.
-- Return the values of the individual octets.
Line 3,298:
end
end
</syntaxhighlight>
{{out}}
<pre>
Line 3,319:
=={{header|M2000 Interpreter}}==
<syntaxhighlight lang="m2000 interpreter">
Module EncodeDecodeUTF8 {
a$=string$("Hello" as UTF8enc)
Line 3,339:
}
EncodeDecodeUTF8
</syntaxhighlight>
{{out}}
<pre>
Line 3,354:
=={{header|Mathematica}}/{{header|Wolfram Language}}==
<
ToCharacterCode[FromCharacterCode[utf, "UTF8"]]</
{{out}}
<pre>{65, 195, 182, 208, 150, 226, 130, 172}
Line 3,367:
For this purpose, using sequences or bytes is not natural. Here is a way to proceed using the module “unicode”.
<
const UChars = ["\u0041", "\u00F6", "\u0416", "\u20AC", "\u{1D11E}"]
Line 3,387:
r = s.toRune
# Display.
echo &"""{uchar:>5} U+{r.int.toHex(5)} {s.map(toHex).join(" ")}"""</
{{out}}
Line 3,400:
In this section, we provide two procedures to convert a Unicode code point to a UTF-8 sequence of bytes and conversely, without using the module “unicode”. We provide also a procedure to convert a sequence of bytes to a string in order to print it. The algorithm is the one used by the Go solution.
<
const
Line 3,472:
# Display.
echo &"""{s.toString:>5} U+{c.int.toHex(5)} {s.map(toHex).join(" ")}"""
</syntaxhighlight>
{{out}}
Line 3,478:
=={{header|Perl}}==
<
use strict;
use warnings;
Line 3,497:
} split //, $utf8;
print "\n";
} @chars;</
{{out}}
Line 3,515:
As requested in the task description:
<!--<
<span style="color: #008080;">constant</span> <span style="color: #000000;">tests</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">#0041</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#00F6</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#0416</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#20AC</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#1D11E</span><span style="color: #0000FF;">}</span>
Line 3,528:
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%04x -> {%s} -> {%s}\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">codepoint</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">hex</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%02x"</span><span style="color: #0000FF;">),</span><span style="color: #000000;">hex</span><span style="color: #0000FF;">(</span><span style="color: #000000;">r</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%04x"</span><span style="color: #0000FF;">)})</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<!--</
{{out}}
Line 3,540:
=={{header|Processing}}==
<
Integer[] code_points = {0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E};
Line 3,567:
tel_1 += 30; tel_2 = 50;
}
}</
Line 3,574:
The encoding and decoding procedure are kept simple and designed to work with an array of 5 elements for input/output of the UTF-8 encoding for a single code point at a time. It was decided not to use a more elaborate example that would have been able to operate on a buffer to encode/decode more than one code point at a time.
<
Procedure UTF8_encode(x, Array encoded_codepoint.a(1)) ;x is codepoint to encode, the array will contain output
Line 3,685:
Print(#CRLF$ + #CRLF$ + "Press ENTER to exit"): Input()
CloseConsole()
EndIf</
Sample output:
<pre> Unicode UTF-8 Decoded
Line 3,698:
=={{header|Python}}==
<
#!/usr/bin/env python3
from unicodedata import name
Line 3,715:
chars = ['A', 'ö', 'Ж', '€', '𝄞']
for char in chars:
print('{:<11} {:<36} {:<15} {:<15}'.format(char, name(char), unicode_code(char), utf8hex(char)))</
{{out}}
<pre>Character Name Unicode UTF-8 encoding (hex)
Line 3,725:
=={{header|Racket}}==
<
(define char-map
Line 3,741:
(map (curryr number->string 16) bites)
(bytes->string/utf-8 (list->bytes bites))
name)))</
{{out}}
<pre>#\A A (41) A LATIN-CAPITAL-LETTER-A
Line 3,753:
{{works with|Rakudo|2017.02}}
Pretty much all built in to the language.
<syntaxhighlight lang="raku"
for < A ö Ж € 𝄞 😜 👨👩👧👦> -> $char {
printf " %-5s | %-43s | %6s | %-7s | %12s |%4s\n", $char, $char.uninames.join(','), $char.ords.join(' '),
('U+' X~ $char.ords».base(16)).join(' '), $char.encode('UTF8').list».base(16).Str, $char.encode('UTF8').decode;
}</
{{out}}
<pre>Character| Name | Ordinal| Unicode | UTF-8 encoded | decoded
Line 3,773:
=={{header|Ruby}}==
<
character_arr = ["A","ö","Ж","€","𝄞"]
for c in character_arr do
Line 3,781:
puts ""
end
</syntaxhighlight>
{{out}}
<pre>
Line 3,807:
=={{header|Rust}}==
<
let chars = vec!('A', 'ö', 'Ж', '€', '𝄞');
chars.iter().for_each(|c| {
Line 3,817:
});
}
</syntaxhighlight>
{{out}}
<pre>
Line 3,829:
=={{header|Scala}}==
=== Imperative solution===
<
val codePoints = Seq(0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)
Line 3,851:
printf(s"%-${w}c %-36s %-7s %-${16 - w}s%c%n",
codePoint, Character.getName(codePoint), leftAlignedHex, s, utf8Decode(bytes))
}</
=== Functional solution===
<
object UTF8EncodeAndDecode extends App {
Line 3,878:
println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]")
}</
=== Composable and testable solution===
<
object UTF8EncodeAndDecode extends TheMeat with App {
Line 3,913:
}
</syntaxhighlight>
=={{header|Seed7}}==
<
include "unicode.s7i";
include "console.s7i";
Line 3,934:
hex(utf8) rpad 22 <& utf8ToStri(utf8));
end for;
end func;</
{{out}}
Line 3,948:
=={{header|Sidef}}==
<
code.chr.encode('UTF-8').bytes.map{.chr}
}
Line 3,961:
assert_eq(n, decoded.ord)
say "#{decoded} -> #{encoded}"
}</
{{out}}
<pre>
Line 3,973:
=={{header|Swift}}==
In Swift there's a difference between UnicodeScalar, which is a single unicode code point, and Character which may consist out of multiple UnicodeScalars, usually because of combining characters.
<
func encode(_ scalar: UnicodeScalar) -> Data {
Line 3,994:
print("character: \(decoded), code point: U+\(String(scalar.value, radix: 16)), \tutf-8: \(formattedBytes)")
}
</syntaxhighlight>
{{out}}
<pre>
Line 4,006:
=={{header|Tcl}}==
Note: Tcl can handle Unicodes only up to U+FFFD, i.e. the Basic Multilingual Plane (BMP, 16 bits wide). Therefore, the fifth test fails as expected.
<
set u [format %c $int]
set bytes {}
Line 4,025:
lappend res [encoder $test] -> [decoder [encoder $test]]
puts $res
}</
<pre>
0x0041 41 -> A
Line 4,037:
While perhaps not as readable as the above, this version handles beyond-BMP codepoints by manually composing the utf-8 byte sequences and emitting raw bytes to the console. <tt>encoding convertto utf-8</tt> command still does the heavy lifting where it can.
<
scan $codepoint %llx cp
if {$cp < 0x10000} {
Line 4,064:
set utf8 [utf8 $codepoint]
puts "[format U+%04s $codepoint]\t$utf8\t[hexchars $utf8]"
}</
{{out}}<pre>U+0041 A 41
Line 4,074:
=={{header|VBA}}==
<
Dim y() As Byte
Dim r As Long
Line 4,195:
Debug.Print String$(8 - Len(s), " "); s
Next cpi
End Sub</
A 41 41 41
ö F6 C3 B6 F6
Line 4,204:
=={{header|Vlang}}==
<
fn decode(s string) ?[]u8 {
return hex.decode(s)
Line 4,216:
println("${codepoint:-7} U+${codepoint:04X}\t${encoded:-12}\t${decoded.bytestr()}")
}
}</
{{out}}
<pre>Char Unicode UTF-8 encoded Decoded
Line 4,228:
=={{header|Wren}}==
The utf8_decode function was translated from the Go entry.
<
var utf8_encode = Fn.new { |cp| String.fromCodePoint(cp).bytes.toList }
Line 4,267:
var uni = String.fromCodePoint(cp2)
System.print("%(Fmt.s(-11, uni)) %(Fmt.s(-37, test[0])) U+%(Fmt.s(-8, Fmt.Xz(4, cp2))) %(utf8)")
}</
{{out}}
Line 4,281:
=={{header|zkl}}==
<
foreach utf,unicode_int in (T( T("\U41;",0x41), T("\Uf6;",0xf6),
T("\U416;",0x416), T("\U20AC;",0x20ac), T("\U1D11E;",0x1d11e))){
Line 4,290:
println("%s %s %9s %x".fmt(char,char2,"U+%x".fmt(unicode_int),utf_int));
}</
Int.len() --> number of bytes in int. This could be hard coded because UTF-8
has a max of 6 bytes and (0x41).toBigEndian(6) --> 0x41,0,0,0,0,0 which is
|