UTF-8 encode and decode: Difference between revisions

Rename Perl 6 -> Raku, alphabetize, minor clean-up
(Rename Perl 6 -> Raku, alphabetize, minor clean-up)
Line 72:
[f0,9d,84,9e] 𝄞
</pre>
=={{header|Common Lisp}}==
 
Helper functions
 
<lang lisp>
(defun ascii-byte-p (octet)
"Return t if octet is a single-byte 7-bit ASCII char.
The most significant bit is 0, so the allowed pattern is 0xxx xxxx."
(assert (typep octet 'integer))
(assert (<= (integer-length octet) 8))
(let ((bitmask #b10000000)
(template #b00000000))
;; bitwise and the with the bitmask #b11000000 to extract the first two bits.
;; check if the first two bits are equal to the template #b10000000.
(= (logand bitmask octet) template)))
 
(defun multi-byte-p (octet)
"Return t if octet is a part of a multi-byte UTF-8 sequence.
The multibyte pattern is 1xxx xxxx. A multi-byte can be either a lead byte or a trail byte."
(assert (typep octet 'integer))
(assert (<= (integer-length octet) 8))
(let ((bitmask #b10000000)
(template #b10000000))
;; bitwise and the with the bitmask #b11000000 to extract the first two bits.
;; check if the first two bits are equal to the template #b10000000.
(= (logand bitmask octet) template)))
 
(defun lead-byte-p (octet)
"Return t if octet is one of the leading bytes of an UTF-8 sequence, nil otherwise.
Allowed leading byte patterns are 0xxx xxxx, 110x xxxx, 1110 xxxx and 1111 0xxx."
(assert (typep octet 'integer))
(assert (<= (integer-length octet) 8))
(let ((bitmasks (list #b10000000 #b11100000 #b11110000 #b11111000))
(templates (list #b00000000 #b11000000 #b11100000 #b11110000)))
(some #'(lambda (a b) (= (logand a octet) b)) bitmasks templates)))
 
(defun n-trail-bytes (octet)
"Take a leading utf-8 byte, return the number of continuation bytes 1-3."
(assert (typep octet 'integer))
(assert (<= (integer-length octet) 8))
(let ((bitmasks (list #b10000000 #b11100000 #b11110000 #b11111000))
(templates (list #b00000000 #b11000000 #b11100000 #b11110000)))
(loop for i from 0 to 3
when (= (nth i templates) (logand (nth i bitmasks) octet))
return i)))
</lang>
 
Encoder
 
<lang lisp>
(defun unicode-to-utf-8 (int)
"Take a unicode code point, return a list of one to four UTF-8 encoded bytes (octets)."
(assert (<= (integer-length int) 21))
(let ((n-trail-bytes (cond ((<= #x00000 int #x00007F) 0)
((<= #x00080 int #x0007FF) 1)
((<= #x00800 int #x00FFFF) 2)
((<= #x10000 int #x10FFFF) 3)))
(lead-templates (list #b00000000 #b11000000 #b11100000 #b11110000))
(trail-template #b10000000)
;; number of content bits in the lead byte.
(n-lead-bits (list 7 5 4 3))
;; number of content bits in the trail byte.
(n-trail-bits 6)
;; list to put the UTF-8 encoded bytes in.
(byte-list nil))
(if (= n-trail-bytes 0)
;; if we need 0 trail bytes, ist just an ascii single byte.
(push int byte-list)
(progn
;; if we need more than one byte, first fill the trail bytes with 6 bits each.
(loop for i from 0 to (1- n-trail-bytes)
do (push (+ trail-template
(ldb (byte n-trail-bits (* i n-trail-bits)) int))
byte-list))
;; then copy the remaining content bytes to the lead byte.
(push (+ (nth n-trail-bytes lead-templates)
(ldb (byte (nth n-trail-bytes n-lead-bits) (* n-trail-bytes n-trail-bits)) int))
byte-list)))
;; return the list of UTF-8 encoded bytes.
byte-list))
</lang>
 
Decoder
 
<lang lisp>
(defun utf-8-to-unicode (byte-list)
"Take a list of one to four utf-8 encoded bytes (octets), return a code point."
(let ((b1 (car byte-list)))
(cond ((ascii-byte-p b1) b1) ; if a single byte, just return it.
((multi-byte-p b1)
(if (lead-byte-p b1)
(let ((n (n-trail-bytes b1))
;; Content bits we want to extract from each lead byte.
(lead-templates (list #b01111111 #b00011111 #b00001111 #b00000111))
;; Content bits we want to extract from each trail byte.
(trail-template #b00111111))
(if (= n (1- (list-length byte-list)))
;; add lead byte
(+ (ash (logand (nth 0 byte-list) (nth n lead-templates)) (* 6 n))
;; and the trail bytes
(loop for i from 1 to n sum
(ash (logand (nth i byte-list) trail-template) (* 6 (- n i)))))
(error "calculated number of bytes doesnt match the length of the byte list")))
(error "first byte in the list isnt a lead byte"))))))
</lang>
 
The test
 
<lang lisp>
(defun test-utf-8 ()
"Return t if the chosen unicode points are encoded and decoded correctly."
(let* ((unicodes-orig (list 65 246 1046 8364 119070))
(unicodes-test (mapcar #'(lambda (x) (utf-8-to-unicode (unicode-to-utf-8 x)))
unicodes-orig)))
(mapcar #'(lambda (x)
(format t
"character ~A, code point: ~6x, utf-8: ~{~x ~}~%"
(code-char x)
x
(unicode-to-utf-8 x)))
unicodes-orig)
;; return t if all are t
(every #'= unicodes-orig unicodes-test)))
</lang>
 
Test output
 
<lang lisp>
CL-USER> (test-utf-8)
character A, code point: 41, utf-8: 41
character ö, code point: F6, utf-8: C3 B6
character Ж, code point: 416, utf-8: D0 96
character €, code point: 20AC, utf-8: E2 82 AC
character 𝄞, code point: 1D11E, utf-8: F0 9D 84 9E
T
</lang>
 
=={{header|Ada}}==
Line 394 ⟶ 258:
𝄞 U+1d11e f0 9d 84 9e
 
</lang>
 
=={{header|Common Lisp}}==
 
Helper functions
 
<lang lisp>
(defun ascii-byte-p (octet)
"Return t if octet is a single-byte 7-bit ASCII char.
The most significant bit is 0, so the allowed pattern is 0xxx xxxx."
(assert (typep octet 'integer))
(assert (<= (integer-length octet) 8))
(let ((bitmask #b10000000)
(template #b00000000))
;; bitwise and the with the bitmask #b11000000 to extract the first two bits.
;; check if the first two bits are equal to the template #b10000000.
(= (logand bitmask octet) template)))
 
(defun multi-byte-p (octet)
"Return t if octet is a part of a multi-byte UTF-8 sequence.
The multibyte pattern is 1xxx xxxx. A multi-byte can be either a lead byte or a trail byte."
(assert (typep octet 'integer))
(assert (<= (integer-length octet) 8))
(let ((bitmask #b10000000)
(template #b10000000))
;; bitwise and the with the bitmask #b11000000 to extract the first two bits.
;; check if the first two bits are equal to the template #b10000000.
(= (logand bitmask octet) template)))
 
(defun lead-byte-p (octet)
"Return t if octet is one of the leading bytes of an UTF-8 sequence, nil otherwise.
Allowed leading byte patterns are 0xxx xxxx, 110x xxxx, 1110 xxxx and 1111 0xxx."
(assert (typep octet 'integer))
(assert (<= (integer-length octet) 8))
(let ((bitmasks (list #b10000000 #b11100000 #b11110000 #b11111000))
(templates (list #b00000000 #b11000000 #b11100000 #b11110000)))
(some #'(lambda (a b) (= (logand a octet) b)) bitmasks templates)))
 
(defun n-trail-bytes (octet)
"Take a leading utf-8 byte, return the number of continuation bytes 1-3."
(assert (typep octet 'integer))
(assert (<= (integer-length octet) 8))
(let ((bitmasks (list #b10000000 #b11100000 #b11110000 #b11111000))
(templates (list #b00000000 #b11000000 #b11100000 #b11110000)))
(loop for i from 0 to 3
when (= (nth i templates) (logand (nth i bitmasks) octet))
return i)))
</lang>
 
Encoder
 
<lang lisp>
(defun unicode-to-utf-8 (int)
"Take a unicode code point, return a list of one to four UTF-8 encoded bytes (octets)."
(assert (<= (integer-length int) 21))
(let ((n-trail-bytes (cond ((<= #x00000 int #x00007F) 0)
((<= #x00080 int #x0007FF) 1)
((<= #x00800 int #x00FFFF) 2)
((<= #x10000 int #x10FFFF) 3)))
(lead-templates (list #b00000000 #b11000000 #b11100000 #b11110000))
(trail-template #b10000000)
;; number of content bits in the lead byte.
(n-lead-bits (list 7 5 4 3))
;; number of content bits in the trail byte.
(n-trail-bits 6)
;; list to put the UTF-8 encoded bytes in.
(byte-list nil))
(if (= n-trail-bytes 0)
;; if we need 0 trail bytes, ist just an ascii single byte.
(push int byte-list)
(progn
;; if we need more than one byte, first fill the trail bytes with 6 bits each.
(loop for i from 0 to (1- n-trail-bytes)
do (push (+ trail-template
(ldb (byte n-trail-bits (* i n-trail-bits)) int))
byte-list))
;; then copy the remaining content bytes to the lead byte.
(push (+ (nth n-trail-bytes lead-templates)
(ldb (byte (nth n-trail-bytes n-lead-bits) (* n-trail-bytes n-trail-bits)) int))
byte-list)))
;; return the list of UTF-8 encoded bytes.
byte-list))
</lang>
 
Decoder
 
<lang lisp>
(defun utf-8-to-unicode (byte-list)
"Take a list of one to four utf-8 encoded bytes (octets), return a code point."
(let ((b1 (car byte-list)))
(cond ((ascii-byte-p b1) b1) ; if a single byte, just return it.
((multi-byte-p b1)
(if (lead-byte-p b1)
(let ((n (n-trail-bytes b1))
;; Content bits we want to extract from each lead byte.
(lead-templates (list #b01111111 #b00011111 #b00001111 #b00000111))
;; Content bits we want to extract from each trail byte.
(trail-template #b00111111))
(if (= n (1- (list-length byte-list)))
;; add lead byte
(+ (ash (logand (nth 0 byte-list) (nth n lead-templates)) (* 6 n))
;; and the trail bytes
(loop for i from 1 to n sum
(ash (logand (nth i byte-list) trail-template) (* 6 (- n i)))))
(error "calculated number of bytes doesnt match the length of the byte list")))
(error "first byte in the list isnt a lead byte"))))))
</lang>
 
The test
 
<lang lisp>
(defun test-utf-8 ()
"Return t if the chosen unicode points are encoded and decoded correctly."
(let* ((unicodes-orig (list 65 246 1046 8364 119070))
(unicodes-test (mapcar #'(lambda (x) (utf-8-to-unicode (unicode-to-utf-8 x)))
unicodes-orig)))
(mapcar #'(lambda (x)
(format t
"character ~A, code point: ~6x, utf-8: ~{~x ~}~%"
(code-char x)
x
(unicode-to-utf-8 x)))
unicodes-orig)
;; return t if all are t
(every #'= unicodes-orig unicodes-test)))
</lang>
 
Test output
 
<lang lisp>
CL-USER> (test-utf-8)
character A, code point: 41, utf-8: 41
character ö, code point: F6, utf-8: C3 B6
character Ж, code point: 416, utf-8: D0 96
character €, code point: 20AC, utf-8: E2 82 AC
character 𝄞, code point: 1D11E, utf-8: F0 9D 84 9E
T
</lang>
 
Line 418 ⟶ 419:
€ 20AC [E2, 82, AC]
𝄞 1D11E [F0, 9D, 84, 9E]</pre>
 
=={{header|Elena}}==
ELENA 4.x :
Line 1,117 ⟶ 1,119:
Decoding: 𝄞
</pre>
 
=={{header|M2000 Interpreter}}==
<lang M2000 Interpreter>
Line 1,190 ⟶ 1,193:
€ euro sign 0020ac e2 82 ac
𝄞 musical symbol g clef 01d11e f0 9d 84 9e
</pre>
 
=={{header|Perl 6}}==
{{works with|Rakudo|2017.02}}
Pretty much all built in to the language.
<lang perl6>say sprintf("%-18s %-36s|%8s| %7s |%14s | %s\n", 'Character|', 'Name', 'Ordinal', 'Unicode', 'UTF-8 encoded', 'decoded'), '-' x 100;
 
for < A ö Ж € 𝄞 😜 👨‍👩‍👧‍👦> -> $char {
printf " %-5s | %-43s | %6s | %-7s | %12s |%4s\n", $char, $char.uninames.join(','), $char.ords.join(' '),
('U+' X~ $char.ords».base(16)).join(' '), $char.encode('UTF8').list».base(16).Str, $char.encode('UTF8').decode;
}</lang>
{{out}}
<pre>Character| Name | Ordinal| Unicode | UTF-8 encoded | decoded
----------------------------------------------------------------------------------------------------
A | LATIN CAPITAL LETTER A | 65 | U+41 | 41 | A
ö | LATIN SMALL LETTER O WITH DIAERESIS | 246 | U+F6 | C3 B6 | ö
Ж | CYRILLIC CAPITAL LETTER ZHE | 1046 | U+416 | D0 96 | Ж
€ | EURO SIGN | 8364 | U+20AC | E2 82 AC | €
𝄞 | MUSICAL SYMBOL G CLEF | 119070 | U+1D11E | F0 9D 84 9E | 𝄞
😜 | FACE WITH STUCK-OUT TONGUE AND WINKING EYE | 128540 | U+1F61C | F0 9F 98 9C | 😜
👨‍👩‍👧‍👦 | MAN,ZERO WIDTH JOINER,WOMAN,ZERO WIDTH JOINER,GIRL,ZERO WIDTH JOINER,BOY | 128104 8205 128105 8205 128103 8205 128102 | U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466 | F0 9F 91 A8 E2 80 8D F0 9F 91 A9 E2 80 8D F0 9F 91 A7 E2 80 8D F0 9F 91 A6 | 👨‍👩‍👧‍👦
</pre>
 
Line 1,240 ⟶ 1,222:
#1D11E -> {#F0,#9D,#84,#9E} -> {#1D11E}
</pre>
 
=={{header|PureBasic}}==
The encoding and decoding procedure are kept simple and designed to work with an array of 5 elements for input/output of the UTF-8 encoding for a single code point at a time. It was decided not to use a more elaborate example that would have been able to operate on a buffer to encode/decode more than one code point at a time.
Line 1,417 ⟶ 1,400:
#\€ € (e2 82 ac) € EURO-SIGN
#\𝄞 𝄞 (f0 9d 84 9e) 𝄞 MUSICAL-SYMBOL-G-CLEF</pre>
 
=={{header|Raku}}==
(formerly Perl 6)
{{works with|Rakudo|2017.02}}
Pretty much all built in to the language.
<lang perl6>say sprintf("%-18s %-36s|%8s| %7s |%14s | %s\n", 'Character|', 'Name', 'Ordinal', 'Unicode', 'UTF-8 encoded', 'decoded'), '-' x 100;
 
for < A ö Ж € 𝄞 😜 👨‍👩‍👧‍👦> -> $char {
printf " %-5s | %-43s | %6s | %-7s | %12s |%4s\n", $char, $char.uninames.join(','), $char.ords.join(' '),
('U+' X~ $char.ords».base(16)).join(' '), $char.encode('UTF8').list».base(16).Str, $char.encode('UTF8').decode;
}</lang>
{{out}}
<pre>Character| Name | Ordinal| Unicode | UTF-8 encoded | decoded
----------------------------------------------------------------------------------------------------
A | LATIN CAPITAL LETTER A | 65 | U+41 | 41 | A
ö | LATIN SMALL LETTER O WITH DIAERESIS | 246 | U+F6 | C3 B6 | ö
Ж | CYRILLIC CAPITAL LETTER ZHE | 1046 | U+416 | D0 96 | Ж
€ | EURO SIGN | 8364 | U+20AC | E2 82 AC | €
𝄞 | MUSICAL SYMBOL G CLEF | 119070 | U+1D11E | F0 9D 84 9E | 𝄞
😜 | FACE WITH STUCK-OUT TONGUE AND WINKING EYE | 128540 | U+1F61C | F0 9F 98 9C | 😜
👨‍👩‍👧‍👦 | MAN,ZERO WIDTH JOINER,WOMAN,ZERO WIDTH JOINER,GIRL,ZERO WIDTH JOINER,BOY | 128104 8205 128105 8205 128103 8205 128102 | U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466 | F0 9F 91 A8 E2 80 8D F0 9F 91 A9 E2 80 8D F0 9F 91 A7 E2 80 8D F0 9F 91 A6 | 👨‍👩‍👧‍👦
</pre>
 
=={{header|Ruby}}==
Line 1,596 ⟶ 1,601:
𝄞 -> ["\xF0", "\x9D", "\x84", "\x9E"]
</pre>
 
=={{header|Swift}}==
In Swift there's a difference between UnicodeScalar, which is a single unicode code point, and Character which may consist out of multiple UnicodeScalars, usually because of combining characters.
Line 1,827 ⟶ 1,833:
? 1D11E F0 9D 84 9E 1D11E
</pre>
 
=={{header|zkl}}==
<lang zkl>println("Char Unicode UTF-8");
10,333

edits