String length: Difference between revisions
→{{header|JavaScript}}: Unicode consideration
(→{{header|JavaScript}}: Unicode consideration) |
|||
Line 1,855:
=={{header|JavaScript}}==
===Byte
JavaScript encodes strings in UTF-16, which represents each character with one or two 16-bit values. The length property of string objects gives the number of 16-bit values used to encode a string, so the number of bytes can be determined by doubling that number.
<syntaxhighlight lang="javascript">
var s = "Hello, world!";
var byteCount = s.length * 2; // 26
</syntaxhighlight>
It's easier to use Buffer.byteLength (Node.JS specific, not ECMAScript).
a = '👩❤️👩'
Buffer.byteLength(a, 'utf16le'); // 16
Buffer.byteLength(a, 'utf8'); // 20
Buffer.byteLength(s, 'utf16le'); // 26
Buffer.byteLength(s, 'utf8'); // 13
</syntaxhighlight>
In pure ECMAScript, TextEncoder() can be used to return the UTF-8 byte size:
<syntaxhighlight lang="javascript">
(new TextEncoder().encode(a)).length; // 20
(new TextEncoder().encode(s)).length; // 13
</syntaxhighlight>
=== Unicode codepoint length ===
JavaScript encodes strings in UTF-16, which represents each character with one or two 16-bit values. The most commonly used characters are represented by one 16-bit value, while rarer ones like some mathematical symbols are represented by two.
▲<syntaxhighlight lang="javascript">var str1 = "Hello, world!";
<syntaxhighlight lang="javascript">
var len1 = str1.length; //13▼
var str1 = "Hello, world!";
▲var len1 = str1.length; // 13
var str2 = "\uD834\uDD2A"; // U+1D12A represented by a UTF-16 surrogate pair▼
</syntaxhighlight>
More generally, the expansion operator in an array can be used to enumerate Unicode code points:
<syntaxhighlight lang="javascript">
[...str2].length // 1
</syntaxhighlight>
=== Unicode grapheme length ===
Counting Unicode codepoints when using combining characters such as joining sequences or diacritics will return the wrong size, so we must count graphemes instead. Intl.Segmenter() default granularity is grapheme.
<syntaxhighlight lang="javascript">
[...new Intl.Segmenter().segment(a)].length; // 1
</syntaxhighlight>
▲var str2 = "\uD834\uDD2A"; //U+1D12A represented by a UTF-16 surrogate pair
▲var len2 = str2.length; //2</syntaxhighlight>
===ES6 destructuring/iterators===
ES6 provides several ways to get a string split into an array of code points instead of UTF-16 code units:
<syntaxhighlight lang="javascript">let
|