String length: Difference between revisions
Content added Content deleted
(→{{header|JavaScript}}: Unicode consideration) |
|||
Line 1,855: | Line 1,855: | ||
=={{header|JavaScript}}== |
=={{header|JavaScript}}== |
||
===Byte |
===Byte length=== |
||
JavaScript encodes strings in UTF-16, which represents each character with one or two 16-bit values. The length property of string objects gives the number of 16-bit values used to encode a string, so the number of bytes can be determined by doubling that number. |
JavaScript encodes strings in UTF-16, which represents each character with one or two 16-bit values. The length property of string objects gives the number of 16-bit values used to encode a string, so the number of bytes can be determined by doubling that number. |
||
<syntaxhighlight lang="javascript"> |
<syntaxhighlight lang="javascript"> |
||
var s = "Hello, world!"; |
|||
var byteCount = s.length * 2; //26 |
var byteCount = s.length * 2; // 26 |
||
===Character Length=== |
|||
</syntaxhighlight> |
|||
It's easier to use Buffer.byteLength (Node.JS specific, not ECMAScript). |
|||
⚫ | |||
a = '👩❤️👩' |
|||
Buffer.byteLength(a, 'utf16le'); // 16 |
|||
Buffer.byteLength(a, 'utf8'); // 20 |
|||
Buffer.byteLength(s, 'utf16le'); // 26 |
|||
Buffer.byteLength(s, 'utf8'); // 13 |
|||
</syntaxhighlight> |
|||
In pure ECMAScript, TextEncoder() can be used to return the UTF-8 byte size: |
|||
<syntaxhighlight lang="javascript"> |
|||
(new TextEncoder().encode(a)).length; // 20 |
|||
(new TextEncoder().encode(s)).length; // 13 |
|||
</syntaxhighlight> |
|||
=== Unicode codepoint length === |
|||
JavaScript encodes strings in UTF-16, which represents each character with one or two 16-bit values. The most commonly used characters are represented by one 16-bit value, while rarer ones like some mathematical symbols are represented by two. |
JavaScript encodes strings in UTF-16, which represents each character with one or two 16-bit values. The most commonly used characters are represented by one 16-bit value, while rarer ones like some mathematical symbols are represented by two. |
||
If the string only contains commonly used characters, the number of characters will be equal to the number of 16-bit values used to represent the characters. |
|||
⚫ | |||
<syntaxhighlight lang="javascript"> |
|||
⚫ | |||
var str1 = "Hello, world!"; |
|||
⚫ | |||
⚫ | |||
⚫ | |||
</syntaxhighlight> |
|||
More generally, the expansion operator in an array can be used to enumerate Unicode code points: |
|||
<syntaxhighlight lang="javascript"> |
|||
[...str2].length // 1 |
|||
</syntaxhighlight> |
|||
=== Unicode grapheme length === |
|||
Counting Unicode codepoints when using combining characters such as joining sequences or diacritics will return the wrong size, so we must count graphemes instead. Intl.Segmenter() default granularity is grapheme. |
|||
<syntaxhighlight lang="javascript"> |
|||
[...new Intl.Segmenter().segment(a)].length; // 1 |
|||
</syntaxhighlight> |
|||
⚫ | |||
⚫ | |||
===ES6 destructuring/iterators=== |
===ES6 destructuring/iterators=== |
||
ES6 provides several ways to get a string split into an array of code points instead of UTF-16 code units: |
ES6 provides several ways to get a string split into an array of code points instead of UTF-16 code units: |
||
<syntaxhighlight lang="javascript">let |
<syntaxhighlight lang="javascript">let |