String length: Difference between revisions
m (→{{header|REXX}}: removed the first and last blank lines. -- ~~~~) |
m (Added implementation for dc) |
||
Line 597: | Line 597: | ||
Character length: 9 - 4a 332 6f 332 73 332 65 301 332 |
Character length: 9 - 4a 332 6f 332 73 332 65 301 332 |
||
Character length: 9 - 4a 332 6f 332 73 332 65 301 332</pre> |
Character length: 9 - 4a 332 6f 332 73 332 65 301 332</pre> |
||
=={{header|Dc}== |
|||
===Character Length=== |
|||
The following code output 5, which is the length of the string "abcde" |
|||
<lang Dc>[abcde]Zp</lang> |
|||
=={{header|E}}== |
=={{header|E}}== |
Revision as of 15:51, 28 December 2012
You are encouraged to solve this task according to the task description, using any language you may know.
In this task, the goal is to find the character and byte length of a string. This means encodings like UTF-8 need to be handled properly, as there is not necessarily a one-to-one relationship between bytes and characters. By character, we mean an individual Unicode code point, not a user-visible grapheme containing combining characters. For example, the character length of "møøse" is 5 but the byte length is 7 in UTF-8 and 10 in UTF-16.
Non-BMP code points (those between 0x10000 and 0x10FFFF) must also be handled correctly: answers should produce actual character counts in code points, not in code unit counts. Therefore a string like "𝔘𝔫𝔦𝔠𝔬𝔡𝔢" (consisting of the 7 Unicode characters U+1D518 U+1D52B U+1D526 U+1D520 U+1D52C U+1D521 U+1D522) is 7 characters long, not 14 UTF-16 code units; and it is 28 bytes long whether encoded in UTF-8 or in UTF-16.
Please mark your examples with ===Character Length=== or ===Byte Length===. If your language is capable of providing the string length in graphemes, mark those examples with ===Grapheme Length===. For example, the string "J̲o̲s̲é̲" ("J\x{332}o\x{332}s\x{332}e\x{301}\x{332}") has 4 user-visible graphemes, 9 characters (code points), and 14 bytes when encoded in UTF-8.
4D
Byte Length
<lang 4d>$length:=Length("Hello, world!")</lang>
ActionScript
Character Length
<lang actionscript>myStrVar.length()</lang>
Ada
Byte Length
<lang ada>Str : String := "Hello World"; Length : constant Natural := Str'Size / 8;</lang> The 'Size attribute returns the size of an object in bits. Provided that under "byte" one understands an octet of bits, the length in "bytes" will be 'Size divided to 8. Note that this is not necessarily the machine storage unit. In order to make the program portable, System.Storage_Unit should be used instead of "magic number" 8. System.Storage_Unit yields the number of bits in a storage unit on the current machine. Further, the length of a string object is not the length of what the string contains in whatever measurement units. String as an object may have a "dope" to keep the array bounds. In fact the object length can even be 0, if the compiler optimized the object away. So in most cases "byte length" makes no sense in Ada.
Character Length
<lang ada>Latin_1_Str : String := "Hello World"; UCS_16_Str : Wide_String := "Hello World"; Unicode_Str : Wide_Wide_String := "Hello World"; Latin_1_Length : constant Natural := Latin_1_Str'Length; UCS_16_Length : constant Natural := UCS_16_Str'Length; Unicode_Length : constant Natural := Unicode_Str'Length;</lang> The attribute 'Length yields the number of elements of an array. Since strings in Ada are arrays of characters, 'Length is the string length. Ada supports strings of Latin-1, UCS-16 and full Unicode characters. In the example above character length of all three strings is 11. The length of the objects in bits will differ.
ALGOL 68
Bits and Bytes Length
<lang algol68>BITS bits := bits pack((TRUE, TRUE, FALSE, FALSE)); # packed array of BOOL # BYTES bytes := bytes pack("Hello, world"); # packed array of CHAR # print((
"BITS and BYTES are fixed width:", new line, "bits width:", bits width, ", max bits: ", max bits, ", bits:", bits, new line, "bytes width: ",bytes width, ", UPB:",UPB STRING(bytes), ", string:", STRING(bytes),"!", new line
))</lang> Output:
BITS and BYTES are fixed width: bits width: +32, max bits: TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT, bits:TTFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF bytes width: +32, UPB: +32, string:Hello, world!
Character Length
<lang algol68>STRING str := "hello, world"; INT length := UPB str; printf(($"Length of """g""" is "g(3)l$,str,length));
printf(($l"STRINGS can start at -1, in which case LWB must be used:"l$)); STRING s := "abcd"[@-1]; print(("s:",s, ", LWB:", LWB s, ", UPB:",UPB s, ", LEN:",UPB s - LWB s + 1))</lang> Output:
Length of "hello, world" is +12 STRINGS can start at -1, in which case LWB must be used: s:abcd, LWB: -1, UPB: +2, LEN: +4
AppleScript
Byte Length
<lang applescript>count of "Hello World"</lang> Mac OS X 10.5 (Leopard) includes AppleScript 2.0 which uses only Unicode (UTF-16) character strings. This example has not been tested and may not work on previous versions of AppleScript. <lang applescript>set inString to "Hello World" as Unicode text set byteCount to 0 set idList to id of inString
repeat with incr in idList
set byteCount to byteCount + 2 if incr as integer > 65535 then set byteCount to byteCount + 2 end if
end repeat
byteCount</lang>
Character Length
<lang applescript>count of "Hello World"</lang> Or: <lang applescript>count "Hello World"</lang>
AutoHotkey
Character Length
<lang AutoHotkey>Msgbox % StrLen("Hello World")</lang> Or: <lang AutoHotkey>String := "Hello World" StringLen, Length, String Msgbox % Length</lang>
AWK
Byte Length
From within any code block: <lang awk>w=length("Hello, world!") # static string example x=length("Hello," s " world!") # dynamic string example y=length($1) # input field example z=length(s) # variable name example</lang> Ad hoc program from command line:
echo "Hello, wørld!" | awk '{print length($0)}' # 14
From executable script: (prints for every line arriving on stdin) <lang awk>#!/usr/bin/awk -f {print"The length of this line is "length($0)}</lang>
Batch File
Byte Length
<lang dos>@echo off setlocal enabledelayedexpansion call :length %1 res echo length of %1 is %res% goto :eof
- length
set str=%~1 set cnt=0
- loop
if "%str%" equ "" ( set %2=%cnt% goto :eof ) set str=!str:~1! set /a cnt = cnt + 1 goto loop</lang>
BASIC
Character Length
BASIC only supports single-byte characters. The character "ø" is converted to "°" for printing to the console and length functions, but will still output to a file as "ø". <lang qbasic> INPUT a$
PRINT LEN(a$)</lang>
ZX Spectrum Basic
The ZX Spectrum needs line numbers:
<lang zxbasic>10 INPUT a$ 20 PRINT LEN(a$)</lang>
BBC BASIC
Character Length
<lang bbcbasic> INPUT text$
PRINT LEN(text$)</lang>
Byte Length
<lang bbcbasic> CP_ACP = 0
CP_UTF8 = &FDE9 textA$ = "møøse" textW$ = " " textU$ = " " SYS "MultiByteToWideChar", CP_ACP, 0, textA$, -1, !^textW$, LEN(textW$)/2 TO nW% SYS "WideCharToMultiByte", CP_UTF8, 0, textW$, -1, !^textU$, LEN(textU$), 0, 0 PRINT "Length in bytes (ANSI encoding) = " ; LEN(textA$) PRINT "Length in bytes (UTF-16 encoding) = " ; 2*(nW%-1) PRINT "Length in bytes (UTF-8 encoding) = " ; LEN($$!^textU$)</lang>
Output:
Length in bytes (ANSI encoding) = 5 Length in bytes (UTF-16 encoding) = 10 Length in bytes (UTF-8 encoding) = 7
Bracmat
The solutions work with UTF-8 encoded strings.
Byte Length
<lang bracmat>(ByteLength=
length
. @(!arg:? [?length)
& !length
);
out$ByteLength$𝔘𝔫𝔦𝔠𝔬𝔡𝔢</lang> Answer:
28
Character Length
<lang bracmat>(CharacterLength=
length c
. 0:?length
& @( !arg : ? ( %?c & utf$!c:?k & 1+!length:?length & ~ ) ? ) | !length
);
out$CharacterLength$𝔘𝔫𝔦𝔠𝔬𝔡𝔢</lang> Answer:
7
An improved version scans the input string character wise, not byte wise. Thus many string positions that are deemed not to be possible starting positions of UTF-8 are not even tried. The patterns [!p
and [?p
implement a ratchet mechanism. [!p
indicates the start of a character and [?p
remembers the end of the character, which becomes the start position of the next byte.
<lang bracmat>(CharacterLength=
length c p
. 0:?length:?p
& @( !arg : ? ( [!p %?c & utf$!c:?k & 1+!length:?length ) ([?p&~) ? ) | !length
);</lang>
C
Byte Length
<lang c>#include <string.h>
int main(void) {
const char *string = "Hello, world!"; size_t length = strlen(string); return 0;
}</lang> or by hand:
<lang c>int main(void) {
const char *string = "Hello, world!"; size_t length = 0; const char *p = string; while (*p++ != '\0') length++; return 0;
}</lang>
or (for arrays of char only)
<lang c>#include <stdlib.h>
int main(void) {
char s[] = "Hello, world!"; size_t length = sizeof s - 1; return 0;
}</lang>
Character Length
For wide character strings (usually Unicode uniform-width encodings such as UCS-2 or UCS-4):
<lang c>#include <stdio.h>
- include <wchar.h>
int main(void) {
wchar_t *s = L"\x304A\x306F\x3088\x3046"; /* Japanese hiragana ohayou */ size_t length;
length = wcslen(s); printf("Length in characters = %d\n", length); printf("Length in bytes = %d\n", sizeof(s) * sizeof(wchar_t)); return 0;
}</lang>
Dealing with raw multibyte string
Following code is written in UTF-8, and environment locale is assumed to be UTF-8 too. Note that "møøse" is here directly written in the source code for clarity, which is not a good idea in general. mbstowcs()
, when passed NULL as the first argument, effectively counts the number of chars in given string under current locale.
<lang c>#include <stdio.h>
- include <stdlib.h>
- include <locale.h>
int main() { setlocale(LC_CTYPE, ""); char moose[] = "møøse"; printf("bytes: %d\n", sizeof(moose) - 1); printf("chars: %d\n", (int)mbstowcs(0, moose, 0));
return 0;
}</lang>output
bytes: 7 chars: 5
C++
Byte Length
<lang cpp>#include <string> // (not <string.h>!) using std::string;
int main() {
string s = "Hello, world!"; string::size_type length = s.length(); // option 1: In Characters/Bytes string::size_type size = s.size(); // option 2: In Characters/Bytes // In bytes same as above since sizeof(char) == 1 string::size_type bytes = s.length() * sizeof(string::value_type);
}</lang> For wide character strings:
<lang cpp>#include <string> using std::wstring;
int main() {
wstring s = L"\u304A\u306F\u3088\u3046"; wstring::size_type length = s.length() * sizeof(wstring::value_type); // in bytes
}</lang>
Character Length
For wide character strings:
<lang cpp>#include <string> using std::wstring;
int main() {
wstring s = L"\u304A\u306F\u3088\u3046"; wstring::size_type length = s.length();
}</lang>
For narrow character strings:
<lang cpp>#include <iostream>
- include <codecvt>
int main() {
std::string utf8 = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9d\x84\x8b"; // U+007a, U+00df, U+6c34, U+1d10b std::cout << "Byte length: " << utf8.size() << '\n'; std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv; std::cout << "Character length: " << conv.from_bytes(utf8).size() << '\n';
}</lang>
<lang cpp>#include <cwchar> // for mbstate_t
- include <locale>
// give the character length for a given named locale std::size_t char_length(std::string const& text, char const* locale_name) {
// locales work on pointers; get length and data from string and // then don't touch the original string any more, to avoid // invalidating the data pointer std::size_t len = text.length(); char const* input = text.data();
// get the named locale std::locale loc(locale_name);
// get the conversion facet of the locale typedef std::codecvt<wchar_t, char, std::mbstate_t> cvt_type; cvt_type const& cvt = std::use_facet<cvt_type>(loc);
// allocate buffer for conversion destination std::size_t bufsize = cvt.max_length()*len; wchar_t* destbuf = new wchar_t[bufsize]; wchar_t* dest_end;
// do the conversion mbstate_t state = mbstate_t(); cvt.in(state, input, input+len, input, destbuf, destbuf+bufsize, dest_end);
// determine the length of the converted sequence std::size_t length = dest_end - destbuf;
// get rid of the buffer delete[] destbuf;
// return the result return length;
}</lang>
Example usage (note that the locale names are OS specific):
<lang cpp>#include <iostream>
int main() {
// Tür (German for door) in UTF8 std::cout << char_length("\x54\xc3\xbc\x72", "de_DE.utf8") << "\n"; // outputs 3
// Tür in ISO-8859-1 std::cout << char_length("\x54\xfc\x72", "de_DE") << "\n"; // outputs 3
}</lang>
Note that the strings are given as explicit hex sequences, so that the encoding used for the source code won't matter.
C#
Platform: .NET
Character Length
<lang csharp>string s = "Hello, world!"; int characterLength = s.Length;</lang>
Byte Length
Strings in .NET are stored in Unicode. <lang csharp>using System.Text;
string s = "Hello, world!"; int byteLength = Encoding.Unicode.GetByteCount(s);</lang> To get the number of bytes that the string would require in a different encoding, e.g., UTF8: <lang csharp>int utf8ByteLength = Encoding.UTF8.GetByteCount(s);</lang>
Clean
Byte Length
Clean Strings are unboxed arrays of characters. Characters are always a single byte. The function size returns the number of elements in an array.
<lang clean>import StdEnv
strlen :: String -> Int strlen string = size string
Start = strlen "Hello, world!"</lang>
Clojure
Byte Length
<lang clojure>(count (.getBytes "Hello, world!")) ; 13 bytes (count (.getBytes "π")) ; two bytes</lang>
Character length
<lang clojure>(count "Hello, world!")</lang>
ColdFusion
Byte Length
<lang cfm>#len("Hello World")#</lang>
Character Length
<lang cfm>#len("Hello World")#</lang>
Common Lisp
Byte Length
In Common Lisp, there is no standard way to examine byte representations of characters, except perhaps to write a string to a file, then reopen the file as binary. However, specific implementations will have ways to do so. For example:
<lang lisp>(length (sb-ext:string-to-octets "Hello Wørld"))</lang> returns 12.
Character Length
Common Lisp represents strings as sequences of characters, not bytes, so there is no ambiguity about the encoding. The length function always returns the number of characters in a string. <lang lisp>(length "Hello World")</lang> returns 11, and
(length "Hello Wørld")
returns 11 too.
Component Pascal
Component Pascal encodes strings in UTF-16, which represents each character with 16-bit value.
Character Length
<lang oberon2> MODULE TestLen;
IMPORT Out;
PROCEDURE DoCharLength*; VAR s: ARRAY 16 OF CHAR; len: INTEGER; BEGIN s := "møøse"; len := LEN(s$); Out.String("s: "); Out.String(s); Out.Ln; Out.String("Length of characters: "); Out.Int(len, 0); Out.Ln END DoCharLength;
END TestLen. </lang>
A symbol $ in LEN(s$) in Component Pascal allows to copy sequence of characters up to null-terminated character. So, LEN(s$) returns a real length of characters instead of allocated by variable.
Running command TestLen.DoCharLength gives following output:
s: møøse Length of characters: 5
Byte Length
<lang oberon2> MODULE TestLen;
IMPORT Out;
PROCEDURE DoByteLength*; VAR s: ARRAY 16 OF CHAR; len, v: INTEGER; BEGIN s := "møøse"; len := LEN(s$); v := SIZE(CHAR) * len; Out.String("s: "); Out.String(s); Out.Ln; Out.String("Length of characters in bytes: "); Out.Int(v, 0); Out.Ln END DoByteLength;
END TestLen. </lang>
Running command TestLen.DoByteLength gives following output:
s: møøse Length of characters in bytes: 10
D
Byte Length
<lang d>import std.stdio;
void showByteLen(T)(T[] str) {
writefln("Byte length: %2d - %(%02x%)", str.length * T.sizeof, cast(ubyte[])str);
}
void main() {
string s1a = "møøse"; // UTF-8 showByteLen(s1a); wstring s1b = "møøse"; // UTF-16 showByteLen(s1b); dstring s1c = "møøse"; // UTF-32 showByteLen(s1c); writeln();
string s2a = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢"; showByteLen(s2a); wstring s2b = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢"; showByteLen(s2b); dstring s2c = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢"; showByteLen(s2c); writeln();
string s3a = "J̲o̲s̲é̲"; showByteLen(s3a); wstring s3b = "J̲o̲s̲é̲"; showByteLen(s3b); dstring s3c = "J̲o̲s̲é̲"; showByteLen(s3c);
}</lang>
- Output:
Byte length: 7 - 6dc3b8c3b87365 Byte length: 10 - 6d00f800f80073006500 Byte length: 20 - 6d000000f8000000f80000007300000065000000 Byte length: 28 - f09d9498f09d94abf09d94a6f09d94a0f09d94acf09d94a1f09d94a2 Byte length: 28 - 35d818dd35d82bdd35d826dd35d820dd35d82cdd35d821dd35d822dd Byte length: 28 - 18d501002bd5010026d5010020d501002cd5010021d5010022d50100 Byte length: 14 - 4accb26fccb273ccb265cc81ccb2 Byte length: 18 - 4a0032036f00320373003203650001033203 Byte length: 36 - 4a000000320300006f000000320300007300000032030000650000000103000032030000
Character Length
<lang d>import std.stdio, std.range, std.conv;
void showCodePointsLen(T)(T[] str) {
writefln("Character length: %2d - %(%x %)", str.walkLength(), cast(uint[])to!(dchar[])(str));
}
void main() {
string s1a = "møøse"; // UTF-8 showCodePointsLen(s1a); wstring s1b = "møøse"; // UTF-16 showCodePointsLen(s1b); dstring s1c = "møøse"; // UTF-32 showCodePointsLen(s1c); writeln();
string s2a = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢"; showCodePointsLen(s2a); wstring s2b = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢"; showCodePointsLen(s2b); dstring s2c = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢"; showCodePointsLen(s2c); writeln();
string s3a = "J̲o̲s̲é̲"; showCodePointsLen(s3a); wstring s3b = "J̲o̲s̲é̲"; showCodePointsLen(s3b); dstring s3c = "J̲o̲s̲é̲"; showCodePointsLen(s3c);
}</lang>
- Output:
Character length: 5 - 6d f8 f8 73 65 Character length: 5 - 6d f8 f8 73 65 Character length: 5 - 6d f8 f8 73 65 Character length: 7 - 1d518 1d52b 1d526 1d520 1d52c 1d521 1d522 Character length: 7 - 1d518 1d52b 1d526 1d520 1d52c 1d521 1d522 Character length: 7 - 1d518 1d52b 1d526 1d520 1d52c 1d521 1d522 Character length: 9 - 4a 332 6f 332 73 332 65 301 332 Character length: 9 - 4a 332 6f 332 73 332 65 301 332 Character length: 9 - 4a 332 6f 332 73 332 65 301 332
==Hello World</lang> Oz uses a single-byte encoding by default. So for normal strings, this will also show the correct character length.
PARI/GP
Character Length
Characters = bytes in Pari; the underlying strings are C strings interpreted as US-ASCII. <lang parigp>len(s)=#s; \\ Alternately, len(s)=length(s); or even len=length;</lang>
Byte Length
This works on objects of any sort, not just strings, and includes overhead. <lang parigp>len(s)=sizebyte(s);</lang>
Pascal
Byte Length
<lang pascal> const
s = 'abcdef';
begin
writeln (length(s))
end. </lang> Output:
6
Perl
Byte Length
Strings in Perl consist of characters. Measuring the byte length therefore requires conversion to some binary representation (called encoding, both noun and verb).
<lang perl>use utf8; # so we can use literal characters like ☺ in source use Encode qw(encode);
print length encode 'UTF-8', "Hello, world! ☺";
- 17. The last character takes 3 bytes, the others 1 byte each.
print length encode 'UTF-16', "Hello, world! ☺";
- 32. 2 bytes for the BOM, then 15 byte pairs for each character.</lang>
Character Length
<lang perl>my $length = length "Hello, world!";</lang>
Grapheme Length
Since Perl 5.12, /\X/
matches an extended grapheme cluster. See "Unicode overhaul" in perl5120delta and also UAX #29.
Perl understands that "\x{1112}\x{1161}\x{11ab}\x{1100}\x{1173}\x{11af}" (한글) contains 2 graphemes, just like "\x{d55c}\x{ae00}" (한글). The longer string uses Korean combining jamo characters.
<lang perl>use v5.12; my $string = "\x{1112}\x{1161}\x{11ab}\x{1100}\x{1173}\x{11af}"; # 한글 my $len; $len++ while ($string =~ /\X/g); printf "Grapheme length: %d\n", $len;</lang>
- Output:
Grapheme length: 2
Perl 6
Byte Length
<lang perl>say "møøse".utf8.bytes;</lang>
Character Length
<lang perl>say "møøse".chars;</lang>
PHP
Byte Length
<lang php>$length = strlen('Hello, world!');</lang>
Character Length
<lang php>$length = mb_strlen('Hello, world!', 'UTF-8'); // or whatever encoding</lang>
PicoLisp
<lang PicoLisp>(let Str "møøse"
(prinl "Character Length of \"" Str "\" is " (length Str)) (prinl "Byte Length of \"" Str "\" is " (size Str)) )</lang>
Output:
Character Length of "møøse" is 5 Byte Length of "møøse" is 7 -> 7
PL/I
<lang PL/I> declare WS widechar (13) initial ('Hello world.'); put ('Character length=', length (WS)); put skip list ('Byte length=', size(WS));
declare SM graphic (13) initial ('Hello world'); put ('Character length=', length(SM)); put skip list ('Byte length=', size(trim(SM)); </lang>
PL/SQL
LENGTH calculates length using characters as defined by the input character set. LENGTHB uses bytes instead of characters. LENGTHC uses Unicode complete characters. LENGTH2 uses UCS2 code points. LENGTH4 uses UCS4 code points.
Byte Length
<lang plsql>DECLARE
string VARCHAR2(50) := 'Hello, world!'; stringlength NUMBER;
BEGIN
stringlength := LENGTHB(string);
END;</lang>
Character Length
<lang plsql>DECLARE
string VARCHAR2(50) := 'Hello, world!'; stringlength NUMBER; unicodelength NUMBER; ucs2length NUMBER; ucs4length NUMBER;
BEGIN
stringlength := LENGTH(string); unicodelength := LENGTHC(string); ucs2length := LENGTH2(string); ucs4length := LENGTH4(string);
END;</lang>
Pop11
Byte Length
Currently Pop11 supports only strings consisting of 1-byte units. Strings can carry arbitrary binary data, so user can for example use UTF-8 (however builtin procedures will treat each byte as a single character). The length function for strings returns length in bytes:
<lang pop11>lvars str = 'Hello, world!'; lvars len = length(str);</lang>
PostScript
Character Length
<lang> (Hello World) length = 11 </lang>
PowerShell
Character Length
<lang powershell>$s = "Hëlló Wørłð" $s.Length</lang>
Byte Length
For UTF-16, which is the default in .NET and therefore PowerShell: <lang powershell>$s = "Hëlló Wørłð" [System.Text.Encoding]::Unicode.GetByteCount($s)</lang> For UTF-8: <lang powershell>[System.Text.Encoding]::UTF8.GetByteCount($s)</lang>
PureBasic
Character Length
<lang PureBasic> a = Len("Hello World") ;a will be 11</lang>
Byte Length
Returns the number of bytes required to store the string in memory in the given format in bytes. 'Format' can be #PB_Ascii, #PB_UTF8 or #PB_Unicode. PureBasic code can be compiled using either Unicode (2-byte) or Ascii (1-byte) encodings for strings. If 'Format' is not specified, the mode of the executable (unicode or ascii) is used.
Note: The number of bytes returned does not include the terminating Null-Character of the string. The size of the Null-Character is 1 byte for Ascii and UTF8 mode and 2 bytes for Unicode mode.
<lang PureBasic>a = StringByteLength("ä", #PB_UTF8) ;a will be 2 b = StringByteLength("ä", #PB_Ascii) ;b will be 1 c = StringByteLength("ä", #PB_Unicode) ;c will be 2 </lang>
Python
2.x
In Python 2.x, there are two types of strings: regular (8-bit) strings, and Unicode strings. Unicode string literals are prefixed with "u".
Byte Length
For 8-bit strings, the byte length is the same as the character length:
>>> len('ascii') 5
For Unicode strings, byte length depends on the encoding. Python use 2 or 4 bytes per character internally for unicode strings, depending on how it was built. The internal representation is not interesting for the user.
# The letter Alef >>> len(u'\u05d0'.encode('utf-8')) 2 >>> len(u'\u05d0'.encode('iso-8859-8')) 1
Example from the problem statement: <lang python>#!/bin/env python
- -*- coding: UTF-8 -*-
s = u"møøse" assert len(s) == 5 assert len(s.encode('UTF-8')) == 7 assert len(s.encode('UTF-16')) == 12 # The extra character is probably a leading Unicode byte-order mark (BOM).</lang>
Character Length
len() returns the number of code units (not code points!) in a Unicode string or plain ASCII string. On a wide build, this is the same as the number of code points, but on a narrow one it is not. Python has no reliable way to get the number of characters in a string, because the answer varies according to the build.To get the length of encoded string, you have to decode it first:
>>> len('ascii') 5 >>> len(u'\u05d0') # the letter Alef as unicode literal 1 >>> len('\xd7\x90'.decode('utf-8')) # Same encoded as utf-8 string 1 >>> len(unichr(0x1F4A9)) # shows that len() gives the wrong answer for non-BMP chars on a narrow build 2
3.x
In Python 3.x, strings are Unicode strings.
Byte Length
Byte length depends on the encoding. Python use 2 or 4 bytes per character internally for unicode strings, depending on how it was built. The internal representation is not interesting for the user.
You can use len() to get the length of a byte sequence. To get a byte sequence from a string, you have to encode it with the desired encoding:
# The letter Alef >>> len('\u05d0'.encode('utf-8')) 2 >>> len('\u05d0'.encode('iso-8859-8')) 1
Example from the problem statement: <lang python>#!/bin/env python
- -*- coding: UTF-8 -*-
s = "møøse" assert len(s) == 5 assert len(s.encode('UTF-8')) == 7 assert len(s.encode('UTF-16')) == 12 # The extra character is probably a leading Unicode byte-order mark (BOM).</lang>
Character Length
len() returns the number of code units in a string, which can be different from the number of characters. In a narrow build, this is not a reliable way to get the number of characters. You can only easily count code points in a wide build. To get the length of an encoded byte sequence, you have to decode it first:
>>> len('ascii') 5 >>> len(chr(0x1F4A9)) # how many code units on narrow build, how many characters on wide build only 2 >>> len('\u05d0') # the letter Alef as unicode literal 1 >>> len(b'\xd7\x90'.decode('utf-8')) # Same encoded as utf-8 byte sequence 1
R
Byte length
<lang R>a <- "m\u00f8\u00f8se" print(nchar(a, type="bytes")) # print 7</lang>
Character length
<lang R>print(nchar(a, type="chars")) # print 5</lang>
REBOL
Byte Length
REBOL 2.x does not natively support UCS (Unicode), so character and byte length are the same. See utf-8.r for an external UTF-8 library.
<lang REBOL>text: "møøse" print rejoin ["Byte length for '" text "': " length? text]</lang>
Output:
Byte length for 'møøse': 5
Retro
Byte Length
<lang Retro>"møøse" getLength putn</lang>
Character Length
Retro does not have built-in support for Unicode, but counting of characters can be done with a small amount of effort.
<lang Retro>chain: UTF8' {{
: utf+ ( $-$ ) [ 1+ dup @ %11000000 and %10000000 = ] while ;
: count ( $-$ ) 0 !here repeat dup @ 0; drop utf+ here ++ again ;
---reveal---
: getLength ( $-n ) count drop @here ;
}}
- chain
"møøse" ^UTF8'getLength putn</lang>
REXX
<lang REXX>sss='123456789abcdef' say 'the length of sss is:' length(sss)</lang> Output:
the length of sss is: 15
Ruby
Byte Length
Since Ruby 1.8.7, String#bytesize is the byte length.
<lang ruby># -*- coding: utf-8 -*-
puts "あいうえお".bytesize
- => 15</lang>
Character Length
Since Ruby 1.9, String#length (alias String#size) is the character length. The magic comment, "coding: utf-8", sets the encoding of all string literals in this file.
<lang ruby># -*- coding: utf-8 -*-
puts "あいうえお".length
- => 5
puts "あいうえお".size # alias for length
- => 5</lang>
Code Set Independence
The next examples show the byte length and character length of "møøse" in different encodings.
To run these programs, you must convert them to different encodings.
- If you use Emacs: Paste each program into Emacs. The magic comment, like
-*- coding: iso-8859-1 -*-
, will tell Emacs to save with that encoding.- If your text editor saves UTF-8: Convert the file before running it. For example:
$ ruby -pe '$_.encode!("iso-8859-1", "utf-8")' scratch.rb | ruby
Program | Output |
---|---|
<lang ruby># -*- coding: iso-8859-1 -*-
s = "møøse" puts "Byte length: %d" % s.bytesize puts "Character length: %d" % s.length</lang> |
Byte length: 5 Character length: 5 |
<lang ruby># -*- coding: utf-8 -*-
s = "møøse" puts "Byte length: %d" % s.bytesize puts "Character length: %d" % s.length</lang> |
Byte length: 7 Character length: 5 |
<lang ruby># -*- coding: gb18030 -*-
s = "møøse" puts "Byte length: %d" % s.bytesize puts "Character length: %d" % s.length</lang> |
Byte length: 11 Character length: 5 |
Ruby 1.8
The next example works with both Ruby 1.8 and Ruby 1.9. In Ruby 1.8, the strings have no encodings, and String#length is the byte length. In Ruby 1.8, the regular expressions knows three Japanese encodings.
/./n
uses no multibyte encoding././e
uses EUC-JP././s
uses Shift-JIS or Windows-31J././u
uses UTF-8.
Then either string.scan(/./u).size
or string.gsub(/./u, ' ').size
counts the UTF-8 characters in string.
<lang ruby># -*- coding: utf-8 -*-
class String
# Define String#bytesize for Ruby 1.8.6. unless method_defined?(:bytesize) alias bytesize length end
end
s = "文字化け" puts "Byte length: %d" % s.bytesize puts "Character length: %d" % s.gsub(/./u, ' ').size</lang>
SAS
<lang sas>data _null_;
a="Hello, World!"; b=length(c); put _all_;
run;</lang>
Scheme
Byte Length
string-size function is only Gauche function. <lang scheme>(string-size "Hello world")</lang>
<lang scheme>(bytes-length #"Hello world")</lang>
Character Length
string-length function is in R5RS, R6RS. <lang scheme> (string-length "Hello world")</lang>
Seed7
Character Length
<lang seed7>length("Hello, world!")</lang>
Slate
<lang slate>'Hello, world!' length.</lang>
Smalltalk
Byte Length
<lang smalltalk>string := 'Hello, world!'. string size.</lang>
Character Length
In GNU Smalltalk:
<lang smalltalk>string := 'Hello, world!'. string numberOfCharacters.</lang>
requires loading the Iconv package:
<lang smalltalk>PackageLoader fileInPackage: 'Iconv'</lang>
SNOBOL4
Byte Length
<lang snobol4> output = "Byte length: " size(trim(input)) end </lang>
Character Length
The example works AFAIK only with CSnobol4 by Phil Budne <lang snobol4> -include "utf.sno" output = "Char length: " utfsize(trim(input)) end </lang>
Standard ML
Byte Length
<lang sml>val strlen = size "Hello, world!";</lang>
Character Length
<lang sml>val strlen = UTF8.size "Hello, world!";</lang>
Tcl
Byte Length
Formally, Tcl does not guarantee to use any particular representation for its strings internally (the underlying implementation objects can hold strings in at least three different formats, mutating between them as necessary) so to way to calculate the "byte length" of a string can only be done with respect to some user-selected encoding. This is done this way (for UTF-8): <lang tcl>string length [encoding convertto utf-8 $theString]</lang> Thus, we have these examples: <lang tcl>set s1 "hello, world" set s2 "\u304A\u306F\u3088\u3046" set enc utf-8 puts [format "length of \"%s\" in bytes is %d" \
$s1 [string length [encoding convertto $enc $s1]]]
puts [format "length of \"%s\" in bytes is %d" \
$s2 [string length [encoding convertto $enc $s2]]]</lang>
Character Length
Basic version:
<lang tcl>string length "Hello, world!"</lang>
or more elaborately, needs Interpreter any 8.X. Tested on 8.4.12.
<lang tcl>fconfigure stdout -encoding utf-8; #So that Unicode string will print correctly set s1 "hello, world" set s2 "\u304A\u306F\u3088\u3046" puts [format "length of \"%s\" in characters is %d" $s1 [string length $s1]] puts [format "length of \"%s\" in characters is %d" $s2 [string length $s2]]</lang>
TI-89 BASIC
The TI-89 uses an fixed 8-bit encoding so there is no difference between character length and byte length.
<lang ti89b>■ dim("møøse") 5</lang>
Toka
Byte Length
<lang toka>" hello, world!" string.getLength</lang>
Trith
Character Length
<lang trith>"møøse" length</lang>
Byte Length
<lang trith>"møøse" size</lang>
TUSCRIPT
Character Length
<lang tuscript> $$ MODE TUSCRIPT string="hello, world" l=LENGTH (string) PRINT "character length of string '",string,"': ",l </lang> Output:
Character length of string 'hello, world': 12
UNIX Shell
Byte Length
With external utility:
<lang bash>string='Hello, world!' length=`expr "x$string" : '.*' - 1` echo $length # if you want it printed to the terminal</lang>
With SUSv3 parameter expansion modifier:
<lang bash>string='Hello, world!' length="${#string}" echo $length # if you want it printed to the terminal</lang>
Vala
Character Length
<lang vala> string s = "Hello, world!"; int characterLength = s.length; </lang>
VBA
Cf. VBScript (below).
VBScript
Byte Length
<lang vbscript>LenB(string|varname)</lang>
Returns the number of bytes required to store a string in memory. Returns null if string|varname is null.
Character Length
<lang vbscript>Len(string|varname)</lang>
Returns the length of the string|varname . Returns null if string|varname is null.
x86 Assembly
Byte Length
The following code uses AT&T syntax and was tested using AS (the portable GNU assembler) under Linux.
<lang x86 Assembly> .data string: .asciz "Test"
.text .globl main
main:
pushl %ebp movl %esp, %ebp
pushl %edi xorb %al, %al movl $-1, %ecx movl $string, %edi cld repne scasb not %ecx dec %ecx popl %edi
;; string length is stored in %ecx register
leave ret
</lang>
XPL0
<lang XPL0>include c:\cxpl\stdlib; IntOut(0, StrLen("Character length = Byte length = String length = "))</lang>
Output:
49
XSLT
Character Length
<lang xml><?xml version="1.0" encoding="UTF-8"?></lang>
...
<lang xml><xsl:value-of select="string-length('møøse')" /> </lang>
xTalk
Byte Length
<lang xtalk>put the length of "Hello World"</lang>
or
<lang xtalk>put the number of characters in "Hello World"</lang>
Character Length
<lang xtalk>put the length of "Hello World"</lang>
or
<lang xtalk>put the number of characters in "Hello World"</lang>
Yorick
Character Length
<lang yorick>strlen("Hello, world!")</lang>
- Programming Tasks
- Basic language learning
- 4D
- ActionScript
- Ada
- ALGOL 68
- AppleScript
- AppleScript examples needing attention
- Examples needing attention
- AutoHotkey
- AWK
- Batch File
- BASIC
- ZX Spectrum Basic
- BBC BASIC
- Bracmat
- C
- C++
- C sharp
- Clean
- Clojure
- ColdFusion
- ColdFusion examples needing attention
- Common Lisp
- Component Pascal
- D
- Hello World
- PARI/GP
- Pascal
- Perl
- Perl 6
- PHP
- PicoLisp
- PL/I
- PL/SQL
- Pop11
- PostScript
- PowerShell
- PureBasic
- Python
- R
- REBOL
- Retro
- REXX
- Ruby
- SAS
- Scheme
- Seed7
- Slate
- Smalltalk
- SNOBOL4
- Standard ML
- Tcl
- TI-89 BASIC
- Toka
- Trith
- TUSCRIPT
- UNIX Shell
- Vala
- VBA
- VBScript
- X86 Assembly
- XPL0
- XSLT
- XTalk
- XTalk examples needing attention
- Yorick
- GUISS/Omit
- Openscad/Omit
- String manipulation