String length: Difference between revisions

Content added Content deleted

Inline

Revision as of 15:51, 28 December 2012

In this task, the goal is to find the character and byte length of a string. This means encodings like UTF-8 need to be handled properly, as there is not necessarily a one-to-one relationship between bytes and characters. By character, we mean an individual Unicode code point, not a user-visible grapheme containing combining characters. For example, the character length of "møøse" is 5 but the byte length is 7 in UTF-8 and 10 in UTF-16.

Non-BMP code points (those between 0x10000 and 0x10FFFF) must also be handled correctly: answers should produce actual character counts in code points, not in code unit counts. Therefore a string like "𝔘𝔫𝔦𝔠𝔬𝔡𝔢" (consisting of the 7 Unicode characters U+1D518 U+1D52B U+1D526 U+1D520 U+1D52C U+1D521 U+1D522) is 7 characters long, not 14 UTF-16 code units; and it is 28 bytes long whether encoded in UTF-8 or in UTF-16.

Please mark your examples with ===Character Length=== or ===Byte Length===. If your language is capable of providing the string length in graphemes, mark those examples with ===Grapheme Length===. For example, the string "J̲o̲s̲é̲" ("J\x{332}o\x{332}s\x{332}e\x{301}\x{332}") has 4 user-visible graphemes, 9 characters (code points), and 14 bytes when encoded in UTF-8.

4D

Byte Length

<lang 4d>$length:=Length("Hello, world!")</lang>

ActionScript

Character Length

<lang actionscript>myStrVar.length()</lang>

Ada

Works with: GCC version 4.1.2

Byte Length

<lang ada>Str : String := "Hello World"; Length : constant Natural := Str'Size / 8;</lang> The 'Size attribute returns the size of an object in bits. Provided that under "byte" one understands an octet of bits, the length in "bytes" will be 'Size divided to 8. Note that this is not necessarily the machine storage unit. In order to make the program portable, System.Storage_Unit should be used instead of "magic number" 8. System.Storage_Unit yields the number of bits in a storage unit on the current machine. Further, the length of a string object is not the length of what the string contains in whatever measurement units. String as an object may have a "dope" to keep the array bounds. In fact the object length can even be 0, if the compiler optimized the object away. So in most cases "byte length" makes no sense in Ada.

Character Length

<lang ada>Latin_1_Str : String := "Hello World"; UCS_16_Str : Wide_String := "Hello World"; Unicode_Str : Wide_Wide_String := "Hello World"; Latin_1_Length : constant Natural := Latin_1_Str'Length; UCS_16_Length : constant Natural := UCS_16_Str'Length; Unicode_Length : constant Natural := Unicode_Str'Length;</lang> The attribute 'Length yields the number of elements of an array. Since strings in Ada are arrays of characters, 'Length is the string length. Ada supports strings of Latin-1, UCS-16 and full Unicode characters. In the example above character length of all three strings is 11. The length of the objects in bits will differ.

ALGOL 68

Bits and Bytes Length

<lang algol68>BITS bits := bits pack((TRUE, TRUE, FALSE, FALSE)); # packed array of BOOL # BYTES bytes := bytes pack("Hello, world"); # packed array of CHAR # print((

 "BITS and BYTES are fixed width:", new line,
 "bits width:", bits width, ", max bits: ", max bits, ", bits:", bits, new line,
 "bytes width: ",bytes width, ", UPB:",UPB STRING(bytes), ", string:", STRING(bytes),"!", new line

))</lang> Output:

BITS and BYTES are fixed width:
bits width:        +32, max bits: TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT, bits:TTFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
bytes width:         +32, UPB:        +32, string:Hello, world!

Character Length

<lang algol68>STRING str := "hello, world"; INT length := UPB str; printf(($"Length of """g""" is "g(3)l$,str,length));

printf(($l"STRINGS can start at -1, in which case LWB must be used:"l$)); STRING s := "abcd"[@-1]; print(("s:",s, ", LWB:", LWB s, ", UPB:",UPB s, ", LEN:",UPB s - LWB s + 1))</lang> Output:

Length of "hello, world" is +12
STRINGS can start at -1, in which case LWB must be used:
s:abcd, LWB:         -1, UPB:         +2, LEN:         +4

AppleScript

Byte Length

This example may be incorrect due to a recent change in the task requirements or a lack of testing. Please verify it and remove this message. If the example does not match the requirements or does not work, replace this message with Template:incorrect or fix the code yourself.

<lang applescript>count of "Hello World"</lang> Mac OS X 10.5 (Leopard) includes AppleScript 2.0 which uses only Unicode (UTF-16) character strings. This example has not been tested and may not work on previous versions of AppleScript. <lang applescript>set inString to "Hello World" as Unicode text set byteCount to 0 set idList to id of inString

repeat with incr in idList

 set byteCount to byteCount + 2
 if incr as integer > 65535 then
   set byteCount to byteCount + 2
 end if

end repeat

byteCount</lang>

Character Length

<lang applescript>count of "Hello World"</lang> Or: <lang applescript>count "Hello World"</lang>

AutoHotkey

Character Length

<lang AutoHotkey>Msgbox % StrLen("Hello World")</lang> Or: <lang AutoHotkey>String := "Hello World" StringLen, Length, String Msgbox % Length</lang>

AWK

Byte Length

From within any code block: <lang awk>w=length("Hello, world!") # static string example x=length("Hello," s " world!") # dynamic string example y=length($1) # input field example z=length(s) # variable name example</lang> Ad hoc program from command line:

 echo "Hello, wørld!" | awk '{print length($0)}'   # 14

From executable script: (prints for every line arriving on stdin) <lang awk>#!/usr/bin/awk -f {print"The length of this line is "length($0)}</lang>

Batch File

Byte Length

<lang dos>@echo off setlocal enabledelayedexpansion call :length %1 res echo length of %1 is %res% goto :eof

length

set str=%~1 set cnt=0

loop

if "%str%" equ "" ( set %2=%cnt% goto :eof ) set str=!str:~1! set /a cnt = cnt + 1 goto loop</lang>

BASIC

Character Length

Works with: QBasic

Works with: Liberty BASIC

Works with: PowerBASIC version PB/CC, PB/DOS

BASIC only supports single-byte characters. The character "ø" is converted to "°" for printing to the console and length functions, but will still output to a file as "ø". <lang qbasic> INPUT a$

PRINT LEN(a$)</lang>

ZX Spectrum Basic

The ZX Spectrum needs line numbers:

<lang zxbasic>10 INPUT a$ 20 PRINT LEN(a$)</lang>

BBC BASIC

Character Length

<lang bbcbasic> INPUT text$

     PRINT LEN(text$)</lang>

Byte Length

Works with: BBC BASIC for Windows

<lang bbcbasic> CP_ACP = 0

     CP_UTF8 = &FDE9
     
     textA$ = "møøse"
     textW$ = "                 "
     textU$ = "                 "
     
     SYS "MultiByteToWideChar", CP_ACP, 0, textA$, -1, !^textW$, LEN(textW$)/2 TO nW%
     SYS "WideCharToMultiByte", CP_UTF8, 0, textW$, -1, !^textU$, LEN(textU$), 0, 0
     PRINT "Length in bytes (ANSI encoding) = " ; LEN(textA$)
     PRINT "Length in bytes (UTF-16 encoding) = " ; 2*(nW%-1)
     PRINT "Length in bytes (UTF-8 encoding) = " ; LEN($$!^textU$)</lang>

Output:

Length in bytes (ANSI encoding) = 5
Length in bytes (UTF-16 encoding) = 10
Length in bytes (UTF-8 encoding) = 7

Bracmat

The solutions work with UTF-8 encoded strings.

Byte Length

<lang bracmat>(ByteLength=

 length

. @(!arg:? [?length)

 & !length

);

out$ByteLength$𝔘𝔫𝔦𝔠𝔬𝔡𝔢</lang> Answer:

Character Length

<lang bracmat>(CharacterLength=

 length c

. 0:?length

   & @( !arg
      :   ?
          ( %?c
          & utf$!c:?k
          & 1+!length:?length
          & ~
          )
          ?
      )
 | !length

);

out$CharacterLength$𝔘𝔫𝔦𝔠𝔬𝔡𝔢</lang> Answer:

An improved version scans the input string character wise, not byte wise. Thus many string positions that are deemed not to be possible starting positions of UTF-8 are not even tried. The patterns [!p and [?p implement a ratchet mechanism. [!p indicates the start of a character and [?p remembers the end of the character, which becomes the start position of the next byte. <lang bracmat>(CharacterLength=

 length c p

. 0:?length:?p

   & @( !arg
      :   ?
          ( [!p %?c
          & utf$!c:?k
          & 1+!length:?length
          )
          ([?p&~)
          ?
      )
 | !length

);</lang>

C

Byte Length

Works with: ANSI C

Works with: GCC version 3.3.3

<lang c>#include <string.h>

int main(void) {

 const char *string = "Hello, world!";
 size_t length = strlen(string);
        
 return 0;

}</lang> or by hand:

<lang c>int main(void) {

 const char *string = "Hello, world!";
 size_t length = 0;
 
 const char *p = string;
 while (*p++ != '\0') length++;                                         
 
 return 0;

}</lang>

or (for arrays of char only)

<lang c>#include <stdlib.h>

int main(void) {

 char s[] = "Hello, world!";
 size_t length = sizeof s - 1;
 
 return 0;

}</lang>

Character Length

For wide character strings (usually Unicode uniform-width encodings such as UCS-2 or UCS-4):

<lang c>#include <stdio.h>

include <wchar.h>

int main(void) {

  wchar_t *s = L"\x304A\x306F\x3088\x3046"; /* Japanese hiragana ohayou */
  size_t length;

  length = wcslen(s);
  printf("Length in characters = %d\n", length);
  printf("Length in bytes      = %d\n", sizeof(s) * sizeof(wchar_t));
  
  return 0;

}</lang>

Dealing with raw multibyte string

Following code is written in UTF-8, and environment locale is assumed to be UTF-8 too. Note that "møøse" is here directly written in the source code for clarity, which is not a good idea in general. mbstowcs(), when passed NULL as the first argument, effectively counts the number of chars in given string under current locale. <lang c>#include <stdio.h>

include <stdlib.h>
include <locale.h>

int main() { setlocale(LC_CTYPE, ""); char moose[] = "møøse"; printf("bytes: %d\n", sizeof(moose) - 1); printf("chars: %d\n", (int)mbstowcs(0, moose, 0));

return 0;

}</lang>output

bytes: 7
chars: 5

C++

Byte Length

Works with: ISO C++

Works with: g++ version 4.0.2

<lang cpp>#include <string> // (not <string.h>!) using std::string;

int main() {

 string s = "Hello, world!";
 string::size_type length = s.length(); // option 1: In Characters/Bytes
 string::size_type size = s.size();     // option 2: In Characters/Bytes
 // In bytes same as above since sizeof(char) == 1
 string::size_type bytes = s.length() * sizeof(string::value_type);

}</lang> For wide character strings:

<lang cpp>#include <string> using std::wstring;

int main() {

 wstring s = L"\u304A\u306F\u3088\u3046";
 wstring::size_type length = s.length() * sizeof(wstring::value_type); // in bytes

}</lang>

Character Length

Works with: C++98

Works with: g++ version 4.0.2

For wide character strings:

<lang cpp>#include <string> using std::wstring;

int main() {

 wstring s = L"\u304A\u306F\u3088\u3046";
 wstring::size_type length = s.length();

}</lang>

For narrow character strings:

Works with: C++11

Works with: clang++ version 3.0

<lang cpp>#include <iostream>

include <codecvt>

int main() {

   std::string utf8 = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9d\x84\x8b"; // U+007a, U+00df, U+6c34, U+1d10b
   std::cout << "Byte length: " << utf8.size() << '\n';
   std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
   std::cout << "Character length: " << conv.from_bytes(utf8).size() << '\n';

}</lang>

Works with: C++98

Works with: g++ version 4.1.2 20061115 (prerelease) (SUSE Linux)

<lang cpp>#include <cwchar> // for mbstate_t

include <locale>

// give the character length for a given named locale std::size_t char_length(std::string const& text, char const* locale_name) {

 // locales work on pointers; get length and data from string and
 // then don't touch the original string any more, to avoid
 // invalidating the data pointer
 std::size_t len = text.length();
 char const* input = text.data();

 // get the named locale
 std::locale loc(locale_name);

 // get the conversion facet of the locale
 typedef std::codecvt<wchar_t, char, std::mbstate_t> cvt_type;
 cvt_type const& cvt = std::use_facet<cvt_type>(loc);

 // allocate buffer for conversion destination
 std::size_t bufsize = cvt.max_length()*len;
 wchar_t* destbuf = new wchar_t[bufsize];
 wchar_t* dest_end;

 // do the conversion
 mbstate_t state = mbstate_t();
 cvt.in(state, input, input+len, input, destbuf, destbuf+bufsize, dest_end);

 // determine the length of the converted sequence
 std::size_t length = dest_end - destbuf;

 // get rid of the buffer
 delete[] destbuf;

 // return the result
 return length;

}</lang>

Example usage (note that the locale names are OS specific):

<lang cpp>#include <iostream>

int main() {

 // Tür (German for door) in UTF8
 std::cout << char_length("\x54\xc3\xbc\x72", "de_DE.utf8") << "\n"; // outputs 3

 // Tür in ISO-8859-1
 std::cout << char_length("\x54\xfc\x72", "de_DE") << "\n"; // outputs 3

}</lang>

Note that the strings are given as explicit hex sequences, so that the encoding used for the source code won't matter.

C#

Platform: .NET

Works with: C # version 1.0+

Character Length

<lang csharp>string s = "Hello, world!"; int characterLength = s.Length;</lang>

Byte Length

Strings in .NET are stored in Unicode. <lang csharp>using System.Text;

string s = "Hello, world!"; int byteLength = Encoding.Unicode.GetByteCount(s);</lang> To get the number of bytes that the string would require in a different encoding, e.g., UTF8: <lang csharp>int utf8ByteLength = Encoding.UTF8.GetByteCount(s);</lang>

Clean

Byte Length

Clean Strings are unboxed arrays of characters. Characters are always a single byte. The function size returns the number of elements in an array.

<lang clean>import StdEnv

strlen :: String -> Int strlen string = size string

Start = strlen "Hello, world!"</lang>

Clojure

Byte Length

<lang clojure>(count (.getBytes "Hello, world!")) ; 13 bytes (count (.getBytes "π")) ; two bytes</lang>

Character length

<lang clojure>(count "Hello, world!")</lang>

ColdFusion

Byte Length

This example may be incorrect due to a recent change in the task requirements or a lack of testing. Please verify it and remove this message. If the example does not match the requirements or does not work, replace this message with Template:incorrect or fix the code yourself.

<lang cfm>#len("Hello World")#</lang>

Character Length

<lang cfm>#len("Hello World")#</lang>

Common Lisp

Byte Length

In Common Lisp, there is no standard way to examine byte representations of characters, except perhaps to write a string to a file, then reopen the file as binary. However, specific implementations will have ways to do so. For example:

Works with: SBCL

<lang lisp>(length (sb-ext:string-to-octets "Hello Wørld"))</lang> returns 12.

Character Length

Common Lisp represents strings as sequences of characters, not bytes, so there is no ambiguity about the encoding. The length function always returns the number of characters in a string. <lang lisp>(length "Hello World")</lang> returns 11, and

(length "Hello Wørld")

returns 11 too.

Component Pascal

Component Pascal encodes strings in UTF-16, which represents each character with 16-bit value.

Character Length

<lang oberon2> MODULE TestLen;

IMPORT Out;

PROCEDURE DoCharLength*; VAR s: ARRAY 16 OF CHAR; len: INTEGER; BEGIN s := "møøse"; len := LEN(s$); Out.String("s: "); Out.String(s); Out.Ln; Out.String("Length of characters: "); Out.Int(len, 0); Out.Ln END DoCharLength;

END TestLen. </lang>

A symbol $ in LEN(s$) in Component Pascal allows to copy sequence of characters up to null-terminated character. So, LEN(s$) returns a real length of characters instead of allocated by variable.

Running command TestLen.DoCharLength gives following output:

s: møøse
Length of characters: 5

Byte Length

<lang oberon2> MODULE TestLen;

IMPORT Out;

PROCEDURE DoByteLength*; VAR s: ARRAY 16 OF CHAR; len, v: INTEGER; BEGIN s := "møøse"; len := LEN(s$); v := SIZE(CHAR) * len; Out.String("s: "); Out.String(s); Out.Ln; Out.String("Length of characters in bytes: "); Out.Int(v, 0); Out.Ln END DoByteLength;

END TestLen. </lang>

Running command TestLen.DoByteLength gives following output:

s: møøse
Length of characters in bytes: 10

D

Byte Length

<lang d>import std.stdio;

void showByteLen(T)(T[] str) {

   writefln("Byte length: %2d - %(%02x%)",
            str.length * T.sizeof, cast(ubyte[])str);

}

void main() {

   string s1a = "møøse"; // UTF-8
   showByteLen(s1a);
   wstring s1b = "møøse"; // UTF-16
   showByteLen(s1b);
   dstring s1c = "møøse"; // UTF-32
   showByteLen(s1c);
   writeln();

   string s2a = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢";
   showByteLen(s2a);
   wstring s2b = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢";
   showByteLen(s2b);
   dstring s2c = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢";
   showByteLen(s2c);
   writeln();

   string s3a = "J̲o̲s̲é̲";
   showByteLen(s3a);
   wstring s3b = "J̲o̲s̲é̲";
   showByteLen(s3b);
   dstring s3c = "J̲o̲s̲é̲";
   showByteLen(s3c);

}</lang>

Output:

Byte length:  7 - 6dc3b8c3b87365
Byte length: 10 - 6d00f800f80073006500
Byte length: 20 - 6d000000f8000000f80000007300000065000000

Byte length: 28 - f09d9498f09d94abf09d94a6f09d94a0f09d94acf09d94a1f09d94a2
Byte length: 28 - 35d818dd35d82bdd35d826dd35d820dd35d82cdd35d821dd35d822dd
Byte length: 28 - 18d501002bd5010026d5010020d501002cd5010021d5010022d50100

Byte length: 14 - 4accb26fccb273ccb265cc81ccb2
Byte length: 18 - 4a0032036f00320373003203650001033203
Byte length: 36 - 4a000000320300006f000000320300007300000032030000650000000103000032030000

Character Length

<lang d>import std.stdio, std.range, std.conv;

void showCodePointsLen(T)(T[] str) {

   writefln("Character length: %2d - %(%x %)",
            str.walkLength(), cast(uint[])to!(dchar[])(str));

}

void main() {

   string s1a = "møøse"; // UTF-8
   showCodePointsLen(s1a);
   wstring s1b = "møøse"; // UTF-16
   showCodePointsLen(s1b);
   dstring s1c = "møøse"; // UTF-32
   showCodePointsLen(s1c);
   writeln();

   string s2a = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢";
   showCodePointsLen(s2a);
   wstring s2b = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢";
   showCodePointsLen(s2b);
   dstring s2c = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢";
   showCodePointsLen(s2c);
   writeln();

   string s3a = "J̲o̲s̲é̲";
   showCodePointsLen(s3a);
   wstring s3b = "J̲o̲s̲é̲";
   showCodePointsLen(s3b);
   dstring s3c = "J̲o̲s̲é̲";
   showCodePointsLen(s3c);

}</lang>

Output:

Character length:  5 - 6d f8 f8 73 65
Character length:  5 - 6d f8 f8 73 65
Character length:  5 - 6d f8 f8 73 65

Character length:  7 - 1d518 1d52b 1d526 1d520 1d52c 1d521 1d522
Character length:  7 - 1d518 1d52b 1d526 1d520 1d52c 1d521 1d522
Character length:  7 - 1d518 1d52b 1d526 1d520 1d52c 1d521 1d522

Character length:  9 - 4a 332 6f 332 73 332 65 301 332
Character length:  9 - 4a 332 6f 332 73 332 65 301 332
Character length:  9 - 4a 332 6f 332 73 332 65 301 332

==Hello World</lang> Oz uses a single-byte encoding by default. So for normal strings, this will also show the correct character length.

PARI/GP

Character Length

Characters = bytes in Pari; the underlying strings are C strings interpreted as US-ASCII. <lang parigp>len(s)=#s; \\ Alternately, len(s)=length(s); or even len=length;</lang>

Byte Length

This works on objects of any sort, not just strings, and includes overhead. <lang parigp>len(s)=sizebyte(s);</lang>

Pascal

Byte Length

<lang pascal> const

 s = 'abcdef';

begin

 writeln (length(s))

end. </lang> Output:

Perl

Byte Length

Works with: Perl version 5.8

Strings in Perl consist of characters. Measuring the byte length therefore requires conversion to some binary representation (called encoding, both noun and verb).

<lang perl>use utf8; # so we can use literal characters like ☺ in source use Encode qw(encode);

print length encode 'UTF-8', "Hello, world! ☺";

17. The last character takes 3 bytes, the others 1 byte each.

print length encode 'UTF-16', "Hello, world! ☺";

32. 2 bytes for the BOM, then 15 byte pairs for each character.</lang>

Character Length

Works with: Perl version 5.X

<lang perl>my $length = length "Hello, world!";</lang>

Grapheme Length

Since Perl 5.12, /\X/ matches an extended grapheme cluster. See "Unicode overhaul" in perl5120delta and also UAX #29.

Perl understands that "\x{1112}\x{1161}\x{11ab}\x{1100}\x{1173}\x{11af}" (한글) contains 2 graphemes, just like "\x{d55c}\x{ae00}" (한글). The longer string uses Korean combining jamo characters.

Works with: Perl version 5.12

<lang perl>use v5.12; my $string = "\x{1112}\x{1161}\x{11ab}\x{1100}\x{1173}\x{11af}"; # 한글 my $len; $len++ while ($string =~ /\X/g); printf "Grapheme length: %d\n", $len;</lang>

Output:

Grapheme length: 2

Perl 6

Byte Length

<lang perl>say "møøse".utf8.bytes;</lang>

Character Length

<lang perl>say "møøse".chars;</lang>

PHP

Byte Length

<lang php>$length = strlen('Hello, world!');</lang>

Character Length

<lang php>$length = mb_strlen('Hello, world!', 'UTF-8'); // or whatever encoding</lang>

PicoLisp

<lang PicoLisp>(let Str "møøse"

  (prinl "Character Length of \"" Str "\" is " (length Str))
  (prinl "Byte Length of \"" Str "\" is " (size Str)) )</lang>

Output:

Character Length of "møøse" is 5
Byte Length of "møøse" is 7
-> 7

PL/I

<lang PL/I> declare WS widechar (13) initial ('Hello world.'); put ('Character length=', length (WS)); put skip list ('Byte length=', size(WS));

declare SM graphic (13) initial ('Hello world'); put ('Character length=', length(SM)); put skip list ('Byte length=', size(trim(SM)); </lang>

PL/SQL

LENGTH calculates length using characters as defined by the input character set. LENGTHB uses bytes instead of characters. LENGTHC uses Unicode complete characters. LENGTH2 uses UCS2 code points. LENGTH4 uses UCS4 code points.

Byte Length

<lang plsql>DECLARE

 string VARCHAR2(50) := 'Hello, world!';
 stringlength NUMBER;

BEGIN

 stringlength := LENGTHB(string);

END;</lang>

Character Length

<lang plsql>DECLARE

 string VARCHAR2(50) := 'Hello, world!';
 stringlength NUMBER;
 unicodelength NUMBER;
 ucs2length NUMBER;
 ucs4length NUMBER;

BEGIN

 stringlength := LENGTH(string);
 unicodelength := LENGTHC(string);
 ucs2length := LENGTH2(string);
 ucs4length := LENGTH4(string);

END;</lang>

Pop11

Byte Length

Currently Pop11 supports only strings consisting of 1-byte units. Strings can carry arbitrary binary data, so user can for example use UTF-8 (however builtin procedures will treat each byte as a single character). The length function for strings returns length in bytes:

<lang pop11>lvars str = 'Hello, world!'; lvars len = length(str);</lang>

PostScript

Character Length

<lang> (Hello World) length = 11 </lang>

PowerShell

Character Length

<lang powershell>$s = "Hëlló Wørłð" $s.Length</lang>

Byte Length

Translation of: C#

For UTF-16, which is the default in .NET and therefore PowerShell: <lang powershell>$s = "Hëlló Wørłð" [System.Text.Encoding]::Unicode.GetByteCount($s)</lang> For UTF-8: <lang powershell>[System.Text.Encoding]::UTF8.GetByteCount($s)</lang>

PureBasic

Character Length

<lang PureBasic> a = Len("Hello World") ;a will be 11</lang>

Byte Length

Returns the number of bytes required to store the string in memory in the given format in bytes. 'Format' can be #PB_Ascii, #PB_UTF8 or #PB_Unicode. PureBasic code can be compiled using either Unicode (2-byte) or Ascii (1-byte) encodings for strings. If 'Format' is not specified, the mode of the executable (unicode or ascii) is used.

Note: The number of bytes returned does not include the terminating Null-Character of the string. The size of the Null-Character is 1 byte for Ascii and UTF8 mode and 2 bytes for Unicode mode.

<lang PureBasic>a = StringByteLength("ä", #PB_UTF8) ;a will be 2 b = StringByteLength("ä", #PB_Ascii) ;b will be 1 c = StringByteLength("ä", #PB_Unicode) ;c will be 2 </lang>

Python

2.x

In Python 2.x, there are two types of strings: regular (8-bit) strings, and Unicode strings. Unicode string literals are prefixed with "u".

Byte Length

Works with: Python version 2.x

For 8-bit strings, the byte length is the same as the character length:

>>> len('ascii')
5

For Unicode strings, byte length depends on the encoding. Python use 2 or 4 bytes per character internally for unicode strings, depending on how it was built. The internal representation is not interesting for the user.

# The letter Alef
>>> len(u'\u05d0'.encode('utf-8'))
2
>>> len(u'\u05d0'.encode('iso-8859-8'))
1

Example from the problem statement: <lang python>#!/bin/env python

-*- coding: UTF-8 -*-

s = u"møøse" assert len(s) == 5 assert len(s.encode('UTF-8')) == 7 assert len(s.encode('UTF-16')) == 12 # The extra character is probably a leading Unicode byte-order mark (BOM).</lang>

Character Length

Works with: Python version 2.4

len() returns the number of code units (not code points!) in a Unicode string or plain ASCII string. On a wide build, this is the same as the number of code points, but on a narrow one it is not. Python has no reliable way to get the number of characters in a string, because the answer varies according to the build.To get the length of encoded string, you have to decode it first:

>>> len('ascii')
5
>>> len(u'\u05d0') # the letter Alef as unicode literal
1
>>> len('\xd7\x90'.decode('utf-8')) # Same encoded as utf-8 string
1
>>> len(unichr(0x1F4A9))  # shows that len() gives the wrong answer for non-BMP chars on a narrow build
2

3.x

In Python 3.x, strings are Unicode strings.

Byte Length

Byte length depends on the encoding. Python use 2 or 4 bytes per character internally for unicode strings, depending on how it was built. The internal representation is not interesting for the user.

You can use len() to get the length of a byte sequence. To get a byte sequence from a string, you have to encode it with the desired encoding:

# The letter Alef
>>> len('\u05d0'.encode('utf-8'))
2
>>> len('\u05d0'.encode('iso-8859-8'))
1

Example from the problem statement: <lang python>#!/bin/env python

-*- coding: UTF-8 -*-

s = "møøse" assert len(s) == 5 assert len(s.encode('UTF-8')) == 7 assert len(s.encode('UTF-16')) == 12 # The extra character is probably a leading Unicode byte-order mark (BOM).</lang>

Character Length

len() returns the number of code units in a string, which can be different from the number of characters. In a narrow build, this is not a reliable way to get the number of characters. You can only easily count code points in a wide build. To get the length of an encoded byte sequence, you have to decode it first:

>>> len('ascii')
5
>>> len(chr(0x1F4A9)) # how many code units on narrow build, how many characters on wide build only
2
>>> len('\u05d0') # the letter Alef as unicode literal
1
>>> len(b'\xd7\x90'.decode('utf-8')) # Same encoded as utf-8 byte sequence
1

R

Byte length

Character length

<lang R>print(nchar(a, type="chars")) # print 5</lang>

REBOL

Byte Length

REBOL 2.x does not natively support UCS (Unicode), so character and byte length are the same. See utf-8.r for an external UTF-8 library.

<lang REBOL>text: "møøse" print rejoin ["Byte length for '" text "': " length? text]</lang>

Output:

Byte length for 'møøse': 5

Retro

Byte Length

<lang Retro>"møøse" getLength putn</lang>

Character Length

Retro does not have built-in support for Unicode, but counting of characters can be done with a small amount of effort.

<lang Retro>chain: UTF8' {{

 : utf+ ( $-$ )
   [ 1+ dup @ %11000000 and %10000000 = ] while ;

 : count ( $-$ )
   0 !here
   repeat dup @ 0; drop utf+ here ++ again ;

---reveal---

 : getLength ( $-n )
   count drop @here ;

}}

chain

"møøse" ^UTF8'getLength putn</lang>

REXX

<lang REXX>sss='123456789abcdef' say 'the length of sss is:' length(sss)</lang> Output:

the length of sss is: 15

Ruby

Byte Length

Since Ruby 1.8.7, String#bytesize is the byte length.

Works with: Ruby version 1.8.7 or 1.9

<lang ruby># -*- coding: utf-8 -*-

puts "あいうえお".bytesize

=> 15</lang>

Character Length

Since Ruby 1.9, String#length (alias String#size) is the character length. The magic comment, "coding: utf-8", sets the encoding of all string literals in this file.

Works with: Ruby version 1.9

<lang ruby># -*- coding: utf-8 -*-

puts "あいうえお".length

=> 5

puts "あいうえお".size # alias for length

=> 5</lang>

Code Set Independence

The next examples show the byte length and character length of "møøse" in different encodings.

To run these programs, you must convert them to different encodings.

If you use Emacs: Paste each program into Emacs. The magic comment, like -*- coding: iso-8859-1 -*-, will tell Emacs to save with that encoding.

If your text editor saves UTF-8: Convert the file before running it. For example:
$ ruby -pe '$_.encode!("iso-8859-1", "utf-8")' scratch.rb | ruby

Works with: Ruby version 1.9

Program	Output
<lang ruby># -- coding: iso-8859-1 -- s = "møøse" puts "Byte length: %d" % s.bytesize puts "Character length: %d" % s.length</lang>	Byte length: 5 Character length: 5
<lang ruby># -- coding: utf-8 -- s = "møøse" puts "Byte length: %d" % s.bytesize puts "Character length: %d" % s.length</lang>	Byte length: 7 Character length: 5
<lang ruby># -- coding: gb18030 -- s = "møøse" puts "Byte length: %d" % s.bytesize puts "Character length: %d" % s.length</lang>	Byte length: 11 Character length: 5

Ruby 1.8

The next example works with both Ruby 1.8 and Ruby 1.9. In Ruby 1.8, the strings have no encodings, and String#length is the byte length. In Ruby 1.8, the regular expressions knows three Japanese encodings.

/./n uses no multibyte encoding.
/./e uses EUC-JP.
/./s uses Shift-JIS or Windows-31J.
/./u uses UTF-8.

Then either string.scan(/./u).size or string.gsub(/./u, ' ').size counts the UTF-8 characters in string.

<lang ruby># -*- coding: utf-8 -*-

class String

 # Define String#bytesize for Ruby 1.8.6.
 unless method_defined?(:bytesize)
   alias bytesize length
 end

end

s = "文字化け" puts "Byte length: %d" % s.bytesize puts "Character length: %d" % s.gsub(/./u, ' ').size</lang>

SAS

<lang sas>data _null_;

  a="Hello, World!";
  b=length(c);
  put _all_;

run;</lang>

Scheme

Byte Length

Works with: Gauche version 0.8.7 [utf-8,pthreads]

string-size function is only Gauche function. <lang scheme>(string-size "Hello world")</lang>

Works with: PLT Scheme version 4.2.4

<lang scheme>(bytes-length #"Hello world")</lang>

Character Length

Works with: Gauche version 0.8.7 [utf-8,pthreads]

string-length function is in R5RS, R6RS. <lang scheme> (string-length "Hello world")</lang>

Seed7

Character Length

<lang seed7>length("Hello, world!")</lang>

Slate

<lang slate>'Hello, world!' length.</lang>

Smalltalk

Byte Length

<lang smalltalk>string := 'Hello, world!'. string size.</lang>

Character Length

In GNU Smalltalk:

<lang smalltalk>string := 'Hello, world!'. string numberOfCharacters.</lang>

requires loading the Iconv package:

<lang smalltalk>PackageLoader fileInPackage: 'Iconv'</lang>

SNOBOL4

Byte Length

<lang snobol4> output = "Byte length: " size(trim(input)) end </lang>

Character Length

The example works AFAIK only with CSnobol4 by Phil Budne <lang snobol4> -include "utf.sno" output = "Char length: " utfsize(trim(input)) end </lang>

Standard ML

Byte Length

Works with: SML/NJ version 110.60

Works with: Moscow ML version 2.01

Works with: MLton version 20061107

<lang sml>val strlen = size "Hello, world!";</lang>

Character Length

Works with: SML/NJ version 110.74

<lang sml>val strlen = UTF8.size "Hello, world!";</lang>

Tcl

Byte Length

Formally, Tcl does not guarantee to use any particular representation for its strings internally (the underlying implementation objects can hold strings in at least three different formats, mutating between them as necessary) so to way to calculate the "byte length" of a string can only be done with respect to some user-selected encoding. This is done this way (for UTF-8): <lang tcl>string length [encoding convertto utf-8 $theString]</lang> Thus, we have these examples: <lang tcl>set s1 "hello, world" set s2 "\u304A\u306F\u3088\u3046" set enc utf-8 puts [format "length of \"%s\" in bytes is %d" \

    $s1 [string length [encoding convertto $enc $s1]]]

puts [format "length of \"%s\" in bytes is %d" \

    $s2 [string length [encoding convertto $enc $s2]]]</lang>

Character Length

Basic version:

<lang tcl>string length "Hello, world!"</lang>

or more elaborately, needs Interpreter any 8.X. Tested on 8.4.12.

<lang tcl>fconfigure stdout -encoding utf-8; #So that Unicode string will print correctly set s1 "hello, world" set s2 "\u304A\u306F\u3088\u3046" puts [format "length of \"%s\" in characters is %d" $s1 [string length $s1]] puts [format "length of \"%s\" in characters is %d" $s2 [string length $s2]]</lang>

TI-89 BASIC

The TI-89 uses an fixed 8-bit encoding so there is no difference between character length and byte length.

<lang ti89b>■ dim("møøse") 5</lang>

Toka

Byte Length

<lang toka>" hello, world!" string.getLength</lang>

Trith

Character Length

<lang trith>"møøse" length</lang>

Byte Length

<lang trith>"møøse" size</lang>

TUSCRIPT

Character Length

<lang tuscript> $$ MODE TUSCRIPT string="hello, world" l=LENGTH (string) PRINT "character length of string '",string,"': ",l </lang> Output:

Character length of string 'hello, world': 12

UNIX Shell

Byte Length

With external utility:

Works with: Bourne Shell

<lang bash>string='Hello, world!' length=`expr "x$string" : '.*' - 1` echo $length # if you want it printed to the terminal</lang>

With SUSv3 parameter expansion modifier:

Works with: Almquist SHell

Works with: Bourne Again SHell version 3.2

Works with: pdksh version 5.2.14 99/07/13.2

Works with: Z SHell

<lang bash>string='Hello, world!' length="${#string}" echo $length # if you want it printed to the terminal</lang>

Vala

Character Length

<lang vala> string s = "Hello, world!"; int characterLength = s.length; </lang>

VBA

Cf. VBScript (below).

VBScript

Byte Length

<lang vbscript>LenB(string|varname)</lang>

Returns the number of bytes required to store a string in memory. Returns null if string|varname is null.

Character Length

<lang vbscript>Len(string|varname)</lang>

Returns the length of the string|varname . Returns null if string|varname is null.

x86 Assembly

Byte Length

The following code uses AT&T syntax and was tested using AS (the portable GNU assembler) under Linux.

<lang x86 Assembly> .data string: .asciz "Test"

.text .globl main

main:

       pushl   %ebp
       movl    %esp, %ebp

       pushl   %edi
       xorb    %al, %al
       movl    $-1, %ecx
       movl    $string, %edi
       cld
       repne   scasb
       not     %ecx
       dec     %ecx
       popl    %edi

       ;; string length is stored in %ecx register

       leave
       ret

</lang>

XPL0

<lang XPL0>include c:\cxpl\stdlib; IntOut(0, StrLen("Character length = Byte length = String length = "))</lang>

Output:

XSLT

Character Length

...

xTalk

Works with: HyperCard

Byte Length

This example may be incorrect due to a recent change in the task requirements or a lack of testing. Please verify it and remove this message. If the example does not match the requirements or does not work, replace this message with Template:incorrect or fix the code yourself.

<lang xtalk>put the length of "Hello World"</lang>

or

<lang xtalk>put the number of characters in "Hello World"</lang>

Character Length

This example may be incorrect due to a recent change in the task requirements or a lack of testing. Please verify it and remove this message. If the example does not match the requirements or does not work, replace this message with Template:incorrect or fix the code yourself.

<lang xtalk>put the length of "Hello World"</lang>

or

<lang xtalk>put the number of characters in "Hello World"</lang>

Yorick

Character Length

<lang yorick>strlen("Hello, world!")</lang>

@@ Line 597: / Line 597: @@
 Character length:  9 - 4a 332 6f 332 73 332 65 301 332
 Character length:  9 - 4a 332 6f 332 73 332 65 301 332</pre>
+=={{header|Dc}==
+===Character Length===
+The following code output 5, which is the length of the string "abcde"
+<lang Dc>[abcde]Zp</lang>
 =={{header|E}}==