Jump to content

String Byte Length: Difference between revisions

Redirecting to String Length
(→‎{{header|Python}}: UTF-16: extra 2 bytes is probably Unicode BOM)
(Redirecting to String Length)
Line 1:
#REDIRECT [[String Length]]
{{Template:split-review}}
{{task}}
 
In this task, the goal is to find the <em>byte</em> length of a string. This means encodings like [[UTF-8]] may need to be handled specially, as there is not necessarily a one-to-one relationship between bytes and characters, and some languages recognize this. For example, the character length of "møøse" is 5 but the byte length is 7 in UTF-8 and 10 in UTF-16.
 
For character length, see [[String Character Length]].
 
=={{header|4D}}==
$length:=Length("Hello, world!")
 
=={{header|Ada}}==
'''Compiler:''' GCC 4.1.2
 
Str : String := "Hello World";
Length : constant Natural := Str'Size / System.Storage_Unit;
 
The 'size attribute returns the size of an object in bits. System.Storage_Unit is the number of bits in a byte on the current machine.
 
=={{header|AppleScript}}==
{{needs-review|AppleScript}}
count of "Hello World"
 
=={{header|AWK}}==
From within any code block:
w=length("Hello, world!") # static string example
x=length("Hello," s " world!") # dynamic string example
y=length($1) # input field example
z=length(s) # variable name example
Ad hoc program from command line:
echo "Hello, wørld!" | awk '{print length($0)}' # 14
From executable script: (prints for every line arriving on stdin)
#!/usr/bin/awk -f
{print"The length of this line is "length($0)}
 
=={{header|C}}==
'''Standard:''' [[ANSI C]] (AKA [[C89]]):
 
'''Compiler:''' GCC 3.3.3
 
#include <string.h>
int main(void)
{
const char *string = "Hello, world!";
size_t length = strlen(string);
return 0;
}
 
or by hand:
 
int main(void)
{
const char *string = "Hello, world!";
size_t length = 0;
char *p = (char *) string;
while (*p++ != '\0') length++;
return 0;
}
 
or (for arrays of char only)
 
#include <stdlib.h>
int main(void)
{
char const s[] = "Hello, world!";
size_t length = sizeof s - 1;
return 0;
}
 
=={{header|C++}}==
 
'''Standard:''' [[ISO C plus plus|ISO C++]] (AKA [[C plus plus 98|C++98]]):
 
'''Compiler:''' g++ 4.0.2
 
#include <string> // note: '''not''' <string.h>
int main()
{
std::string s = "Hello, world!";
std::string::size_type length = s.length(); // option 1: In Characters/Bytes
std::string::size_type size = s.size(); // option 2: In Characters/Bytes
// In bytes same as above since sizeof(char) == 1
std::string::size_type bytes = s.length() * sizeof(std::string::value_type);
}
 
For wide character strings:
 
#include <string>
int main()
{
std::wstring s = L"\u304A\u306F\u3088\u3046";
std::wstring::size_type length = s.length() * sizeof(std::wstring::value_type); // in bytes
}
 
=={{header|C sharp|C#}}==
'''Platform:''' [[.NET]]
'''Language Version:''' 1.0+
 
string s = "Hello, world!";
int blength = System.Text.Encoding.GetBytes(s).length; // In Bytes.
 
=={{header|Clean}}==
Clean Strings are unboxed arrays of characters. Characters are always a single byte. The function size returns the number of elements in an array.
 
import StdEnv
strlen :: String -> Int
strlen string = size string
Start = strlen "Hello, world!"
 
=={{header|ColdFusion}}==
{{needs-review|ColdFusion}}
#len("Hello World")#
 
=={{header|Common Lisp}}==
{{needs-review|Common Lisp}}
(length "Hello World")
 
=={{header|Component Pascal}}==
{{needs-review|Component Pascal}}
LEN("Hello, World!")
 
=={{header|Forth}}==
'''Interpreter:''' ANS Forth
 
Strings in Forth come in two forms, neither of which are the null-terminated form commonly used in the C standard library.
 
===Counted string===
A counted string is a single pointer to a short string in memory. The string's first byte is the count of the number of characters in the string. This is how symbols are stored in a Forth dictionary.
 
CREATE s ," Hello world" \ create string "s"
s C@ ( -- length=11 )
 
===Stack string===
A string on the stack is represented by a pair of cells: the address of the string data and the length of the string data (in characters). The word '''COUNT''' converts a counted string into a stack string. The STRING utility wordset of ANS Forth works on these addr-len pairs. This representation has the advantages of not requiring null-termination, easy representation of substrings, and not being limited to 255 characters.
 
S" string" ( addr len)
DUP . \ 6
 
=={{header|Haskell}}==
 
It is not possible to determine the "byte length" of an ordinary string, because in Haskell, a string is a boxed list of unicode characters. So each character in a string is represented as whatever the compiler considers as the most efficient representation of a cons-cell and a unicode character, and not as a byte.
 
For efficient storage of sequences of bytes, there's ''Data.ByteString'', which uses ''Word8'' as a base type. Byte strings have an additional ''Data.ByteString.Char8'' interface, which will truncate each Unicode ''Char'' to 8 bits as soon as it is converted to a byte string. However, this is not adequate for the task, because truncation simple will garble characters other than Latin-1, instead of encoding them into UTF-8, say.
 
There are several (non-standard, so far) Unicode encoding libraries available on [http://hackage.haskell.org/ Hackage]. As an example, we'll use [http://hackage.haskell.org/packages/archive/encoding/0.2/doc/html/Data-Encoding.html encoding-0.2], as ''Data.Encoding'':
 
import Data.Encoding
import Data.ByteString as B
strUTF8 :: ByteString
strUTF8 = encode UTF8 "Hello World!"
strUTF32 :: ByteString
strUTF32 = encode UTF32 "Hello World!"
strlenUTF8 = B.length strUTF8
strlenUTF32 = B.length strUTF32
 
=={{header|IDL}}==
{{needs-review|IDL}}
'''Compiler:''' any IDL compiler should do
 
length = strlen("Hello, world!")
 
=={{header|Java}}==
 
Java encodes strings in UTF-16, which represents each character with one or two 16-bit values. The length method of String objects returns the number of 16-bit values used to encode a string, so the number of bytes can be determined by doubling that number.
 
String s = "Hello, world!";
int byteCount = s.length() * 2;
 
Another way to know the byte length of a string is to explicitly specify the charset we desire.
 
String s = "Hello, world!";
int byteCountUTF16 = s.getBytes("UTF-16").length;
int byteCountUTF8 = s.getBytes("UTF-8").length;
 
=={{header|JavaScript}}==
JavaScript encodes strings in UTF-16, which represents each character with one or two 16-bit values. The length property of string objects gives the number of 16-bit values used to encode a string, so the number of bytes can be determined by doubling that number.
 
var s = "Hello, world!";
var byteCount = s.length * 2; //26
 
=={{header|JudoScript}}==
{{needs-review|JudoScript}}
//Store length of hello world in length and print it
. length = "Hello World".length();
 
=={{header|LSE64}}==
LSE stores strings as arrays of characters in 64-bit cells plus a count.
" Hello world" @ 1 + 8 * , # 96 = (11+1)*(size of a cell) = 12*8
 
=={{header|Lua}}==
{{needs-review|Lua}}
'''Interpreter:''' [[Lua]] 5.0 or later.
 
string="Hello world"
length=#string
 
=={{header|mIRC Scripting Language}}==
{{needs-review|mIRC Scripting Language}}
alias stringlength { echo -a Your Name is: $len($$?="Whats your name") letters long! }
 
=={{header|OCaml}}==
{{needs-review|OCaml}}
'''Interpreter'''/'''Compiler:''' [[Ocaml]] 3.09
 
String.length "Hello world";;
 
=={{header|Perl}}==
'''Interpreter:''' [[perl]] 5.8
 
Strings in Perl consist of characters. Measuring the byte length therefore requires conversion to some binary representation (called encoding, both noun and verb).
 
use utf8; # so we can use literal characters like ☺ in source
use Encode qw(encode);
print length encode 'UTF-8', "Hello, world! ☺";
# 17. The last character takes 3 bytes, the others 1 byte each.
print length encode 'UTF-16', "Hello, world! ☺";
# 32. 2 bytes for the BOM, then 15 byte pairs for each character.
 
=={{header|PHP}}==
{{needs-review|PHP}}
$length = strlen('Hello, world!');
 
=={{header|PL/SQL}}==
{{needs-review|PL/SQL}}
DECLARE
string VARCHAR2( 50 ) := 'Hello, world!';
stringlength NUMBER;
BEGIN
stringlength := length( string );
END;
 
=={{header|Pop11}}==
Currently Pop11 supports only strings consisting of 1-byte units.
Strings can carry arbitrary binary data, so user can for example
use UTF-8 (however builtin procedures will treat each byte as
a single character). The length function for strings returns
length in bytes:
 
lvars str = 'Hello, world!';
lvars len = length(str);
 
=={{header|Python}}==
'''Interpreter:''' [[Python]] 2.x
 
Byte length depends on the encoding. Python use 2 or 4 bytes per character internally for unicode strings, depending on how it was built. The internal representation is not interesting for the user.
 
# The letter Alef
>>> len(u'\u05d0'.encode('utf-8'))
2
>>> len(u'\u05d0'.encode('iso-8859-8'))
1
 
Example from the problem statement:
#!/bin/env python
# -*- coding: UTF-8 -*-
s = u"møøse"
assert len(s) == 5
assert len(s.encode('UTF-8')) == 7
assert len(s.encode('UTF-16')) == 12 # The extra character is probably a leading Unicode byte-order mark (BOM).
 
=={{header|Ruby}}==
string="Hello world"
print string.length
or
puts "Hello World".length
 
=={{header|Scheme}}==
{{needs-review|Scheme}}
(string-length "Hello world")
 
=={{header|Smalltalk}}==
{{needs-review|Smalltalk}}
string := 'Hello, world!".
string size.
 
=={{header|Standard ML}}==
{{needs-review|Standard ML}}
'''Interpreter:''' [[Standard ML of New Jersey | SML/NJ]] 110.60, [[Moscow ML]] 2.01 (January 2004)
 
'''Compiler:''' [[MLton]] 20061107
 
val strlen = size "Hello, world!";
 
=={{header|Tcl}}==
Basic version:
 
string bytelength "Hello, world!"
 
or more elaborately, needs '''Interpreter''' any 8.X. Tested on 8.4.12.
 
fconfigure stdout -encoding utf-8; #So that Unicode string will print correctly
set s1 "hello, world"
set s2 "\u304A\u306F\u3088\u3046"
puts [format "length of \"%s\" in bytes is %d" $s1 [string bytelength $s1]]
puts [format "length of \"%s\" in bytes is %d" $s2 [string bytelength $s2]]
 
=={{header|Toka}}==
" hello, world!" string.getLength
 
=={{header|UNIX Shell}}==
With external utilities:
 
'''Interpreter:''' any bourne shell
 
string='Hello, world!'
length=`echo -n "$string" | wc -c | tr -dc '0-9'`
echo $length # if you want it printed to the terminal
 
With SUSv3 parameter expansion modifier:
 
'''Interpreter:''' [[Almquist SHell]] (NetBSD 3.0), [[Bourne Again SHell]] 3.2, [[Korn SHell]] (5.2.14 99/07/13.2), [[Z SHell]]
 
string='Hello, world!'
length="${#string}"
echo $length # if you want it printed to the terminal
 
 
=={{header|VBScript}}==
LenB(string|varname)
 
Returns the number of bytes required to store a string in memory
Returns null if string|varname is null
 
=={{header|xTalk}}==
{{needs-review|xTalk}}
'''Interpreter:''' HyperCard
 
put the length of "Hello World"
 
or
 
put the number of characters in "Hello World"
Anonymous user
Cookies help us deliver our services. By using our services, you agree to our use of cookies.