String Byte Length: Difference between revisions
Content added Content deleted
(→{{header|Python}}: UTF-16: extra 2 bytes is probably Unicode BOM) |
(Redirecting to String Length) |
||
Line 1: | Line 1: | ||
#REDIRECT [[String Length]] |
|||
{{Template:split-review}} |
|||
{{task}} |
|||
In this task, the goal is to find the <em>byte</em> length of a string. This means encodings like [[UTF-8]] may need to be handled specially, as there is not necessarily a one-to-one relationship between bytes and characters, and some languages recognize this. For example, the character length of "møøse" is 5 but the byte length is 7 in UTF-8 and 10 in UTF-16. |
|||
For character length, see [[String Character Length]]. |
|||
=={{header|4D}}== |
|||
$length:=Length("Hello, world!") |
|||
=={{header|Ada}}== |
|||
'''Compiler:''' GCC 4.1.2 |
|||
Str : String := "Hello World"; |
|||
Length : constant Natural := Str'Size / System.Storage_Unit; |
|||
The 'size attribute returns the size of an object in bits. System.Storage_Unit is the number of bits in a byte on the current machine. |
|||
=={{header|AppleScript}}== |
|||
{{needs-review|AppleScript}} |
|||
count of "Hello World" |
|||
=={{header|AWK}}== |
|||
From within any code block: |
|||
w=length("Hello, world!") # static string example |
|||
x=length("Hello," s " world!") # dynamic string example |
|||
y=length($1) # input field example |
|||
z=length(s) # variable name example |
|||
Ad hoc program from command line: |
|||
echo "Hello, wørld!" | awk '{print length($0)}' # 14 |
|||
From executable script: (prints for every line arriving on stdin) |
|||
#!/usr/bin/awk -f |
|||
{print"The length of this line is "length($0)} |
|||
=={{header|C}}== |
|||
'''Standard:''' [[ANSI C]] (AKA [[C89]]): |
|||
'''Compiler:''' GCC 3.3.3 |
|||
#include <string.h> |
|||
int main(void) |
|||
{ |
|||
const char *string = "Hello, world!"; |
|||
size_t length = strlen(string); |
|||
return 0; |
|||
} |
|||
or by hand: |
|||
int main(void) |
|||
{ |
|||
const char *string = "Hello, world!"; |
|||
size_t length = 0; |
|||
char *p = (char *) string; |
|||
while (*p++ != '\0') length++; |
|||
return 0; |
|||
} |
|||
or (for arrays of char only) |
|||
#include <stdlib.h> |
|||
int main(void) |
|||
{ |
|||
char const s[] = "Hello, world!"; |
|||
size_t length = sizeof s - 1; |
|||
return 0; |
|||
} |
|||
=={{header|C++}}== |
|||
'''Standard:''' [[ISO C plus plus|ISO C++]] (AKA [[C plus plus 98|C++98]]): |
|||
'''Compiler:''' g++ 4.0.2 |
|||
#include <string> // note: '''not''' <string.h> |
|||
int main() |
|||
{ |
|||
std::string s = "Hello, world!"; |
|||
std::string::size_type length = s.length(); // option 1: In Characters/Bytes |
|||
std::string::size_type size = s.size(); // option 2: In Characters/Bytes |
|||
// In bytes same as above since sizeof(char) == 1 |
|||
std::string::size_type bytes = s.length() * sizeof(std::string::value_type); |
|||
} |
|||
For wide character strings: |
|||
#include <string> |
|||
int main() |
|||
{ |
|||
std::wstring s = L"\u304A\u306F\u3088\u3046"; |
|||
std::wstring::size_type length = s.length() * sizeof(std::wstring::value_type); // in bytes |
|||
} |
|||
=={{header|C sharp|C#}}== |
|||
'''Platform:''' [[.NET]] |
|||
'''Language Version:''' 1.0+ |
|||
string s = "Hello, world!"; |
|||
int blength = System.Text.Encoding.GetBytes(s).length; // In Bytes. |
|||
=={{header|Clean}}== |
|||
Clean Strings are unboxed arrays of characters. Characters are always a single byte. The function size returns the number of elements in an array. |
|||
import StdEnv |
|||
strlen :: String -> Int |
|||
strlen string = size string |
|||
Start = strlen "Hello, world!" |
|||
=={{header|ColdFusion}}== |
|||
{{needs-review|ColdFusion}} |
|||
#len("Hello World")# |
|||
=={{header|Common Lisp}}== |
|||
{{needs-review|Common Lisp}} |
|||
(length "Hello World") |
|||
=={{header|Component Pascal}}== |
|||
{{needs-review|Component Pascal}} |
|||
LEN("Hello, World!") |
|||
=={{header|Forth}}== |
|||
'''Interpreter:''' ANS Forth |
|||
Strings in Forth come in two forms, neither of which are the null-terminated form commonly used in the C standard library. |
|||
===Counted string=== |
|||
A counted string is a single pointer to a short string in memory. The string's first byte is the count of the number of characters in the string. This is how symbols are stored in a Forth dictionary. |
|||
CREATE s ," Hello world" \ create string "s" |
|||
s C@ ( -- length=11 ) |
|||
===Stack string=== |
|||
A string on the stack is represented by a pair of cells: the address of the string data and the length of the string data (in characters). The word '''COUNT''' converts a counted string into a stack string. The STRING utility wordset of ANS Forth works on these addr-len pairs. This representation has the advantages of not requiring null-termination, easy representation of substrings, and not being limited to 255 characters. |
|||
S" string" ( addr len) |
|||
DUP . \ 6 |
|||
=={{header|Haskell}}== |
|||
It is not possible to determine the "byte length" of an ordinary string, because in Haskell, a string is a boxed list of unicode characters. So each character in a string is represented as whatever the compiler considers as the most efficient representation of a cons-cell and a unicode character, and not as a byte. |
|||
For efficient storage of sequences of bytes, there's ''Data.ByteString'', which uses ''Word8'' as a base type. Byte strings have an additional ''Data.ByteString.Char8'' interface, which will truncate each Unicode ''Char'' to 8 bits as soon as it is converted to a byte string. However, this is not adequate for the task, because truncation simple will garble characters other than Latin-1, instead of encoding them into UTF-8, say. |
|||
There are several (non-standard, so far) Unicode encoding libraries available on [http://hackage.haskell.org/ Hackage]. As an example, we'll use [http://hackage.haskell.org/packages/archive/encoding/0.2/doc/html/Data-Encoding.html encoding-0.2], as ''Data.Encoding'': |
|||
import Data.Encoding |
|||
import Data.ByteString as B |
|||
strUTF8 :: ByteString |
|||
strUTF8 = encode UTF8 "Hello World!" |
|||
strUTF32 :: ByteString |
|||
strUTF32 = encode UTF32 "Hello World!" |
|||
strlenUTF8 = B.length strUTF8 |
|||
strlenUTF32 = B.length strUTF32 |
|||
=={{header|IDL}}== |
|||
{{needs-review|IDL}} |
|||
'''Compiler:''' any IDL compiler should do |
|||
length = strlen("Hello, world!") |
|||
=={{header|Java}}== |
|||
Java encodes strings in UTF-16, which represents each character with one or two 16-bit values. The length method of String objects returns the number of 16-bit values used to encode a string, so the number of bytes can be determined by doubling that number. |
|||
String s = "Hello, world!"; |
|||
int byteCount = s.length() * 2; |
|||
Another way to know the byte length of a string is to explicitly specify the charset we desire. |
|||
String s = "Hello, world!"; |
|||
int byteCountUTF16 = s.getBytes("UTF-16").length; |
|||
int byteCountUTF8 = s.getBytes("UTF-8").length; |
|||
=={{header|JavaScript}}== |
|||
JavaScript encodes strings in UTF-16, which represents each character with one or two 16-bit values. The length property of string objects gives the number of 16-bit values used to encode a string, so the number of bytes can be determined by doubling that number. |
|||
var s = "Hello, world!"; |
|||
var byteCount = s.length * 2; //26 |
|||
=={{header|JudoScript}}== |
|||
{{needs-review|JudoScript}} |
|||
//Store length of hello world in length and print it |
|||
. length = "Hello World".length(); |
|||
=={{header|LSE64}}== |
|||
LSE stores strings as arrays of characters in 64-bit cells plus a count. |
|||
" Hello world" @ 1 + 8 * , # 96 = (11+1)*(size of a cell) = 12*8 |
|||
=={{header|Lua}}== |
|||
{{needs-review|Lua}} |
|||
'''Interpreter:''' [[Lua]] 5.0 or later. |
|||
string="Hello world" |
|||
length=#string |
|||
=={{header|mIRC Scripting Language}}== |
|||
{{needs-review|mIRC Scripting Language}} |
|||
alias stringlength { echo -a Your Name is: $len($$?="Whats your name") letters long! } |
|||
=={{header|OCaml}}== |
|||
{{needs-review|OCaml}} |
|||
'''Interpreter'''/'''Compiler:''' [[Ocaml]] 3.09 |
|||
String.length "Hello world";; |
|||
=={{header|Perl}}== |
|||
'''Interpreter:''' [[perl]] 5.8 |
|||
Strings in Perl consist of characters. Measuring the byte length therefore requires conversion to some binary representation (called encoding, both noun and verb). |
|||
use utf8; # so we can use literal characters like ☺ in source |
|||
use Encode qw(encode); |
|||
print length encode 'UTF-8', "Hello, world! ☺"; |
|||
# 17. The last character takes 3 bytes, the others 1 byte each. |
|||
print length encode 'UTF-16', "Hello, world! ☺"; |
|||
# 32. 2 bytes for the BOM, then 15 byte pairs for each character. |
|||
=={{header|PHP}}== |
|||
{{needs-review|PHP}} |
|||
$length = strlen('Hello, world!'); |
|||
=={{header|PL/SQL}}== |
|||
{{needs-review|PL/SQL}} |
|||
DECLARE |
|||
string VARCHAR2( 50 ) := 'Hello, world!'; |
|||
stringlength NUMBER; |
|||
BEGIN |
|||
stringlength := length( string ); |
|||
END; |
|||
=={{header|Pop11}}== |
|||
Currently Pop11 supports only strings consisting of 1-byte units. |
|||
Strings can carry arbitrary binary data, so user can for example |
|||
use UTF-8 (however builtin procedures will treat each byte as |
|||
a single character). The length function for strings returns |
|||
length in bytes: |
|||
lvars str = 'Hello, world!'; |
|||
lvars len = length(str); |
|||
=={{header|Python}}== |
|||
'''Interpreter:''' [[Python]] 2.x |
|||
Byte length depends on the encoding. Python use 2 or 4 bytes per character internally for unicode strings, depending on how it was built. The internal representation is not interesting for the user. |
|||
# The letter Alef |
|||
>>> len(u'\u05d0'.encode('utf-8')) |
|||
2 |
|||
>>> len(u'\u05d0'.encode('iso-8859-8')) |
|||
1 |
|||
Example from the problem statement: |
|||
#!/bin/env python |
|||
# -*- coding: UTF-8 -*- |
|||
s = u"møøse" |
|||
assert len(s) == 5 |
|||
assert len(s.encode('UTF-8')) == 7 |
|||
assert len(s.encode('UTF-16')) == 12 # The extra character is probably a leading Unicode byte-order mark (BOM). |
|||
=={{header|Ruby}}== |
|||
string="Hello world" |
|||
print string.length |
|||
or |
|||
puts "Hello World".length |
|||
=={{header|Scheme}}== |
|||
{{needs-review|Scheme}} |
|||
(string-length "Hello world") |
|||
=={{header|Smalltalk}}== |
|||
{{needs-review|Smalltalk}} |
|||
string := 'Hello, world!". |
|||
string size. |
|||
=={{header|Standard ML}}== |
|||
{{needs-review|Standard ML}} |
|||
'''Interpreter:''' [[Standard ML of New Jersey | SML/NJ]] 110.60, [[Moscow ML]] 2.01 (January 2004) |
|||
'''Compiler:''' [[MLton]] 20061107 |
|||
val strlen = size "Hello, world!"; |
|||
=={{header|Tcl}}== |
|||
Basic version: |
|||
string bytelength "Hello, world!" |
|||
or more elaborately, needs '''Interpreter''' any 8.X. Tested on 8.4.12. |
|||
fconfigure stdout -encoding utf-8; #So that Unicode string will print correctly |
|||
set s1 "hello, world" |
|||
set s2 "\u304A\u306F\u3088\u3046" |
|||
puts [format "length of \"%s\" in bytes is %d" $s1 [string bytelength $s1]] |
|||
puts [format "length of \"%s\" in bytes is %d" $s2 [string bytelength $s2]] |
|||
=={{header|Toka}}== |
|||
" hello, world!" string.getLength |
|||
=={{header|UNIX Shell}}== |
|||
With external utilities: |
|||
'''Interpreter:''' any bourne shell |
|||
string='Hello, world!' |
|||
length=`echo -n "$string" | wc -c | tr -dc '0-9'` |
|||
echo $length # if you want it printed to the terminal |
|||
With SUSv3 parameter expansion modifier: |
|||
'''Interpreter:''' [[Almquist SHell]] (NetBSD 3.0), [[Bourne Again SHell]] 3.2, [[Korn SHell]] (5.2.14 99/07/13.2), [[Z SHell]] |
|||
string='Hello, world!' |
|||
length="${#string}" |
|||
echo $length # if you want it printed to the terminal |
|||
=={{header|VBScript}}== |
|||
LenB(string|varname) |
|||
Returns the number of bytes required to store a string in memory |
|||
Returns null if string|varname is null |
|||
=={{header|xTalk}}== |
|||
{{needs-review|xTalk}} |
|||
'''Interpreter:''' HyperCard |
|||
put the length of "Hello World" |
|||
or |
|||
put the number of characters in "Hello World" |