Compare length of two strings: Difference between revisions

Content added Content deleted

Inline

Revision as of 17:36, 28 October 2021

Task

Given two strings of different length, determine which string is longer or shorter. Print both strings and their length, one on each line. Print the longer one first.

Measure the length of your string in terms of bytes or characters, as appropriate for your language. If your language doesn't have an operator for measuring the length of a string, note it.

Other tasks related to string operations:

Metrics

Counting

Remove/replace

Anagrams/Derangements/shuffling

Find/Search/Determine

Formatting

Song lyrics/poems/Mad Libs/phrases

Tokenize

Sequences

ALGOL 68

Algol 68 does not have an in-built "LENGTH" operator, it does have operators LWB and UPB which return the lower bound and upper bound of an array and as strings are arrays of characters, LENGTH can easily be constructed from these.
In most Algol 68 implementations such as Algol 68G and Rutgers Algol 68, the CHAR type is an 8-bit byte. <lang algol68>BEGIN # compare string lengths #

   # returns the length of s using the builtin UPB and LWB operators #
   OP LENGTH = ( STRING s )INT: ( UPB s + 1 ) - LWB s;
   # prints s and its length #
   PROC print string = ( STRING s )VOID:
        print( ( """", s, """ has length: ", whole( LENGTH s, 0 ), " bytes.", newline ) );
   STRING shorter     = "short";
   STRING not shorter = "longer";
   IF LENGTH shorter >  LENGTH not shorter THEN print string( shorter ) FI;
   print string( not shorter );
   IF LENGTH shorter <= LENGTH not shorter THEN print string( shorter ) FI

END</lang>

Output:

"longer" has length: 6 bytes.
"short" has length: 5 bytes.

Julia

Per the Julia docs, a String in Julia is a sequence of characters encoded as UTF-8. Most string methods in Julia actually accept an AbstractString, which is the supertype of strings in Julia regardless of the encoding, including the default UTF-8.

The Char data type in Julia is a 32-bit, potentially Unicode data type, so that if we enumerate a String as a Char array, we get a series of 32-bit characters: <lang julia>s = "niño" println("Position Char Bytes\n==============================") for (i, c) in enumerate(s)

   println("$i          $c     $(sizeof(c))")

end

</lang>

Output:

Position  Char Bytes
==============================
1          n     4
2          i     4
3          ñ     4
4          o     4

However, if we index into the string, the index into the string will function as if the string was an ordinary C string, that is, an array of unsigned 8-bit integers. If the index attempts to index within a character of size greater than one byte, an error is thrown for bad indexing. This can be demonstrated by casting the above string to codeunits: <lang julia>println("Position Codeunit Bytes\n==============================") for (i, c) in enumerate(codeunits(s))

   println("$i            $(string(c, base=16))     $(sizeof(c))")

end

</lang>

Output:

Position  Codeunit Bytes
==============================
1            6e     1
2            69     1
3            c3     1
4            b1     1
5            6f     1

Note that the length of "niño" as a String is 4 characters, and the length of "niño" as codeunits (ie, 8 bit bytes) is 5. Indexing into the 4th position results in an error: <lang julia> julia> s[4] ERROR: StringIndexError: invalid index [4], valid nearby indices [3]=>'ñ', [5]=>'o' </lang>

So, whether a string is longer or shorter depends on the encoding, as below: <lang julia>length("ñññ") < length("nnnn") # true, and the usual meaning of length of a String

length(codeunits("ñññ")) > length(codeunits("nnnn")) # true as well </lang>

Raku

So... In what way does this task differ significantly from String length? Other than being horribly under specified?

In the modern world, string "length" is pretty much a useless measurement, especially in the absence of a specified encoding; hence Raku not even having an operator: "length" for strings.

<lang perl6>say 'Strings (👨‍👩‍👧‍👦, 🤔🇺🇸, BOGUS!) sorted: "longest" first:'; say "$_: characters:{.chars}, Unicode code points:{.codes}, UTF-8 bytes:{.encode('UTF8').bytes}, UTF-16 bytes:{.encode('UTF16').bytes}" for <👨‍👩‍👧‍👦 BOGUS! 🤔🇺🇸>.sort: -*.chars;</lang>

Output:

Strings (👨‍👩‍👧‍👦, 🤔🇺🇸, BOGUS!) sorted: "longest" first:
BOGUS!: characters:6,  Unicode code points:6,  UTF-8 bytes:6,  UTF-16 bytes:12
🤔🇺🇸: characters:2,  Unicode code points:3,  UTF-8 bytes:12,  UTF-16 bytes:12
👨‍👩‍👧‍👦: characters:1,  Unicode code points:7,  UTF-8 bytes:25,  UTF-16 bytes:22

Wren

Library: Wren-upc

In Wren a string (i.e. an object of the String class) is an immutable sequence of bytes which is usually interpreted as UTF-8 but does not have to be.

With regard to string length, the String.count method returns the number of 'codepoints' in the string. If the string contains bytes which are invalid UTF-8, each such byte adds one to the count.

To find the number of bytes one can use String.bytes.count.

Unicode grapheme clusters, where what appears to be a single 'character' may in fact be an amalgam of several codepoints, are not directly supported by Wren but it is possible to measure the length in grapheme clusters of a string (i.e. the number of user perceived characters) using the Graphemes.clusterCount method of the Wren-upc module. <lang ecmascript>import "./upc" for Graphemes

var printCounts = Fn.new { |s1, s2, c1, c2|

  var l1 = (c1 > c2) ? [s1, c1] : [s2, c2]
  var l2 = (c1 > c2) ? [s2, c2] : [s1, c1]
  System.print(  "%(l1[0]) : length %(l1[1])")
  System.print(  "%(l2[0]) : length %(l2[1])\n")

}

var codepointCounts = Fn.new { |s1, s2|

  var c1 = s1.count
  var c2 = s2.count
  System.print("Comparison by codepoints:")
  printCounts.call(s1, s2, c1, c2)

}

var byteCounts = Fn.new { |s1, s2|

  var c1 = s1.bytes.count
  var c2 = s2.bytes.count
  System.print("Comparison by bytes:")
  printCounts.call(s1, s2, c1, c2)

}

var graphemeCounts = Fn.new { |s1, s2|

  var c1 = Graphemes.clusterCount(s1)
  var c2 = Graphemes.clusterCount(s2)
  System.print("Comparison by grapheme clusters:")
  printCounts.call(s1, s2, c1, c2)

}

for (pair in [ ["nino", "niño"], ["👨‍👩‍👧‍👦", "🤔🇺🇸"] ]) {

   codepointCounts.call(pair[0], pair[1])
   byteCounts.call(pair[0], pair[1])
   graphemeCounts.call(pair[0], pair[1])

}</lang>

Output:

Comparison by codepoints:
niño : length 4
nino : length 4

Comparison by bytes:
niño : length 5
nino : length 4

Comparison by grapheme clusters:
niño : length 4
nino : length 4

Comparison by codepoints:
👨‍👩‍👧‍👦 : length 7
🤔🇺🇸 : length 3

Comparison by bytes:
👨‍👩‍👧‍👦 : length 25
🤔🇺🇸 : length 12

Comparison by grapheme clusters:
🤔🇺🇸 : length 2
👨‍👩‍👧‍👦 : length 1

Z80 Assembly

<lang z80>Terminator equ 0 ;null terminator PrintChar equ &BB5A ;Amstrad CPC BIOS call, prints accumulator to screen as an ASCII character.

       org &8000

ld hl,String1 ld de,String2 call CompareStringLengths

jp nc, Print_HL_First ex de,hl Print_HL_First: push bc push hl call PrintString pop hl push hl ld a,' ' call PrintChar call getStringLength ld a,b call ShowHex_NoLeadingZeroes call NewLine pop hl pop bc

ex de,hl push bc push hl call PrintString pop hl push hl ld a,' ' call PrintChar call getStringLength ld a,b call ShowHex_NoLeadingZeroes call NewLine pop hl pop bc ReturnToBasic: RET

String1: byte "Hello",Terminator String2: byte "Goodbye",Terminator

RELEVANT SUBROUTINES - PRINTSTRING AND NEWLINE CREATED BY KEITH S. OF CHIBIAKUMAS

CompareStringLengths: ;HL = string 1 ;DE = string 2 ;CLOBBERS A,B,C push hl push de ex de,hl call GetStringLength ld b,c

ex de,hl call GetStringLength ld a,b cp c pop de pop hl ret ;returns carry set if HL < DE, zero set if equal, zero & carry clear if HL >= DE ;returns len(DE) in C, and len(HL) in B.

GetStringLength: ld b,0 loop_getStringLength: ld a,(hl) cp Terminator ret z inc hl inc b jr loop_getStringLength

NewLine: push af ld a,13 ;Carriage return call PrintChar ld a,10 ;Line Feed call PrintChar pop af ret

PrintString: ld a,(hl) cp Terminator ret z inc hl call PrintChar jr PrintString

ShowHex_NoLeadingZeroes:

useful for printing values where leading zeroes don't make sense,
such as money etc.

push af and %11110000 ifdef gbz80 ;game boy swap a else ;zilog z80 rrca rrca rrca rrca endif or a call nz,PrintHexChar ;if top nibble of A is zero, don't print it. pop af and %00001111 or a ret z ;if bottom nibble of A is zero, don't print it! jp PrintHexChar

PrintHexChar: or a ;Clear Carry Flag daa add a,&F0 adc a,&40 ;This sequence converts a 4-bit hex digit to its ASCII equivalent. jp PrintChar</lang>

Output:

Goodbye 7
Hello 5