Longest common prefix: Difference between revisions

Content added Content deleted

Inline

Revision as of 13:21, 20 March 2015

It is often useful to find the common prefix of a set of strings, that is, the longest initial portion of all strings that are identical.

Given a set of strings, R, for a prefix S, it should hold that:

\forall x\ \in \ R:\ S\leq

_pref

\ x

~ "for all members x of set R, it hold true that string S is a prefix of x"

(help here: does not express that S is the longest common prefix of x)

An example use case for this: given a set of phone numbers, identify a common dialing code. This can be accomplished by first determining the common prefix (if any), and then matching it against know dialing codes (iteratively dropping characters from rhs until a match is found, as the lcp function may match more than the dialing code).

Test cases

For a function, lcp, accepting a list of strings, the following should hold true (the empty string, $\varepsilon$ , is considered a prefix of all strings):

lcp("interspecies","interstelar","interstate") = "inters"
lcp("throne","throne") = "throne"
lcp("throne","dungeon") =  $\varepsilon$ 
lcp("cheese") = "cheese"
lcp( $\varepsilon$ ) =  $\varepsilon$ 
lcp( $\emptyset$ ) =  $\varepsilon$ 
lcp("prefix","suffix") =  $\varepsilon$

Task inspired by this stackoverflow question: Find the longest common starting substring in a set of strings

J

In other words: compare adjacent strings pair-wise, combine results logically, find first mismatch in any of them, take that many characters from the first of the strings.

As the number of adjacent pairs is O(n) where n is the number of strings, this approach could be faster in the limit cases than sorting.

Examples:

<lang J> lcp 'interspecies','interstellar',:'interstate' inters

  lcp 'throne',:'throne'

throne

  lcp 'throne',:'dungeon'

  lcp ,:'cheese'

cheese

  lcp ,:

  lcp 0 0$

  lcp 'prefix',:'suffix'

</lang>

Python

Python: Functional

To see if all the n'th characters are the same I sort and compare the first to the last in the lambda function. <lang python>from itertools import takewhile

def lcp(*s):

   return .join(ch[0] for ch in takewhile(lambda x: x.sort() or not x or x[0] == x[-1],

(list(y) for y in zip(*s))))

assert lcp("interspecies","interstelar","interstate") == "inters" assert lcp("throne","throne") == "throne" assert lcp("throne","dungeon") == "" assert lcp("cheese") == "cheese" assert lcp("") == "" assert lcp("prefix","suffix") == ""</lang>

The above runs without output.

Scala

<lang scala>class TestLCP extends FunSuite {

 test("shared start") {
   assert(lcp("interspecies","interstelar","interstate") === "inters")
   assert(lcp("throne","throne") === "throne")
   assert(lcp("throne","dungeon").isEmpty)
   assert(lcp("cheese") === "cheese")
   assert(lcp("").isEmpty)
   assert(lcp(Nil :_*).isEmpty)
   assert(lcp("prefix","suffix").isEmpty)
 }

 def lcp(list: String*) = list.foldLeft("")((_,_) =>
   (list.min.view,list.max.view).zipped.takeWhile(v => v._1 == v._2).unzip._1.mkString)

}</lang>

zkl

The string method prefix returns the number of common prefix characters. <lang zkl>fcn lcp(s,strings){ s[0,s.prefix(vm.pasteArgs(1))] }</lang> Or, without using prefix:

Translation of: Scala

<lang zkl>fcn lcp(strings){

  vm.arglist.reduce(fcn(prefix,s){ Utils.Helpers.zipW(prefix,s) // lazy zip
     .pump(String,fcn([(a,b)]){ a==b and a or Void.Stop })
  })

}</lang> <lang zkl>tester:=TheVault.Test.UnitTester.UnitTester(); tester.testRun(lcp.fp("interspecies","interstelar","interstate"),Void,"inters",__LINE__); tester.testRun(lcp.fp("throne","throne"),Void,"throne",__LINE__); tester.testRun(lcp.fp("throne","dungeon"),Void,"",__LINE__); tester.testRun(lcp.fp("cheese"),Void,"cheese",__LINE__); tester.testRun(lcp.fp(""),Void,"",__LINE__); tester.testRun(lcp.fp("prefix","suffix"),Void,"",__LINE__); tester.stats();</lang> The fp (partial application) method is used to delay running lcp until the tester actually tests.

Output:

===================== Unit Test 1 =====================
Test 1 passed!
===================== Unit Test 2 =====================
Test 2 passed!
===================== Unit Test 3 =====================
Test 3 passed!
===================== Unit Test 4 =====================
Test 4 passed!
===================== Unit Test 5 =====================
Test 5 passed!
===================== Unit Test 6 =====================
Test 6 passed!
6 tests completed.
Passed test(s): 6 (of 6)

@@ Line 3: / Line 3: @@
 Given a set of strings, ''R'', for a prefix ''S'', it should hold that:
-:<math>\forall x\ \in\ R:\ S \le</math><sub>pref</sub> <math>\ x</math>
+:<math>\forall x\ \in\ R:\ S \le</math><sub>pref</sub> <math>\ x</math> ~ "for all members ''x'' of set ''R'', it hold true that string ''S'' is a prefix of ''x''"
 :(help here: ''does not express that ''S'' is the <u>longest</u> common prefix of ''x)
 An example use case for this: given a set of phone numbers, identify a common dialing code.  This can be accomplished by first determining the common prefix (if any), and then matching it against know dialing codes (iteratively dropping characters from rhs until a match is found, as the ''lcp'' function may match more than the dialing code).