Longest common substring: Difference between revisions
(→{{header|Racket}}: actual implementation added) |
(jq) |
||
Line 196: | Line 196: | ||
<lang J> 'thisisatest' lcs 'testing123testing' |
<lang J> 'thisisatest' lcs 'testing123testing' |
||
test</lang> |
test</lang> |
||
=={{header|jq}}== |
|||
{{trans|C#, Go, Ruby}} |
|||
{{works with|jq|1.4}} |
|||
'''Utility functions''': |
|||
<lang jq># Create an m x n matrix |
|||
def matrix(m; n; init): |
|||
if m == 0 then [] |
|||
elif m == 1 then [range(0;n) | init] |
|||
elif m > 0 then |
|||
matrix(1;n;init) as $row |
|||
| [range(0;m) | $row ] |
|||
else error("matrix\(m);_;_) invalid") |
|||
end; |
|||
def set(i;j; value): |
|||
value as $value |
|||
| setpath([i,j]; $value);</lang> |
|||
'''Longest Common Substring''': |
|||
<lang jq>def lcs(a; b): |
|||
matrix(a|length; b|length; 0) as $lengths |
|||
# state: [ $lengths, greatestLength, answer ] |
|||
| [$lengths, 0] |
|||
| reduce range(0; a|length) as $i |
|||
(.; |
|||
reduce range(0; b|length) as $j |
|||
(.; |
|||
if a[$i:$i+1] == b[$j:$j+1] then |
|||
if $i == 0 or $j == 0 then .[0] |= set($i; $j; 1) |
|||
else (.[0] |= set($i; $j; .[$i-1][$j-1] + 1)) |
|||
end |
|||
| if .[0][$i][$j] > .[1] then |
|||
.[1] = .[0][$i][$j] |
|||
| .[1] as $mx |
|||
| .[2] = a[1 + $i - $mx : $i+1] # output |
|||
else . |
|||
end |
|||
else . |
|||
end )) | .[2];</lang> |
|||
'''Example''': |
|||
<lang jq>lcs("thisisatest"; "testing123testing")</lang> |
|||
{{out}} |
|||
<lang sh>$ jq -n -f Longest_common_substring.jq |
|||
"test"</lang> |
|||
=={{header|Racket}}== |
=={{header|Racket}}== |
Revision as of 05:09, 11 May 2015
Write a function returns the longest common substring of two strings. Use it within a program that demonstrates sample output from the function, which will consist of the longest common substring between "thisisatest" and "testing123testing". Note that substrings are consecutive characters within a string. This distinguishes them from subsequences, which is any sequence of characters within a string, even if there are extraneous characters in between them. Hence, the longest common subsequence between "thisisatest" and "testing123testing" is "tsitest", whereas the longest common substring is just "test".
References:
C#
Using dynamic programming
<lang csharp>using System;
namespace LongestCommonSubstring {
class Program { static void Main(string[] args) { Console.WriteLine(lcs("thisisatest", "testing123testing")); Console.ReadKey(true); }
public static string lcs(string a, string b) { var lengths = new int[a.Length, b.Length]; int greatestLength = 0; string output = ""; for (int i = 0; i < a.Length; i++) { for (int j = 0; j < b.Length; j++) { if (a[i] == b[j]) { lengths[i, j] = i == 0 || j == 0 ? 1 : lengths[i - 1, j - 1] + 1; if (lengths[i, j] > greatestLength) { greatestLength = lengths[i, j]; output = a.Substring(i - greatestLength + 1, greatestLength); } } else { lengths[i, j] = 0; } } } return output; } }
}</lang>
- Output:
test
Searching for smaller substrings of a in b
<lang csharp>//C# program tests the LCSUBSTR (Longest Common Substring) subroutine. using System; namespace LongestCommonSubstring {
class Program { static void Main(string[] args) { string a = args.Length >= 1 ? args[0] : ""; /*get two arguments (strings). */ string b = args.Length == 2 ? args[1] : ""; if (a == "") a = "thisisatest"; /*use this string for a default. */ if (b == "") b = "testing123testing"; /* " " " " " " */ Console.WriteLine("string A = {0}", a); /*echo string A to screen. */ Console.WriteLine("string B = {0}", b); /*echo string B to screen. */ Console.WriteLine("LCsubstr = {0}", LCsubstr(a, b)); /*tell Longest Common Substring. */ Console.ReadKey(true); } /*stick a fork in it, we're done.*/
/*─────────────────────────────────LCSUBSTR subroutine─────────────────────────────────*/ public static string LCsubstr(string x, string y) /*Longest Common Substring. */ { string output = ""; int lenx = x.Length; /*shortcut for using the X length*/ for (int j = 0; j < lenx; j++) /*step through start points in X.*/ { for (int k = lenx - j; k > -1; k--) /*step through string lengths. */ { string common = x.Substring(j, k); /*extract a common substring. */ if (y.IndexOf(common) > -1 && common.Length > output.Length) output = common; /*longest?*/ } /*k*/ } /*j*/ return output; /*$ is "" if no common string. */ } }
}</lang> output when using the default inputs:
string A = thisisatest string B = testing123testing LCsubstr = test
Searching for smaller substrings of a in b (simplified)
<lang csharp>//C# program tests the LCS (Longest Common Substring) subroutine. using System; namespace LongestCommonSubstring {
class Program { static void Main(string[] args) { string a = args.Length >= 1 ? args[0] : ""; /*get two arguments (strings). */ string b = args.Length == 2 ? args[1] : ""; if (a == "") a = "thisisatest"; /*use this string for a default. */ if (b == "") b = "testing123testing"; /* " " " " " " */ Console.WriteLine("string A = {0}", a); /*echo string A to screen. */ Console.WriteLine("string B = {0}", b); /*echo string B to screen. */ Console.WriteLine("LCS = {0}", lcs(a, b)); /*tell Longest Common Substring. */ Console.ReadKey(true); } /*stick a fork in it, we're done.*/
/*─────────────────────────────────LCS subroutine─────────────────────────────────*/ private static string lcs(string a, string b) { if(b.Length<a.Length){ string t=a; a=b; b=t; } for (int n = a.Length; n > 0; n--) { for (int m = a.Length-n; m <= a.Length-n; m++) { string s=a.Substring(m,n); if(b.Contains(s)) return(s); } } return ""; } }
</lang> output when using the default inputs:
string A = thisisatest string B = testing123testing LCS = test
Go
<lang go>package main
import "fmt"
func lcs(a, b string) (output string) {
lengths := make([]int, len(a)*len(b)) greatestLength := 0 for i, x := range a { for j, y := range b { if x == y { if i == 0 || j == 0 { lengths[i*len(b)+j] = 1 } else { lengths[i*len(b)+j] = lengths[(i-1)*len(b)+j-1] + 1 } if lengths[i*len(b)+j] > greatestLength { greatestLength = lengths[i*len(b)+j] output = a[i-greatestLength+1 : i+1] } } } } return
}
func main() {
fmt.Println(lcs("thisisatest", "testing123testing"))
}</lang>
- Output:
test
J
This algorithm starts by comparing each character in the one string to each character in the other, and then iterates on this result until it finds the length of the longest common substring. So if Lx is the length of one argument string, Ly is the length of the other argument string, and Lr is the length of the result string, this algorithm uses space on the order of Lx*Ly and time on the order of Lx*Ly*Lr.
In other words: this can be suitable for small problems, but you might want something better if you're comparing gigabyte length strings with high commonality.
<lang J>lcstr=:4 :0
C=. ({.~ 1+$) x=/y M=. >./ (* * * >. * + (_1&|.)@:|:^:2)^:_ C N=. >./ M y {~ (M i. N)-i.-N
)</lang>
Intermedate results:
C shows which characters are in common between the two strings. M marks the length of the longest common substring ending at each position in the right argument N is the length of the longest common substring
Example use:
<lang J> 'thisisatest' lcs 'testing123testing' test</lang>
jq
Utility functions: <lang jq># Create an m x n matrix def matrix(m; n; init):
if m == 0 then [] elif m == 1 then [range(0;n) | init] elif m > 0 then matrix(1;n;init) as $row | [range(0;m) | $row ] else error("matrix\(m);_;_) invalid") end;
def set(i;j; value):
value as $value | setpath([i,j]; $value);</lang>
Longest Common Substring: <lang jq>def lcs(a; b):
matrix(a|length; b|length; 0) as $lengths # state: [ $lengths, greatestLength, answer ] | [$lengths, 0] | reduce range(0; a|length) as $i (.; reduce range(0; b|length) as $j (.; if a[$i:$i+1] == b[$j:$j+1] then if $i == 0 or $j == 0 then .[0] |= set($i; $j; 1) else (.[0] |= set($i; $j; .[$i-1][$j-1] + 1)) end | if .[0][$i][$j] > .[1] then .[1] = .[0][$i][$j] | .[1] as $mx | .[2] = a[1 + $i - $mx : $i+1] # output else . end else . end )) | .[2];</lang>
Example: <lang jq>lcs("thisisatest"; "testing123testing")</lang>
- Output:
<lang sh>$ jq -n -f Longest_common_substring.jq "test"</lang>
Racket
A chance to show off how to use HashTable
types in typed/racket
<lang racket>#lang typed/racket (: lcs (String String -> String)) (define (lcs a b)
(: all-substrings# (String -> (HashTable String Boolean))) (define (all-substrings# str) (define l (string-length str)) (for*/hash : (HashTable String Boolean) ((s (in-range 0 l)) (e (in-range (add1 s) (add1 l)))) (values (substring str s e) #t))) (define a# (all-substrings# a)) (define b# (all-substrings# b)) (define-values (s l) (for/fold : (Values String Nonnegative-Integer) ((s "") (l : Nonnegative-Integer 0)) ((a_ (in-hash-keys a#)) #:when (and (> (string-length a_) l) (hash-ref b# a_ #f))) (values a_ (string-length a_)))) s)
(module+ test
("thisisatest" . lcs . "testing123testing"))</lang>
- Output:
"test"
REXX
<lang rexx>/*REXX program tests the LCSUBSTR (Longest Common Substring) subroutine.*/ parse arg a b . /*get two arguments (strings). */ if a== then a= "thisisatest" /*use this string for a default. */ if b== then b= "testing123testing" /* " " " " " " */ say ' string A =' a /*echo string A to screen. */ say ' string B =' b /*echo string B to screen. */ say ' LCsubstr =' LCsubstr(a,b) /*tell Longest Common Substring. */ exit /*stick a fork in it, we're done.*/ /*──────────────────────────────────LCSUBSTR subroutine─────────────────*/ LCsubstr: procedure; parse arg x,y,$ /*Longest Common Substring. */
- =length(x) /*shortcut for using the X length*/
do j=1 for # /*step through start points in X.*/ do k=# by -1 for # /*step through string lengths. */ _=strip(substr(x,j,k)) /*extract a common substring. */ if pos(_,y)\==0 & length(_)>length($) then $=_ /*longest?*/ end /*j*/ /* [↑] see if it is the longest.*/ end /*k*/
return $ /*$ is null if no common string.*/</lang> output when using the default inputs:
string A = thisisatest string B = testing123testing LCsubstr = test
Ruby
<lang ruby>def longest_common_substring(a,b)
lengths = Array.new(a.length){Array.new(b.length, 0)} greatestLength = 0 output = "" a.each_char.with_index do |x,i| b.each_char.with_index do |y,j| next if x != y lengths[i][j] = (i.zero? || j.zero?) ? 1 : lengths[i-1][j-1] + 1 if lengths[i][j] > greatestLength greatestLength = lengths[i][j] output = a[i - greatestLength + 1, greatestLength] end end end output
end
p longest_common_substring("thisisatest", "testing123testing")</lang>
- Output:
"test"
zkl
<lang zkl>fcn lcd(a,b){
if(b.len()<a.len()){ t:=a; a=b; b=t; } foreach n,m in ([a.len()..1,-1],a.len()-n+1){ s:=a[m,n]; if(b.holds(s)) return(s); } ""
}</lang> <lang zkl>lcd("testing123testing","thisisatest").println();</lang>
- Output:
test