We need to combine 3 columns in a database by concatenation. However, the 3 columns may contain overlapping parts and the parts should not be duplicated. For example,
<Why not just do something like this. First get the first character or word (which is going to signify the overlap) in the three columns.
Then, start to add the first string to a stringbuffer, one character at a time.
Each time look to see if you reached a part that is overlapped with the second or third string.
If so then start concatenating the string that also contains what is in the first string.
When done start, if no overlap, start with the second string and then the third string.
So in the second example in the question I will keep d and g in two variables.
Then, as I add the first string abc come from the first string, then I see that d is also in the second string so I shift to adding from the second string def are added from string 2, then I move on and finish with string 3.
If you are doing this in a database why not just use a stored procedure to do this?
Most of the other answers have focused on constant-factor optimizations, but it's also possible to do asymptotically better. Look at your algorithm: it's O(N^2). This seems like a problem that can be solved much faster than that!
Consider Knuth Morris Pratt. It keeps track of the maximum amount of substring we have matched so far throughout. That means it knows how much of S1 has been matched at the end of S2, and that's the value we're looking for! Just modify the algorithm to continue instead of returning when it matches the substring early on, and have it return the amount matched instead of 0 at the end.
That gives you an O(n) algorithm. Nice!
int OverlappedStringLength(string s1, string s2) {
//Trim s1 so it isn't longer than s2
if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length);
int[] T = ComputeBackTrackTable(s2); //O(n)
int m = 0;
int i = 0;
while (m + i < s1.Length) {
if (s2[i] == s1[m + i]) {
i += 1;
//<-- removed the return case here, because |s1| <= |s2|
} else {
m += i - T[i];
if (i > 0) i = T[i];
}
}
return i; //<-- changed the return here to return characters matched
}
int[] ComputeBackTrackTable(string s) {
var T = new int[s.Length];
int cnd = 0;
T[0] = -1;
T[1] = 0;
int pos = 2;
while (pos < s.Length) {
if (s[pos - 1] == s[cnd]) {
T[pos] = cnd + 1;
pos += 1;
cnd += 1;
} else if (cnd > 0) {
cnd = T[cnd];
} else {
T[pos] = 0;
pos += 1;
}
}
return T;
}
OverlappedStringLength("abcdef", "defghl") returns 3
Or you could do it in mysql with the following stored function:
DELIMITER //
DROP FUNCTION IF EXISTS concat_with_overlap //
CREATE FUNCTION concat_with_overlap(a VARCHAR(100), b VARCHAR(100))
RETURNS VARCHAR(200) DETERMINISTIC
BEGIN
DECLARE i INT;
DECLARE al INT;
DECLARE bl INT;
SET al = LENGTH(a);
SET bl = LENGTH(a);
IF al=0 THEN
RETURN b;
END IF;
IF bl=0 THEN
RETURN a;
END IF;
IF al < bl THEN
SET i = al;
ELSE
SET i = bl;
END IF;
search: WHILE i > 0 DO
IF RIGHT(a,i) = LEFT(b,i) THEN
RETURN CONCAT(a, SUBSTR(b,i+1));
END IF;
SET i = i - 1;
END WHILE search;
RETURN CONCAT(a,b);
END//
I tried it with your test data:
mysql> select a,b,c,
-> concat_with_overlap( concat_with_overlap( a, b ), c ) as result
-> from testing //
+-------------+---------+--------+-------------+
| a | b | c | result |
+-------------+---------+--------+-------------+
| a | b | c | abc |
| abcde | defgh | ghlmn | abcdefghlmn |
| abcdede | dedefgh | | abcdedefgh |
| abcde | d | ghlmn | abcdedghlmn |
| abcdef | | defghl | abcdefghl |
| abXabXabXac | XabXac | | abXabXabXac |
+-------------+---------+--------+-------------+
6 rows in set (0.00 sec)
You may use a DFA. For example, a string XYZ
should be read by the regular expression ^((A)?B)?C
. That regular expression will match the longest prefix which matches a suffix of the XYZ
string. With such a regular expression you can either match and get the match result, or generate a DFA, on which you can use the state to indicate the proper position for the "cut".
In Scala, the first implementation -- using regex directly -- might go like this:
def toRegex(s1: String) = "^" + s1.map(_.toString).reduceLeft((a, b) => "("+a+")?"+b) r
def concatWithoutMatch(s1 : String, s2: String) = {
val regex = toRegex(s1)
val prefix = regex findFirstIn s2 getOrElse ""
s1 + s2.drop(prefix length)
}
For example:
scala> concatWithoutMatch("abXabXabXac", "XabXacd")
res9: java.lang.String = abXabXabXacd
scala> concatWithoutMatch("abc", "def")
res10: java.lang.String = abcdef
scala> concatWithoutMatch(concatWithoutMatch("abcde", "defgh"), "ghlmn")
res11: java.lang.String = abcdefghlmn
I'm trying to make this C# as pleasant to read as possible.
public static string Concatenate(string s1, string s2)
{
if (string.IsNullOrEmpty(s1)) return s2;
if (string.IsNullOrEmpty(s2)) return s1;
if (s1.Contains(s2)) return s1;
if (s2.Contains(s1)) return s2;
char endChar = s1.ToCharArray().Last();
char startChar = s2.ToCharArray().First();
int s1FirstIndexOfStartChar = s1.IndexOf(startChar);
int overlapLength = s1.Length - s1FirstIndexOfStartChar;
while (overlapLength >= 0 && s1FirstIndexOfStartChar >=0)
{
if (CheckOverlap(s1, s2, overlapLength))
{
return s1 + s2.Substring(overlapLength);
}
s1FirstIndexOfStartChar =
s1.IndexOf(startChar, s1FirstIndexOfStartChar);
overlapLength = s1.Length - s1FirstIndexOfStartChar;
}
return s1 + s2;
}
private static bool CheckOverlap(string s1, string s2, int overlapLength)
{
if (overlapLength <= 0)
return false;
if (s1.Substring(s1.Length - overlapLength) ==
s2.Substring(0, overlapLength))
return true;
return false;
}
EDIT: I see that this is almost the same as jerryjvl's solution. The only difference is that this will work with the "abcde", "d" case.
I think this will be pretty quick:
You have two strings, string1 and string2. Look backwards (right to left) through string1 for the first character of string2. Once you have that position, determine if there is overlap. If there isn't, you need to keep searching. If there is you need to determine if there is any possibility for another match.
To do that, simply explore the shorter of the two strings for a recurrence of the overlapping characters. ie: If the location of the match in string1 leaves a short string1 remaining, repeat the initial search from the new starting point in string1. Conversely, if the unmatched portion of string2 is shorter, search it for a repeat of the overlapping characters.
Repeat as required.
Job done!
This doesn't require much in terms of memory allocation (all searching done in place, just need to allocate the resultant string buffer) and only requires (at most) one pass of one of the strings being overlapped.