We need to combine 3 columns in a database by concatenation. However, the 3 columns may contain overlapping parts and the parts should not be duplicated. For example,
<How about (pardon the C#):
public static string OverlapConcat(string s1, string s2)
{
// Handle nulls... never return a null
if (string.IsNullOrEmpty(s1))
{
if (string.IsNullOrEmpty(s2))
return string.Empty;
else
return s2;
}
if (string.IsNullOrEmpty(s2))
return s1;
// Checks above guarantee both strings have at least one character
int len1 = s1.Length - 1;
char last1 = s1[len1];
char first2 = s2[0];
// Find the first potential match, bounded by the length of s1
int indexOfLast2 = s2.LastIndexOf(last1, Math.Min(len1, s2.Length - 1));
while (indexOfLast2 != -1)
{
if (s1[len1 - indexOfLast2] == first2)
{
// After the quick check, do a full check
int ix = indexOfLast2;
while ((ix != -1) && (s1[len1 - indexOfLast2 + ix] == s2[ix]))
ix--;
if (ix == -1)
return s1 + s2.Substring(indexOfLast2 + 1);
}
// Search for the next possible match
indexOfLast2 = s2.LastIndexOf(last1, indexOfLast2 - 1);
}
// No match found, so concatenate the full strings
return s1 + s2;
}
This implementation does not make any string copies (partial or otherwise) until it has established what needs copying, which should help performance a lot.
Also, the match check first tests the extremeties of the potentially matched area (2 single characters) which in normal english text should give a good chance of avoiding checking any other characters for mismatches.
Only once it establishes the longest match it can make, or that no match is possible at all, will two strings be concatenated. I have used simple '+' here, because I think the optimisation of the rest of the algorithm has already removed most of the inefficiencies in your original. Give this a try and let me know if it is good enough for your purposes.
Heres a perl -pseudo oneliner:
$_ = s1.s2;
s/([\S]+)\1/\1/;
perl regex's are pretty efficient, you can look up what algo they are using but they definitely implement some type of FSM etc so will get u results in pretty good O(..).
This problem seems like a variation of the longest common sub-sequence problem, which can be solved via dynamic programming.
http://www.algorithmist.com/index.php/Longest_Common_Subsequence
Here is a Java implementation which finds the maximum overlap between two strings with length N and M in something like O(min(N,M)) operations ~ O(N).
I had the same idea as @sepp2k:s now deleted answer, and worked a bit further on it. Seems to work fine. The idea is to iterate through the first string and start tracking once you find something which matches the start of the second string. Figured out that you might need to do multiple simultaneous trackings if false and true matches are overlapping. At the end you choose the longest track.
I havent worked out the absolutely worst case yet, with maximal overlap between matches, but I don't expect it to spiral out of control since I think that you cannot overlap arbitrary many matches. Normally you only track one or two matches at a time: the candidates are removed as soon as there is a mismatch.
static class Candidate {
int matchLen = 0;
}
private String overlapOnce(@NotNull final String a, @NotNull final String b) {
final int maxOverlap = Math.min(a.length(), b.length());
final Collection<Candidate> candidates = new LinkedList<>();
for (int i = a.length() - maxOverlap; i < a.length(); ++i) {
if (a.charAt(i) == b.charAt(0)) {
candidates.add(new Candidate());
}
for (final Iterator<Candidate> it = candidates.iterator(); it.hasNext(); ) {
final Candidate candidate = it.next();
if (a.charAt(i) == b.charAt(candidate.matchLen)) {
//advance
++candidate.matchLen;
} else {
//not matching anymore, remove
it.remove();
}
}
}
final int matchLen = candidates.isEmpty() ? 0 :
candidates.stream().map(c -> c.matchLen).max(Comparator.comparingInt(l -> l)).get();
return a + b.substring(matchLen);
}
private String overlapOnce(@NotNull final String... strings) {
return Arrays.stream(strings).reduce("", this::overlapOnce);
}
And some tests:
@Test
public void testOverlapOnce() throws Exception {
assertEquals("", overlapOnce("", ""));
assertEquals("ab", overlapOnce("a", "b"));
assertEquals("abc", overlapOnce("ab", "bc"));
assertEquals("abcdefghqabcdefghi", overlapOnce("abcdefgh", "efghqabcdefghi"));
assertEquals("aaaaaabaaaaaa", overlapOnce("aaaaaab", "baaaaaa"));
assertEquals("ccc", overlapOnce("ccc", "ccc"));
assertEquals("abcabc", overlapOnce("abcabc", "abcabc"));
/**
* "a" + "b" + "c" => "abc"
"abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
"abcdede" + "dedefgh" + "" => "abcdedefgh"
"abcde" + "d" + "ghlmn" => "abcdedghlmn"
"abcdef" + "" + "defghl" => "abcdefghl"
*/
assertEquals("abc", overlapOnce("a", "b", "c"));
assertEquals("abcdefghlmn", overlapOnce("abcde", "defgh", "ghlmn"));
assertEquals("abcdedefgh", overlapOnce("abcdede", "dedefgh"));
assertEquals("abcdedghlmn", overlapOnce("abcde", "d", "ghlmn"));
assertEquals("abcdefghl", overlapOnce("abcdef", "", "defghl"));
// Consider str1=abXabXabXac and str2=XabXac. Your approach will output abXabXabXacXabXac because by
// resetting j=0, it goes to far back.
assertEquals("abXabXabXac", overlapOnce("abXabXabXac", "XabXac"));
// Try to trick algo with an earlier false match overlapping with the real match
// - match first "aba" and miss that the last "a" is the start of the
// real match
assertEquals("ababa--", overlapOnce("ababa", "aba--"));
}
Here's a solution in Python. It should be faster just by not needing to build substrings in memory all the time. The work is done in the _concat function, which concatenates two strings. The concat function is a helper that concatenates any number of strings.
def concat(*args):
result = ''
for arg in args:
result = _concat(result, arg)
return result
def _concat(a, b):
la = len(a)
lb = len(b)
for i in range(la):
j = i
k = 0
while j < la and k < lb and a[j] == b[k]:
j += 1
k += 1
if j == la:
n = k
break
else:
n = 0
return a + b[n:]
if __name__ == '__main__':
assert concat('a', 'b', 'c') == 'abc'
assert concat('abcde', 'defgh', 'ghlmn') == 'abcdefghlmn'
assert concat('abcdede', 'dedefgh', '') == 'abcdedefgh'
assert concat('abcde', 'd', 'ghlmn') == 'abcdedghlmn'
assert concat('abcdef', '', 'defghl') == 'abcdefghl'
If you're doing it outside the database, try perl:
sub concat {
my($x,$y) = @_;
return $x if $y eq '';
return $y if $x eq '';
my($i) = length($x) < length($y) ? length($x) : length($y);
while($i > 0) {
if( substr($x,-$i) eq substr($y,0,$i) ) {
return $x . substr($y,$i);
}
$i--;
}
return $x . $y;
}
It's exactly the same algorithms as yours, I'm just curios if java or perl is faster ;-)