问题
Given the following, i can find the longest common substring:
s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
print longest_common_substring(s1, s2)
[out]:
foo bar
But how do i ensure that the longest common substring respect English word boundary and don't cut up a word? For example, the following sentences:
s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)
outputs the follow which is NOT desired since it breaks up the word kappa
from s2:
a foo bar
The desired output is still:
foo bar
I've tried also an ngram way of getting the longest common substring respecting word boundary but is there other way that deals with strings without calculating ngrams? (see answer)
回答1:
This is too simple to understand. I used your code to do 75% of the job. I first split the sentence into words, then pass it to your function to get the largest common substring(in this case it will be longest consecutive words), so your function gives me ['foo', 'bar'], I join the elements of that array to produce the desired result.
Here is the online working copy for you to test and verify and fiddle with it.
http://repl.it/RU0/1
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
def longest_common_sentence(s1, s2):
s1_words = s1.split(' ')
s2_words = s2.split(' ')
return ' '.join(longest_common_substring(s1_words, s2_words))
s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'
Edge cases
'.' and '?' are also treated as valid words as in your case if there is a space between last word and the punctuation mark. If you don't leave a space they will be counted as part of last word. In that case 'sheep' and 'sheep?' would not be same words anymore. Its up to you decide what to do with such characters, before calling such function. In that case
import re
s1 = re.sub('[.?]','', s1)
s2 = re.sub('[.?]','', s2)
and then continue as usual.
回答2:
My answer does not draw from any official sources, but just a simple observation: at least in my installation, there is a difference between the output of your LCS function as it is on the pair (s1, s2) and (s1, s3):
In [1]: s1 = "this is a foo bar sentence ."
In [3]: s2 = "what the foo bar blah blah black sheep is doing ?"
In [4]: s3 = "what a kappa foo bar black sheep ?"
In [12]: longest_common_substring(s1, s3)
Out[12]: 'a foo bar '
In [13]: longest_common_substring(s1, s2)
Out[13]: ' foo bar '
As you may notice, if complete words are matched, then also the surrounding whitespace is matched.
You can then modify the function before its output is returned, like this:
answer = s1[x_longest - longest: x_longest]
if not (answer.startswith(" ") and answer.endswith(" ")):
return longest_common_substring(s1, answer[1:])
else:
return answer
I am sure there are other edge cases, like the substring appearing at the end of the string, recursively calling the function either with s1
or s2
, whether to trim the answer
front or back, and others -- but at least in the cases you show, this simple modification does what you want:
In [20]: longest_common_substring(s1, s3)
Out[20]: ' foo bar '
Do you think this direction is worth exploring?
回答3:
Just add an acceptance condition in your code:
s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest and word_aligned(x, y, m[x][y]): # acceptance condition
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
def word_aligned(x, y, length):
"""check that a match starting at s1[x - 1] and s2[y - 1] is aligned on a word boundary"""
# check start of match in s1
if s1[x - 1].isspace():
# match doesn't start with a character, reject
return False
if x - 2 > 0 and not s1[x - 2].isspace():
# char before match is not start of line or space, reject
return False
# check start of match in s2
... same as above ...
# check end of match in s1
... your code is a bit hard for me follow, what is end of match? ...
# check end of match in s2
... same as above ...
return True
print longest_common_substring(s1, s2)
回答4:
All you need to do is add checks for beginning and end of a word.
You then update m
only for valid match ends.
Like so:
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
# current character in s1
x_char = s1[x - 1]
# we are at the beginning of a word in s1 if
# (we are at the beginning of s1) or
# (previous character is a space)
x_word_begin = (x == 1) or (s1[x - 2] == " ")
# we are at the end of a word in s1 if
# (we are at the end of s1) or
# (next character is a space)
x_word_end = (x == len(s1)) or (s1[x] == " ")
for y in xrange(1, 1 + len(s2)):
# current character in s2
y_char = s2[y - 1]
# we are at the beginning of a word in s2 if
# (we are at the beginning of s2) or
# (previous character is a space)
y_word_begin = (y == 1) or (s2[y - 2] == " ")
# we are at the end of a word in s2 if
# (we are at the end of s2) or
# (next character is a space)
y_word_end = (y == len(s2)) or (s2[y] == " ")
if x_char == y_char:
# no match starting with x_char
if m[x - 1][y - 1] == 0:
# a match can start only with a space
# or at the beginning of a word
if x_char == " " or (x_word_begin and y_word_begin):
m[x][y] = m[x - 1][y - 1] + 1
else:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
# the match can end only with a space
# or at the end of a word
if x_char == " " or (x_word_end and y_word_end):
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
回答5:
This is more of an interesting problem then I initially gave it credit for. When you think about it there are 4 possible outcomes.
- The trivial case, the whole string is matched no boundaries (your first example)
- Cross a word boundary at the beginning (your second example)
- Cross a word boundary at the end
- Have a word boundary at each end
Now your code takes care of the trivial case so we can leverage that; all that's left is to wrap your results in a few checks for the other cases. So what should these checks look like? Let's take your failure case:
string 1 = "this is a foo bar sentence ."
string 2 = "what a kappa foo bar black sheep ?"
output string = "a foo bar"
So from a string find
perspective we can find all those letters in that order in both string1
and string2
, but if we separate everything around spaces into lists, and look for the lists in order only string1
will match.
Now I'm primarily a C guy, so I want to write that in a function:
def full_string(str1, str2, chkstr):
l1 = str1.split()
l2 = str2.split()
chkl = chkstr.split()
return (any(l1[i:i+len(chkl)]==chkl for i in xrange(len(l1)-len(chkl)+1)) and
any(l2[i:i+len(chkl)]==chkl for i in xrange(len(l2)-len(chkl)+1)))
With this function we can check if either of the two strings do not contain all of the words of our result from longest_common_substring(s1, s2)
in order. Perfect. So the last step is to combine these two functions and check for each of the 4 cases listed above:
def longest_whole_substring(s1, s2):
subs = longest_common_substring(s1, s2)
if not full_string(s1, s2, subs):
if full_string(s1, s2, ' '.join(subs.split()[1:])):
subs = ' '.join(subs.split()[1:])
elif full_string(s1, s2, ' '.join(subs.split()[:-1])):
subs = ' '.join(subs.split()[:-1])
else:
subs = ' '.join(subs.split()[1:-1])
return subs
Now the function longest_whole_substring(s1, s2)
will provide the longest whole substring not chopping off any words. Let’s just test it out in each of the cases:
Trivial:
>>> a = 'this is a foo bar bar foo string'
>>> b = 'foo bar'
>>>
>>> longest_whole_substring(a,b)
'foo bar'
Word boundary at the beginning:
>>> b = 's a foo bar'
>>>
>>> longest_whole_substring(a,b)
'a foo bar '
Word boundry at the end:
>>> b = 'foo bar f'
>>>
>>> longest_whole_substring(a,b)
'foo bar'
And a Word boundry at both ends:
>>> b = 's a foo bar f'
>>>
>>> longest_whole_substring(a,b)
'a foo bar'
Lookin' Good!
回答6:
I did it recursively:
def common_phrase(self, longer, shorter):
""" recursively find longest common substring, consists of whole words only and in the same order """
if shorter in longer:
return shorter
elif len(shorter.split()) > 1:
common_phrase_without_last_word = common_phrase(shorter.rsplit(' ', 1)[0], longer)
common_phrase_without_first_word = common_phrase(shorter.split(' ', 1)[1], longer)
without_first_is_longer = len(common_phrase_without_last_word) < len(common_phrase_without_first_word)
return ((not without_first_is_longer) * common_phrase_without_last_word +
without_first_is_longer * common_phrase_without_first_word)
else:
return ''
Just classify the two strings to 'shorter' and 'longer' before applying:
if len(str1) > len(str2):
longer, shorter = str1, str2
else:
longer, shorter = str2, str1
回答7:
Here's an ngram way:
def ngrams(text, n):
return [text[i:i+n] for i in xrange(len(text)-n)]
def longest_common_ngram(s1, s2):
s1ngrams = list(chain(*[[" ".join(j) for j in ngrams(s1.split(), i)]
for i in range(1, len(s1.split()))]))
s2ngrams = list(chain(*[[" ".join(j) for j in ngrams(s2.split(), i)]
for i in range(1, len(s2.split()))]))
return set(s1ngrams).intersection(set(s2ngrams))
回答8:
One efficient method for finding longest common substrings is a suffix tree (see http://en.wikipedia.org/wiki/Suffix_tree and http://en.wikipedia.org/wiki/Longest_common_substring_problem). I don't see any reason you couldn't create a suffix tree using words instead of characters, in which case the longest-common-subsequence extracted from the tree would respect token boundaries. This approach would be particularly efficient if you want to find common substrings between one fixed string and a large number of other strings.
See the accepted answer to python: library for generalized suffix trees for a list of Python suffix tree implementations.
来源:https://stackoverflow.com/questions/22726177/longest-common-substring-without-cutting-a-word-python