问题
I'm using difflib
's SequenceMatcher
to get_opcodes()
and than highlight the changes with css
to create some kind of web diff
.
First, I set a min_delta
so that I consider two strings different if only 3 or more characters in the whole string differ, one after another (delta
means a real, encountered delta, which sums up all one-character changes):
matcher = SequenceMatcher(source_str, diff_str)
min_delta = 3
delta = 0
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == "equal":
continue # nothing to capture here
elif tag == "delete":
if source_str[i1:i2].isspace():
continue # be whitespace-agnostic
else:
delta += (i2 - i1) # delete i2-i1 chars
elif tag == "replace":
if source_str[i1:i2].isspace() or diff_str[j1:j2].isspace():
continue # be whitespace-agnostic
else:
delta += (i2 - i1) # replace i2-i1 chars
elif tag == "insert":
if diff_str[j1:j2].isspace():
continue # be whitespace-agnostic
else:
delta += (j2 - j1) # insert j2-j1 chars
return_value = True if (delta > min_delta) else False
This helps me to determine, if two strings really differ. Not very efficient, but I didn't think anything better out.
Then, I colorize the differences between two strings in the same way:
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == "equal":
# bustling with strings, inserting them in <span>s and colorizing
elif tag == "delete":
# ...
return_value = old_string, new_string
And the result looks pretty ugly (blue for replaced, green for new and red for deleted, nothing for equal):
So, this is happening because SequenceMatcher
matches every single character. But I want for it to match every single word instead (and probably whitespaces around them), or something even more eye-candy because as you can see on the screenshot, the first book is actually moved on the fourth position.
It seems to me that something could be done with isjunk
and autojunk
parameters of SequenceMatcher
, but I can't figure out how to write lambda
s for my purposes.
Thus, I have two questions:
Is it possible to match by words? Is it possible to do using
get_opcodes()
andSequenceMatcher
? If not, what could by used instead?Okay, this is rather a corollary, but nevertheless: if matching by words is possible, then I can get rid of the dirty hacks with
min_delta
and returnTrue
as soon as at least one word differs, right?
回答1:
SequenceMatcher
can accept lists of str
as input.
You can first split the input into words, and then use SequenceMatcher
to help you diff words. Then your colored diff would be by words instead of by characters.
>>> def my_get_opcodes(a, b):
... s = SequenceMatcher(None, a, b)
... for tag, i1, i2, j1, j2 in s.get_opcodes():
... print('{:7} a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
... tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
...
>>> my_get_opcodes("qabxcd", "abycdf")
delete a[0:1] --> b[0:0] 'q' --> ''
equal a[1:3] --> b[0:2] 'ab' --> 'ab'
replace a[3:4] --> b[2:3] 'x' --> 'y'
equal a[4:6] --> b[3:5] 'cd' --> 'cd'
insert a[6:6] --> b[5:6] '' --> 'f'
# This is the bad result you currently have.
>>> my_get_opcodes("one two three\n", "ore tree emu\n")
equal a[0:1] --> b[0:1] 'o' --> 'o'
replace a[1:2] --> b[1:2] 'n' --> 'r'
equal a[2:5] --> b[2:5] 'e t' --> 'e t'
delete a[5:10] --> b[5:5] 'wo th' --> ''
equal a[10:13] --> b[5:8] 'ree' --> 'ree'
insert a[13:13] --> b[8:12] '' --> ' emu'
equal a[13:14] --> b[12:13] '\n' --> '\n'
>>> my_get_opcodes("one two three\n".split(), "ore tree emu\n".split())
replace a[0:3] --> b[0:3] ['one', 'two', 'three'] --> ['ore', 'tree', 'emu']
# This may be the result you want.
>>> my_get_opcodes("one two emily three ha\n".split(), "ore tree emily emu haha\n".split())
replace a[0:2] --> b[0:2] ['one', 'two'] --> ['ore', 'tree']
equal a[2:3] --> b[2:3] ['emily'] --> ['emily']
replace a[3:5] --> b[3:5] ['three', 'ha'] --> ['emu', 'haha']
# A more complicated example exhibiting all four kinds of opcodes.
>>> my_get_opcodes("one two emily three yo right end\n".split(), "ore tree emily emu haha yo yes right\n".split())
replace a[0:2] --> b[0:2] ['one', 'two'] --> ['ore', 'tree']
equal a[2:3] --> b[2:3] ['emily'] --> ['emily']
replace a[3:4] --> b[3:5] ['three'] --> ['emu', 'haha']
equal a[4:5] --> b[5:6] ['yo'] --> ['yo']
insert a[5:5] --> b[6:7] [] --> ['yes']
equal a[5:6] --> b[7:8] ['right'] --> ['right']
delete a[6:7] --> b[8:8] ['end'] --> []
You can also diff by line, by book, or by segments. You only need to prepare a function that can preprocess the whole passage string into desired diff chunks.
For example:
- To diff by line - You probably could use
splitlines()
- To diff by book - You probably could implement a function that strips off the
1.
,2.
- To diff by segments - You could throw in the API like this way
([book_1, author_1, year_1, book_2, author_2, ...], [book_1, author_1, year_1, book_2, author_2, ...])
. And then your coloring would be by segment.
来源:https://stackoverflow.com/questions/39001097/match-changes-by-words-not-by-characters