问题
I'd like to record the location of differences from both strings in a list (to remove them) ... preferably recording the highest separation point for each section, as these areas will have dynamic content.
Compare these
total chars 178. Two unique sections
t1 = 'WhereTisthetotalnumberofght5y5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although'
and
total chars 211. Two unique sections
t2 = 'WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although'
I know difflib can do this but the output is bad.
I'd like to store (in a list) the char positions, perferably the larger seperation values.
pattern location
t1 = 'WhereTisthetotalnumberof 24 ght5y5wsjhhhhjhkmhm 43 Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofap 151 xxxxxxx 158 proximation,although'
t2 = 'WhereTisthetotalnumberof 24 dofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs 76 Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentre 155 xxxxxxx 162 sultsduetodifferinglevelsofapproximation,although'
output:
output list = [24, 76, 151, 162]
Update
Response post @Olivier
position of all Y's seperated by ***
t1
WhereTisthetotalnumberofght5***y***5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although
t2 WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssugu***y***gui***y***gis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although
output after matcher.get_matching_blocks()
and string = ''.join([t1[a:a+n] for a, _, n in blocks])
WhereTisthetotalnumberof***y*** Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapproximation,although
回答1:
Using difflib
is probably your best bet as you are unlikely to come up with a more efficient solution than the algorithms it provides. What you want is to use SequenceMatcher.get_matching_blocks
. Here is what it will output according to the doc.
Return list of triples describing matching subsequences. Each triple is of the form
(i, j, n)
, and means thata[i:i+n] == b[j:j+n]
. The triples are monotonically increasing in i and j.
Here is a way you could use this to reconstruct a string from which you removed the delta.
from difflib import SequenceMatcher
x = "abc_def"
y = "abc--ef"
matcher = SequenceMatcher(None, x, y)
blocks = matcher.get_matching_blocks()
# blocks: [Match(a=0, b=0, size=4), Match(a=5, b=5, size=2), Match(a=7, b=7, size=0)]
string = ''.join([x[a:a+n] for a, _, n in blocks])
# string: "abcef"
Edit: It was also pointed out that in a case where you had two strings like such.
t1 = 'WordWordaayaaWordWord'
t2 = 'WordWordbbbybWordWord'
Then the above code would return 'WordWordyWordWord
. This is because get_matching_blocks
will catch that 'y'
that is present in both strings between the expected blocks. A solution around this is to filter the returned blocks by length.
string = ''.join([x[a:a+n] for a, _, n in blocks if n > 1])
If you want more complex analysis of the returned blocks you could also do the following.
def block_filter(substring):
"""Outputs True if the substring is to be merged, False otherwise"""
...
string = ''.join([x[a:a+n] for a, _, n in blocks if block_filter(x[a:a+n])])
来源:https://stackoverflow.com/questions/48737881/python-3-differences-between-two-strings