问题
Background
The Python module regex allows fuzzy matching.
You can specify the allowable number of substitutions (s), insertions (i), deletions (d), and total errors (e).
The fuzzy_counts property of a match result returns a tuple (0,0,0), where:
match.fuzzy_counts[0] = count for 's'
match.fuzzy_counts[1] = count for 'i'
match.fuzzy_counts[2] = count for 'd'
Problem
The deletions and insertions are counted as expected, but not the substitutions.
In the example below, the only change is a single character deleted in the query, yet the substitutions count is 6 (7 if the BESTMATCH option is removed).
How are the substitutions counted?
I would be grateful of someone can anyone explain how this works to me.
>>> import regex
>>> reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"
>>> query = "TATGGACCAAAGTCTCAAGCCATGTG"
>>> match = regex.search(reference, query, regex.BESTMATCH)
>>> print(match.fuzzy_counts)
(6,0,1)
回答1:
This was caused by what looks to be a bug in the regex module's cost calculations. It was still present up until regex version 2015.10.05, but was fixed in the next version, 2015.10.22, as shown below:
$ sudo pip3 install regex==2015.10.05
Processing /root/.cache/pip/wheels/24/cb/ae/9653e30c8f801544a645e17d26fa6803aeaf76ad0482663c27/regex-2015.10.5-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
Successfully installed regex-2015.10.5
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(5, 0, 1)
$ sudo pip3 install regex==2015.10.22
Processing /root/.cache/pip/wheels/60/f6/9a/23e723633e62a79064cb301c54a3b50482b8c690f86c9983ee/regex-2015.10.22-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
Found existing installation: regex 2015.10.5
Uninstalling regex-2015.10.5:
Successfully uninstalled regex-2015.10.5
Successfully installed regex-2015.10.22
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(0, 0, 1)
Given these dates, I infer that the commit that fixed the bug was https://bitbucket.org/mrabarnett/mrab-regex/commits/296c1daf86619039c6fe55868e7d861097d01aae, with description
Hg issue 161: Unexpected fuzzy match results
Fixed the bug and did some related tidying up.
The referenced bug is https://bitbucket.org/mrabarnett/mrab-regex/issues/161.
回答2:
The issue seems to be related to the value in the allowed error setting.
Reducing the s to s < 3 changes the fuzzy match tuple score downwards:
>>> reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<3,i<3,d<3,e<4}"
>>> query = "TATGGACCAAAGTCTCAAGCCATGTG"
>>> match = regex.search(reference, query, regex.BESTMATCH)
>>> print(match.fuzzy_counts)
(1,0,1)
reducing the allowed error for 's' even further returns the expected 's' score for this match:
>>> reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<2,i<3,d<3,e<4}"
>>> query = "TATGGACCAAAGTCTCAAGCCATGTG"
>>> match = regex.search(reference, query, regex.BESTMATCH)
>>> print(match.fuzzy_counts)
(0,0,1)
Why it behaves in this way is still a mystery to me.
来源:https://stackoverflow.com/questions/31193749/python-regex-module-fuzzy-match-substitution-count-not-as-expected