Error: nothing to repeat at position

后端未结

关注

 1  524

I have a text file which contains a list of slang words and their substitutes in real English. I converted this text file into a dictionary using \":\" as a split point, and upo

相关标签:

1条回答

梦如初夏

2021-01-23 10:36

I suggest replacing

slangs_re = re.compile('|'.join(slang_dict.keys()))

with

slangs_re = re.compile(r"(?<!\w)(?:{})(?!\w)".format('|'.join([re.escape(x) for x in slang_dict])))

and make sure you pass the keys sorted by length in the descending order.

from collections import OrderedDict
import re

test = "fitess no kome*"

slang_dict = {"Aha aha":"no", "fitess":"fitness", "damm":"damn", "kome*":"come", "ow wow":"rrf"}
slang_dict = OrderedDict(sorted(slang_dict.iteritems(), key=lambda x: len(x[0]), reverse=True))

slangs_re = re.compile(r"(?<!\w)(?:{})(?!\w)".format('|'.join([re.escape(x) for x in slang_dict])))
def correct_slang(s, slang_dict=slang_dict):
    def replace(match):
        return slang_dict[match.group(0)]

    return slangs_re.sub(replace, s)

test = correct_slang(test)
print(test)

See the Python demo

This will check the terms as whole words and will escape the special chars in each of the search phrases so that no issues could occur when passing them to the regular expression engine.

If you are not interested in whole word matching, remove (?<!\w) (checking for the leading word boundary) and (?!\w) (checking for the trailing word boundary).

0 讨论(0)