Error: nothing to repeat at position

后端 未结 1 517
暗喜
暗喜 2021-01-23 10:10

I have a text file which contains a list of slang words and their substitutes in real English. I converted this text file into a dictionary using \":\" as a split point, and upo

相关标签:
1条回答
  • 2021-01-23 10:36

    I suggest replacing

    slangs_re = re.compile('|'.join(slang_dict.keys()))
    

    with

    slangs_re = re.compile(r"(?<!\w)(?:{})(?!\w)".format('|'.join([re.escape(x) for x in slang_dict])))
    

    and make sure you pass the keys sorted by length in the descending order.

    from collections import OrderedDict
    import re
    
    test = "fitess no kome*"
    
    slang_dict = {"Aha aha":"no", "fitess":"fitness", "damm":"damn", "kome*":"come", "ow wow":"rrf"}
    slang_dict = OrderedDict(sorted(slang_dict.iteritems(), key=lambda x: len(x[0]), reverse=True))
    
    slangs_re = re.compile(r"(?<!\w)(?:{})(?!\w)".format('|'.join([re.escape(x) for x in slang_dict])))
    def correct_slang(s, slang_dict=slang_dict):
        def replace(match):
            return slang_dict[match.group(0)]
    
        return slangs_re.sub(replace, s)
    
    test = correct_slang(test)
    print(test)
    

    See the Python demo

    This will check the terms as whole words and will escape the special chars in each of the search phrases so that no issues could occur when passing them to the regular expression engine.

    If you are not interested in whole word matching, remove (?<!\w) (checking for the leading word boundary) and (?!\w) (checking for the trailing word boundary).

    0 讨论(0)
提交回复
热议问题