Regex for accent insensitive replacement in python

前端 未结 2 486
灰色年华
灰色年华 2021-01-12 15:07

In Python 3, I\'d like to be able to use re.sub() in an \"accent-insensitive\" way, as we can do with the re.I flag for case-insensitive substituti

2条回答
  •  挽巷
    挽巷 (楼主)
    2021-01-12 15:49

    unidecode is often mentioned for removing accents in Python, but it also does more than that : it converts '°' to 'deg', which might not be the desired output.

    unicodedata seems to have enough functionality to remove accents.

    With any pattern

    This method should work with any pattern and any text.

    You can temporarily remove the accents from both the text and regex pattern. The match information from re.finditer() (start and end indices) can be used to modify the original, accented text.

    Note that the matches must be reversed in order to not modify the following indices.

    import re
    import unicodedata
    
    original_text = "I'm drinking a 80° café in a cafe with Chloë, François Déporte and Francois Deporte."
    
    accented_pattern = r'a café|François Déporte'
    
    def remove_accents(s):
        return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
    
    print(remove_accents('äöüßéèiìììíàáç'))
    # aoußeeiiiiiaac
    
    pattern = re.compile(remove_accents(accented_pattern))
    
    modified_text = original_text
    matches = list(re.finditer(pattern, remove_accents(original_text)))
    
    for match in matches[::-1]:
        modified_text = modified_text[:match.start()] + 'X' + modified_text[match.end():]
    
    print(modified_text)
    # I'm drinking a 80° café in X with Chloë, X and X.
    

    If pattern is a word or a set of words

    You could :

    • remove the accents out of your pattern words and save them in a set for fast lookup
    • look for every word in your text with \w+
    • remove the accents from the word:
      • If it matches, replace by X
      • If it doesn't match, leave the word untouched

    import re
    from unidecode import unidecode
    
    original_text = "I'm drinking a café in a cafe with Chloë."
    
    def remove_accents(string):
        return unidecode(string)
    
    accented_words = ['café', 'français']
    
    words_to_remove = set(remove_accents(word) for word in accented_words)
    
    def remove_words(matchobj):
        word = matchobj.group(0)
        if remove_accents(word) in words_to_remove:
            return 'X'
        else:
            return word
    
    print(re.sub('\w+', remove_words, original_text))
    # I'm drinking a X in a X with Chloë.
    

提交回复
热议问题