Remove accents and keep under dots in Python

问题

I am working on an NLP task that requires using a corpus of the language called Yoruba. Yoruba is a language that has diacritics (accents) and under dots in its alphabets. For instance, this is a Yoruba string: "ọmọàbúròẹlẹ́wà", and I need to remove the accents and keep the under dots.

I have tried using the unidecode library in Python, but it removes accents and under dots.

import unidecode
ac_stng = "ọmọàbúròẹlẹ́wà"
unac_stng = unidecode.unidecode(ac_stng)

I expect the output to be "ọmọaburoẹlẹwa". However, when I used the unidecode library in Python, I got "omoaburoelewa".

回答1:

I would use Unicode normalization for this.

Characters with accents and dots like that are precomposed Unicode characters. If you decompose them, you can get the base character plus the combining characters for the accents and dots and whatnot. Then you can remove the ones you don't want and re-compose the string back into precomposed characters.

You can do this in Python using unicodedata.normalize. Specifically, you want the "NFD" (Normalization Form Canonical Decomposition) normalization form. This will give you the canonical decomposition of the characters. Then to re-compose the characters, you want "NFC" (Normalization Form Canonical Composition).

I'll show you what I mean. First, let's look at individual code points the example text you provided above:

>>> from pprint import pprint
>>> import unicodedata
>>> text = 'ọmọàbúròẹlẹ́wà'
>>> pprint([unicodedata.name(c) for c in text])
['LATIN SMALL LETTER O WITH DOT BELOW',
 'LATIN SMALL LETTER M',
 'LATIN SMALL LETTER O WITH DOT BELOW',
 'LATIN SMALL LETTER A WITH GRAVE',
 'LATIN SMALL LETTER B',
 'LATIN SMALL LETTER U WITH ACUTE',
 'LATIN SMALL LETTER R',
 'LATIN SMALL LETTER O WITH GRAVE',
 'LATIN SMALL LETTER E WITH DOT BELOW',
 'LATIN SMALL LETTER L',
 'LATIN SMALL LETTER E WITH ACUTE',
 'COMBINING DOT BELOW',
 'LATIN SMALL LETTER W',
 'LATIN SMALL LETTER A WITH GRAVE']

As you can see, one of the characters is already partially decomposed (the one with the separate "COMBINING DOT BELOW"). Now let's look at it fully decomposed:

>>> text = unicodedata.normalize('NFD', text)
>>> pprint([unicodedata.name(c) for c in text])
['LATIN SMALL LETTER O',
 'COMBINING DOT BELOW',
 'LATIN SMALL LETTER M',
 'LATIN SMALL LETTER O',
 'COMBINING DOT BELOW',
 'LATIN SMALL LETTER A',
 'COMBINING GRAVE ACCENT',
 'LATIN SMALL LETTER B',
 'LATIN SMALL LETTER U',
 'COMBINING ACUTE ACCENT',
 'LATIN SMALL LETTER R',
 'LATIN SMALL LETTER O',
 'COMBINING GRAVE ACCENT',
 'LATIN SMALL LETTER E',
 'COMBINING DOT BELOW',
 'LATIN SMALL LETTER L',
 'LATIN SMALL LETTER E',
 'COMBINING DOT BELOW',
 'COMBINING ACUTE ACCENT',
 'LATIN SMALL LETTER W',
 'LATIN SMALL LETTER A',
 'COMBINING GRAVE ACCENT']

Now according to your requirements, it sounds like you want to keep all Latin letters (and probably the rest of ASCII too, I'm guessing) plus the "COMBINING DOT BELOW" code point, which we can refer to using the literal '\N{COMBINING DOT BELOW}' for easier readability of your code.

Here's an example function that I think will do what you want:

import unicodedata

def remove_accents_but_not_dots(input_text):
    # Step 1: Decompose input_text into base letters and combinining characters
    decomposed_text = unicodedata.normalize('NFD', input_text)

    # Step 2: Filter out the combining characters we don't want
    filtered_text = ''
    for c in decomposed_text:
        if ord(c) <= 0x7f or c == '\N{COMBINING DOT BELOW}':
            # Only keep ASCII or "COMBINING DOT BELOW"
            filtered_text += c

    # Step 3: Re-compose the string into precomposed characters
    return unicodedata.normalize('NFC', filtered_text)

(Of course, string concatenation in Python is slow, but I'll leave the optimizations to you. This example was written for readability.)

And here's what the result looks like:

>>> remove_accents_but_not_dots('ọmọàbúròẹlẹ́wà')
'ọmọaburoẹlẹwa'

回答2:

Since there's a specific type of accent parsing you want to do, it'll likely be easiest to write the parser yourself. Essentially you can check the unicode value of every letter in a string using ord(), and check it against a list of unicode values for undesirably accented letters. There's two steps, the way I see it:

The first is to deal with characters that only have diacritic marks, no dots. From my admittedly cursory research of this language, it seems that for a given vowel it has three possible diacritic marks; acute, grave, and macron. Then, for a given vowel, you can create an array of the unicode numbers of each diacritic variant. So for the letter "a", you'd have the following:

a_diacritics = [224, 225, 257] # Unicode values for á, à, and ā

Then you could compare the unicode values of each letter in your input to that array, and if it is a match, swap it with a normal "a":

input_string = "ọmọàbúròẹlẹ́wà"
output = ""
for letter in input:
    if ord(letter) in a_diacritics:
        output += 'a'
    else:
        output += letter

After running that bit of code, the variable output would equal "ọmọabúròẹlẹ́wa". You'd then write similar arrays and parsing logic with the unicode values for the other vowels.

The second part is the characters with both diacritics and dots. Letters like "ẹ́" are usually technically two separate characters. In the case of "ẹ́", it's "é" and the 'combining dot below' character, however in the case of the visually identical "ẹ́", it's "ẹ" and the 'combining acute accent' character. For the letters with the added dot character, the previous step with the arrays takes care of them. Then, for the added diacritic characters, you can have one final array for their unicode values:

diacritic_marks = [769, 768, 772] # Unicode values for acute, grave, and macron diacritics

Then have the parsing loop ignore these characters:

for letter in input_string:
    if ord(letter) in a_diacritics:
        output += 'a'
    elif ord(letter) in diacritic_marks:
        pass
    else:
        output += letter

来源：https://stackoverflow.com/questions/57453751/remove-accents-and-keep-under-dots-in-python

标签

python

python-3.x

string

nlp

python-unicode