Python: Comparing strings with accented characters does not work

后端 未结 2 1432
庸人自扰
庸人自扰 2021-01-07 05:27

I\'m quite new to python. I am trying to remove files that appear on one list from another list. The lists were produced by redirecting ll -R on mac and on windows (but have

相关标签:
2条回答
  • 2021-01-07 05:31

    You have a few problems with your program:

    You program will generate an AttributeError exception and consequently pass in every loop. Neither word1 nor word2 have a method called .decode(). In Python3, you can encode a string into a sequence of bytes, or you can decode a sequence of bytes into a string.

    The use of codecs is a red herring. Both of your input files are UTF-8 encoded. The bytes from the file are successfully decoded when you read them from the file.

    Your strings are similar in appearance, but are composed of different unicode code points. Specifically, "Adhésion" includes the two unicode code points 0065 and 0301, "LATIN SMALL LETTER E" and "COMBINING ACUTE ACCENT". On the other hand, the 2nd word, "Adhésion" contains the single code point 00E9, "LATIN SMALL LETTER E WITH ACUTE". As Daniel points out in his answer, you can check for the semantic equivalence of these distinct strings by normalizing them first.

    Here is how I would solve your problems:

    #!/usr/bin/python3
    
    import sys
    import unicodedata
    
    with open('testmissingfiles', 'r') as fp:
        list1 = [line.strip() for line in fp]
    with open('testfilesindata','r') as fp:
        list2 = [line.strip() for line in fp]
    
    word1 = list1[0]
    word2 = list2[0]
    
    if word1 == word2:
        print("%s and %s are identical"%(word1, word2))
    elif unicodedata.normalize('NFC', word1) == unicodedata.normalize('NFC', word2):
        print("%s and %s look the same, but use different code poitns"%(word1, word2))
    else:
        print("%s and %s are unrelated"%(word1, word2))
    
    0 讨论(0)
  • 2021-01-07 05:32

    Use unicodedata.normalize to normalize the to strings to the same normal form:

    import unicodedata
    
    encoded1 = unicodedata.normalize('NFC', word1.decode('utf8'))
    encoded2 = unicodedata.normalize('NFC', word2.decode('utf8'))
    
    0 讨论(0)
提交回复
热议问题