问题
I am working with multilingual text data, among others with Russian using the Cyrillic alphabet and Turkish. I basically have to compare the words in two files my_file
and check_file
and if the words in my_file
can be found in check_file
, write them in an output file keeping the meta-information about these words from both input files.
Some words are lowercased while other words are capitalised so I have to lowercase all the words to compare them. As I use Python 3.6.5 and Python 3 uses unicode as default, it handles lowercasing and later on capitalising the words correctly for Cyrillic. For Turkish however, some letters are not handled correctly. Uppercase 'İ'
should correspond to lowercase 'i'
, uppercase 'I'
should correspond to lowercase 'ı'
and lowercase 'i'
should correspond to uppercase 'İ'
which is not the case if I type the following in the console:
>>> print('İ'.lower())
i̇ # somewhat not rendered correctly, corresponds to unicode 'i\u0307'
>>> print('I'.lower())
i
>>> print('i'.upper())
I
I am doing as follows (simplified sample code):
# python my_file check_file language
import sys
language = sys.argv[3]
# code to get the files as lists
my_file_list = [['ıspanak', 'N'], ['ısır', 'N'], ['acık', 'V']]
check_file_list = [['109', 'Ispanak', 'food_drink'], ['470', 'Isır', 'action_words'], [409, 'Acık', 'action_words']]
# get the lists as dict
my_dict = {}
check_dict = {}
for l in my_file_list:
word = l[0].lower()
pos = l[1]
my_dict[word] = pos
for l in check_file_list:
word_id = l[0]
word = l[1].lower()
word_cat = l[2]
check_dict[word] = [word_id, word_cat]
# compare the two dicts
for word, pos in my_dict.items():
if word in check_dict:
word_id = check_dict[word][0]
word_cat = check_dict[word][1]
print(word, pos, word_id, word_cat)
This gives me only one result but it should give me the three words as result:
acık V 409 action_words
What I've done so far based on this question:
- Read the accepted answer which proposes to use PyICU but I want my code to be useable without people having to install stuff so I didn't implement it.
- Tried to
import locale
andlocale.setlocale(locale.LC_ALL, 'tr_TR.UTF-8')
as mentionned in the question but it didn't change anything. Implement two functions
turkish_lower(self)
andturkish_upper(self)
for the three problematic letters as described in the second answer which seems to be the only solution:def turkish_lower(self): self = re.sub(r'İ', 'i', self) self = re.sub(r'I', 'ı', self) self = self.lower() return self def turkish_upper(self): self = re.sub(r'i', 'İ', self) self = self.upper() return self
But how can I use these two functions without having to check if language == 'Turkish'
every time? Should I override the built-in functions lower()
and upper()
? If yes, what is the pythonic way of doing it? Should I implement classes for the various languages I'm working with and override the built-in functions inside the class for Turkish?
回答1:
You can create a simple "language aware" string that subclasses str
and will do the proper uppercasing and lowercasing, for example:
class LanguageAwareStr(str):
lang = None
class RussianStr(LanguageAwareStr):
lang = 'ru'
class TurkishStr(LanguageAwareStr):
lang = 'tr'
_case_lookup_upper = {'İ': 'i', 'I': 'ı'} # lookup uppercase letters
_case_lookup_lower = {v: k for (k, v) in _case_lookup_upper.items()}
# here we override the lower() and upper() methods
def lower(self):
chars = [self._case_lookup_upper.get(c, c) for c in self]
result = ''.join(chars).lower()
cls = type(self) # so we return a TurkishStr result
return cls(result)
def upper(self):
chars = [self._case_lookup_lower.get(c, c) for c in self]
result = ''.join(chars).upper()
cls = type(self) # so we return a TurkishStr result
return cls(result)
Then when you read a string, knowing what language it is, you wrap it in the proper LanguageAwareStr subclass, and then just use it regularly:
from langaware import RussianStr, TurkishStr
if language == 'turkish':
LangStr = TurkishStr # can also create a dict to lookup the correct class
Then when you read language strings, you simply wrap them in a call to LangStr()
:
for l in my_file_list:
word = LangStr(l[0]).lower()
pos = l[1]
my_dict[word] = pos
for l in check_file_list:
word_id = l[0]
word = LangStr(l[1]).lower()
word_cat = l[2]
check_dict[word] = [word_id, word_cat]
回答2:
I would suggest trying install the turkish language pack for locale:
sudo apt-get install language-pack-tr
sudo dpkg-reconfigure locales # *
You can also check which languages you have in locale using the terminal command: $ locale -a
https://forum.yazbel.com/t/cozuldu-locale-setlocale-locale-lc-all-tr-tr-yapisinda-sorun-yasiyorum-turkce-karakter-sorunu/476
来源:https://stackoverflow.com/questions/50135094/handle-turkish-uppercase-and-lowercase-correctly-need-to-modify-override-built