python-unicode

Why does ElementTree reject UTF-16 XML declarations with “encoding incorrect”?

左心房为你撑大大i 提交于 2019-12-04 17:49:48
问题 In Python 2.7, when passing a unicode string to ElementTree's fromstring() method that has encoding="UTF-16" in the XML declaration, I'm getting a ParseError saying that the encoding specified is incorrect: >>> from xml.etree import ElementTree >>> data = u'<?xml version="1.0" encoding="utf-16"?><root/>' >>> ElementTree.fromstring(data) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1300, in

python 2.7 lowercase

无人久伴 提交于 2019-12-04 15:48:03
问题 When I use .lower() in Python 2.7, string is not converted to lowercase for letters ŠČŽ . I read data from dictionary. I tried using str(tt["code"]).lower() , tt["code"].lower() . Any suggestions ? 回答1: Use unicode strings: drostie@signy:~$ python Python 2.7.2+ (default, Oct 4 2011, 20:06:09) [GCC 4.6.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> print "ŠČŽ" ŠČŽ >>> print "ŠČŽ".lower() ŠČŽ >>> print u"ŠČŽ".lower() ščž See that little u ? That means

Reading russian language data from csv

穿精又带淫゛_ 提交于 2019-12-03 20:50:18
I have some data in CSV file that are in Russian: 2-комнатная квартира РДТ', мкр Тастак-3, Аносова — Толе би;Алматы 2-комнатная квартира БГР', мкр Таугуль, Дулати (Навои) — Токтабаева;Алматы 2-комнатная квартира ЦФМ', мкр Тастак-2, Тлендиева — Райымбека;Алматы Delimiter is ; symbol. I want to read data and put it into array. I tried to read this data using this code: def loadCsv(filename): lines = csv.reader(open(filename, "rb"),delimiter=";" ) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [str(x) for x in dataset[i]] return dataset Then I read and print result: mydata =

python 2.7 lowercase

旧巷老猫 提交于 2019-12-03 09:50:42
When I use .lower() in Python 2.7, string is not converted to lowercase for letters ŠČŽ . I read data from dictionary. I tried using str(tt["code"]).lower() , tt["code"].lower() . Any suggestions ? Use unicode strings: drostie@signy:~$ python Python 2.7.2+ (default, Oct 4 2011, 20:06:09) [GCC 4.6.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> print "ŠČŽ" ŠČŽ >>> print "ŠČŽ".lower() ŠČŽ >>> print u"ŠČŽ".lower() ščž See that little u ? That means that it's created as a unicode object rather than a str object. Use unicode: >>> print u'ŠČŽ'.lower().encode(

Unicode Encode Error when writing pandas df to csv

半世苍凉 提交于 2019-12-02 23:36:30
I cleaned 400 excel files and read them into python using pandas and appended all the raw data into one big df. Then when I try to export it to a csv: df.to_csv("path",header=True,index=False) I get this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 20: ordinal not in range(128) Can someone suggest a way to fix this and what it means? Thanks You have unicode values in your DataFrame. Files store bytes, which means all unicode have to be encoded into bytes before they can be stored in a file. You have to specify an encoding, such as utf-8 . For example, df

How to properly iterate over unicode characters in Python

放肆的年华 提交于 2019-12-02 07:01:49
问题 I would like to iterate over a string and output all emojis. I'm trying to iterate over the characters, and check them against an emoji list. However, python seems to split the unicode characters into smaller ones, breaking my code. Example: >>> list(u'Test \U0001f60d') [u'T', u'e', u's', u't', u' ', u'\ud83d', u'\ude0d'] Any ideas why u'\U0001f60d' gets split? Or what's a better way to extract all emojis? This was my original extraction code: def get_emojis(text): emojis = [] for character

How to properly iterate over unicode characters in Python

余生长醉 提交于 2019-12-02 06:45:33
I would like to iterate over a string and output all emojis. I'm trying to iterate over the characters, and check them against an emoji list . However, python seems to split the unicode characters into smaller ones, breaking my code. Example: >>> list(u'Test \U0001f60d') [u'T', u'e', u's', u't', u' ', u'\ud83d', u'\ude0d'] Any ideas why u'\U0001f60d' gets split? Or what's a better way to extract all emojis? This was my original extraction code: def get_emojis(text): emojis = [] for character in text: if character in EMOJI_SET: emojis.append(character) return emojis Python pre-3.3 uses UTF-16LE

python sys.getsizeof method returning different sizes on different versions of python

∥☆過路亽.° 提交于 2019-12-02 05:46:34
问题 sys.getsizeof is returning different size for a unicode string on different versions of python. sys.getsizeof(u'Hello World') return 96 on Python 2.7.3 and returns 72 on Python 2.7.11 回答1: sys.getsizeof is giving you implementation details by definition, and none of those details are guaranteed to remain stable between versions or even builds. It's unlikely that anything significant changed between 2.7.3 and 2.7.11 though; YOU's comment on character width likely explains the discrepancy;

Removing all Emojis from Text

穿精又带淫゛_ 提交于 2019-12-02 00:24:28
问题 This question has been asked here Python : How to remove all emojis Without a solution, I have as step towards the solution. But need help finishing it off. I went and got all the emoji hex code points from the emoji site: https://www.unicode.org/emoji/charts/emoji-ordering.txt I then read in the file like so: file = open('emoji-ordering.txt') temp = file.readline() final_list = [] while temp != '': #print(temp) if not temp[0] == '#' : utf_8_values = ((temp.split(';')[0]).rstrip()).split(' ')

Removing all Emojis from Text

守給你的承諾、 提交于 2019-12-01 20:48:09
This question has been asked here Python : How to remove all emojis Without a solution, I have as step towards the solution. But need help finishing it off. I went and got all the emoji hex code points from the emoji site: https://www.unicode.org/emoji/charts/emoji-ordering.txt I then read in the file like so: file = open('emoji-ordering.txt') temp = file.readline() final_list = [] while temp != '': #print(temp) if not temp[0] == '#' : utf_8_values = ((temp.split(';')[0]).rstrip()).split(' ') values = ["u\\"+(word[0]+((8 - len(word[2:]))*'0' + word[2:]).rstrip()) for word in utf_8_values]