python-unicode

Python3: UnicodeEncodeError: 'ascii' codec can't encode character '\xfc'

浪子不回头ぞ 提交于 2019-12-19 14:01:07
问题 I'am trying to get running a very simple example on OSX with python 3.5.1 but I'm really stucked. Have read so many articles that deal with similar problems but I can not fix this by myself. Do you have any hints how to resolve this issue? I would like to have the correct encoded latin-1 output as defined in mylist without any errors. My code: # coding=<latin-1> mylist = [u'Glück', u'Spaß', u'Ähre',] print(mylist) The error: Traceback (most recent call last): File "/Users/abc/test.py", line 4

Python, convert 4-byte char to avoid MySQL error “Incorrect string value:”

喜欢而已 提交于 2019-12-17 16:26:55
问题 I need to convert (in Python) a 4-byte char into some other character. This is to insert it into my utf-8 mysql database without getting an error such as: "Incorrect string value: '\xF0\x9F\x94\x8E' for column 'line' at row 1" Warning raised by inserting 4-byte unicode to mysql shows to do it this way: >>> import re >>> highpoints = re.compile(u'[\U00010000-\U0010ffff]') >>> example = u'Some example text with a sleepy face: \U0001f62a' >>> highpoints.sub(u'', example) u'Some example text with

Why character ID 160 is not recognised as Unicode in PDFMiner?

那年仲夏 提交于 2019-12-14 03:00:21
问题 I am converting .pdf files into .xml files using PDFMiner. For each word in the .pdf file, PDFMiner checks whether it is Unicode or not (among many other things). If it is, it returns the character, if it is not, it raises an exception and returns the string "(cid:%d)" where %d is the character id, which I think is the Unicode Decimal. This is well explained in the edit part of this question: What is this (cid:51) in the output of pdf2txt?. I report the code here for convenience: def render

Open() and codecs.open() in Python 2.7 behave strangely different

不羁的心 提交于 2019-12-14 02:02:41
问题 I have a text file with first line of unicode characters and all other lines in ASCII. I try to read the first line as one variable, and all other lines as another. However, when I use the following code: # -*- coding: utf-8 -*- import codecs import os filename = '1.txt' f = codecs.open(filename, 'r3', encoding='utf-8') print f names_f = f.readline().split(' ') data_f = f.readlines() print len(names_f) print len(data_f) f.close() print 'And now for something completely differerent:' g = open

Python io module's TextIOWrapper or BuffereRWPair functions are not playing nicely with pySerial

只谈情不闲聊 提交于 2019-12-13 16:27:57
问题 I'm writing a serial adapter for some scientific hardware whose command set uses UTF-8 character encodings. All responses from the hardware are terminated with a carriage return (u'\r'). I would like to able to use pySerial's readline() function with an EOL character specified, so I have this setup, ala this thread: import serial import io ser = serial.Serial(port='COM10', baudrate=128000) sio = io.TextIOWrapper(io.BufferedRWPair(ser, ser, 1), encoding='utf-8', newline=u'\r') ser.open() #

Unicode category for commas and quotation marks

一笑奈何 提交于 2019-12-13 10:23:43
问题 I have this helper function that gets rid of control characters in XML text: def remove_control_characters(s): #Remove control characters in XML text t = "" for ch in s: if unicodedata.category(ch)[0] == "C": t += " " if ch == "," or ch == "\"": t += "" else: t += ch return "".join(ch for ch in t if unicodedata.category(ch)[0]!="C") I would like to know whether there is a unicode category for excluding quotation marks and commas. 回答1: In Unicode, control characters general category is 'Cc',

Python - isalpha() returns True on unicode modifiers

孤者浪人 提交于 2019-12-13 09:00:21
问题 Why does u'\u02c7'.isalpha() return True, if symbol ˇ is not alphabetic? Does this method work properly only with ASCII chars? 回答1: U+02c7 CARON is a codepoint in the Lm (Modifier Letter) category, so according to the Unicode standard , it is alphabetic. The documentation for str.isalpha() makes it clear what is included: Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”,

Python: UnicodeDecodeError: 'utf8'

末鹿安然 提交于 2019-12-13 08:38:26
问题 I'm having problem to save accented letters. I'm using POSTGRESQL and Python 2.7 POSTGRESQL - ENCODING = 'LATIN1' I already added this line but does not worked! #!/usr/bin/python # -*- coding: UTF-8 -*- More about error message: UnicodeDecodeError: 'utf8' codec can't decode byte 0xed Please, any idea how to fix it? @Edit: cur = conn.cursor() cur.execute("SELECT * FROM users") rows = cur.fetchall() obj_list = list() for row in rows: ob = dict() ob['ID'] = row[0] ob['NAME'] = row[1] ob['CITY']

Reading utf-8 escape sequences from a file

瘦欲@ 提交于 2019-12-12 20:02:06
问题 I have an utf-8 encoded file that contains multiple lines like \x02I don't like \x0307bananas\x03.\x02 Hey, how are you doing? You called? How do I read the lines of that file to a list, decoding all the escape sequences? I tried the code below: with codecs.open(file, 'r', encoding='utf-8') as q: quotes = q.readlines() print(str(random.choice(quotes))) But it prints the line without decoding escape characters. \x02I don't like \x0307bananas\x03\x02 (Note: escape characters are IRC color codes

Unsuppress UnicodeEncodeError exceptions when run from Aptana Studio PyDev

可紊 提交于 2019-12-12 19:23:47
问题 The following is a statement that should raise an UnicodeEncodeError exception: print 'str+{}'.format(u'unicode:\u2019') In a Python shell, the exception is raised as expected: >>> print 'str+{}'.format(u'unicode:\u2019') Traceback (most recent call last): File "<pyshell#10>", line 1, in <module> print 'str+{}'.format(u'unicode:\u2019') UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 8: ordinal not in range(128) However, if I place that line at the start of my