python-unicode | 易学教程

Does python re (regex) have an alternative to \u unicode escape sequences?

阅读更多关于 Does python re (regex) have an alternative to \u unicode escape sequences?

问题 Python treats \uxxxx as a unicode character escape inside a string literal (e.g. u"\u2014" gets interpreted as Unicode character U+2014). But I just discovered (Python 2.7) that standard regex module doesn't treat \uxxxx as a unicode character. Example: codepoint = 2014 # Say I got this dynamically from somewhere test = u"This string ends with \u2014" pattern = r"\u%s$" % codepoint assert(pattern[-5:] == "2014$") # Ends with an escape sequence for U+2014 assert(re.search(pattern, test) !=

Django admin does not allow saving unicode slugs

阅读更多关于 Django admin does not allow saving unicode slugs

问题 I'm trying to save a Persian slug for this model: class Category(models.Model): name = models.CharField('name', max_length=100) slug = models.SlugField('slug', unique=True) description = models.TextField('description') class Meta: verbose_name = 'category' verbose_name_plural = 'categories' @permalink def get_absolute_url(self): return ('category_detail', None, { 'slug': self.slug }) def __unicode__(self): return u'%s' % self.name But Django does not save the page and complaint that: Enter a

Do I have to encode unicode variable before write to file?

阅读更多关于 Do I have to encode unicode variable before write to file?

问题 I read the "Unicdoe Pain" article days ago. And I keep the "Unicode Sandwich" in mind. Now I have to handle some Chinese and I've got a list chinese = [u'中文', u'你好'] Do i need to proceed encoding before writing to file? add_line_break = [word + u'\n' for word in chinese] encoded_chinese = [word.encode('utf-8') for word in add_line_break] with open('filename', 'wb') as f: f.writelines(encoded_chinese) Somehow I find out that in python2. I can do this: chinese = ['中文', '你好'] with open('filename

Python 3 and b'\x92'.decode('latin1')

阅读更多关于 Python 3 and b'\x92'.decode('latin1')

问题 I'm getting results I didn't expect from decoding b'\x92' with the latin1 codec. See the session below: Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32 >>> b'\xa3'.decode('latin1').encode('ascii', 'namereplace') b'\\N{POUND SIGN}' >>> b'\x92'.decode('latin1').encode('ascii', 'namereplace') b'\\x92' >>> ord(b'\x92'.decode('latin1')) 146 The result decoding b'\xa3' gave me exactly what I was expecting. But the two results for b'\x92' were not what

Python: using regex and tokens with accented chars (negative lookbehind)

阅读更多关于 Python: using regex and tokens with accented chars (negative lookbehind)

问题 I need to detect capitalized words in Spanish, but only when they are not preceeded by a token, which can have unicode chars. (I'm using Python 2.7.12 in linux). This works ok (non-unicode token [e.g. guion:] >>> import regex >>> s = u"guion: El computador. Ángel." >>> p = regex.compile( r'(?<!guion:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X) >>> print p.sub( r"**\1**", s) guion: El computador. **Ángel**. But the same logic fails to spot accented tokens [e.g. guión:]: >>> s = u"guión:

UnicodeEncodeError on API-call (json)

阅读更多关于 UnicodeEncodeError on API-call (json)

问题 I am trying to print out the result of this API-call, but I am getting an UnicodeEncodeError. Probably super noob question, but would really appreciate any help with this :) import http.client import json api_key = 'hidden' connection = http.client.HTTPConnection('api.football-data.org') headers = { 'X-Auth-Token': api_key, 'X-Response-Control': 'minified' } connection.request('GET', '/v1/competitions', None, headers) response = json.loads(connection.getresponse().read().decode()) print

UnicodeDecodeError: 'ascii' codec can't decode byte in 0xc3 in position 304: ordinal not in range(128)

阅读更多关于 UnicodeDecodeError: 'ascii' codec can't decode byte in 0xc3 in position 304: ordinal not in range(128)

问题 I just left the PC at work (using Python 2.7) and had a script that I was just finishing up (reproduced below). It ran fine at work, I just wanted to add one or two things. But I came home and am using my Mac's version of Python (3.2.2) and I get the following error: Traceback (most recent call last): File "/Users/Downloads/sda/alias.py", line 25, in <module> for row_2 in in_csv: File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/encodings/ascii.py", line 26, in decode

How to print() a string in Python3 without exceptions?

阅读更多关于 How to print() a string in Python3 without exceptions?

问题 Seemingly simple question: How do I print() a string in Python3? Should be a simple: print(my_string) But that doesn't work. Depending on the content of my_string , environment variables and the OS you use that will throw an UnicodeEncodeError exception: >>> print("\u3423") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character '\u3423' in position 0: ordinal not in range(128) Is there a clean portable way to fix this?

Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi

阅读更多关于 Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi

问题 I'm stuck here trying to unescape HTML special characters. The problematic text is Rudimental & Emeli Sandé which should be converted to Rudimental & Emeli Sandé The text is downloaded via WGET (outside of python) To test this, save a ANSI file with this line and import it. import HTMLParser trackentry = open('import.txt', 'r').readlines() print(trackentry) track = trackentry[0] html_parser = HTMLParser.HTMLParser() track = html_parser.unescape(track) print(track) I get this error when a line

Using textwrap.wrap with bytes count

阅读更多关于 Using textwrap.wrap with bytes count

问题 How can I use the textwrap module to split before a line reaches a certain amount of bytes (without splitting a multi-bytes character)? I would like something like this: >>> textwrap.wrap('☺ ☺☺ ☺☺ ☺ ☺ ☺☺ ☺☺', bytewidth=10) ☺ ☺☺ ☺☺ ☺ ☺ ☺☺ ☺☺ 回答1: The result depends on the encoding used, because the number of bytes per character is a function of the encoding, and in many encodings, of the character as well. I'll assume we're using UTF-8, in which '☺' is encoded as e298ba and is three bytes long