python-unicode

Does python re (regex) have an alternative to \u unicode escape sequences?

我与影子孤独终老i 提交于 2019-12-11 11:11:54
问题 Python treats \uxxxx as a unicode character escape inside a string literal (e.g. u"\u2014" gets interpreted as Unicode character U+2014). But I just discovered (Python 2.7) that standard regex module doesn't treat \uxxxx as a unicode character. Example: codepoint = 2014 # Say I got this dynamically from somewhere test = u"This string ends with \u2014" pattern = r"\u%s$" % codepoint assert(pattern[-5:] == "2014$") # Ends with an escape sequence for U+2014 assert(re.search(pattern, test) !=

Django admin does not allow saving unicode slugs

做~自己de王妃 提交于 2019-12-11 10:58:19
问题 I'm trying to save a Persian slug for this model: class Category(models.Model): name = models.CharField('name', max_length=100) slug = models.SlugField('slug', unique=True) description = models.TextField('description') class Meta: verbose_name = 'category' verbose_name_plural = 'categories' @permalink def get_absolute_url(self): return ('category_detail', None, { 'slug': self.slug }) def __unicode__(self): return u'%s' % self.name But Django does not save the page and complaint that: Enter a

Do I have to encode unicode variable before write to file?

时光总嘲笑我的痴心妄想 提交于 2019-12-11 09:05:47
问题 I read the "Unicdoe Pain" article days ago. And I keep the "Unicode Sandwich" in mind. Now I have to handle some Chinese and I've got a list chinese = [u'中文', u'你好'] Do i need to proceed encoding before writing to file? add_line_break = [word + u'\n' for word in chinese] encoded_chinese = [word.encode('utf-8') for word in add_line_break] with open('filename', 'wb') as f: f.writelines(encoded_chinese) Somehow I find out that in python2. I can do this: chinese = ['中文', '你好'] with open('filename

Python 3 and b'\x92'.decode('latin1')

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-11 05:55:29
问题 I'm getting results I didn't expect from decoding b'\x92' with the latin1 codec. See the session below: Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32 >>> b'\xa3'.decode('latin1').encode('ascii', 'namereplace') b'\\N{POUND SIGN}' >>> b'\x92'.decode('latin1').encode('ascii', 'namereplace') b'\\x92' >>> ord(b'\x92'.decode('latin1')) 146 The result decoding b'\xa3' gave me exactly what I was expecting. But the two results for b'\x92' were not what

Python: using regex and tokens with accented chars (negative lookbehind)

ⅰ亾dé卋堺 提交于 2019-12-11 05:08:32
问题 I need to detect capitalized words in Spanish, but only when they are not preceeded by a token, which can have unicode chars. (I'm using Python 2.7.12 in linux). This works ok (non-unicode token [e.g. guion:] >>> import regex >>> s = u"guion: El computador. Ángel." >>> p = regex.compile( r'(?<!guion:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X) >>> print p.sub( r"**\1**", s) guion: El computador. **Ángel**. But the same logic fails to spot accented tokens [e.g. guión:]: >>> s = u"guión:

UnicodeEncodeError on API-call (json)

浪子不回头ぞ 提交于 2019-12-10 23:29:11
问题 I am trying to print out the result of this API-call, but I am getting an UnicodeEncodeError. Probably super noob question, but would really appreciate any help with this :) import http.client import json api_key = 'hidden' connection = http.client.HTTPConnection('api.football-data.org') headers = { 'X-Auth-Token': api_key, 'X-Response-Control': 'minified' } connection.request('GET', '/v1/competitions', None, headers) response = json.loads(connection.getresponse().read().decode()) print

UnicodeDecodeError: 'ascii' codec can't decode byte in 0xc3 in position 304: ordinal not in range(128)

自古美人都是妖i 提交于 2019-12-10 17:12:39
问题 I just left the PC at work (using Python 2.7) and had a script that I was just finishing up (reproduced below). It ran fine at work, I just wanted to add one or two things. But I came home and am using my Mac's version of Python (3.2.2) and I get the following error: Traceback (most recent call last): File "/Users/Downloads/sda/alias.py", line 25, in <module> for row_2 in in_csv: File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/encodings/ascii.py", line 26, in decode

How to print() a string in Python3 without exceptions?

回眸只為那壹抹淺笑 提交于 2019-12-10 16:02:44
问题 Seemingly simple question: How do I print() a string in Python3? Should be a simple: print(my_string) But that doesn't work. Depending on the content of my_string , environment variables and the OS you use that will throw an UnicodeEncodeError exception: >>> print("\u3423") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character '\u3423' in position 0: ordinal not in range(128) Is there a clean portable way to fix this?

Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi

痴心易碎 提交于 2019-12-10 14:20:46
问题 I'm stuck here trying to unescape HTML special characters. The problematic text is Rudimental & Emeli Sandé which should be converted to Rudimental & Emeli Sandé The text is downloaded via WGET (outside of python) To test this, save a ANSI file with this line and import it. import HTMLParser trackentry = open('import.txt', 'r').readlines() print(trackentry) track = trackentry[0] html_parser = HTMLParser.HTMLParser() track = html_parser.unescape(track) print(track) I get this error when a line

Using textwrap.wrap with bytes count

蹲街弑〆低调 提交于 2019-12-10 13:52:18
问题 How can I use the textwrap module to split before a line reaches a certain amount of bytes (without splitting a multi-bytes character)? I would like something like this: >>> textwrap.wrap('☺ ☺☺ ☺☺ ☺ ☺ ☺☺ ☺☺', bytewidth=10) ☺ ☺☺ ☺☺ ☺ ☺ ☺☺ ☺☺ 回答1: The result depends on the encoding used, because the number of bytes per character is a function of the encoding, and in many encodings, of the character as well. I'll assume we're using UTF-8, in which '☺' is encoded as e298ba and is three bytes long