python-unicode

How can I fix 'UnicodeDecodeError' when trying to extract text with pdfminer.six?

邮差的信 提交于 2019-12-06 11:37:54
I get a UnicodeEncodeError when using pdfminer (the latest version from git ) installed via pip install git+https://github.com/pdfminer/pdfminer.six.git : Traceback (most recent call last): File "pdfminer_sample3.py", line 34, in <module> print(convert_pdf_to_txt("samples/numbers-test-document.pdf")) File "pdfminer_sample3.py", line 27, in convert_pdf_to_txt text = retstr.getvalue() File "/usr/lib/python2.7/StringIO.py", line 271, in getvalue self.buf += ''.join(self.buflist) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) How can I fix that?

python charmap codec can't decode byte X in position Y character maps to <undefined>

China☆狼群 提交于 2019-12-06 08:44:28
问题 I'm experimenting with python libraries for data analysis,the problem i'm facing is this exception UnicodeDecodeError was unhandled by user code Message: 'charmap' codec can't decode byte 0x81 in position 165: character maps to < undefined> I have looked into answers with similar issues and the OP seems to be either reading text with different encoding or printing it. In my code the error shows up at import statement,that's what confuses me. I'm using python 64 bit 3.3 on Visual Studio 2015

BeautifulSoup “encode(”utf-8\")

守給你的承諾、 提交于 2019-12-06 08:04:06
from bs4 import BeautifulSoup import urllib.request link = ('https://mywebsite.org') req = urllib.request.Request(link, headers={'User-Agent': 'Mozilla/5.0'}) url = urllib.request.urlopen(req).read() soup = BeautifulSoup(url, "html.parser") body = soup.find_all('div', {"class":"wrapper"}) print(body) Hi guys, I have a problem with this code. If I run it it come the error UnicodeEncodeError: 'charmap' codec can't encode character '\u2022' in position 138: character maps to I tryed to search and I found that I had to add .encode("utf-8") but if I add it come the error AttributeError: 'ResultSet'

Unicode formatting

╄→尐↘猪︶ㄣ 提交于 2019-12-05 21:55:13
I am working with string formatting. For english the formatting is neat but for unicode characters the formatting is haphazard. Can anyone please tell me the reason? Example: form = u'{:<15}{:<3}({})' a = [ u'സി ട്രീമിം', u'ബി ഡോഗേറ്റ്', u'ജെ ഹോളണ്ട്', u'എം നസീർ ', u'എം ബസ്ചാഗൻ…', u'ടി ഹെഡ് ', u'കെ ഭാരത് ', u'എം സിറാജ് ', u'എ ഈശ്വരൻ ', u'സി ഹാൻഡ്‌സ്‌കോംബ് ബി',] for i in range(0, 10): print form.format(a[i][:12], 1, 2) Gives output as While s = [ u'abcdef', u'akash', u'rohit', u'anubhav', u'bhargav', u'achut', u'punnet', u'tom', u'rach', u'kamal' ] for i in range(0, 10): print form.format(s[i][

Python escape sequence \\N{name} not working as per definition

旧巷老猫 提交于 2019-12-05 10:58:50
I am trying to print unicode characters given their name as follows: # -*- coding: utf-8 -*- print "\N{SOLIDUS}" print "\N{BLACK SPADE SUIT}" However the output I get is not very encouraging. The escape sequence is printed as is. ActivePython 2.7.2.5 (ActiveState Software Inc.) based on Python 2.7.2 (default, Jun 24 2011, 12:21:10) [MSC v.1500 32 bit (Intel)] on Type "help", "copyright", "credits" or "license" for more information. >>> # -*- coding: utf-8 -*- ... print "\N{SOLIDUS}" \N{SOLIDUS} >>> print "\N{BLACK SPADE SUIT}" \N{BLACK SPADE SUIT} >>> I can however see that another asker has

How to fix an encoding migrating Python subprocess to unicode_literals?

自古美人都是妖i 提交于 2019-12-05 09:23:39
We're preparing to move to Python 3.4 and added unicode_literals. Our code relies extensively on piping to/from external utilities using subprocess module. The following code snippet works fine on Python 2.7 to pipe UTF-8 strings to a sub-process: kw = {} kw[u'stdin'] = subprocess.PIPE kw[u'stdout'] = subprocess.PIPE kw[u'stderr'] = subprocess.PIPE kw[u'executable'] = u'/path/to/binary/utility' args = [u'', u'-l', u'nl'] line = u'¡Basta Ya!' popen = subprocess.Popen(args,**kw) popen.stdin.write('%s\n' % line.encode(u'utf-8')) ...blah blah... The following changes throw this error: from _

UnicodeEncodeError: 'ascii' codec can't encode character u'\\u2019' in position 6: ordinal not in range(128)

人盡茶涼 提交于 2019-12-05 05:13:06
I am trying to pull a list of 500 restaurants in Amsterdam from TripAdvisor; however after the 308th restaurant I get the following error: Traceback (most recent call last): File "C:/Users/dtrinh/PycharmProjects/TripAdvisorData/LinkPull-HK.py", line 43, in <module> writer.writerow(rest_array) UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 6: ordinal not in range(128) I tried several things I found on StackOverflow, but nothing is working as of right now. I was wondering if someone could take a look at my code and see any potential solutions that would be great.

TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')

匆匆过客 提交于 2019-12-04 22:44:49
Strange error from numpy via matplotlib when trying to get a histogram of a tiny toy dataset. I'm just not sure how to interpret the error, which makes it hard to see what to do next. Didn't find much related, though this nltk question and this gdsCAD question are superficially similar. I intend the debugging info at bottom to be more helpful than the driver code, but if I've missed something, please ask. This is reproducible as part of an existing test suite. if n > 1: return diff(a[slice1]-a[slice2], n-1, axis=axis) else: > return a[slice1]-a[slice2] E TypeError: ufunc 'subtract' did not

Unicode search not working

a 夏天 提交于 2019-12-04 19:52:21
Consider this. # -*- coding: utf-8 -*- data = "cdbsb \xe2\x80\xa6 abc" print data #prints cdbsb … abc ^ print re.findall(ur"[\u2026]", data ) Why can't re find this unicode character ? I have already checked \xe2\x80\xa6 === … === U+2026 My guess is that the issue is because data is a byte-string. You might have the console encoding as utf-8 , hence when printing the string, the console converts the string to utf-8 and then shows it (You can check this out at sys.stdout.encoding ). Hence you are getting the character - … . But most probably re does not do this decoding for you. If you convert

Unicode in python

*爱你&永不变心* 提交于 2019-12-04 19:27:38
Now I use elixir with my mysql database and the redispy with redis and i select UTF-8 at all the place. I wanna to get some data writing in chinese like {'Info':‘8折’,'Name':'家乐福'} but what i got is like this: {'Info': u'8\u6298', 'Name': u'\u5bb6\u4e50\u798f'} and after i store this dict to redis and get it out by redispy it becomes: {"Info": "8\u6298", "Name": "\u5bb6\u4e50\u798f"} I know if i add u' before 8\u6298 and print it it will shou me "8折" but is there a function or another solution to this problem? The latter looks like json, try decoding it first: import json resp = '{"Info": "8